Judge
Batch-evaluate a cohort against one synthesized rubric, rank with the rubric's ordered tie-break, override a score at the human gate, and lock the official result.
Sample cohort — real RunRecords cached from a live gemini-3.1-flash-lite run. Press Run cohort live to re-evaluate against the API now.
Same cohort. Two ranks.
Six-hat raw consensus on the left; Glasshat's audited rank on the right.
- 1Glasshat59.3
- 2MeshSight53.7
- 3QuickWrap31.3
- 4TodoZap31.0
- 1Glasshat57.8
- 2MeshSight52.1
- 3QuickWrap29.9
- 4TodoZap29.6
Calibration · hit@13 (does the audit improve the ranking?)
n=50 · 13 winners · backend=mockOffline hit@13 on 50 historical Gemini-3 Devpost submissions with binary Winner-badge labels (13 winners). The source has no rank or score, so this is hit@13, not a rank curve. Backend=mock: the mock scorer is deterministic and illustrative (it does not read meaning from text); the live figure is produced by the same harness with LLM_BACKEND=vertex.
| # | Submission | Final | Per-criterion | Winner | Official |
|---|---|---|---|---|---|
| 1 | Glasshat 4 self-corrections | 57.8 | Technological Implementation3.4/5 · 25% Design2.1/5 · 25% Potential Impact2.9/5 · 25% Quality of the Idea3.1/5 · 25% | ||
| 2 | MeshSight 4 self-corrections | 52.1 | Technological Implementation2.9/5 · 25% Design2.1/5 · 25% Potential Impact2.8/5 · 25% Quality of the Idea2.6/5 · 25% | ||
| 3 | QuickWrap 4 self-corrections | 29.9 | Technological Implementation1.3/5 · 25% Design1.7/5 · 25% Potential Impact1.6/5 · 25% Quality of the Idea1.4/5 · 25% | ||
| 4 | TodoZap 4 self-corrections | 29.6 | Technological Implementation1.3/5 · 25% Design1.8/5 · 25% Potential Impact1.5/5 · 25% Quality of the Idea1.3/5 · 25% |
Ties break by the rubric's ordered chain (Tech → Design → Impact → Idea). Override a score to re-rank; lock to freeze the official result.