Glasshat

Judge

Batch-evaluate a cohort against one synthesized rubric, rank with the rubric's ordered tie-break, override a score at the human gate, and lock the official result.

Gemini / GEAP3.1-flash-liteGoogle ADKorchestratingCloud RundeployedArize AXtracingPhoenix MCPcalibration loop

Sample cohort — real RunRecords cached from a live gemini-3.1-flash-lite run. Press Run cohort live to re-evaluate against the API now.

Same cohort. Two ranks.

Six-hat raw consensus on the left; Glasshat's audited rank on the right.

no rank change in this cohort
Without Glasshat auditraw six-hat consensus
  1. 1Glasshat
    59.3
  2. 2MeshSight
    53.7
  3. 3QuickWrap
    31.3
  4. 4TodoZap
    31.0
With Glasshat auditcalibration applied
  1. 1Glasshat
    57.8
  2. 2MeshSight
    52.1
  3. 3QuickWrap
    29.9
  4. 4TodoZap
    29.6

Calibration · hit@13 (does the audit improve the ranking?)

n=50 · 13 winners · backend=mock
hit@13 before audit
38%
pre-calibration ranking
hit@13 after audit
38%
post-calibration ranking
Audit effect
±0 pts
no change in top-13 hit rate on this run

Offline hit@13 on 50 historical Gemini-3 Devpost submissions with binary Winner-badge labels (13 winners). The source has no rank or score, so this is hit@13, not a rank curve. Backend=mock: the mock scorer is deterministic and illustrative (it does not read meaning from text); the live figure is produced by the same harness with LLM_BACKEND=vertex.

#SubmissionFinalPer-criterionWinnerOfficial
1
Glasshat
4 self-corrections
57.8
Technological Implementation3.4/5 · 25%
Design2.1/5 · 25%
Potential Impact2.9/5 · 25%
Quality of the Idea3.1/5 · 25%
2
MeshSight
4 self-corrections
52.1
Technological Implementation2.9/5 · 25%
Design2.1/5 · 25%
Potential Impact2.8/5 · 25%
Quality of the Idea2.6/5 · 25%
3
QuickWrap
4 self-corrections
29.9
Technological Implementation1.3/5 · 25%
Design1.7/5 · 25%
Potential Impact1.6/5 · 25%
Quality of the Idea1.4/5 · 25%
4
TodoZap
4 self-corrections
29.6
Technological Implementation1.3/5 · 25%
Design1.8/5 · 25%
Potential Impact1.5/5 · 25%
Quality of the Idea1.3/5 · 25%

Ties break by the rubric's ordered chain (Tech → Design → Impact → Idea). Override a score to re-rank; lock to freeze the official result.