Judge

Batch-evaluate a cohort against one synthesized rubric, rank with the rubric's ordered tie-break, override a score at the human gate, and lock the official result.

Gemini / GEAP3.1-flash-liteGoogle ADKorchestratingCloud RundeployedArize AXtracingPhoenix MCPcalibration loop

RubricTop-K

Add submission

Sample cohort — real RunRecords cached from a live gemini-3.1-flash-lite run. Press Run cohort live to re-evaluate against the API now.

Same cohort. Two ranks.

Six-hat raw consensus on the left; Glasshat's audited rank on the right.

no rank change in this cohort

Without Glasshat auditraw six-hat consensus

1Glasshat
59.3
2MeshSight
53.7
3QuickWrap
31.3
4TodoZap
31.0

With Glasshat auditcalibration applied

1Glasshat
57.8
2MeshSight
52.1
3QuickWrap
29.9
4TodoZap
29.6

Calibration · hit@13 (does the audit improve the ranking?)

n=50 · 13 winners · backend=mock

hit@13 before audit

38%

pre-calibration ranking

hit@13 after audit

38%

post-calibration ranking

Audit effect

±0 pts

no change in top-13 hit rate on this run

Offline hit@13 on 50 historical Gemini-3 Devpost submissions with binary Winner-badge labels (13 winners). The source has no rank or score, so this is hit@13, not a rank curve. Backend=mock: the mock scorer is deterministic and illustrative (it does not read meaning from text); the live figure is produced by the same harness with LLM_BACKEND=vertex.

#	Submission	Final	Per-criterion
1	Glasshat 4 self-corrections	57.8	Technological Implementation3.4/5 · 25% Design2.1/5 · 25% Potential Impact2.9/5 · 25% Quality of the Idea3.1/5 · 25%
2	MeshSight 4 self-corrections	52.1	Technological Implementation2.9/5 · 25% Design2.1/5 · 25% Potential Impact2.8/5 · 25% Quality of the Idea2.6/5 · 25%
3	QuickWrap 4 self-corrections	29.9	Technological Implementation1.3/5 · 25% Design1.7/5 · 25% Potential Impact1.6/5 · 25% Quality of the Idea1.4/5 · 25%
4	TodoZap 4 self-corrections	29.6	Technological Implementation1.3/5 · 25% Design1.8/5 · 25% Potential Impact1.5/5 · 25% Quality of the Idea1.3/5 · 25%

Ties break by the rubric's ordered chain (Tech → Design → Impact → Idea). Override a score to re-rank; lock to freeze the official result.