the judge scores 9.0 · then catches its own over-confidence
calibrated7.6 — pulled to where the retrieved evidence holds
7.6 — pulled to where the retrieved evidence holds
Illustrative correction · clip(score − 0.8·mean_delta, p25, p75), ±2.0 cap · prior recovered from held-out spike-D anchors.
AI writes the submissions now — by the thousands. The judging didn't change.
Trace it. Trust it.
Glasshat doesn't just judge — it audits the judge. Every hat, every sub-score opens its own trace span in Arize AX; the panel catches its own over-confidence and self-corrects, live. So the score isn't a black box — it's one you can check. Not a chatbot.
- Arize track
- Gemini · Vertex AI
- Google ADK
- Phoenix + MCP
clip(score − 0.8·mean_delta, p25, p75) · ±2.0 cap, from a prior recovered from held-out spike-D anchors.
- White
- Red
- Black
- Green
- Blue
- Yellow →calibrated
Same cohort. Two truths. The audit decides who wins.
A raw AI panel hands the loudest, least-evidenced project the crown. Watch the board tilt— and once calibration lands on retrieved evidence, a better-evidenced project rises to #1. No one touches it; the audit happens on its own.
- 3Halcyon · grid-balancing modeldeep evidencerepo + deck grounded7.9score / 10calibrated ↑ from 7.9
- 1Nimbus · pitch-perfect demothin evidenceover-confident yellow hat9.4score / 10pulled back −1.3 · within ±2.0 cap
- 2Tessera · multi-agent routermixed evidencepartial grounding8.6score / 10calibrated −0.8
- 4Quillwright · doc-search toolmid evidencesparse retrieval7.1score / 10calibrated −0.2
Projects, scores, ranks and deltas are illustrative— a depiction of the re-ranking mechanism, not live results.
Built for the Arize track
The audit is auditable
— on Arize AX.
Glasshat doesn't hide the judgment — it traces it. Every agent, every one of the six hats, and the self-correction itself opens its own trace span in Arize AX. So you don't take the score on faith — you open the trace of how it was judged and audited, and check it. Trace it. Trust it.
The calibration consultant reads a Phoenix dataset over MCP and writes each correction back — the learning loop runs live on the Cloud Run demo (a Cloud-SQL-backed Phoenix); the genuine live Arize-AX results — the full trace tree and the hit@13 experiment — are the credentialed Agent-Engine run shown next.
Genuinely deployed · genuinely measured
Not a mock. The evaluation brain is a real ADK agent on Agent Engine — traced, and scored.
- ADK 2.0 Workflow graphDeployed on Gemini Enterprise Agent PlatformAgent Engine / Runtime · serving live stream_query · managed Sessions + AGENT_IDENTITY
- 104 spansFull nested Arize AX traceagent → Workflow → 6 hats (generate + embed) · verified via client.spans.list(project=glasshat)
- hit@13 0.61548 / 13 winners ranked into the top-13real Gemini · vs 0.3846 mock · 0.26 chance
Live Agent Engine resource reasoningEngines/7480191458771730432 · AX experiment glasshat-hit-at-13-gemini. hit@13 on binary Winner labels — not a rank curve; audit Δ=0 on this set.
The judge catches its ownover-confidence — live.
A rubric-aware AI evaluator scores with a six-hat panel, grounds every sub-score in retrieved evidence, then pulls its own optimism back to where the evidence actually supports. Every step is a trace span in Arize AX — so the score is one you can open, and check.
Trace it. Trust it.
illustrative score · the math runs on real evidence
Spotlight finding the submission…
Loading rubric presets…