Glasshat

the judge scores 9.0 · then catches its own over-confidence

calibrated7.6 — pulled to where the retrieved evidence holds

7.6 — pulled to where the retrieved evidence holds

Illustrative correction · clip(score − 0.8·mean_delta, p25, p75), ±2.0 cap · prior recovered from held-out spike-D anchors.

Evaluation constellation

AI writes the submissions now — by the thousands. The judging didn't change.

Trace it. Trust it.

Glasshat doesn't just judge — it audits the judge. Every hat, every sub-score opens its own trace span in Arize AX; the panel catches its own over-confidence and self-corrects, live. So the score isn't a black box — it's one you can check. Not a chatbot.

  • Arize track
  • Gemini · Vertex AI
  • Google ADK
  • Phoenix + MCP
Yellow hat · over-confident

clip(score − 0.8·mean_delta, p25, p75) · ±2.0 cap, from a prior recovered from held-out spike-D anchors.

Illustrative.
  • White
  • Red
  • Black
  • Green
  • Blue
  • Yellow →calibrated
Glasshat · it audits the judge

Same cohort. Two truths. The audit decides who wins.

A raw AI panel hands the loudest, least-evidenced project the crown. Watch the board tilt— and once calibration lands on retrieved evidence, a better-evidenced project rises to #1. No one touches it; the audit happens on its own.

Cohort ranking · raw consensus
over-confident · evidence not yet applied
  1. 3
    Halcyon · grid-balancing model
    deep evidencerepo + deck grounded
    7.9score / 10calibrated ↑ from 7.9
  2. 1
    Nimbus · pitch-perfect demo
    thin evidenceover-confident yellow hat
    9.4score / 10pulled back −1.3 · within ±2.0 cap
  3. 2
    Tessera · multi-agent router
    mixed evidencepartial grounding
    8.6score / 10calibrated −0.8
  4. 4
    Quillwright · doc-search tool
    mid evidencesparse retrieval
    7.1score / 10calibrated −0.2
Halcyon rises to #1 once evidence is weighed. clip(score − 0.8·mean_delta, p25, p75), ±2.0 cap — a calibration prior recovered from held-out anchors.

Projects, scores, ranks and deltas are illustrative— a depiction of the re-ranking mechanism, not live results.

Built for the Arize track

The audit is auditable
— on Arize AX.

Glasshat doesn't hide the judgment — it traces it. Every agent, every one of the six hats, and the self-correction itself opens its own trace span in Arize AX. So you don't take the score on faith — you open the trace of how it was judged and audited, and check it. Trace it. Trust it.

The calibration consultant reads a Phoenix dataset over MCP and writes each correction back — the learning loop runs live on the Cloud Run demo (a Cloud-SQL-backed Phoenix); the genuine live Arize-AX results — the full trace tree and the hit@13 experiment — are the credentialed Agent-Engine run shown next.

Arize AX · OpenInference/OTLPPhoenix · MCP (live)every hat = a span

Genuinely deployed · genuinely measured

Not a mock. The evaluation brain is a real ADK agent on Agent Engine — traced, and scored.

  • ADK 2.0 Workflow graphDeployed on Gemini Enterprise Agent PlatformAgent Engine / Runtime · serving live stream_query · managed Sessions + AGENT_IDENTITY
  • 104 spansFull nested Arize AX traceagent → Workflow → 6 hats (generate + embed) · verified via client.spans.list(project=glasshat)
  • hit@13 0.61548 / 13 winners ranked into the top-13real Gemini · vs 0.3846 mock · 0.26 chance

Live Agent Engine resource reasoningEngines/7480191458771730432 · AX experiment glasshat-hit-at-13-gemini. hit@13 on binary Winner labels — not a rank curve; audit Δ=0 on this set.

Glasshat · it audits the judge

The judge catches its ownover-confidence — live.

A rubric-aware AI evaluator scores with a six-hat panel, grounds every sub-score in retrieved evidence, then pulls its own optimism back to where the evidence actually supports. Every step is a trace span in Arize AX — so the score is one you can open, and check.

Trace it. Trust it.

Live API health: API /health → {"status":"ok"}
Gemini 3.1 Flash-Lite on the Gemini Enterprise Agent Platform · Google ADK 2.0 Workflow · Arize AX · Apache-2.0
Deployed as a genuine ADK agent on Agent Engine · full nested Arize AX trace (104 spans, agent → 6 hats) · live hit@13 0.6154 (8 / 13 winners in the top-13)

Submission · YELLOW hat (optimism)over-confident

illustrative score · the math runs on real evidence

Spotlight finding the submission…

scanningfollowing the beam onto the verdict

Loading rubric presets…