clip( score − 0.8 · mean_delta , p25 , p75 ) · ±2.0 cap · spike-D anchors · evidence-bucketed · YELLOW hat → optimism flagged · Δ = 9.0 − 7.6 = 1.4 · p25 ≤ calibrated ≤ p75 · audit the judge — including itself · clip( score − 0.8 · mean_delta , p25 , p75 ) · ±2.0 cap · spike-D anchors · evidence-bucketed · YELLOW hat → optimism flagged · Δ = 9.0 − 7.6 = 1.4 · p25 ≤ calibrated ≤ p75 · audit the judge — including itself · clip( score − 0.8 · mean_delta , p25 , p75 ) · ±2.0 cap · spike-D anchors · evidence-bucketed · YELLOW hat → optimism flagged · Δ = 9.0 − 7.6 = 1.4 · p25 ≤ calibrated ≤ p75 · audit the judge — including itself · clip( score − 0.8 · mean_delta , p25 , p75 ) · ±2.0 cap · spike-D anchors · evidence-bucketed · YELLOW hat → optimism flagged · Δ = 9.0 − 7.6 = 1.4 · p25 ≤ calibrated ≤ p75 · audit the judge — including itself · clip( score − 0.8 · mean_delta , p25 , p75 ) · ±2.0 cap · spike-D anchors · evidence-bucketed · YELLOW hat → optimism flagged · Δ = 9.0 − 7.6 = 1.4 · p25 ≤ calibrated ≤ p75 · audit the judge — including itself · clip( score − 0.8 · mean_delta , p25 , p75 ) · ±2.0 cap · spike-D anchors · evidence-bucketed · YELLOW hat → optimism flagged · Δ = 9.0 − 7.6 = 1.4 · p25 ≤ calibrated ≤ p75 · audit the judge — including itself ·

audit the judge — including itself · p25 ≤ calibrated ≤ p75 · Δ = 9.0 − 7.6 = 1.4 · YELLOW hat → optimism flagged · spike-D anchors · evidence-bucketed · ±2.0 cap · clip( score − 0.8 · mean_delta , p25 , p75 ) · audit the judge — including itself · p25 ≤ calibrated ≤ p75 · Δ = 9.0 − 7.6 = 1.4 · YELLOW hat → optimism flagged · spike-D anchors · evidence-bucketed · ±2.0 cap · clip( score − 0.8 · mean_delta , p25 , p75 ) · audit the judge — including itself · p25 ≤ calibrated ≤ p75 · Δ = 9.0 − 7.6 = 1.4 · YELLOW hat → optimism flagged · spike-D anchors · evidence-bucketed · ±2.0 cap · clip( score − 0.8 · mean_delta , p25 , p75 ) · audit the judge — including itself · p25 ≤ calibrated ≤ p75 · Δ = 9.0 − 7.6 = 1.4 · YELLOW hat → optimism flagged · spike-D anchors · evidence-bucketed · ±2.0 cap · clip( score − 0.8 · mean_delta , p25 , p75 ) · audit the judge — including itself · p25 ≤ calibrated ≤ p75 · Δ = 9.0 − 7.6 = 1.4 · YELLOW hat → optimism flagged · spike-D anchors · evidence-bucketed · ±2.0 cap · clip( score − 0.8 · mean_delta , p25 , p75 ) · audit the judge — including itself · p25 ≤ calibrated ≤ p75 · Δ = 9.0 − 7.6 = 1.4 · YELLOW hat → optimism flagged · spike-D anchors · evidence-bucketed · ±2.0 cap · clip( score − 0.8 · mean_delta , p25 , p75 ) ·

evidence > assumption · confident · fast · unaccountable → audited · evidence > assumption · confident · fast · unaccountable → audited · evidence > assumption · confident · fast · unaccountable → audited · evidence > assumption · confident · fast · unaccountable → audited · evidence > assumption · confident · fast · unaccountable → audited · evidence > assumption · confident · fast · unaccountable → audited · evidence > assumption · confident · fast · unaccountable → audited · evidence > assumption · confident · fast · unaccountable → audited · evidence > assumption · confident · fast · unaccountable → audited · evidence > assumption · confident · fast · unaccountable → audited · evidence > assumption · confident · fast · unaccountable → audited · evidence > assumption · confident · fast · unaccountable → audited ·

the judge scores 9.0 · then catches its own over-confidence

987.09876

calibrated7.6 — pulled to where the retrieved evidence holds

Illustrative correction · clip(score − 0.8·mean_delta, p25, p75), ±2.0 cap · prior recovered from held-out spike-D anchors.

Evaluation constellation

AI writes the submissions now — by the thousands. The judging didn't change.

Trace it. Trust it.

Glasshat doesn't just judge — it audits the judge. Every hat, every sub-score opens its own trace span in Arize AX; the panel catches its own over-confidence and self-corrects, live. So the score isn't a black box — it's one you can check. Not a chatbot.

Score a submission →Judge a cohort

Arize track
Gemini · Vertex AI
Google ADK
Phoenix + MCP

Yellow hat · over-confident

clip(score − 0.8·mean_delta, p25, p75) · ±2.0 cap, from a prior recovered from held-out spike-D anchors.

Illustrative.

White
Red
Black
Green
Blue
Yellow →calibrated

Glasshat · it audits the judge

Same cohort. Two truths. The audit decides who wins.

A raw AI panel hands the loudest, least-evidenced project the crown. Watch the board tilt— and once calibration lands on retrieved evidence, a better-evidenced project rises to #1. No one touches it; the audit happens on its own.

Cohort ranking · raw consensus

over-confident · evidence not yet applied

3
Halcyon · grid-balancing model
deep evidencerepo + deck grounded
7.9score / 10calibrated ↑ from 7.9
1
Nimbus · pitch-perfect demo
thin evidenceover-confident yellow hat
9.4score / 10pulled back −1.3 · within ±2.0 cap
2
Tessera · multi-agent router
mixed evidencepartial grounding
8.6score / 10calibrated −0.8
4
Quillwright · doc-search tool
mid evidencesparse retrieval
7.1score / 10calibrated −0.2

Halcyon rises to #1 once evidence is weighed. clip(score − 0.8·mean_delta, p25, p75), ±2.0 cap — a calibration prior recovered from held-out anchors.

Projects, scores, ranks and deltas are illustrative— a depiction of the re-ranking mechanism, not live results.

See the rank-flip on /judge

Built for the Arize track

The audit is auditable
— on Arize AX.

Glasshat doesn't hide the judgment — it traces it. Every agent, every one of the six hats, and the self-correction itself opens its own trace span in Arize AX. So you don't take the score on faith — you open the trace of how it was judged and audited, and check it. Trace it. Trust it.

The calibration consultant reads a Phoenix dataset over MCP and writes each correction back — the learning loop runs live on the Cloud Run demo (a Cloud-SQL-backed Phoenix); the genuine live Arize-AX results — the full trace tree and the hit@13 experiment — are the credentialed Agent-Engine run shown next.

Arize AX · OpenInference/OTLPPhoenix · MCP (live)every hat = a span

Genuinely deployed · genuinely measured

Not a mock. The evaluation brain is a real ADK agent on Agent Engine — traced, and scored.

ADK 2.0 Workflow graphDeployed on Gemini Enterprise Agent PlatformAgent Engine / Runtime · serving live stream_query · managed Sessions + AGENT_IDENTITY
104 spansFull nested Arize AX traceagent → Workflow → 6 hats (generate + embed) · verified via client.spans.list(project=glasshat)
hit@13 0.61548 / 13 winners ranked into the top-13real Gemini · vs 0.3846 mock · 0.26 chance

Live Agent Engine resource reasoningEngines/7480191458771730432 · AX experiment glasshat-hit-at-13-gemini. hit@13 on binary Winner labels — not a rank curve; audit Δ=0 on this set.

Glasshat · it audits the judge

The judge catches its ownover-confidence — live.

A rubric-aware AI evaluator scores with a six-hat panel, grounds every sub-score in retrieved evidence, then pulls its own optimism back to where the evidence actually supports. Every step is a trace span in Arize AX — so the score is one you can open, and check.

Trace it. Trust it.

Watch a single audit →See rank-flip →

Live API health: API /health → {"status":"ok"}
Gemini 3.1 Flash-Lite on the Gemini Enterprise Agent Platform · Google ADK 2.0 Workflow · Arize AX · Apache-2.0
Deployed as a genuine ADK agent on Agent Engine · full nested Arize AX trace (104 spans, agent → 6 hats) · live hit@13 0.6154 (8 / 13 winners in the top-13)

Submission · YELLOW hat (optimism)over-confident

illustrative score · the math runs on real evidence

Spotlight finding the submission…

scanning — following the beam onto the verdict

Loading rubric presets…

Trace it. Trust it.

Same cohort. Two truths. The audit decides who wins.

The audit is auditable— on Arize AX.

Not a mock. The evaluation brain is a real ADK agent on Agent Engine — traced, and scored.

The judge catches its ownover-confidence — live.

The audit is auditable
— on Arize AX.