Glasshat

Participant

Score your submission against the official rubric, watch the audit catch its own over-confidence live, then iterate on your weakest axis.

Gemini / GEAP3.1-flash-liteGoogle ADKorchestratingCloud RundeployedArize AXtracingPhoenix MCPcalibration loop

Sample result — a cached gemini-3.1-flash-lite evaluation so you can see the output shape. Submit your own above to run it live.

Evaluation pipeline

  1. 1Inputdeck + rubric source
  2. 2RubricSynthesizerofficial rules → rubric
  3. 3BluePlannerhats + weights
  4. 4SixHatPanel6 hats · evidence
  5. 5Auditcalibrated self-correct9.07.8
  6. 6Final scorerubric native scale
Arize AXOpenInference spans · one per agent → otlp.arize.com
Phoenix MCPcalibration read + write-back → Audit (live MCP per request · Cloud-SQL-backed Phoenix)

Every agent emits an Arize AX span; the audit consults the calibration path and self-corrects before the score is locked.

Final score
57.8
out of 100
Self-corrections
4
over-confident axes pulled back
Rubric
rapid-agent
4 criteria · weighted_sum
audit · self-correctionyellow hat over-scored tech-implementation
original9.0
corrected7.8
delta-1.16

Basis: calibrated prior · 7 past evaluations · mean over-confidence 1.45 · live-trace variant: Phoenix MCP path

yellow over/under-confident on 'tech-implementation' (evidence=low, mean_delta=+1.45, n=7)

Glasshat catches its own over-confidence and corrects the score before the judge locks it.

Per-criterion scores

Technological Implementation3.53.4/5 · 25% · self-corrected
  • deckdeck-0
yellow hattech-implementationself-corrected

Score lowered from 9.0 7.8 after consulting 7 past evaluations (mean over-confidence 1.45).

yellow over/under-confident on 'tech-implementation' (evidence=low, mean_delta=+1.45, n=7)

Design2.22.1/5 · 25% · self-corrected
  • deckdeck-0
yellow hatdesignself-corrected

Score lowered from 4.0 2.8 after consulting 7 past evaluations (mean over-confidence 1.45).

yellow over/under-confident on 'design' (evidence=low, mean_delta=+1.45, n=7)

Potential Impact3.02.9/5 · 25% · self-corrected
  • deckdeck-0
yellow hatpotential-impactself-corrected

Score lowered from 9.0 7.8 after consulting 7 past evaluations (mean over-confidence 1.45).

yellow over/under-confident on 'potential-impact' (evidence=low, mean_delta=+1.45, n=7)

Quality of the Idea3.23.1/5 · 25% · self-corrected
  • deckdeck-0
yellow hatquality-of-ideaself-corrected

Score lowered from 9.0 7.8 after consulting 7 past evaluations (mean over-confidence 1.45).

yellow over/under-confident on 'quality-of-idea' (evidence=low, mean_delta=+1.45, n=7)

Self-correction graph

Axes: score · weight · evidence depth. Amber nodes were self-corrected and reshape from their over-confident origin.

Proof receipt

sample · cached real run
run id53e2c03e-2f9a-4c2f-b5ff-e1221287a4f3
From this runlive
final score
57.79
corrections
4
criteria
4
mode
judge
timestamp
2026-05-22T14:46:08.377401+00:00
Deployment configstatic
model
gemini-3.1-flash-lite
tracer
Arize AX · otlp.arize.com
deployment
Cloud Run · panelyst-hackathon · us-central1
consultant
phoenix-mcp · live

livefields are read from this run's RunRecord; staticfields are the deployment's fixed config. The Phoenix MCP calibration loop reads + writes the live dataset over MCP per request.

Synthesized rubric

preset: rapid-agentweighted_sum · scale 0-100
CriterionWeightScaleBMADTie-break
Technological Implementation
Judging criterion 1 of 4, weight 25%, tie-break first
25%1–5
B1B2C1C2C3C4
1
Design
Judging criterion 2 of 4, weight 25%, tie-break second
25%1–5
D1D3A2
2
Potential Impact
Judging criterion 3 of 4, weight 25%, tie-break third
25%1–5
A4A1
3
Quality of the Idea
Judging criterion 4 of 4, weight 25%, tie-break fourth
25%1–5
A1A3
4