Participant
Score your submission against the official rubric, watch the audit catch its own over-confidence live, then iterate on your weakest axis.
Sample result — a cached gemini-3.1-flash-lite evaluation so you can see the output shape. Submit your own above to run it live.
Evaluation pipeline
- 1Inputdeck + rubric source
- 2RubricSynthesizerofficial rules → rubric
- 3BluePlannerhats + weights
- 4SixHatPanel6 hats · evidence
- 5Auditcalibrated self-correct9.07.8
- 6Final scorerubric native scale
Every agent emits an Arize AX span; the audit consults the calibration path and self-corrects before the score is locked.
Basis: calibrated prior · 7 past evaluations · mean over-confidence 1.45 · live-trace variant: Phoenix MCP path
yellow over/under-confident on 'tech-implementation' (evidence=low, mean_delta=+1.45, n=7)
Glasshat catches its own over-confidence and corrects the score before the judge locks it.
Per-criterion scores
- deckdeck-0
Score lowered from 9.0 → 7.8 after consulting 7 past evaluations (mean over-confidence 1.45).
yellow over/under-confident on 'tech-implementation' (evidence=low, mean_delta=+1.45, n=7)
- deckdeck-0
Score lowered from 4.0 → 2.8 after consulting 7 past evaluations (mean over-confidence 1.45).
yellow over/under-confident on 'design' (evidence=low, mean_delta=+1.45, n=7)
- deckdeck-0
Score lowered from 9.0 → 7.8 after consulting 7 past evaluations (mean over-confidence 1.45).
yellow over/under-confident on 'potential-impact' (evidence=low, mean_delta=+1.45, n=7)
- deckdeck-0
Score lowered from 9.0 → 7.8 after consulting 7 past evaluations (mean over-confidence 1.45).
yellow over/under-confident on 'quality-of-idea' (evidence=low, mean_delta=+1.45, n=7)
Self-correction graph
Axes: score · weight · evidence depth. Amber nodes were self-corrected and reshape from their over-confident origin.
Proof receipt
sample · cached real run53e2c03e-2f9a-4c2f-b5ff-e1221287a4f3- final score
- 57.79
- corrections
- 4
- criteria
- 4
- mode
- judge
- timestamp
- 2026-05-22T14:46:08.377401+00:00
- model
- gemini-3.1-flash-lite
- tracer
- Arize AX · otlp.arize.com
- deployment
- Cloud Run · panelyst-hackathon · us-central1
- consultant
- phoenix-mcp · live
livefields are read from this run's RunRecord; staticfields are the deployment's fixed config. The Phoenix MCP calibration loop reads + writes the live dataset over MCP per request.
Synthesized rubric
| Criterion | Weight | Scale | BMAD | Tie-break |
|---|---|---|---|---|
Technological Implementation Judging criterion 1 of 4, weight 25%, tie-break first | 25% | 1–5 | B1B2C1C2C3C4 | 1 |
Design Judging criterion 2 of 4, weight 25%, tie-break second | 25% | 1–5 | D1D3A2 | 2 |
Potential Impact Judging criterion 3 of 4, weight 25%, tie-break third | 25% | 1–5 | A4A1 | 3 |
Quality of the Idea Judging criterion 4 of 4, weight 25%, tie-break fourth | 25% | 1–5 | A1A3 | 4 |