Participant

Score your submission against the official rubric, watch the audit catch its own over-confidence live, then iterate on your weakest axis.

Gemini / GEAP3.1-flash-liteGoogle ADKorchestratingCloud RundeployedArize AXtracingPhoenix MCPcalibration loop

RubricRepo URL (optional)

Pitch / deck text

Sample result — a cached gemini-3.1-flash-lite evaluation so you can see the output shape. Submit your own above to run it live.

Evaluation pipeline

1Inputdeck + rubric source
2RubricSynthesizerofficial rules → rubric
3BluePlannerhats + weights
4SixHatPanel6 hats · evidence
5Auditcalibrated self-correct9.07.8
6Final scorerubric native scale

Arize AXOpenInference spans · one per agent → otlp.arize.com

Phoenix MCPcalibration read + write-back → Audit (live MCP per request · Cloud-SQL-backed Phoenix)

Every agent emits an Arize AX span; the audit consults the calibration path and self-corrects before the score is locked.

Final score

57.8

out of 100

Self-corrections

over-confident axes pulled back

Rubric

rapid-agent

4 criteria · weighted_sum

audit · self-correctionyellow hat over-scored tech-implementation

original9.0

corrected7.8

delta-1.16

Basis: calibrated prior · 7 past evaluations · mean over-confidence 1.45 · live-trace variant: Phoenix MCP path

yellow over/under-confident on 'tech-implementation' (evidence=low, mean_delta=+1.45, n=7)

Glasshat catches its own over-confidence and corrects the score before the judge locks it.

Per-criterion scores

Technological Implementation3.53.4/5 · 25% · self-corrected

deckdeck-0

yellow hattech-implementationself-corrected

Score lowered from 9.0 → 7.8 after consulting 7 past evaluations (mean over-confidence 1.45).

yellow over/under-confident on 'tech-implementation' (evidence=low, mean_delta=+1.45, n=7)

Design2.22.1/5 · 25% · self-corrected

deckdeck-0

yellow hatdesignself-corrected

Score lowered from 4.0 → 2.8 after consulting 7 past evaluations (mean over-confidence 1.45).

yellow over/under-confident on 'design' (evidence=low, mean_delta=+1.45, n=7)

Potential Impact3.02.9/5 · 25% · self-corrected

deckdeck-0

yellow hatpotential-impactself-corrected

Score lowered from 9.0 → 7.8 after consulting 7 past evaluations (mean over-confidence 1.45).

yellow over/under-confident on 'potential-impact' (evidence=low, mean_delta=+1.45, n=7)

Quality of the Idea3.23.1/5 · 25% · self-corrected

deckdeck-0

yellow hatquality-of-ideaself-corrected

Score lowered from 9.0 → 7.8 after consulting 7 past evaluations (mean over-confidence 1.45).

yellow over/under-confident on 'quality-of-idea' (evidence=low, mean_delta=+1.45, n=7)

Self-correction graph

Axes: score · weight · evidence depth. Amber nodes were self-corrected and reshape from their over-confident origin.

Proof receipt

sample · cached real run

run id53e2c03e-2f9a-4c2f-b5ff-e1221287a4f3

From this runlive

final score: 57.79
corrections: 4
criteria: 4
mode: judge
timestamp: 2026-05-22T14:46:08.377401+00:00

Deployment configstatic

model: gemini-3.1-flash-lite
tracer: Arize AX · otlp.arize.com
deployment: Cloud Run · panelyst-hackathon · us-central1
consultant: phoenix-mcp · live

livefields are read from this run's RunRecord; staticfields are the deployment's fixed config. The Phoenix MCP calibration loop reads + writes the live dataset over MCP per request.

Synthesized rubric

preset: rapid-agentweighted_sum · scale 0-100

Criterion	Weight	Scale	BMAD	Tie-break
Technological Implementation Judging criterion 1 of 4, weight 25%, tie-break first	25%	1–5	B1B2C1C2C3C4	1
Design Judging criterion 2 of 4, weight 25%, tie-break second	25%	1–5	D1D3A2	2
Potential Impact Judging criterion 3 of 4, weight 25%, tie-break third	25%	1–5	A4A1	3
Quality of the Idea Judging criterion 4 of 4, weight 25%, tie-break fourth	25%	1–5	A1A3	4