Live calibration
Running accuracy and calibration on real submitted claims, as labeled by Veridi administrators. See also: baseline calibration on the 100-claim validation set.
Need at least 10 labeled claims for Brier to be meaningful
Across 43 distinct claims
Not enough labeled data yet to compute meaningful selective Brier, coverage, or Murphy decomposition. Check back as more claims are reviewed.
In the meantime, the baseline calibration page reports the measured Brier (0.0253 selective, across 100 pre-labeled claims).
Diagnostic tags on labels
When admins mark a label as partial, incorrect, or can't-judge, they can add an optional diagnostic tag. Recurring tags point at methodology failure modes worth investigating.
| Tag | Count |
|---|---|
| verdict too strong | 1 |
| missing context | 1 |
| methodology gap | 1 |
Relationship to the baseline
The baseline calibration page shows Veridi's measured calibration on a curated 100-claim validation set (the original 95 claims plus the GTS-D Wave 1 extension added 2026-05-04). It is reproducible, but by construction selected for verifiability. This page shows real submitted claims labeled post-hoc by admin reviewers. The two numbers should roughly agree as the live sample grows; persistent divergence is a finding worth investigating.
→ baseline calibration on the 100-claim validation set
User perception (submitter feedback) — Veridi
Submitter feedback on expectation match, reasoning, and evidence. Perception, not ground truth — diverges from admin judgment in informative ways.
42/223 Veridi claims (19%) with submitter feedback
38 ratings
38 ratings
5% of feedback rows
Expectation match
| Response | Count | Share |
|---|---|---|
| feedback.match.lower | 5 | 12% |
| Matched my expectation | 33 | 79% |
| feedback.match.higher | 0 | 0% |
| I had no prior expectation | 0 | 0% |
Reasoning rating distribution
| Rating | Count | Share |
|---|---|---|
| 1 | 0 | 0% |
| 2 | 0 | 0% |
| 3 | 0 | 0% |
| 4 | 1 | 3% |
| 5 | 0 | 0% |
| 6 | 2 | 5% |
| 7 | 4 | 11% |
| 8 | 14 | 37% |
| 9 | 11 | 29% |
| 10 | 6 | 16% |
Evidence rating distribution
| Rating | Count | Share |
|---|---|---|
| 1 | 0 | 0% |
| 2 | 1 | 3% |
| 3 | 1 | 3% |
| 4 | 1 | 3% |
| 5 | 0 | 0% |
| 6 | 2 | 5% |
| 7 | 1 | 3% |
| 8 | 13 | 34% |
| 9 | 12 | 32% |
| 10 | 7 | 18% |
Flagged concerns
| Category | Count |
|---|---|
| Wrong interpretation | 1 |
| Other | 1 |
Calibration feedback loop
Brier-lite scoring on outcomes from the last 30 / 60 / 90 days. Predicted is the system's confidence at recommendation time; actual is the realized outcome (per the methodology's outcome → ground-truth map). Lower Brier = better-calibrated predictions.
Calibration loop not yet running for this methodology — the 90-day window has fewer than 5 resolvable outcomes.
Outcome submissions
User-reported outcomes for Veridi fact-checks: did the verdict hold up?
Need at least 5 outcomes for distribution to be meaningful.
Not enough outcomes yet. Check back as more users opt in to outcome tracking and report back at the scheduled intervals.