Live calibration

Running accuracy and calibration on real submitted claims, as labeled by Veridi administrators. See also: baseline calibration on the 100-claim validation set.

Admin labels are not ground truth — they are admin judgments about ground truth. The numbers on this page measure how often Veridi's verdict agrees with a post-hoc reviewer, not how often Veridi is actually right. For contested or domain-expert claims, single-reviewer judgments are a useful but not final signal. Multi-reviewer labeling is a future addition; until then, treat these numbers as directional.
5
Claims labeled so far
Need at least 10 labeled claims for Brier to be meaningful
43
Submitter feedback entries
Across 43 distinct claims

Not enough labeled data yet to compute meaningful selective Brier, coverage, or Murphy decomposition. Check back as more claims are reviewed.

In the meantime, the baseline calibration page reports the measured Brier (0.0253 selective, across 100 pre-labeled claims).

Diagnostic tags on labels

When admins mark a label as partial, incorrect, or can't-judge, they can add an optional diagnostic tag. Recurring tags point at methodology failure modes worth investigating.

Tag Count
verdict too strong1
missing context1
methodology gap1

Relationship to the baseline

The baseline calibration page shows Veridi's measured calibration on a curated 100-claim validation set (the original 95 claims plus the GTS-D Wave 1 extension added 2026-05-04). It is reproducible, but by construction selected for verifiability. This page shows real submitted claims labeled post-hoc by admin reviewers. The two numbers should roughly agree as the live sample grows; persistent divergence is a finding worth investigating.

→ baseline calibration on the 100-claim validation set

User perception (submitter feedback) — Veridi

Submitter feedback on expectation match, reasoning, and evidence. Perception, not ground truth — diverges from admin judgment in informative ways.

42
Feedback entries
42/223 Veridi claims (19%) with submitter feedback
8.3 / 10
Mean reasoning rating
38 ratings
8.2 / 10
Mean evidence rating
38 ratings
2
Concerns logged
5% of feedback rows

Expectation match

Response Count Share
feedback.match.lower 5 12%
Matched my expectation 33 79%
feedback.match.higher 0 0%
I had no prior expectation 0 0%

Reasoning rating distribution

Rating Count Share
1 0 0%
2 0 0%
3 0 0%
4 1 3%
5 0 0%
6 2 5%
7 4 11%
8 14 37%
9 11 29%
10 6 16%

Evidence rating distribution

Rating Count Share
1 0 0%
2 1 3%
3 1 3%
4 1 3%
5 0 0%
6 2 5%
7 1 3%
8 13 34%
9 12 32%
10 7 18%

Flagged concerns

Category Count
Wrong interpretation 1
Other 1

Calibration feedback loop

Brier-lite scoring on outcomes from the last 30 / 60 / 90 days. Predicted is the system's confidence at recommendation time; actual is the realized outcome (per the methodology's outcome → ground-truth map). Lower Brier = better-calibrated predictions.

Calibration loop not yet running for this methodology — the 90-day window has fewer than 5 resolvable outcomes.

Outcome submissions

User-reported outcomes for Veridi fact-checks: did the verdict hold up?

0
Outcomes recorded
Need at least 5 outcomes for distribution to be meaningful.

Not enough outcomes yet. Check back as more users opt in to outcome tracking and report back at the scheduled intervals.