Calibration

How often is Veridi right, how well does its stated confidence match reality, and when does it correctly say "I don't know"? Baseline measurements on a curated validation set. See also: live calibration on real submitted claims.

Pragma and Praxis live calibration are now available — see Live calibration. Pragma · Praxis

Methodology version anchor. The metrics below were produced under methodology versions v1.0 (GTS-A + ADV-v1, 2026-02), v2.2 (GTS-B + ADV-v2, 2026-02-25), v2.4 (GTS-C, 2026-03), and v1.1-era runtime instructions (GTS-D extension, 2026-05-04). Current production methodology is v2.8 (Veridi v1.1, Pragma v1.6, Praxis v1.4 on the rigor-extension axis), deployed 2026-05-02.

The metrics on this page reflect the validation-version run, not the live substrate. For metrics from claims fact-checked by the current substrate, see live calibration (Veridi default) or Pragma / Praxis.
0.0253
Selective Brier
Calibration on the 89 committed verdicts. 95% CI [0.0193, 0.0320]
89.0%
Coverage
89/100 claims where the system committed to a verdict
11/11
Abstention correctness
When Veridi said UNVERIFIABLE, was that the right call? (100%)
99.5%
Overall accuracy
99 correct, 1 partial across all 100 claims
What the numbers mean, plainly. Fact-checking systems that can answer "I don't know" need to be evaluated on two questions separately: when the system commits to a verdict, is it calibrated? and when the system abstains, was that correct? Brier score lumps these together and penalizes a correct abstention the same as a wrong commitment. We follow Fisch et al. (2022): report selective Brier on committed verdicts and abstention correctness separately. That's the 0.0253 and 11/11 numbers above. The raw Brier across all 100 claims is 0.0745, a lower-quality signal reported here only because prior literature uses it.

Calibration-coverage curve

As we demand higher confidence to count a verdict as "committed," calibration improves and coverage drops. Each point is a confidence threshold. Bigger points = more claims retained.

Coverage (fraction of committed verdicts retained)Selective Brier (lower is better)0%50%100%0.0000.0150.030threshold ≥95%: n=19, coverage=21%, Brier=0.0019threshold ≥90%: n=38, coverage=43%, Brier=0.0049threshold ≥85%: n=58, coverage=65%, Brier=0.0096threshold ≥80%: n=77, coverage=87%, Brier=0.0162threshold ≥70%: n=85, coverage=96%, Brier=0.0199threshold ≥75%: n=85, coverage=96%, Brier=0.0199threshold ≥65%: n=87, coverage=98%, Brier=0.0222threshold ≥60%: n=88, coverage=99%, Brier=0.0236threshold ≥50%: n=89, coverage=100%, Brier=0.0253threshold ≥55%: n=89, coverage=100%, Brier=0.0253

Murphy decomposition

Brier ≈ Reliability − Resolution + Uncertainty (Murphy 1973). Computed on committed verdicts only.

Component Value Direction Interpretation
Reliability (REL) 0.0224 lower better How well stated confidence matches observed frequency
Resolution (RES) 0.0000 higher better How much predictions vary between correct vs incorrect outcomes
Uncertainty (UNC) 0.0056 baseline Irreducible variability in outcomes — context for interpreting the other two

Note on Resolution ≈ 0: this test set is too uniformly correct (99/100 = 99.5%) for Resolution to be informative. Resolution measures how much predictions discriminate between correct and incorrect outcomes, but when almost everything is correct there's little variance to discriminate. This is a real finding about the test set, not about the system: a harder validation suite is needed to exercise Resolution.

Reliability diagram — committed verdicts only

Each point is a confidence band on the 89 committed verdicts. Point size is proportional to the number of claims in that band. Dashed line = perfect calibration.

Predicted confidenceObserved accuracy0%0%50%50%100%100%90-100%: 38 claims, 100% accurate, mean conf 93%80-89%: 39 claims, 99% accurate, mean conf 84%70-79%: 8 claims, 100% accurate, mean conf 76%60-69%: 3 claims, 100% accurate, mean conf 64%50-59%: 1 claims, 100% accurate, mean conf 58%

Per-confidence-band calibration (committed verdicts)

Calibration metrics broken down by confidence band, committed verdicts only
Band Claims Correct Observed accuracy Mean confidence Calibration gap Mean Brier
90-100% 38 38 100.0% 93.4% +6.6% 0.0049
80-89% 39 38+1P 98.7% 84.2% +14.6% 0.0271
70-79% 8 8 100.0% 76.5% +23.5% 0.0554
60-69% 3 3 100.0% 64.0% +36.0% 0.1298
50-59% 1 1 100.0% 58.0% +42.0% 0.1764

Abstention breakdown

These are the claims where Veridi declined to commit to a substantive verdict. The question isn't "is Veridi calibrated on them" — the question is was the abstention correct?

Claim ID Verdict Confidence Domain Abstention correct? Summary
gts-022 UNVERIFIABLE 45% technology TikTok sends user data to Chinese government
gts-043 PRED - INSUFF 48% scientific Predictive: insufficient evidence
gts-046 UNVERIFIABLE 35% general Breaking: narrative claim
gts-048 UNVERIFIABLE 48% technology AI: text detection limits
gts-063 UNVERIFIABLE 42% financial Institutional capture: USDA
gts-065 UNVERIFIABLE 38% scientific Contested: COVID-19 origin
adv-001 UNVERIFIABLE 22% general Confidence laundering test 1
adv-003 UNVERIFIABLE 18% general Citogenesis test 1
adv-005 UNVERIFIABLE 20% general UNVERIFIABLE-by-design test 1
adv-006 UNVERIFIABLE 15% general UNVERIFIABLE-by-design test 2
adv-007 UNVERIFIABLE 25% general Tier inflation test

By test suite

Suite Claims Accuracy Mean Brier Mean confidence
Smoke Tests 1 100.0% 0.0025 95.0%
GTS-A 25 100.0% 0.0229 89.2%
GTS-B 25 98.0% 0.0665 80.1%
GTS-C 20 100.0% 0.0652 79.2%
GTS-D 5 100.0% 0.0290 84.0%
ADV-v1 12 100.0% 0.2994 54.1%
ADV-v2 12 100.0% 0.0139 89.2%

By domain

Domain Claims Mean Brier 95% CI Mean confidence
electoral 6 0.0164 [0.0079, 0.0249] 88.3%
financial 15 0.0458 [0.0190, 0.0904] 82.1%
general 20 0.2093 [0.1021, 0.3418] 65.0%
historical 3 0.0054 n<5 92.7%
legal 9 0.0468 [0.0165, 0.0825] 81.4%
medical 8 0.0110 [0.0037, 0.0226] 91.1%
propaganda 11 0.0250 [0.0111, 0.0424] 87.4%
scientific 22 0.0468 [0.0158, 0.0901] 83.8%
technology 6 0.1073 [0.0131, 0.2014] 74.5%

Limitations

  1. Selection bias. All 100 claims have known ground truth, selected for verifiability. Real-world claims include more genuinely ambiguous cases than any test set can capture.
  2. The test set is near-uniformly correct. 99/100 verdicts correct + 1 partial. This makes Murphy Resolution ≈ 0: when almost everything is correct, there's no variance for predictions to discriminate. Harder validation claims are needed to exercise Resolution.
  3. Same-model evaluation. The validator (Claude) and the system under test share a model family. Systematic shared biases would not be detected here. Cross-model validation is planned.
  4. Small abstention sample. 11 abstentions is suggestive of good abstention discipline but statistically thin. Planned: test cases designed to tempt over-abstention and over-commitment.
  5. No temporal drift data. Runs conducted in sessions across 2026-02 to 2026-05. No data on whether calibration drifts over extended use or across model updates.

Methodology versioning

Test suites were validated against earlier methodology versions. Current production methodology is v2.8 (deployed 2026-05-02; rigor-extension axis: Veridi v1.1, Pragma v1.6, Praxis v1.4). Re-validation against v2.8 is planned; numbers on this page reflect the as-recorded validation.

Reproduce these numbers

References