Calibration

How often is Veridi right, how well does its stated confidence match reality, and when does it correctly say "I don't know"? Baseline measurements on a curated validation set. See also: live calibration on real submitted claims.

Pragma and Praxis live calibration are now available — see Live calibration. Pragma · Praxis

Methodology version anchor. The metrics below were produced under methodology versions v1.0 (GTS-A + ADV-v1, 2026-02), v2.2 (GTS-B + ADV-v2, 2026-02-25), v2.4 (GTS-C, 2026-03), and v1.1-era runtime instructions (GTS-D extension, 2026-05-04). Current production methodology is v2.8 (Veridi v1.1, Pragma v1.6, Praxis v1.4 on the rigor-extension axis), deployed 2026-05-02.

The metrics on this page reflect the validation-version run, not the live substrate. For metrics from claims fact-checked by the current substrate, see live calibration (Veridi default) or Pragma / Praxis.

0.0253

Selective Brier
Calibration on the 89 committed verdicts. 95% CI [0.0193, 0.0320]

89.0%

Coverage
89/100 claims where the system committed to a verdict

11/11

Abstention correctness
When Veridi said UNVERIFIABLE, was that the right call? (100%)

99.5%

Overall accuracy
99 correct, 1 partial across all 100 claims

What the numbers mean, plainly. Fact-checking systems that can answer "I don't know" need to be evaluated on two questions separately: when the system commits to a verdict, is it calibrated? and when the system abstains, was that correct? Brier score lumps these together and penalizes a correct abstention the same as a wrong commitment. We follow Fisch et al. (2022): report selective Brier on committed verdicts and abstention correctness separately. That's the 0.0253 and 11/11 numbers above. The raw Brier across all 100 claims is 0.0745, a lower-quality signal reported here only because prior literature uses it.

Calibration-coverage curve

As we demand higher confidence to count a verdict as "committed," calibration improves and coverage drops. Each point is a confidence threshold. Bigger points = more claims retained.

Murphy decomposition

Brier ≈ Reliability − Resolution + Uncertainty (Murphy 1973). Computed on committed verdicts only.

Component	Value	Direction	Interpretation
Reliability (REL)	0.0224	lower better	How well stated confidence matches observed frequency
Resolution (RES)	0.0000	higher better	How much predictions vary between correct vs incorrect outcomes
Uncertainty (UNC)	0.0056	baseline	Irreducible variability in outcomes — context for interpreting the other two

Note on Resolution ≈ 0: this test set is too uniformly correct (99/100 = 99.5%) for Resolution to be informative. Resolution measures how much predictions discriminate between correct and incorrect outcomes, but when almost everything is correct there's little variance to discriminate. This is a real finding about the test set, not about the system: a harder validation suite is needed to exercise Resolution.

Reliability diagram — committed verdicts only

Each point is a confidence band on the 89 committed verdicts. Point size is proportional to the number of claims in that band. Dashed line = perfect calibration.

Per-confidence-band calibration (committed verdicts)

Calibration metrics broken down by confidence band, committed verdicts only
Band	Claims	Correct	Observed accuracy	Mean confidence	Calibration gap	Mean Brier
90-100%	38	38	100.0%	93.4%	+6.6%	0.0049
80-89%	39	38+1P	98.7%	84.2%	+14.6%	0.0271
70-79%	8	8	100.0%	76.5%	+23.5%	0.0554
60-69%	3	3	100.0%	64.0%	+36.0%	0.1298
50-59%	1	1	100.0%	58.0%	+42.0%	0.1764

Abstention breakdown

These are the claims where Veridi declined to commit to a substantive verdict. The question isn't "is Veridi calibrated on them" — the question is was the abstention correct?

Claim ID	Verdict	Confidence	Domain	Abstention correct?	Summary
gts-022	UNVERIFIABLE	45%	technology	✓	TikTok sends user data to Chinese government
gts-043	PRED - INSUFF	48%	scientific	✓	Predictive: insufficient evidence
gts-046	UNVERIFIABLE	35%	general	✓	Breaking: narrative claim
gts-048	UNVERIFIABLE	48%	technology	✓	AI: text detection limits
gts-063	UNVERIFIABLE	42%	financial	✓	Institutional capture: USDA
gts-065	UNVERIFIABLE	38%	scientific	✓	Contested: COVID-19 origin
adv-001	UNVERIFIABLE	22%	general	✓	Confidence laundering test 1
adv-003	UNVERIFIABLE	18%	general	✓	Citogenesis test 1
adv-005	UNVERIFIABLE	20%	general	✓	UNVERIFIABLE-by-design test 1
adv-006	UNVERIFIABLE	15%	general	✓	UNVERIFIABLE-by-design test 2
adv-007	UNVERIFIABLE	25%	general	✓	Tier inflation test

By test suite

Suite	Claims	Accuracy	Mean Brier	Mean confidence
Smoke Tests	1	100.0%	0.0025	95.0%
GTS-A	25	100.0%	0.0229	89.2%
GTS-B	25	98.0%	0.0665	80.1%
GTS-C	20	100.0%	0.0652	79.2%
GTS-D	5	100.0%	0.0290	84.0%
ADV-v1	12	100.0%	0.2994	54.1%
ADV-v2	12	100.0%	0.0139	89.2%

By domain

Domain	Claims	Mean Brier	95% CI	Mean confidence
electoral	6	0.0164	[0.0079, 0.0249]	88.3%
financial	15	0.0458	[0.0190, 0.0904]	82.1%
general	20	0.2093	[0.1021, 0.3418]	65.0%
historical	3	0.0054	n<5	92.7%
legal	9	0.0468	[0.0165, 0.0825]	81.4%
medical	8	0.0110	[0.0037, 0.0226]	91.1%
propaganda	11	0.0250	[0.0111, 0.0424]	87.4%
scientific	22	0.0468	[0.0158, 0.0901]	83.8%
technology	6	0.1073	[0.0131, 0.2014]	74.5%

Limitations

Selection bias. All 100 claims have known ground truth, selected for verifiability. Real-world claims include more genuinely ambiguous cases than any test set can capture.
The test set is near-uniformly correct. 99/100 verdicts correct + 1 partial. This makes Murphy Resolution ≈ 0: when almost everything is correct, there's no variance for predictions to discriminate. Harder validation claims are needed to exercise Resolution.
Same-model evaluation. The validator (Claude) and the system under test share a model family. Systematic shared biases would not be detected here. Cross-model validation is planned.
Small abstention sample. 11 abstentions is suggestive of good abstention discipline but statistically thin. Planned: test cases designed to tempt over-abstention and over-commitment.
No temporal drift data. Runs conducted in sessions across 2026-02 to 2026-05. No data on whether calibration drifts over extended use or across model updates.

Methodology versioning

Test suites were validated against earlier methodology versions. Current production methodology is v2.8 (deployed 2026-05-02; rigor-extension axis: Veridi v1.1, Pragma v1.6, Praxis v1.4). Re-validation against v2.8 is planned; numbers on this page reflect the as-recorded validation.

GTS-A + ADV-v1: validated against methodology v1.0
GTS-B + ADV-v2: validated against methodology v2.2
GTS-C: validated against methodology v2.4
GTS-D: targeted extension run against methodology v1.1-era runtime instructions

Reproduce these numbers

Download the raw data, 100-row JSONL, one claim per line, includes verdict/confidence/outcome
Read the original analysis report (uses classical Brier across the original 95 claims; superseded by this selective framing but preserved for comparison; the report does not include the 5 GTS-D extension rows added 2026-05-04)
Read the validation report
Compute script: Veridi/fact-checker-files/validation-results/compute-calibration.py

References

Fisch et al. (2022), Calibrated Selective Classification — selective Brier framing
Murphy (1973), Brier score decomposition — REL / RES / UNC
Dimitriadis et al. (2021), Stable reliability diagrams (CORP) — cited for future work