Calibration
How often is Veridi right, how well does its stated confidence match reality, and when does it correctly say "I don't know"? Baseline measurements on a curated validation set. See also: live calibration on real submitted claims.
Pragma and Praxis live calibration are now available — see Live calibration. Pragma · Praxis
Calibration on the 89 committed verdicts. 95% CI [0.0193, 0.0320]
89/100 claims where the system committed to a verdict
When Veridi said UNVERIFIABLE, was that the right call? (100%)
99 correct, 1 partial across all 100 claims
Calibration-coverage curve
As we demand higher confidence to count a verdict as "committed," calibration improves and coverage drops. Each point is a confidence threshold. Bigger points = more claims retained.
Murphy decomposition
Brier ≈ Reliability − Resolution + Uncertainty (Murphy 1973). Computed on committed verdicts only.
| Component | Value | Direction | Interpretation |
|---|---|---|---|
| Reliability (REL) | 0.0224 | lower better | How well stated confidence matches observed frequency |
| Resolution (RES) | 0.0000 | higher better | How much predictions vary between correct vs incorrect outcomes |
| Uncertainty (UNC) | 0.0056 | baseline | Irreducible variability in outcomes — context for interpreting the other two |
Note on Resolution ≈ 0: this test set is too uniformly correct (99/100 = 99.5%) for Resolution to be informative. Resolution measures how much predictions discriminate between correct and incorrect outcomes, but when almost everything is correct there's little variance to discriminate. This is a real finding about the test set, not about the system: a harder validation suite is needed to exercise Resolution.
Reliability diagram — committed verdicts only
Each point is a confidence band on the 89 committed verdicts. Point size is proportional to the number of claims in that band. Dashed line = perfect calibration.
Per-confidence-band calibration (committed verdicts)
| Band | Claims | Correct | Observed accuracy | Mean confidence | Calibration gap | Mean Brier |
|---|---|---|---|---|---|---|
| 90-100% | 38 | 38 | 100.0% | 93.4% | +6.6% | 0.0049 |
| 80-89% | 39 | 38+1P | 98.7% | 84.2% | +14.6% | 0.0271 |
| 70-79% | 8 | 8 | 100.0% | 76.5% | +23.5% | 0.0554 |
| 60-69% | 3 | 3 | 100.0% | 64.0% | +36.0% | 0.1298 |
| 50-59% | 1 | 1 | 100.0% | 58.0% | +42.0% | 0.1764 |
Abstention breakdown
These are the claims where Veridi declined to commit to a substantive verdict. The question isn't "is Veridi calibrated on them" — the question is was the abstention correct?
| Claim ID | Verdict | Confidence | Domain | Abstention correct? | Summary |
|---|---|---|---|---|---|
| gts-022 | UNVERIFIABLE | 45% | technology | ✓ | TikTok sends user data to Chinese government |
| gts-043 | PRED - INSUFF | 48% | scientific | ✓ | Predictive: insufficient evidence |
| gts-046 | UNVERIFIABLE | 35% | general | ✓ | Breaking: narrative claim |
| gts-048 | UNVERIFIABLE | 48% | technology | ✓ | AI: text detection limits |
| gts-063 | UNVERIFIABLE | 42% | financial | ✓ | Institutional capture: USDA |
| gts-065 | UNVERIFIABLE | 38% | scientific | ✓ | Contested: COVID-19 origin |
| adv-001 | UNVERIFIABLE | 22% | general | ✓ | Confidence laundering test 1 |
| adv-003 | UNVERIFIABLE | 18% | general | ✓ | Citogenesis test 1 |
| adv-005 | UNVERIFIABLE | 20% | general | ✓ | UNVERIFIABLE-by-design test 1 |
| adv-006 | UNVERIFIABLE | 15% | general | ✓ | UNVERIFIABLE-by-design test 2 |
| adv-007 | UNVERIFIABLE | 25% | general | ✓ | Tier inflation test |
By test suite
| Suite | Claims | Accuracy | Mean Brier | Mean confidence |
|---|---|---|---|---|
| Smoke Tests | 1 | 100.0% | 0.0025 | 95.0% |
| GTS-A | 25 | 100.0% | 0.0229 | 89.2% |
| GTS-B | 25 | 98.0% | 0.0665 | 80.1% |
| GTS-C | 20 | 100.0% | 0.0652 | 79.2% |
| GTS-D | 5 | 100.0% | 0.0290 | 84.0% |
| ADV-v1 | 12 | 100.0% | 0.2994 | 54.1% |
| ADV-v2 | 12 | 100.0% | 0.0139 | 89.2% |
By domain
| Domain | Claims | Mean Brier | 95% CI | Mean confidence |
|---|---|---|---|---|
| electoral | 6 | 0.0164 | [0.0079, 0.0249] | 88.3% |
| financial | 15 | 0.0458 | [0.0190, 0.0904] | 82.1% |
| general | 20 | 0.2093 | [0.1021, 0.3418] | 65.0% |
| historical | 3 | 0.0054 | n<5 | 92.7% |
| legal | 9 | 0.0468 | [0.0165, 0.0825] | 81.4% |
| medical | 8 | 0.0110 | [0.0037, 0.0226] | 91.1% |
| propaganda | 11 | 0.0250 | [0.0111, 0.0424] | 87.4% |
| scientific | 22 | 0.0468 | [0.0158, 0.0901] | 83.8% |
| technology | 6 | 0.1073 | [0.0131, 0.2014] | 74.5% |
Limitations
- Selection bias. All 100 claims have known ground truth, selected for verifiability. Real-world claims include more genuinely ambiguous cases than any test set can capture.
- The test set is near-uniformly correct. 99/100 verdicts correct + 1 partial. This makes Murphy Resolution ≈ 0: when almost everything is correct, there's no variance for predictions to discriminate. Harder validation claims are needed to exercise Resolution.
- Same-model evaluation. The validator (Claude) and the system under test share a model family. Systematic shared biases would not be detected here. Cross-model validation is planned.
- Small abstention sample. 11 abstentions is suggestive of good abstention discipline but statistically thin. Planned: test cases designed to tempt over-abstention and over-commitment.
- No temporal drift data. Runs conducted in sessions across 2026-02 to 2026-05. No data on whether calibration drifts over extended use or across model updates.
Methodology versioning
Test suites were validated against earlier methodology versions. Current production methodology is v2.8 (deployed 2026-05-02; rigor-extension axis: Veridi v1.1, Pragma v1.6, Praxis v1.4). Re-validation against v2.8 is planned; numbers on this page reflect the as-recorded validation.
- GTS-A + ADV-v1: validated against methodology v1.0
- GTS-B + ADV-v2: validated against methodology v2.2
- GTS-C: validated against methodology v2.4
- GTS-D: targeted extension run against methodology v1.1-era runtime instructions
Reproduce these numbers
- Download the raw data, 100-row JSONL, one claim per line, includes verdict/confidence/outcome
- Read the original analysis report (uses classical Brier across the original 95 claims; superseded by this selective framing but preserved for comparison; the report does not include the 5 GTS-D extension rows added 2026-05-04)
- Read the validation report
- Compute script:
Veridi/fact-checker-files/validation-results/compute-calibration.py
References
- Fisch et al. (2022), Calibrated Selective Classification — selective Brier framing
- Murphy (1973), Brier score decomposition — REL / RES / UNC
- Dimitriadis et al. (2021), Stable reliability diagrams (CORP) — cited for future work