Veridi Calibration Analysis
Date: 2026-04-05 Claims analyzed: 95 validated (94 PASS, 1 PARTIAL, 0 FAIL) Methodology versions: v1.0 (GTS-A, ADV-v1), v2.2 (GTS-B, ADV-v2), v2.4 (GTS-C) Data sources: summary.md, gts-b-scorecard.md, gts-c-scorecard.md, adv-v2-scorecard.md
Overall Metrics
| Metric | Value |
|---|---|
| Total claims | 95 |
| Verdict accuracy | 99.5% (94 correct, 1 partial) |
| Mean confidence | 80.3% |
| Overall Brier score | 0.0768 [0.0472, 0.1098] |
| Perfect calibration Brier | 0.0000 (always right at 100%) |
| Naive baseline Brier | 0.2500 (always 50% on binary) |
| 95% Bootstrap CI (1000 iter) | [0.0472, 0.1098] |
Confidence Band Calibration
The core calibration question: when Veridi says X% confidence, is it correct X% of the time?
| Band | Claims | Correct | Accuracy | Mean Conf | Expected Acc | Calibration Gap | Mean Brier |
|---|---|---|---|---|---|---|---|
| 90-100% | 37 | 37 | 100.0% | 93.5% | 93.5% | +6.5% | 0.0048 |
| 80-89% | 37 | 36+1P | 98.6% | 83.9% | 83.9% | +14.7% | 0.0278 |
| 70-79% | 6 | 6 | 100.0% | 76.3% | 76.3% | +23.7% | 0.0562 |
| 60-69% | 3 | 3 | 100.0% | 64.0% | 64.0% | +36.0% | 0.1298 |
| 50-59% | 1 | 1 | 100.0% | 58.0% | 58.0% | +42.0% | 0.1764 |
| 40-49% | 4 | 4 | 100.0% | 45.8% | 45.8% | +54.2% | 0.2949 |
| 30-39% | 2 | 2 | 100.0% | 36.5% | 36.5% | +63.5% | 0.4035 |
| 20-29% | 3 | 3 | 100.0% | 22.3% | 22.3% | +77.7% | 0.6036 |
| 10-19% | 2 | 2 | 100.0% | 16.5% | 16.5% | +83.5% | 0.6975 |
Interpretation
Calibration gap = Actual Accuracy - Mean Confidence. Positive = underconfident (more accurate than claimed). Negative = overconfident (less accurate than claimed).
90-100% band: 37 claims at mean 93.5% confidence, 100.0% accuracy. System is underconfident by 6.5% in this band.
Below-60% band: 12 claims at mean 34.5% confidence, 100.0% accuracy. These are primarily UNVERIFIABLE/breaking-event/contested claims where low confidence is methodologically correct. Calibration gap: +65.5% (underconfident).
Brier Scores by Suite
| Suite | Claims | Mean Brier | Mean Confidence | Accuracy |
|---|---|---|---|---|
| Smoke Tests | 1 | 0.0025 | 95.0% | 100.0% |
| GTS-A | 25 | 0.0229 | 89.2% | 100.0% |
| GTS-B | 25 | 0.0665 | 80.1% | 98.0% |
| GTS-C | 20 | 0.0652 | 79.2% | 100.0% |
| ADV-v1 | 12 | 0.2994 | 54.1% | 100.0% |
| ADV-v2 | 12 | 0.0139 | 89.2% | 100.0% |
Brier Scores by Domain
| Domain | Claims | Mean Brier | 95% CI | Mean Confidence |
|---|---|---|---|---|
| electoral | 6 | 0.0164 | [0.0079, 0.0249] | 88.3% |
| financial | 15 | 0.0458 | [0.0190, 0.0904] | 82.1% |
| general | 20 | 0.2093 | [0.1021, 0.3418] | 65.0% |
| historical | 3 | 0.0054 | n<5 | 92.7% |
| legal | 6 | 0.0565 | [0.0169, 0.1023] | 79.8% |
| medical | 8 | 0.0110 | [0.0037, 0.0226] | 91.1% |
| propaganda | 11 | 0.0250 | [0.0111, 0.0424] | 87.4% |
| scientific | 21 | 0.0467 | [0.0143, 0.0911] | 84.0% |
| technology | 5 | 0.1259 | [0.0129, 0.2388] | 71.8% |
Brier Scores by Verdict Type
| Verdict | Claims | Mean Brier | Mean Confidence |
|---|---|---|---|
| FALSE | 20 | 0.0083 | 93.5% |
| LACKS CONTEXT | 8 | 0.0579 | 78.1% |
| MISLEADING | 17 | 0.0261 | 84.8% |
| MIXED | 5 | 0.0820 | 72.4% |
| MOSTLY FALSE | 21 | 0.0218 | 85.9% |
| MOSTLY TRUE | 6 | 0.0164 | 88.2% |
| OUTDATED | 1 | 0.0009 | 97.0% |
| PRED - FLAWED | 1 | 0.0400 | 80.0% |
| PRED - INSUFF | 1 | 0.2704 | 48.0% |
| PRED - SOUND | 1 | 0.0225 | 85.0% |
| TRUE | 4 | 0.0017 | 96.0% |
| UNVERIFIABLE | 10 | 0.4922 | 30.8% |
Limitations
-
Selection bias: All 95 claims have known ground truth, selected for verifiability. Real-world claims include more genuinely ambiguous cases where the system might be wrong. This dataset tests calibration on verifiable claims, not on the full distribution of claims the system would encounter in production.
-
Near-perfect accuracy inflates calibration gap: With 94/95 correct verdicts, every confidence band shows ~100% accuracy. This makes all bands appear underconfident. The system cannot be overconfident when it's almost never wrong. The calibration gap is real but its magnitude is partly an artifact of high accuracy on selected claims.
-
Low-confidence claims are methodologically correct: UNVERIFIABLE verdicts at 15-45% confidence are not underconfident — low confidence IS the correct output for genuinely unverifiable claims. The Brier score penalizes correct-but-low-confidence verdicts, which conflicts with the methodology's design intent for these claim types.
-
Same-model evaluation: The validator (Claude Opus 4.6) and the system under test are the same model family. Systematic shared biases would not be detected.
-
No temporal degradation data: All test runs were conducted in 2-3 sessions. No data on whether calibration drifts over extended use or across model updates.
Recommendations
-
Track ad-hoc claims with ground truth closing: The 4 existing ad-hoc claims in calibration.jsonl lack outcome data. Implement a
/veridi-closeworkflow that revisits past claims when ground truth becomes available. -
Add adversarial accuracy tests: Create claims designed to be plausibly wrong — where the expected ground truth is non-obvious. Current test suites select claims with clear answers. The calibration gap may narrow on harder claims.
-
Cross-model validation: Run the same 20-claim subset on a different model (e.g., Gemini, GPT-4) to check for shared-model bias.
-
Brier decomposition: With more data, decompose Brier score into reliability (calibration) and resolution (discrimination) components to distinguish 'well-calibrated but indiscriminate' from 'miscalibrated but discriminating.'
Generated 2026-04-05 from validation scorecards. Methodology: Brier score = (confidence/100 - outcome)² where outcome = 1 (correct verdict) or 0 (incorrect). PARTIAL treated as outcome = 0.5.