Calibration Analysis 2026 04 05

Veridi Calibration Analysis

Date: 2026-04-05 Claims analyzed: 95 validated (94 PASS, 1 PARTIAL, 0 FAIL) Methodology versions: v1.0 (GTS-A, ADV-v1), v2.2 (GTS-B, ADV-v2), v2.4 (GTS-C) Data sources: summary.md, gts-b-scorecard.md, gts-c-scorecard.md, adv-v2-scorecard.md

Overall Metrics

Metric	Value
Total claims	95
Verdict accuracy	99.5% (94 correct, 1 partial)
Mean confidence	80.3%
Overall Brier score	0.0768 [0.0472, 0.1098]
Perfect calibration Brier	0.0000 (always right at 100%)
Naive baseline Brier	0.2500 (always 50% on binary)
95% Bootstrap CI (1000 iter)	[0.0472, 0.1098]

Confidence Band Calibration

The core calibration question: when Veridi says X% confidence, is it correct X% of the time?

Band	Claims	Correct	Accuracy	Mean Conf	Expected Acc	Calibration Gap	Mean Brier
90-100%	37	37	100.0%	93.5%	93.5%	+6.5%	0.0048
80-89%	37	36+1P	98.6%	83.9%	83.9%	+14.7%	0.0278
70-79%	6	6	100.0%	76.3%	76.3%	+23.7%	0.0562
60-69%	3	3	100.0%	64.0%	64.0%	+36.0%	0.1298
50-59%	1	1	100.0%	58.0%	58.0%	+42.0%	0.1764
40-49%	4	4	100.0%	45.8%	45.8%	+54.2%	0.2949
30-39%	2	2	100.0%	36.5%	36.5%	+63.5%	0.4035
20-29%	3	3	100.0%	22.3%	22.3%	+77.7%	0.6036
10-19%	2	2	100.0%	16.5%	16.5%	+83.5%	0.6975

Interpretation

Calibration gap = Actual Accuracy - Mean Confidence. Positive = underconfident (more accurate than claimed). Negative = overconfident (less accurate than claimed).

90-100% band: 37 claims at mean 93.5% confidence, 100.0% accuracy. System is underconfident by 6.5% in this band.

Below-60% band: 12 claims at mean 34.5% confidence, 100.0% accuracy. These are primarily UNVERIFIABLE/breaking-event/contested claims where low confidence is methodologically correct. Calibration gap: +65.5% (underconfident).

Brier Scores by Suite

Suite	Claims	Mean Brier	Mean Confidence	Accuracy
Smoke Tests	1	0.0025	95.0%	100.0%
GTS-A	25	0.0229	89.2%	100.0%
GTS-B	25	0.0665	80.1%	98.0%
GTS-C	20	0.0652	79.2%	100.0%
ADV-v1	12	0.2994	54.1%	100.0%
ADV-v2	12	0.0139	89.2%	100.0%

Brier Scores by Domain

Domain	Claims	Mean Brier	95% CI	Mean Confidence
electoral	6	0.0164	[0.0079, 0.0249]	88.3%
financial	15	0.0458	[0.0190, 0.0904]	82.1%
general	20	0.2093	[0.1021, 0.3418]	65.0%
historical	3	0.0054	n<5	92.7%
legal	6	0.0565	[0.0169, 0.1023]	79.8%
medical	8	0.0110	[0.0037, 0.0226]	91.1%
propaganda	11	0.0250	[0.0111, 0.0424]	87.4%
scientific	21	0.0467	[0.0143, 0.0911]	84.0%
technology	5	0.1259	[0.0129, 0.2388]	71.8%

Brier Scores by Verdict Type

Verdict	Claims	Mean Brier	Mean Confidence
FALSE	20	0.0083	93.5%
LACKS CONTEXT	8	0.0579	78.1%
MISLEADING	17	0.0261	84.8%
MIXED	5	0.0820	72.4%
MOSTLY FALSE	21	0.0218	85.9%
MOSTLY TRUE	6	0.0164	88.2%
OUTDATED	1	0.0009	97.0%
PRED - FLAWED	1	0.0400	80.0%
PRED - INSUFF	1	0.2704	48.0%
PRED - SOUND	1	0.0225	85.0%
TRUE	4	0.0017	96.0%
UNVERIFIABLE	10	0.4922	30.8%

Limitations

Selection bias: All 95 claims have known ground truth, selected for verifiability. Real-world claims include more genuinely ambiguous cases where the system might be wrong. This dataset tests calibration on verifiable claims, not on the full distribution of claims the system would encounter in production.
Near-perfect accuracy inflates calibration gap: With 94/95 correct verdicts, every confidence band shows ~100% accuracy. This makes all bands appear underconfident. The system cannot be overconfident when it's almost never wrong. The calibration gap is real but its magnitude is partly an artifact of high accuracy on selected claims.
Low-confidence claims are methodologically correct: UNVERIFIABLE verdicts at 15-45% confidence are not underconfident — low confidence IS the correct output for genuinely unverifiable claims. The Brier score penalizes correct-but-low-confidence verdicts, which conflicts with the methodology's design intent for these claim types.
Same-model evaluation: The validator (Claude Opus 4.6) and the system under test are the same model family. Systematic shared biases would not be detected.
No temporal degradation data: All test runs were conducted in 2-3 sessions. No data on whether calibration drifts over extended use or across model updates.

Recommendations

Track ad-hoc claims with ground truth closing: The 4 existing ad-hoc claims in calibration.jsonl lack outcome data. Implement a /veridi-close workflow that revisits past claims when ground truth becomes available.
Add adversarial accuracy tests: Create claims designed to be plausibly wrong — where the expected ground truth is non-obvious. Current test suites select claims with clear answers. The calibration gap may narrow on harder claims.
Cross-model validation: Run the same 20-claim subset on a different model (e.g., Gemini, GPT-4) to check for shared-model bias.
Brier decomposition: With more data, decompose Brier score into reliability (calibration) and resolution (discrimination) components to distinguish 'well-calibrated but indiscriminate' from 'miscalibrated but discriminating.'

Generated 2026-04-05 from validation scorecards. Methodology: Brier score = (confidence/100 - outcome)² where outcome = 1 (correct verdict) or 0 (incorrect). PARTIAL treated as outcome = 0.5.