← Back to calibration

Veridi Calibration Analysis

Date: 2026-04-05 Claims analyzed: 95 validated (94 PASS, 1 PARTIAL, 0 FAIL) Methodology versions: v1.0 (GTS-A, ADV-v1), v2.2 (GTS-B, ADV-v2), v2.4 (GTS-C) Data sources: summary.md, gts-b-scorecard.md, gts-c-scorecard.md, adv-v2-scorecard.md


Overall Metrics

Metric Value
Total claims 95
Verdict accuracy 99.5% (94 correct, 1 partial)
Mean confidence 80.3%
Overall Brier score 0.0768 [0.0472, 0.1098]
Perfect calibration Brier 0.0000 (always right at 100%)
Naive baseline Brier 0.2500 (always 50% on binary)
95% Bootstrap CI (1000 iter) [0.0472, 0.1098]

Confidence Band Calibration

The core calibration question: when Veridi says X% confidence, is it correct X% of the time?

Band Claims Correct Accuracy Mean Conf Expected Acc Calibration Gap Mean Brier
90-100% 37 37 100.0% 93.5% 93.5% +6.5% 0.0048
80-89% 37 36+1P 98.6% 83.9% 83.9% +14.7% 0.0278
70-79% 6 6 100.0% 76.3% 76.3% +23.7% 0.0562
60-69% 3 3 100.0% 64.0% 64.0% +36.0% 0.1298
50-59% 1 1 100.0% 58.0% 58.0% +42.0% 0.1764
40-49% 4 4 100.0% 45.8% 45.8% +54.2% 0.2949
30-39% 2 2 100.0% 36.5% 36.5% +63.5% 0.4035
20-29% 3 3 100.0% 22.3% 22.3% +77.7% 0.6036
10-19% 2 2 100.0% 16.5% 16.5% +83.5% 0.6975

Interpretation

Calibration gap = Actual Accuracy - Mean Confidence. Positive = underconfident (more accurate than claimed). Negative = overconfident (less accurate than claimed).

90-100% band: 37 claims at mean 93.5% confidence, 100.0% accuracy. System is underconfident by 6.5% in this band.

Below-60% band: 12 claims at mean 34.5% confidence, 100.0% accuracy. These are primarily UNVERIFIABLE/breaking-event/contested claims where low confidence is methodologically correct. Calibration gap: +65.5% (underconfident).

Brier Scores by Suite

Suite Claims Mean Brier Mean Confidence Accuracy
Smoke Tests 1 0.0025 95.0% 100.0%
GTS-A 25 0.0229 89.2% 100.0%
GTS-B 25 0.0665 80.1% 98.0%
GTS-C 20 0.0652 79.2% 100.0%
ADV-v1 12 0.2994 54.1% 100.0%
ADV-v2 12 0.0139 89.2% 100.0%

Brier Scores by Domain

Domain Claims Mean Brier 95% CI Mean Confidence
electoral 6 0.0164 [0.0079, 0.0249] 88.3%
financial 15 0.0458 [0.0190, 0.0904] 82.1%
general 20 0.2093 [0.1021, 0.3418] 65.0%
historical 3 0.0054 n<5 92.7%
legal 6 0.0565 [0.0169, 0.1023] 79.8%
medical 8 0.0110 [0.0037, 0.0226] 91.1%
propaganda 11 0.0250 [0.0111, 0.0424] 87.4%
scientific 21 0.0467 [0.0143, 0.0911] 84.0%
technology 5 0.1259 [0.0129, 0.2388] 71.8%

Brier Scores by Verdict Type

Verdict Claims Mean Brier Mean Confidence
FALSE 20 0.0083 93.5%
LACKS CONTEXT 8 0.0579 78.1%
MISLEADING 17 0.0261 84.8%
MIXED 5 0.0820 72.4%
MOSTLY FALSE 21 0.0218 85.9%
MOSTLY TRUE 6 0.0164 88.2%
OUTDATED 1 0.0009 97.0%
PRED - FLAWED 1 0.0400 80.0%
PRED - INSUFF 1 0.2704 48.0%
PRED - SOUND 1 0.0225 85.0%
TRUE 4 0.0017 96.0%
UNVERIFIABLE 10 0.4922 30.8%

Limitations

  1. Selection bias: All 95 claims have known ground truth, selected for verifiability. Real-world claims include more genuinely ambiguous cases where the system might be wrong. This dataset tests calibration on verifiable claims, not on the full distribution of claims the system would encounter in production.

  2. Near-perfect accuracy inflates calibration gap: With 94/95 correct verdicts, every confidence band shows ~100% accuracy. This makes all bands appear underconfident. The system cannot be overconfident when it's almost never wrong. The calibration gap is real but its magnitude is partly an artifact of high accuracy on selected claims.

  3. Low-confidence claims are methodologically correct: UNVERIFIABLE verdicts at 15-45% confidence are not underconfident — low confidence IS the correct output for genuinely unverifiable claims. The Brier score penalizes correct-but-low-confidence verdicts, which conflicts with the methodology's design intent for these claim types.

  4. Same-model evaluation: The validator (Claude Opus 4.6) and the system under test are the same model family. Systematic shared biases would not be detected.

  5. No temporal degradation data: All test runs were conducted in 2-3 sessions. No data on whether calibration drifts over extended use or across model updates.

Recommendations

  1. Track ad-hoc claims with ground truth closing: The 4 existing ad-hoc claims in calibration.jsonl lack outcome data. Implement a /veridi-close workflow that revisits past claims when ground truth becomes available.

  2. Add adversarial accuracy tests: Create claims designed to be plausibly wrong — where the expected ground truth is non-obvious. Current test suites select claims with clear answers. The calibration gap may narrow on harder claims.

  3. Cross-model validation: Run the same 20-claim subset on a different model (e.g., Gemini, GPT-4) to check for shared-model bias.

  4. Brier decomposition: With more data, decompose Brier score into reliability (calibration) and resolution (discrimination) components to distinguish 'well-calibrated but indiscriminate' from 'miscalibrated but discriminating.'


Generated 2026-04-05 from validation scorecards. Methodology: Brier score = (confidence/100 - outcome)² where outcome = 1 (correct verdict) or 0 (incorrect). PARTIAL treated as outcome = 0.5.