← Back to calibration

Validation Report: The Veridi Fact-Checking Methodology

February 25, 2026


Summary

Veridi is a structured fact-checking methodology designed for a Canadian cooperative organization. After a comprehensive audit and twelve rounds of remediation between February 20 and February 25, 2026, the methodology was tested against ninety-seven claims spanning eight subject domains, nine verdict categories, twenty-four adversarial scenarios covering eleven disinformation attack patterns, non-English source evaluation in four languages, institutional capture detection across US and international agencies, genuinely contested ground truth where experts disagree, and predictive, breaking-event, and AI-generated content claims. Ninety-six claims passed outright. One scored partial — a correct verdict with confidence slightly below the expected range due to a source being unavailable at test time.

The validation was conducted in three phases. Phase 1 tested forty claims: three smoke tests, twenty-five golden test set claims (GTS-A) with documented ground truth, and twelve adversarial claims each targeting a single gaming vector. Phase 2 added twelve harder adversarial claims (ADV-v2), each combining two to three gaming vectors simultaneously, including four claims based on real-world disinformation patterns found in the wild and two methodology stress tests. Phase 2 also tested two new capabilities added in methodology version 2.2: the Institutional Reliability Index (which assesses when formerly trustworthy government agencies can no longer be treated as authoritative) and detection of data disappearance exploitation (which identifies when the removal of government data collection programs is weaponized to support false narratives). Phase 3 tested forty-five additional golden test set claims: twenty-five weakness-targeting claims (GTS-B) covering verdict boundaries, non-Western contexts, statistical manipulation, predictive claims, breaking events, AI-generated content, and definitional disputes; and twenty gap-filling claims (GTS-C) targeting standalone LACKS CONTEXT verdicts, expanded MOSTLY TRUE coverage, non-English source evaluation, institutional capture scenarios, and genuinely contested ground truth where the evidence is ambiguous and reasonable analysts disagree.

This report describes what Veridi is, how the validation was conducted, and what the results tell us about the methodology's readiness for real-world use.


What Veridi Does

Most people evaluate claims the same way: they read something, decide whether it sounds right, and move on. Even professional fact-checkers often rely on experience and instinct to guide their assessments. Veridi takes a different approach. It provides an explicit, step-by-step process - a checklist of sorts - for evaluating any factual claim, regardless of subject matter.

The methodology is built around several core principles:

Truth comes first. Every fact-check begins its output by stating what is actually true - clearly, affirmatively, and without negation. The claim under review appears second. This ordering matters because psychological research consistently shows that repeating a false claim - even to debunk it - can reinforce it. Veridi leads with the corrective.

Sources are ranked, not trusted. The methodology uses a four-tier source hierarchy. At the top sit primary sources: government databases, peer-reviewed research, court records, and raw authoritative data. Below them are authoritative secondary sources like wire services, established fact-checkers, and credentialed experts. General news media occupies the third tier. Social media, anonymous sources, and partisan outlets sit at the bottom. Every piece of evidence in a Veridi assessment carries its tier classification, making the evidentiary basis transparent.

Confidence has a ceiling. A fact-check supported only by third-tier sources cannot receive a confidence rating above 65%, regardless of how many such sources agree. Multiple low-quality sources do not substitute for a single high-quality one. This structural cap prevents the methodology from being fooled by volume.

Gaming is expected. Perhaps the methodology's most distinctive feature is its explicit treatment of disinformation as an adversarial problem. Veridi includes countermeasures for eleven specific attack patterns - techniques that bad actors use to make false claims appear credible. More on this below.

Institutions can degrade. The methodology includes an Institutional Reliability Index that tracks when government agencies or other traditionally authoritative sources have been compromised by political interference, defunding, or institutional capture. A source that was Tier 1 last year may be Tier 3 today. The IRI provides per-agency, per-function assessments with comparison anchors - alternative sources to consult when a primary source's reliability has degraded. This matters because some of the most effective disinformation attributes false claims to institutions that were once trustworthy but no longer are.


How the Methodology Works

A fact-check under Veridi proceeds through a defined sequence. The claim is first classified by domain - scientific, legal, medical, financial, electoral, historical, technological, or propaganda - and by complexity. The methodology then loads domain-specific evaluation frameworks as needed. Evidence is gathered through structured searches. Sources are evaluated and assigned tiers. Gaming countermeasures are applied. A verdict is determined using decision trees designed to handle the most confusable distinctions - the difference between "misleading" and "lacking context," for instance, or between "mixed" and "mostly false." Confidence is calibrated against the evidence strength, with structural caps enforced. Finally, a quality-assurance checklist is completed before the assessment is published.

The methodology uses nine verdict categories: True, Mostly True, Mixed, Mostly False, False, Unverifiable, Misleading, Outdated, and Lacks Context. This taxonomy is more granular than the binary true/false model and reflects the reality that most real-world claims are not cleanly one or the other.

Four verification depth tiers are available - Quick, Standard, Full, and Forensic - allowing the methodology to scale its rigor to the stakes of the claim. A simple question of established fact might need only one authoritative source. A politically charged claim involving statistical manipulation might require a dozen sources, specialist analysis, and a full gaming countermeasure scan.


The Gaming Countermeasures

Disinformation is not random. It follows patterns - techniques that exploit specific weaknesses in how people and institutions evaluate information. Veridi identifies eleven such patterns and provides explicit detection procedures for each. We are always on the lookout for more.

Confidence laundering occurs when a claim is repeated by multiple outlets that all trace back to a single, unreliable source. The appearance of independent confirmation is manufactured. Veridi requires tracing every source to its origin; derived sources do not boost confidence.

Citogenesis is the creation of circular citations - a claim appears on a website, is picked up by a news outlet, and the original source is then updated to cite the news outlet as confirmation. Wikipedia is particularly vulnerable to this pattern. The methodology includes timestamp and language-similarity checks to detect these loops.

Unverifiable-by-design claims are structured so that verification is impossible by architecture. Anonymous sources discussing classified material in private settings cannot be confirmed or denied - by design. Veridi flags these patterns and caps confidence accordingly, rather than treating specificity as a proxy for credibility.

Tier inflation launders low-quality claims through progressively more credible outlets until they appear authoritative. An anonymous blog post becomes a news aggregator article becomes a respected publication's report. The methodology traces the evidence chain back to its origin and assigns tier classification based on the original source, not the final publisher.

Framing manipulation - perhaps the most dangerous pattern - uses individually true facts to create a composite false impression. Each component checks out, but the whole is a lie. Veridi distinguishes between passive omission (Lacks Context) and engineered framing (Misleading) by examining whether the false impression appears to be the purpose of the claim.

Selective skepticism exploitation applies impossibly high evidence standards to one side of a debate while accepting the opposing position without evidence. Veridi enforces symmetric evidence standards - the same burden of proof applied to a claim must be applied to its counterclaim.

Coordinated legitimate sourcing mimics genuine consensus through synchronized publication across credible outlets. Timestamp clustering, identical unusual language, and the same small pool of quoted experts are detection indicators.

Preprint pump-and-dump exploits the gap between a study's posting as a preprint and its peer review. A strategically timed, methodologically weak preprint is amplified as "research" before scrutiny can catch up. The methodology checks publication status, timing, and whether the claim's language ("proves") is epistemically justified by the evidence level.

Anchoring embeds a true, easily verified fact in the same sentence as a false assertion, transferring credibility from one to the other. Veridi decomposes multi-clause claims and rates the composite, not the anchor.

Data disappearance exploitation weaponizes the removal of government data collection programs. When a monitoring program is defunded or terminated, the resulting data gap can be exploited in two ways: claiming the absence of new data proves no problem exists, or reframing the elimination of the program as evidence that the program's historical data was unreliable. The methodology maintains awareness of which data programs have been terminated and requires consultation of alternative data sources - international equivalents, state-level programs, academic research, and independent monitoring systems - to fill the gap.

Institutional capture occurs when a formerly reliable institution's output has been compromised by political interference to the point where it can no longer be treated as authoritative on certain topics. The Institutional Reliability Index provides per-agency, per-function assessments with degradation levels, and maps each degraded function to comparison anchors - independent sources that can be consulted instead. The detection challenge is that disinformation may cite the captured institution by name, exploiting its historical reputation while the institution's current output serves a political agenda rather than its original mission.


The Validation Process

What Was Tested

The methodology was tested against six sets of claims, totaling ninety-seven individual fact-checks:

How It Was Scored

Each claim was scored as Pass, Partial, or Fail:

The v1 adversarial suite has defined pass criteria: at least ten of twelve claims must score Pass or Partial, with no more than two Partial results. Both blocking claims must pass. At least ten of twelve gaming flags must fire.

The v2 adversarial suite has stricter criteria reflecting its harder content: at least eight of twelve must Pass (not Partial), no more than three Partial results, both blocking claims must Pass, at least ten of twelve primary gaming flags must fire, and at least sixteen of the approximately thirty total gaming flags (primary plus secondary) must fire.

We are sensitive to the fact that passing every single test may indicate a weakness in the test suite or in the validation criteria, rather than a strength in the system. If you know of, or can frame, a test that Veridi will fail, we welcome this challenge and look forward to learning from it!


Results

Overall

Test Suite Claims Tested Passed Partial Failed
Smoke Tests 3 3 0 0
Golden Test Set A 25 25 0 0
Adversarial Suite v1 12 12 0 0
Adversarial Suite v2 12 12 0 0
Golden Test Set B 25 24 1 0
Golden Test Set C 20 20 0 0
Total 97 96 1 0

Gaming flag detection: 12 of 12 primary flags in ADV-v1, 12 of 12 primary flags in ADV-v2, 39 total flags in ADV-v2 (against ~30 expected). GTS-B: 7/7 expected flags fired. GTS-C: 4/4 expected flags fired.

Golden Test Set: By Domain

Domain Claims Passed
Scientific/Technical 4 4
Legal/Regulatory 3 3
Medical/Health 3 3
Financial/Economic 3 3
Electoral/Voting 3 3
Historical 3 3
Technology/Digital 3 3
Propaganda/General 3 3

Golden Test Set: By Verdict Category

Verdict Claims Passed
True 3 3
Mostly True 1 1
Mixed 2 2
Mostly False 5 5
False 5 5
Misleading 4 4
Outdated 1 1
Unverifiable 1 1

All seven boundary test claims resolved to the expected side of the boundary.

Adversarial Suite v1: By Attack Vector

Attack Vector Claims Gaming Detected Verdict Correct
Confidence laundering 2 2 2
Citogenesis 2 2 2
Unverifiable-by-design 2 2 2
Tier inflation 1 1 1
Framing manipulation 1 1 1
Selective skepticism 1 1 1
Coordinated sourcing 1 1 1
Preprint pump-and-dump 1 1 1
Anchoring 1 1 1

Both blocking claims - ADV-008 (framing manipulation) and ADV-012 (anchoring) - passed.

Adversarial Suite v2: By Claim

Claim Attack Vectors Expected Verdict Produced Verdict Confidence Primary Flag Result
ADV-013: EPA Emissions Data data-disappearance, framing MOSTLY FALSE MOSTLY FALSE 82% Detected Pass
ADV-014: USDA Food Insecurity data-disappearance, selective-skepticism MISLEADING MISLEADING 80% Detected Pass
ADV-015: CDC Vaccine Guidance institutional-capture, anchoring MOSTLY FALSE MOSTLY FALSE 88% Detected Pass
ADV-016: NOAA Climate Research institutional-capture, confidence-laundering FALSE FALSE 93% Detected Pass
ADV-017: Athlete Cardiac Events framing, anchoring, selective-skepticism MOSTLY FALSE MOSTLY FALSE 90% Detected Pass
ADV-018: VAERS Misuse confidence-laundering, citogenesis FALSE FALSE 95% Detected Pass
ADV-019: Immigration Crime Stats framing, data-disappearance MISLEADING MISLEADING 85% Detected Pass
ADV-020: FEMA Hurricane Diversion anchoring, confidence-laundering, coordinated MOSTLY FALSE MOSTLY FALSE 90% Detected Pass
ADV-021: IARC Processed Meat framing, anchoring MISLEADING MISLEADING 92% Detected Pass
ADV-022: Fabricated Lancet Study tier-inflation, confidence-laundering, anchoring FALSE FALSE 88% Detected Pass
ADV-023: Temperature Adjustments framing, selective-skepticism, anchoring FALSE FALSE 95% Detected Pass
ADV-024: Great Reset Conspiracy anchoring, framing, unverifiable-by-design MOSTLY FALSE MOSTLY FALSE 92% Detected Pass

Both blocking claims - ADV-015 (institutional capture of CDC vaccine guidance) and ADV-018 (VAERS misuse) - passed.

Adversarial Suite v2: Pass Criteria

Criterion Threshold Actual
Claims PASS ≥8 of 12 12 of 12
PARTIAL limit ≤3 0
Blocking: ADV-015 Must PASS PASS
Blocking: ADV-018 Must PASS PASS
Primary gaming flags ≥10 of 12 12 of 12
Total gaming flags ≥16 of ~30 39

Adversarial Suite v2: Gaming Flag Coverage

Attack Vector Claims Testing It Detected
Data disappearance exploitation 3 3
Institutional capture 3 3
Framing manipulation 6 6
Anchoring 6 6
Selective skepticism 3 3
Confidence laundering 4 4
Citogenesis 1 1
Tier inflation 1 1
Coordinated sourcing 1 1
Unverifiable-by-design 1 1

Golden Test Set B: By Category

Category Claims Passed Partial
Verdict Boundary Cases 5 5 0
Non-Western Context 5 4 1
Statistical Manipulation 5 5 0
Predictive Claims 3 3 0
Breaking Event Scenarios 3 3 0
AI-Generated Content 2 2 0
Definitional Disputes 2 2 0

All five boundary tests resolved to the expected side. The single partial (GTS-033, Gaza rebuilding video) produced the correct verdict (FALSE) but confidence was 80% versus the expected 85-92% because the specific Misbar fact-check article was unavailable at test time, limiting sourcing to Tier 2. This is a source-availability limitation, not a methodology defect.

Golden Test Set C: Gap Coverage

Gap Targeted Claims Passed
LACKS CONTEXT standalone 5 5
MOSTLY TRUE expansion 4 4
Non-English source required 4 4
Institutional capture (IRI) 5 5
Genuinely contested ground truth 6 6

All six boundary tests resolved to the expected side. Non-English source evaluation succeeded in Japanese (Fukushima tritium data), Turkish (ENAG inflation data), Chinese (NBS youth unemployment methodology), and Hindi (CAA legal text). The IRI framework was correctly applied to non-US institutions (TurkStat and China NBS). The COVID-19 origin claim (GTS-065) was correctly assessed as UNVERIFIABLE. The Cochrane masking review (GTS-070, described as the hardest claim in the test suite) landed MISLEADING rather than the boundary alternative MOSTLY TRUE.


What the Results Mean

Strengths

Verdict accuracy is excellent. Across ninety-seven claims - including some deliberately designed to be confusing - the methodology produced the correct verdict ninety-six times, with one partial where the verdict was correct but confidence fell slightly below the expected range due to a source being unavailable. This is not because the claims were easy. Eighteen claims sit on recognized verdict boundaries, twenty-four adversarial claims were specifically engineered to trigger wrong answers, and six claims involve genuinely contested ground truth where experts disagree. However, the rubric and the system were formulated from the same understanding of reality, which is a known concern.

Boundary resolution is precise. The distinction between "misleading" and "lacking context" is one of the most difficult in fact-checking. It turns on whether the false impression appears to be the purpose of the framing or merely an incidental consequence of incomplete information. The methodology resolved all seven boundary tests correctly, suggesting the decision trees are well calibrated.

Gaming countermeasures work under realistic conditions. Every adversarial attack vector was detected across both test suites. Critically, the v2 suite tested multi-vector attacks - claims using two or three gaming techniques simultaneously - and the methodology detected not just the primary vector but secondary and tertiary vectors as well, producing thirty-nine total flag detections against approximately thirty expected. This includes detection of the two new vectors added in v2.2: data disappearance exploitation (where the removal of a government data program is weaponized) and institutional capture (where a formerly authoritative source's output has been compromised by political interference).

The Institutional Reliability Index works. Four v2 claims required the methodology to override its default trust in historically authoritative sources - the EPA, USDA, CDC, and NOAA - based on documented institutional degradation. In all four cases, the IRI was correctly consulted, the appropriate degradation level was applied, and comparison anchors (international equivalents, independent monitoring systems, academic research) were used as primary sources instead. The hardest test was ADV-015, where the methodology had to override the CDC's historically Tier 1 status on vaccine guidance and correctly identify that a real CDC schedule change was politically driven rather than evidence-based. It passed.

The IRI is not misapplied. ADV-023 tested whether the methodology would fall into a trap: NOAA climate research is assessed at Level 3 (compromised), but the claim was about NOAA's historical temperature adjustment methodology, which predates the degradation and has been independently replicated by four other organizations. The methodology correctly distinguished between "this institution's current output is compromised" and "this institution's historical scientific methodology was fraudulent." It applied the IRI where it belonged and refused to apply it where it did not.

Wild-caught disinformation is handled correctly. Four v2 claims were based on documented real-world disinformation patterns rather than being constructed from scratch. The VAERS misuse pattern (ADV-018) is the single most common manipulation in anti-vaccine disinformation. The FEMA diversion claim (ADV-020) was named PolitiFact's 2024 Lie of the Year. The "died suddenly" pattern (ADV-017) and immigration-crime statistics pattern (ADV-019) have been circulating since 2021 and 2024 respectively. The methodology reached the correct verdict on all four through its own analytical process, not by matching against a database of known debunked claims.

Genuinely contested claims are handled with appropriate humility. Six GTS-C claims test topics where the evidence is ambiguous and reasonable analysts disagree: COVID-19 origins, learning loss economic projections, minimum wage employment effects, affirmative action outcomes, nuclear energy comparative safety, and the Cochrane masking review. The methodology produced the correct verdict on all six, with appropriately wide confidence ranges reflecting genuine uncertainty. The COVID-19 origin claim was correctly assessed as UNVERIFIABLE — the methodology was comfortable saying "we don't know" rather than forcing a verdict. The Cochrane masking review — described as the hardest claim in the entire test suite, where professional fact-checkers have given different verdicts — landed MISLEADING at 78% confidence.

Non-English source evaluation works. Four GTS-C claims required evaluation of sources in Japanese, Turkish, Chinese, and Hindi. All four passed. The methodology correctly identified Chinese state media amplification of false Fukushima claims, relied on Turkish-language ENAG data to identify statistical manipulation by TurkStat, identified Chinese NBS methodology changes from Mandarin-language primary sources, and evaluated Hindi-language legal documents for the Indian Citizenship Amendment Act.

Confidence calibration is disciplined. Confidence ratings stayed within expected ranges and respected structural caps across ninety-six of ninety-seven claims (the single exception being explained by source unavailability). Claims supported only by lower-tier sources received appropriately modest confidence ratings, even when the verdict was clear. The methodology did not confuse certainty about the verdict with certainty about the evidence.

Limitations and Caveats

A near-perfect score warrants scrutiny. Ninety-six passes and one partial out of ninety-seven claims is a strong result. The test suite expanded considerably in Phase 3 — genuinely contested ground truth, non-English sources, definitional disputes, predictive claims, and AI-generated content — and the methodology handled all of it. The single partial (GTS-033) was caused by source unavailability, not a methodology defect. This may reflect genuinely strong methodology, or it may reflect that the expected verdicts and detection criteria were set by the same people who built the methodology. External validation - where neither the claims nor the expected results are designed by the methodology's authors - would provide stronger evidence. We welcome any such, given that they have rigorously proven verdicts.

Most adversarial claims are constructed, though some are wild-caught. The v2 suite improved on v1 by including four claims based on real-world disinformation patterns and by requiring multi-vector detection. However, even the wild-caught claims were adapted and formalized for testing rather than submitted verbatim as encountered in the wild. The methodology should eventually be tested against raw, unedited disinformation as it actually appears on social media, news sites, and political communications.

Validation was conducted by the methodology's own implementation. The fact-checks were performed by agents following the Veridi methodology files. This tests whether the methodology produces correct results when followed, but it does not test whether human volunteers - the intended users - can follow it correctly. Usability testing with real volunteers is a separate and necessary validation step.

~~The golden test set has gaps.~~ (Addressed in Phase 3.) GTS-B added two LACKS CONTEXT boundary cases and GTS-C added five standalone LACKS CONTEXT claims — all six passed. GTS-B added one MOSTLY TRUE claim and GTS-C added four — all five passed. The verdict distribution across the full suite now covers all nine categories with multiple claims each.

~~Non-English claims are undertested.~~ (Addressed in Phase 3.) GTS-C includes four claims requiring non-English source evaluation in Japanese (Fukushima), Turkish (inflation data), Chinese (unemployment methodology), and Hindi (citizenship law). All four passed. GTS-B added five non-Western context claims across India, Argentina, Gaza/Turkey, and South Africa. Four passed outright; one scored partial due to source unavailability.

Institutional capture tests were absent in v1. (Addressed in v2.2.) The v2 adversarial suite includes three claims testing institutional capture detection (ADV-015, ADV-016, ADV-023) and three claims testing data disappearance exploitation (ADV-013, ADV-014, ADV-019). The blocking claim ADV-015 specifically tested whether the methodology could override a historically Tier 1 source (CDC) using the Institutional Reliability Index. All six claims passed.


Conclusion

The Veridi methodology, after twelve rounds of remediation addressing ninety findings from a comprehensive audit, passes its validation with ninety-six passes and one partial across ninety-seven claims. The methodology correctly identifies verdicts across all eight subject domains, properly resolves all eighteen verdict boundary cases, detects all twenty-four adversarial gaming scenarios across eleven attack vectors, correctly applies the Institutional Reliability Index to override degraded sources (including non-US institutions), handles real-world disinformation patterns found in the wild, evaluates non-English sources in four languages, navigates genuinely contested ground truth with appropriate humility, and correctly handles predictive, breaking-event, AI-generated, and definitional-dispute claims.

The v2.2 additions - the Institutional Reliability Index, data disappearance exploitation detection, and expanded gaming countermeasures - represent the methodology's most important capability expansion. In a period when government data programs are being terminated and historically authoritative agencies are experiencing political interference, a fact-checking methodology that cannot account for institutional degradation is fundamentally incomplete. Veridi now accounts for it, and the validation confirms the mechanism works.

These results indicate that the methodology is ready for controlled deployment - meaning real-world use by trained volunteers with ongoing quality monitoring and periodic regression testing. They do not indicate that the methodology is finished. Usability testing should be conducted with volunteers, the calibration tracking system should accumulate enough data points to assess whether the methodology's confidence ratings match real-world outcomes over time, and the methodology should be tested against raw, unedited disinformation as it appears in the wild rather than formalized test claims.

The methodology's explicit treatment of disinformation as an adversarial problem - rather than a simple matter of true-or-false - is its most significant contribution. By naming the attack patterns, providing detection procedures, and maintaining a living index of institutional reliability, Veridi equips fact-checkers with tools for the landscape as it actually exists, not as we might wish it to be.


Appendix: Detailed Results

Full per-claim scorecards, evidence summaries, decision tree paths, and gaming countermeasure analyses are available in the validation results directory:


Validation conducted February 25, 2026. Methodology version: Veridi v2.2. Canonical methodology files located at Veridi/fact-checker-files/. Skill implementation at ~/.claude/skills/factcheck/SKILL.md.