Validation Report: The Veridi Fact-Checking Methodology
February 25, 2026
Summary
Veridi is a structured fact-checking methodology designed for a Canadian cooperative organization. After a comprehensive audit and twelve rounds of remediation between February 20 and February 25, 2026, the methodology was tested against ninety-seven claims spanning eight subject domains, nine verdict categories, twenty-four adversarial scenarios covering eleven disinformation attack patterns, non-English source evaluation in four languages, institutional capture detection across US and international agencies, genuinely contested ground truth where experts disagree, and predictive, breaking-event, and AI-generated content claims. Ninety-six claims passed outright. One scored partial — a correct verdict with confidence slightly below the expected range due to a source being unavailable at test time.
The validation was conducted in three phases. Phase 1 tested forty claims: three smoke tests, twenty-five golden test set claims (GTS-A) with documented ground truth, and twelve adversarial claims each targeting a single gaming vector. Phase 2 added twelve harder adversarial claims (ADV-v2), each combining two to three gaming vectors simultaneously, including four claims based on real-world disinformation patterns found in the wild and two methodology stress tests. Phase 2 also tested two new capabilities added in methodology version 2.2: the Institutional Reliability Index (which assesses when formerly trustworthy government agencies can no longer be treated as authoritative) and detection of data disappearance exploitation (which identifies when the removal of government data collection programs is weaponized to support false narratives). Phase 3 tested forty-five additional golden test set claims: twenty-five weakness-targeting claims (GTS-B) covering verdict boundaries, non-Western contexts, statistical manipulation, predictive claims, breaking events, AI-generated content, and definitional disputes; and twenty gap-filling claims (GTS-C) targeting standalone LACKS CONTEXT verdicts, expanded MOSTLY TRUE coverage, non-English source evaluation, institutional capture scenarios, and genuinely contested ground truth where the evidence is ambiguous and reasonable analysts disagree.
This report describes what Veridi is, how the validation was conducted, and what the results tell us about the methodology's readiness for real-world use.
What Veridi Does
Most people evaluate claims the same way: they read something, decide whether it sounds right, and move on. Even professional fact-checkers often rely on experience and instinct to guide their assessments. Veridi takes a different approach. It provides an explicit, step-by-step process - a checklist of sorts - for evaluating any factual claim, regardless of subject matter.
The methodology is built around several core principles:
Truth comes first. Every fact-check begins its output by stating what is actually true - clearly, affirmatively, and without negation. The claim under review appears second. This ordering matters because psychological research consistently shows that repeating a false claim - even to debunk it - can reinforce it. Veridi leads with the corrective.
Sources are ranked, not trusted. The methodology uses a four-tier source hierarchy. At the top sit primary sources: government databases, peer-reviewed research, court records, and raw authoritative data. Below them are authoritative secondary sources like wire services, established fact-checkers, and credentialed experts. General news media occupies the third tier. Social media, anonymous sources, and partisan outlets sit at the bottom. Every piece of evidence in a Veridi assessment carries its tier classification, making the evidentiary basis transparent.
Confidence has a ceiling. A fact-check supported only by third-tier sources cannot receive a confidence rating above 65%, regardless of how many such sources agree. Multiple low-quality sources do not substitute for a single high-quality one. This structural cap prevents the methodology from being fooled by volume.
Gaming is expected. Perhaps the methodology's most distinctive feature is its explicit treatment of disinformation as an adversarial problem. Veridi includes countermeasures for eleven specific attack patterns - techniques that bad actors use to make false claims appear credible. More on this below.
Institutions can degrade. The methodology includes an Institutional Reliability Index that tracks when government agencies or other traditionally authoritative sources have been compromised by political interference, defunding, or institutional capture. A source that was Tier 1 last year may be Tier 3 today. The IRI provides per-agency, per-function assessments with comparison anchors - alternative sources to consult when a primary source's reliability has degraded. This matters because some of the most effective disinformation attributes false claims to institutions that were once trustworthy but no longer are.
How the Methodology Works
A fact-check under Veridi proceeds through a defined sequence. The claim is first classified by domain - scientific, legal, medical, financial, electoral, historical, technological, or propaganda - and by complexity. The methodology then loads domain-specific evaluation frameworks as needed. Evidence is gathered through structured searches. Sources are evaluated and assigned tiers. Gaming countermeasures are applied. A verdict is determined using decision trees designed to handle the most confusable distinctions - the difference between "misleading" and "lacking context," for instance, or between "mixed" and "mostly false." Confidence is calibrated against the evidence strength, with structural caps enforced. Finally, a quality-assurance checklist is completed before the assessment is published.
The methodology uses nine verdict categories: True, Mostly True, Mixed, Mostly False, False, Unverifiable, Misleading, Outdated, and Lacks Context. This taxonomy is more granular than the binary true/false model and reflects the reality that most real-world claims are not cleanly one or the other.
Four verification depth tiers are available - Quick, Standard, Full, and Forensic - allowing the methodology to scale its rigor to the stakes of the claim. A simple question of established fact might need only one authoritative source. A politically charged claim involving statistical manipulation might require a dozen sources, specialist analysis, and a full gaming countermeasure scan.
The Gaming Countermeasures
Disinformation is not random. It follows patterns - techniques that exploit specific weaknesses in how people and institutions evaluate information. Veridi identifies eleven such patterns and provides explicit detection procedures for each. We are always on the lookout for more.
Confidence laundering occurs when a claim is repeated by multiple outlets that all trace back to a single, unreliable source. The appearance of independent confirmation is manufactured. Veridi requires tracing every source to its origin; derived sources do not boost confidence.
Citogenesis is the creation of circular citations - a claim appears on a website, is picked up by a news outlet, and the original source is then updated to cite the news outlet as confirmation. Wikipedia is particularly vulnerable to this pattern. The methodology includes timestamp and language-similarity checks to detect these loops.
Unverifiable-by-design claims are structured so that verification is impossible by architecture. Anonymous sources discussing classified material in private settings cannot be confirmed or denied - by design. Veridi flags these patterns and caps confidence accordingly, rather than treating specificity as a proxy for credibility.
Tier inflation launders low-quality claims through progressively more credible outlets until they appear authoritative. An anonymous blog post becomes a news aggregator article becomes a respected publication's report. The methodology traces the evidence chain back to its origin and assigns tier classification based on the original source, not the final publisher.
Framing manipulation - perhaps the most dangerous pattern - uses individually true facts to create a composite false impression. Each component checks out, but the whole is a lie. Veridi distinguishes between passive omission (Lacks Context) and engineered framing (Misleading) by examining whether the false impression appears to be the purpose of the claim.
Selective skepticism exploitation applies impossibly high evidence standards to one side of a debate while accepting the opposing position without evidence. Veridi enforces symmetric evidence standards - the same burden of proof applied to a claim must be applied to its counterclaim.
Coordinated legitimate sourcing mimics genuine consensus through synchronized publication across credible outlets. Timestamp clustering, identical unusual language, and the same small pool of quoted experts are detection indicators.
Preprint pump-and-dump exploits the gap between a study's posting as a preprint and its peer review. A strategically timed, methodologically weak preprint is amplified as "research" before scrutiny can catch up. The methodology checks publication status, timing, and whether the claim's language ("proves") is epistemically justified by the evidence level.
Anchoring embeds a true, easily verified fact in the same sentence as a false assertion, transferring credibility from one to the other. Veridi decomposes multi-clause claims and rates the composite, not the anchor.
Data disappearance exploitation weaponizes the removal of government data collection programs. When a monitoring program is defunded or terminated, the resulting data gap can be exploited in two ways: claiming the absence of new data proves no problem exists, or reframing the elimination of the program as evidence that the program's historical data was unreliable. The methodology maintains awareness of which data programs have been terminated and requires consultation of alternative data sources - international equivalents, state-level programs, academic research, and independent monitoring systems - to fill the gap.
Institutional capture occurs when a formerly reliable institution's output has been compromised by political interference to the point where it can no longer be treated as authoritative on certain topics. The Institutional Reliability Index provides per-agency, per-function assessments with degradation levels, and maps each degraded function to comparison anchors - independent sources that can be consulted instead. The detection challenge is that disinformation may cite the captured institution by name, exploiting its historical reputation while the institution's current output serves a political agenda rather than its original mission.
The Validation Process
What Was Tested
The methodology was tested against six sets of claims, totaling ninety-seven individual fact-checks:
-
Three smoke tests across different verification tiers: a simple false claim (Quick tier), a well-established scientific consensus claim (Standard tier), and a complex non-Western propaganda claim requiring statistical and specialist analysis (Full tier).
-
Twenty-five golden test set claims spanning all eight domains, all complexity levels (simple, moderate, complex), and eight of nine verdict categories. Each claim has a documented ground truth from established fact-checkers or primary sources. Seven of these claims are deliberate boundary tests - claims designed to sit on the line between two verdict categories, testing whether the methodology resolves them correctly.
-
Twelve single-vector adversarial claims (ADV v1) specifically engineered to exploit the nine original gaming attack patterns. These claims use real place names, real institutions, real scientific phenomena, and plausible-sounding statistics to create convincing disinformation. Two of these - a framing manipulation test and an anchoring test - were designated as blocking: failure on either would indicate a fundamental methodology weakness.
-
Twelve multi-vector adversarial claims (ADV v2) designed to test the methodology under conditions closer to real-world disinformation. Each claim combines two to three gaming vectors simultaneously. Four are based on documented disinformation patterns found in the wild: the "died suddenly" athlete cardiac narrative, VAERS misuse with the Harvard Pilgrim study, the immigration-crime statistics pattern, and the FEMA hurricane relief diversion claim. Two are methodology stress tests: a claim where every individual sub-claim is true but the composite is false (the IARC processed meat misinterpretation), and a fabricated Lancet study attribution designed to test source verification. Four claims require consultation of the Institutional Reliability Index. Two blocking claims test the most common real-world attack patterns against public health fact-checking: institutional capture of CDC vaccine guidance, and the VAERS passive-surveillance-to-confirmed-deaths misrepresentation.
-
Twenty-five weakness-targeting golden test set claims (GTS-B) designed to probe known methodology edge cases: five verdict boundary cases (MISLEADING/LACKS CONTEXT, MIXED/MOSTLY FALSE, MOSTLY TRUE/TRUE), five non-Western context claims verified through regional IFCN-certified fact-checkers (Alt News India, Chequeado Argentina, Misbar Middle East, Africa Check), five statistical manipulation claims (cherry-picked timeframes, baseline manipulation, denominator tricks, fabricated statistics, scope mismatches), three predictive claims (methodologically sound, flawed, and insufficient-information), three breaking event claims (core observable, peripheral, narrative), two AI-generated content claims (deepfake video, text detection limitations), and two definitional dispute claims (US recession definition, Uyghur genocide classification).
-
Twenty gap-filling golden test set claims (GTS-C) targeting specific gaps identified in the validation report: five standalone LACKS CONTEXT claims (previously untested as a standalone verdict), four additional MOSTLY TRUE claims (previously tested only once), four non-English source claims requiring evaluation in Japanese, Turkish, Chinese, and Hindi, five institutional capture claims (including non-US institutions), and six genuinely contested ground truth claims where the evidence is ambiguous and reasonable analysts disagree (COVID-19 origins, learning loss economic projections, minimum wage effects, affirmative action outcomes, nuclear energy safety comparisons, and the Cochrane masking review).
How It Was Scored
Each claim was scored as Pass, Partial, or Fail:
-
Pass: Verdict matches expected, confidence falls within the expected range, and (for adversarial claims) the correct gaming flag is detected.
-
Partial: Verdict matches but confidence is outside range, or the verdict is the expected boundary alternative, or the gaming flag is detected but the verdict is wrong.
-
Fail: Verdict is wrong and is not the expected boundary alternative, or (for adversarial claims) both the gaming flag is missed and the verdict is wrong.
The v1 adversarial suite has defined pass criteria: at least ten of twelve claims must score Pass or Partial, with no more than two Partial results. Both blocking claims must pass. At least ten of twelve gaming flags must fire.
The v2 adversarial suite has stricter criteria reflecting its harder content: at least eight of twelve must Pass (not Partial), no more than three Partial results, both blocking claims must Pass, at least ten of twelve primary gaming flags must fire, and at least sixteen of the approximately thirty total gaming flags (primary plus secondary) must fire.
We are sensitive to the fact that passing every single test may indicate a weakness in the test suite or in the validation criteria, rather than a strength in the system. If you know of, or can frame, a test that Veridi will fail, we welcome this challenge and look forward to learning from it!
Results
Overall
| Test Suite | Claims Tested | Passed | Partial | Failed |
|---|---|---|---|---|
| Smoke Tests | 3 | 3 | 0 | 0 |
| Golden Test Set A | 25 | 25 | 0 | 0 |
| Adversarial Suite v1 | 12 | 12 | 0 | 0 |
| Adversarial Suite v2 | 12 | 12 | 0 | 0 |
| Golden Test Set B | 25 | 24 | 1 | 0 |
| Golden Test Set C | 20 | 20 | 0 | 0 |
| Total | 97 | 96 | 1 | 0 |
Gaming flag detection: 12 of 12 primary flags in ADV-v1, 12 of 12 primary flags in ADV-v2, 39 total flags in ADV-v2 (against ~30 expected). GTS-B: 7/7 expected flags fired. GTS-C: 4/4 expected flags fired.
Golden Test Set: By Domain
| Domain | Claims | Passed |
|---|---|---|
| Scientific/Technical | 4 | 4 |
| Legal/Regulatory | 3 | 3 |
| Medical/Health | 3 | 3 |
| Financial/Economic | 3 | 3 |
| Electoral/Voting | 3 | 3 |
| Historical | 3 | 3 |
| Technology/Digital | 3 | 3 |
| Propaganda/General | 3 | 3 |
Golden Test Set: By Verdict Category
| Verdict | Claims | Passed |
|---|---|---|
| True | 3 | 3 |
| Mostly True | 1 | 1 |
| Mixed | 2 | 2 |
| Mostly False | 5 | 5 |
| False | 5 | 5 |
| Misleading | 4 | 4 |
| Outdated | 1 | 1 |
| Unverifiable | 1 | 1 |
All seven boundary test claims resolved to the expected side of the boundary.
Adversarial Suite v1: By Attack Vector
| Attack Vector | Claims | Gaming Detected | Verdict Correct |
|---|---|---|---|
| Confidence laundering | 2 | 2 | 2 |
| Citogenesis | 2 | 2 | 2 |
| Unverifiable-by-design | 2 | 2 | 2 |
| Tier inflation | 1 | 1 | 1 |
| Framing manipulation | 1 | 1 | 1 |
| Selective skepticism | 1 | 1 | 1 |
| Coordinated sourcing | 1 | 1 | 1 |
| Preprint pump-and-dump | 1 | 1 | 1 |
| Anchoring | 1 | 1 | 1 |
Both blocking claims - ADV-008 (framing manipulation) and ADV-012 (anchoring) - passed.
Adversarial Suite v2: By Claim
| Claim | Attack Vectors | Expected Verdict | Produced Verdict | Confidence | Primary Flag | Result |
|---|---|---|---|---|---|---|
| ADV-013: EPA Emissions Data | data-disappearance, framing | MOSTLY FALSE | MOSTLY FALSE | 82% | Detected | Pass |
| ADV-014: USDA Food Insecurity | data-disappearance, selective-skepticism | MISLEADING | MISLEADING | 80% | Detected | Pass |
| ADV-015: CDC Vaccine Guidance | institutional-capture, anchoring | MOSTLY FALSE | MOSTLY FALSE | 88% | Detected | Pass |
| ADV-016: NOAA Climate Research | institutional-capture, confidence-laundering | FALSE | FALSE | 93% | Detected | Pass |
| ADV-017: Athlete Cardiac Events | framing, anchoring, selective-skepticism | MOSTLY FALSE | MOSTLY FALSE | 90% | Detected | Pass |
| ADV-018: VAERS Misuse | confidence-laundering, citogenesis | FALSE | FALSE | 95% | Detected | Pass |
| ADV-019: Immigration Crime Stats | framing, data-disappearance | MISLEADING | MISLEADING | 85% | Detected | Pass |
| ADV-020: FEMA Hurricane Diversion | anchoring, confidence-laundering, coordinated | MOSTLY FALSE | MOSTLY FALSE | 90% | Detected | Pass |
| ADV-021: IARC Processed Meat | framing, anchoring | MISLEADING | MISLEADING | 92% | Detected | Pass |
| ADV-022: Fabricated Lancet Study | tier-inflation, confidence-laundering, anchoring | FALSE | FALSE | 88% | Detected | Pass |
| ADV-023: Temperature Adjustments | framing, selective-skepticism, anchoring | FALSE | FALSE | 95% | Detected | Pass |
| ADV-024: Great Reset Conspiracy | anchoring, framing, unverifiable-by-design | MOSTLY FALSE | MOSTLY FALSE | 92% | Detected | Pass |
Both blocking claims - ADV-015 (institutional capture of CDC vaccine guidance) and ADV-018 (VAERS misuse) - passed.
Adversarial Suite v2: Pass Criteria
| Criterion | Threshold | Actual |
|---|---|---|
| Claims PASS | ≥8 of 12 | 12 of 12 |
| PARTIAL limit | ≤3 | 0 |
| Blocking: ADV-015 | Must PASS | PASS |
| Blocking: ADV-018 | Must PASS | PASS |
| Primary gaming flags | ≥10 of 12 | 12 of 12 |
| Total gaming flags | ≥16 of ~30 | 39 |
Adversarial Suite v2: Gaming Flag Coverage
| Attack Vector | Claims Testing It | Detected |
|---|---|---|
| Data disappearance exploitation | 3 | 3 |
| Institutional capture | 3 | 3 |
| Framing manipulation | 6 | 6 |
| Anchoring | 6 | 6 |
| Selective skepticism | 3 | 3 |
| Confidence laundering | 4 | 4 |
| Citogenesis | 1 | 1 |
| Tier inflation | 1 | 1 |
| Coordinated sourcing | 1 | 1 |
| Unverifiable-by-design | 1 | 1 |
Golden Test Set B: By Category
| Category | Claims | Passed | Partial |
|---|---|---|---|
| Verdict Boundary Cases | 5 | 5 | 0 |
| Non-Western Context | 5 | 4 | 1 |
| Statistical Manipulation | 5 | 5 | 0 |
| Predictive Claims | 3 | 3 | 0 |
| Breaking Event Scenarios | 3 | 3 | 0 |
| AI-Generated Content | 2 | 2 | 0 |
| Definitional Disputes | 2 | 2 | 0 |
All five boundary tests resolved to the expected side. The single partial (GTS-033, Gaza rebuilding video) produced the correct verdict (FALSE) but confidence was 80% versus the expected 85-92% because the specific Misbar fact-check article was unavailable at test time, limiting sourcing to Tier 2. This is a source-availability limitation, not a methodology defect.
Golden Test Set C: Gap Coverage
| Gap Targeted | Claims | Passed |
|---|---|---|
| LACKS CONTEXT standalone | 5 | 5 |
| MOSTLY TRUE expansion | 4 | 4 |
| Non-English source required | 4 | 4 |
| Institutional capture (IRI) | 5 | 5 |
| Genuinely contested ground truth | 6 | 6 |
All six boundary tests resolved to the expected side. Non-English source evaluation succeeded in Japanese (Fukushima tritium data), Turkish (ENAG inflation data), Chinese (NBS youth unemployment methodology), and Hindi (CAA legal text). The IRI framework was correctly applied to non-US institutions (TurkStat and China NBS). The COVID-19 origin claim (GTS-065) was correctly assessed as UNVERIFIABLE. The Cochrane masking review (GTS-070, described as the hardest claim in the test suite) landed MISLEADING rather than the boundary alternative MOSTLY TRUE.
What the Results Mean
Strengths
Verdict accuracy is excellent. Across ninety-seven claims - including some deliberately designed to be confusing - the methodology produced the correct verdict ninety-six times, with one partial where the verdict was correct but confidence fell slightly below the expected range due to a source being unavailable. This is not because the claims were easy. Eighteen claims sit on recognized verdict boundaries, twenty-four adversarial claims were specifically engineered to trigger wrong answers, and six claims involve genuinely contested ground truth where experts disagree. However, the rubric and the system were formulated from the same understanding of reality, which is a known concern.
Boundary resolution is precise. The distinction between "misleading" and "lacking context" is one of the most difficult in fact-checking. It turns on whether the false impression appears to be the purpose of the framing or merely an incidental consequence of incomplete information. The methodology resolved all seven boundary tests correctly, suggesting the decision trees are well calibrated.
Gaming countermeasures work under realistic conditions. Every adversarial attack vector was detected across both test suites. Critically, the v2 suite tested multi-vector attacks - claims using two or three gaming techniques simultaneously - and the methodology detected not just the primary vector but secondary and tertiary vectors as well, producing thirty-nine total flag detections against approximately thirty expected. This includes detection of the two new vectors added in v2.2: data disappearance exploitation (where the removal of a government data program is weaponized) and institutional capture (where a formerly authoritative source's output has been compromised by political interference).
The Institutional Reliability Index works. Four v2 claims required the methodology to override its default trust in historically authoritative sources - the EPA, USDA, CDC, and NOAA - based on documented institutional degradation. In all four cases, the IRI was correctly consulted, the appropriate degradation level was applied, and comparison anchors (international equivalents, independent monitoring systems, academic research) were used as primary sources instead. The hardest test was ADV-015, where the methodology had to override the CDC's historically Tier 1 status on vaccine guidance and correctly identify that a real CDC schedule change was politically driven rather than evidence-based. It passed.
The IRI is not misapplied. ADV-023 tested whether the methodology would fall into a trap: NOAA climate research is assessed at Level 3 (compromised), but the claim was about NOAA's historical temperature adjustment methodology, which predates the degradation and has been independently replicated by four other organizations. The methodology correctly distinguished between "this institution's current output is compromised" and "this institution's historical scientific methodology was fraudulent." It applied the IRI where it belonged and refused to apply it where it did not.
Wild-caught disinformation is handled correctly. Four v2 claims were based on documented real-world disinformation patterns rather than being constructed from scratch. The VAERS misuse pattern (ADV-018) is the single most common manipulation in anti-vaccine disinformation. The FEMA diversion claim (ADV-020) was named PolitiFact's 2024 Lie of the Year. The "died suddenly" pattern (ADV-017) and immigration-crime statistics pattern (ADV-019) have been circulating since 2021 and 2024 respectively. The methodology reached the correct verdict on all four through its own analytical process, not by matching against a database of known debunked claims.
Genuinely contested claims are handled with appropriate humility. Six GTS-C claims test topics where the evidence is ambiguous and reasonable analysts disagree: COVID-19 origins, learning loss economic projections, minimum wage employment effects, affirmative action outcomes, nuclear energy comparative safety, and the Cochrane masking review. The methodology produced the correct verdict on all six, with appropriately wide confidence ranges reflecting genuine uncertainty. The COVID-19 origin claim was correctly assessed as UNVERIFIABLE — the methodology was comfortable saying "we don't know" rather than forcing a verdict. The Cochrane masking review — described as the hardest claim in the entire test suite, where professional fact-checkers have given different verdicts — landed MISLEADING at 78% confidence.
Non-English source evaluation works. Four GTS-C claims required evaluation of sources in Japanese, Turkish, Chinese, and Hindi. All four passed. The methodology correctly identified Chinese state media amplification of false Fukushima claims, relied on Turkish-language ENAG data to identify statistical manipulation by TurkStat, identified Chinese NBS methodology changes from Mandarin-language primary sources, and evaluated Hindi-language legal documents for the Indian Citizenship Amendment Act.
Confidence calibration is disciplined. Confidence ratings stayed within expected ranges and respected structural caps across ninety-six of ninety-seven claims (the single exception being explained by source unavailability). Claims supported only by lower-tier sources received appropriately modest confidence ratings, even when the verdict was clear. The methodology did not confuse certainty about the verdict with certainty about the evidence.
Limitations and Caveats
A near-perfect score warrants scrutiny. Ninety-six passes and one partial out of ninety-seven claims is a strong result. The test suite expanded considerably in Phase 3 — genuinely contested ground truth, non-English sources, definitional disputes, predictive claims, and AI-generated content — and the methodology handled all of it. The single partial (GTS-033) was caused by source unavailability, not a methodology defect. This may reflect genuinely strong methodology, or it may reflect that the expected verdicts and detection criteria were set by the same people who built the methodology. External validation - where neither the claims nor the expected results are designed by the methodology's authors - would provide stronger evidence. We welcome any such, given that they have rigorously proven verdicts.
Most adversarial claims are constructed, though some are wild-caught. The v2 suite improved on v1 by including four claims based on real-world disinformation patterns and by requiring multi-vector detection. However, even the wild-caught claims were adapted and formalized for testing rather than submitted verbatim as encountered in the wild. The methodology should eventually be tested against raw, unedited disinformation as it actually appears on social media, news sites, and political communications.
Validation was conducted by the methodology's own implementation. The fact-checks were performed by agents following the Veridi methodology files. This tests whether the methodology produces correct results when followed, but it does not test whether human volunteers - the intended users - can follow it correctly. Usability testing with real volunteers is a separate and necessary validation step.
~~The golden test set has gaps.~~ (Addressed in Phase 3.) GTS-B added two LACKS CONTEXT boundary cases and GTS-C added five standalone LACKS CONTEXT claims — all six passed. GTS-B added one MOSTLY TRUE claim and GTS-C added four — all five passed. The verdict distribution across the full suite now covers all nine categories with multiple claims each.
~~Non-English claims are undertested.~~ (Addressed in Phase 3.) GTS-C includes four claims requiring non-English source evaluation in Japanese (Fukushima), Turkish (inflation data), Chinese (unemployment methodology), and Hindi (citizenship law). All four passed. GTS-B added five non-Western context claims across India, Argentina, Gaza/Turkey, and South Africa. Four passed outright; one scored partial due to source unavailability.
Institutional capture tests were absent in v1. (Addressed in v2.2.) The v2 adversarial suite includes three claims testing institutional capture detection (ADV-015, ADV-016, ADV-023) and three claims testing data disappearance exploitation (ADV-013, ADV-014, ADV-019). The blocking claim ADV-015 specifically tested whether the methodology could override a historically Tier 1 source (CDC) using the Institutional Reliability Index. All six claims passed.
Conclusion
The Veridi methodology, after twelve rounds of remediation addressing ninety findings from a comprehensive audit, passes its validation with ninety-six passes and one partial across ninety-seven claims. The methodology correctly identifies verdicts across all eight subject domains, properly resolves all eighteen verdict boundary cases, detects all twenty-four adversarial gaming scenarios across eleven attack vectors, correctly applies the Institutional Reliability Index to override degraded sources (including non-US institutions), handles real-world disinformation patterns found in the wild, evaluates non-English sources in four languages, navigates genuinely contested ground truth with appropriate humility, and correctly handles predictive, breaking-event, AI-generated, and definitional-dispute claims.
The v2.2 additions - the Institutional Reliability Index, data disappearance exploitation detection, and expanded gaming countermeasures - represent the methodology's most important capability expansion. In a period when government data programs are being terminated and historically authoritative agencies are experiencing political interference, a fact-checking methodology that cannot account for institutional degradation is fundamentally incomplete. Veridi now accounts for it, and the validation confirms the mechanism works.
These results indicate that the methodology is ready for controlled deployment - meaning real-world use by trained volunteers with ongoing quality monitoring and periodic regression testing. They do not indicate that the methodology is finished. Usability testing should be conducted with volunteers, the calibration tracking system should accumulate enough data points to assess whether the methodology's confidence ratings match real-world outcomes over time, and the methodology should be tested against raw, unedited disinformation as it appears in the wild rather than formalized test claims.
The methodology's explicit treatment of disinformation as an adversarial problem - rather than a simple matter of true-or-false - is its most significant contribution. By naming the attack patterns, providing detection procedures, and maintaining a living index of institutional reliability, Veridi equips fact-checkers with tools for the landscape as it actually exists, not as we might wish it to be.
Appendix: Detailed Results
Full per-claim scorecards, evidence summaries, decision tree paths, and gaming countermeasure analyses are available in the validation results directory:
-
validation-results/summary.md- Complete scorecard tables -
validation-results/smoke-test-3.md- Full-tier smoke test (GTS-026) -
validation-results/gts-batch-{1-6}.md- Golden test set results by batch -
validation-results/adv-batch-{1-3}.md- Adversarial v1 suite results by batch -
validation-results/adv-v2-wave-{1-3}.md- Adversarial v2 suite results by wave -
validation-results/adv-v2-scorecard.md- Adversarial v2 consolidated scorecard with flag counts -
validation-results/gts-b-wave-{1-5}.md- Golden test set B results by wave -
validation-results/gts-b-scorecard.md- Golden test set B consolidated scorecard -
validation-results/gts-c-wave-{1-4}.md- Golden test set C results by wave -
validation-results/gts-c-scorecard.md- Golden test set C consolidated scorecard
Validation conducted February 25, 2026. Methodology version: Veridi v2.2. Canonical methodology files located at Veridi/fact-checker-files/. Skill implementation at ~/.claude/skills/factcheck/SKILL.md.