πŸ”¬ Healthcare Fraud Detection: Full Session Report

One evening applying Karpathy's autoresearch pattern to 90M Medicare records
Date: March 8–9, 2026  |  Server: Hetzner 5.78.148.70  |  DB: 6GB DuckDB, 30 tables, 90M+ rows  |  Final AUC: 0.8098
πŸ“‹ Session Summary Starting from zero infrastructure, in one evening we built an end-to-end healthcare fraud detection system: data pipeline, iterative model (18 versions), ground truth analysis, fuzzy label expansion experiment, feature importance analysis, top suspect generation, and an LLM web validation pipeline. Final AUC: 0.8098 against OIG LEIE exclusion labels on 1.2M Medicare providers.

Table of Contents

  1. The Autoresearch Pattern
  2. Data Infrastructure
  3. 18 Model Iterations
  4. Key Discoveries
  5. Ground Truth Deep Dive: The 181 Label Question
  6. Fuzzy Matching Experiment
  7. Feature Importance Analysis
  8. Top Suspect List
  9. LLM Web Validation Pipeline
  10. Next Steps
0.5561β†’0.81AUC improvement
18model versions
90M+CMS records
1.2Mproviders scored
181ground truth labels
5scripts built

1. The Autoresearch Pattern

Inspired by Andrej Karpathy's autoresearch project, the core loop is simple:

The Three-File Loop

The key insight from Karpathy: you need an automated, objective oracle. In his case it was validation loss on a language model. In ours it's the OIG LEIE β€” the federal database of providers excluded from Medicare for fraud, abuse, or license violations. Each eval.py run downloads the live LEIE, matches it to CMS billing data, and returns AUC-ROC. The agent iterates on detector.py freely.

2. Data Infrastructure

DatasetRowsWhat It ContainsFraud Signal
CMS Part B Physician Claims1.26MEvery Medicare physician's billing: services, beneficiaries, paymentsVolume anomalies
CMS Part D Prescribing1.38MDrug prescribing patterns by provider including opioid ratesOpioid patterns
Open Payments (Sunshine Act)14.7MEvery pharmaceutical/device payment to a physicianIndustry entanglement
NPPES NPI Registry7.1MProvider demographics, taxonomy codes, addressIdentity resolution
PECOS Enrollment2.54MMedicare enrollment records (2.54M rows = up to 75 per NPI)Enrollment gaps
OIG LEIE Exclusion List82,714All providers excluded from Medicare β€” ground truth labelsLabel source

All stored in a single 6GB DuckDB database on a Hetzner server (32GB RAM, 8 vCPU). Python venv at /home/dataops/cms-data/.venv/. DuckDB opened read-only to protect data integrity.

3. Model Iterations β€” AUC Progression

Baseline
0.5561
V2
0.7695
V3
0.7433 β–Ό
V4
0.7325 β–Ό
V5
0.7483
V6
0.7664
V7
0.7904
V8–V11
β†’0.8013
V12 βœ“
0.8098
V13–V15
0.8098
VersionAUCKey ChangeOutcome
Baseline0.5561Raw billing z-scores, global normalizationNear random
V20.7695Z-scores within specialty peer group; LA opioid rate; PECOS gap Γ— volume; speaking fees+21 pts β€” massive jump
V30.7433Added HCPCS concentration HHI, upcoding ratio, taxonomy mismatchβˆ’3 pts β€” added noise
V40.7325Percentile bucketing (90/95/99th pct) instead of z-scoresβˆ’4 pts β€” z-scores better
V70.7904Ensemble: 50% max(subscores) + 50% weighted mean β€” breakthrough insight+2 pts
V8–V110.7920β†’0.8013Progressive shift toward higher max weight (60%β†’90%)Confirming max dominates
V120.8098Pure max(subscores) β€” no weighted mean at allBest result
V13–V150.8098–0.8100Power transforms, top-2 mean β€” no improvementPlateau reached

4. Key Discoveries

🎯 Discovery #1: Fraud = Extreme on ANY Single Dimension

Taking max(subscores) across all features dramatically outperforms weighted averaging. Fraudsters don't need to be suspicious on every metric β€” being in the 99th percentile on one metric is sufficient signal. This drove the biggest AUC jump (V7β†’V12). The implication: the more diverse your subscore pool, the more chances you have to catch different fraud patterns.

πŸ“ˆ Discovery #2: Specialty Normalization Is Non-Negotiable

The single biggest jump (baseline→V2, +21 pts) came from computing z-scores within specialty peer groups, not globally. An oncologist billing 50 services/patient is normal. A family practitioner billing 50 services/patient is extreme. Without specialty normalization, legitimate high-billing specialties flood the top of the suspect list.

⚠️ Discovery #3: More Features β‰  Better Performance

V3 added HCPCS concentration, upcoding ratio, and taxonomy mismatch β€” and AUC dropped 3 points. Adding weak features dilutes strong ones unless they're in the max pool as standalone subscores. The final model uses just 6 subscores. Adding 3 more hurt. This is a core lesson in fraud detection: signal quality > quantity.

πŸ”¬ Discovery #4: PECOS Signal Is Partially Post-Exclusion Leakage

PECOS enrollment gap has a remarkable 0.78 standalone AUC β€” but this is largely circular: when a provider is excluded from Medicare, their PECOS enrollment is terminated. The feature is detecting the effect of exclusion (PECOS revoked), not the cause (fraud). It floods the top suspect list with legitimate unenrolled providers and excluded-but-since-reinstated providers. The billing signal (svc/bene, pay/bene) is the genuinely predictive dimension.

5. Ground Truth Deep Dive: The 181 Label Question

Blake flagged an important concern: with only 181 matched labels out of 82,714 LEIE entries, how reliable is our AUC?

Why So Few Matches?

LEIE SegmentCountWhy Not Matched
Placeholder NPI (0000000000)74,241No NPI on file β€” pre-NPI era exclusions or organizations
Real NPI, matched to CMS182βœ… Our ground truth labels
Real NPI, not in CMS8,291Excluded earlier β€” billing data scrubbed from CMS over time
πŸ’‘ The Critical Insight: Exclusion Year Distribution Of our 181 matches: 93% were excluded in 2024–2026. Providers excluded in 2015 have had a decade of billing data removed from CMS public datasets. Our model is a leading indicator β€” it learns what providers looked like right before they got caught, not what historical fraud looked like.

Is 181 Enough?

Honestly β€” it's tight. Key implications:

6. Fuzzy Matching Experiment

We attempted to expand ground truth from 181 β†’ 720+ labels by fuzzy-matching the 74,241 no-NPI LEIE entries against NPPES using last_name + state + specialty.

Results

Match ConfidenceTotal MatchedIn CMSAUC Impact
HIGH (exact name + state + taxonomy + credentials)287440.81β†’0.75 (AUC dropped)
MEDIUM (first initial match)1,709~495
LOW (ambiguous, first pick)1,454~513Not tested

Why Did AUC Drop?

Score Distribution Reveals the Problem

Investigation revealed: the fuzzy-matched providers were excluded in 1995–2010 β€” 20-30 years ago. Their current billing patterns are completely normal (they may have been reinstated, or the NPI was reassigned). Adding them as fraud labels teaches the model the wrong thing. The old LEIE entries without NPIs are not worth adding.

What would actually help: DOJ press releases (OIG publishes ~100/month with specific provider names), CMS Program Integrity dataset, state medical board disciplinary actions.

7. Feature Importance Analysis

SubscoreStandalone AUCNotes
PECOS enrollment gap0.7778⚠️ Mostly post-exclusion leakage β€” floods list with wrong suspects
Drug cost/beneficiary0.5756Moderate β€” specialty-normalized drug spend
Services/beneficiary0.5678Moderate β€” the key billing anomaly signal
Payment/beneficiary0.5549Correlated with svc/bene
Open Payments (industry $)0.5379Weak individually
LA opioid rate0.4617Below random standalone β€” only adds value in the max pool
Combined max(subscores)0.8072The ensemble effect β€” each feature catches different fraudsters

The ensemble effect is the whole game. No single signal is powerful, but they catch different fraudsters. That's exactly why max() works: it asks "is this provider extreme on anything?" rather than "is this provider above average on everything?"

8. Top Suspect List β€” Physicians Only

After removing PECOS-dominated results and filtering to individual physicians (removing ambulance services, labs, IDTFs), the top billing anomaly suspects:

RankProviderSpecialtyLocationKey MetricWhy Flagged
1 Andrew Leavitt
NPI 1952343113
Internal Medicine San Francisco, CA 10,333 svc/bene
2.5M services, 246 patients
$4.76M paid
Extreme β€” implausible
2 Elisabeth Balken
NPI 1992489728
Nurse Practitioner Mesa, AZ 991 svc/bene
29,734 services, 30 patients
$23.3M paid
Extreme β€” NP billing $23M
3 William Price
NPI 1366423865
Family Practice Tulsa, OK 2,357 svc/bene
120K services, 51 patients
$29,870 paid
Very high volume, very low payment β€” data anomaly?
4 Brandon Hardesty
NPI 1952581027
Internal Medicine Indianapolis, IN 5,634 svc/bene
766K services, 136 patients
LA opioid: HIGH
Likely IU Health hematology
5 Frank Curvin
NPI 1457397382
Family Practice Johns Creek, GA 191 svc/bene
19,932 services, 104 patients
$11.8M paid
High for family practice
6 Joyce Ravain
NPI 1215980230
Emergency Medicine Ormond Beach, FL 92.9 svc/bene
4,181 services, 45 patients
$835K paid
EM docs don't have ongoing patients
⚠️ Important Caveat High scores β‰  fraud. Our earlier web research confirmed that similar patterns (Indianapolis cluster, oncology infusion centers) can be completely legitimate. These suspects require human review + web research before any conclusions.

9. LLM Web Validation Pipeline

Built validate_suspects.py β€” a batch pipeline that takes the physician suspect list and for each provider:

  1. Builds 3–5 targeted search queries (OIG news, medical board, fraud indictment, settlement)
  2. Calls Brave Search API (3 searches Γ— 3 results = up to 9 unique sources per provider)
  3. Feeds CMS data context + search results to Claude (claude-haiku) for structured analysis
  4. Returns structured JSON: verdict | fraud_confidence | fraud_indicators | legitimate_explanations | recommended_action
  5. Generates styled HTML report with color-coded verdicts

Verdict Categories

VerdictMeaningAction
LIKELY_FRAUDWeb evidence of legal action, conviction, or OIG press releaseRefer to OIG, request claims audit
POSSIBLE_FRAUDStatistical anomaly + suspicious patterns but no confirmed legal actionEnhanced monitoring, deeper audit
LEGITIMATEHigh-volume specialty center, academic medicine, specialty drugRemove from watchlist
INSUFFICIENT_DATANo web presence, new provider, generic nameManual review needed

To Activate

# On Hetzner server:
export BRAVE_API_KEY=...       # free tier at https://brave.com/search/api/
export ANTHROPIC_API_KEY=...

# Run top 25 suspects:
cd /home/dataops/fraud-detector
python3 validate_suspects.py --limit 25

# Output:
# validation/validation_report_2026-03-09.html  (HTML report)
# validation/validation_results.json            (structured data)

10. Next Steps β€” Prioritized

PriorityTaskExpected Impact
P0 Set API keys on server, run validation pipeline on top 25 suspects First real-world validated fraud findings
P0 Investigate Andrew Leavitt (10,333 svc/bene), Elisabeth Balken ($23M on 30 patients) May be our clearest fraud signals
P1 Expand ground truth via DOJ press releases β€” OIG publishes ~100/month with provider names Could grow labels from 181 β†’ 1,000+
P1 Build frontend explorer (FastAPI + React) for interactive provider investigation Makes the validation workflow usable by others
P2 Add network analysis — referral rings where A→B→C→kickback→A New fraud pattern not currently detected
P2 Write 3-part blog post series for personal website Content marketing / credibility for venture
P3 Train proper ML model (gradient boosting) using hand-crafted features as inputs May push AUC past 0.85 with proper cross-validation

Repository & Files

FilePurposeLocation
eval.pyFixed oracle β€” do not modifyServer + GitHub
detector.pyFraud scoring logic (V15 final)Server + GitHub
feature_importance.pySubscore AUC + top suspect generationServer + GitHub
fuzzy_match.pyLEIE name matching against NPPESServer + GitHub
validate_suspects.pyLLM web validation pipelineServer + GitHub
physician_suspects.csvTop 100 physician billing suspectsServer + GitHub
results.tsvFull AUC experiment log (18 runs)Server + GitHub
reports/2026-03-09_fraud-detection-comprehensive.htmlPublic-facing summary reportMobile URL

Session: March 8–9, 2026 (~6 hours)  |  GitHub: blakethom8/cms-fraud-detection  |  Server: Hetzner 5.78.148.70  |  Data: All CMS data publicly available at data.cms.gov