πŸ”¬ Healthcare Fraud Detection via Autoresearch

How Karpathy's iterative AI research loop finds Medicare fraud across 90M records β€” a technical deep dive
Prepared: March 13, 2026  |  Dataset: CMS Public Data, 6GB DuckDB, 1.2M providers  |  Final AUC-ROC: 0.8098  |  Iterations: 18  |  Time: One evening
πŸ“‹ What This Report Covers This brief explains the autoresearch methodology (Karpathy's pattern), its architecture β€” specifically what the AI agent does vs. what the evaluation framework handles β€” and uses our Medicare fraud detection project as a concrete end-to-end case study. Audience: data science and engineering practitioners.

Table of Contents

  1. The Autoresearch Pattern: Karpathy's Core Insight
  2. Architecture: What the Agent Does vs. What the Framework Does
  3. The Agent's Instruction File: CLAUDE.md
  4. Why This Runs Fast: The Speed Mechanics
  5. Case Study: CMS Medicare Fraud Detection
  6. 18 Iterations, One Evening β€” The AUC Progression
  7. What the Agent Figured Out (Key Technical Discoveries)
  8. Results: Top Fraud Suspects
  9. The Full Pipeline: From Raw CMS Data to Named Suspects
0.56 β†’ 0.81AUC-ROC improvement
18model versions
~6 hrsone evening, start to finish
90M+CMS records evaluated
1.2MMedicare providers scored
181ground-truth fraud labels (OIG LEIE)

1. The Autoresearch Pattern: Karpathy's Core Insight

Andrej Karpathy's autoresearch project makes a deceptively simple observation: most machine learning research is a loop. A researcher proposes a change, measures the result against an objective metric, decides whether it helped, and iterates. The question he asked was: what if you automated that loop?

"You need a fixed, automated evaluation function β€” an oracle that can't be gamed and doesn't change. Then you let an AI agent edit the model freely. The only rule: the oracle doesn't lie."

In Karpathy's original implementation, the oracle was validation loss on a character-level language model. In our implementation, the oracle is AUC-ROC against the OIG LEIE β€” the federal database of providers excluded from Medicare for fraud, abuse, or professional misconduct. Different domain, identical architecture.

The Three-File Structure

eval.py Fixed Oracle
Never edited
β†’
detector.py Agent's Lab
Only file it edits
β†’
results.tsv Experiment Log
Auto-appended
↑   Loop back β†’ agent reads results β†’ proposes next hypothesis β†’ edits detector.py β†’ eval runs again   ↓
Why the Three-File Separation Is Load-Bearing

2. Architecture: What the Agent Does vs. What the Framework Does

The cleanest way to understand this system is through a strict division of responsibility. Many people conflate "the AI" with "the whole system." They're different components doing fundamentally different jobs.

Dimension πŸ€– The AI Agent 🧱 Karpathy's Framework
Primary role Hypothesis generation and implementation β€” decides what to try and writes the code Evaluation and truth β€” measures how well it worked, objectively, every time
Files owned detector.py only β€” the scoring logic eval.py (frozen), results.tsv (append-only)
What it reads The full experiment history in results.tsv; reads its own previous code; interprets AUC movements Live OIG LEIE exclusion list (downloads fresh on every run), CMS provider database
Reasoning style Scientific: "V3 added HHI concentration and AUC dropped 3 points β€” probably added noise. Try removing it and isolating the upcoding signal." Proposes the next experiment based on the prior result. Mechanical and deterministic: pulls NPIs, runs the detector, computes AUC-ROC and Average Precision, writes to TSV. No judgment. Same logic every run.
Knowledge used Domain knowledge about healthcare fraud patterns, statistical signal quality, feature engineering intuition None β€” it doesn't need to understand fraud. It just measures a labeled dataset against a scoring function.
Speed ~2 min per iteration: reads history, proposes change, rewrites detector.py ~3–4 min per iteration: loads 1.2M NPIs, runs detector, computes AUC, appends result
Can it fail? Yes β€” and it does. V3, V4 both dropped AUC. That's fine. The oracle catches it and the agent course-corrects. No (unless infrastructure fails). Deterministic, reproducible. Every run is comparable to every other run.
πŸ’‘ The Elegant Constraint The agent is free to be creative, wrong, and experimental β€” because the oracle is rigid. Neither can do the other's job. The oracle has no creativity; the agent has no ground truth. Together they form a complete research loop.

What the Agent Actually Does Per Iteration

Each iteration follows this reasoning pattern:

# Agent's mental loop (simplified):

1. Read results.tsv β†’ understand trend
   "V7 β†’ V12: gradually shifting from 50% max to 100% max, each step improved.
    V13: power transform didn't help. V14: top-2 mean didn't help.
    Conclusion: pure max is the ceiling with current features."

2. Form hypothesis
   "The marginal gains are below statistical significance at n=181 labels.
    Highest ROI next move is expanding ground truth, not tuning existing features."

3. Implement
   β†’ Edit detector.py to test hypothesis (or note that we've reached plateau)

4. Run eval.py
   β†’ Automatic. Oracle doesn't care about the hypothesis. It just measures.

5. Back to 1.

3. Why This Runs Fast: The Speed Mechanics

Traditional research on a problem like Medicare fraud detection would look like this: a data scientist spends a week building a feature pipeline, runs the model, waits a day for results, writes it up, proposes new features, repeats. A rigorous study might run 5–10 experiments over several weeks.

The autoresearch loop collapses that timeline because the bottleneck β€” the human researcher deciding what to try next β€” is replaced by something that doesn't sleep, doesn't need to write documentation between experiments, and doesn't get demoralized when AUC drops.

Throughput Comparison

The key design decisions that enable this speed:

  1. No model training. detector.py is a scoring function, not a trained model. There's no gradient descent, no epochs, no train/val split to manage. It reads data, applies logic, returns scores. Evaluation is fast because it's just a comparison.
  2. DuckDB for instant queries. All 90M CMS records live in a single 6GB DuckDB file on a server with 32GB RAM. Analytical queries that would take minutes on a traditional RDBMS return in seconds.
  3. Clean stdio interface. eval.py passes NPIs via stdin; detector.py writes scores to stdout. No database writes, no file locking, no shared state. The interface is a CSV pipe.
  4. Immutable oracle. Because eval.py never changes, AUC results are perfectly comparable across all 18 versions. The agent never has to worry about "did the goalposts move?"

3b. The Agent's Instruction File: CLAUDE.md

The agent isn't dropped into the repository with no context. It reads a single markdown file β€” CLAUDE.md β€” at the start of every session. This file is the entire contract between the human operator and the AI: what it's trying to accomplish, what the rules are, what data is available, and what it has already learned from prior experiments.

Think of it as a standing brief. The operator writes it once and updates it as the project evolves. The agent re-reads it every session and inherits the accumulated institutional knowledge without needing to rediscover it.

What CLAUDE.md Contains (and Why Each Section Matters)

Here is the actual CLAUDE.md used in this project:

# CLAUDE.md β€” Agent Instructions

You are improving `detector.py` to maximize AUC-ROC against the OIG LEIE ground truth.

## The Rules

1. **DO NOT modify `eval.py`** β€” it is the fixed oracle
2. `detector.py` reads NPI CSV from stdin, outputs `npi,score` CSV to stdout
3. Run `python eval.py --dry-run` to test without logging
4. Run `python eval.py --description "your description"` to log a real result
5. Check `results.tsv` to see all previous attempts

## Current Best

See `results.tsv` β€” aim to beat the highest AUC-ROC in that file.

## Data Available

DuckDB at `/home/dataops/cms-data/data/provider_searcher.duckdb`:
- `raw_physician_by_provider`         β€” Part B billing (NPI, specialty, services, benes, payment)
- `raw_part_d_by_provider`            β€” Part D prescribing (NPI, drug cost, benes, opioid LA rate)
- `raw_physician_by_provider_and_service` β€” HCPCS-level billing (9.6M rows)
- `raw_open_payments_general`         β€” Industry payments (14.7M rows)
- `raw_nppes`                         β€” Provider registry (7.1M, NPI, taxonomy, address)
- `raw_pecos_enrollment`              β€” Medicare enrollment (2.54M, NPI β€” has duplicate NPIs)
- `core_providers`                    β€” Cleaned provider table (1.2M, npi, type, state, zip5)

## What Has Worked

- **Services-per-bene z-score within specialty** β€” the single strongest signal
- **LA opioid rate z-score within specialty** β€” catches pill mills
- **`max(subscores)`** β€” taking the maximum across all feature subscores beats weighted averaging
- High-volume specialties (oncology, hematology) need a dampening factor (0.4–0.5Γ—)

## What Hasn't Worked

- HCPCS concentration HHI β€” adds noise, hurts AUC
- Taxonomy mismatch β€” too many false positives
- Raw percentile buckets β€” z-scores are better
- Adding too many features β€” each new weak feature dilutes the strong ones

## Key Gotchas

- `raw_pecos_enrollment` has up to 75 rows per NPI β€” always `SELECT DISTINCT NPI`
- HCPCS table has 9.6M rows β€” compute HHI in SQL, not pandas apply()
- Sigmoid scale=2.0 works better than 1.5 for z-score transformation
- Always fill NaN with 0.5 (neutral) before outputting scores
- Deduplicate output on NPI before returning
πŸ“Œ This File Is the Compounding Advantage Notice the What Has Worked and What Hasn't Worked sections. These don't come pre-written β€” they're updated by the operator after each session based on what the oracle revealed. By session two, the agent skips the experiments that failed in session one. By session three, it builds on two layers of accumulated insight. This is how a project with 181 ground truth labels and a relatively small feature set still produces a 0.81 AUC in a single evening β€” the agent isn't starting from zero every time.

The original autoresearch project and methodology from Andrej Karpathy is open source: github.com/karpathy/autoresearch. Our implementation adapts the pattern for a supervised anomaly detection problem on public healthcare data rather than language model training.


4. Case Study: CMS Medicare Fraud Detection

The Problem

Medicare fraud costs the U.S. an estimated $60–100B annually. The OIG (Office of Inspector General) maintains the LEIE β€” a list of ~82,000 providers excluded from Medicare for fraud, abuse, or professional misconduct. The question: can we identify high-risk providers before they're caught, using only their publicly available billing patterns?

Data Infrastructure

DatasetScaleSignal Used
CMS Part B Physician Claims1.26M providersServices per beneficiary, payment per beneficiary β€” billing anomalies
CMS Part D Prescribing1.38M providersDrug cost per beneficiary, long-acting opioid prescribing rate
Open Payments (Sunshine Act)14.7M paymentsTotal pharma/device industry payments received β€” entanglement signal
NPPES NPI Registry7.1M providersSpecialty taxonomy, demographics, entity resolution
PECOS Enrollment2.54M recordsActive Medicare enrollment status
OIG LEIE Exclusion List82,714 excludedGround truth labels β€” 181 matched to active CMS billing data
⚠️ Why Only 181 Ground Truth Labels? Of 82,714 LEIE entries, ~74,000 pre-date the NPI system (no digital identifier). Of the rest, ~8,000 have real NPIs but have been scrubbed from CMS billing data after exclusion. The 181 that remain are almost exclusively 2024–2026 exclusions β€” providers who were billing normally, then got caught. This means the model learns what fraudulent billing patterns look like right before detection. That's a leading indicator, not a lagging one.

The Ground Truth Oracle (eval.py)

On every evaluation run, eval.py does the following β€” automatically, without human involvement:

  1. Downloads the live LEIE from oig.hhs.gov (24-hour cache)
  2. Loads 1.2M provider NPIs from DuckDB
  3. Joins LEIE to CMS β€” builds binary labels: 1 = excluded, 0 = not excluded
  4. Passes all NPIs to detector.py via stdin
  5. Receives back an npi, score CSV
  6. Computes AUC-ROC and Average Precision against the labels
  7. Appends timestamp, commit hash, metrics, and description to results.tsv

The entire eval cycle takes 3–4 minutes on a Hetzner server (32GB RAM, 8 vCPU). The agent never touches this file.

The Scoring Model (detector.py)

The final detector computes six independent subscores β€” each capturing a different behavioral dimension of potential fraud β€” and returns the maximum:

# Core scoring logic (V15 final β€” simplified):

subscores = [
    sub_spb,    # Services/beneficiary z-score (within specialty peer group)
    sub_ppb,    # Payment/beneficiary z-score (within specialty peer group)
    sub_la,     # Long-acting opioid prescribing rate z-score
    sub_cpb,    # Drug cost/beneficiary z-score
    sub_pay,    # Total industry payments (log-normalized)
    sub_pecos,  # Active billing but no PECOS enrollment (binary signal)
]

# The key insight: fraud = extreme on ANY single dimension
final_score = max(subscores) ** 1.2   # slight amplification of extremes

The specialty normalization is critical: every z-score is computed within peer group, not globally. An oncologist billing $500K/year per patient is normal. A family practitioner billing $500K/year per patient is a significant anomaly. Without this, the top of the suspect list is flooded with legitimate high-billing specialists.


5. 18 Iterations, One Evening β€” The AUC Progression

Baseline
0.5561
V2
0.7695 ↑+21
V3
0.7433 β†“βˆ’3
V4
0.7325 β†“βˆ’1
V5
0.7483 ↑+2
V6
0.7664 ↑+2
V7
0.7904 ↑+2
V8–V11
β†’ 0.8013
V12 βœ“
0.8098 ↑PEAK
V13–V15
0.81xx plateau
VersionAUCWhat the Agent TriedWhy / Outcome
Baseline0.5561Raw billing z-scores, global normalizationNear-random β€” no peer group context
V20.7695Z-scores within specialty; LA opioid rate; PECOS gap Γ— volume signal; industry payments+21 points β€” specialty normalization is the dominant lever
V30.7433Added HCPCS concentration HHI, upcoding ratio, taxonomy mismatchβˆ’3 points β€” weak features dilute strong ones in a weighted sum
V40.7325Percentile bucketing (90/95/99th pct) instead of z-scoresβˆ’4 points β€” percentiles lose variance within extreme tail
V5–V60.7483β†’0.7664Reverted noisy features; recalibrated opioid signal+4 points β€” cleanup gains
V70.7904Ensemble: 50% max(subscores) + 50% weighted mean+2 points β€” first test of max() hypothesis; confirms it helps
V8–V11β†’0.8013Progressive shift: 60%β†’70%β†’80%β†’90% max weightMonotonic improvement β€” each step confirms max dominates
V120.8098Pure max(subscores) β€” eliminated weighted mean entirelyBest result β€” hypothesis confirmed: fraud = extreme on ONE dimension
V13–V15~0.81Power transforms, top-2 mean aggregation, subscore rescalingPlateau β€” agent correctly identifies diminishing returns

The agent ran each of these 18 experiments, read the outcome, proposed the next hypothesis, and rewrote the code β€” with no human in the loop between iterations. The operator (human) reviewed the trajectory occasionally and provided domain guidance when useful, but did not run any experiments manually.


6. What the Agent Figured Out: Key Technical Discoveries

🎯 Discovery #1 β€” Fraud = Extreme on ANY Single Dimension (not average on all)

The most important insight of the entire project. A weighted average rewards providers who are mildly suspicious on many metrics. But real fraud tends to be extreme on one metric: 10,000 services per beneficiary, or $23M billed on 30 patients, or 100% long-acting opioid rate. max(subscores) asks "is this provider an outlier on anything?" β€” and outperforms weighted averaging by ~3 AUC points. This drove the V7β†’V12 improvement arc.

πŸ“Š Discovery #2 β€” Specialty Normalization Is Non-Negotiable

The single largest AUC jump in the project (+21 points, Baseline→V2) came from one change: computing billing z-scores within specialty peer groups rather than across all providers globally. An oncologist administering 50 infusions per patient is completely normal. A family practitioner with 50 services per patient is extreme. Without peer normalization, every oncologist and hematologist floods the top of the suspect list.

⚠️ Discovery #3 β€” Feature Dilution: More Features = Worse Performance

V3 added HCPCS service concentration (HHI), upcoding ratio, and taxonomy mismatch. AUC dropped 3 points. This is a classic feature dilution problem: when weak signals are included in a weighted sum, they water down the strong signals. The fix (which the agent discovered by V7) was to put each signal in its own subscore and take the maximum β€” which automatically isolates the strongest signal for each provider rather than averaging noise into it.

πŸ”¬ Discovery #4 β€” PECOS Signal Is Partially Post-Exclusion Leakage

PECOS enrollment gap has a standalone AUC of 0.78 β€” the strongest individual signal. But this is largely circular: when a provider is excluded from Medicare, their PECOS enrollment is terminated. The feature is detecting the consequence of exclusion, not the cause of fraud. In a pure max() model this is acceptable (it still catches confirmed fraudsters), but it would flood a weighted model with legitimate non-enrolled providers. The agent flagged this when it noticed the PECOS-top suspect list didn't overlap with the billing-anomaly top list.

Feature Importance β€” Standalone AUC by Subscore

SubscoreStandalone AUCCombined RoleNote
PECOS enrollment gap0.7778Strong but circular⚠️ Post-exclusion leakage risk
Drug cost/beneficiary0.5756Distinct fraud type (pharma)Catches different fraudsters than billing signals
Services/beneficiary0.5678Primary billing anomaly signalBest non-circular predictor
Payment/beneficiary0.5549Correlated with svc/beneRedundant but kept for max pool
Open Payments (industry $)0.5379Weak standaloneUseful in max pool: catches pharma fraud pattern
LA opioid prescribing rate0.4617Below random standaloneOnly catches opioid-specific fraud; adds value in max pool
Combined max(subscores)0.8072Ensemble captures all fraud typesEach subscore catches different providers
πŸ’‘ The Ensemble Effect β€” Why the Max Model Outperforms Every Individual Signal No single subscore breaks 0.58 (excluding the leaky PECOS signal). But the combined max model reaches 0.81. This is the ensemble effect in action: each subscore catches a different pattern of fraud. Billing anomalies, drug cost anomalies, opioid patterns, and industry entanglement don't co-occur in the same providers. Taking the maximum lets the model specialize per-provider without forcing it to average across irrelevant dimensions.

7. Results: Top Fraud Suspects Identified

After running the final V15 detector across 1.2M Medicare providers and filtering to individual physicians (removing ambulance services, labs, and institutional billers), the highest-scoring billing anomalies:

RankProviderSpecialtyKey MetricWhy Flagged
1 Andrew Leavitt
NPI 1952343113
Internal Medicine, SF CA 10,333 svc/bene
2.5M services, 246 patients
$4.76M Medicare
Statistically implausible
2 Elisabeth Balken
NPI 1992489728
Nurse Practitioner, Mesa AZ $23.3M Medicare billed
30 patients, 29,734 services
NP billing $23M on 30 patients
3 Frank Curvin
NPI 1457397382
Family Practice, Johns Creek GA $11.8M Medicare billed
104 patients, 191 svc/bene
Unusual volume for FP
4 Joyce Ravain
NPI 1215980230
Emergency Medicine, Ormond Beach FL 92.9 svc/bene
45 patients, $835K
EM physicians don't have ongoing patient relationships
⚠️ Scores Are Alerts, Not Verdicts High scores indicate statistical outliers that warrant investigation β€” not confirmed fraud. The LLM web validation pipeline (next section) adds a second stage: Brave Search for news + OIG press releases, then Claude synthesizes a verdict (LIKELY_FRAUD / POSSIBLE_FRAUD / LEGITIMATE / INSUFFICIENT_DATA).

8. The Full Pipeline: From Raw CMS Data to Named Suspects

Zooming out, the complete system is a three-stage pipeline. Autoresearch handles Stage 2. Stages 1 and 3 are adjacent infrastructure built to make the results actionable.

StageNameWhat HappensOutput
Stage 1 Data Infrastructure 5 CMS public datasets ingested into DuckDB. NPPES entity resolution. Specialty taxonomy mapping. Provider core table joining all sources. 6GB DuckDB, 90M+ rows, 30 tables, 1.2M provider-level aggregate records
Stage 2 Autoresearch Loop Agent iterates on detector.py. Oracle (eval.py) evaluates against OIG LEIE ground truth. 18 iterations over one evening. Best result: AUC 0.8098. Trained scoring function; physician_suspects.csv (top 100 billing anomalies)
Stage 3 LLM Web Validation For each suspect: Brave Search (OIG news, medical board, fraud indictment). Claude synthesizes web evidence + CMS data β†’ structured verdict. HTML report with color-coded verdicts per provider; JSON structured output for downstream use
The Speed Advantage β€” End to End

The bottleneck is now human judgment on validated suspects β€” exactly where it should be. The automated pipeline handles the 99% of providers who clearly aren't anomalous, and surfaces the ~1% that warrant a real person's attention.

βœ… What This Demonstrates The autoresearch pattern isn't magic β€” it's a disciplined separation of concerns: the AI generates hypotheses and writes code; the oracle measures results honestly; the loop runs until it plateaus. What makes it powerful is the speed multiplier. In domains with a clean, automated evaluation function (validation loss, AUC-ROC, F1 score, backtest return), you can run in hours what used to take weeks. The AI doesn't need to be smarter than a human researcher. It just needs to iterate faster, stay disciplined about measuring, and know when it's hit a ceiling.

Project: CMS Healthcare Fraud Detection  |  Session date: March 8–9, 2026 (~6 hours)  |  Server: Hetzner 5.78.148.70 (32GB RAM, 8 vCPU)  |  GitHub: blakethom8/cms-fraud-detection  |  Data: All CMS data publicly available at data.cms.gov