Healthcare Fraud Detection via Autoresearch — Technical Brief

📋 What This Report Covers This brief explains the autoresearch methodology (Karpathy's pattern), its architecture — specifically what the AI agent does vs. what the evaluation framework handles — and uses our Medicare fraud detection project as a concrete end-to-end case study. Audience: data science and engineering practitioners.

The Autoresearch Pattern: Karpathy's Core Insight
Architecture: What the Agent Does vs. What the Framework Does
The Agent's Instruction File: CLAUDE.md
Why This Runs Fast: The Speed Mechanics
Case Study: CMS Medicare Fraud Detection
18 Iterations, One Evening — The AUC Progression
What the Agent Figured Out (Key Technical Discoveries)
Results: Top Fraud Suspects
The Full Pipeline: From Raw CMS Data to Named Suspects

0.56 → 0.81AUC-ROC improvement

18model versions

~6 hrsone evening, start to finish

90M+CMS records evaluated

1.2MMedicare providers scored

181ground-truth fraud labels (OIG LEIE)

1. The Autoresearch Pattern: Karpathy's Core Insight

Andrej Karpathy's autoresearch project makes a deceptively simple observation: most machine learning research is a loop. A researcher proposes a change, measures the result against an objective metric, decides whether it helped, and iterates. The question he asked was: what if you automated that loop?

"You need a fixed, automated evaluation function — an oracle that can't be gamed and doesn't change. Then you let an AI agent edit the model freely. The only rule: the oracle doesn't lie."

In Karpathy's original implementation, the oracle was validation loss on a character-level language model. In our implementation, the oracle is AUC-ROC against the OIG LEIE — the federal database of providers excluded from Medicare for fraud, abuse, or professional misconduct. Different domain, identical architecture.

The Three-File Structure

eval.py Fixed Oracle
Never edited

→

detector.py Agent's Lab
Only file it edits

→

results.tsv Experiment Log
Auto-appended

↑ Loop back → agent reads results → proposes next hypothesis → edits detector.py → eval runs again ↓

Why the Three-File Separation Is Load-Bearing

eval.py is sacrosanct. The agent cannot touch it. This prevents the model from "cheating" — e.g., memorizing the fraud labels instead of learning generalizable signals. It's the referee, and referees don't play.
detector.py is the laboratory. The agent can rewrite it from scratch 18 times if it wants. Total creative freedom within the constraint that it must read NPIs from stdin and write scores to stdout.
results.tsv is institutional memory. Every run is logged with timestamp, AUC, and a description. The agent sees the full experiment history and reasons about trends — "V3 dropped 3 points, probably because I added noisy features."

2. Architecture: What the Agent Does vs. What the Framework Does

The cleanest way to understand this system is through a strict division of responsibility. Many people conflate "the AI" with "the whole system." They're different components doing fundamentally different jobs.

Dimension	🤖 The AI Agent	🧱 Karpathy's Framework
Primary role	Hypothesis generation and implementation — decides what to try and writes the code	Evaluation and truth — measures how well it worked, objectively, every time
Files owned	`detector.py` only — the scoring logic	`eval.py` (frozen), `results.tsv` (append-only)
What it reads	The full experiment history in `results.tsv`; reads its own previous code; interprets AUC movements	Live OIG LEIE exclusion list (downloads fresh on every run), CMS provider database
Reasoning style	Scientific: "V3 added HHI concentration and AUC dropped 3 points — probably added noise. Try removing it and isolating the upcoding signal." Proposes the next experiment based on the prior result.	Mechanical and deterministic: pulls NPIs, runs the detector, computes AUC-ROC and Average Precision, writes to TSV. No judgment. Same logic every run.
Knowledge used	Domain knowledge about healthcare fraud patterns, statistical signal quality, feature engineering intuition	None — it doesn't need to understand fraud. It just measures a labeled dataset against a scoring function.
Speed	~2 min per iteration: reads history, proposes change, rewrites detector.py	~3–4 min per iteration: loads 1.2M NPIs, runs detector, computes AUC, appends result
Can it fail?	Yes — and it does. V3, V4 both dropped AUC. That's fine. The oracle catches it and the agent course-corrects.	No (unless infrastructure fails). Deterministic, reproducible. Every run is comparable to every other run.

💡 The Elegant Constraint The agent is free to be creative, wrong, and experimental — because the oracle is rigid. Neither can do the other's job. The oracle has no creativity; the agent has no ground truth. Together they form a complete research loop.

What the Agent Actually Does Per Iteration

Each iteration follows this reasoning pattern:

# Agent's mental loop (simplified):

1. Read results.tsv → understand trend
   "V7 → V12: gradually shifting from 50% max to 100% max, each step improved.
    V13: power transform didn't help. V14: top-2 mean didn't help.
    Conclusion: pure max is the ceiling with current features."

2. Form hypothesis
   "The marginal gains are below statistical significance at n=181 labels.
    Highest ROI next move is expanding ground truth, not tuning existing features."

3. Implement
   → Edit detector.py to test hypothesis (or note that we've reached plateau)

4. Run eval.py
   → Automatic. Oracle doesn't care about the hypothesis. It just measures.

5. Back to 1.

3. Why This Runs Fast: The Speed Mechanics

Traditional research on a problem like Medicare fraud detection would look like this: a data scientist spends a week building a feature pipeline, runs the model, waits a day for results, writes it up, proposes new features, repeats. A rigorous study might run 5–10 experiments over several weeks.

The autoresearch loop collapses that timeline because the bottleneck — the human researcher deciding what to try next — is replaced by something that doesn't sleep, doesn't need to write documentation between experiments, and doesn't get demoralized when AUC drops.

Throughput Comparison

Human researcher cycle: Hypothesis (hours) → code (hours/days) → evaluate (hours) → interpret (hours) → write up → next hypothesis. Typical cadence: 1–2 iterations/day.
Autoresearch cycle: Hypothesis (2 min) → code (2 min) → evaluate (3–4 min) → interpret (1 min). Typical cadence: ~6–8 iterations/hour if running.
This project: 18 iterations in ~6 hours. AUC jumped from 0.5561 (near-random) to 0.8098 — a result that would realistically take 2–3 weeks of traditional research sprints.

The key design decisions that enable this speed:

No model training. detector.py is a scoring function, not a trained model. There's no gradient descent, no epochs, no train/val split to manage. It reads data, applies logic, returns scores. Evaluation is fast because it's just a comparison.
DuckDB for instant queries. All 90M CMS records live in a single 6GB DuckDB file on a server with 32GB RAM. Analytical queries that would take minutes on a traditional RDBMS return in seconds.
Clean stdio interface. eval.py passes NPIs via stdin; detector.py writes scores to stdout. No database writes, no file locking, no shared state. The interface is a CSV pipe.
Immutable oracle. Because eval.py never changes, AUC results are perfectly comparable across all 18 versions. The agent never has to worry about "did the goalposts move?"

3b. The Agent's Instruction File: `CLAUDE.md`

The agent isn't dropped into the repository with no context. It reads a single markdown file — CLAUDE.md — at the start of every session. This file is the entire contract between the human operator and the AI: what it's trying to accomplish, what the rules are, what data is available, and what it has already learned from prior experiments.

Think of it as a standing brief. The operator writes it once and updates it as the project evolves. The agent re-reads it every session and inherits the accumulated institutional knowledge without needing to rediscover it.

What CLAUDE.md Contains (and Why Each Section Matters)

The rules: Explicit constraints — do not touch eval.py, interface spec (stdin/stdout CSV), how to log vs. dry-run. The agent can't misinterpret these because they're written plainly.
Current best: Points to results.tsv — the agent always knows the target to beat.
Data available: Table names, column names, row counts, gotchas (e.g., PECOS has duplicate NPIs). No trial-and-error on schema.
What has worked / what hasn't: Accumulated lessons from prior sessions. This is the compounding advantage — each session's learnings are written back into the brief, so future sessions don't repeat failed experiments.
Key gotchas: Specific engineering landmines the agent encountered (deduplicate PECOS, compute HHI in SQL not pandas, sigmoid scale). Prevents wasted iterations on known failure modes.

Here is the actual CLAUDE.md used in this project:

# CLAUDE.md — Agent Instructions

You are improving `detector.py` to maximize AUC-ROC against the OIG LEIE ground truth.

## The Rules

1. **DO NOT modify `eval.py`** — it is the fixed oracle
2. `detector.py` reads NPI CSV from stdin, outputs `npi,score` CSV to stdout
3. Run `python eval.py --dry-run` to test without logging
4. Run `python eval.py --description "your description"` to log a real result
5. Check `results.tsv` to see all previous attempts

## Current Best

See `results.tsv` — aim to beat the highest AUC-ROC in that file.

## Data Available

DuckDB at `/home/dataops/cms-data/data/provider_searcher.duckdb`:
- `raw_physician_by_provider`         — Part B billing (NPI, specialty, services, benes, payment)
- `raw_part_d_by_provider`            — Part D prescribing (NPI, drug cost, benes, opioid LA rate)
- `raw_physician_by_provider_and_service` — HCPCS-level billing (9.6M rows)
- `raw_open_payments_general`         — Industry payments (14.7M rows)
- `raw_nppes`                         — Provider registry (7.1M, NPI, taxonomy, address)
- `raw_pecos_enrollment`              — Medicare enrollment (2.54M, NPI — has duplicate NPIs)
- `core_providers`                    — Cleaned provider table (1.2M, npi, type, state, zip5)

## What Has Worked

- **Services-per-bene z-score within specialty** — the single strongest signal
- **LA opioid rate z-score within specialty** — catches pill mills
- **`max(subscores)`** — taking the maximum across all feature subscores beats weighted averaging
- High-volume specialties (oncology, hematology) need a dampening factor (0.4–0.5×)

## What Hasn't Worked

- HCPCS concentration HHI — adds noise, hurts AUC
- Taxonomy mismatch — too many false positives
- Raw percentile buckets — z-scores are better
- Adding too many features — each new weak feature dilutes the strong ones

## Key Gotchas

- `raw_pecos_enrollment` has up to 75 rows per NPI — always `SELECT DISTINCT NPI`
- HCPCS table has 9.6M rows — compute HHI in SQL, not pandas apply()
- Sigmoid scale=2.0 works better than 1.5 for z-score transformation
- Always fill NaN with 0.5 (neutral) before outputting scores
- Deduplicate output on NPI before returning

📌 This File Is the Compounding Advantage Notice the What Has Worked and What Hasn't Worked sections. These don't come pre-written — they're updated by the operator after each session based on what the oracle revealed. By session two, the agent skips the experiments that failed in session one. By session three, it builds on two layers of accumulated insight. This is how a project with 181 ground truth labels and a relatively small feature set still produces a 0.81 AUC in a single evening — the agent isn't starting from zero every time.

The original autoresearch project and methodology from Andrej Karpathy is open source: github.com/karpathy/autoresearch. Our implementation adapts the pattern for a supervised anomaly detection problem on public healthcare data rather than language model training.

4. Case Study: CMS Medicare Fraud Detection

The Problem

Medicare fraud costs the U.S. an estimated $60–100B annually. The OIG (Office of Inspector General) maintains the LEIE — a list of ~82,000 providers excluded from Medicare for fraud, abuse, or professional misconduct. The question: can we identify high-risk providers before they're caught, using only their publicly available billing patterns?

Data Infrastructure

Dataset	Scale	Signal Used
CMS Part B Physician Claims	1.26M providers	Services per beneficiary, payment per beneficiary — billing anomalies
CMS Part D Prescribing	1.38M providers	Drug cost per beneficiary, long-acting opioid prescribing rate
Open Payments (Sunshine Act)	14.7M payments	Total pharma/device industry payments received — entanglement signal
NPPES NPI Registry	7.1M providers	Specialty taxonomy, demographics, entity resolution
PECOS Enrollment	2.54M records	Active Medicare enrollment status
OIG LEIE Exclusion List	82,714 excluded	Ground truth labels — 181 matched to active CMS billing data

⚠️ Why Only 181 Ground Truth Labels? Of 82,714 LEIE entries, ~74,000 pre-date the NPI system (no digital identifier). Of the rest, ~8,000 have real NPIs but have been scrubbed from CMS billing data after exclusion. The 181 that remain are almost exclusively 2024–2026 exclusions — providers who were billing normally, then got caught. This means the model learns what fraudulent billing patterns look like right before detection. That's a leading indicator, not a lagging one.

The Ground Truth Oracle (`eval.py`)

On every evaluation run, eval.py does the following — automatically, without human involvement:

Downloads the live LEIE from oig.hhs.gov (24-hour cache)
Loads 1.2M provider NPIs from DuckDB
Joins LEIE to CMS — builds binary labels: 1 = excluded, 0 = not excluded
Passes all NPIs to detector.py via stdin
Receives back an npi, score CSV
Computes AUC-ROC and Average Precision against the labels
Appends timestamp, commit hash, metrics, and description to results.tsv

The entire eval cycle takes 3–4 minutes on a Hetzner server (32GB RAM, 8 vCPU). The agent never touches this file.

The Scoring Model (`detector.py`)

The final detector computes six independent subscores — each capturing a different behavioral dimension of potential fraud — and returns the maximum:

# Core scoring logic (V15 final — simplified):

subscores = [
    sub_spb,    # Services/beneficiary z-score (within specialty peer group)
    sub_ppb,    # Payment/beneficiary z-score (within specialty peer group)
    sub_la,     # Long-acting opioid prescribing rate z-score
    sub_cpb,    # Drug cost/beneficiary z-score
    sub_pay,    # Total industry payments (log-normalized)
    sub_pecos,  # Active billing but no PECOS enrollment (binary signal)
]

# The key insight: fraud = extreme on ANY single dimension
final_score = max(subscores) ** 1.2   # slight amplification of extremes

The specialty normalization is critical: every z-score is computed within peer group, not globally. An oncologist billing $500K/year per patient is normal. A family practitioner billing $500K/year per patient is a significant anomaly. Without this, the top of the suspect list is flooded with legitimate high-billing specialists.

5. 18 Iterations, One Evening — The AUC Progression

Baseline

0.5561

0.7695 ↑+21

0.7433 ↓−3

0.7325 ↓−1

0.7483 ↑+2

0.7664 ↑+2

0.7904 ↑+2

V8–V11

→ 0.8013

V12 ✓

0.8098 ↑PEAK

V13–V15

0.81xx plateau

Version	AUC	What the Agent Tried	Why / Outcome
Baseline	0.5561	Raw billing z-scores, global normalization	Near-random — no peer group context
V2	0.7695	Z-scores within specialty; LA opioid rate; PECOS gap × volume signal; industry payments	+21 points — specialty normalization is the dominant lever
V3	0.7433	Added HCPCS concentration HHI, upcoding ratio, taxonomy mismatch	−3 points — weak features dilute strong ones in a weighted sum
V4	0.7325	Percentile bucketing (90/95/99th pct) instead of z-scores	−4 points — percentiles lose variance within extreme tail
V5–V6	0.7483→0.7664	Reverted noisy features; recalibrated opioid signal	+4 points — cleanup gains
V7	0.7904	Ensemble: 50% max(subscores) + 50% weighted mean	+2 points — first test of max() hypothesis; confirms it helps
V8–V11	→0.8013	Progressive shift: 60%→70%→80%→90% max weight	Monotonic improvement — each step confirms max dominates
V12	0.8098	Pure `max(subscores)` — eliminated weighted mean entirely	Best result — hypothesis confirmed: fraud = extreme on ONE dimension
V13–V15	~0.81	Power transforms, top-2 mean aggregation, subscore rescaling	Plateau — agent correctly identifies diminishing returns

The agent ran each of these 18 experiments, read the outcome, proposed the next hypothesis, and rewrote the code — with no human in the loop between iterations. The operator (human) reviewed the trajectory occasionally and provided domain guidance when useful, but did not run any experiments manually.

6. What the Agent Figured Out: Key Technical Discoveries

🎯 Discovery #1 — Fraud = Extreme on ANY Single Dimension (not average on all)

The most important insight of the entire project. A weighted average rewards providers who are mildly suspicious on many metrics. But real fraud tends to be extreme on one metric: 10,000 services per beneficiary, or $23M billed on 30 patients, or 100% long-acting opioid rate. max(subscores) asks "is this provider an outlier on anything?" — and outperforms weighted averaging by ~3 AUC points. This drove the V7→V12 improvement arc.

📊 Discovery #2 — Specialty Normalization Is Non-Negotiable

The single largest AUC jump in the project (+21 points, Baseline→V2) came from one change: computing billing z-scores within specialty peer groups rather than across all providers globally. An oncologist administering 50 infusions per patient is completely normal. A family practitioner with 50 services per patient is extreme. Without peer normalization, every oncologist and hematologist floods the top of the suspect list.

⚠️ Discovery #3 — Feature Dilution: More Features = Worse Performance

V3 added HCPCS service concentration (HHI), upcoding ratio, and taxonomy mismatch. AUC dropped 3 points. This is a classic feature dilution problem: when weak signals are included in a weighted sum, they water down the strong signals. The fix (which the agent discovered by V7) was to put each signal in its own subscore and take the maximum — which automatically isolates the strongest signal for each provider rather than averaging noise into it.

🔬 Discovery #4 — PECOS Signal Is Partially Post-Exclusion Leakage

PECOS enrollment gap has a standalone AUC of 0.78 — the strongest individual signal. But this is largely circular: when a provider is excluded from Medicare, their PECOS enrollment is terminated. The feature is detecting the consequence of exclusion, not the cause of fraud. In a pure max() model this is acceptable (it still catches confirmed fraudsters), but it would flood a weighted model with legitimate non-enrolled providers. The agent flagged this when it noticed the PECOS-top suspect list didn't overlap with the billing-anomaly top list.

Feature Importance — Standalone AUC by Subscore

Subscore	Standalone AUC	Combined Role	Note
PECOS enrollment gap	0.7778	Strong but circular	⚠️ Post-exclusion leakage risk
Drug cost/beneficiary	0.5756	Distinct fraud type (pharma)	Catches different fraudsters than billing signals
Services/beneficiary	0.5678	Primary billing anomaly signal	Best non-circular predictor
Payment/beneficiary	0.5549	Correlated with svc/bene	Redundant but kept for max pool
Open Payments (industry $)	0.5379	Weak standalone	Useful in max pool: catches pharma fraud pattern
LA opioid prescribing rate	0.4617	Below random standalone	Only catches opioid-specific fraud; adds value in max pool
Combined max(subscores)	0.8072	Ensemble captures all fraud types	Each subscore catches different providers

💡 The Ensemble Effect — Why the Max Model Outperforms Every Individual Signal No single subscore breaks 0.58 (excluding the leaky PECOS signal). But the combined max model reaches 0.81. This is the ensemble effect in action: each subscore catches a different pattern of fraud. Billing anomalies, drug cost anomalies, opioid patterns, and industry entanglement don't co-occur in the same providers. Taking the maximum lets the model specialize per-provider without forcing it to average across irrelevant dimensions.

7. Results: Top Fraud Suspects Identified

After running the final V15 detector across 1.2M Medicare providers and filtering to individual physicians (removing ambulance services, labs, and institutional billers), the highest-scoring billing anomalies:

Rank	Provider	Specialty	Key Metric	Why Flagged
1	Andrew Leavitt NPI 1952343113	Internal Medicine, SF CA	10,333 svc/bene 2.5M services, 246 patients $4.76M Medicare	Statistically implausible
2	Elisabeth Balken NPI 1992489728	Nurse Practitioner, Mesa AZ	$23.3M Medicare billed 30 patients, 29,734 services	NP billing $23M on 30 patients
3	Frank Curvin NPI 1457397382	Family Practice, Johns Creek GA	$11.8M Medicare billed 104 patients, 191 svc/bene	Unusual volume for FP
4	Joyce Ravain NPI 1215980230	Emergency Medicine, Ormond Beach FL	92.9 svc/bene 45 patients, $835K	EM physicians don't have ongoing patient relationships

⚠️ Scores Are Alerts, Not Verdicts High scores indicate statistical outliers that warrant investigation — not confirmed fraud. The LLM web validation pipeline (next section) adds a second stage: Brave Search for news + OIG press releases, then Claude synthesizes a verdict (LIKELY_FRAUD / POSSIBLE_FRAUD / LEGITIMATE / INSUFFICIENT_DATA).

8. The Full Pipeline: From Raw CMS Data to Named Suspects

Zooming out, the complete system is a three-stage pipeline. Autoresearch handles Stage 2. Stages 1 and 3 are adjacent infrastructure built to make the results actionable.

Stage	Name	What Happens	Output
Stage 1	Data Infrastructure	5 CMS public datasets ingested into DuckDB. NPPES entity resolution. Specialty taxonomy mapping. Provider core table joining all sources.	6GB DuckDB, 90M+ rows, 30 tables, 1.2M provider-level aggregate records
Stage 2	Autoresearch Loop	Agent iterates on `detector.py`. Oracle (`eval.py`) evaluates against OIG LEIE ground truth. 18 iterations over one evening. Best result: AUC 0.8098.	Trained scoring function; `physician_suspects.csv` (top 100 billing anomalies)
Stage 3	LLM Web Validation	For each suspect: Brave Search (OIG news, medical board, fraud indictment). Claude synthesizes web evidence + CMS data → structured verdict.	HTML report with color-coded verdicts per provider; JSON structured output for downstream use

The Speed Advantage — End to End

Stage 1 (data infrastructure): Built over ~2 weeks, reusable across all future analyses
Stage 2 (autoresearch loop): Baseline to AUC 0.81 in one evening — roughly 2–3 weeks of traditional sprint work
Stage 3 (validation pipeline): ~10 minutes to web-validate 25 suspects via LLM

The bottleneck is now human judgment on validated suspects — exactly where it should be. The automated pipeline handles the 99% of providers who clearly aren't anomalous, and surfaces the ~1% that warrant a real person's attention.

✅ What This Demonstrates The autoresearch pattern isn't magic — it's a disciplined separation of concerns: the AI generates hypotheses and writes code; the oracle measures results honestly; the loop runs until it plateaus. What makes it powerful is the speed multiplier. In domains with a clean, automated evaluation function (validation loss, AUC-ROC, F1 score, backtest return), you can run in hours what used to take weeks. The AI doesn't need to be smarter than a human researcher. It just needs to iterate faster, stay disciplined about measuring, and know when it's hit a ceiling.

Project: CMS Healthcare Fraud Detection | Session date: March 8–9, 2026 (~6 hours) | Server: Hetzner 5.78.148.70 (32GB RAM, 8 vCPU) | GitHub: blakethom8/cms-fraud-detection | Data: All CMS data publicly available at data.cms.gov

🔬 Healthcare Fraud Detection via Autoresearch

Table of Contents

1. The Autoresearch Pattern: Karpathy's Core Insight

The Three-File Structure

2. Architecture: What the Agent Does vs. What the Framework Does

What the Agent Actually Does Per Iteration

3. Why This Runs Fast: The Speed Mechanics

3b. The Agent's Instruction File: `CLAUDE.md`

4. Case Study: CMS Medicare Fraud Detection

The Problem

Data Infrastructure

The Ground Truth Oracle (`eval.py`)

The Scoring Model (`detector.py`)

5. 18 Iterations, One Evening — The AUC Progression

6. What the Agent Figured Out: Key Technical Discoveries

Feature Importance — Standalone AUC by Subscore

7. Results: Top Fraud Suspects Identified

8. The Full Pipeline: From Raw CMS Data to Named Suspects

🔬 Healthcare Fraud Detection via Autoresearch

Table of Contents

1. The Autoresearch Pattern: Karpathy's Core Insight

The Three-File Structure

2. Architecture: What the Agent Does vs. What the Framework Does

What the Agent Actually Does Per Iteration

3. Why This Runs Fast: The Speed Mechanics

3b. The Agent's Instruction File: CLAUDE.md

4. Case Study: CMS Medicare Fraud Detection

The Problem

Data Infrastructure

The Ground Truth Oracle (eval.py)

The Scoring Model (detector.py)

5. 18 Iterations, One Evening — The AUC Progression

6. What the Agent Figured Out: Key Technical Discoveries

Feature Importance — Standalone AUC by Subscore

7. Results: Top Fraud Suspects Identified

8. The Full Pipeline: From Raw CMS Data to Named Suspects

3b. The Agent's Instruction File: `CLAUDE.md`

The Ground Truth Oracle (`eval.py`)

The Scoring Model (`detector.py`)