Andrej Karpathy's autoresearch project makes a deceptively simple observation: most machine learning research is a loop. A researcher proposes a change, measures the result against an objective metric, decides whether it helped, and iterates. The question he asked was: what if you automated that loop?
"You need a fixed, automated evaluation function β an oracle that can't be gamed and doesn't change. Then you let an AI agent edit the model freely. The only rule: the oracle doesn't lie."
In Karpathy's original implementation, the oracle was validation loss on a character-level language model. In our implementation, the oracle is AUC-ROC against the OIG LEIE β the federal database of providers excluded from Medicare for fraud, abuse, or professional misconduct. Different domain, identical architecture.
The cleanest way to understand this system is through a strict division of responsibility. Many people conflate "the AI" with "the whole system." They're different components doing fundamentally different jobs.
| Dimension | π€ The AI Agent | π§± Karpathy's Framework |
|---|---|---|
| Primary role | Hypothesis generation and implementation β decides what to try and writes the code | Evaluation and truth β measures how well it worked, objectively, every time |
| Files owned | detector.py only β the scoring logic |
eval.py (frozen), results.tsv (append-only) |
| What it reads | The full experiment history in results.tsv; reads its own previous code; interprets AUC movements |
Live OIG LEIE exclusion list (downloads fresh on every run), CMS provider database |
| Reasoning style | Scientific: "V3 added HHI concentration and AUC dropped 3 points β probably added noise. Try removing it and isolating the upcoding signal." Proposes the next experiment based on the prior result. | Mechanical and deterministic: pulls NPIs, runs the detector, computes AUC-ROC and Average Precision, writes to TSV. No judgment. Same logic every run. |
| Knowledge used | Domain knowledge about healthcare fraud patterns, statistical signal quality, feature engineering intuition | None β it doesn't need to understand fraud. It just measures a labeled dataset against a scoring function. |
| Speed | ~2 min per iteration: reads history, proposes change, rewrites detector.py | ~3β4 min per iteration: loads 1.2M NPIs, runs detector, computes AUC, appends result |
| Can it fail? | Yes β and it does. V3, V4 both dropped AUC. That's fine. The oracle catches it and the agent course-corrects. | No (unless infrastructure fails). Deterministic, reproducible. Every run is comparable to every other run. |
Each iteration follows this reasoning pattern:
# Agent's mental loop (simplified):
1. Read results.tsv β understand trend
"V7 β V12: gradually shifting from 50% max to 100% max, each step improved.
V13: power transform didn't help. V14: top-2 mean didn't help.
Conclusion: pure max is the ceiling with current features."
2. Form hypothesis
"The marginal gains are below statistical significance at n=181 labels.
Highest ROI next move is expanding ground truth, not tuning existing features."
3. Implement
β Edit detector.py to test hypothesis (or note that we've reached plateau)
4. Run eval.py
β Automatic. Oracle doesn't care about the hypothesis. It just measures.
5. Back to 1.
Traditional research on a problem like Medicare fraud detection would look like this: a data scientist spends a week building a feature pipeline, runs the model, waits a day for results, writes it up, proposes new features, repeats. A rigorous study might run 5β10 experiments over several weeks.
The autoresearch loop collapses that timeline because the bottleneck β the human researcher deciding what to try next β is replaced by something that doesn't sleep, doesn't need to write documentation between experiments, and doesn't get demoralized when AUC drops.
The key design decisions that enable this speed:
detector.py is a scoring function, not a trained model. There's no gradient descent, no epochs, no train/val split to manage. It reads data, applies logic, returns scores. Evaluation is fast because it's just a comparison.eval.py passes NPIs via stdin; detector.py writes scores to stdout. No database writes, no file locking, no shared state. The interface is a CSV pipe.eval.py never changes, AUC results are perfectly comparable across all 18 versions. The agent never has to worry about "did the goalposts move?"CLAUDE.mdThe agent isn't dropped into the repository with no context. It reads a single markdown file β CLAUDE.md β at the start of every session. This file is the entire contract between the human operator and the AI: what it's trying to accomplish, what the rules are, what data is available, and what it has already learned from prior experiments.
Think of it as a standing brief. The operator writes it once and updates it as the project evolves. The agent re-reads it every session and inherits the accumulated institutional knowledge without needing to rediscover it.
CLAUDE.md Contains (and Why Each Section Matters)
eval.py, interface spec (stdin/stdout CSV), how to log vs. dry-run. The agent can't misinterpret these because they're written plainly.results.tsv β the agent always knows the target to beat.Here is the actual CLAUDE.md used in this project:
# CLAUDE.md β Agent Instructions You are improving `detector.py` to maximize AUC-ROC against the OIG LEIE ground truth. ## The Rules 1. **DO NOT modify `eval.py`** β it is the fixed oracle 2. `detector.py` reads NPI CSV from stdin, outputs `npi,score` CSV to stdout 3. Run `python eval.py --dry-run` to test without logging 4. Run `python eval.py --description "your description"` to log a real result 5. Check `results.tsv` to see all previous attempts ## Current Best See `results.tsv` β aim to beat the highest AUC-ROC in that file. ## Data Available DuckDB at `/home/dataops/cms-data/data/provider_searcher.duckdb`: - `raw_physician_by_provider` β Part B billing (NPI, specialty, services, benes, payment) - `raw_part_d_by_provider` β Part D prescribing (NPI, drug cost, benes, opioid LA rate) - `raw_physician_by_provider_and_service` β HCPCS-level billing (9.6M rows) - `raw_open_payments_general` β Industry payments (14.7M rows) - `raw_nppes` β Provider registry (7.1M, NPI, taxonomy, address) - `raw_pecos_enrollment` β Medicare enrollment (2.54M, NPI β has duplicate NPIs) - `core_providers` β Cleaned provider table (1.2M, npi, type, state, zip5) ## What Has Worked - **Services-per-bene z-score within specialty** β the single strongest signal - **LA opioid rate z-score within specialty** β catches pill mills - **`max(subscores)`** β taking the maximum across all feature subscores beats weighted averaging - High-volume specialties (oncology, hematology) need a dampening factor (0.4β0.5Γ) ## What Hasn't Worked - HCPCS concentration HHI β adds noise, hurts AUC - Taxonomy mismatch β too many false positives - Raw percentile buckets β z-scores are better - Adding too many features β each new weak feature dilutes the strong ones ## Key Gotchas - `raw_pecos_enrollment` has up to 75 rows per NPI β always `SELECT DISTINCT NPI` - HCPCS table has 9.6M rows β compute HHI in SQL, not pandas apply() - Sigmoid scale=2.0 works better than 1.5 for z-score transformation - Always fill NaN with 0.5 (neutral) before outputting scores - Deduplicate output on NPI before returning
The original autoresearch project and methodology from Andrej Karpathy is open source: github.com/karpathy/autoresearch. Our implementation adapts the pattern for a supervised anomaly detection problem on public healthcare data rather than language model training.
Medicare fraud costs the U.S. an estimated $60β100B annually. The OIG (Office of Inspector General) maintains the LEIE β a list of ~82,000 providers excluded from Medicare for fraud, abuse, or professional misconduct. The question: can we identify high-risk providers before they're caught, using only their publicly available billing patterns?
| Dataset | Scale | Signal Used |
|---|---|---|
| CMS Part B Physician Claims | 1.26M providers | Services per beneficiary, payment per beneficiary β billing anomalies |
| CMS Part D Prescribing | 1.38M providers | Drug cost per beneficiary, long-acting opioid prescribing rate |
| Open Payments (Sunshine Act) | 14.7M payments | Total pharma/device industry payments received β entanglement signal |
| NPPES NPI Registry | 7.1M providers | Specialty taxonomy, demographics, entity resolution |
| PECOS Enrollment | 2.54M records | Active Medicare enrollment status |
| OIG LEIE Exclusion List | 82,714 excluded | Ground truth labels β 181 matched to active CMS billing data |
eval.py)On every evaluation run, eval.py does the following β automatically, without human involvement:
oig.hhs.gov (24-hour cache)1 = excluded, 0 = not excludeddetector.py via stdinnpi, score CSVresults.tsvThe entire eval cycle takes 3β4 minutes on a Hetzner server (32GB RAM, 8 vCPU). The agent never touches this file.
detector.py)The final detector computes six independent subscores β each capturing a different behavioral dimension of potential fraud β and returns the maximum:
# Core scoring logic (V15 final β simplified):
subscores = [
sub_spb, # Services/beneficiary z-score (within specialty peer group)
sub_ppb, # Payment/beneficiary z-score (within specialty peer group)
sub_la, # Long-acting opioid prescribing rate z-score
sub_cpb, # Drug cost/beneficiary z-score
sub_pay, # Total industry payments (log-normalized)
sub_pecos, # Active billing but no PECOS enrollment (binary signal)
]
# The key insight: fraud = extreme on ANY single dimension
final_score = max(subscores) ** 1.2 # slight amplification of extremes
The specialty normalization is critical: every z-score is computed within peer group, not globally. An oncologist billing $500K/year per patient is normal. A family practitioner billing $500K/year per patient is a significant anomaly. Without this, the top of the suspect list is flooded with legitimate high-billing specialists.
| Version | AUC | What the Agent Tried | Why / Outcome |
|---|---|---|---|
| Baseline | 0.5561 | Raw billing z-scores, global normalization | Near-random β no peer group context |
| V2 | 0.7695 | Z-scores within specialty; LA opioid rate; PECOS gap Γ volume signal; industry payments | +21 points β specialty normalization is the dominant lever |
| V3 | 0.7433 | Added HCPCS concentration HHI, upcoding ratio, taxonomy mismatch | β3 points β weak features dilute strong ones in a weighted sum |
| V4 | 0.7325 | Percentile bucketing (90/95/99th pct) instead of z-scores | β4 points β percentiles lose variance within extreme tail |
| V5βV6 | 0.7483β0.7664 | Reverted noisy features; recalibrated opioid signal | +4 points β cleanup gains |
| V7 | 0.7904 | Ensemble: 50% max(subscores) + 50% weighted mean | +2 points β first test of max() hypothesis; confirms it helps |
| V8βV11 | β0.8013 | Progressive shift: 60%β70%β80%β90% max weight | Monotonic improvement β each step confirms max dominates |
| V12 | 0.8098 | Pure max(subscores) β eliminated weighted mean entirely | Best result β hypothesis confirmed: fraud = extreme on ONE dimension |
| V13βV15 | ~0.81 | Power transforms, top-2 mean aggregation, subscore rescaling | Plateau β agent correctly identifies diminishing returns |
The agent ran each of these 18 experiments, read the outcome, proposed the next hypothesis, and rewrote the code β with no human in the loop between iterations. The operator (human) reviewed the trajectory occasionally and provided domain guidance when useful, but did not run any experiments manually.
The most important insight of the entire project. A weighted average rewards providers who are mildly suspicious on many metrics. But real fraud tends to be extreme on one metric: 10,000 services per beneficiary, or $23M billed on 30 patients, or 100% long-acting opioid rate. max(subscores) asks "is this provider an outlier on anything?" β and outperforms weighted averaging by ~3 AUC points. This drove the V7βV12 improvement arc.
The single largest AUC jump in the project (+21 points, BaselineβV2) came from one change: computing billing z-scores within specialty peer groups rather than across all providers globally. An oncologist administering 50 infusions per patient is completely normal. A family practitioner with 50 services per patient is extreme. Without peer normalization, every oncologist and hematologist floods the top of the suspect list.
V3 added HCPCS service concentration (HHI), upcoding ratio, and taxonomy mismatch. AUC dropped 3 points. This is a classic feature dilution problem: when weak signals are included in a weighted sum, they water down the strong signals. The fix (which the agent discovered by V7) was to put each signal in its own subscore and take the maximum β which automatically isolates the strongest signal for each provider rather than averaging noise into it.
PECOS enrollment gap has a standalone AUC of 0.78 β the strongest individual signal. But this is largely circular: when a provider is excluded from Medicare, their PECOS enrollment is terminated. The feature is detecting the consequence of exclusion, not the cause of fraud. In a pure max() model this is acceptable (it still catches confirmed fraudsters), but it would flood a weighted model with legitimate non-enrolled providers. The agent flagged this when it noticed the PECOS-top suspect list didn't overlap with the billing-anomaly top list.
| Subscore | Standalone AUC | Combined Role | Note |
|---|---|---|---|
| PECOS enrollment gap | 0.7778 | Strong but circular | β οΈ Post-exclusion leakage risk |
| Drug cost/beneficiary | 0.5756 | Distinct fraud type (pharma) | Catches different fraudsters than billing signals |
| Services/beneficiary | 0.5678 | Primary billing anomaly signal | Best non-circular predictor |
| Payment/beneficiary | 0.5549 | Correlated with svc/bene | Redundant but kept for max pool |
| Open Payments (industry $) | 0.5379 | Weak standalone | Useful in max pool: catches pharma fraud pattern |
| LA opioid prescribing rate | 0.4617 | Below random standalone | Only catches opioid-specific fraud; adds value in max pool |
| Combined max(subscores) | 0.8072 | Ensemble captures all fraud types | Each subscore catches different providers |
After running the final V15 detector across 1.2M Medicare providers and filtering to individual physicians (removing ambulance services, labs, and institutional billers), the highest-scoring billing anomalies:
| Rank | Provider | Specialty | Key Metric | Why Flagged |
|---|---|---|---|---|
| 1 | Andrew Leavitt NPI 1952343113 |
Internal Medicine, SF CA | 10,333 svc/bene 2.5M services, 246 patients $4.76M Medicare |
Statistically implausible |
| 2 | Elisabeth Balken NPI 1992489728 |
Nurse Practitioner, Mesa AZ | $23.3M Medicare billed 30 patients, 29,734 services |
NP billing $23M on 30 patients |
| 3 | Frank Curvin NPI 1457397382 |
Family Practice, Johns Creek GA | $11.8M Medicare billed 104 patients, 191 svc/bene |
Unusual volume for FP |
| 4 | Joyce Ravain NPI 1215980230 |
Emergency Medicine, Ormond Beach FL | 92.9 svc/bene 45 patients, $835K |
EM physicians don't have ongoing patient relationships |
Zooming out, the complete system is a three-stage pipeline. Autoresearch handles Stage 2. Stages 1 and 3 are adjacent infrastructure built to make the results actionable.
| Stage | Name | What Happens | Output |
|---|---|---|---|
| Stage 1 | Data Infrastructure | 5 CMS public datasets ingested into DuckDB. NPPES entity resolution. Specialty taxonomy mapping. Provider core table joining all sources. | 6GB DuckDB, 90M+ rows, 30 tables, 1.2M provider-level aggregate records |
| Stage 2 | Autoresearch Loop | Agent iterates on detector.py. Oracle (eval.py) evaluates against OIG LEIE ground truth. 18 iterations over one evening. Best result: AUC 0.8098. |
Trained scoring function; physician_suspects.csv (top 100 billing anomalies) |
| Stage 3 | LLM Web Validation | For each suspect: Brave Search (OIG news, medical board, fraud indictment). Claude synthesizes web evidence + CMS data β structured verdict. | HTML report with color-coded verdicts per provider; JSON structured output for downstream use |
The bottleneck is now human judgment on validated suspects β exactly where it should be. The automated pipeline handles the 99% of providers who clearly aren't anomalous, and surfaces the ~1% that warrant a real person's attention.
Project: CMS Healthcare Fraud Detection | Session date: March 8β9, 2026 (~6 hours) | Server: Hetzner 5.78.148.70 (32GB RAM, 8 vCPU) | GitHub: blakethom8/cms-fraud-detection | Data: All CMS data publicly available at data.cms.gov