Healthcare AI

I Let an AI Hunt Medicare Fraud Overnight. Here's What It Found.

One evening, 90 million records, 18 model iterations — and a nurse practitioner billing $23 million on 30 patients.

Blake Thomson  ·  March 9, 2026  ·  8 min read

Medicare loses an estimated $60–100 billion per year to fraud. That's not a rounding error — it's more than the GDP of most countries. The government has investigators, task forces, and a federal exclusion database tracking every provider who's been caught. And yet the problem keeps growing.

I spent an evening wondering: what happens if you point modern AI tooling — a large language model, a six-gigabyte database of claims data, and an automated feedback loop — directly at this problem? Could you meaningfully detect fraud on 1.2 million Medicare providers in a few hours?

Short answer: yes, kind of. And what the model found along the way was more interesting than the final score.


The Setup: Karpathy's Autoresearch Pattern

The approach is adapted from a technique called autoresearch, popularized by Andrej Karpathy. The idea is simple but powerful: split your problem into two files.

One file — eval.py — is the oracle. It's locked. It measures performance objectively and automatically. You never touch it.

The other file — detector.py — is yours to iterate on freely. The AI agent edits it, runs the oracle, reads the score, and tries again. No human in the loop. No waiting for feedback. Just pure iteration overnight.

"The key constraint is that the oracle must be automated and objective. You can't autoresearch 'is this essay good?' You can autoresearch anything with a measurable score."

For fraud detection, the oracle is the OIG LEIE — the Office of Inspector General's List of Excluded Individuals and Entities. Every provider who's been kicked out of Medicare for fraud, abuse, or license violations is on this list. It's publicly available, updated monthly, and completely unambiguous. If your detector correctly ranks real fraudsters higher than legitimate providers, your AUC goes up. If it doesn't, it goes down.

We ran 18 iterations in one evening.

90M+CMS records
1.2Mproviders scored
18iterations
0.81final AUC

The Data: Six Gigabytes of Medicare

The CMS (Centers for Medicare & Medicaid Services) publishes a remarkable amount of public data. Physician billing. Drug prescribing. Industry payments. Provider enrollment records. All of it free to download.

We loaded it into a single DuckDB database on a Hetzner server: Part B physician claims, Part D prescribing data, Open Payments (the Sunshine Act database tracking every dollar pharma pays to doctors), NPPES provider demographics, PECOS enrollment records. Six gigabytes. 90 million rows. Thirty tables.

The ground truth: 181 confirmed fraudsters currently in Medicare's billing data whose names appear on the LEIE exclusion list. That's 0.015% of all providers — a needle in a very large haystack. (The LEIE has 82,000 entries total, but most are from pre-NPI era or have had their billing data scrubbed from CMS over time. Our 181 are all recent: 93% excluded in 2024–2026.)


What the Iterations Taught Us

The AUC progression tells a story:

Baseline
0.556
V2
0.770
V3 ▼
0.743
V7
0.790
V12 ✓
0.810

Three discoveries stand out — each one a genuine surprise.

Discovery 1: You must normalize within specialty, not globally

The single biggest AUC jump — from 0.556 to 0.770, nearly +21 points — came from one change: computing billing anomaly scores within specialty peer groups rather than across all providers. An oncologist billing 200 services per patient is normal. A family practitioner billing 200 services per patient is not. Without specialty normalization, every high-volume legitimate specialist floods the top of your suspect list.

Discovery 2: More features made things worse

After V2's big jump, the temptation was to add more signals. We added HCPCS billing code concentration, upcoding ratios, taxonomy mismatches. AUC dropped from 0.770 to 0.743. Adding weak features dilutes strong ones. The final model uses just six subscores. In fraud detection, signal quality beats quantity every time.

Discovery 3: Fraud = extreme on ANY single dimension

The most counterintuitive finding. Instead of computing a weighted average across all fraud signals, we tried taking the maximum — the single highest subscore for each provider. AUC jumped from 0.790 to 0.810 and kept improving with a pure max approach. Why? Fraudsters don't need to be suspicious on every metric. They need to be extreme on one. A provider billing 10,000 services per patient doesn't also need unusual prescribing patterns to be worth investigating. One extreme signal is enough.


What Topped the Suspect List

After scoring all 1.2 million providers, filtering to individual physicians, and removing edge cases, the highest-anomaly providers in the dataset:

Provider Specialty What Triggered the Flag
Andrew Leavitt, MD
San Francisco, CA
Internal Medicine Extreme 10,333 services/patient on 246 patients ($4.76M). Statistically implausible for Internal Medicine.
Elisabeth Balken, NP
Mesa, AZ
Nurse Practitioner Extreme $23.3M Medicare on 30 patients. Nurse practitioner billing at cardiologist-tier volume.
Frank Curvin, MD
Johns Creek, GA
Family Practice High $11.8M on 104 patients. Family practitioners don't typically generate this billing volume.
Joyce Ravain, MD
Ormond Beach, FL
Emergency Medicine Suspicious 92.9 services/patient on a small panel. EM physicians don't have ongoing patient relationships — a high services/patient ratio in this specialty is unusual.
Important caveat

High anomaly scores don't mean fraud. Earlier analysis found that similar statistical patterns — high service volumes, unusual billing concentrations — can be completely legitimate: academic cancer centers running hemophilia infusions, specialty psychiatrists prescribing $7,700/dose tardive dyskinesia medication, large therapy headquarters with thousands of providers sharing an address. These suspects require human review and web research before any conclusions.

That's why the next phase of this project is an LLM web validation pipeline: for each flagged provider, run targeted searches against OIG press releases, medical board actions, and news sources, then have a language model synthesize the findings into a structured verdict. Statistical flags → human-readable context.


The 181-Label Problem

One concern worth being honest about: the ground truth set is small. Only 181 confirmed fraudsters are both in the LEIE and in our CMS billing data. That's 0.015% prevalence — extreme class imbalance — and it means the confidence interval on our 0.81 AUC is roughly ±0.03–0.04. The early big jumps (baseline → V2 → V7) are real and statistically meaningful. The marginal improvements in the final iterations probably aren't.

We also ran an experiment trying to expand ground truth by fuzzy-matching the 74,000 LEIE entries that have no NPI on file against the provider registry. We found ~720 likely matches — but AUC dropped when we added them as labels. Investigation revealed why: most of those entries were excluded in 1995–2010. Their current billing patterns look completely normal because they've been legitimate providers (or non-providers) for 15–30 years. You can't teach a model what fraud looks like by pointing it at 30-year-old exclusions.

The right path to expanding ground truth: DOJ press releases (OIG publishes ~100 per month with specific provider names and NPI numbers), state medical board disciplinary actions, and CMS program integrity datasets. That work is underway.


What This Is and Isn't

This isn't a production fraud detection system. It's a working prototype that demonstrates the approach — one evening of AI-assisted iteration on publicly available data producing a meaningful fraud signal (0.81 AUC vs. 0.56 random baseline) and a validated list of suspicious providers worth investigating.

The approach is replicable. The data is public. The technique — autoresearch with an automated oracle — generalizes well beyond fraud to any problem where you have a measurable, objective outcome. We happened to use Medicare exclusions. The same loop works for clinical quality metrics, billing optimization, network adequacy analysis.

More posts in this series will go deeper: the feature engineering that drove the big jumps, what the web validation pipeline found when we actually looked up the top suspects, and what a proper machine learning approach (vs. rule-based scoring) would add.

Key Takeaways

1. Specialty normalization is non-negotiable in healthcare fraud detection.
2. max(subscores) beats weighted averages — fraud needs only one extreme dimension.
3. Ground truth quality matters more than quantity; old exclusion labels add noise, not signal.
4. PECOS enrollment gaps look predictive but are mostly circular — effects, not causes.
5. The autoresearch pattern works: automated oracle + free iteration = real progress in a single session.


GitHub: blakethom8/cms-fraud-detection — all code, results.tsv (full 18-run log), and the suspect list are public.
Data: All CMS datasets used are publicly available at data.cms.gov. No patient data. Provider-level aggregates only.