Healthcare Fraud Detection — Full Session Report

📋 Session Summary Starting from zero infrastructure, in one evening we built an end-to-end healthcare fraud detection system: data pipeline, iterative model (18 versions), ground truth analysis, fuzzy label expansion experiment, feature importance analysis, top suspect generation, and an LLM web validation pipeline. Final AUC: 0.8098 against OIG LEIE exclusion labels on 1.2M Medicare providers.

The Autoresearch Pattern
Data Infrastructure
18 Model Iterations
Key Discoveries
Ground Truth Deep Dive: The 181 Label Question
Fuzzy Matching Experiment
Feature Importance Analysis
Top Suspect List
LLM Web Validation Pipeline
Next Steps

0.5561→0.81AUC improvement

18model versions

90M+CMS records

1.2Mproviders scored

181ground truth labels

5scripts built

1. The Autoresearch Pattern

Inspired by Andrej Karpathy's autoresearch project, the core loop is simple:

The Three-File Loop

eval.py — Fixed oracle. Never edited. Downloads OIG LEIE, measures AUC-ROC, logs to results.tsv.
detector.py — The only file the agent edits. Reads NPI list from stdin, writes npi,score to stdout.
results.tsv — Experiment log. Tracks every iteration with timestamp, AUC, description.

The key insight from Karpathy: you need an automated, objective oracle. In his case it was validation loss on a language model. In ours it's the OIG LEIE — the federal database of providers excluded from Medicare for fraud, abuse, or license violations. Each eval.py run downloads the live LEIE, matches it to CMS billing data, and returns AUC-ROC. The agent iterates on detector.py freely.

2. Data Infrastructure

Dataset	Rows	What It Contains	Fraud Signal
CMS Part B Physician Claims	1.26M	Every Medicare physician's billing: services, beneficiaries, payments	Volume anomalies
CMS Part D Prescribing	1.38M	Drug prescribing patterns by provider including opioid rates	Opioid patterns
Open Payments (Sunshine Act)	14.7M	Every pharmaceutical/device payment to a physician	Industry entanglement
NPPES NPI Registry	7.1M	Provider demographics, taxonomy codes, address	Identity resolution
PECOS Enrollment	2.54M	Medicare enrollment records (2.54M rows = up to 75 per NPI)	Enrollment gaps
OIG LEIE Exclusion List	82,714	All providers excluded from Medicare — ground truth labels	Label source

All stored in a single 6GB DuckDB database on a Hetzner server (32GB RAM, 8 vCPU). Python venv at /home/dataops/cms-data/.venv/. DuckDB opened read-only to protect data integrity.

3. Model Iterations — AUC Progression

Baseline

0.5561

0.7695

0.7433 ▼

0.7325 ▼

0.7483

0.7664

0.7904

V8–V11

→0.8013

V12 ✓

0.8098

V13–V15

0.8098

Version	AUC	Key Change	Outcome
Baseline	0.5561	Raw billing z-scores, global normalization	Near random
V2	0.7695	Z-scores within specialty peer group; LA opioid rate; PECOS gap × volume; speaking fees	+21 pts — massive jump
V3	0.7433	Added HCPCS concentration HHI, upcoding ratio, taxonomy mismatch	−3 pts — added noise
V4	0.7325	Percentile bucketing (90/95/99th pct) instead of z-scores	−4 pts — z-scores better
V7	0.7904	Ensemble: 50% max(subscores) + 50% weighted mean — breakthrough insight	+2 pts
V8–V11	0.7920→0.8013	Progressive shift toward higher max weight (60%→90%)	Confirming max dominates
V12	0.8098	Pure `max(subscores)` — no weighted mean at all	Best result
V13–V15	0.8098–0.8100	Power transforms, top-2 mean — no improvement	Plateau reached

4. Key Discoveries

🎯 Discovery #1: Fraud = Extreme on ANY Single Dimension

Taking max(subscores) across all features dramatically outperforms weighted averaging. Fraudsters don't need to be suspicious on every metric — being in the 99th percentile on one metric is sufficient signal. This drove the biggest AUC jump (V7→V12). The implication: the more diverse your subscore pool, the more chances you have to catch different fraud patterns.

📈 Discovery #2: Specialty Normalization Is Non-Negotiable

The single biggest jump (baseline→V2, +21 pts) came from computing z-scores within specialty peer groups, not globally. An oncologist billing 50 services/patient is normal. A family practitioner billing 50 services/patient is extreme. Without specialty normalization, legitimate high-billing specialties flood the top of the suspect list.

⚠️ Discovery #3: More Features ≠ Better Performance

V3 added HCPCS concentration, upcoding ratio, and taxonomy mismatch — and AUC dropped 3 points. Adding weak features dilutes strong ones unless they're in the max pool as standalone subscores. The final model uses just 6 subscores. Adding 3 more hurt. This is a core lesson in fraud detection: signal quality > quantity.

🔬 Discovery #4: PECOS Signal Is Partially Post-Exclusion Leakage

PECOS enrollment gap has a remarkable 0.78 standalone AUC — but this is largely circular: when a provider is excluded from Medicare, their PECOS enrollment is terminated. The feature is detecting the effect of exclusion (PECOS revoked), not the cause (fraud). It floods the top suspect list with legitimate unenrolled providers and excluded-but-since-reinstated providers. The billing signal (svc/bene, pay/bene) is the genuinely predictive dimension.

5. Ground Truth Deep Dive: The 181 Label Question

Blake flagged an important concern: with only 181 matched labels out of 82,714 LEIE entries, how reliable is our AUC?

Why So Few Matches?

LEIE Segment	Count	Why Not Matched
Placeholder NPI (0000000000)	74,241	No NPI on file — pre-NPI era exclusions or organizations
Real NPI, matched to CMS	182	✅ Our ground truth labels
Real NPI, not in CMS	8,291	Excluded earlier — billing data scrubbed from CMS over time

💡 The Critical Insight: Exclusion Year Distribution Of our 181 matches: 93% were excluded in 2024–2026. Providers excluded in 2015 have had a decade of billing data removed from CMS public datasets. Our model is a leading indicator — it learns what providers looked like right before they got caught, not what historical fraud looked like.

Is 181 Enough?

Honestly — it's tight. Key implications:

The 95% confidence interval on our 0.81 AUC is roughly ±0.03–0.04
V11 (0.8013) vs V12 (0.8098) are likely not statistically distinguishable — we chased noise in those final iterations
The large jumps (baseline→V2, V2→V7) are real; the marginal improvements (<0.01) are uncertain
The model works — but expansion of ground truth is the highest-ROI next step

6. Fuzzy Matching Experiment

We attempted to expand ground truth from 181 → 720+ labels by fuzzy-matching the 74,241 no-NPI LEIE entries against NPPES using last_name + state + specialty.

Results

Match Confidence	Total Matched	In CMS	AUC Impact
HIGH (exact name + state + taxonomy + credentials)	287	44	0.81→0.75 (AUC dropped)
MEDIUM (first initial match)	1,709	~495	0.81→0.75 (AUC dropped)
LOW (ambiguous, first pick)	1,454	~513	Not tested

Why Did AUC Drop?

Score Distribution Reveals the Problem

LEIE-exact positives: mean score = 0.674, median = 1.000 → model correctly flags them HIGH
Fuzzy-new positives: mean score = 0.217, median = 0.122 → model scores them like normal providers

Investigation revealed: the fuzzy-matched providers were excluded in 1995–2010 — 20-30 years ago. Their current billing patterns are completely normal (they may have been reinstated, or the NPI was reassigned). Adding them as fraud labels teaches the model the wrong thing. The old LEIE entries without NPIs are not worth adding.

What would actually help: DOJ press releases (OIG publishes ~100/month with specific provider names), CMS Program Integrity dataset, state medical board disciplinary actions.

7. Feature Importance Analysis

Subscore	Standalone AUC	Notes
PECOS enrollment gap	0.7778	⚠️ Mostly post-exclusion leakage — floods list with wrong suspects
Drug cost/beneficiary	0.5756	Moderate — specialty-normalized drug spend
Services/beneficiary	0.5678	Moderate — the key billing anomaly signal
Payment/beneficiary	0.5549	Correlated with svc/bene
Open Payments (industry $)	0.5379	Weak individually
LA opioid rate	0.4617	Below random standalone — only adds value in the max pool
Combined max(subscores)	0.8072	The ensemble effect — each feature catches different fraudsters

The ensemble effect is the whole game. No single signal is powerful, but they catch different fraudsters. That's exactly why max() works: it asks "is this provider extreme on anything?" rather than "is this provider above average on everything?"

8. Top Suspect List — Physicians Only

After removing PECOS-dominated results and filtering to individual physicians (removing ambulance services, labs, IDTFs), the top billing anomaly suspects:

Rank	Provider	Specialty	Location	Key Metric	Why Flagged
1	Andrew Leavitt NPI 1952343113	Internal Medicine	San Francisco, CA	10,333 svc/bene 2.5M services, 246 patients $4.76M paid	Extreme — implausible
2	Elisabeth Balken NPI 1992489728	Nurse Practitioner	Mesa, AZ	991 svc/bene 29,734 services, 30 patients $23.3M paid	Extreme — NP billing $23M
3	William Price NPI 1366423865	Family Practice	Tulsa, OK	2,357 svc/bene 120K services, 51 patients $29,870 paid	Very high volume, very low payment — data anomaly?
4	Brandon Hardesty NPI 1952581027	Internal Medicine	Indianapolis, IN	5,634 svc/bene 766K services, 136 patients LA opioid: HIGH	Likely IU Health hematology
5	Frank Curvin NPI 1457397382	Family Practice	Johns Creek, GA	191 svc/bene 19,932 services, 104 patients $11.8M paid	High for family practice
6	Joyce Ravain NPI 1215980230	Emergency Medicine	Ormond Beach, FL	92.9 svc/bene 4,181 services, 45 patients $835K paid	EM docs don't have ongoing patients

⚠️ Important Caveat High scores ≠ fraud. Our earlier web research confirmed that similar patterns (Indianapolis cluster, oncology infusion centers) can be completely legitimate. These suspects require human review + web research before any conclusions.

9. LLM Web Validation Pipeline

Built validate_suspects.py — a batch pipeline that takes the physician suspect list and for each provider:

Builds 3–5 targeted search queries (OIG news, medical board, fraud indictment, settlement)
Calls Brave Search API (3 searches × 3 results = up to 9 unique sources per provider)
Feeds CMS data context + search results to Claude (claude-haiku) for structured analysis
Returns structured JSON: verdict | fraud_confidence | fraud_indicators | legitimate_explanations | recommended_action
Generates styled HTML report with color-coded verdicts

Verdict Categories

Verdict	Meaning	Action
LIKELY_FRAUD	Web evidence of legal action, conviction, or OIG press release	Refer to OIG, request claims audit
POSSIBLE_FRAUD	Statistical anomaly + suspicious patterns but no confirmed legal action	Enhanced monitoring, deeper audit
LEGITIMATE	High-volume specialty center, academic medicine, specialty drug	Remove from watchlist
INSUFFICIENT_DATA	No web presence, new provider, generic name	Manual review needed

To Activate

# On Hetzner server:
export BRAVE_API_KEY=...       # free tier at https://brave.com/search/api/
export ANTHROPIC_API_KEY=...

# Run top 25 suspects:
cd /home/dataops/fraud-detector
python3 validate_suspects.py --limit 25

# Output:
# validation/validation_report_2026-03-09.html  (HTML report)
# validation/validation_results.json            (structured data)

10. Next Steps — Prioritized

Priority	Task	Expected Impact
P0	Set API keys on server, run validation pipeline on top 25 suspects	First real-world validated fraud findings
P0	Investigate Andrew Leavitt (10,333 svc/bene), Elisabeth Balken ($23M on 30 patients)	May be our clearest fraud signals
P1	Expand ground truth via DOJ press releases — OIG publishes ~100/month with provider names	Could grow labels from 181 → 1,000+
P1	Build frontend explorer (FastAPI + React) for interactive provider investigation	Makes the validation workflow usable by others
P2	Add network analysis — referral rings where A→B→C→kickback→A	New fraud pattern not currently detected
P2	Write 3-part blog post series for personal website	Content marketing / credibility for venture
P3	Train proper ML model (gradient boosting) using hand-crafted features as inputs	May push AUC past 0.85 with proper cross-validation

Repository & Files

File	Purpose	Location
`eval.py`	Fixed oracle — do not modify	Server + GitHub
`detector.py`	Fraud scoring logic (V15 final)	Server + GitHub
`feature_importance.py`	Subscore AUC + top suspect generation	Server + GitHub
`fuzzy_match.py`	LEIE name matching against NPPES	Server + GitHub
`validate_suspects.py`	LLM web validation pipeline	Server + GitHub
`physician_suspects.csv`	Top 100 physician billing suspects	Server + GitHub
`results.tsv`	Full AUC experiment log (18 runs)	Server + GitHub
`reports/2026-03-09_fraud-detection-comprehensive.html`	Public-facing summary report	Mobile URL

Session: March 8–9, 2026 (~6 hours) | GitHub: blakethom8/cms-fraud-detection | Server: Hetzner 5.78.148.70 | Data: All CMS data publicly available at data.cms.gov

🔬 Healthcare Fraud Detection: Full Session Report

Table of Contents