Healthcare Fraud Detection — Autoresearch Design Report

What We're Building

This project adapts Andrej Karpathy's autoresearch pattern — an autonomous overnight experiment loop — to healthcare fraud detection. Instead of iterating on neural network training code, an AI agent iterates on fraud scoring logic, running automated evaluations against a ground-truth dataset of known fraudulent providers (the OIG LEIE exclusion list). You sleep, it runs 50–100 experiments, and you wake up to a better fraud detector.

Core Insight: The reason this works (and regular coding agents don't loop autonomously) is the oracle: AUC-ROC against LEIE-labeled providers is a hard, automated, human-judgment-free metric. The agent knows if it got better. No human in the loop required per iteration.

The Karpathy Pattern — Direct Comparison

autoresearch (ML training)	Our version (fraud detection)
train.py — model architecture, optimizer, hyperparams (agent edits this)	detector.py — scoring logic, SQL features, thresholds, weights (agent edits this)
program.md — research instructions (human edits this)	strategy.md — fraud hypotheses to test, what patterns to try (human edits this)
prepare.py — fixed eval harness, data loading (never modified)	eval.py — loads LEIE, joins CMS data, measures AUC-ROC (never modified)
val_bpb — validation bits per byte (lower = better)	AUC-ROC — area under ROC curve (higher = better, max 1.0)
5-minute fixed GPU training run	30–60 second DuckDB eval against 90M rows
~100 experiments overnight on one H100	~50–100 experiments overnight on Hetzner CPU server
Git commit if val_bpb improves	Git commit if AUC-ROC improves
LOOP FOREVER until interrupted	LOOP FOREVER until interrupted

Reference: github.com/karpathy/autoresearch — cloned to ~/.openclaw/workspace/projects/autoresearch/

The Ground Truth Oracle — OIG LEIE

The List of Excluded Individuals/Entities (LEIE) is the OIG's public database of providers excluded from Medicare and Medicaid participation due to fraud, abuse, or program-related crimes. Updated monthly. ~70,000 current records.

Why LEIE Is the Perfect Oracle

Binary labels: Excluded = fraud/abuse confirmed. Not excluded = unknown (negative label).
NPI linkable: LEIE contains NPI numbers and names — join directly to your CMS provider tables.
Publicly available: Free CSV download, updated monthly, no access restrictions.
Ground truth (not hypothesis): These aren't statistical outliers — these are convicted/settled bad actors.
Fully automated eval: Load CSV → join → label → score → AUC. No human judgment required.

Download URL: https://oig.hhs.gov/exclusions/downloadables/UPDATED.csv

Key LEIE fields: LASTNAME, FIRSTNAME, NPI, SPECIALTY, EXCLTYPE (exclusion category), EXCLDATE, REINDATE, WAIVERSTATE

Join strategy: NPI match (primary) → name + specialty fallback (for pre-NPI records). Expected match rate to your CMS data: 30–60% of LEIE records will have corresponding CMS activity.

Project Structure

fraud-detector/ ├── eval.py # FIXED — ground truth oracle, never modified │ # loads LEIE, joins CMS DuckDB, measures AUC-ROC │ ├── detector.py # AGENT EDITS THIS — fraud scoring logic │ # SQL feature queries, weights, composite score formula │ ├── strategy.md # HUMAN EDITS THIS — research hypotheses │ # what patterns to try, domain guidance for the agent │ ├── results.tsv # Auto-populated experiment log │ # commit | auc_roc | n_flagged | status | description │ └── reports/ # Generated HTML summaries (for website) └── YYYY-MM-DD_run-tag.html

The Eval Harness — eval.py

This is the fixed file the agent never touches. It defines what "better" means.

def evaluate(detector_output: pd.DataFrame) -> float: """ Given a scored provider DataFrame with columns: - npi (str) - fraud_score (float, 0-1) Returns AUC-ROC against LEIE ground truth labels. """ leie = load_leie() # download or load from cache labeled = leie.merge( # join on NPI detector_output, on='npi', how='right' ) labeled['is_excluded'] = labeled['leie_npi'].notna().astype(int) auc = roc_auc_score( labeled['is_excluded'], labeled['fraud_score'] ) return auc # the oracle — higher is better, max 1.0

Baseline AUC (random scoring): ~0.500. A good fraud detection model: 0.700–0.850+. That's the improvement space the agent explores.

What the Agent Iterates On — detector.py

The agent modifies the fraud scoring logic. Everything is fair game:

Feature Category	What the Agent Explores	Data Source
Billing outliers	Payment per beneficiary vs. specialty peers (z-scores, percentiles, thresholds)	Medicare Part B utilization
Opioid prescribing	Opioid rate, high-dose volume, CMS opioid flags, brand vs. generic rate	Part D prescriber data
Open Payments	Industry payment amounts, payment types (speaking fees, ownership), count of payers	Open Payments (Sunshine Act)
Geographic signals	Provider density by zip, patient travel distance, hot zone overlap	NPPES addresses
Temporal patterns	Year-over-year billing growth, service mix shifts, new specialty codes	Multi-year utilization data
Score composition	Feature weights, normalization method, composite formula, rank cutoffs	Any/all of the above

Simplicity criterion (from Karpathy's program.md): All else being equal, simpler is better. A 0.005 AUC improvement from deleting 20 lines of complexity? Don't keep it. A 0.005 improvement from a simple percentile tweak? Keep.

The Experiment Loop

LOOP FOREVER: 1. Read strategy.md + current detector.py 2. Propose one change (feature tweak, new signal, weight adjustment) 3. git commit the change to detector.py 4. Run: python eval.py > run.log 2>&1 5. Read: grep "^auc_roc:" run.log 6. If AUC improved → keep git commit (advance branch) If AUC equal or worse → git reset (discard) 7. Log to results.tsv: commit | auc_roc | n_flagged | keep/discard | description 8. NEVER STOP. NEVER ASK FOR PERMISSION. If stuck, think harder. Read strategy.md again.

Expected throughput: 30–60 sec per eval on Hetzner → ~60–120 experiments overnight.

Our Data — What We Have in DuckDB

All data lives on Hetzner 5.78.148.70, DuckDB at /home/dataops/cms-data/data/provider_searcher.duckdb.

Dataset	Records	Fraud Signals
NPPES Provider Registry	8M providers	Address anomalies, taxonomy mismatches
Medicare Part B Utilization	~10M records	Billing outliers, volume anomalies, peer comparison
Medicare Part D Prescriber	~10M records	Opioid flags, brand preference, pill mill patterns
Open Payments (General)	14.7M transactions	Industry payments, ownership interests, kickback proxies
Open Payments (Research)	1.1M transactions	Research funding conflicts
Doctors & Clinicians	2.7M national	Practice patterns, group affiliations
Facility Affiliations	1.6M records	Referral network construction, ownership ties
MIPS Performance	541K records	Quality score outliers, low performers
OIG LEIE (to add)	~70K records	Ground truth labels — known bad actors

Total DB: ~5.5GB, 90M+ rows, 30 tables. Source: CMS public use files + Open Payments + NPPES.

Why No GPU Needed

Mac compatible. CPU only. No NVIDIA required. Unlike Karpathy's version (which trains neural networks on an H100), our fraud detection loop runs SQL queries and scoring logic against DuckDB — pure CPU. The Hetzner server (32GB RAM, 8 vCPU) is more than enough. You could run the eval loop on a Mac for local testing.

Website Integration — healthcaredataai.com

This project has a double life: it's a real working tool and a marketing asset.

The Story We're Telling

Most healthcare analytics shops write a whitepaper about fraud detection. We actually ran it — 100 automated experiments overnight, iterating on our algorithm until the model stopped improving. Here's the git log. Here's the AUC curve. Here's what features moved the needle.

That's a story no consultant has told before. It positions you as a domain expert who understands both the clinical context and modern ML workflow patterns.

Page	Content	Lead Hook
/projects/fraud-analysis/	Overview — problem, our approach, key findings	CTA: "We build custom FWA detection for health plans"
/projects/fraud-analysis/methodology	Deep dive — how autoresearch loop works, features tried, what worked	Technical credibility, developer leads
/projects/fraud-analysis/findings	What the model actually found — top-scoring providers by category	Domain expertise demonstration
Whitepaper (gated)	Full report — methodology + code + findings. Email required to download.	Email capture → lead nurture sequence

Next Steps — Build Sequence

Step	Task	Status
1	Restore SSH access to Hetzner CMS server (5.78.148.70)	Needs Blake
2	Download LEIE CSV, load into DuckDB, join to provider tables, verify label coverage	Chief — once SSH works
3	Build eval.py — fixed harness, AUC-ROC measurement, test it runs clean	Chief
4	Build baseline detector.py — simple billing outlier score, establish baseline AUC	Chief
5	Write strategy.md — initial hypotheses, what patterns to try first	Chief (with Blake input)
6	Kick off overnight run — Claude Code in autoresearch loop on Hetzner	Tonight
7	Morning review — read results.tsv, review what worked, update strategy.md	Blake + Chief
8	Build website pages from findings — methodology, case studies, interactive explorer	Week 2

Article / Blog Angle

The autoresearch connection is the hook for an article. Proposed title:

"We Ran Karpathy's Autoresearch on Healthcare Fraud Data — Here's What the Algorithm Found"

Structure:

Explain Karpathy's autoresearch pattern (brief, link to his tweet)
Explain why healthcare fraud detection is a perfect fit (automated oracle = LEIE)
Walk through the results — what features moved AUC, what surprised us
Show the git log as "the experiment record" — 100 commits, each a hypothesis tested
End: what this means for health plans doing FWA detection at scale

Distribution: LinkedIn (Blake's profile + company page), healthcaredataai.com blog, possibly HackerNews or Towards Data Science.

Report generated by Chief · March 8, 2026 · Based on Karpathy autoresearch repo (cloned to ~/.openclaw/workspace/projects/autoresearch/) and existing fraud analysis project (~/.openclaw/workspace/projects/fraud-analysis/)

🔬 Healthcare Fraud Detection — Autoresearch Design