๐Ÿ”ฌ Healthcare Fraud Detection โ€” Autoresearch Design

Project: CMS Claims Fraud Detection using Karpathy's Autoresearch Pattern

Date: March 8, 2026

Status: Pre-Build Design Report โ€” Review Before Building

Author: Chief

What We're Building

This project adapts Andrej Karpathy's autoresearch pattern โ€” an autonomous overnight experiment loop โ€” to healthcare fraud detection. Instead of iterating on neural network training code, an AI agent iterates on fraud scoring logic, running automated evaluations against a ground-truth dataset of known fraudulent providers (the OIG LEIE exclusion list). You sleep, it runs 50โ€“100 experiments, and you wake up to a better fraud detector.

Core Insight: The reason this works (and regular coding agents don't loop autonomously) is the oracle: AUC-ROC against LEIE-labeled providers is a hard, automated, human-judgment-free metric. The agent knows if it got better. No human in the loop required per iteration.

The Karpathy Pattern โ€” Direct Comparison

autoresearch (ML training) Our version (fraud detection)
train.py โ€” model architecture, optimizer, hyperparams (agent edits this) detector.py โ€” scoring logic, SQL features, thresholds, weights (agent edits this)
program.md โ€” research instructions (human edits this) strategy.md โ€” fraud hypotheses to test, what patterns to try (human edits this)
prepare.py โ€” fixed eval harness, data loading (never modified) eval.py โ€” loads LEIE, joins CMS data, measures AUC-ROC (never modified)
val_bpb โ€” validation bits per byte (lower = better) AUC-ROC โ€” area under ROC curve (higher = better, max 1.0)
5-minute fixed GPU training run 30โ€“60 second DuckDB eval against 90M rows
~100 experiments overnight on one H100 ~50โ€“100 experiments overnight on Hetzner CPU server
Git commit if val_bpb improves Git commit if AUC-ROC improves
LOOP FOREVER until interrupted LOOP FOREVER until interrupted

Reference: github.com/karpathy/autoresearch โ€” cloned to ~/.openclaw/workspace/projects/autoresearch/


The Ground Truth Oracle โ€” OIG LEIE

The List of Excluded Individuals/Entities (LEIE) is the OIG's public database of providers excluded from Medicare and Medicaid participation due to fraud, abuse, or program-related crimes. Updated monthly. ~70,000 current records.

Why LEIE Is the Perfect Oracle

Download URL: https://oig.hhs.gov/exclusions/downloadables/UPDATED.csv

Key LEIE fields: LASTNAME, FIRSTNAME, NPI, SPECIALTY, EXCLTYPE (exclusion category), EXCLDATE, REINDATE, WAIVERSTATE

Join strategy: NPI match (primary) โ†’ name + specialty fallback (for pre-NPI records). Expected match rate to your CMS data: 30โ€“60% of LEIE records will have corresponding CMS activity.


Project Structure

fraud-detector/ โ”œโ”€โ”€ eval.py # FIXED โ€” ground truth oracle, never modified โ”‚ # loads LEIE, joins CMS DuckDB, measures AUC-ROC โ”‚ โ”œโ”€โ”€ detector.py # AGENT EDITS THIS โ€” fraud scoring logic โ”‚ # SQL feature queries, weights, composite score formula โ”‚ โ”œโ”€โ”€ strategy.md # HUMAN EDITS THIS โ€” research hypotheses โ”‚ # what patterns to try, domain guidance for the agent โ”‚ โ”œโ”€โ”€ results.tsv # Auto-populated experiment log โ”‚ # commit | auc_roc | n_flagged | status | description โ”‚ โ””โ”€โ”€ reports/ # Generated HTML summaries (for website) โ””โ”€โ”€ YYYY-MM-DD_run-tag.html

The Eval Harness โ€” eval.py

This is the fixed file the agent never touches. It defines what "better" means.

def evaluate(detector_output: pd.DataFrame) -> float: """ Given a scored provider DataFrame with columns: - npi (str) - fraud_score (float, 0-1) Returns AUC-ROC against LEIE ground truth labels. """ leie = load_leie() # download or load from cache labeled = leie.merge( # join on NPI detector_output, on='npi', how='right' ) labeled['is_excluded'] = labeled['leie_npi'].notna().astype(int) auc = roc_auc_score( labeled['is_excluded'], labeled['fraud_score'] ) return auc # the oracle โ€” higher is better, max 1.0

Baseline AUC (random scoring): ~0.500. A good fraud detection model: 0.700โ€“0.850+. That's the improvement space the agent explores.


What the Agent Iterates On โ€” detector.py

The agent modifies the fraud scoring logic. Everything is fair game:

Feature Category What the Agent Explores Data Source
Billing outliers Payment per beneficiary vs. specialty peers (z-scores, percentiles, thresholds) Medicare Part B utilization
Opioid prescribing Opioid rate, high-dose volume, CMS opioid flags, brand vs. generic rate Part D prescriber data
Open Payments Industry payment amounts, payment types (speaking fees, ownership), count of payers Open Payments (Sunshine Act)
Geographic signals Provider density by zip, patient travel distance, hot zone overlap NPPES addresses
Temporal patterns Year-over-year billing growth, service mix shifts, new specialty codes Multi-year utilization data
Score composition Feature weights, normalization method, composite formula, rank cutoffs Any/all of the above
Simplicity criterion (from Karpathy's program.md): All else being equal, simpler is better. A 0.005 AUC improvement from deleting 20 lines of complexity? Don't keep it. A 0.005 improvement from a simple percentile tweak? Keep.

The Experiment Loop

LOOP FOREVER: 1. Read strategy.md + current detector.py 2. Propose one change (feature tweak, new signal, weight adjustment) 3. git commit the change to detector.py 4. Run: python eval.py > run.log 2>&1 5. Read: grep "^auc_roc:" run.log 6. If AUC improved โ†’ keep git commit (advance branch) If AUC equal or worse โ†’ git reset (discard) 7. Log to results.tsv: commit | auc_roc | n_flagged | keep/discard | description 8. NEVER STOP. NEVER ASK FOR PERMISSION. If stuck, think harder. Read strategy.md again.

Expected throughput: 30โ€“60 sec per eval on Hetzner โ†’ ~60โ€“120 experiments overnight.


Our Data โ€” What We Have in DuckDB

All data lives on Hetzner 5.78.148.70, DuckDB at /home/dataops/cms-data/data/provider_searcher.duckdb.

Dataset Records Fraud Signals
NPPES Provider Registry 8M providers Address anomalies, taxonomy mismatches
Medicare Part B Utilization ~10M records Billing outliers, volume anomalies, peer comparison
Medicare Part D Prescriber ~10M records Opioid flags, brand preference, pill mill patterns
Open Payments (General) 14.7M transactions Industry payments, ownership interests, kickback proxies
Open Payments (Research) 1.1M transactions Research funding conflicts
Doctors & Clinicians 2.7M national Practice patterns, group affiliations
Facility Affiliations 1.6M records Referral network construction, ownership ties
MIPS Performance 541K records Quality score outliers, low performers
OIG LEIE (to add) ~70K records Ground truth labels โ€” known bad actors

Total DB: ~5.5GB, 90M+ rows, 30 tables. Source: CMS public use files + Open Payments + NPPES.


Why No GPU Needed

Mac compatible. CPU only. No NVIDIA required. Unlike Karpathy's version (which trains neural networks on an H100), our fraud detection loop runs SQL queries and scoring logic against DuckDB โ€” pure CPU. The Hetzner server (32GB RAM, 8 vCPU) is more than enough. You could run the eval loop on a Mac for local testing.

Website Integration โ€” healthcaredataai.com

This project has a double life: it's a real working tool and a marketing asset.

The Story We're Telling

Most healthcare analytics shops write a whitepaper about fraud detection. We actually ran it โ€” 100 automated experiments overnight, iterating on our algorithm until the model stopped improving. Here's the git log. Here's the AUC curve. Here's what features moved the needle.

That's a story no consultant has told before. It positions you as a domain expert who understands both the clinical context and modern ML workflow patterns.

Page Content Lead Hook
/projects/fraud-analysis/ Overview โ€” problem, our approach, key findings CTA: "We build custom FWA detection for health plans"
/projects/fraud-analysis/methodology Deep dive โ€” how autoresearch loop works, features tried, what worked Technical credibility, developer leads
/projects/fraud-analysis/findings What the model actually found โ€” top-scoring providers by category Domain expertise demonstration
Whitepaper (gated) Full report โ€” methodology + code + findings. Email required to download. Email capture โ†’ lead nurture sequence

Next Steps โ€” Build Sequence

Step Task Status
1 Restore SSH access to Hetzner CMS server (5.78.148.70) Needs Blake
2 Download LEIE CSV, load into DuckDB, join to provider tables, verify label coverage Chief โ€” once SSH works
3 Build eval.py โ€” fixed harness, AUC-ROC measurement, test it runs clean Chief
4 Build baseline detector.py โ€” simple billing outlier score, establish baseline AUC Chief
5 Write strategy.md โ€” initial hypotheses, what patterns to try first Chief (with Blake input)
6 Kick off overnight run โ€” Claude Code in autoresearch loop on Hetzner Tonight
7 Morning review โ€” read results.tsv, review what worked, update strategy.md Blake + Chief
8 Build website pages from findings โ€” methodology, case studies, interactive explorer Week 2

Article / Blog Angle

The autoresearch connection is the hook for an article. Proposed title:

"We Ran Karpathy's Autoresearch on Healthcare Fraud Data โ€” Here's What the Algorithm Found"

Structure:

Distribution: LinkedIn (Blake's profile + company page), healthcaredataai.com blog, possibly HackerNews or Towards Data Science.


Report generated by Chief ยท March 8, 2026 ยท Based on Karpathy autoresearch repo (cloned to ~/.openclaw/workspace/projects/autoresearch/) and existing fraud analysis project (~/.openclaw/workspace/projects/fraud-analysis/)