Inspired by Andrej Karpathy's autoresearch project, the core loop is simple:
eval.py β Fixed oracle. Never edited. Downloads OIG LEIE, measures AUC-ROC, logs to results.tsv.detector.py β The only file the agent edits. Reads NPI list from stdin, writes npi,score to stdout.results.tsv β Experiment log. Tracks every iteration with timestamp, AUC, description.The key insight from Karpathy: you need an automated, objective oracle. In his case it was validation loss on a language model. In ours it's the OIG LEIE β the federal database of providers excluded from Medicare for fraud, abuse, or license violations. Each eval.py run downloads the live LEIE, matches it to CMS billing data, and returns AUC-ROC. The agent iterates on detector.py freely.
| Dataset | Rows | What It Contains | Fraud Signal |
|---|---|---|---|
| CMS Part B Physician Claims | 1.26M | Every Medicare physician's billing: services, beneficiaries, payments | Volume anomalies |
| CMS Part D Prescribing | 1.38M | Drug prescribing patterns by provider including opioid rates | Opioid patterns |
| Open Payments (Sunshine Act) | 14.7M | Every pharmaceutical/device payment to a physician | Industry entanglement |
| NPPES NPI Registry | 7.1M | Provider demographics, taxonomy codes, address | Identity resolution |
| PECOS Enrollment | 2.54M | Medicare enrollment records (2.54M rows = up to 75 per NPI) | Enrollment gaps |
| OIG LEIE Exclusion List | 82,714 | All providers excluded from Medicare β ground truth labels | Label source |
All stored in a single 6GB DuckDB database on a Hetzner server (32GB RAM, 8 vCPU). Python venv at /home/dataops/cms-data/.venv/. DuckDB opened read-only to protect data integrity.
| Version | AUC | Key Change | Outcome |
|---|---|---|---|
| Baseline | 0.5561 | Raw billing z-scores, global normalization | Near random |
| V2 | 0.7695 | Z-scores within specialty peer group; LA opioid rate; PECOS gap Γ volume; speaking fees | +21 pts β massive jump |
| V3 | 0.7433 | Added HCPCS concentration HHI, upcoding ratio, taxonomy mismatch | β3 pts β added noise |
| V4 | 0.7325 | Percentile bucketing (90/95/99th pct) instead of z-scores | β4 pts β z-scores better |
| V7 | 0.7904 | Ensemble: 50% max(subscores) + 50% weighted mean β breakthrough insight | +2 pts |
| V8βV11 | 0.7920β0.8013 | Progressive shift toward higher max weight (60%β90%) | Confirming max dominates |
| V12 | 0.8098 | Pure max(subscores) β no weighted mean at all | Best result |
| V13βV15 | 0.8098β0.8100 | Power transforms, top-2 mean β no improvement | Plateau reached |
Taking max(subscores) across all features dramatically outperforms weighted averaging. Fraudsters don't need to be suspicious on every metric β being in the 99th percentile on one metric is sufficient signal. This drove the biggest AUC jump (V7βV12). The implication: the more diverse your subscore pool, the more chances you have to catch different fraud patterns.
The single biggest jump (baselineβV2, +21 pts) came from computing z-scores within specialty peer groups, not globally. An oncologist billing 50 services/patient is normal. A family practitioner billing 50 services/patient is extreme. Without specialty normalization, legitimate high-billing specialties flood the top of the suspect list.
V3 added HCPCS concentration, upcoding ratio, and taxonomy mismatch β and AUC dropped 3 points. Adding weak features dilutes strong ones unless they're in the max pool as standalone subscores. The final model uses just 6 subscores. Adding 3 more hurt. This is a core lesson in fraud detection: signal quality > quantity.
PECOS enrollment gap has a remarkable 0.78 standalone AUC β but this is largely circular: when a provider is excluded from Medicare, their PECOS enrollment is terminated. The feature is detecting the effect of exclusion (PECOS revoked), not the cause (fraud). It floods the top suspect list with legitimate unenrolled providers and excluded-but-since-reinstated providers. The billing signal (svc/bene, pay/bene) is the genuinely predictive dimension.
Blake flagged an important concern: with only 181 matched labels out of 82,714 LEIE entries, how reliable is our AUC?
| LEIE Segment | Count | Why Not Matched |
|---|---|---|
| Placeholder NPI (0000000000) | 74,241 | No NPI on file β pre-NPI era exclusions or organizations |
| Real NPI, matched to CMS | 182 | β Our ground truth labels |
| Real NPI, not in CMS | 8,291 | Excluded earlier β billing data scrubbed from CMS over time |
Honestly β it's tight. Key implications:
We attempted to expand ground truth from 181 β 720+ labels by fuzzy-matching the 74,241 no-NPI LEIE entries against NPPES using last_name + state + specialty.
| Match Confidence | Total Matched | In CMS | AUC Impact |
|---|---|---|---|
| HIGH (exact name + state + taxonomy + credentials) | 287 | 44 | 0.81β0.75 (AUC dropped) |
| MEDIUM (first initial match) | 1,709 | ~495 | |
| LOW (ambiguous, first pick) | 1,454 | ~513 | Not tested |
Investigation revealed: the fuzzy-matched providers were excluded in 1995β2010 β 20-30 years ago. Their current billing patterns are completely normal (they may have been reinstated, or the NPI was reassigned). Adding them as fraud labels teaches the model the wrong thing. The old LEIE entries without NPIs are not worth adding.
What would actually help: DOJ press releases (OIG publishes ~100/month with specific provider names), CMS Program Integrity dataset, state medical board disciplinary actions.
| Subscore | Standalone AUC | Notes |
|---|---|---|
| PECOS enrollment gap | 0.7778 | β οΈ Mostly post-exclusion leakage β floods list with wrong suspects |
| Drug cost/beneficiary | 0.5756 | Moderate β specialty-normalized drug spend |
| Services/beneficiary | 0.5678 | Moderate β the key billing anomaly signal |
| Payment/beneficiary | 0.5549 | Correlated with svc/bene |
| Open Payments (industry $) | 0.5379 | Weak individually |
| LA opioid rate | 0.4617 | Below random standalone β only adds value in the max pool |
| Combined max(subscores) | 0.8072 | The ensemble effect β each feature catches different fraudsters |
The ensemble effect is the whole game. No single signal is powerful, but they catch different fraudsters. That's exactly why max() works: it asks "is this provider extreme on anything?" rather than "is this provider above average on everything?"
After removing PECOS-dominated results and filtering to individual physicians (removing ambulance services, labs, IDTFs), the top billing anomaly suspects:
| Rank | Provider | Specialty | Location | Key Metric | Why Flagged |
|---|---|---|---|---|---|
| 1 | Andrew Leavitt NPI 1952343113 |
Internal Medicine | San Francisco, CA | 10,333 svc/bene 2.5M services, 246 patients $4.76M paid |
Extreme β implausible |
| 2 | Elisabeth Balken NPI 1992489728 |
Nurse Practitioner | Mesa, AZ | 991 svc/bene 29,734 services, 30 patients $23.3M paid |
Extreme β NP billing $23M |
| 3 | William Price NPI 1366423865 |
Family Practice | Tulsa, OK | 2,357 svc/bene 120K services, 51 patients $29,870 paid |
Very high volume, very low payment β data anomaly? |
| 4 | Brandon Hardesty NPI 1952581027 |
Internal Medicine | Indianapolis, IN | 5,634 svc/bene 766K services, 136 patients LA opioid: HIGH |
Likely IU Health hematology |
| 5 | Frank Curvin NPI 1457397382 |
Family Practice | Johns Creek, GA | 191 svc/bene 19,932 services, 104 patients $11.8M paid |
High for family practice |
| 6 | Joyce Ravain NPI 1215980230 |
Emergency Medicine | Ormond Beach, FL | 92.9 svc/bene 4,181 services, 45 patients $835K paid |
EM docs don't have ongoing patients |
Built validate_suspects.py β a batch pipeline that takes the physician suspect list and for each provider:
verdict | fraud_confidence | fraud_indicators | legitimate_explanations | recommended_action| Verdict | Meaning | Action |
|---|---|---|
| LIKELY_FRAUD | Web evidence of legal action, conviction, or OIG press release | Refer to OIG, request claims audit |
| POSSIBLE_FRAUD | Statistical anomaly + suspicious patterns but no confirmed legal action | Enhanced monitoring, deeper audit |
| LEGITIMATE | High-volume specialty center, academic medicine, specialty drug | Remove from watchlist |
| INSUFFICIENT_DATA | No web presence, new provider, generic name | Manual review needed |
# On Hetzner server: export BRAVE_API_KEY=... # free tier at https://brave.com/search/api/ export ANTHROPIC_API_KEY=... # Run top 25 suspects: cd /home/dataops/fraud-detector python3 validate_suspects.py --limit 25 # Output: # validation/validation_report_2026-03-09.html (HTML report) # validation/validation_results.json (structured data)
| Priority | Task | Expected Impact |
|---|---|---|
| P0 | Set API keys on server, run validation pipeline on top 25 suspects | First real-world validated fraud findings |
| P0 | Investigate Andrew Leavitt (10,333 svc/bene), Elisabeth Balken ($23M on 30 patients) | May be our clearest fraud signals |
| P1 | Expand ground truth via DOJ press releases β OIG publishes ~100/month with provider names | Could grow labels from 181 β 1,000+ |
| P1 | Build frontend explorer (FastAPI + React) for interactive provider investigation | Makes the validation workflow usable by others |
| P2 | Add network analysis β referral rings where AβBβCβkickbackβA | New fraud pattern not currently detected |
| P2 | Write 3-part blog post series for personal website | Content marketing / credibility for venture |
| P3 | Train proper ML model (gradient boosting) using hand-crafted features as inputs | May push AUC past 0.85 with proper cross-validation |
| File | Purpose | Location |
|---|---|---|
eval.py | Fixed oracle β do not modify | Server + GitHub |
detector.py | Fraud scoring logic (V15 final) | Server + GitHub |
feature_importance.py | Subscore AUC + top suspect generation | Server + GitHub |
fuzzy_match.py | LEIE name matching against NPPES | Server + GitHub |
validate_suspects.py | LLM web validation pipeline | Server + GitHub |
physician_suspects.csv | Top 100 physician billing suspects | Server + GitHub |
results.tsv | Full AUC experiment log (18 runs) | Server + GitHub |
reports/2026-03-09_fraud-detection-comprehensive.html | Public-facing summary report | Mobile URL |
Session: March 8β9, 2026 (~6 hours) | GitHub: blakethom8/cms-fraud-detection | Server: Hetzner 5.78.148.70 | Data: All CMS data publicly available at data.cms.gov