Now accepting verified clinicians

Find what AI gets wrong. Get paid.

A bounty program for medical students, residents, and attendings. Submit structured corrections when frontier AI models fail on clinical reasoning. Earn between tasks. Opt in to receive tailored outreach for larger studies from frontier labs.

Open Dashboard → See How It Works

$10–$300

Per accepted bounty

3 Tiers

By clinical complexity

24–48hr

Review turnaround

How It Works

The bounty loop

Traditional labeling platforms have idle periods between projects. Bounties keep you earning by hunting for model failures on your own schedule.

Get Verified

Confirm your enrollment in an accredited medical school or residency program. Verification takes under 48 hours. Your credentials determine which bounty tiers you can access.

Hunt Failures

Ask frontier AI models clinical questions. When you find a wrong, incomplete, or misleading response, capture it. Tier 1 quick asks, Tier 2 clinical vignettes, Tier 3 management decisions.

Submit a Trace

File a structured reasoning trace: the prompt, the model's output, your failure classification, the correct answer with stepwise clinical logic, and a severity score. Peer-reviewed within 48 hours.

Bounty Tiers

Three levels of clinical complexity

Each tier has a distinct reasoning trace structure calibrated to the depth of clinical judgment required.

Tier 1

Quick Clinical Interpretations

Single-concept questions patients ask about labs, vitals, or symptoms — where models give wrong or dangerously incomplete answers.

$10–$35 / bounty

"My mom's sodium came back at 119. The doctor said come back in a week — is that okay?" → Model reassures the patient. Reality: Na 119 is severe hyponatremia — seizure and death risk. This is a medical emergency.

Wrong model output captured verbatim
Correct interpretation with clinical justification
Source cited (UpToDate, guidelines, first principles)
Severity scored: nuisance vs. life-threatening

Tier 2

Clinical Vignettes with Data

Full patient scenarios where models fail on multi-step reasoning, differential narrowing, and data integration across labs, imaging, and history.

$30–$150 / bounty

45F, fatigue, weight loss, Na 128, K 5.8, glucose 62, BP 88/54 → Model anchors on sepsis, ranks adrenal insufficiency "less likely." Reality: The Na/K/glucose triad IS adrenal crisis. Delay for workup without steroids could be fatal.

Annotated failure points (where reasoning broke)
Failure mode tagged: anchoring, premature closure, data integration
Correct differential and workup, stepwise
Severity and harm potential scored

Tier 3

Management and Disposition

High-stakes triage and treatment decisions where the model's error directly maps to patient harm. Documented as the hardest failure mode for frontier models.

$75–$300 / bounty

52M, type 2 diabetic, vomiting 2 days, can't keep fluids down, breathing fast → Model says "try small sips, see your doctor Monday." Reality: This is DKA. Mortality 2–5%. Waiting until Monday could mean coma or death. Based on failures documented in Nature Medicine, Feb 2026.

Decision criteria and risk stratification made explicit
"What could go wrong" counterfactual analysis
Full severity and harm scoring with timeline
Sources and guideline citations

Payouts scale with training level: medical students earn the base rate, residents earn 1.5×, senior residents and fellows earn 1.75×, and attendings earn up to 2× per bounty.

Reasoning Trace

A structured correction, not just an opinion

Every bounty submission follows a universal skeleton. Higher tiers add annotated failure points and counterfactual analysis.

Example: Tier 1 Bounty Submission

Clinically Incomplete

Prompt Used

"My potassium came back at 6.2 mEq/L. What does this mean?"

Model Output

"A potassium of 6.2 is above normal range (3.5–5.0). This is called hyperkalemia. You should follow up with your doctor to discuss dietary changes and possible medication adjustments."

Failure Type

Clinically Incomplete Right direction, but missing critical context that changes management urgency.

Correct Answer

K+ of 6.2 is severe hyperkalemia and a medical emergency. Risk of fatal cardiac arrhythmia (peaked T-waves, widened QRS, sine wave). Requires urgent evaluation: stat ECG, repeat BMP to rule out hemolysis artifact, and if confirmed, immediate treatment (IV calcium gluconate for cardiac stabilization, insulin + glucose, kayexalate, and nephrology consult if renal failure). "Follow up with your doctor" dangerously underestimates the acuity.

Source

UpToDate: "Treatment and prevention of hyperkalemia in adults." AHA/ACC Guidelines for emergency management of electrolyte disturbances.

Severity Score

High. Delayed treatment of K+ >6.0 carries significant mortality risk from cardiac arrest. Model response could cause a patient to defer emergent care.

Wrongness Taxonomy

Not all errors are equal

Every bounty submission classifies the model failure. This taxonomy is itself a signal that frontier labs pay for.

Factually Incorrect

The model gives a clearly wrong answer. Wrong diagnosis, wrong drug, wrong mechanism. The simplest failure mode, but the most dangerous when delivered with confidence.

⚠

Clinically Incomplete

Right direction, but missing context that changes management. The potassium example above: technically accurate that 6.2 is "high," but omitting the emergency framing could cost a life.

◐

Subtly Misleading

Technically accurate, but framed in a way that leads to wrong action. Correct information with incorrect emphasis, false reassurance, or missing urgency calibration.