An AI Just Scored 100% on the USMLE. Here's What That Actually Means for Med Students.
If you've been anywhere near medical Twitter or r/medicalschool in the last few months, you've probably seen the headline: an AI got a perfect score on the USMLE. All three Steps. Every question correct.
The natural reaction ranges from "cool, but whatever" to "should I even bother studying anymore?" Neither response quite captures what's going on here. The truth is more interesting, more nuanced, and ultimately more useful for your own prep than the clickbait version suggests.
Let's walk through what actually happened and why it matters — and doesn't matter — for the person reading this with First Aid open on their desk.
What Actually Happened
In August 2025, a company called OpenEvidence announced that its AI had achieved a perfect 100% accuracy on the USMLE, becoming the first AI system in history to do so. The score covered all three Steps: Step 1, Step 2 CK, and Step 3.
OpenEvidence is a Miami-based AI company founded by Daniel Nadler, a Harvard PhD who previously built Kensho Technologies (an AI analytics firm acquired by S&P Global for roughly $550 million in 2018). The company's main product is a clinical decision support platform used by physicians — think of it as a medical search engine where doctors ask clinical questions and get answers grounded in peer-reviewed evidence from sources like the New England Journal of Medicine, JAMA, Cochrane, and NCCN guidelines.
The company has moved fast. As of late 2025, over 430,000 U.S. physicians — roughly 40% of all practicing doctors in the country — had registered on the platform, which was handling about 18 million clinical consultations per month. Nadler was named to TIME's 100 Most Influential People in Global Health list in 2025, and the company's valuation has climbed from $1 billion in early 2025 to $12 billion by January 2026, with investors including Sequoia Capital, Google Ventures, Nvidia, Kleiner Perkins, and the Mayo Clinic.
So this is not a garage project. It's a well-funded, widely adopted platform with serious institutional backing.
The Fine Print You Should Actually Read
Here is where most coverage of the story stops. Here is where it gets important.
The 100% score was achieved on the Kung et al. dataset — a standardized benchmark based on the official USMLE sample exam questions available on usmle.org. The dataset includes 94 Step 1 questions, 109 Step 2 CK questions, and 122 Step 3 questions, totaling 325 items.
Three things to know about the methodology:
-
Image-based questions were excluded. The AI was only tested on text-based questions. USMLE exams include histology slides, ECG tracings, imaging studies, and dermatology photos. Those weren't part of this benchmark.
-
Recording errors in the original dataset were corrected. The team identified and fixed mistakes in the Kung et al. dataset before running the evaluation. This is reasonable methodology, but it means the exact question set differs slightly from what earlier AIs were tested on.
-
This was not the actual USMLE. It was a performance evaluation on a publicly available set of sample questions. The real USMLE is a proctored, multi-hour exam administered by the NBME at Prometric testing centers. Nobody is claiming the AI sat for the actual test.
None of this makes the achievement less impressive. Getting 325 out of 325 on USMLE-level questions is genuinely remarkable. But "AI scores 100% on a curated subset of sample questions with images removed" lands differently than "AI aces the USMLE," even though both describe the same event.
How We Got Here: The AI Performance Timeline
The USMLE has quietly become the benchmark that AI researchers use to measure medical reasoning ability. Here's how the progression looked:
| Year | AI System | Approximate USMLE Accuracy |
|---|---|---|
| Late 2022 | ChatGPT (GPT-3.5) | ~60% (near passing threshold) |
| Late 2022 | Flan-PaLM (Google) | ~68% |
| Early 2023 | GPT-4 (OpenAI) | ~86% |
| Early 2023 | Med-PaLM 2 (Google) | ~86.5% |
| July 2023 | OpenEvidence | >90% (first AI above 90%) |
| 2024 | GPT-4o (OpenAI) | ~90% |
| April 2025 | SCAI (Univ. at Buffalo) | 95.2% on Step 3 |
| August 2025 | OpenEvidence | 100% |
The jump from "can barely pass" to "perfect score" happened in roughly three years. That's fast, even by AI standards.
When OpenEvidence first crossed the 90% mark in July 2023, the company reported making 77% fewer errors than ChatGPT, 24% fewer than GPT-4, and 31% fewer than Google's Med-PaLM 2. By August 2025, the errors were gone entirely.
How OpenEvidence's AI Actually Works
OpenEvidence has not published a detailed peer-reviewed paper on the architecture behind the 100% score, so we're working from public statements and press releases. Here's what's known:
The system uses what Nadler describes as "second and third derivative reasoning" — not just recalling a fact, but figuring out what a set of facts implies, and then reasoning through those implications. In his words: "not just taking the facts that come in before you, but taking the factors in before you, figuring out what those imply, and then reasoning through the implications."
The AI has access to a massive curated evidence base through multi-year content partnerships with NEJM Group (all published content from 1990 onward), JAMA Network, NCCN, Wiley, and Cochrane. So unlike a general-purpose chatbot trained on the open internet, OpenEvidence reasons over decades of peer-reviewed medical literature specifically.
Alongside the 100% score, the company released a free explanation model that teaches the reasoning behind each correct answer, creating clinical vignettes customizable by training level. The stated goal is to democratize access to quality medical education resources.
The Part That Should Temper the Hype
Here's the fact that didn't make it into most headlines.
In November 2025, a pilot study posted on medRxiv tested OpenEvidence on the MedXpertQA dataset — a set of complex medical subspecialty scenarios that go well beyond the scope and format of USMLE questions. These are the kinds of messy, ambiguous, multi-system problems that show up in real clinical practice.
OpenEvidence scored 34%.
That's not a typo. The same AI that got every single USMLE question right could only answer about one in three complex subspecialty questions correctly.
This result highlights something fundamental about what standardized exams actually test. USMLE questions, by design, have one clearly correct answer among five options. They test a specific body of knowledge in a structured format. Real clinical reasoning is open-ended, ambiguous, involves incomplete information, and frequently has no single right answer. Mastering the former does not automatically translate to the latter.
This is true for AI systems, and it's worth noting that it's also true for human examinees. A perfect Step 1 score has never guaranteed clinical excellence.
What This Means for Your USMLE Prep
If you're a medical student reading this between question blocks, here's the honest takeaway.
Your exam is not getting replaced
The USMLE exists to certify that human physicians have the foundational knowledge needed to practice medicine safely. The fact that an AI can now ace the test doesn't change that purpose. The NBME has given no indication that AI performance will change exam policies, scoring, or requirements. You still need to pass. You still need to know this material.
AI study tools are getting significantly better
The same underlying technology that powers OpenEvidence's perfect score is making its way into study tools. The difference between a general chatbot that hallucinates drug interactions and a purpose-built medical AI grounded in peer-reviewed evidence is real, measurable, and growing. AI-powered study platforms that use this technology responsibly — with expert-reviewed content as the foundation — are going to be genuinely more useful than their predecessors.
The "what" hasn't changed; the "how" is evolving
You still need to learn the renin-angiotensin-aldosterone system. You still need to recognize the histology of granulomatous inflammation. The core knowledge hasn't changed. What's changing is the quality of the tools available to help you learn it — from better explanations and adaptive question routing to AI tutoring that can meet you where you are.
Do not use general AI chatbots as your primary study resource
This point bears repeating, especially in the wake of headlines like "AI scores 100% on USMLE." OpenEvidence is a specialized system with access to curated medical literature and purpose-built reasoning capabilities. ChatGPT is a general-purpose language model. They are not the same thing. A 2025 study found ChatGPT-3.5 had 0% accuracy on drug information categories including IV compatibility and monitoring parameters. Using a general chatbot as your primary factual reference for board prep is still risky.
The real competitive advantage is knowing how to learn
AI systems are getting better at answering questions. They are not getting better at being physicians. The clinical reasoning, pattern recognition, and judgment that residency programs are ultimately looking for cannot be outsourced to a chatbot. Your job is to build that foundation — and the best way to do that is still active recall, spaced repetition, and deliberate practice with high-quality question banks.
The Bigger Picture
Daniel Nadler has framed OpenEvidence's mission around equity: "There's an enormous amount of inequality in medical education in the United States and in preparation for medical school exams." The free explanation model released alongside the 100% score is meant to make high-quality reasoning accessible regardless of which medical school you attend or how much you can afford to spend on prep materials.
That vision resonates with a real problem. USMLE prep costs $2,000–$4,000 when you add up all the resources, and for IMGs in countries where that represents months of income, the financial barrier is significant.
Whether AI-powered tools will meaningfully close that gap remains to be seen. But the trajectory is clear: medical AI is improving faster than most people expected, and the tools available to students today are qualitatively different from what existed even two years ago.
The students who will benefit most are the ones who understand what these tools can and cannot do — and use them accordingly.
Frequently Asked Questions
Did an AI really pass the USMLE?
In August 2025, OpenEvidence became the first AI to score a perfect 100% on all three USMLE Steps using the Kung et al. benchmark dataset of 325 sample questions from usmle.org. Image-based questions were excluded. The AI did not take the actual proctored exam.
What is OpenEvidence?
OpenEvidence is an AI-powered clinical decision support platform used by over 430,000 U.S. physicians. Founded by Daniel Nadler (Harvard PhD), it provides evidence-grounded answers to clinical questions using content from NEJM, JAMA, Cochrane, and other peer-reviewed sources. It is valued at $12 billion as of January 2026.
How does OpenEvidence compare to ChatGPT for medical questions?
OpenEvidence is purpose-built for medicine with access to curated peer-reviewed literature. ChatGPT is a general-purpose language model. In benchmarks, OpenEvidence made 77% fewer errors than ChatGPT. A 2025 study found ChatGPT-3.5 had 0% accuracy on several drug information categories where precision matters most.
Will AI replace the USMLE?
There is no indication from the NBME or USMLE program that AI performance will change exam requirements. The USMLE certifies human physician competency, and that purpose remains regardless of AI capabilities.
Should I use AI tools to study for the USMLE?
AI tools can be valuable supplements — especially for concept explanation, adaptive learning, and personalized study plans. The key is using purpose-built tools with expert-reviewed content rather than general chatbots, and always treating AI as a supplement to active recall and structured practice, not a replacement for it.
Studying for the USMLE? QuantaPrep combines expert-reviewed questions with AI-powered analytics to help you study smarter. All content is written and validated by medical educators — no hallucination risk in the core material. Start free with 20 questions per day, no credit card required.
Ready to start practicing?
QuantaPrep's question bank features detailed explanations, performance analytics, and study modes designed around active recall.