AI reopened 376 unsolved rare-disease cases. Doctors confirmed 18 new answers.
For families with a sick child and no name for the illness, the hardest word is “unsolved.” The tests come back. The specialists meet. And the file goes into a drawer, because there is nothing left to try.
A new study reopened 376 of those drawers. Researchers from Boston Children’s Hospital’s Manton Center for Orphan Disease Research, Harvard University, and OpenAI used a general AI reasoning model to help diagnose rare diseases that had already defeated expert review. The model did not diagnose anyone. It read each case, weighed the evidence, and handed doctors a short list of leads worth a second look.
After the doctors did that second look, 18 of the 376 cases got a confirmed answer. The study was published on June 18, 2026, in NEJM AI, and OpenAI detailed the results.
Quick answer
- Boston Children’s, Harvard, and OpenAI ran 376 previously unsolved rare-disease cases through OpenAI’s o3 Deep Research, a general reasoning model.
- The cases were not new. Many had already been reviewed by several labs and specialist teams and were still unsolved.
- The model suggested evidence-backed explanations. Human experts then reviewed, tested, and confirmed them in a certified lab.
- Doctors established 18 new diagnoses, an added yield of 4.8% on cases specialists had already given up on.
- In 7 of the 18, the answer already existed somewhere else and had simply never reached the patient’s own record.
- The model made zero medical decisions. Every diagnosis was made by qualified clinicians through standard testing and confirmation.
What the study actually did
Rare-disease work has a backlog problem. A child gets their genome sequenced, the result is inconclusive, and the case stalls. Meanwhile, science keeps moving. New genes get linked to disease. Labs reclassify old genetic variants. Fresh papers pile up. The child’s DNA never changes, but the meaning of that DNA can change every year.
The catch is that nobody has time to go back.
“The bottleneck is time. An expert can devote only so much of their day to any one particular person.”
So old cases sit, even when the answer may already be sitting in a database somewhere.
The researchers wanted to know if a general AI model could take a first pass at that backlog, and whether that could help expert-led reanalysis scale as the science keeps changing. They chose OpenAI’s o3 Deep Research, a reasoning model that can search and connect scientific material. Their rule was strict from the start: the model widens the search, humans make every call.
Why AI can help diagnose rare diseases that stumped specialists
The reason old cases can hold new answers comes down to fragmentation. A patient’s symptoms, test results, and family history often live in different databases that use different labels, formats, and vocabularies. Stitching all of that together is slow and easy to get wrong, so even a careful specialist can miss the one connection that matters.
There is also a timing problem. A child’s genome is sometimes read before the relevant gene has ever been tied to a disease. Years later, the link exists in the literature, but no one has gone back to re-read that particular file.
This is where a reasoning model earns its place. Instead of returning a ranked list of genes and stopping there, the team asked o3 to act as an “explanation-first” layer on top of the existing genetic pipelines.
It had to connect the clinical features, the inheritance pattern, the variant evidence, and the published science into one written argument that a human could poke holes in. That is a different job from an ordinary search, and it is closer to the kind of AI-assisted research that labs are starting to lean on.
How the reanalysis worked
For each case, the team built a de-identified packet. It held standardized Human Phenotype Ontology terms that describe the patient’s symptoms, occasional clinician notes, any working diagnosis, basic metadata like age and gender, and a filtered table of genetic variants. Most packets included data from the child and both biological parents.
The model was asked to propose the single most plausible molecular explanation and to show its work. Then the humans took over. Researchers reviewed every output using the ACMG/AMP framework, the same standard clinical labs use to grade genetic variants. At least two team members checked each candidate, and any disagreement was settled by consensus.
The bar for calling something a diagnosis was high. A model’s answer counted only after experts reviewed the evidence, targeted follow-up or segregation testing checked the leading hypothesis, the variant was classified as pathogenic or likely pathogenic, a CLIA-certified laboratory confirmed it, and the clinical team returned the result to the family. A raw model output was never, at any point, treated as a diagnosis.
Before touching a single unsolved case, the team tested and tuned the workflow on cases that already had answers, refining the prompts and the clinical context, variant filters, and evidence questions the model received. The model recovered the correct gene and variant in 48 of 51 mixed rare-disease cases, got the right diagnosis in 45 of 57 neuromuscular cases, and in a set of 15 long-read genomes it named the correct gene every time and both disease-causing copies in 12.
The model also scored its own confidence, and those scores lined up with reality. On calls it got consistently right, the mean minimum confidence score was 85.6. On calls that were wrong or unknown, it was 42.1.
The team did not treat these as true probabilities or as a stand-in for evidence. They used them only to point tired reviewers at the most promising leads first.
What the reanalysis actually found
Across the 376 cases, doctors confirmed 18 new diagnoses, a yield of 4.8%. The cases were split into four groups, and the hit rate was not even across them.
The neurodevelopmental group produced the most confirmed answers, 10 out of 100. The early-psychosis group had the highest rate at 13.3%, though it held only 15 cases, so that percentage carries a wide margin of error.
Neuromuscular disease added 4 out of 61, and the large sudden-death group added 2 out of 200. Yield also tracked how likely each group was to have a single-gene cause in the first place.
A 4.8% hit rate sounds small, and on paper it is. What makes it meaningful is where it came from. These were not fresh cases. They had been through multiple pipelines and multidisciplinary teams and were still stuck. Squeezing single-digit gains out of that pile is genuinely hard, and similar reanalysis studies see the same modest range on heavily reviewed cases.
One finding says a lot about the real problem. Of the 18 diagnoses, 7 were rediscoveries. The answer already existed outside the local workflow, and in several cases the variant was already flagged as pathogenic in a public database. It just never made it into the record the patient’s own team was looking at.
That is not really a science failure. It is a plumbing failure, the kind that shows up when information is scattered across systems that do not talk to each other.
Beyond diagnoses: new leads worth testing
The model did more than match known patterns. In one early-psychosis case, it noticed a run of low-quality readings on chromosome 22, tied that to the child’s heart, immune, developmental, and psychiatric symptoms, and proposed a 22q11.2 deletion linked to DiGeorge syndrome. That structural clue was not even listed in the input data.
It also broke its own rules in useful ways. The prompt asked for one single-gene cause, but some presentations were too complex for that. In one case, variants in two genes, LAMA2 and FOXP1, together explained the muscle and developmental features. Another turned out to have a two-gene cause involving TTN and SRPK3 that had not been recognized before.
The most striking result was not a diagnosis at all. In one neurodevelopmental case, the model flagged an 11-amino-acid deletion in a gene called S1PR1 in a person who also had vitiligo, the condition that causes loss of skin pigment.
It then built a mechanism: the deletion could change the receptor in a way that both reduces pigment production and helps immune cells linger in the skin. That link between S1PR1 and vitiligo still needs lab work to confirm, but it shows a model pulling threads from structural biology, immunology, and genetics into one concrete, testable idea.
The team saw a similar signal in the neuromuscular group, where damaging variants in HSPB8 and CDK13 did not fit the genes’ best-known disorders, hinting at a wider disease spectrum.
Kyra’s diagnosis, nearly two decades late
The numbers get human fast. Kyra’s story started in karate class, when her mother noticed the 9-year-old could not sink as low into her stances as before. She was slowing down at soccer and walking up on her toes. Her pediatrician could not find the cause and sent her to a specialist.
What followed was a nearly 20-year search. Tests, treatments, and consultations, and no name for what was happening to her body. She was in a wheelchair by the age of 13, though her condition later plateaued.
Kyra’s file was one of the four answers found in the neuromuscular group. The team traced her weakness to a frameshift variant in HSPB8 and diagnosed a form of myofibrillar myopathy, where abnormal protein clumps build up in muscle fibers.
A genetic counselor from the Manton Center called her about a week before her 28th birthday. Her form of the disease is so rare that little is known about how it will progress, but after nearly two decades, the answer itself brought some closure.
What this study does not prove
The researchers were careful to fence in what they had shown, and that honesty matters. This is not evidence that patients, clinicians, or anyone else should use OpenAI’s models to diagnose disease or make medical decisions. The study does not endorse o3 Deep Research, ChatGPT, or any product as a diagnostic tool.
The design had real limits. It was retrospective, the four groups were very different from each other, and reviewers could see the model’s confidence scores, which can bias judgment. The team did not measure time saved, cost, clinician effort, false-positive workload, or any effect on patient care. It also did not systematically test other kinds of genetic variation, such as structural variants, repeat expansions, deep-intronic changes, and mosaicism.
There is a deeper caution too. Large language models can misread context and produce clean, confident explanations that fall apart on inspection. That is exactly why every result here was forced through human review and lab confirmation.
The model widened the search and focused the human work. It never decided what to tell a family. This is the honest version of the “AI in medicine” story, and it is worth remembering the next time a headline claims a model can replace human experts.
The study also ran on de-identified data, with no protected health information leaving approved environments. Any wider use in clinics would demand the same care around privacy, security, and local regulation that all medical work requires. Model access does not replace sequencing machines, genetic counseling, confirmatory testing, or a specialist’s judgment.
What comes next
The obvious next step is bigger, better-controlled studies. Prospective, multi-center trials should compare AI-assisted reanalysis against standard practice on the things this study did not measure: time to a lead, clinician effort, false-positive burden, cost, and real effects on care. Versioned prompts, reference checks, and audit logs will be needed to make any of it reproducible and safe.
The tools will keep improving as well. This study used o3 Deep Research, a general model. Newer general models can search and synthesize more, and purpose-built systems like GPT-Rosalind are aimed at deeper life-sciences work. None of that was tested here. OpenAI supported this first study, and the Manton Center will lead the next stage through a grant from the OpenAI Foundation.
“Researchers like Catherine and me can’t possibly keep 8,000 different diseases in our heads. That’s the power of AI.”
That framing also fits the broader mood, where the public’s trust in AI rises and falls with how carefully it is used. A tool that surfaces leads for humans to check is a very different thing from a tool that decides.
The bottom line
Strip away the hype and this is a modest, careful result: a general AI model helped human experts find 18 answers in 376 cases that had already been declared dead ends. It diagnosed no one. It confirmed nothing on its own.
But 18 families now have a name for what happened, and 7 of those answers were already out there, just unreachable. The promise here is not a machine that replaces the doctor.
It is a machine that can read faster than any human and hand the doctor a lead worth chasing. For thousands of families still waiting, that is the difference between a closed drawer and one more chance.
Related reading
AI That Builds AI: OpenAI and Anthropic Set 2028 Goal
OpenAI's plan for an automated AI researcher, and the odds experts give it.
AI & WorkWill AI Take My Job? Why the CEOs Who Said Yes Are Now Saying No
The AI leaders who warned of a jobs apocalypse just changed their tune.
AI & TrustAmericans Are Using AI More Than Ever, and Trusting It Less
Pew's 2026 survey: about half of adults use AI chatbots, but trust is falling.
AI ModelsSakana Fugu: One Model to Command Them All
One model that runs a whole team of other AI models behind a single API.
Sources
- OpenAI. Using AI to help physicians diagnose rare genetic diseases affecting children, June 18, 2026
- The study was published in NEJM AI, June 18, 2026











