Did the AI diagnose the patients?

No. OpenAI's o3 Deep Research model surfaced evidence-linked candidate explanations for researchers to review. Every one of the 18 diagnoses was established by qualified clinicians through expert review, additional testing, and confirmation by a CLIA-certified laboratory. The model made no medical decisions.

How many cases were solved?

Doctors confirmed 18 new diagnoses out of 376 previously unsolved cases, an added diagnostic yield of 4.8%. These were not fresh cases; many had already been reviewed by multiple labs and specialist teams and were still unsolved.

Which AI model was used?

The team used OpenAI o3 Deep Research, a general-purpose reasoning model that can search and connect scientific material. It was asked to link clinical features, inheritance, variant evidence, and published science into a written explanation a human reviewer could interrogate.

Were some of the answers already known?

Yes. Of the 18 diagnoses, 7 were rediscoveries: the answer already existed outside the local workflow, and in several cases the variant was already listed as pathogenic in a public database. It had simply never reached the record the patient's own team was reviewing.

Does this mean people should use ChatGPT to diagnose illness?

No. The study is explicit that it is not evidence that patients, clinicians, or anyone else should use OpenAI models to diagnose disease or make medical decisions. It does not endorse o3 Deep Research, ChatGPT, or any product as a diagnostic tool.

AI NewsJuly 3, 2026 · 12 min read

AI reopened 376 unsolved rare-disease cases. Doctors confirmed 18 new answers.

For families with a sick child and no name for the illness, the hardest word is “unsolved.” The tests come back. The specialists meet. And the file goes into a drawer, because there is nothing left to try.

A new study reopened 376 of those drawers. Researchers from Boston Children’s Hospital’s Manton Center for Orphan Disease Research, Harvard University, and OpenAI used a general AI reasoning model to help diagnose rare diseases that had already defeated expert review. The model did not diagnose anyone. It read each case, weighed the evidence, and handed doctors a short list of leads worth a second look.

After the doctors did that second look, 18 of the 376 cases got a confirmed answer. The study was published on June 18, 2026, in NEJM AI, and OpenAI detailed the results.

Quick answer

Boston Children’s, Harvard, and OpenAI ran 376 previously unsolved rare-disease cases through OpenAI’s o3 Deep Research, a general reasoning model.
The cases were not new. Many had already been reviewed by several labs and specialist teams and were still unsolved.
The model suggested evidence-backed explanations. Human experts then reviewed, tested, and confirmed them in a certified lab.
Doctors established 18 new diagnoses, an added yield of 4.8% on cases specialists had already given up on.
In 7 of the 18, the answer already existed somewhere else and had simply never reached the patient’s own record.
The model made zero medical decisions. Every diagnosis was made by qualified clinicians through standard testing and confirmation.

What the study actually did

The child's genome stays unchanged for years, but the evidence around it keeps growing until a once-inconclusive test quietly becomes answerable.

Rare-disease work has a backlog problem. A child gets their genome sequenced, the result is inconclusive, and the case stalls. Meanwhile, science keeps moving. New genes get linked to disease. Labs reclassify old genetic variants. Fresh papers pile up. The child’s DNA never changes, but the meaning of that DNA can change every year.

The catch is that nobody has time to go back.

“The bottleneck is time. An expert can devote only so much of their day to any one particular person.”
Dr. Catherine Brownstein, Boston Children's Hospital's Manton Center for Orphan Disease Research

So old cases sit, even when the answer may already be sitting in a database somewhere.

The researchers wanted to know if a general AI model could take a first pass at that backlog, and whether that could help expert-led reanalysis scale as the science keeps changing. They chose OpenAI’s o3 Deep Research, a reasoning model that can search and connect scientific material. Their rule was strict from the start: the model widens the search, humans make every call.

Why AI can help diagnose rare diseases that stumped specialists

One patient's information is scattered across systems that do not talk: symptoms in HPO terms, test results in lab format, family history in another record, and a variant table with different IDs. A reasoning layer connects them.

The reason old cases can hold new answers comes down to fragmentation. A patient’s symptoms, test results, and family history often live in different databases that use different labels, formats, and vocabularies. Stitching all of that together is slow and easy to get wrong, so even a careful specialist can miss the one connection that matters.

There is also a timing problem. A child’s genome is sometimes read before the relevant gene has ever been tied to a disease. Years later, the link exists in the literature, but no one has gone back to re-read that particular file.

This is where a reasoning model earns its place. Instead of returning a ranked list of genes and stopping there, the team asked o3 to act as an “explanation-first” layer on top of the existing genetic pipelines.

It had to connect the clinical features, the inheritance pattern, the variant evidence, and the published science into one written argument that a human could poke holes in. That is a different job from an ordinary search, and it is closer to the kind of AI-assisted research that labs are starting to lean on.

How the reanalysis worked

For each case, the team built a de-identified packet. It held standardized Human Phenotype Ontology terms that describe the patient’s symptoms, occasional clinician notes, any working diagnosis, basic metadata like age and gender, and a filtered table of genetic variants. Most packets included data from the child and both biological parents.

How one unsolved case moves through the workflow: a de-identified case packet feeds the o3 model, two experts review the hypothesis against ACMG/AMP criteria, follow-up or segregation testing checks it, a CLIA-certified lab confirms the variant, and only then is a result returned to the family.

The model was asked to propose the single most plausible molecular explanation and to show its work. Then the humans took over. Researchers reviewed every output using the ACMG/AMP framework, the same standard clinical labs use to grade genetic variants. At least two team members checked each candidate, and any disagreement was settled by consensus.

The bar for calling something a diagnosis was high. A model’s answer counted only after experts reviewed the evidence, targeted follow-up or segregation testing checked the leading hypothesis, the variant was classified as pathogenic or likely pathogenic, a CLIA-certified laboratory confirmed it, and the clinical team returned the result to the family. A raw model output was never, at any point, treated as a diagnosis.

Before touching a single unsolved case, the team tested and tuned the workflow on cases that already had answers, refining the prompts and the clinical context, variant filters, and evidence questions the model received. The model recovered the correct gene and variant in 48 of 51 mixed rare-disease cases, got the right diagnosis in 45 of 57 neuromuscular cases, and in a set of 15 long-read genomes it named the correct gene every time and both disease-causing copies in 12.

Before it was trusted with unsolved cases, the model was checked on already-solved ones: 48 of 51 mixed cases, 45 of 57 neuromuscular cases, correct gene in all 15 long-read genomes, and both disease-causing copies in 12 of those 15.

The model also scored its own confidence, and those scores lined up with reality. On calls it got consistently right, the mean minimum confidence score was 85.6. On calls that were wrong or unknown, it was 42.1.

The team did not treat these as true probabilities or as a stand-in for evidence. They used them only to point tired reviewers at the most promising leads first.

What the reanalysis actually found

Across the 376 cases, doctors confirmed 18 new diagnoses, a yield of 4.8%. The cases were split into four groups, and the hit rate was not even across them.

Where the 18 confirmed diagnoses came from: neurodevelopmental 10 of 100 (10%), neuromuscular 4 of 61 (6.6%), sudden unexpected death 2 of 200 (1%), early psychosis 2 of 15 (13.3%), for an overall 4.8% yield.

The neurodevelopmental group produced the most confirmed answers, 10 out of 100. The early-psychosis group had the highest rate at 13.3%, though it held only 15 cases, so that percentage carries a wide margin of error.

Neuromuscular disease added 4 out of 61, and the large sudden-death group added 2 out of 200. Yield also tracked how likely each group was to have a single-gene cause in the first place.

A 4.8% hit rate sounds small, and on paper it is. What makes it meaningful is where it came from. These were not fresh cases. They had been through multiple pipelines and multidisciplinary teams and were still stuck. Squeezing single-digit gains out of that pile is genuinely hard, and similar reanalysis studies see the same modest range on heavily reviewed cases.

One finding says a lot about the real problem. Of the 18 diagnoses, 7 were rediscoveries. The answer already existed outside the local workflow, and in several cases the variant was already flagged as pathogenic in a public database. It just never made it into the record the patient’s own team was looking at.

Of 18 confirmed diagnoses, 11 were newly established in this workflow and 7 were rediscoveries, answers that already existed elsewhere but had never reached the patient's local record.

That is not really a science failure. It is a plumbing failure, the kind that shows up when information is scattered across systems that do not talk to each other.

Beyond diagnoses: new leads worth testing

Three kinds of leads beyond single-gene diagnoses: a hidden structural change on chromosome 22 pointing to DiGeorge syndrome, cases explained by two genes together such as LAMA2 with FOXP1 and TTN with SRPK3, and a new testable idea linking an S1PR1 deletion to vitiligo that still needs lab validation.

The model did more than match known patterns. In one early-psychosis case, it noticed a run of low-quality readings on chromosome 22, tied that to the child’s heart, immune, developmental, and psychiatric symptoms, and proposed a 22q11.2 deletion linked to DiGeorge syndrome. That structural clue was not even listed in the input data.

It also broke its own rules in useful ways. The prompt asked for one single-gene cause, but some presentations were too complex for that. In one case, variants in two genes, LAMA2 and FOXP1, together explained the muscle and developmental features. Another turned out to have a two-gene cause involving TTN and SRPK3 that had not been recognized before.

The most striking result was not a diagnosis at all. In one neurodevelopmental case, the model flagged an 11-amino-acid deletion in a gene called S1PR1 in a person who also had vitiligo, the condition that causes loss of skin pigment.

It then built a mechanism: the deletion could change the receptor in a way that both reduces pigment production and helps immune cells linger in the skin. That link between S1PR1 and vitiligo still needs lab work to confirm, but it shows a model pulling threads from structural biology, immunology, and genetics into one concrete, testable idea.

The team saw a similar signal in the neuromuscular group, where damaging variants in HSPB8 and CDK13 did not fit the genes’ best-known disorders, hinting at a wider disease spectrum.

Kyra’s diagnosis, nearly two decades late

Kyra's timeline: muscle weakness at age 9 in karate and soccer, nearly 20 years of tests and treatments with no diagnosis, a wheelchair by age 13, the study linking her case to an HSPB8 frameshift and myofibrillar myopathy, and the call with an answer about a week before her 28th birthday.

The numbers get human fast. Kyra’s story started in karate class, when her mother noticed the 9-year-old could not sink as low into her stances as before. She was slowing down at soccer and walking up on her toes. Her pediatrician could not find the cause and sent her to a specialist.

What followed was a nearly 20-year search. Tests, treatments, and consultations, and no name for what was happening to her body. She was in a wheelchair by the age of 13, though her condition later plateaued.

Kyra’s file was one of the four answers found in the neuromuscular group. The team traced her weakness to a frameshift variant in HSPB8 and diagnosed a form of myofibrillar myopathy, where abnormal protein clumps build up in muscle fibers.

A genetic counselor from the Manton Center called her about a week before her 28th birthday. Her form of the disease is so rare that little is known about how it will progress, but after nearly two decades, the answer itself brought some closure.

What this study does not prove

What the study showed versus what it did not do: it showed a general reasoning model can turn phenotype, inheritance, variants, and literature into reviewable hypotheses. It did not diagnose anyone, did not measure time, cost, effort, false positives, or care, did not test structural variants, repeat expansions, deep-intronic changes, or mosaicism, and does not replace sequencing, counseling, or specialists.

The researchers were careful to fence in what they had shown, and that honesty matters. This is not evidence that patients, clinicians, or anyone else should use OpenAI’s models to diagnose disease or make medical decisions. The study does not endorse o3 Deep Research, ChatGPT, or any product as a diagnostic tool.

The design had real limits. It was retrospective, the four groups were very different from each other, and reviewers could see the model’s confidence scores, which can bias judgment. The team did not measure time saved, cost, clinician effort, false-positive workload, or any effect on patient care. It also did not systematically test other kinds of genetic variation, such as structural variants, repeat expansions, deep-intronic changes, and mosaicism.

There is a deeper caution too. Large language models can misread context and produce clean, confident explanations that fall apart on inspection. That is exactly why every result here was forced through human review and lab confirmation.

The model widened the search and focused the human work. It never decided what to tell a family. This is the honest version of the “AI in medicine” story, and it is worth remembering the next time a headline claims a model can replace human experts.

The study also ran on de-identified data, with no protected health information leaving approved environments. Any wider use in clinics would demand the same care around privacy, security, and local regulation that all medical work requires. Model access does not replace sequencing machines, genetic counseling, confirmatory testing, or a specialist’s judgment.

What comes next

The road ahead: bigger prospective multi-center trials, safeguards like versioned prompts and audit logs and calibrated uncertainty, better models including purpose-built tools like GPT-Rosalind, and the Manton Center leading the next stage with an OpenAI Foundation grant.

The obvious next step is bigger, better-controlled studies. Prospective, multi-center trials should compare AI-assisted reanalysis against standard practice on the things this study did not measure: time to a lead, clinician effort, false-positive burden, cost, and real effects on care. Versioned prompts, reference checks, and audit logs will be needed to make any of it reproducible and safe.

The tools will keep improving as well. This study used o3 Deep Research, a general model. Newer general models can search and synthesize more, and purpose-built systems like GPT-Rosalind are aimed at deeper life-sciences work. None of that was tested here. OpenAI supported this first study, and the Manton Center will lead the next stage through a grant from the OpenAI Foundation.

“Researchers like Catherine and me can’t possibly keep 8,000 different diseases in our heads. That’s the power of AI.”
Alan Beggs, director of the Manton Center for Orphan Disease Research

That framing also fits the broader mood, where the public’s trust in AI rises and falls with how carefully it is used. A tool that surfaces leads for humans to check is a very different thing from a tool that decides.

The bottom line

The takeaway: this is not AI replacing the doctor's diagnosis. It is AI handing the doctor a lead worth checking. For thousands of families, an unanswered question no longer has to stay unanswerable.

Strip away the hype and this is a modest, careful result: a general AI model helped human experts find 18 answers in 376 cases that had already been declared dead ends. It diagnosed no one. It confirmed nothing on its own.

But 18 families now have a name for what happened, and 7 of those answers were already out there, just unreachable. The promise here is not a machine that replaces the doctor.

It is a machine that can read faster than any human and hand the doctor a lead worth chasing. For thousands of families still waiting, that is the difference between a closed drawer and one more chance.

Sources

OpenAI. Using AI to help physicians diagnose rare genetic diseases affecting children, June 18, 2026
The study was published in NEJM AI, June 18, 2026

Written by

Selvam Sivakumar

Founder, Elephas.app

Selvam Sivakumar is the founder of Elephas and an expert in AI, Mac apps, and productivity tools. He writes about practical ways professionals can use AI to work smarter while keeping their data private.

Back to News