As a timer counted down, a team of physicians from St. Michael’s Medical Center in Newark, N.J., conferred on a medical diagnosis question. Then another. And another. With each question, the stakes at Doctor’s Dilemma, an annual competition held in May in Washington, D.C., grew higher. By the end, the team had wrestled with 45 conditions, symptoms, or treatments. They defeated 50 teams to win the 2016 Osler Cup.
The stakes are even higher for real-life diagnoses, where doctors always face time pressure. That is why researchers have tried since the 1960s to supplement doctors’ memory and decision-making skills with computer-based diagnostic aids. In 2012, for example, IBM pitted a version of its Jeopardy!-winning artificial intelligence, Watson, against questions from Doctor’s Dilemma. But Big Blue’s brainiac couldn’t replicate the overwhelming success it had against human Jeopardy! players.
The trouble is, computerized diagnosis aids do not yet measure up to the performance of human doctors, according to several recent studies. Nor can makers of such software seem to agree on a single benchmark by which to measure performance. Using reports on such software in the peer-reviewed literature, one team of researchers found wide performance variations across different diseases, as well as different usage patterns among doctors. For example, younger doctors are likelier to spend time putting more patient data into a tool and likelier to benefit from the aid. Two presentations at the 6–8 November Diagnostic Error in Medicine Conference in Hollywood, Calif., confronted the issue of how to realistically incorporate technological aids into doctor training and hectic diagnosis routines.
Another issue is figuring out how to compare different software aids. “If you look at, for example, the big progress that has occurred in speech recognition or in image classification, it's really been brought about by having really good benchmark data sets and really like having actual competitions,” says computer scientist Ole Winther at the Technical University of Denmark in Lyngby. “We don't have the same in the medical domain.”
While IBM did publish a report in 2013 on its Watson-vs-Doctor’s Dilemma test, Winther says that he has been unable to obtain the subset of questions IBM used, so he was unable to directly compare it to a diagnostic aid he and colleagues built, called FindZebra. Last year, his team estimated that both FindZebra and Watson list the correct diagnosis among their top 10 results about 60 percent of the time, which is in line with what a Spanish team reported earlier this year.
Despite the lack of a unified benchmark for computer-aided diagnostics, individual doctors, family members of misdiagnosed patients, and academic and clinical groups have built and are marketing such aids. Clients include private health insurance companies and research hospitals around the world–among them, a pair of medical facilities in North Carolina and Japan that have reported some success diagnosing patients with Watson. Yet, at a recent IBM Research event in Zurich, one of IBM’s clients, Jens-Peter Neumann of the Rhön-Klinikum hospital network in Germany, said that it is too early to estimate the potential cost savings of his team’s Watson collaboration.
In February 2016 the Rhön-Klinikum network began pilot-testing Watson against the ultimate challenge for any diagnostics aid: rare diseases. The 7,000 or so known rare diseases affect perhaps 7 percent of Europe’s population, according to Munich Re, an insurance and risk management firm. As genomic screening grows more sophisticated, insurer Munich Re predicts the discovery of over 1,000 more diseases by 2020. “Memorizing them all is just not going to happen,” says computer scientist and physician Tobias Mueller of the University Clinic Marburg in Germany, who is involved in the Rhön-Klinikum pilot.
Instead the team is structuring the natural-language medical histories of the 522 patients in the pilot into the right format for Watson, a time-consuming process that combines human and computer efforts. Watson can then compare these structured histories to the medical literature and suggest ranked diagnoses.
One issue, Mueller says, has been consistently processing medical literature from both German and English. So far, the team have opted to use a combination of medical taxonomies, such as MedDRA and ICD10, to describe symptoms and diagnoses. He also notes that sometimes the knowledge sources fed into Watson contradict each other. In other words: computerized diagnosis aids are struggling with some of the same problems humans do when sharing and comparing information. “However, this reflects the diversity of the knowledge base of Watson and is no different than having a room full of doctors with different backgrounds and different opinions. It's more a strength, than a weakness,” Mueller says.
Despite the struggles, Winther says computer-aided diagnosis will ultimately mature: “A lot of patients spend years and years juggling between [general practitioners] and the wrong specialists. That’s still a challenge where there’s room for these kinds of tools.”
This post was updated on 15 November 2016 to clarify the timing and aims of the Rhön-Klinikum pilot study.