AI in Medicine? It's back to the future, Dr Watson

Why IBM's cancer projects sounds like Expert Systems Mk.2

Analysis "OK, the error rate is terrible, but it's Artificial Intelligence – so it can only improve!"

Of course. AI is always "improving" – as much is implied by the cleverly anthropomorphic phrase, "machine learning". Learning systems don't get dumber. But what if they don't actually improve?

The caveat accompanies almost any mainstream story on machine learning or AI today. But it was actually being expressed with great confidence forty years ago, the last time AI was going to "revolutionise medicine".

IBM's ambitious Watson Health initiative will unlock "$2 trillion of value," according to Deborah DiSanzo, general manager of Watson Health at IBM.

But this year it has attracted headlines of the wrong kind. In February, the cancer centre at the University of Texas put its Watson project on hold, after spending over $60m with IBM and consultants PricewaterhouseCoopers. Earlier this month, StatNews published a fascinating investigative piece into the shortcomings of its successor, IBM's Watson for Oncology. IBM marketing claims this is "helping doctors out-think cancer, one patient at a time".

The StatNews piece is a must-read if you're thinking of deploying AI, because it's only tangentially about Artificial Intelligence, and actually tells us much more about the pitfalls of systems deployment, and cultural practice. In recent weeks Gizmodo and MIT Technology Review have also run critical looks at Watson for Oncology. In the latter, the system's designers despaired at the claims being made on its behalf.

"How disappointing," wrote tech books publisher O'Reilly's books editor, Andy Oram, to net protocol pioneer Dave Farber. "This much-hyped medical AI is more like 1980s expert systems, not good at diagnosing cancer."

What does he mean?

Given how uncanny it is that so much of today's machine learning mania echoes earlier hypes, let's take a step back and examine the fate of one showpiece Artificial Intelligence medical system, and see if there's anything we can learn from history.

MYCIN

The history of AI is one of long "winters" of disinterest punctuated by brief periods of hype and investment. Developed by Edward Shortliffe, MYCIN was a backward-chaining system designed to help clinicians that emerged early on in the first "AI winter".

MYCIN used AI to identify the bacteria causing infections, and based on information provided by a clinician, recommended the correct dosage for the patient.

MYCIN also bore the hallmarks of experience. The first two decades of AI had been an ambitious project to encode all human knowledge in symbols and rules, so they could be algorithmically processed by a digital computer. Despite great claims made on its behalf, this had yielded very little of use. Then in 1973, the UK withdrew funding for AI from all but three UK universities. The climate had gone cold again.

AI researchers were obliged to explore new approaches. The most promising seemed to be to give the systems constraints – simplifying the problem space. Micro-worlds, artificially simple situations, were one approach; and Terry Winograd's block-stacker SHRDLU was one example. From Micro-worlds came rules-based "expert systems". MYCIN was such a rules-based system. Comprising 150 IF-THEN statements, MYCIN made inferences from a limited knowledge base.

Why didn't MYCIN get better?

Stanford had compared MYCIN to the work of eight experts at Stanford medical school. Out in the real world it was deemed unfit for purpose, and went unused.

MYCIN's Achilles' heel was predicted in advance by the leading AI critic (and tormentor) Hubert Dreyfus. Not all knowledge can be finessed into "rules". This difficulty was acknowledged by the "Father of Expert Systems" Ed Feigenbaum, an academic who established the Stanford Knowledge Systems Lab, and whose 1983 book The Fifth Generation: Artificial Intelligence and Japan's Computer Challenge to the Worldcreated a huge revival in AI investment – and something close to panic in American business and government. Then, as now, we worried about an "AI gap" with Asia.

Dreyfus pointed out that a real human expert used a combination of examples and intuition based on experience. Replicating this might prove elusive:

"If internship and the use of examples play an essential role in expert judgement, ie, if there is a limit to what can be understood by rules, Feigenbaum would never see it – especially in domains such as medicine, where there is a large and increasing body of factual information."

Ah, yes, said Feigenbaum. We have that in hand.

A new job category was then born – that of "knowledge engineer". A knowledge engineer was sent in when the human expert didn't realise, or couldn't express, how she or he had come to a decision. A knowledge engineer was like a horse whisperer, taming the wild and elusive knowledge floating around in the expert's head, and then bottling it, for an expert system to use.

So how did that go?

Not learning from failure

Summarising the field at the end of the 1980s, academics Ostberg, Whitaker and Amick* discovered that the reality...

little resemblance to the success stories reported in high-priced insider newsletters. As of this date, we find that there are very few operational systems in use worldwide ... The literature, including the expensive and supposedly insightful expert systems newsletters, has consistently overstated the degree of expert systems penetration into the workplace. However, even the outright failures have not dissuaded organizations from pursuing the technology; they have simply categorized previous efforts as learning experiences.

Feigenbaum himself despaired. Perhaps expert knowledge was simply "10,000 special cases", he mused. It would always evade capture.

The great horse whispering project had failed.

Shortliffe discovered that MYCIN's failure to move into full use was in large part due to the ability of experts to decide how they wished to carry out their tasks – "if the tool was not directly helpful to how they wished to work, then it would simply not be used," notes Philip Leith in his study, The Rise and Fall of Legal Expert Systems, a fascinating analogy for the failure of 1980s AI in medicine.

"What was missing was proper analysis of user needs – vital in any other area of computer implementation, but apparently viewed as unnecessary by the AI community. This produced a mismatch between what the experimenter believed the user wanted and what the user would actually use."

AI is only part of what Watson for Health in reality does, and AI today is very different to the AI of the '60s, '70s and '80s: probabilistic AI takes a brute force approach, using large data sets. In some cases, such as speech recognition, this has been fantastically successful. In others it's still quite useful. But in many other situations, it isn't.

MYCIN failed, but it didn't just fail because the AI stubbornly couldn't improve. Like IBM's Watson today, it was immensely time-consuming for staff using the system, relying entirely on data input. The knowledge base was incomplete, and it was used in areas it shouldn't have been used.

In its weighty investigation, StatNews found another reason for Watson for Oncology not living up to its billing: culture. Hospitals in Denmark and the Netherlands had declined to sign up because it was too US-centric, "putting too much stress on American studies, and too little stress on big, international, European, and other-part-of-the-world studies", according to one source quoted.

Today we're in another Tulip Mania phase of AI: Softbank's singularity-obsessed boss has pledged that much of his $100bn tech fund will focus on AI and ML investments.

Today MYCIN will be recalled by thousands of students who studied computer science in the 1980s, as it became the canonical example of an expert system. That was the fate of many such systems: useless in the field, they were used for teaching. The next generation would be better.

Or not.

At the time, MYCIN's defenders claimed that no "expert" could outperform it, and it prompted a wave of enthusiasm. MYCIN had much to commend it. It was honest, giving the user a probability figure and a full trace of all the evidence.

Just as today's AI experiments purport to be able to tell if you're gay (and in future, its Stanford creator claims, your political views), the headline conceals a probability, derived after much training (or "learning").

Unlike almost all mainstream publications, which rarely if ever report the failure rate, we do. For example, the failure rate for the AI recognition that claimed to identify masked faces was surprising: with cap and scarf the AI is between 43 and 55 per cent accurate. "On a practical level it isn't that awesome," we noted.