This is an article from KFF Health News.
It is the oncologist’s job to prepare cancer patients to make difficult decisions. However, they don’t always remember to do it. At the University of Pennsylvania Health System, an artificial intelligence algorithm that predicts the likelihood of death is prompting doctors to talk to patients about their treatment options and end-of-life wishes.
However, this is not a set-it-and-forget-it tool. A 2022 study found that routine technical diagnostics have degraded algorithms during the COVID-19 pandemic, worsening death predictions by 7 percentage points.
It may have had an impact on real life as well. The study’s lead author, Ravi Parikh, an oncologist at Emory University, told KFF Health News that the tool could help doctors have important discussions with patients that could potentially lead to stopping unnecessary chemotherapy. He said he had tried unsuccessfully hundreds of times to get the program to start.
He believes that several algorithms designed to enhance medical care, not just Penn Medicine’s, have weakened during the pandemic. “Many institutions do not regularly monitor the performance of their products,” Parikh said.
Algorithm glitches are one facet of a dilemma that computer scientists and doctors have long recognized, but one that is beginning to confound hospital executives and researchers. Artificial intelligence systems require consistent monitoring and staffing to keep them in place and operating properly.
Essentially, new tools require more people and more machinery to keep from messing up.
“Everyone thinks that AI will increase our access, our capabilities, and improve things like care,” said Nigam Shah, chief data scientist at Stanford Health Care. “That’s all great and good, but if it increases the cost of care by 20%, is it even possible?”
Government officials are concerned that hospitals lack the resources to implement these technologies at their own pace. “I’ve looked broadly,” FDA Commissioner Robert Califf told a recent committee on AI. “I don’t believe there is a single health system in the United States that can validate AI algorithms that are deployed in clinical health systems.”
AI is already pervasive in the medical field. Algorithms are used to predict a patient’s risk of death or deterioration, suggest patient diagnoses and triage, record and summarize visits to save doctors’ work, and approve insurance claims.
If technology evangelists are right, technology will be ubiquitous and profitable. Investment firm Bessemer Venture Partners has identified about 20 health-focused AI startups each on track to generate $10 million in annual revenue. The FDA has approved approximately 1,000 artificial intelligence products.
It’s difficult to evaluate whether these products work. It’s even harder to assess whether they will continue to work or whether they have developed software equivalent to engines with blown gaskets or leaks.
A recent study from Yale Medicine evaluated six “early warning systems” that alert clinicians when a patient’s condition may deteriorate rapidly. Dana Edelson, a physician at the University of Chicago and co-founder of the company that provided one of the algorithms for the study, said the supercomputer ran through the data for several days. This process was fruitful and we found significant differences in performance between the six products.
For hospitals and healthcare providers, choosing the algorithm that best fits their needs can be a challenge. The average doctor doesn’t have a supercomputer on hand, and there’s no consumer report for AI.
“We don’t have standards,” said Jesse Ehrenfeld, immediate past president of the American Medical Association. “There is nothing I can point to you today as a standard for how to evaluate, monitor, and observe the performance of algorithmic models as you deploy them, whether they are AI-enabled or not.”
Perhaps the most common AI product in the doctor’s office is something called ambient documentation, a technology-enabled assistant that listens and summarizes a patient’s examination. Last year, Rock Health investors tracked $353 million flowing into these document preparation companies. But Ehrenfeld said, “Currently, there is no standard by which to compare the output of these tools.”
And it’s a problem when even a small error can have fatal consequences. A team at Stanford University sought to use large-scale language models, the technology underlying popular AI tools such as ChatGPT, to summarize a patient’s medical history. They compared the results to what the doctors wrote.
“Even in the best case, the model had a 35% error rate,” said Stanford’s Shah. In medicine, “If you’re writing a summary and forget one word, like ‘fever,’ that’s a problem, right?”
In some cases, the reasons why an algorithm fails are quite logical. Its effectiveness can be compromised if the underlying data changes, for example, if a hospital switches testing providers.
But sometimes the pit yawns open for no apparent reason.
Sandy Aronson, a technology executive with Massachusetts General Brigham’s Personalized Medicine Program in Boston, said his team tested an application aimed at helping genetic counselors find relevant literature about DNA mutations. However, the product suffered from “non-determinism,” he said. Asking the same question multiple times in a short period of time will give you different results.
Aronson is excited about the potential for large-scale language models to summarize the knowledge of busy genetic counselors, but says “the technology needs to improve.”
What is an institution to do when indicators and standards are sparse and errors can occur for strange reasons? Invest a lot of resources. Shah said it took Stanford University eight to 10 months and 115 person hours just to audit the fairness and reliability of the two models.
Experts interviewed by KFF Health News floated the idea of artificial intelligence monitoring the artificial intelligence and some (human) data experts monitoring both. Everyone acknowledged that this would require organizations to invest more money, but this is difficult given the reality of hospital budgets and the limited supply of AI technology experts. It’s a challenge.
“It’s great to have the vision of melting icebergs to monitor models,” Shah said. “But is that really what I want? How many more people do I need?”