Large-scale language models can analyze vast amounts of interventional radiology safety data to draw conclusions, helping IR specialists design interventions.
Experts from the University of Toronto detailed that huge amounts of data on adverse events from medical devices are accumulated every day. Canadian College of Radiologists Journal [1]In 2022 alone, the U.S. Food and Drug Administration’s database recorded approximately 3 million incidents, with most reports collected in free text fields.
“Human analysis of these databases to gain meaningful insights is limited by a variety of factors, including expertise, time required, lack of uniformity, and fatigue,” Blair E. Warren, a radiology resident in the university’s medical imaging department at the time of the study, and his colleagues wrote on August 21. “As a result, databases housing important safety information may be underutilized.”
In the study, Warren and his coauthors analyzed FDA adverse event data related to thermal ablation, an IR procedure that uses microwave energy to destroy cancer or tumor cells. The sample included 1,189 cases tallied in the United States between 2011 and 2021, with three residents sorting the information and an IR fellow analyzing the final tally.
GPT-4 was trained on a set of 25 IR adverse events and further validated on 639 events and tested on 79 events. OpenAI’s large-scale language model demonstrated high accuracy in classifying interventional radiology cases, achieving 86.4% accuracy on the larger validation set.
” [large language model] “We suggest the possibility of using LLMs to emulate human analysis and process large amounts of IR safety data as a tool for clinicians,” the authors advised.
Mechanical failure is one of the most common failures, and probe tip breakage in particular is a common occurrence in the medical literature for microwave ablation. These adverse events are often caused by applying improper or excessive force to the probe. In almost half of the breakages (42.7%), the medical professional left the probe tip in place. However, because of “limited” data from the US FDA, Warren and his co-authors were unable to determine long-term outcomes.
“This study demonstrates the feasibility of using AI to generate reports on data that might otherwise go under-evaluated,” the authors conclude. “Importantly, automated analyses of these data can be produced by non-AI experts, and the resulting LLMs could act as early detectors to identify important insights that might not have otherwise been explored. This proposed co-implementation of AI and humans is low risk given that the data is most likely already present, with AI being used to augment human analysis, with ongoing oversight by experts as a final safeguard.”
For more information, including potential study limitations, please see the links below.


