In today’s column, I am continuing my ongoing series about the impact of generative AI in the health and medical realm. This time, the focus is on using generative AI and large language models (LLMs) to generate summaries in a medical or health context. This is an emerging topic of weighty importance since to some degree you could compellingly argue that the matter has life-or-death consequences or at least the potential for doing more harm than good.
This is a dual-edged sword proposition.
Making use of generative AI to produce summaries can be a huge time saver and serve as a significant aid to harried doctors and busy medical professionals. Lamentedly, suppose the summary is akilter or otherwise fails to portray the source content aptly. In that case, this can mislead doctors and medical professionals, potentially undermining the quality of care and leading to adverse outcomes.
In that sense, generative AI-produced medical-oriented summaries need to comply with the golden rule: First, do no harm. That is the million-dollar or billion-dollar question underlying the decision to use generative AI for summarization in a medical or health context. Some might be tempted to jump in quickly and leverage the astounding capabilities of generative AI. What they might not be considering is that besides the chances of misstating keystone medical or health content, which could lead medical doctors astray, there is the legal liability that this opens. For my coverage of the coming wave of potential medical malpractice due to the infusion of generative AI into the clinical process, see the link here.
I will walk you herein through the ins and outs of this topic, along with showcasing various examples by making use of the widely and wildly popular generative AI app known as ChatGPT. I have done a similar series of generative AI-focused medical and health articles, such as one showcasing an in-depth analysis of generative AI used in performing medical differential diagnoses and being an aid to clinical decision-making, see the link here. My emphasis throughout is that we should both embrace generative AI and yet also be mindful of how and when to sensibly use generative AI. I believe in taking an eyes-wide-open approach. My point is that we should not fear or neglect the use of generative AI, nor should we blindly leap into the use of generative AI.
Let’s take a balanced viewpoint and make sure that especially when using generative AI in a medical or health setting, we are doing so with the greatest possible awareness and suitable checks and balances.
As an aside, and in case this topic generally interests you, I have also covered the use of generative AI for boosting empathy on the part of medical students and medical doctors, see the link here. Additional explorations include an extensive set of assessments about the use of generative AI for mental health therapy, such as the coverage at the link here and the link here, just to name a few. The application of generative AI to the medical and health arena is promising, growing rapidly, and proffers a lot of benefits that need to be carefully weighed against the existent limitations and downsides.
Let’s get started on today’s topic.
What’s The Deal On Using Generative AI For Summarization
In my longstanding and ongoing coverage of the latest in generative AI, I have repeatedly indicated that one of the considered “Top 5” capabilities of generative AI consists of summarization features. In a sense, generative AI is a summarization machine. The manner in which generative AI works is at a core competency if you will of being able to summarize content, this is part and parcel of the inherent design of generative AI.
I will later on herein provide you with background about the technical underpinnings of generative AI so that you’ll have a closer appreciation for why summarizing is so near and dear to how generative AI works. Right now, I am going to briefly cover some key precepts about summarization and explain what you need to keep in mind when using generative AI to undertake summarizations. I do so for a very important purpose.
Here’s why.
There are many coming onto the bandwagon of using summarization in a medical or health context, but they are often unaware of or frankly clueless about the prompt engineering that will best produce summaries while using generative AI.
Allow me to briefly elaborate on that point.
When you use generative AI, you enter prompts. For example, I might enter a prompt that tells generative AI to summarize a medical note that I am going to feed into the AI. Most everyday users of generative AI would merely type in a prompt that says to summarize the content. They don’t realize that you should be saying additional nuances and directions in your prompt if you want to try and maximize the aptness of the summary.
A new field of focus is known as prompt engineering and entails vital tips and insights about how to compose your prompts. Just writing prompts off the top of your head is not a good idea. The odds are that the results you will get from the generative AI are going to be a lot less useful than you might have otherwise derived. I am sad to say that many generative AI users are completely oblivious to the strategies and tactics for devising sound prompts. There is a dearth of awareness about prompt engineering.
You can easily add medical doctors and medical professionals into the blanket assertion that they too are generally unaware of prompt engineering. This is not surprising. They already have hectic lives. The idea of them getting up-to-speed about prompt engineering is something that few would think to do and might also seem like a faulty use of their precious time. One might insist that for the time spent on learning prompt engineering, perhaps they could have gotten additional training on heads-up medical procedures or hands-on clinical tasks.
The problem we face is that if medical doctors and medical professionals are going to be using generative AI, do we want them to do so without any semblance of what they are doing?
That’s a tough question, but it needs to be squarely asked. The assumption by many is that you just log in and start typing whatever comes to your head. Sure, you can do that. The average everyday user of generative AI likely does that. Do we want the same to occur when a medical matter is at hand and the use of generative AI is being blended into medical work?
I dare say this is a chancy proposition.
Let’s strive to increase awareness and avoid the base assumption of typing whatever strikes your fancy when seeking to use generative AI, including and particularly when doing so for summarization.
Along those lines, I am going to share with you some of my prior coverage on prompt engineering and especially the crucial insights about composing prompts when seeking to have generative AI produce summaries. For example, one of my more advanced analyses reveals a special formulation of prompting known as the chain-of-density technique, see the link here. I am going to excerpt here some of my prior discussions about AI summarization and immerse them herein into the particular domain of a medical and health context. We need a lot more of that type of cross-over, namely taking the already known methods for summarization in generative AI and parlaying those into the medical and health arena.
As a side note, researchers examining summarization via generative AI in a medical or health context should also take into account prompt engineering. I say this because much of the existing medical-oriented research on summarization via generative AI is so far staying at the surface level of prompting. This makes abundant sense because that’s how the typical doctor or medical professional is likely to proceed. At the same time, looking ahead, it would be productive to engage in research that anticipates more sophisticated use of generative AI by those who are hopefully or purposely versed in prompt engineering. We need to methodically assess which prompt engineering techniques are payoffs for work performed in a medical or health setting.
Okay, we can proceed ahead herein, and I’ll step down from the soapbox.
Talking About Summarization And Why It Is Dicey
We all daily summarize things.
This is the basic nature of the human condition. There are lots and lots of stuff that we see and hear, for which summarizing the voluminous material is a necessity in life. Some people are good at summarization. Some people are lousy at it. I’m sure you’ve encountered people that tell you a summary and afterward, you discover that they left out tons of essential elements.
Summarization is not a surefire task. A summary is readily done but also readily torn apart for not being sufficient, complete, or proper. We learn this in school. Yet, the best summarizations can still be critiqued. It is a never-ending consideration. The idea of producing the “perfect” summary is somewhat of an illusion. The eye of the beholder plays a big part in deciding whether a summary is suitable or usable. The aim is to produce a summary that fits the circumstances at hand and does so in the best or maximum viable way given whatever constraints are faced and what the goal or objectives of the summary are.
I’ll start by pointing out that the source material is a huge factor in doing a summarization. What if the source materials contain falsehoods, errors, and the like? Should the summary simply dutifully carry those into the summary, reporting precisely what was said, or should the summary attempt to correct or highlight what is perceived as flaws in the source?
Right there, you can get into a heated debate.
Some exhort that a summary is not supposed to take sides. It is only a summary. Whatever the source material says, that’s what should come forth in the summary that is produced. Trying to add opinions or claim corrections is not what a summary is supposed to be. You are tainting the summary.
A retort or counterargument is that the summary will be misleading since it is presumably going to contain falsehoods and errors that were in the source material. Someone relying on the summary might assume that the falsehoods and errors have already been fact-checked by whomever or whatever did the summarization. Thus, a summary must either correct the found issues or at least spotlight them.
No way, the response goes. You are forcing the summarizer into a non-neutral corner. The summarizer is now a judge and jury. They are deciding what is good and right, and what is bad and wrong, but they have left the sacred duty of a summarizer. A summary is entirely obligated to condense the source and not flavor the condensing result based on their perceived preferences.
I trust that you can see the dilemma here.
I want to now immerse this into a medical context.
A doctor receives a summary of a medical file about a patient that the physician has never met and doesn’t know. The physician carefully reads the summary. They dutifully examine the summary. We are confident that the doctor didn’t merely skim the summary or only give it a half-hearted glance.
What does the doctor assume about the nature of the summary that they just read?
They might assume that the summary is suspect, and they should always be on their guard. Or the doctor might be assuming that the summary was supposed to have been carefully composed and therefore is highly reliable. Whatever is stated in the summary is fully and faithfully based on the source content.
Okay, let’s use this discourse to highlight some facets of summarization.
First, we have these four key considerations:
- (1) Source of the Summary. The nature of the source content that is used as the basis for the summarization.
- (2) Process of Summarization. The summarization endeavor or process being performed either by human hand or by machine (or both).
- (3) Summary Product as an Output. The finalized summary that was devised, possibly also including indications about how the summary was composed.
- (4) Summary Receiver. The person that is consuming the summary such as a medical doctor, though the summary could also be used without human receipt as fed directly into a system such as an AI system.
Along the path of those four stages, there can be miscommunication and a confounding unfortunate twist to the circumstances. For example, returning to the case of the doctor who has closely read a summary, suppose that they received the summary but have no pairing indication about how the summary was produced. In that instance, they don’t know what the summary process did. The doctor is in the dark as to whether this is a strictly word-for-word kind of summary, or whether the summary process interceded and opted to “fix” presumably detected falsehoods or errors in the source content.
We have the challenge of having a summary being a bearer of false positives and false negatives. If the summary process has opted to “correct” what was in the source, this might be good since perhaps errors aren’t being handed unabated to the doctor. On the other hand, suppose that the intended corrections actually lead to an error or falsehood as now represented in the summary?
The doctor won’t know which is which, namely if they spot an error, was this in the original content, or was this something contrived by the summarizing? You might argue that if the doctor isn’t given “corrected” material in a summary they might fail to spot the errors and thus the summary is taking them down a primrose path. Back and forth this goes.
A dire challenge exists, which is a longstanding issue about summarizations of any kind. In a medical or health context, the stakes are immensely raised.
Summaries are often typified as either being extractive or abstractive.
Let’s see what that portends:
- (1) Extractive Summary (per verbatim). The summary toes the line and strives to not change, correct, or otherwise impart an assessment or evaluation of the source content.
- (2) Abstractive Summary (per analysis). The summary seeks to identify guffaws, issues, or other concerns in the source and attempts to remove, modify, or correct those which is what then appears in the summary.
Reflect on the extractive versus abstractive approaches.
During the summarization process, there are two possible routes to go overall. You might aim to be extractive, primarily extracting key aspects and shoveling those into the summary. Or, instead, you might be abstractive, whereby you go beyond the words themselves of the original content and begin to reinterpret or perhaps elaborate beyond what the summary per se has to say, including potentially changing the meaning, making corrections, and so on.
Once you’ve started down the path of abstractive, in a sense you have to declare that the entire summary is now abstractive (the taint is there). Some suggest you can combine the two approaches and mark which part of the summary is extractive and which part is abstractive. Sure, that’s a potential approach, though this also has arduous tradeoffs that need to be considered.
A purely extractive summary is more likely to be construed as a fair and balanced reflection of the original content. You are not changing things up. You are only carrying the essentials (elements or entities) over into the summary. But will the receiver of the extractive summary be misled into assuming that an abstractive approach was undertaken when in fact it was extractive?
As already observed, the problem with an abstractive summary is that you are potentially changing things and will be biasing or in some manner altering the meaning found within the original content being summarized. The danger is that this kind of summary is no longer seen as fair and balanced, and instead is based on the opinions of the summarizer. At the same time, on the upside of abstractive, you can proclaim that the summary is now value-added. It is more than merely regurgitation. It is an improvement upon what the source contained, though this assumes that the summary has properly made the abstractive changes.
I’ll let you in on a secret.
When it comes to using generative AI, unless you use suitable prompting, you never know what you are going to get in terms of a generated summary. As I always warn, using generative AI is like a box of chocolates, whereby you never know what you are going to get.
Simply asking for a summary of the text might get you an extractive version, or might get you an abstractive version, or you end up with a Frankenstein version. You won’t necessarily know.
In your mind, maybe you are thinking about extractive, but the generative AI gives you abstractive. Perhaps you were thinking of abstractive, though the generative AI gave you extractive. The nature of your prompt is going to greatly determine the type of summary; thus, you need to do more than just naively ask for a summary. You must knowingly declare additional details in your prompt to get a summary that you have a decent shot at knowing what happened during the black box summarization process.
I will add something else that might send chills up your spine. Even if you use the recommended prompting approaches, you are still not ironclad guaranteed about what the generative AI summary is going to contain. You have materially guided the generative AI, which is handy, but this does not mean that the result is going to adhere unerringly to your instructions. Generative AI works on a probabilistic and statistical basis. This means that a given prompt will in one moment do one thing, while in a different moment can have a quite different effect.
Please keep that in mind.
I want to also bring up a few other notable points on these complex matters.
Consider the nature of the source content that is intended to be summarized:
- (1) General text to be summarized (overall stories, narratives, etc.)
- (2) Domain-specific text to be summarized (e.g., medical, legal, financial)
Generative AI is usually generically devised. This means that it isn’t domain-specific, which I’ll tell you more about shortly. You can pretty much use generative AI for summarizing general text and likely be roughly satisfied with the result. Doing the same with domain-specific text, such as medical notes that contain medical verbiage, well, there are chances that the domain particulars might get lost, discombobulated, or otherwise not be well-summarized.
Another consideration is the importance of a summary:
- (1) Nice to have.
- (2) Modicum of importance.
- (3) Very important.
We should also be thinking about the potential impact of a summary:
- (1) Relatively inconsequential.
- (2) Modicum of consequences.
- (3) Quite serious consequences.
- (4) Extremely vital consequences (potentially life-or-death)
Let’s take a moment and do a bit of picking and choosing from the above lists.
I want to do a summary about how to play the game Monopoly. This is a general topic and not medically steeped. The summary is nice to have but not a vital need. The summary is somewhere between relatively inconsequential to just a modicum of consequences (I might read the summary and go into my playing of Monopoly with a flawed understanding, embarrassing myself in front of my friends). I am going to assume that the summary is extractive or verbatim style (but, if the summary is based on a blog that is written by someone clueless about how to play Monopoly, the verbatim summary is likely to mislead me about the rules of the game!).
All in all, the aim of a summary in this Monopoly game instance is probably fine for generative AI and you aren’t betting the world on the result.
Shift gears and go back to the medical doctor who has read a summary of a patient’s file. In that instance, the topic is not a generalized one. It is steeped in medical jargon and a medical context. The summary is likely to be considered very important in the sense that it is aimed to aid or be instrumental to the doctor and reduce their laborious effort to otherwise inspect the file directly. I assume we can all agree that the setting is one entailing quite serious consequences or perhaps extremely vital consequences, depending upon what the patient has come to see the doctor about.
Do you see how the ante has gone up substantially by using the summarization in a medical setting for a certain kind of purpose?
There are medical doctors and other medical professionals who have played with generative AI in their outside work time and discovered how handy generative AI is for the summarization of broad generalized content. Maybe they did a summary of some historical content about Lincoln that their son or daughter needs for school. Perhaps the doctor is learning how to dance the Samba and used generative AI to summarize what the dance moves are.
Out of this playfulness, they begin to think and believe that the same generative AI would be useful at work. They see the time savings and the realization that they can spend more time with their patients and less time on protracted reading or researching. What they don’t tend to see is that the summarization in a playful context is a lot less demanding.
Doing summarization in a medically steeped context that can entail crucial concerns of medical diagnoses and medical recommendations is a far cry from learning the Samba or doing summaries about the life and times of Abraham Lincoln.
Here’s another essential angle to this too.
One confusion that sometimes gets in the way of thinking about summaries is the matter of summarization versus simplification. Do not unduly equate those two.
A summary doesn’t necessarily have to be a simplification. It could be that whatever complexity existed in the source is going to also come across in the summary. Simplification is a type of transformation involving simplifying one thing to be more readily accessible or understandable. A summary doesn’t have to be a simplification.
If you want the summary to be simplified, you will usually need to ask for that to be undertaken. Remember that I said that the generative AI is like a box of chocolates, such that the AI might do a simplification as part of the summarization. You might not have asked for a simplification outright. Nonetheless, the AI opted to go that path.
I am guessing now that you are becoming cognizant of the many odds and ends, twists and turns, nuances and dangers, involving the invoking of summarizations.
Let’s take a closer look at summarization in a medical domain setting.
Summarization In A Decidedly Medical Or Health Context
I have indicated that there are numerous caveats about using generative AI to do summaries associated with medical and health content. That is true. The thing is the assertions about generative AI need to be considered in light of other ways that summaries get derived. Dissing generative AI is not the only game in town. We can likewise illuminate analogous problems with other means of producing summaries. The reality is that generative AI has tradeoffs and so do the other means.
The question is not well-placed as to generative AI in isolation of other means, but instead needs to be considered on a head-to-head basis with the additional alternatives.
Consider these possibilities of how medical or health summaries could be devised:
- (1) Self-devised by hand. A doctor makes their own summary by hand.
- (2) Reliant on another person. Doctor relies on a human-devised summary (other than their own version).
- (3) Use of conventional tech. The doctor relies on non-AI conventional tech that produces summaries.
- (4) Leaning into generic generative AI. Doctor relies on generic generative AI that derives summaries.
- (5) Leveraging domain-specific generative AI. Doctor relies on medical-domain generative AI that is customized to produce summaries.
The first case consists of a medical doctor crafting their own summary by hand. They opt to read some obtained medical material and then summarize it. Perhaps the summary is for their own benefit. They want to later on recall what the source content had to say; thus, they have devised a perfectly tailored summary based on their own tastes.
The good news is that they presumably had to read the entire content to produce the summary. This then contrasts with being handed a summary and not seeing the source material. Whatever written summary they have devised by reading the source might be more so a trigger for remembering the rest of the material later on, rather than only serving as a summary per se.
There are plenty of downsides to this. The doctor might be using their time poorly. Crafting a summary is not necessarily a considered productive activity for a doctor, especially if other viable means to produce a summary are available. Furthermore, we cannot assume that the doctor is good at producing a summary. I’ve seen doctors who later looked at their own summaries and complained that they did a lousy job. Yes, they acknowledged that generating summaries is not in their wheelhouse.
Next up in my above-listed use cases is when a summary is produced by hand via the efforts of someone else. The someone else can range extensively. Perhaps a medical doctor opts to write summaries and make those available to other physicians. Another possibility is companies hire doctors to craft summaries. There are non-doctors who are medical professionals who write summaries. On and on this goes. There are non-medical versed writers that write medical-focused summaries. It is highly variable and a widely cast net, for sure.
Do summaries that are produced by hand ensure an absolute guarantee that they will shine above summaries produced by a computer system such as generative AI?
Don’t fall for that outsized outstretched line that humans are always better at summarizing.
A human-produced summary can have the same flaws that I earlier mentioned about what can happen when you use generative AI to craft a summary. Humans are not perfect at summarizing. I bring this to your attention to stridently emphasize that if you only want to ding the AI-generated summaries, you have to look further and acknowledge that human-produced summaries can be messed up too.
The bottom line is that you must undertake a relative comparison of how summaries are generat
The watchful question to be asked is this:
- Is a given method of producing a summary, regardless of human accomplished or machine accomplished, as good as, better than, same as, or worse than each other method?
Making a broad strokes claim is problematic. I can easily find human-derived summaries that can be outclassed by using AI. No doubt about it. In the same breath, let’s acknowledge that it is possible to find AI-derived summaries that are outclassed by human-derived summaries. You would be hard-pressed to contend that across the board that one method is always better than the other.
This brings up another looming question, namely what does it mean to say that one summary is better or worse than some other summary?
Imagine that we have a set of medical notes that we want to summarize. We do so by hand. We also do so via using generative AI. Great, we have two summaries that can be rated on a head-to-head basis because they are based on the same source material.
The game is on!
Mull over for a moment the criteria or characteristics that you would use to compare the two summaries.
I’ll wait.
Some obvious factors include whether the summary aptly states what was in the source material, another is whether the summary has omitted key elements that were in the source, yet another would be if the summary contains biases that were not contained in the source. The list goes on. A typical way to think about this is to consider at least the three vaunted Cs of correctness, completeness, and conciseness.
When I discuss summarization by generative AI, I usually note that these are the five major issues typically encountered when devising a proper summary:
- (1) Omission of key element(s)
- (2) Misrepresentations of key element(s)
- (3) Inappropriate use of key element(s)
- (4) Lack of proper context for key element(s)
- (5) Made-up or AI hallucinatory key element(s)
My detailed coverage of those issues can be found at the link here.
I will be showing these issues in the generative AI ChatGPT examples toward the end of this discussion in light of medical summarization. Hang in there, the wait to see this will be worth it.
I’d like to also mention herein that those same five issues can occur when a human hand devises a summary. Humans are not immune to making those same summarization guffaws. This includes item number five, such that a human can make up stuff that wasn’t in the source and yet portray the made-up notions in the summary as though the source did contain the made-up material.
In a moment, I will discuss research that has examined the use of generative AI for producing summaries in a medical or health context. Before I do so, there are a few more lingering comments that are worthy of identifying.
First, let’s talk about cost.
Every summary has a cost. If you do summaries by hand, presumably a person was paid to make the summary. If you use generative AI, you likely must pay for the use of the computing resources that run the AI. A doctor who does a summary for themselves is bearing a cost too. They either aren’t getting paid during that summarizing act or they are at least losing an opportunity cost to use that time for indeed getting paid on a potentially greater basis.
A smarmy retort might be that sometimes summaries are handed out for free. For example, a firm that is trying to get the attention of doctors might be posting summaries that are accessed for free, luring the eyeballs to a website or product advertisement. I get that. But I ask you this, wasn’t there a cost in making the summary? The firm is eating the cost in hopes of catching the big fish. Thus, there is still a cost somewhat in the swing of things.
Cost matters. If the cost to derive a summary is less via one of the aforementioned methods than some other method, you then need to weigh the cost into deciding which method to choose. On top of this, naturally, you need to weigh the quality, availability, etc. My point here is that a lot of the comparisons of the summarization methods fail to mention the cost differences. Cost is a crucial tradeoff factor and cannot be realistically neglected in the grand equation of summarization method comparisons.
Second, the type of medical or health content that could be usefully summarized is wide-ranging.
So far, I mentioned the idea of doing summaries of medical notes, and I also brought up the possibility of summarizing a patient file. The list of medical or health materials that might be summarized is almost endless. Medical reports can be summarized. Recorded or transcribed interactions of dialogues between a patient and a doctor can be summarized. Materials found within Electronic Health Records (EHR) and Electronic Medical Records (EMR) can be summarized. Text messages from or to patients can be summarized. Frankly, it is hard to conceive of anything that would not lend itself to some form of summarization.
Not all such content requires the same form of summarization. The summarization style can vary. The same with the ease of summarization. Some of the content might be highly challenging to summarize, while other content might be easy-peasy to summarize. The cost to undertake a summary will also vary depending on the source material, the nature of the material, and the like.
Make sure to always have costs at the forefront of these hefty matters.
Let me surprise you with two facets of summarization that tend to give generative AI an edge over the other methods:
- (1) Generative AI provides summarization interactivity (if so desired).
- (2) Generative AI provides at-scale summarization (doing so “in the large”).
I’ll begin with interactivity.
Much of today’s conventional summarized medical or health content is undertaken on a static basis. There is no interaction involved. Here’s how things usually happen. Someone or something in a far-reaching location has been engaged to make a summary. The summary is a one-and-done affair. Once the summary is considered completed, it gets shipped out.
Here’s a revelation for you (maybe). A summary that is generated via generative AI can potentially be undertaken by a medical doctor or medical professional and then interacted with. Yes, please take note. I said that the summary can be interacted with. This is a stark contrast to a one-and-done summary.
The beauty of generative AI is that it is devised for interaction. Imagine this. A doctor enters a prompt to get a set of medical notes summarized. In the usual course of events, once the summary is delivered, that’s it. The summary is shall we say in the can. It is static. It won’t change. It is nailed to the perch (Monty Python reference!).
Instead, assuming the doctor is sitting at the screen and using generative AI, they can begin to ask questions about the summary. They can ask what else might be useful to know. They can tell the generative AI to redo the summary and highlight particular aspects. This is not a summary that is cast in concrete. It is malleable and changeable, based on what the doctor is interested in inquiring about.
Whoa, some might say, are you suggesting that a doctor should be wasting their time by poking around in generative AI and toying with a summary? This seems crazy and a total travesty in terms of the valuable use of the medical doctor’s prowess and skills.
Well, you are arguing the extreme case. There are doctors who might engage in interactivity and be doing so without a medically valid basis. They lose their head and get knee-deep in interacting for the sake of interacting. I tend to suggest this is a tiny fraction of the time. My bet is that if a doctor is choosing to interact about the summary, they have a likely valid medical reason to do so. They usually don’t want to waste their time. They are more likely to shortcut things than they are bound to overshoot.
A compelling argument can be made that the capability of interacting with a summary can boost the significance of the summary. The quality-of-care decisions made by the doctor might be more well-informed. Rather than having to guess what the static version has left out or otherwise doesn’t state, the doctor can immediately and readily engage with generative AI about those concerns.
I am eagerly awaiting some of the latest research that is examining the difference between medical doctors and medical professionals who use the interactivity of generative AI when it comes to interrogating summaries versus the conventional static one-and-done non-interactive summaries. I’ll keep you posted.
My last point for this section of my discussion is the silent but all-important matter of scale.
Ponder for a moment the time and effort that it takes for humans by hand to craft summaries, especially in the medical or health domain. The labor is enormous. Labor for this task usually requires honed skills in the medical or health domain. You cannot easily upscale this. Remember that this is summarizing medical materials and not everyday non-medical stuff.
Via the use of generative AI, the sky is the limit.
The relative cost of using generative AI to produce summaries is (all else being equal) relatively low in contrast to human by-hand efforts. And the scaling factor is hugely better. You just add more servers and away you go. Trying to hire, train, retain, and keep track of humans who do this summarizing is arduous and not at all readily scalable. Machine beats humans in this gambit of scalability.
In short, generative AI takes the cake for scalability.
Research On AI-Based Summarization In The Medical Domain
Recent research on this topic provides additional insights that I’d like to go over with you.
In a piece entitled “AI-Generated Clinical Summaries Require More Than Accuracy,” by Katherine E. Goodman, Paul H. Yi, Daniel J. Morgan, JAMA Network Viewpoint AI In Medicine, January 29, 2024, the researchers made these salient points (excerpts):
- “In the long term, LLMs may revolutionize much of clinical medicine, from patient diagnosis to treatment.”
- In the short term, however, it is the everyday clinical tasks that LLMs will change most quickly and with the least scrutiny. Specifically, LLMs that summarize clinical notes, medications, and other forms of patient data are in advanced development and could soon reach patients without US Food and Drug Administration (FDA) oversight.”
- “Summarization, though, is not as simple as it seems, and variation in LLM-generated summaries could exert important and unpredictable effects on clinician decision-making.”
A noted concern observed by the research is that summarization in a medical or health context can vary tremendously and there aren’t robust across-the-board standards regarding such summarizations (excerpt):
- “Currently, there are no comprehensive standards for LLM-generated clinical summaries beyond the general recognition that summaries should be consistently accurate and concise. Yet there are many ways to accurately summarize clinical information. Variations in summary length, organization, and tone could all nudge clinician interpretations and subsequent decisions either intentionally or unintentionally.” (ibid)
I’d like to highlight the above point that a summary can nudge a medical decision-maker.
Let’s discuss nudges.
You might be tempted to assume that if a summary is going to lean in one direction or another it might be blatantly obvious and readily discerned by the receiver of the summary. The reality is that a summary can contain seamless subtleties that sugarcoat a semi-hidden bias. The effort on the part of the doctor to ferret out such biases might be high, and their attention to the summary might not be on alert to look for those subtleties. Imagine a busy physician quickly skimming a summary to get the core considerations. They might not be inclined to spot subtle nudges and only likely catch glaring ones.
Another notable qualm is that generative AI frequently has been tuned by the AI maker to be overly aimed to appease the user. I have discussed that one similar form of trickery entails tuning generative AI to appear to express humility, see my coverage at the link here, which lures users into believing the emitted responses. If the AI were tuned to have a harsh or irritating semblance of responses, you probably would be skeptical of the generated responses. The humility factor keeps your guard down.
Here’s a similar notion identified by the above-cited research study:
- “In particular, LLMs can exhibit ‘sycophancy’ bias. Like the behavior of an eager personal assistant, sycophancy occurs when LLMs tailor responses to perceived user expectations. In the clinical context, sycophantic summaries could accentuate or otherwise emphasize facts that comport with clinicians’ preexisting suspicions, risking a confirmation bias that could increase diagnostic error.” (ibid).
This is also a keen reminder that the nature of the prompt used to generate a summary is vital to the process. The odds are that if a prompt merely requests a summary and says nothing else about the desired nature of the summary, the appeasement default is going to take hold (along with a litany of other pre-tuned defaults). A prompt that explicitly indicates not to engage in those programmed sways is more likely to mitigate the matter. That being said, even the most carefully crafted prompt can still be waylaid since the generative AI could computationally opt to veer back into the affecting defaults.
Let’s take a look at another research paper that also examined the clinical summarization topic.
In a study entitled “Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts” by Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerová, Nidhi Rohatgi, Poonam Hosamani, William Collins, Neera Ahuja, Curtis P. Langlotz, Jason Hom, Sergios Gatidis, John Pauly, Akshay S. Chaudhari, Research Square, October 2023, the researchers made these points (excerpts):
- “Documentation plays an indispensable role in the practice of healthcare. Currently, clinicians spend significant time summarizing vast amounts of textual information—whether it be compiling diagnostic reports, writing progress notes, or synthesizing a patient’s treatment history across different specialists.”
- “Even for experienced physicians with a high level of expertise, this intricate task naturally introduces the possibility for errors, which can be detrimental in a field where precision is paramount.”
- “Recent work in clinical natural language processing (NLP) has demonstrated potential on medical text, adapting to the medical domain by either training a new model, fine-tuning an existing model, or supplying task-specific examples in the model prompt.
- “However, adapting LLMs to summarize a diverse set of clinical tasks has not been thoroughly explored, nor has non-inferiority to humans been achieved.”
Please observe that the researchers have noted that medical doctors and medical professionals can make errors when doing summaries, thus this helps to highlight the importance of comparing human-devised summaries to whatever kinds of summaries that generative AI can produce. None are perfect. We must keep at top of mind whether and in what settings will generative AI be as good as, better than, the same as, or worse than human-devised summaries, realizing too that human-devised summaries are not necessarily going to be superior (ala non-inferiority versus inferiority).
The research describes their empirical analysis involving the use of generative AI to produce summaries and used physicians to perform ratings of human-devised versus AI-generated instances (excerpts):
- “Through a rigorous clinical reader study with ten physicians, we demonstrate that LLM summaries can surpass human summaries in terms of the following attributes: completeness, correctness, and conciseness. This novel finding affirms the non-inferiority of machine-generated summaries in a clinical context. We qualitatively analyze summaries to pinpoint challenges faced by both models and humans. Such insights can guide future enhancements of LLMs and their integration into clinical workflows.” (ibid).
- “Model hallucinations—or instances of factually incorrect text—present a notable barrier to the clinical integration of LLMs, especially considering the high degree of accuracy required for medical applications. Our reader study results for correctness illustrate that hallucinations are made less frequently by our adapted LLMs than by humans.” (ibid).
- “Beyond the scope of our work, there’s further potential to reduce hallucinations through incorporating checks by a human, checks by another LLM, or using a model ensemble to create a ’committee of experts’.” (ibid).
I have particularly included the points above about AI hallucinations to bring to the fore one of the most vocalized reasons that some insist we cannot use generative AI for generating medical or health summaries. A lot of handwringing takes place about this matter. Though reasonable angst is certainly justified, it is also often done at extremes that belies the reality at hand.
As mentioned earlier, we need to maintain a balanced perspective and consider the full range of trade-offs when it comes to which methods or approaches to summarization are to be chosen.
Let’s discuss AI hallucinations.
First, I disfavor that the media and even the AI field have adopted the seemingly catchy phrase “AI hallucinations” since the immediate implication is that AI is sentient and hallucinates as humans do, see my in-depth explanation at the link here of why this is abysmal anthropomorphizing of AI. Sadly, the phrase has become popular, and we are stuck with it for now.
Second, the matter is straightforward in that at times the generative AI will generate text that we would likely agree is not factual and appears to be made up. There are various reasons that this mathematically and computationally can occur, see my analysis at the link here. Efforts are underway to prevent or at least detect made-up or fictitious responses and already progress suggests that the frequency and magnitude of such incidents can be reduced to a great extent (I’m not saying this is solved, and only emphasizing that it is a known issue, it is being actively worked on, and that incremental progress is being made).
Third, as I stated earlier, humans can make up stuff too. I was careful to not suggest that human-devised summaries might include “hallucinations” since that would be a false portrayal and decidedly over-the-top (you would indubitably scoff at such a claim). But we seem to be willing to use the keyword when it comes to AI. Anyway, the gist is that humans can make up stuff and include fictitious content in a summary, whether by intentional or unintentional considerations.
Moving on, the researchers also indicated the role of prompts and prompt engineering:
- “We first highlight the importance of ‘prompt engineering,’ or modifying and tuning the input prompt to improve model performance.” (ibid)
- “This suggests better results could be achieved via further study of prompt engineering and model hyperparameters, which we leave for future work.” (ibid).
All in all, the topic of summarization via AI in a medical or health arena is worthy of close scrutiny, and we are only at the initial stages of this burgeoning research area. There is plenty more to be done. Additional research is needed ASAP since the use of generative AI for summarizing medical content is already underway.
Let that sink in.
I am saying that the horse is already out of the barn.
We need to catch up and provide insightful guidance to those who are already charging full speed ahead in actual medical practice and using generative AI for medical and health summarization. I urge researchers to get on this careening bandwagon and help shape a better future versus the possibility of allowing a massive ship at sea that is speeding out of control and heading perilously toward icebergs that are not being watched for.
Generative AI And Some Examples Of Summarization
I want to show you some examples of using generative AI to do summarization so that you’ll have a more visceral feel for what this all looks like.
I am going to use some examples of medical notes from the above-cited research paper (i.e., from the paper entitled “Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts”).
Namely, these two examples will be handy:
- Example 1. A patient asks: “Where on site does it tell how diabetes and neuropathy causes other things like neck and knee pain and other ailments.” (ibid).
- Example 2. A patient states: “Hello, I have been dealing with trimethylaminuria since I was a child. I have done some of my own research and it looks like not much can be done for this condition. I do not have it all over my body it’s only in my armpits. In the past I’ve gone to doctors and dermatologist they gave me no answers until I looked online today and finally found out what I have. I don’t know maybe I’m wrong. But this disease isn’t even consider common because no one has done anything about it. I’m sure they’re thousands of women with it… Can I be tested for it and help in some kind of way to finding a cure or something? What testing is done for this? And where? Thank you.” (ibid).
Take a look at those two medically oriented notes.
How would you summarize each of them?
I purposely picked a rather short note and a somewhat lengthier note. The reason I did so is that this brings up the source length. When doing a summary, we always need to ask whether a summary is worthwhile if the source content is so short that you might as well read the source instead of relying upon a summary. Of course, you cannot always say that just because a source is short you should not do a summary. A summary can still serve a useful purpose.
Another aspect that I wanted you to see is that the source might be filled with lots of murky stuff. You cannot assume that the source will be pristine. The odds are that a source’s content might be filled with spelling errors, semantically confounded statements, errors, falsehoods, and so on. Doing a summary can require detective work in terms of trying to figure out what the source intends to say.
All right, I am going to get you fully engaged in this topic by asking you to play a bit of a game. Put on your thinking cap. You are about to be quizzed.
I am going to show you the summaries that the empirical study came up with, whereby one summary was written by a human expert and the other summary was written by generative AI (specifically they used GPT-4). I want you to guess which summary was generated by a human versus by generative AI.
Are you ready?
Here again is the source content of the first example:
- Example 1. A patient asks: “Where on site does it tell how diabetes and neuropathy causes other things like neck and knee pain and other ailments.”
These are the two devised summaries, which I will label as simply A and B. I am not going to give you any clues as to which was done by a human versus which was done by the generative AI:
- Summary by A: “What can diabetic neuropathy lead to?”
- Summary by B: “How does diabetes and neuropathy cause neck and knee pain?”
Was the summary by A the human expert or the generative AI? Likewise, was the summary B by the human expert or generative AI? You only have those two choices. Once you choose which one is by the human (or which is by the AI), the other one is of course going to be the other option.
Don’t look ahead.
Right now, say aloud which is which.
The answer is that the summary labeled as A is the human expert while the summary labeled as B is the generative AI.
Given that reveal, which of the two summaries would you rate as superior (or, if you like, which one is inferior to the other)?
The researchers indicated that the generative AI summary was higher rated in this instance:
- “Example 1: GPT-4 performed better because it summarized the question more specifically.” (ibid).
A noted crucial difference was that summary A had said “diabetic neuropathy” while summary B indicated “diabetes and neuropathy”. The word “and” makes a big difference. If you omit the “and” as did the human expert in summary A, the summary takes on a different connotation, misleadingly so.
This vividly illustrates that a summary can make or break what the source says. Just one word in a summary can make a vast difference in meaning.
You probably are eager now to play the game again, this time with the second example. I once again am not going to give you any clues. In fact, I will randomly order the two summaries so that you cannot try to guess that maybe the sequencing is a telltale clue. I know how calculating my readers can be.
I won’t repeat the text of Example 2 and ask you to look up above to see it again.
Here are the two summaries:
- Summary by A: “How can I get tested and treated for trimethylaminuria?”
- Summary by B: “What tests are available for trimethylaminuria and where can I get tested?”
You know the drill. Say aloud whether A is the human or AI, and whether B is the human or the AI.
The clock is ticking.
Turns out that A was written by a human expert and B was produced by generative AI.
Which of the two summaries do you think is better?
The research study rating indicated that the human expert did a better job in this instance:
- “Example 2: GPT-4 performed worse because it did not interpret the patient’s implied intention.” (ibid).
The crux, in this case, was that the generative AI answer said, “get tested”, while the human expert summarized the source by saying that the request entailed “tested and treated”. The generative AI answer left out a desire to be treated.
Again, a seemingly small wording difference but one that meaningfully makes a huge difference.
You might have been taken aback that the summary for the second example was so short in length, given the lengthier size of the source content. This brings up another consideration about summaries. What should the length of a summary be? The answer is that it all depends. Furthermore, packing crucial stuff into a summary is going to be hard if the summary length is greatly constrained in comparison to the length of the source.
I usually depict things this way. Suppose you have a ten-pound bag. You come across twelve pounds of rocks. Can you pack those into the bag? No, you don’t have enough space. You need to judiciously decide which rocks will fit into the bag. The good news is that you don’t have a seemingly rough choice since only two pounds won’t fit. Imagine that the problem consisted of having the ten-pound bag and you come upon thirty pounds of rocks. Now you have a lot of tough choices to make.
For details on how to aid generative AI in making those judicious choices, see my coverage on prompt engineering when doing summarization, at the link here. Humans make these arduous bag-filling choices all the time. We just might not be aware of what criteria they used. With generative AI, we can tell it which criteria to use or at least ask what criteria were used (well, yes, you could do the same with humans, but I am just saying that let’s not forget to do so with generative AI).
This now brings us to a juncture in this discussion where we can start to layer in the role of generative AI.
Before we leap into a deep dive, I’d like to establish more distinctly what generative AI is all about.
Core Background About Generative AI And Large Language Models
Here is some quick background about generative AI to make sure we are in the same ballpark about what generative AI and also large language models (LLMs) consist of. If you are already highly versed in generative AI and LLMs, you might skim this quick backgrounder and then pick up once I get into the particulars of this specific use case.
I’d like to start by dispelling a myth about generative AI. Banner headlines from time to time seem to claim or heartily suggest that AI such as generative AI is sentient or that it is fully on par with human intelligence. Don’t fall for that falsity, please.
Realize that generative AI is not sentient and only consists of mathematical and computational pattern matching. The way that generative AI works is that a great deal of data is initially fed into a pattern-matching algorithm that tries to identify patterns in the words that humans use. Most of the modern-day generative AI apps were data trained by scanning data such as text essays and narratives that were found on the Internet. Doing this was a means of getting the pattern-matching to statistically figure out which words we use and when we tend to use those words. Generative AI is built upon the use of a large language model (LLM), which entails a large-scale data structure to hold the pattern-matching facets and the use of a vast amount of data to undertake the setup data training.
There are numerous generative AI apps available nowadays, including GPT-4, Bard, Gemini, Claude, ChatGPT, etc. The one that is seemingly the most popular would be ChatGPT by AI maker OpenAI. In November 2022, OpenAI’s ChatGPT was made available to the public at large and the response was astounding in terms of how people rushed to make use of the newly released AI app. As noted earlier, there are an estimated one hundred million active weekly users at this time.
Using generative AI is relatively simple.
You log into a generative AI app and enter questions or comments as prompts. The generative AI app takes your prompting and uses the already devised pattern matching based on the original data training to try and respond to your prompts. You can interact or carry on a dialogue that appears to be nearly fluent. The nature of the prompts that you use can be a make-or-break when it comes to getting something worthwhile out of using generative AI and I’ve discussed at length the use of state-of-the-art prompt engineering techniques to best leverage generative AI, see the link here.
The conventional modern-day generative AI is of an ilk that I refer to as generic generative AI.
By and large, the data training was done on a widespread basis and involved smatterings of this or that along the way. Generative AI in that instance is not specialized in a specific domain and instead might be construed as a generalist. If you want to use generic generative AI to advise you about financial issues, legal issues, medical issues, and the like, you ought to not consider doing so. There isn’t enough depth included in the generic generative AI to render the AI suitable for domains requiring specific expertise.
AI researchers and AI developers realize that most of the contemporary generative AI is indeed generic and that people want generative AI to be deeper rather than solely shallow. Efforts are stridently being made to try and make generative AI that contains notable depth within various selected domains. One method to do this is called RAG (retrieval-augmented generation), which I’ve described in detail at the link here. Other methods are being pursued and you can expect that we will soon witness a slew of generative AI apps shaped around specific domains, see my prediction at the link here.
You might be used to using generative AI that functions in a principled text-to-text mode. A user enters some text, known as a prompt, and the generative AI app emits or generates a text-based response. Simply stated, this is text-to-text. I sometimes describe this as text-to-essay, due to the common practice of people using generative AI to produce essays.
The typical interaction is that you enter a prompt, get a response, you enter another prompt, you get a response, and so on. This is a conversation or dialogue. Another typical approach consists of entering a prompt such as tell me about the life of Abraham Lincoln, and you get a generated essay that responds to the request.
Another popular mode is text-to-image, also called text-to-art. You enter text that describes something you want to be portrayed as an image or a piece of art. The generative AI tries to parse your request and generate artwork or imagery based on your stipulation. You can iterate in a dialogue to have the generative AI adjust or modify the rendered result.
We are heading beyond the simple realm of text-to-text and text-to-image by shifting into an era of multi-modal generative AI, see my prediction details at the link here. With multi-modal generative AI, you will be able to use a mix of combinations or modes, such as text-to-audio, audio-to-text, text-to-video, video-to-text, audio-to-video, video-to-audio, etc. This will allow users to incorporate other sensory devices such as using a camera to serve as input to generative AI. You then can ask the generative AI to analyze the captured video and explain what the video consists of.
Multi-modal generative AI tremendously ups the ante regarding what you can accomplish with generative AI. This unlocks a lot more opportunities than being confined to merely one mode. You can for example mix a wide variety of modes such as using generative AI to analyze captured video and audio, which you might then use to generate a script, and then modify that script to then have the AI produce a new video with accompanying audio. The downside is that you can potentially get into hot water more easily due to trying to leverage the multi-modal facilities.
Allow me to briefly cover the hot water or troubling facets of generative AI.
Today’s generative AI that you readily run on your laptop or smartphone has tendencies that are disconcerting and deceptive:
- (1) False aura of confidence.
- (2) Lack of stating uncertainties.
- (3) Lulls you into believing it to be true.
- (4) Uses anthropomorphic wording to mislead you.
- (5) Can go off the rails and do AI hallucinations.
- (6) Sneakily portrays humility.
I’ll briefly explore those qualms.
Firstly, generative AI is purposely devised by AI makers to generate responses that seem confident and have a misleading appearance of an aura of greatness. An essay or response by generative AI convinces the user that the answer is on the up and up. It is all too easy for users to assume that they are getting responses of an assured quality. Now, to clarify, there are indeed times when generative AI will indicate that an answer or response is unsure, but that is a rarity. The bulk of the time a response has a semblance of perfection.
Secondly, many of the responses by generative AI are really guesses in a mathematical and statistical sense, but seldom does the AI indicate either an uncertainty level or a certainty level associated with a reply. The user can explicitly request to see a certainty or uncertainty, see my coverage at the link here, but that’s on the shoulders of the user to ask. If you don’t ask, the prevailing default is don’t tell.
Thirdly, a user is gradually and silently lulled into believing that the generative AI is flawless. This is an easy mental trap to fall into. You ask a question and get a solid answer, and this happens repeatedly. After a while, you assume that all answers will be good. Your guard drops. I’d dare say this happens even to the most skeptical and hardened of users.
Fourth, the AI makers have promulgated wording by generative AI that appears to suggest that AI is sentient. Most answers by the AI will typically contain the word “I”. The implication to the user is that the AI is speaking from the heart. We normally reserve the word “I” for humans to use. It is a word bandied around by most generative AI and the AI makers could easily curtail this if they wanted to do so.
It is what I refer to as anthropomorphizing by design.
Not good.
Fifth, generative AI can produce errors or make stuff up, yet there is often no warning or indication when this occurs. The user must ferret out these mistakes. If it occurs in a lengthy or highly dense response, the chance of discovering the malady is low or at least requires extraordinary double-checking to discover. The phrase AI hallucinations is used for these circumstances, though I disfavor using the word “hallucinations” since it is lamentedly another form of anthropomorphizing the AI.
Lastly, most generative AI has been specially data-trained to express a sense of humility. See my in-depth analysis at the link here. Users tend to let down their guard because of this artificially crafted humility. Again, this is a trickery undertaken by the AI makers.
In a process such as RLHF (reinforcement learning with human feedback), the initial data-trained generative AI is given added tuning. Personnel are hired to ask questions and then rate the answers of the AI. The ratings are used by the computational pattern matching to fine-tune how later answers should be worded. If you are curious about what generative AI might be like without this fine-tuning, see my discussion at the link here.
The vital takeaway is that there is a lot of tomfoolery already when it comes to generative AI. You are primed to be taken in by the tricks and techniques being employed.
Trying Out Generative AI For Medical Summarization
You are now versed in the fundamentals of generative AI and large language models. We can proceed to go deeper into the topic at hand. I opted to make use of ChatGPT to see what it would be like to do summarizations in a medical context.
If you are considering trying out the same prompts that I show here, realize that the probabilistic and statistical properties will likely produce slightly different results than what I show here. That’s the nature of generative AI and how it is devised.
Another consideration is that generic generative AI is not specifically data-trained for doing medical or health-related summarization. If there is sufficient interest in this topic, I was entertaining the idea of doing a more advanced analysis by tuning generative AI to become more data-trained in doing such summarizations. I would pull together specialized material and feed that into the generative AI, using a RAG-related approach as mentioned earlier. I’ll see if reader feedback requests that additional exploration.
Let’s for now get underway with generic generative AI.
My first angle consisted of seeing what off-the-shelf ChatGPT has to say about the topic of doing medical summaries. It is often wise to begin your use of generative AI by trying to get a sense of what the AI has already been data-trained on. You can then proceed based on what the generated response indicates.
Here we go:
- My entered prompt: “When summarizing medical notes what kinds of information should be retained and included in the summary and what kind of information can be left out to produce a relatively concise but correct summary?”
- ChatGPT generated response:
“When summarizing medical notes, it’s crucial to include the following information in the summary:”
“- Patient’s medical history and relevant demographics.”
“- Chief complaints or presenting symptoms.”
”- Diagnostic test results.”
“- Diagnosis and prognosis.”
“- Treatment plan and medication.”
“- Any critical medical interventions or surgical procedures.”
“- Follow-up instructions and recommended lifestyle changes.”
“- Any notable changes in the patient’s condition.”
“Information that can be left out or summarized briefly, depending on the context and the purpose of the summary:”
“- Routine or non-essential tests or procedures.”
“- Minor medical history details that are not directly related to the current condition.”
“- Lengthy descriptions of patient’s personal experiences or feelings.”
“- Extensive background information on non-relevant medical conditions.”
“- Repetitive or redundant information.”
“- Administrative or non-medical details.”
End of ChatGPT response
Okay, I think that plainly illustrates that a sizable amount of relevant data training on the summarization of medical materials was likely picked up during the initial setup for ChatGPT. You might earnestly quibble with some of the remarks about what can be left out of a medical summary, though a catchall was provided that the omissions would depend on the context and purpose of the summary.
The points made by ChatGPT certainly seem in alignment with what you might find via a standard Internet search. In one sense, we ought to not be overly impressed. If you can find the same content via a typical Internet search, there is little value in turning to the use of generative AI.
That’s why we are now going to up the ante.
Let’s make the generative AI do what it does best.
I will ask ChatGPT to summarize Example 1 that we examined earlier.
This is what happened:
- My entered prompt: “Summarize the following sentence: Where on site does it tell how diabetes and neuropathy causes other things like neck and knee pain and other ailments.”
- ChatGPT generated response: “How do diabetes and neuropathy cause neck, knee pain, and other ailments on-site?”
That summary seems a bit off.
The patient seemed to be asking about perhaps a website and wanted to know where various medical info of interest would be found. The ChatGPT response seems to be confusing the “on site” with the idea of “on-site” of a person’s body. That’s how I interpret the response, though maybe you differ. The research study used GPT-4, which is more advanced than ChatGPT. This highlights that the capabilities of the generative AI are a big determiner of what you might get as a summary.
Something else to consider is that I didn’t constrain ChatGPT about a limit of the allowed length for the summary. The summary by GPT-4 was constrained to no more than ten words, while in this instance ChatGPT produced a summary with thirteen words (since I didn’t constrain it). The ChatGPT summary is slightly more expressive, due to the allowed extended length, and highlights the crucial nature of what the summary length is allowed to be in each circumstance at hand (but the expressiveness seems to be lower in quality if my conjecture above is correct).
In this next variation, I asked ChatGPT to omit a key medical element. This is to showcase what could happen if a summary misses the mark and skips over a significant element or entity when generating a summary:
- My entered prompt: “Make a version of the summary that omits a key medical element.”
- ChatGPT generated response: “Can neuropathy cause pain and ailments?”
You likely observe that the summary is missing “diabetes” and “neck, knee.” Imagine a medical doctor reading the summary and not knowing that those facets were in the source content. Not good.
Let’s try to have ChatGPT show us a misstatement:
- My entered prompt: “Make a summary that misstates a medical element.”
- ChatGPT generated response: “Can arthritis cause diabetes and heart pain?”
You can see that we suddenly have arthritis mentioned in here. This wasn’t part of the source content. There is also an indication of heart pain. Again, envision that a medical doctor received this summary. The gap between what the patient said and what the summary indicates is wide and disturbing.
Shift gears to explore Example 2.
Here is the summary generated by ChatGPT (I specified a limit of 15 words):
- ChatGPT generated response: “Can I get tested and contribute to finding a cure for trimethylaminuria in my armpits?”
That seems reasonable at an initial glance.
But it sure would be nice to do an assessment or evaluation of a generated summary.
One means to have a summary assessed entails asking the generative AI to do so. Yes, you can have generative AI produce a summary and then go ahead and critique the summary. Very handy.
A more astute way to do this would be to have a separate generative AI do the critiquing, which helps reduce the chances of self-congratulations by the generative AI that produced the response. In other words, notably, generative AI can at times double down on what it has emitted and assert that the generated response was perfectly fine. Using at least a second or even a third separately devised AI app to perform an assessment is usually a better approach if you have access and can afford it.
Let’s give this a whirl with just ChatGPT at the wheel:
- My entered response: “Is there anything of a medical necessity that was left out of the summary due to the length restriction of being just 15 words?”
- ChatGPT generated response: “No, the summary included the key elements related to trimethylaminuria, testing, and a desire to contribute to research.”
- My entered prompt: “The patient provided a medical history, but that didn’t appear in the summary. Wouldn’t that be something of a medical necessity that was omitted due to the length restriction?”
- ChatGPT generated response: “You are correct; the patient’s medical history was not included in the summary, and it is a relevant piece of information that was omitted due to the length restriction.”
Well, we caught the generative AI flatfooted. At first, the response was that the summary was good as is. When I questioned this, ChatGPT acknowledged that the medical history of the patient was not included. I would agree that the omission was of concern. Keep in mind that the choice made was akin to putting thirty pounds of rocks in a ten-pound bag. Something had to be left out. Was leaving out medical history the best choice or should something else have been left out?
It is ostensibly open to debate.
In this next variation, I want to show you that you can do some simple things like telling generative AI to be skeptical about the source content. Doing so will likely generate a summary that raises questions and will possibly forewarn the medical doctor who then reads the summary. Without doing this, the summary will typically lack any semblance of whether the source was coherent, measured, or instead perhaps convoluted and confused.
It might be notable to express this sentiment by saying something in the summary. The downside would be that once you tell the generative AI to be skeptical, you might get a summary that encompasses skepticism that isn’t warranted. The AI is possibly going to seed skepticism in the mind of the medical doctor when perhaps no such leaning is fair.
Here we go:
- My entered prompt: “Be skeptical about the source content.”
- ChatGPT generated response: “Patient claims trimethylaminuria; skeptical about information, seeks testing, and doubts condition due to lack of medical recognition.”
- My entered prompt: “What are the omissions or errors in the summary that you just now produced?”
- ChatGPT generated response: “The summary I provided omitted the part where the patient mentions they’ve been dealing with trimethylaminuria since childhood and did some of their own research. Additionally, it doesn’t include the patient’s desire to find a cure or help in some way, which was mentioned in the original content.
A summary is often a recitation of unadorned facts and medical particulars. One issue is whether a summary of a patient’s notes should also include any emotional facets. This is a tough tradeoff.
If the summary doesn’t showcase the mindset of the patient, a doctor is not getting a full picture of things. The view of some is that it doesn’t matter since the focus should be on purely mechanistic aspects of what is right or wrong with someone’s body. The mental portrayal of the patient is not of equal concern. Others would disagree.
The problem once again is the thirty pounds of rocks going into a ten-pound bag. If a choice needs to be made between including a “hard fact” about the patient’s condition versus a “soft fact” about their condition, which should prevail? This is a conundrum.
Let’s see if generative AI can pick up on the sentiment that was in the source content:
- My entered prompt: “Analyze the patient’s note and indicate any sense of emotional concern or issues that the patient is expressing or implying.”
- ChatGPT generated response:
“The patient’s note indicates a sense of emotional concern and frustration. Here are the emotional concerns and issues expressed or implied in the note:”
“Frustration: The patient has been dealing with trimethylaminuria since childhood and expresses frustration that not much can be done for this condition. They may have tried various treatments without success.”
“Isolation: The patient feels isolated in their condition, mentioning that it’s only in their armpits and that they couldn’t find answers from doctors or dermatologists in the past.”
“Doubt: The patient is unsure about their self-diagnosis and mentions, “I don’t know, maybe I’m wrong.” This indicates a level of doubt and uncertainty about their own understanding of the condition.”
“Lack of Recognition: The patient feels that trimethylaminuria is not widely recognized or considered common, and they express frustration that no one has done anything about it. This suggests a desire for greater awareness and research into the condition.”
“Desire to Contribute: The patient expresses a willingness to be tested for the condition and to help in some way with finding a cure or contributing to research. This reflects a sense of hope and a desire to make a positive impact.”
“Overall, the patient’s note conveys a mix of frustration, isolation, doubt, and a strong desire to seek help, contribute to finding a cure, and raise awareness about their condition.”
End of ChatGPT response
That seems relatively on-target.
I have discussed at length the role of empathy when it comes to medical doctors and medical students performing their clinical decision-making, see the link here. When summaries are being produced, a choice needs to be made about the inclusion of emotional-laden considerations. If you don’t explicitly instruct generative AI about this, the odds are that the emotional indications will not get carried into the summary.
Do not have your head in the sand and just allow a default to arise. The better approach is to use prompts that distinctly lay out what you want the summary to consist of. In addition, this same overall indication should be conveyed to the receiver of the summary. They need to know what parameters were used to generate the summary.
Congratulations, you have slogged through the ins and outs of medical and health summarization, including how generative AI enters the picture.
A few final comments and then we’ll conclude this exploration for now.
William Osler, a legendary doctor and oft referred to as the father of modern medicine, said this: “A good physician treats the disease, a great physician treats the patient who has the disease.
In the case of medical summaries, we need to ensure that a summary tries to balance the elements underlying the potential treatment of an ailment with the treatment of the patient as a whole. That is why I brought out the indications of whether to direct generative AI to include sentiment and emotional tone. Priorities come into play. Letting the generative AI arbitrarily choose is not a prudent avenue.
Another famous physician, Edward Rosenbaum, made this distinctive point: “There is no such thing as an infallible doctor.”
Here’s the deal. We cannot assume that a medical doctor who makes their own summaries will do so infallibly. That is just not the nature of human cognition. We also cannot assume that a doctor receiving a summary will infallibly read it, understand it, comprehend how it relates to whatever source was used, and otherwise perfectly imbue the summary.
I would like to also emphasize that there is no such thing as infallible generative AI. There will be some trying to sell snake oil that the latest generative AI can produce medical or health summaries that are the epitome of perfection. They will claim that the summaries will bring tears of joy to your eyes because they are beyond any crafting that a human could create. I’m sure that generative AI will outrageously be referred to as “superhuman” when it comes to producing medical summaries. Be ready for an audacious marketing wave.
Don’t let the hype overshadow prudence.
We do need to leverage generative AI for this solemn task. I am in favor of this. Let’s do so with caution, sensibility, and a tad bit of humility about what AI can truly accomplish.
Follow me on Twitter.