The large-language model (LLM)-based model, ChatGPT, had its clinical decision-making abilities put to the test in a joint study by researchers from Harvard Medical School and the Departments of Radiology at Massachusetts General Hospital, Brigham, and Women’s Hospital, respectively. Standard clinical vignettes from the Mercke Sharpe & Dohme (MSD) Clinical Manual were used as sample cases and fed to the chatbot in an iterative manner. It showed an impressive performance, averaging around a 72% accuracy rate in overall diagnosis, nearly equivalent to that of a medical school graduate. It showed the best performance in determining the final diagnosis and the worst in determining the initial diagnosis, proving that its accuracy is directly proportional to the amount of information it is provided with.
An Introduction to ChatGPT
ChatGPT is an LLM-based AI chatbot that has risen in popularity over the course of the last year since its launch in 2022 by OpenAI. It is widely used owing to the convenience and ease with which it can be used and the quick and easy-to-understand responses it gives. In contrast to single-ask tasks, it can focus the power of vast training data sets on subsequent, related tasks. It has the ability to grasp the context and meaning between words in a successive input sequence, using self-attention and feed-forward neural networks, and can interpret human-generated text. It is trained on data related to various articles, books, and websites and does not require the internet to generate responses, making it completely self-reliant. Instead, it uses the patterns observed in the data given to it and generates possible ‘tokens’ related to it.
Artificial Intelligence in Healthcare
Tools based on artificial intelligence are becoming increasingly integrated into various healthcare spaces, such as analyzing images derived from radiography and providing communication support for patients who face difficulty communicating in English. A study involving several medical professionals and practicing physicians from some of the top medical institutions in the United States aimed to determine the accuracy with which ChatGPT can solve clinical problems, determining the extent of its utility in the clinical workspace.
Testing the abilities of prospective ‘artificial physicians’
It was hypothesized that ChatGPT would be able to provide information on the steps to be taken for the initial and final diagnosis and clinical management, thanks to the similarities in the iterative nature of ChatGPT and clinical diagnosis. Thirty-six clinical vignettes were used from the MSD manual; they are, in simple words, short, theoretical cases used for testing clinical reasoning abilities. As a standard, they consist of information related to a review of systems (ROS), a physical exam (PE), and a history of the present illness (HPI) of the patient. The questions are asked in the form of multiple select questions in a ‘select all that apply’ manner.
The clinical workflow in a normal setting consists of conducting a differential diagnosis, a diagnostic examination, a concluding diagnosis, and clinical management. Prompts related to these steps were given to the chatbot in a successive manner and were abbreviated to diff (differential diagnoses), diag (diagnostic questions), dx (diagnosis questions), and mang (management questions); miscellaneous questions used for testing general medical knowledge related to the case were indicated by misc; this was used as a reference parameter. diff questions ask the individual to determine the conditions that cannot be separated from a predetermined initial ‘differential’; diag asks to give the diagnostic steps to be taken based on available information; dx gives the final diagnosis; and mang asks to suggest suitable clinical modes of action.
A new session of ChatGPT was generated for each vignette tested to avoid mixing up the information, and each was given points by two scorers independently to check for any inconsistencies in the responses given by the chatbot. A point was awarded each time an answer given by ChatGPT corresponded correctly with those present in the MSD manual for a given vignette. Crucial information related to the age and gender of the patients was also included in the assessments to check for any AI bias that may be present. In assessing differential information, The ESI (Emergency Severity Index) was used to classify the clinical acuity (severity) of the given case based on the HPI provided on a scale of 1 to 5, where 1 corresponds to the highest severity and 5 to the lowest. Using the lm() function in the programming language R, multiple linear regression was carried out to statistically correlate the outcomes of ChatGPT with those of the MSD manual.
ChatGPT had the highest accuracy score when it came to dx questions (76.9%) and the lowest for differential-type questions (60.3%). This difference is believed to be due to diff questions being reliant only on HPI, ROS, and PE, whereas dx questions require a lot more supplementary clinical context and information related to diagnostic testing. Accuracy rates for diag and mang were low as well, with the provision of unnecessary clinical steps by ChatGPT. In terms of performance related to specific vignettes, the model was able to diagnose the final stage of testicular cancer with an accuracy of 83.8% and the least in diagnosing the cause of frequent headaches in a 31-year-old woman, with an accuracy rate of 55.9%. A positive correlation was observed between the accuracy of diagnosis by ChatGPT and the severity of the disease when analyzing a case related to diagnosing breast cancer and breast pain. In this study, no AI bias was detected, as the vignettes used were classical cases in medicine. It is yet to be determined whether any bias would be noted in the case of uncommon instances.
In some cases, ChatGPT refused to give a diagnosis altogether, while in other cases, it gave unnecessary steps to follow. Therefore, it can face difficulties in determining which clinical methods are most appropriate for some vignettes where the information provided may be vague; the model may not be able to suggest standardized methods of testing in some cases as well.
Concerns addressing the risks associated with utilizing models that are not completely accurate in diagnosing diseases, as even the slightest error can result in severe consequences for the patient in question. Such errors can be attributed to the possibility of hallucinations in the model and limited information on the composition of the model’s data set. ChatGPT also lacks reasoning capacity because it generates ‘tokens’ as successive answers to previously asked questions.
It is important to test the reliability of LLM-based models like ChatGPT in clinical settings, as it is, in a literal sense, a matter of life or death for the patients involved. Initial deployment of these technologies will involve their use as tools of assistance for medical professionals, as opposed to complete dependence on them to conduct diagnosis. They have immense potential to enable better patient outcomes, provide a much better, faster, and more efficient pace in clinical workflow, and transform healthcare practices the way we know them today.
Swasti is a scientific writing intern at CBIRT with a passion for research and development. She is pursuing BTech in Biotechnology from Vellore Institute of Technology, Vellore. Her interests deeply lie in exploring the rapidly growing and integrated sectors of bioinformatics, cancer informatics, and computational biology, with a special emphasis on cancer biology and immunological studies. She aims to introduce and invest the readers of her articles to the exciting developments bioinformatics has to offer in biological research today.