Chatbots Assess Medical Exam Performance
Study evaluates AI chatbots' effectiveness in medical licensing exams.
― 6 min read
Table of Contents
Artificial Intelligence (AI) is changing many fields, including medicine and how MedicalStudents learn. One interesting tool is AI Chatbots, which can help with training and education. These chatbots can provide simulated practice, give tailored feedback, and assist in clinical training. However, before using these chatbots in medical programs, we need to check how well they actually work.
Early Observations of Chatbot Performance
When chatbots were first available, medical schools started testing them with exam simulations. The results showed that while some chatbots provided correct and sensible answers, others made clear mistakes or produced confident but incorrect responses. These issues can come from the data used to train the bots, which might contain biases or incorrect information. Overall, chatbots generally scored around the passing mark, with some outperforming students. Their performance was often better on easier questions and when the Exams were in English. As the exams became more challenging, the chatbot scores dropped. Still, newer versions of these bots tend to perform better than older ones, indicating that they are improving over time.
Concerns and Potential
Schools are starting to worry about the potential for exam cheating using these chatbots. However, they can also be useful for creating tests by identifying unclear or poorly written questions. There hasn't been much research comparing different chatbots, and existing studies usually focus on just a few bots without measuring how often they make mistakes.
Study Overview
This study focused on a major theory exam that all medical students must pass to earn their medical degree. The exam was conducted in 2021 in Belgium and is similar to licensing exams in other countries. Six different chatbots were tested on this exam, looking at how well each bot performed. The study aimed to measure their performance, assess how often they made mistakes, and see if any questions on the exam were weak.
Methodology
The study received approval from the university's ethics committee. Medical students must pass an exam consisting of 102 multiple-choice questions covering various topics before becoming licensed doctors. This study examined the exam as it had been presented to students during their training. The questions were not available online, so they did not influence the chatbot training.
Selection of Chatbots
Six publicly available chatbots were selected for testing. The most popular free chatbots included ChatGPT, Bard, and Bing. Two paid versions, Claude Instant and Claude+, as well as GPT-4, were also included to see how they compared to free versions. Although Bing uses the same technology as GPT-4, it pulls from other sources, making it a customized version.
Data Collection
The exam questions were translated into English using a translation service while keeping the original writing style. A few questions that were local to Belgium or contained images were removed to ensure clarity. The bots were tested on their ability to answer these questions, and the researchers had to prompt Bard multiple times to get responses for some questions.
Performance Evaluation
The main focus was on how well the chatbots could answer the exam questions. They were scored based on the proportion of correct answers. If a chatbot selected a second-best answer, it received partial credit, while choosing a harmful answer resulted in a loss of points. The questions fell into different categories based on their difficulty, type, and if they included dangerous answers.
The study also looked at how often each chatbot made mistakes, including identifying problems with specific exam questions.
Results of the Exam Performance
In summary, Bing and GPT-4 performed the best on the exam with a score of 76%, while the average for all bots was 68%. Even though all bots answered some questions incorrectly, Bard did not select any wrong or dangerous answers. Bing had a few second-best answers, while other bots had more. Bard struggled to answer several questions and often needed prompting.
When looking at difficult questions, the bots did better than students, who had a significantly lower average score. The study found that Bing and GPT-4 were particularly strong at easier questions but struggled with more complex ones.
Reasons for Mistakes
For the incorrect answers, the study examined how often the bots provided responses that did not make sense or were false. Bing had fewer nonsensical answers than Bard and Claude Instant but still made some mistakes. These errors often emerged due to misunderstanding the context of the questions.
Weak Questions Identified
During the analysis, a few questions were identified as weak or unclear. For example, a question concerning when to start renal replacement therapy had misleading options that could confuse both bots and students alike.
Comparison of Bot Responses
Among all the bots, some performed similarly, while others had larger differences in accuracy. The researchers also looked at how well the bots agreed with each other on the answers. Overall, there was a moderate level of agreement.
Conclusions
The study highlighted significant differences between the chatbots in terms of their performance on the medical licensing exam. Bing stood out for its reliability, as it made fewer errors compared to the other bots. While the improvements in chatbot performance are encouraging, it’s essential to remain cautious about relying on them for medical knowledge. The findings also raise questions about the effectiveness of multiple-choice exams in assessing the skills that future doctors need, particularly when it comes to human interaction.
Recommendations for Future Use
Bing may be a useful tool for identifying poorly crafted exam questions, saving educators time and effort. The results also suggest that chatbots could be particularly helpful in areas where students struggle, especially in difficult questions.
The study calls for more research to explore how chatbots perform on different types of questions and in various educational settings. It’s clear that while chatbots can be useful tools, they should not be mistaken for actual medical professionals.
Future Considerations
As the use of AI in education grows, ethical and legal issues need addressing, including energy consumption, data privacy, and proper usage of copyrighted material. Before implementing AI more widely in medical education, it’s crucial to gain a better understanding of these issues.
Overall, while six AI chatbots successfully passed an important medical exam, Bing and GPT-4 emerged as the most effective. However, the bots' weaknesses, especially on difficult questions, highlight the need for more research and improvement before they can be fully relied upon in a medical setting.
Title: Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam
Abstract: Recently developed chatbots based on large language models (further called bots) have promising features which could facilitate medical education. Several bots are freely available, but their proficiency has been insufficiently evaluated. In this study the authors have tested the current performance on the multiple-choice medical licensing exam of University of Antwerp (Belgium) of six widely used bots: ChatGPT (OpenAI), Bard (Google), New Bing (Microsoft), Claude instant (Anthropic), Claude+ (Anthropic) and GPT-4 (OpenAI). The primary outcome was the performance on the exam expressed as a proportion of correct answers. Secondary analyses were done for a variety of features in the exam questions: easy versus difficult questions, grammatically positive versus negative questions, and clinical vignettes versus theoretical questions. Reasoning errors and untruthful statements (hallucinations) in the bots answers were examined. All bots passed the exam; Bing and GPT-4 (both 76% correct answers) outperformed the other bots (62-67%, p= 0.03) and students (61%). Bots performed worse on difficult questions (62%, p= 0.06), but outperformed students (32%) on those questions even more (p
Authors: Stefan Morreel, V. Verhoeven, D. Mathysen
Last Update: 2023-08-21 00:00:00
Language: English
Source URL: https://www.medrxiv.org/content/10.1101/2023.08.18.23294263
Source PDF: https://www.medrxiv.org/content/10.1101/2023.08.18.23294263.full.pdf
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to medrxiv for use of its open access interoperability.