Simple Science

Cutting edge science explained simply

# Health Sciences# Public and Global Health

Evaluating ChatGPT's Performance in Public Health Exams

Study assesses AI chatbot's ability to answer public health exam questions.

― 6 min read


ChatGPT in Public HealthChatGPT in Public HealthExamseducation.AI shows promise but risks in health
Table of Contents

ChatGPT is a chatbot powered by artificial intelligence (AI) created by OpenAI. It uses special models to understand and respond to user Questions in a way that feels like talking to a person. These models have been trained on a lot of text from books, websites, and other sources. The goal is to help ChatGPT answer questions in a natural conversation style.

Concerns About AI and Public Health

There are some worries about AI and its effects on public health. Some experts believe that tools like ChatGPT could spread incorrect information, leading to confusion, especially during health crises. There are fears that general AI could pose big risks, similar to serious global issues like pandemics or wars. AI can help with health, but it can also cause problems if companies put profit over people's well-being.

ChatGPT and Health Education

There is a lot of interest in using ChatGPT for health education. In tests for medical knowledge, ChatGPT has performed well in some cases, like American medical exams, but not so well in others. This has led to discussions about how it could be used for teaching and learning in medicine.

Public health exams are different than regular medical tests. They often ask complex questions that need more than just facts. For example, the Diplomate exam in Public Health in the UK covers a wide range of topics like research methods, health promotion, and ethics.

The aim was to see how well ChatGPT could perform on this public health exam. We wanted to know if its answers were similar to those given by human test-takers and how much insight it showed in its Responses.

How the Exam Was Conducted

We picked the last seven available versions of the public health exam questions from 2014 to 2017. Each paper has ten questions that require different types of answers. To make it fair, we left out questions from before 2014 because they were different in style.

To get ChatGPT's answers, we put each question into its system in a specific way. For longer answers, we asked it to write complete sentences instead of bullet points. Each time we asked a question, we started fresh to avoid any influence from earlier questions.

Some questions needed answers tailored to specific countries or strategies, so we adjusted them to make sure they fit what was on the exam. Questions that required images or graphics were not included since ChatGPT can't analyze those.

The final step was to have two experts grade the responses. They marked them as they would in a real exam setting, and we used their scores to judge ChatGPT's Performance.

Performance of ChatGPT

Each of the seven papers had ten questions, totaling 70 possible answers. Out of those, 21 questions were tossed out for various reasons. In total, ChatGPT produced 119 answers across the papers.

ChatGPT's scores for the complete questions ranged from 4 to 9.5 out of 10, while human scores were between 3.25 and 8. On average, ChatGPT scored more than 5 out of 10 across four of the marked papers. However, it did score below 5 on four questions in one paper, which would be a failing score. Overall, it would have passed three out of four exams.

In comparison, the pass rates for human respondents in those exams varied from 47% to 65%. ChatGPT's average score was around 5.9 out of 10, while human test-takers averaged 6.47 out of 10.

When looking at different sections of the exam, ChatGPT did best in research methods, scoring an average of 7.95. In other sections, it barely passed.

Difficulty in Identifying Responses

Markers were able to identify ChatGPT responses accurately about 73.6% of the time, but they struggled to recognize human answers correctly, identifying them only 28.6% of the time.

ChatGPT averaged about 3.6 unique Insights for each part of the questions. It showed the most insights in research methods, health information, and management topics.

Learning Levels of ChatGPT

The answers from ChatGPT were judged based on Bloom’s revised taxonomy of learning, a model that helps understand different levels of learning. About 71.4% of its answers were rated at the ideal learning level, while only a small portion fell below the expected level.

Key Findings

The study found that ChatGPT could have passed the public health exam on three out of four occasions. Its answers were mostly consistent, scoring between 5 and 7 points. A big part of its success came from its strong performance in research methods, similar to other findings showing that ChatGPT works well in math.

It was challenging for markers to tell apart answers from ChatGPT and those from humans. ChatGPT was able to provide some valuable insights for every question, making it a potential study aid for students. However, it sometimes gave incorrect information, indicating a need for caution when relying on it for facts.

What We Already Know

Large language models (LLMs) like ChatGPT can help in public health. They can assist with tasks such as data analysis, but they also come with risks, such as spreading misinformation. ChatGPT has shown uneven performance on various health exams but may be a helpful learning tool.

What This Study Adds

From this study, we learned that ChatGPT can generate believable responses to public health questions that are hard to tell apart from those of human candidates. Yet, it still has issues with providing false facts with great confidence.

It appears to do better when answering factual questions, but the occurrence of errors is still a significant problem. This raises concerns for institutions that assess public health knowledge when answers may have been partially created by AI.

Limitations of the Study

We could only check one specific part of the public health exam due to the availability of markers, meaning we didn’t look at all types of questions. We had to exclude some questions that didn't fit the new exam style, which limited our sample size.

It’s likely that ChatGPT would have struggled with longer, detailed questions since it didn't always provide the depth needed. We also didn't use follow-up questions, which might have improved the relevance of its answers.

The study focused on ChatGPT version 3.5, and since then, a new version has come out with improvements. Other AI models have also been introduced, so regular checks of these tools are necessary to stay updated with their capabilities.

Finally, this study looked only at ChatGPT in a specific exam situation, so we should be careful about applying these results more broadly. Markers noticed that ChatGPT sometimes had difficulty making its answers specific to the unique scenarios, especially in open-ended questions.

Conclusion

ChatGPT 3.5 did relatively well on the public health exam, particularly in the research methods section. Its answers were tough to differentiate from those of human candidates, suggesting it could be a valuable resource for learning in public health. Nevertheless, the tendency to generate false information must be addressed to fully realize its potential. As AI technology continues to develop, ongoing research will be essential to evaluate its benefits and risks in public health education.

Original Source

Title: ChatGPT sits the DFPH exam: large language model performance and potential to support public health learning

Abstract: BackgroundArtificial intelligence-based large language models, like ChatGPT, have been rapidly assessed for both risks and potential in health-related assessment and learning. However, their application in public health professional exams have not yet been studied. We evaluated the performance of ChatGPT in part of the Faculty of Public Healths Diplomat exam (DFPH). MethodsChatGPT was provided with a bank of 119 publicly available DFPH question parts from past papers. Its performance was assessed by two active DFPH examiners. The degree of insight and level of understanding apparently displayed by ChatGPT was also assessed. ResultsChatGPT passed 3 of 4 papers, surpassing the current pass rate. It performed best on questions relating to research methods. Its answers had a high floor. Examiners identified ChatGPT answers with 73.6% accuracy and human answers with 28.6% accuracy. ChatGPT provided a mean of 3.6 unique insights per question and appeared to demonstrate a required level of learning on 71.4% of occasions. ConclusionsLarge language models have rapidly increasing potential as a learning tool in public health education. However, their factual fallibility and the difficulty of distinguishing their responses from that of humans pose potential threats to teaching and learning.

Authors: Nathan P Davies, R. Wilson, M. S. Winder, S. J. Tunster, K. McVicar, S. T. Thakrar, J. Williams, A. Reid

Last Update: 2023-07-06 00:00:00

Language: English

Source URL: https://www.medrxiv.org/content/10.1101/2023.07.04.23291894

Source PDF: https://www.medrxiv.org/content/10.1101/2023.07.04.23291894.full.pdf

Licence: https://creativecommons.org/licenses/by-nc/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to medrxiv for use of its open access interoperability.

More from authors

Similar Articles