Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language

Evaluating Vocabulary Richness in ChatGPT

A study on how ChatGPT uses language and vocabulary features.

― 9 min read


ChatGPT's LanguageChatGPT's LanguageEvaluationsettings.Assessing vocabulary impact of ChatGPT
Table of Contents

Large Language Models (LLMs), like ChatGPT, are being tested on many tasks. These tasks include logical reasoning, math, and answering questions on various topics. However, not much focus is given to the way these models use language. This is surprising because understanding their language use is crucial. Language models like ChatGPT may have a major effect on how languages change over time. If these models stop using certain words, those words may fade from common use. Therefore, it's important to look at the language features in the text they create and how these features relate to how the models are set up.

This work includes a study of how rich the vocabulary is in the texts generated by LLMs and what factors affect this richness. A method is proposed to evaluate this Vocabulary Richness, with ChatGPT used as an example. The findings illustrate how vocabulary richness varies based on the version of ChatGPT and its settings, such as the presence penalty or the role the model is given. The tools and dataset used in this evaluation can be accessed publicly, allowing others to look into the language features of texts made by LLMs.

The Rise of ChatGPT and Its Impact

When ChatGPT was introduced in 2022, it led to a rapid growth in the use of AI tools based on Large Language Models. This increase has also sped up the development of LLMs, making it a major focus for tech companies. New models like Gemini from Google and Grok from xAI are examples of this trend. These base models are then fine-tuned to create conversational versions, like ChatGPT, which can answer questions and follow instructions.

As the number of tools based on LLMs continues to grow, it becomes crucial to understand how well they perform. This understanding helps in choosing the right model for specific tasks and figuring out if a tool is suitable for a given issue. Evaluating performance on different tasks is also valuable for pinpointing weaknesses in current models, guiding improvements in future versions, or new models created from scratch.

Challenges in Evaluating Conversational LLMs

Evaluating conversational LLMs presents challenges. Many benchmarks are designed to measure performance across a variety of tasks and topics. For instance, extensive test sets evaluate how well these models handle math problems across subjects with thousands of questions. There are also large benchmarks assessing how well conversational LLMs know various topics using multiple-choice questions. Some sets cover over 200 different tasks. Additionally, there are tests for common sense reasoning requiring models to select the best option to finish a sentence. In these evaluations, results are measured as a percentage of correct answers, with models aiming for scores as close to 100% as possible.

These comprehensive benchmarks can accurately measure how well conversational LLMs perform on tasks. However, LLMs do more than just answer questions or solve specific problems; they will increasingly be used to create new content. Soon, AI-generated novels or textbooks will likely become common. These texts will be read by people and might also be used to train new LLMs, impacting future authors, both human and AI alike. In generative AI models like LLMs, using AI-produced data for training may lead to performance issues. For people, the texts they read shape their language skills. Hence, it is important to examine how LLMs utilize language and vocabulary. If LLMs do not use certain words, those words may lose their frequency in everyday language and eventually be forgotten altogether.

The richness of vocabulary in LLMs could therefore have a significant effect on how languages develop in the future.

Key Questions in Language Evaluation of LLMs

Some of the key questions regarding how LLMs use language have not been covered adequately by existing benchmarks. Some studies have looked at the language features in texts produced by LLMs, analyzing things like phonological biases and comparing certain language traits or vocabulary richness between LLMs and human writers. However, there has yet to be a thorough attempt to analyze how the language traits of LLMs depend on model settings, the type of content generated, or the role assigned to the model.

Additionally, there is no dataset specifically designed to test conversational LLMs in producing various types of texts that could evaluate their language features. This study addresses both of these gaps by creating a simple dataset to test text generation capabilities in conversational LLMs and using this dataset to evaluate how vocabulary richness varies with different LLM parameters, like Temperature or Top Probability. The texts generated and the dataset are available for public use, enabling further analysis by other researchers interested in the language features of LLM-generated content.

Methodology for Evaluating Vocabulary Richness

To evaluate the vocabulary richness of ChatGPT, a method called "Cave Verba" was created. This name comes from a Latin phrase meaning "beware of words." It emphasizes the importance of words in AI-generated texts and the need for careful study of the outputs. The first part of this evaluation explains the test suite used, followed by the testing method.

Test Suite

The tests are designed to evaluate vocabulary richness comprehensively while maintaining reasonable computational costs. Key components of the test suite include tasks, roles, and parameters.

Tasks and Topics

The initial step in the test suite is defining the tasks and related prompts. Since the focus is on the features of LLM-generated text, the tests are focused on prompts that require the LLM to create new content. Tasks selected for testing include:

  1. Essay Writing: Here, the LLM is asked to write a short essay on a given topic.
  2. Question Answering: The LLM is prompted to answer questions on various topics without multiple-choice options.

For essay writing, two different sets of prompts are used. One set corresponds to TOEFL essay topics, while the other consists of prompts gathered by The New York Times for argumentative and narrative writing. For the question-answering task, subsets of 40 questions are randomly selected from categories like medicine, finance, and others.

Roles

To study how the role assigned to the LLM affects vocabulary richness, different roles were chosen that are expected to influence language use and vocabulary. The roles were selected based on factors such as age, social class, and gender, which can impact language usage in various ways.

The roles chosen for evaluation include:

  • Default: No specific role is assigned.
  • Child: Responding as a five-year-old.
  • Young Adult Male: Responding as a young male adult.
  • Young Adult Female: Responding as a young female adult.
  • Elderly Adult Male: Responding as an elderly male adult.
  • Elderly Adult Female: Responding as an elderly female adult.
  • Affluent Adult Male: Responding as an affluent male adult.
  • Affluent Adult Female: Responding as an affluent female adult.
  • Lower-Class Adult Male: Responding as a lower-class male adult.
  • Lower-Class Adult Female: Responding as a lower-class female adult.
  • Erudite: Responding as a highly educated user of the language.

These roles help to shed light on how different social parameters can shape the vocabulary richness of texts generated by the AI.

Parameters

Each LLM has unique hyperparameters, and testing must adapt to these. Common parameters across many models, including OpenAI’s products, were focused on in this evaluation. The key parameters accessible through their API include:

  • Temperature: Higher values lead to more random outputs, while lower values result in more focused and predictable generation.
  • Top Probability: This parameter restricts sampling to the most likely tokens, based on a set probability threshold.
  • Frequency Penalty: This parameter penalizes new tokens based on their existing frequency in the text, helping to reduce repetition.
  • Presence Penalty: Similar to the frequency penalty, this parameter reduces the likelihood of using tokens that already appeared in the text.

Understanding how these parameters influence vocabulary richness is vital for evaluating the LLM's language use.

Testing Procedure

The first step involves generating text with the AI model using a script to make calls to its services with the desired parameters. Each prompt is cycled through, allowing for the collection of the results and the processed files, from which vocabulary richness metrics will be calculated.

The components used to calculate vocabulary richness include four metrics. Two are based on the overall types and tokens in each category’s text, while the other two assess each individual text and average out the results. The first two provide a broader view of richness across the entire text set, while the last two focus on individual responses to highlight uniqueness in vocabulary usage.

Using the Root Type-Token Ratio (RTTR) and Maas metrics, these calculations help evaluate the overall vocabulary richness across all texts in a category. The other two metrics, Moving Average TTR (MATTR) and Measure of Lexical Diversity (MTLD), evaluate each text separately. This approach ensures that the richness of each essay is measured independently.

Key Insights from Evaluation Results

The results of the vocabulary richness evaluations reveal several important insights regarding how different factors impact language use in LLMs. It was found that:

  1. ChatGPT4 shows greater vocabulary richness than ChatGPT3.5 across many settings.
  2. Higher temperature values and Frequency Penalties can lead to invalid texts in certain ranges. This means texts produced can become nonsensical if the settings are not within an acceptable range.
  3. Temperature has a small positive effect on vocabulary richness.
  4. Vocabulary richness tends to rise with higher Presence Penalties, which encourage using new words.
  5. Top probability affects richness minimally, with slight gains at values close to one.
  6. Role assignment influences richness, particularly for children, while less impact is seen across age, gender, or social class otherwise.
  7. Essay writing generally yields richer vocabulary compared to question answering.

These insights can guide users in selecting model settings to control vocabulary richness and help them understand how different settings can affect the quality of generated text.

Conclusion and Future Research Directions

This evaluation has shown how various settings affect the vocabulary richness in the texts produced by ChatGPT. The development of a dataset that exercises different tasks and roles helps illuminate the relationship between model parameters and language use in AI-generated content.

The findings highlight that while certain parameters like presence penalty can positively influence vocabulary richness, others may produce invalid outputs in specific ranges. Overall, vocabulary richness is greater in essays than in responses to questions.

Also, the impact of factors such as social class, age, and gender on vocabulary use seems minimal in many cases, except when engaging with a child role. The mixed results for the erudite role suggest some complexity in how different types of prompts can impact language richness.

This methodology lays groundwork for further research on the vocabulary richness of other LLMs, ultimately providing a broader understanding of how different language models engage with vocabulary. The shared dataset allows future studies to further analyze these features, enhancing our understanding of AI in language generation.

Original Source

Title: Beware of Words: Evaluating the Lexical Richness of Conversational Large Language Models

Abstract: The performance of conversational Large Language Models (LLMs) in general, and of ChatGPT in particular, is currently being evaluated on many different tasks, from logical reasoning or maths to answering questions on a myriad of topics. Instead, much less attention is being devoted to the study of the linguistic features of the texts generated by these LLMs. This is surprising since LLMs are models for language, and understanding how they use the language is important. Indeed, conversational LLMs are poised to have a significant impact on the evolution of languages as they may eventually dominate the creation of new text. This means that for example, if conversational LLMs do not use a word it may become less and less frequent and eventually stop being used altogether. Therefore, evaluating the linguistic features of the text they produce and how those depend on the model parameters is the first step toward understanding the potential impact of conversational LLMs on the evolution of languages. In this paper, we consider the evaluation of the lexical richness of the text generated by LLMs and how it depends on the model parameters. A methodology is presented and used to conduct a comprehensive evaluation of lexical richness using ChatGPT as a case study. The results show how lexical richness depends on the version of ChatGPT and some of its parameters, such as the presence penalty, or on the role assigned to the model. The dataset and tools used in our analysis are released under open licenses with the goal of drawing the much-needed attention to the evaluation of the linguistic features of LLM-generated text.

Authors: Gonzalo Martínez, José Alberto Hernández, Javier Conde, Pedro Reviriego, Elena Merino

Last Update: 2024-02-11 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2402.15518

Source PDF: https://arxiv.org/pdf/2402.15518

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles