Using Images to Boost Word Prediction in Language Tasks
Study shows that images help people and AI guess next words more accurately.
― 6 min read
Table of Contents
The Shannon game is a classic task used in language studies. It asks people to guess the next letter in a sentence based on what came before it. In this study, we expand this idea by adding Images as an option. We want to see how using both text and images can help people and a computer model guess Words better.
We had real people and a language model, GPT-2, take part in the game. Our findings show that when images are included, both the people and the computer could guess the next word more accurately and felt more confident in their guesses. Interestingly, certain types of words, such as nouns and determiners, did much better with the image help.
As the amount of Context increased-meaning the extra information from the image added to the sentence-the guessing improved even more. This shows that using images alongside text can really help in language tasks.
How the Game Works
The original Shannon game is designed to show how predictable the English language is. It starts with a participant trying to guess what letter comes next. They pick from 26 letters or a space. When they make a guess, the correct letter is shown, and they guess the next one.
In our version, we present a sentence along with an image. For example, if we show the sentence, "Several plates of food are set on a table," along with a related image, the participant thinks of the next word after the first three words. They then rate how confident they feel about their guess. After the word is revealed, they can see how close they were and reflect on their Accuracy.
Previous studies have shown that humans can guess words better when they receive context. This has been explored using a method called the cloze procedure, where volunteers fill in missing words based on what they see before and after the blank. Our game is similar, but it only uses the left side context, without any right-side hints.
Many earlier studies focused only on text, but we wanted to look at how images can also help. We compare how well people and the language model perform with just text and with both text and images.
Related Studies
Before this study, researchers have looked into how context affects word prediction. Early studies relied on tasks that predicted sentences. While some teams have included images before, these were not directly related to the guessing task.
We think that predicting the next word in a sentence gives us a great chance to study how context influences language processing. The effects of context can vary and have been shown in brain studies, where researchers found that how the brain reacts to a word can depend on what was said previously.
Some past studies have looked at how visual context impacts understanding sentences, but they didn't directly explore how images can serve as hints for guessing words. Our goal is to fill this gap by looking at how images or information from images can help in predicting words in the context of our game.
Priming and Prompting
Priming is a well-known idea in psychology. It occurs when presenting one stimulus affects the processing of another later stimulus. For example, if someone sees the word "cat," they will likely respond faster to "dog" afterward because these two words are related.
Prompting is similar but is used in language models. It means giving extra context to help models complete a task. In our game, we assess whether visual clues help humans and language models predict words in the same way.
Our findings suggest that using images helps both people and the language model guess words better. We looked into how visual information can aid in predicting the next word in a sentence.
Experiment Setup
In our game, participants try to guess the next word based on the words before it. We tested five different setups. In one setup, no image was shown, while in another, the full image was provided. Participants were asked to predict the next word and rate their Confidence in their guesses.
A total of 24 participants from various backgrounds took part. They were all non-native English speakers who were quite skilled in English. Each participant saw 17 sentences with randomly assigned setups.
The interface was designed as a web application to allow more people to take part. The participants would see a sentence, guess the next word, and then self-evaluate how accurate their guess was. The process continued until the end of the sentence.
Results of the Experiment
We found that the presence of images significantly boosted both accuracy and confidence in guesses. The configuration where the entire image was shown led to the best results for both accuracy and confidence ratings. In setups without an image, participants felt less sure about their guesses.
When using only text labels or snippets of the image, participants still showed increased confidence compared to the no-image setup. However, the configuration that provided all labels and objects in the image was sometimes distracting.
As expected, participants showed lower confidence at the beginning of the sentences. For the first word, they mainly guessed articles or did not attempt a guess.
Certain types of words were easier to predict. For instance, participants were better at guessing determiners compared to nouns. This was interesting because the initial words should not have been influenced by what was shown in the images.
Language Model Results
We also ran the experiment with the GPT-2 language model. For this part, we focused on two configurations: no image and images with text labels. The model showed slightly better results when it used the image labels to assist in guessing.
The patterns we saw in both humans and the model indicated that both had more confidence and made more accurate guesses when provided with extra information. However, the relationship between human scores and the model's scores varied when images were included.
Conclusions
In summary, our study shows that visual hints can help in predicting the next word in a sentence. The game we created showed that any visual information positively affects confidence and accuracy when guessing words. Out of all options, using the full image gave the best results.
We also noticed that the effects of context and word types influenced how well participants could predict words. The more context there is, the clearer these effects become.
Future studies could look into different types of images or even other forms of input, like video or sound, to see how they compare. The current study was mainly focused on English, suggesting that other languages may behave differently in similar tasks.
Overall, our work has opened new avenues to explore how combining text and images can help both humans and machines better understand and predict language.
Title: Multimodal Shannon Game with Images
Abstract: The Shannon game has long been used as a thought experiment in linguistics and NLP, asking participants to guess the next letter in a sentence based on its preceding context. We extend the game by introducing an optional extra modality in the form of image information. To investigate the impact of multimodal information in this game, we use human participants and a language model (LM, GPT-2). We show that the addition of image information improves both self-reported confidence and accuracy for both humans and LM. Certain word classes, such as nouns and determiners, benefit more from the additional modality information. The priming effect in both humans and the LM becomes more apparent as the context size (extra modality information + sentence context) increases. These findings highlight the potential of multimodal information in improving language understanding and modeling.
Authors: Vilém Zouhar, Sunit Bhattacharya, Ondřej Bojar
Last Update: 2024-09-27 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2303.11192
Source PDF: https://arxiv.org/pdf/2303.11192
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://vilda.net/s/mmsg/?uid=demo
- https://github.com/zouharvi/multimodal-shannon-game
- https://en.wikipedia.org/wiki/Common_European_Framework_of_Reference_for_Languages
- https://huggingface.co/GPT-2
- https://github.com/ultralytics/yolov5
- https://lemongrad.com/english-language-statistics/
- https://www.overleaf.com/3989426424jdjdbgpsswjg
- https://github.com/zouharvi/mmsg