A New Method for Tracking Word Meaning Changes
This method uses contextual embeddings to measure how word meanings shift over time.
― 7 min read
Table of Contents
Measuring how the meaning of words changes over time is an important part of understanding language. In the past, people used simple techniques to do this, relying on static word representations. However, more recent methods using Contextual Embeddings have not shown clear improvements over these basic methods. This problem is further complicated by issues related to how easily these methods can be scaled and interpreted.
Here, we discuss a fresh approach to measuring Semantic Change by using contextual embeddings, focusing on the most likely replacements for terms that are masked. This method is not only easier to understand and interpret, but it also requires less storage and performs better on popular Datasets for this task. Furthermore, it allows us to look deeper into the nuances of how word meanings change.
The Challenge of Measuring Semantic Change
Semantic change detection has been a complicated area in natural language processing (NLP). Many of the approaches that have emerged seem to not outperform the Traditional Methods that used static vectors. The common practice involves comparing word representations from different time periods, but this has proven to be difficult.
In theory, contextual embeddings should be well-suited for more detailed analysis since they give unique representations of words based on the context in which they occur. Yet, many attempts to use these advanced methods have resulted in worse performance and require more resources than simpler techniques.
This paper offers a new, straightforward strategy to detect semantic change using contextual embeddings. We focus on the most likely Substitutes for masked words and examine how the distributions of these substitutes vary over different time periods. This not only aligns better with human judgment but also runs efficiently and helps us understand changes in meaning over time.
Existing Methods and Their Limitations
There are two main types of methods currently used to track how the meanings of words change. The first approach associates each term with a single vector for each time period and analyzes the distance between these vectors. Some variations have attempted to average the outputs from contextual models to get a single representation per time period, but this has not significantly improved performance over static vectors.
The second approach starts from word sense induction methods and measures the change by looking at how the different meanings of a word are used over time. Some researchers cluster the contextual representations and look at how these clusters differ between two time periods.
Although contextual embeddings offer more detailed information, methods using static embeddings remain competitive. For instance, in a recent multilingual semantic change detection competition, none of the top-performing systems used contextual embeddings.
Our Approach
In our approach, we represent each token in our data by a small set of likely replacements from a contextual embedding model. While some previous research focused on using these models for disambiguating word senses, we aim to measure semantic change directly.
For any given occurrence of a word, we mask the word of interest and pass this masked context through our model to get predicted probabilities for the masked token. We save only the top replacements that are most likely and ignore the rest.
We then create a distribution of these replacements for a word in a specific time period. To quantify the semantic change between two time periods, we calculate a score based on how different these distributions are. Since these raw scores can be influenced by how often the words appear, we scale these scores to account for frequency effects.
Compared to traditional methods that require substantial storage space, our approach uses much less. The dominant meanings of a word in its context can be summarized by the most frequently occurring substitutions. Even if some words are split into multiple pieces during tokenization, we can still represent a range of meanings effectively.
Data and Experimental Setup
To evaluate our method, we use datasets that have already been assessed for semantic change detection. We focus on five datasets where words are labeled based on their semantic change across two time periods. Four of the datasets come from a recent task focused on unsupervised lexical semantic change detection and include words from different languages that have been graded by human evaluators. The fifth dataset has words labeled for semantic change from the 1960s to the 1990s.
For each dataset, we adjust a BERT model to account for the associated text data by continuing training on a masked language model. We then identify occurrences of each word and randomly select a sample of occurrences for our analysis.
With the sampled tokens, we apply our method to find the most likely replacements and compute scores for semantic change. Finally, we compare our results against established benchmarks using statistical measures.
Results and Analysis
Our method produces varying results across different datasets. While it may not always outperform previous methods, it achieves the best average performance and shows significant advancements on specific datasets.
For example, the raw scores from our method reveal a strong connection between how often a term appears and the scores measuring semantic change. This correlation underlines the necessity of rescaling our raw scores to compare against terms with similar frequency.
When examining the results, we can easily interpret the shifts in meaning. For instance, the term "plane" shows a notable shift from referring primarily to a geometric figure to being associated with flying machines in more modern contexts. On the other hand, our analysis suggests that the change in meaning for the word "graft" might have been somewhat underestimated.
Insights from the Method
Our approach offers various insights into how word meanings evolve. By taking a closer look at the dominant substitutes for each term over different time periods, we can see clear shifts in usage. The data helps clarify the gradual adoption of new meanings or the decline of old ones.
This method also allows us to investigate how different meanings are organized into clusters. For example, we can group similar substitute terms and examine how their occurrence changes over time. The ability to visualize and analyze these shifts provides a more nuanced understanding of semantic change.
Limitations and Future Work
While our method shows promise, there are limitations to consider. The datasets we used were relatively small, which means the estimates of correlation with human evaluations may not be entirely stable. Additionally, while we focused on several languages, there’s no guarantee that these methods will yield the same results in other languages or time periods.
The quality of the pretrained models also varies by language, which can affect overall performance. Different configurations, such as the number of samples taken or choices made about replacement terms, could lead to different outcomes.
Additionally, measuring semantic change is inherently complex. Words are used differently by different people, and attempts to quantify these changes can oversimplify the richness of language.
Future research could focus on expanding the datasets or exploring larger samples for more robustness. It will also be important to see how well our method works with different languages and potentially refine it to improve performance.
Conclusion
In conclusion, measuring how meanings change is vital for understanding language. Although many methods have tried to tackle this issue, our new approach using contextual embeddings offers an efficient and interpretable way to track semantic change.
The findings not only provide insights into how words evolve but also pave the way for more advanced studies in the future. By continuing to refine our methods and expand our datasets, we hope to gain a deeper grasp of language and the changes it undergoes over time.
Title: Substitution-based Semantic Change Detection using Contextual Embeddings
Abstract: Measuring semantic change has thus far remained a task where methods using contextual embeddings have struggled to improve upon simpler techniques relying only on static word vectors. Moreover, many of the previously proposed approaches suffer from downsides related to scalability and ease of interpretation. We present a simplified approach to measuring semantic change using contextual embeddings, relying only on the most probable substitutes for masked terms. Not only is this approach directly interpretable, it is also far more efficient in terms of storage, achieves superior average performance across the most frequently cited datasets for this task, and allows for more nuanced investigation of change than is possible with static word vectors.
Authors: Dallas Card
Last Update: 2023-09-06 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2309.02403
Source PDF: https://arxiv.org/pdf/2309.02403
Licence: https://creativecommons.org/licenses/by-sa/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.