Questioning Language Models: Bias Insights
Using queries to expose gender biases in language models.
― 5 min read
Table of Contents
We look into how to get information from Language Models, which are a type of artificial intelligence, using a method based on asking questions and getting answers. Our approach is inspired by a known learning model that uses two types of questions: membership questions and equivalence questions. In our case, the language model acts like a teacher or oracle that gives answers based on what it has learned.
Language models can be thought of as complicated systems that sometimes keep their workings mysterious. This can make it difficult to understand how they make decisions. Our goal is to find a way to understand what these models have learned about the world, especially concerning biases related to gender and occupations.
Learning from Language Models
The process of learning from these models involves creating a program that can ask the language model questions about specific scenarios. The answers help reveal underlying patterns in the data the model was trained on.
What Are Membership and Equivalence Questions?
Membership questions check if a certain example follows the rules learned by the model. For example, you may ask if a specific scenario fits a certain occupation.
Equivalence questions ask if two different examples are essentially the same according to the model's learned rules. If they are not the same, the model will provide a counterexample to show how they differ.
This approach is useful because it allows us to uncover biases and recognize how the model understands various roles in society.
Addressing Challenges
Pulling knowledge from language models is not simple. There are several challenges we need to address:
Simulating Equivalence Queries: The first challenge is that we cannot easily check if the model considers two different examples as the same. To get around this, we randomly generate groups of examples to see how the model classifies them. If the model misclassifies any example in that group, we treat it as evidence that our current hypothesis is not correct.
Input Format: The second challenge involves the format of the input data. Language models don't typically use formats that we would expect when applying the learning Algorithms directly. We need to convert examples into a format the model can process and understand.
Non-Horn Behavior: The third challenge is that language models often do not strictly follow the simplified logical rules we expect from Horn clauses, which are a form of logical representation. Since these models may represent more complex scenarios, our algorithm needs to adapt to this reality.
A Proposed Algorithm
To tackle these challenges, we developed a new algorithm that aims to extract the tightest Horn approximation of the model's behavior. This algorithm guarantees that it will eventually finish running, either in a reasonable amount of time or potentially longer depending on the complexity of the examples.
We will conduct experiments with various pre-trained language models to uncover rules indicating Gender Biases in professions, such as assumptions linking certain jobs predominantly with men or women.
Conducting Experiments with Language Models
We decided to employ robust language models like BERT and RoBERTa to perform our tests. The main goal was to check how these models relate various occupations with gender biases. To do this, we gathered data from open sources containing information on occupations, genders, nationalities, and birth years.
Experiment Setup
Data Collection: We collected a dataset from a web resource that lists various jobs alongside associated genders, birth years, and nationalities.
Template Creation: Each example from the dataset is transformed into a sentence using a specific structure. For instance, "Person was born in [year] in [continent] and is a [occupation]."
Predicting Genders: For membership queries, we let the model predict the gender of a person based on the constructed sentence. We compare the model's prediction with the actual known gender to see if they match.
Random Sampling for Equivalence Queries: For equivalence queries, we randomly generate features that correspond to different possible examples. This helps us test our hypothesis against the model.
Results from Experiments
From our experiments, we found clear evidence of gender biases in the models. For example, rules extracted from the models indicated that women are not likely to be associated with roles typically seen as masculine, like being a bank manager or mathematician. Conversely, men were found to be unfairly excluded from roles such as nursing.
These findings align with existing research on biases present in society, confirming that the models reflect these biases.
Significance of Findings
Our results are meaningful as they provide insights into how language models understand gender roles related to professions. Such insights can have implications for how these models are used in real-world applications, like hiring practices or automated decision-making systems.
By understanding these biases, we can move toward creating fairer and more balanced AI systems. These systems should ideally promote equality and avoid reinforcing negative stereotypes.
Conclusion
In summary, we have outlined a method for extracting knowledge from language models using an approach based on querying. By adapting existing learning algorithms to the unique challenges posed by neural networks, we can expose biases that exist in these models. Our experiments have shown that these models tend to reflect societal biases regarding gender and occupation.
Moving forward, we hope this work will inspire further research into reducing bias in AI and improving the fairness of automated systems. It is essential to continue examining how these powerful technologies operate and influence our perceptions and decisions.
Title: Learning Horn Envelopes via Queries from Large Language Models
Abstract: We investigate an approach for extracting knowledge from trained neural networks based on Angluin's exact learning model with membership and equivalence queries to an oracle. In this approach, the oracle is a trained neural network. We consider Angluin's classical algorithm for learning Horn theories and study the necessary changes to make it applicable to learn from neural networks. In particular, we have to consider that trained neural networks may not behave as Horn oracles, meaning that their underlying target theory may not be Horn. We propose a new algorithm that aims at extracting the "tightest Horn approximation" of the target theory and that is guaranteed to terminate in exponential time (in the worst case) and in polynomial time if the target has polynomially many non-Horn examples. To showcase the applicability of the approach, we perform experiments on pre-trained language models and extract rules that expose occupation-based gender biases.
Authors: Sophie Blum, Raoul Koudijs, Ana Ozaki, Samia Touileb
Last Update: 2023-09-13 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2305.12143
Source PDF: https://arxiv.org/pdf/2305.12143
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.