Advancements in Specialized Language Models for Software Engineering
This study highlights the effectiveness of tailored language models using Stack Overflow data.
― 9 min read
Table of Contents
Large pre-trained language Models have changed both natural language understanding and software engineering. Models like OpenAI's GPT series have made big advancements beyond earlier models like BERT and RoBERTa. These large models learn language patterns and meaning by using huge amounts of data from the internet. They are trained on diverse data, which helps them understand how humans use language.
However, the biggest models are costly to create and use, and often not open for public examination. This means we often do not know how they were built or what data they used. We believe that while large general models are useful, there is also a need for smaller, specific models that can focus on particular tasks. In this study, we look at the rich environment of Stack Overflow, a question and answer site for programmers, as an example of where such focused models can excel.
Our approach involves using available data from Stack Overflow, which includes extensive pairs of questions and answers, alongside comments. We follow standard practices to train our models. We create two models from this data: one with 125 million Parameters and another with 762 million parameters, at costs of about $374 and $1600, respectively. We compare how our models perform against previous models that focused entirely on Stack Overflow data, as well as general-purpose models and OpenAI's ChatGPT across four distinct tasks specific to Stack Overflow.
Our models consistently outperform other models on all tasks, showing that even the smaller one often provides strong results. Both models are made available for the public. The success highlights that when models are trained thoroughly on specific data, they can provide a powerful and cost-effective alternative to large, closed-source models.
Importance of Language Models in Software Engineering
Language models have significantly impacted both natural language processing and software engineering. In natural language processing, these models help with various tasks like sorting and classifying text, analyzing feelings in text, and translating one language to another. In software engineering, they treat code as a language and help with tasks like finishing code, finding bugs, and summarizing code.
Models can be built from scratch for new tasks if they have enough data, typically counted in millions of tokens. However, recent findings show it’s more effective to pre-train models on large amounts of data before adjusting them for specific tasks. After pre-Training, smaller models like BERT are generally fine-tuned for specific tasks while larger ones like GPT-3 can interact with users by answering questions and providing examples.
The trend has been towards larger models. Models like GPT-3, which has 175 billion parameters, and others require large amounts of data, often in the hundreds of billions of tokens, to be effective. They perform remarkably well on language tasks when given a description of a problem and some examples.
However, using these large models comes at a price. They are expensive and hard to adjust for smaller projects with limited resources. Many of these models are also not open-source, which restricts access to how they were trained and what data they used. Furthermore, their general training approach often results in lower performance on tasks where there is enough specific data to train on.
In response, the software engineering community has developed models focused on specific coding tasks. For instance, CodeBERT is based on BERT and trained to predict missing parts of code. Other models like Codex and CodeGen are set up to create code as they receive input.
In this study, we contribute by training additional BERT-style models on Stack Overflow data. This platform is unique as it contains high-quality information, combining explanations with code, making it ideal for programmers seeking help.
Generic BERT models have previously been fine-tuned with success on Stack Overflow data. Some researchers have even proposed pre-training these models specifically on Stack Overflow, finding them to be more effective for specific tasks. In our work, we expand on this idea, improving how we train these models by focusing on the vast and rich data available from Stack Overflow.
Overview of Stack Overflow
Stack Overflow is a large question-and-answer platform, particularly for software programmers. Each post consists of several key elements: a title, a question body, an answer body, and user comments. The title gives a brief overview of what the question is about, while the question body contains detailed information about the user's issue or query.
The answer body contains responses from other users addressing the question, while comments provide additional context, clarifications, or suggestions on the main post. Posts are also tagged with keywords that describe their topics, helping users find relevant questions and answers.
When users ask questions on the site, they usually provide not only text but also related materials such as sample code, screenshots, and links to resources. Additionally, they use tags to identify the programming language or library involved in their queries. Once a question is posted, other users can respond with answers, and the original poster may select one as the accepted solution.
The community can also upvote or downvote answers and questions, which helps to highlight the most helpful content and minimize low-quality posts. However, it's worth noting that the accepted answer isn't always the one with the most votes, indicating the rich diversity of opinions on how to solve problems.
Structure of Pre-Trained Language Models
Language models predict text based on the context provided. There are two key types: auto-regressive models and masked-language models. Auto-regressive models predict the next word in a sequence based on previous words. In contrast, masked-language models predict certain hidden words in a piece of text.
Both types can be pre-trained on extensive text datasets, breaking documents down into training examples. Auto-regressive models are particularly useful when the desired outcome is generating new text, as seen with models like ChatGPT. Masked-language models, such as BERT, benefit from having the entire document during training, allowing them to create a solid understanding of the text.
BERT, developed by Google, is especially noteworthy. It utilizes a unique architecture that allows it to learn from context by predicting masked words in sentences. This design helps it better grasp the meaning of text, making it effective for various tasks like tagging named entities in a sentence or classifying text.
Google has released two versions of BERT, BERTBase and BERTLarge, which vary in size and complexity. BERTBase has 110 million parameters, while BERTLarge has about 340 million parameters. Our models, SOBertBase and SOBertLarge, aim to leverage similar architectures while focusing on the rich data from Stack Overflow.
Training on Stack Overflow Data
Stack Overflow provides regularly updated dumps of publicly available content, including questions, answers, comments, and user interactions. This wealth of data, coupled with its clear annotation of explanatory text alongside code, makes it an excellent resource for software engineering research.
However, working with Stack Overflow data poses some challenges. The quality of answers varies significantly, with some being incomplete or lacking context. The vast quantity of data makes it difficult to discern high-quality content relevant to specific research questions.
To enhance data quality, we filter the responses to focus on those with at least one upvote, indicating some level of community engagement. After filtering, we obtained millions of answer posts and comments to create our dataset for training.
We preprocess the data to ensure it's suitable for model training. This includes tokenizing and cleaning the text, while preserving the code snippets as they are an integral part of the content. The cleaned text is then tokenized for efficient processing in our language model.
Model Design and Training Process
In our design, we set the maximum input sequence length for our models to 2048 tokens, a significant increase from the conventional 512 tokens used in many models. This decision is based on the analysis showing that many Stack Overflow posts exceed the traditional length limit, allowing us to retain more useful context during training.
We split our training into two models, SOBertBase with 125 million parameters and SOBertLarge with 762 million parameters. The training is done using advanced techniques and tools to ensure efficient parallel processing. We train our models for 100,000 steps, exposing them to billions of tokens to ensure they learn effectively from the data.
The total cost of this training is relatively low compared to many state-of-the-art models, making it accessible for smaller teams and researchers who may not have extensive resources at their disposal.
Evaluation of Model Performance
To evaluate how well our models perform, we fine-tune them on four different tasks that focus on understanding questions and answers from Stack Overflow. These tasks include:
- Question Quality Prediction: Determining the quality of questions posted on the platform.
- Closed Question Prediction: Predicting whether a question will be closed by moderators.
- Named Entity Recognition: Identifying important technical entities within the text.
- Obsoletion Detection: Identifying answers that may contain outdated information or code.
For each task, we analyze the effectiveness of our models compared to general-purpose models and previously established specialized models. We use metrics like accuracy, precision, and F1-score to provide a comprehensive understanding of performance.
Our results demonstrate that SOBert significantly outperformed other models across all tasks. The smaller SOBertBase model often performed remarkably well, showing that smaller models trained effectively on specific data can achieve strong results.
Conclusion
Through our study, we emphasize the importance of using specialized models trained on specific datasets. The results show that even models with fewer parameters can outperform larger general-purpose models when trained on high-quality, relevant data.
We release our models for public use, providing valuable tools for anyone seeking to improve their understanding of Stack Overflow content. This work illustrates that careful consideration of the training data and model design is critical to achieving successful outcomes in natural language processing and software engineering tasks.
Title: "Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow
Abstract: Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $\$187$ and $\$800$ each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
Authors: Manisha Mukherjee, Vincent J. Hellendoorn
Last Update: 2024-01-24 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2306.03268
Source PDF: https://arxiv.org/pdf/2306.03268
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.