Introducing InternLM-Law: A Model for Legal Queries
InternLM-Law enhances responses to diverse Chinese legal questions with advanced training.
― 7 min read
Table of Contents
- Building a Dataset
- Importance of Large Language Models
- Model Performance
- Our Contributions
- Related Work in Legal AI
- Training Process of InternLM-Law
- Data Sources for Training
- Processing Legal Data
- Training Legal NLP Data
- Legal Consultation Data Processing
- Processing Legal Regulations
- High-Quality Legal Data Processing
- Data Synthesis and Resampling
- Comparing Our Model
- Objective and Subjective Evaluation
- Long Context Evaluation
- Effectiveness of Training Strategies
- Conclusion
- Original Source
- Reference Links
Large language models have shown they can do many things, but they have trouble with Legal questions because the law is complex and needs special knowledge. This article introduces InternLM-Law, a model created to help with various legal questions connected to Chinese laws, from basic legal questions to complicated real-life legal issues.
Building a Dataset
To create this model, we put together a large dataset with over a million legal queries. We developed a system to filter and process this data to make sure it covers a wide range of topics and is of high quality. Our Training used a new two-step method: at first, we trained the model with both legal and general content to give it broad knowledge, then we focused on quality legal data to help it produce better responses.
InternLM-Law showed it could perform better than leading models, such as GPT-4, on many legal tasks. We plan to share InternLM-Law and our dataset to help others research how to apply models in law.
Importance of Large Language Models
Large language models are becoming an important area of study in Natural Language Processing, attracting attention for their ability to apply to different fields. Some researchers are trying to use these models in areas like medicine, coding, and mathematics. They can help with specific problems and respond in natural language. In law, earlier studies have worked on creating models focused on specific tasks, but these often only provided limited legal advice and relied on older models that were not as effective.
There is still a strong need for a large model focused on the Chinese legal domain, which is what we aim to address with InternLM-Law.
Model Performance
Our model, called InternLM-Law-7B, received high scores across different legal tasks when evaluated. It performed better than GPT-4 and other large general models. We created a comprehensive training dataset from various public legal Datasets on the internet. This dataset includes both question-and-answer pairs and other information.
To make our model effective in legal tasks, we realized that just using legal data was not enough. We added general data to help the model apply its broader skills to legal issues. We also used a two-step training method to help the model learn important legal regulations and improve its response style.
Our Contributions
The main contributions of our work are:
- We built InternLM-Law, a large language model made for the Chinese legal field. It can handle different kinds of legal tasks and set a new high standard on the LawBench Evaluation.
- We invested a lot of time in creating and training our model. Our dataset has over 1 million samples and we used effective techniques to ensure its quality.
- We used a two-step training pipeline, first training on both legal and general tasks, and then focusing on high-quality legal data.
Related Work in Legal AI
Legal Artificial Intelligence has been a topic in Natural Language Processing for a long time. Most previous studies focused on creating specialized tools for one particular task, which makes the legal system complicated. Some researchers are working to create large language models that can handle various legal tasks.
A few existing models have tried to focus on the legal domain specifically. For instance, SaulLM-7B is designed for legal text understanding, while models like Lawyer-LLaMA evolved to improve their consulting abilities through focused training on legal datasets. However, many of these models do not perform well in a variety of tasks, making our approach with InternLM-Law unique.
Training Process of InternLM-Law
We used InternLM2-Chat as the base for our model. The training included two stages. First, we trained on a mix of legal tasks and other general tasks. This phase helped the model gain a wider view of legal topics. Next, we refined the model with a focused legal training to enhance its legal knowledge, response structure, and accuracy in answering questions.
The training used powerful hardware for 8 hours, and we set it to handle long legal texts by allowing longer input lengths. We carefully set the learning rates and trained each stage thoroughly.
Data Sources for Training
Our dataset had two parts: legal and general data. The legal data aimed to cover a wide range of legal knowledge, split into categories like legal education materials, consultation records, and updated legal regulations. We sourced our legal data from various competitions and public legal databases.
To gather legal consultation data, we collected millions of records from online sources. These records contained numerous real-world legal issues where individuals sought help from legal practitioners. To ensure privacy, we anonymized all sensitive information.
The general data included a broad selection of topics such as everyday conversations, mathematical problems, and code generation, all processed to maintain quality and helpfulness.
Processing Legal Data
We developed a detailed plan to process our legal data, aiming to improve its quality. Since online legal consultations often included short and less detailed responses, we created a semi-automated method to expand and enhance these responses. We also observed that the data distribution was imbalanced, so we focused on crucial areas like laws and regulations to improve the legal dataset's quality.
Training Legal NLP Data
When dealing with legal tasks, we categorized them into different types using existing legal benchmarks. We then generated diverse and relevant instructions for each task to create a well-structured legal dataset.
Legal Consultation Data Processing
Our legal consultation dataset included various legal scenarios. We recognized that many of these contained unnecessary information that could harm data quality. To ensure reliability, we employed filtering methods to refine the dataset, discarding overly brief or unclear responses and maintaining quality throughout.
Processing Legal Regulations
For legal regulations, we turned pure text data into question-and-answer pairs for training. By transforming titles of laws or regulations into questions, we helped the model retain relevant legal knowledge effectively.
High-Quality Legal Data Processing
To make our model's legal knowledge more precise, we used GPT-4 to semi-automate the generation of high-quality Q&A datasets. We manually checked and adjusted the generated content for accuracy.
Data Synthesis and Resampling
Since responses written by humans can differ in style and detail, we created additional data using GPT-4, refining it with human feedback. We sampled critical legal content, focusing on frequently occurring legal issues to bring clarity and improve accuracy in the model's responses.
Comparing Our Model
We compared InternLM-Law with other leading models, both general and legal-specific, including the high-performing GPT-4. The evaluation showed that our model surpassed the others, particularly in legal tasks on the LawBench benchmark, which tests the model's memorization, comprehension, and application of legal knowledge.
Objective and Subjective Evaluation
Alongside the benchmark evaluation, we assessed how our model performed on subjective legal questions, reflecting real-world legal consultations. Our model achieved an impressive win rate against GPT-4 in legal consultation tasks.
Long Context Evaluation
Handling long legal documents is often necessary. We tested our model's ability to understand and answer questions based on lengthy legal judgments. Other models struggled with this type of task, while InternLM-Law effectively processed long texts and successfully answered related questions.
Effectiveness of Training Strategies
We explored how using general datasets during training impacted both legal and general tasks. Our findings indicated that including general data not only preserved the model's general capabilities but also enhanced its legal skills.
Conclusion
InternLM-Law is a significant advancement in the Chinese legal domain, outperforming existing models while providing a robust framework for future legal AI applications. Despite its success, the model still faces challenges, such as occasional inaccuracies, emphasizing the need for further improvement in handling complex legal reasoning tasks.
Title: InternLM-Law: An Open Source Chinese Legal Large Language Model
Abstract: While large language models (LLMs) have showcased impressive capabilities, they struggle with addressing legal queries due to the intricate complexities and specialized expertise required in the legal field. In this paper, we introduce InternLM-Law, a specialized LLM tailored for addressing diverse legal queries related to Chinese laws, spanning from responding to standard legal questions (e.g., legal exercises in textbooks) to analyzing complex real-world legal situations. We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries, and implement a data filtering and processing pipeline to ensure its diversity and quality. Our training approach involves a novel two-stage process: initially fine-tuning LLMs on both legal-specific and general-purpose content to equip the models with broad knowledge, followed by exclusive fine-tuning on high-quality legal data to enhance structured output generation. InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks. We make InternLM-Law and our dataset publicly available to facilitate future research in applying LLMs within the legal domain.
Authors: Zhiwei Fei, Songyang Zhang, Xiaoyu Shen, Dawei Zhu, Xiao Wang, Maosong Cao, Fengzhe Zhou, Yining Li, Wenwei Zhang, Dahua Lin, Kai Chen, Jidong Ge
Last Update: 2024-06-21 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2406.14887
Source PDF: https://arxiv.org/pdf/2406.14887
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://github.com/InternLM/InternLM-Law
- https://huggingface.co/lyogavin/Anima33B
- https://cail.cipsc.org.cn/
- https://flk.npc.gov.cn/
- https://huggingface.co/Qwen/Qwen1.5-72B
- https://qwenlm.github.io/blog/qwen1.5/
- https://jecqa.thunlp.org/
- https://cail.cipsc.org.cn/task_summit.html?raceID=2
- https://laic.cjbdi.com/
- https://aistudio.baidu.com/datasetdetail/181754
- https://github.com/liuhuanyong/CrimeKgAssitant
- https://cail.cipsc.org.cn/task_summit.html?raceID=1
- https://github.com/china-ai-law-challenge/CAIL2021/tree/main/xxcq
- https://cail.cipsc.org.cn/task_summit.html?raceID=4
- https://cail.cipsc.org.cn/task_summit.html?raceID=5
- https://github.com/thunlp/LEVEN
- https://github.com/china-ai-law-challenge/cail2018
- https://github.com/LiuHC0428/LAW-GPT
- https://www.66law.cn/
- https://aclanthology.org/2020.emnlp-main.56.pdf
- https://github.com/thulawtech/leec