Simple Science

Cutting edge science explained simply

# Biology# Genetics

Advances in Genetic Variants and AI Support

AI models enhance understanding of genetic variants for healthcare.

Shuangjia Lu, Erdal Cosgun

― 8 min read


Genetic Insights PoweredGenetic Insights Poweredby AIbetter healthcare outcomes.Models enhance data processing for
Table of Contents

Genetics can sound complicated, right? Well, let’s break it down a bit. When scientists look at our genes, they often examine tiny changes called genetic variants. These variants can tell us a lot about what might happen to our health. So, they need to catalog this information in a way that everyone can understand and use. This is where Variant Annotations come in.

Variant annotations are like the footnotes in a book. They provide important details about genetic variants, such as where they are located and what they might mean for our health. Think of it as a map guiding us through the twists and turns of our genetic makeup. These annotations are gathered from different databases, like ClinVar and GnomAD, which collect information from numerous studies and clinical reports. It’s like gathering all the pieces of a jigsaw puzzle to help us see the full picture.

Researchers and doctors have a bit of a challenge. They need to sort through millions of these genetic variants to figure out which ones are significant for patients. It’s kind of like searching for a needle in a haystack – if the haystack were made of genetic data! They look at past records of genes and diseases, how common a variant is in the population, and its predicted effects on health. This can take a lot of time and effort.

Large Language Models: Our New Helpers

Now, enter our superheroes: large language models (LLMs). These are advanced computer programs that seem to do it all! They’ve shown amazing skills in various tasks across many fields. In our world of genetics, LLMs like GPT-4 and Llama are stepping in to lend a hand. Previous studies have shown that LLMs have potential in genetics for things like predicting disease risk and identifying important genes.

But here’s the catch: Current LLMs don’t know much about genetics. It’s like having a top chef who can’t tell the difference between a tomato and a potato. To truly aid in genetic research, we need to equip these LLMs with variant annotation knowledge. By doing this, they can help process information faster and provide interpretations that are accurate and relevant. Imagine not having to sift through countless databases manually! This could save researchers a lot of time and resources.

How to Integrate Knowledge into LLMs

So, how do we give our LLMs some “genetic smarts”? There are two primary methods: retrieval-augmented generation (RAG) and Fine-tuning. Let’s see what these fancy names mean!

Fine-tuning is like giving the LLM a crash course in genetics. It involves training the model on a specific set of data related to genetics, so it can adjust its knowledge based on that information. It’s like sending a student to a specialized class to learn about a specific topic.

On the other hand, RAG adds a layer of knowledge without changing the LLM itself. Instead of changing the base model, it helps the model find and use external information to generate responses. It’s like having a helpful encyclopedia nearby when you’re answering questions. When a user asks something, the model performs a search, retrieves relevant information, and combines it to provide a more informed answer.

In our endeavor, we decided to take both approaches. We fed our LLMs 190 million variant annotations using RAG and fine-tuning. This brought a noticeable increase in the model's ability to provide accurate annotations and interpretations.

Collecting the Data

Let’s talk about the treasure trove of data we used. We gathered variant annotations from four major databases: ClinVar, gnomAD, GWAS Catalog, and PharmGKB. Each of these databases contains a wealth of information about genetic variants and their relationships to health. It’s like collecting all the recipe books to create the ultimate cookbook!

ClinVar, for example, contains over 2.8 million variants that have been clinically relevant. Meanwhile, gnomAD records information from hundreds of thousands of individuals, giving insight into how common certain variants are. By combining data from these sources, we created a more comprehensive and useful set of annotations for our LLMs to work with.

Preparing Data for Fine-tuning

Fine-tuning the LLM required some preparation. We had to format our data in a specific way that the model could understand. Think of it as organizing your closet – everything needs to be in the right place for it to work! We randomly selected a training set of 3,000 variants from ClinVar and prepared them using a specific format called JSON Lines.

We took the important details around each variant, like its chromosome location and what it might mean for health. This information was carefully extracted and organized so that the model could learn from it effectively. We wanted to ensure that when we asked the model questions, it could give us answers that made sense.

Building a RAG System

While fine-tuning was good, we also built a RAG system to complement it. We created a search index so that when the model didn’t have a direct answer, it could look up relevant information quickly. This is sort of like how we use Google to find answers. The search index was designed to help the model retrieve data from our vast collection of variant annotations.

To do this, we formatted the data in CSV files, which are easy for computers to read. This index allowed the model to search through the variant information by different categories, like gene or condition. When a user asks a question, the model can quickly find the right data and provide accurate answers.

Evaluating the Models

After putting all this work into training our LLMs, it was time for evaluation. We wanted to see how well these models could predict the information we wanted, such as the gene associated with a variant. We randomly sampled some variants from our datasets to see how accurately the models could respond.

Initially, the base models showed less than 2% accuracy in predicting genes. Sounds disheartening, doesn’t it? But then we decided to test them using variants from the top 10 well-known genes. The models did a bit better, with GPT-4o achieving a 68% accuracy rate. Not quite perfect, but definitely an improvement!

Fine-tuning for Better Performance

To enhance the model's performance further, we fine-tuned it using our prepared prompts. We used the prompts to guide the model’s responses and improve its accuracy. We also discovered that focusing on predicting individual fields led to much better results.

For example, when we concentrated on predicting just the gene name, the accuracy soared to a wonderful 95%. However, predicting the condition proved more challenging, with accuracy dropping down due to a lot of "not provided" responses in our data. It’s like asking a contestant on a game show the wrong question; sometimes, they can only say “I don’t know.”

RAG vs. Fine-tuning: A Showdown

After testing both methods, we found something interesting. RAG outperformed fine-tuning in several areas, including accuracy and efficiency. With RAG, we integrated a whopping 190 million variant annotations, while fine-tuning struggled to add a small fraction of that.

The cost of using RAG was primarily in creating and storing the search index. Fine-tuning was a bit pricier in terms of training processes and the number of tokens needed. If we expanded fine-tuning to handle 190 million annotations, the costs would skyrocket!

In terms of flexibility, RAG is a champion. It can easily be adapted to any model, while fine-tuning ties the knowledge to a specific model. So, RAG is like the cool kid who gets invited to every party, while fine-tuning is that friend who only works well with one group.

Use Cases of RAG-Enhanced Model

The potential for our RAG-enhanced model goes beyond just providing accurate data. For instance, imagine a doctor trying to diagnose a patient based on their symptoms and variant information. Our model could play a crucial role by identifying the disease and the responsible variants efficiently.

In a scenario where we provided symptoms of cystic fibrosis along with a variant list, the model accurately identified the related disease and causal variant. It cut down the effort required by healthcare professionals, making the process smoother and more accessible. It’s like having an expert assistant on hand to sift through all the details!

Conclusion: A Bright Future in Genomics

We’ve made significant strides in improving our model’s ability to analyze genetic data. By integrating 190 million variant annotations, our model can provide accurate and informative responses. Researchers and healthcare providers can now access detailed annotations about specific variants in a conversational way.

However, it’s essential to note there are still some limitations. For instance, the model doesn’t fully grasp other genetic concepts, like upper and lower allele frequencies. The RAG search method is also based on keywords, which could limit the range of questions it can handle.

By exploring new methods like vector search, we could enhance the model even further. As we continue to push the boundaries of genetic understanding through AI, the future looks promising. Our work is a stepping stone toward developing better and more comprehensive tools for supporting disease diagnosis and facilitating research discoveries in genomics.

So, as we continue this fascinating journey through genetics, let’s keep having fun decoding the mysteries of our DNA, one variant at a time!

Original Source

Title: Boosting GPT Models for Genomics Analysis: Generating Trusted Genetic Variant Annotations and Interpretations through RAG and fine-tuning

Abstract: Large language models (LLMs) have acquired a remarkable level of knowledge through their initial training. However, they lack expertise in particular domains such as genomics. Variant annotation data, an important component of genomics, is crucial for interpreting and prioritizing disease-related variants among millions of variants identified by genetic sequencing. In our project, we aimed to improve LLM performance in genomics by adding variant annotation data to LLMs by retrieval-augmented generation (RAG) and fine-tuning techniques. Using RAG, we successfully integrated 190 million highly accurate variant annotations, curated from 5 major annotation datasets and tools, into GPT-4o. This integration empowers users to query specific variants and receive accurate variant annotations and interpretations supported by advanced reasoning and language understanding capabilities of LLMs. Additionally, fine-tuning GPT-4 on variant annotation data also improved model performance in some annotation fields, although the accuracy across more fields remains suboptimal. Our model significantly improved the accessibility and efficiency of the variant interpretation process by leveraging LLM capabilities. Our project also revealed that RAG outperforms fine-tuning in factual knowledge injection in terms of data volume, accuracy, and cost-effectiveness. As a pioneering study for adding genomics knowledge to LLMs, our work paves the way for developing more comprehensive and informative genomics AI systems to support clinical diagnosis and research projects, and it demonstrates the potential of LLMs in specialized domains.

Authors: Shuangjia Lu, Erdal Cosgun

Last Update: 2024-11-15 00:00:00

Language: English

Source URL: https://www.biorxiv.org/content/10.1101/2024.11.12.623275

Source PDF: https://www.biorxiv.org/content/10.1101/2024.11.12.623275.full.pdf

Licence: https://creativecommons.org/licenses/by-nc/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to biorxiv for use of its open access interoperability.

Similar Articles