Simple Science

Cutting edge science explained simply

# Biology# Bioinformatics

Language Models in Biology: Current Insights

Researchers analyze advanced models to predict biological outcomes using gene data.

― 5 min read


Model Predictions inModel Predictions inBiologygenetic outcomes accurately.Current models struggle with predicting
Table of Contents

Recently, researchers are looking into how advanced computer models, known as language models, can help in the field of biology. These models are powerful tools that can analyze vast amounts of biological data. The aim is to teach these models about various living systems, including how genes interact, how cells function, and more. By doing so, scientists hope these models can predict the results of experiments that have not yet been conducted, similar to how they generate meaningful text or images.

Data Availability

Many large datasets are now available to train these models. For example, the Human Cell Atlas project has compiled data on many different kinds of human cells. Another resource, CELLxGENE, offers millions of gene expression profiles from various organisms, including information from healthy and diseased states. These datasets are essential for training models to understand complex biological systems.

Recent Advances in Models

Some of the latest models are called scGPT and scFoundation. These models have been trained using data from millions of single cells. They function based on deep learning techniques, especially a method known as the transformer architecture. These models are designed to perform various tasks, which include identifying cell types, inferring gene interactions, and predicting the effects of genetic changes.

Both models provide pre-trained versions, which allow researchers to adjust them for specific tasks using additional datasets. For instance, scFoundation has modified an existing tool called GEARS to predict how genetic changes affect cells, using advanced techniques including graph neural networks.

Evaluating Model Performance

To understand how well these models work, researchers conducted tests on their ability to predict changes in gene expression following genetic alterations. For this, they used a dataset where certain genes were activated in specific cell types. They looked at how Gene Expressions changed in response to single and double genetic changes.

Different approaches were compared to see which offered the most reliable predictions. One model simply predicted no changes at all, while another assumed that the effects of two genetic changes could just be added together. Surprisingly, the latter approach performed better than the new deep learning models in terms of prediction accuracy.

Prediction Challenges

RNA Sequencing data, which measures gene expression, can be noisy. This noise can affect predictions, especially for genes that are expressed at low levels. Researchers found that the accuracy of all models decreased when including low-expressed genes in predictions. However, the ranking of the models remained consistent, indicating that the results were reliable.

Researchers are particularly interested in how double genetic changes can lead to unexpected results. They assessed whether the new deep learning models could find these unexpected scenarios better than simpler methods. They defined these scenarios by measuring how much the expression changed compared to what the additive model predicted.

After analyzing results, they found a high number of genetic interactions that were not accounted for by the simple additive predictions. Yet, when it came to identifying these interactions, the simpler models still outperformed the complex deep learning models.

Exploring Single Genetic Changes

Another important feature of the new models is their ability to predict the effects of previously unseen genetic changes. The hope is that these models have learned enough about the relationships between genes during training that they can apply this knowledge to new scenarios.

To test this, researchers used existing datasets and compared predictions made by the new models to a straightforward linear model. This basic model used statistical techniques to find relationships between gene expressions. Despite the advanced techniques used in the deep learning models, findings showed that they did not produce better predictions than the straightforward linear model when dealing with new genetic changes.

Using Pre-Trained Models

Thinking creatively, researchers explored if they could enhance predictions by using data from one dataset to train the model while applying it to another. They found that using data from one experiment improved predictions when applied to a different dataset. There was a consistent advantage when using this strategy, indicating that the embeddings learned from the data might hold meaningful insights.

Furthermore, they experimented with using embeddings produced by scGPT and scFoundation to see if that led to better predictions. This method showed some positive results, even though it did not significantly surpass the basic linear model in every case.

Conclusion on Current Findings

The findings suggest a couple of critical points. First, current deep learning models have not yet proven to be superior to simpler models in predicting experimental outcomes. This indicates that there is still progress to be made before these advanced models can reliably predict results in biology.

The models were not able to leverage their complex structures to provide better insights compared to the simpler methods. Critics argue that this might not mean these models are ineffective, but rather that the specific tasks they were tested on might not showcase their full capabilities.

Overall, this research highlights the importance of developing reliable benchmarks in the field. Such benchmarks can help refine models and direct future efforts in applying machine learning in biological research. It serves as a reminder that while advanced models have potential, understanding their practical applications and limitations is crucial for translating computational advances into real-world biological insights.

Original Source

Title: Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods

Abstract: Advanced deep-learning methods, such as transformer-based foundation models, promise to learn representations of biology that can be employed to predict in silico the outcome of unseen experiments, such as the effect of genetic perturbations on the transcriptomes of human cells. To see whether current models already reach this goal, we benchmarked two state-of-the-art foundation models and one popular graph-based deep learning framework against deliberately simplistic linear models in two important use cases: For combinatorial perturbations of two genes for which only data for the individual single perturbations have been seen, we find that a simple additive model outperformed the deep learning-based approaches. Also, for perturbations of genes that have not yet been seen, but which may be "interpolated" from biological similarity or network context, a simple linear model performed as good as the deep learning-based approaches. While the promise of deep neural networks for the representation of biological systems and prediction of experimental outcomes is plausible, our work highlights the need for critical benchmarking to direct research efforts that aim to bring transfer learning to biology.

Authors: Constantin Ahlmann-Eltze, W. Huber, S. Anders

Last Update: Oct 28, 2024

Language: English

Source URL: https://www.biorxiv.org/content/10.1101/2024.09.16.613342

Source PDF: https://www.biorxiv.org/content/10.1101/2024.09.16.613342.full.pdf

Licence: https://creativecommons.org/licenses/by-nc/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to biorxiv for use of its open access interoperability.

More from authors

Similar Articles