Simple Science

Cutting edge science explained simply

# Biology# Bioinformatics

Enhancing Genomic Research Through Phylogenetic Augmentation

Scientists use homologous sequences to improve deep learning models in genomics.

― 6 min read


Genomics Boosted byGenomics Boosted byPhylogenetic Toolspredictions in genetic research.New techniques enhance model
Table of Contents

In the world of genetics, understanding how genes behave in different situations is vital. Scientists are especially interested in how certain regions of DNA, called regulatory sequences, influence genes. These regulatory sequences tell genes when to turn on or off, how much of a protein to make, and many other important tasks.

Deep Learning, a type of artificial intelligence, helps scientists make predictions about these gene behaviors. By training computer models on vast amounts of data, researchers can analyze aspects of DNA that were previously hard to study.

The Role of Deep Learning in Genomics

Deep learning models have become very useful in predicting how DNA sequences will behave. They can forecast things like how accessible certain parts of the DNA are, where proteins called transcription factors will bind, and how enhancers operate. These predictions are evaluated using test sets, which are separate from the data used to teach the models. This separation ensures that the models are truly learning rather than just memorizing the training data.

Even more importantly, when these deep learning models spot biological patterns in the data, they can help deepen our knowledge of biological processes. Studies have shown that these models can identify both familiar and new patterns within DNA sequences, leading to valuable insights.

Challenges with Data Availability

However, building effective deep learning models requires a lot of data. For many organisms, especially less-studied ones, there simply isn’t enough information available. Most of the detailed data comes from well-known species like humans or mice. This presents a challenge: how can scientists create complex models when they have a limited amount of data?

One proposed solution is to generate artificial data by testing random DNA sequences in the lab and evaluating these against real genomic sequences. The idea is that natural DNA sequences do not have enough variation to teach models everything they need to know.

Data Augmentation Techniques

To boost the amount of training data, scientists often use a technique called data augmentation. This process involves making modified copies of existing data. For example, in image processing, researchers can flip, rotate, or change the color of images to create new versions without needing new images.

In genomics, there are fewer tailored augmentation methods available. Scientists frequently use techniques such as creating reverse complements of sequences or shifting sequences along the DNA strand. Recently, methods that mimic evolution, like introducing random changes to DNA sequences, have shown potential in improving model performance.

The Power of Homologous Sequences

Homologous sequences are DNA sequences from different species that share a common ancestor. They might look different but often serve similar biological roles. Because these sequences can provide valuable information about function and evolution, researchers are now considering them as a way to augment training datasets.

By incorporating homologous sequences from related species, scientists can enhance the diversity of training data, potentially leading to better model performance. This method has proven to be particularly effective in various biological scenarios.

How Phylogenetic Augmentation Works

Phylogenetic augmentation means transforming a DNA sequence from one species into a homolog from another species. This technique uses multi-species genome alignments to enrich the training data. By including homologs as augmented versions of training sequences, the models are exposed to a broader range of sequences.

The application of this method involves three main steps. First, researchers use multi-species genome alignments to identify homologous sequences for each DNA sequence in their training set. Next, they apply phylogenetic augmentation to these sequences during the model training process. Lastly, after training, the models are fine-tuned on the original sequences to improve accuracy and lessen bias.

Benefits of Phylogenetic Augmentation

Early experiments using phylogenetic augmentation have shown promising results. For example, when training models to predict specific activities in the Drosophila genus, researchers found that models using phylogenetic augmentation performed better than those that did not. In one example, the model’s performance increased significantly when homologs from closely related species were included.

Moreover, phylogenetic augmentation can help when working with smaller datasets. In cases where there are insufficient regions of interest for effective machine learning, augmenting the training data with homologous sequences can enhance the model's performance, even with less data.

Real-World Applications

Scientists applied the phylogenetic augmentation method to real-world genomic datasets to test its effectiveness further. One study analyzed data from the Drosophila S2 cell line, where researchers predicted enhancer activity. They extracted homologs from multiple Drosophila species and incorporated those into their training dataset.

Another analysis looked at binary DNase-seq peaks from various human cell lines. In this case, researchers used homologs from closely related mammalian species. Results showed a marked improvement in model predictions when using phylogenetic augmentation.

Furthermore, the method proved useful when training models on much smaller datasets, such as those examining RNA-binding proteins in yeast. Researchers found that applying phylogenetic augmentation significantly increased the model's ability to predict relevant biological features.

Exploring the Impact of Hyperparameters

To evaluate the effectiveness of phylogenetic augmentation, researchers explored various factors, known as hyperparameters. One critical area they analyzed was the number of species included in the augmentation process. They trained models with different species, measuring improvements in predictive performance.

They also examined how the rate of augmentation applied during model training affected results. Initial findings indicated that applying augmentation at a moderate rate led to better outcomes than overusing it across every training sequence. This suggests that there is an optimal amount of augmentation needed to maximize performance without introducing too much variability.

Conclusion

Phylogenetic augmentation represents a powerful tool for advancing genomic research using deep learning. By utilizing homologous sequences from related species, researchers can overcome data limitations and create models with improved predictive capabilities.

As deep learning continues to play a critical role in understanding genetics, methods like phylogenetic augmentation have the potential to significantly enhance the efficiency and effectiveness of these models.

In an era where large datasets are becoming increasingly available, this method could help researchers glean vital biological insights, ultimately contributing to our understanding of complex genetic mechanisms.

With its broad applicability across various organisms and experimental conditions, phylogenetic augmentation holds promise for future advancements in genomics.

Original Source

Title: Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation

Abstract: Structured abstractO_ST_ABSMotivationC_ST_ABSSupervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited. ResultsInspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves experimental data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep learning problems in genomics. Availability and implementationThe open-source GitHub repository agduncan94/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures. [email protected]

Authors: Alan M Moses, A. G. Duncan, J. A. Mitchell

Last Update: 2024-01-17 00:00:00

Language: English

Source URL: https://www.biorxiv.org/content/10.1101/2023.09.15.558005

Source PDF: https://www.biorxiv.org/content/10.1101/2023.09.15.558005.full.pdf

Licence: https://creativecommons.org/licenses/by-nc/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to biorxiv for use of its open access interoperability.

Similar Articles