The Human Genome: Uncovering Coding Genes
A deep dive into the search and classification of human coding genes.
Miguel Maquedano, Daniel Cerdán-Vélez, Michael L. Tress
― 8 min read
Table of Contents
- The Search for Coding Genes
- The Role of Research Teams
- The Numbers Game
- The Challenge of Misclassification
- Merging the Lists
- Coding Status: The Verdict
- What Are Potential Non-Coding Features?
- Why Do Some Genes Slip Through the Cracks?
- The Mystery of Read-Through Genes
- The Push for Consensus
- The Changing Landscape of Coding Genes
- Conclusion: The Future of Gene Annotation
- Original Source
The human genome is like a giant instruction manual that gives our cells the information they need to produce proteins, which are the building blocks of life. When scientists first sequenced the human genome, they estimated that we had between 25,000 and 40,000 genes responsible for protein coding. However, as research advanced, that number was revised down to between 19,000 and 22,000 genes. So, what happened to the extra genes? Were they just a figment of fancy science dreams?
Over the years, many research teams have worked tirelessly to analyze our genome and identify the true Coding Genes. Their findings have led to a better understanding of which genes are real and which ones might be impostors. Like a game of "Guess Who," researchers have tried to discern between genes that actually produce proteins and those that merely pretend.
The Search for Coding Genes
Identifying coding genes is not just about finding a name on a list. Scientists use various sources of evidence to determine whether a gene can actually produce a protein. They look at things like experimental data and how well a gene is conserved across different species. If a gene is conserved, it means it likely serves a fundamental purpose and is, therefore, more likely to be a coding gene.
New coding genes are added to the list whenever there is enough proof to suggest that they are for real. However, some genes may change status as more data become available. In a way, it's like watching a soap opera where characters frequently switch sides, leading to all sorts of dramatic twists!
The Role of Research Teams
Three main research groups have taken charge of analyzing the coding genes in our genome: Ensembl/GENCODE, RefSeq, and UniProtKB. Each group has its own take on what constitutes a coding gene. They use genomic coordinates and protein data to compile their lists. However, the differing criteria have resulted in discrepancies, much like different interpretations of the same movie script.
For example, the pseudogene WASH6P has been a character in this drama, changing its status several times based on new evidence. It’s the ultimate diva of the gene world—always in the spotlight but never quite fitting the mold of a coding gene.
The Numbers Game
In the past, estimates for the total number of coding genes were quite high. But as researchers dived deeper into the data, the numbers began to drop. More rigorous analysis revealed that the actual count may be closer to 20,000. It’s like when you go to a buffet, pile your plate high, and realize you can only eat half. The gene buffet served us a reality check!
Interestingly, reports show that the number of coding genes is on the rise again. This uptick is due to researchers actively searching for small Open Reading Frames (ORFs) that may have previously slipped under the radar. These small genes could be the hidden gems of the coding world, and scientists are on a mission to find them.
The Challenge of Misclassification
The search for coding genes can be tricky. Many researchers focus on discovering new coding genes because it’s often easier to find them than to prove that a predicted coding gene doesn’t produce proteins. It’s like hunting for treasure—people are more motivated to unearth gold than sift through dirt.
Some groups have attempted to identify genes that may have been misclassified. In a groundbreaking analysis, researchers discovered that many newly annotated genes resembled Non-coding RNA instead of coding genes. One group even estimated that there were about 20,500 coding genes, while another predicted there were fewer than 20,000. Talk about a family feud—there’s no clear winner!
Over the years, researchers have flagged thousands of genes as potential non-coding, leading to a reclassification frenzy. Some genes have been reclassified multiple times as new evidence came in. It’s like a never-ending game of musical chairs—every time the music stops, someone’s seat gets taken away!
Merging the Lists
To tackle this complicated situation, researchers have merged the three major reference sets (Ensembl/GENCODE, RefSeq, and UniProtKB) to create a more unified gene list. In doing so, they found that they had annotated around 22,210 coding genes. But, interestingly, one in eight annotated coding genes did not receive a stamp of approval from all three groups. It’s like getting three different opinions on your outfit—one will love it, one will hate it, and the third will simply be confused.
After further refinement and analysis, it was discovered that the number of genes listed across the three sets was actually lower than in the previous merging. In fact, researchers identified 2,606 genes where there was no consensus on coding status. These genes are still arguing about whether they belong in the coding club or not.
Coding Status: The Verdict
Among the genes that were annotated as coding, around 19,267 were deemed to be coding by all three research teams. But for the remaining genes, the sorting process revealed various statuses like read-through genes, Pseudogenes, and others, showing that the picture of coding status can be quite complex. It’s sort of like sorting through the laundry—you think you have a clear load of whites, but soon you find a rogue red sock in the mix!
To determine the status of these non-intersection genes, researchers examined the gene annotations from the reference sets and found common statuses. Some genes were classified as read-through genes, meaning that all their transcripts were read-through transcripts, while others were considered pseudogenes—essentially, genes that have lost their functionality over time.
What Are Potential Non-Coding Features?
In the ongoing quest for clarity, researchers defined potential non-coding features for coding genes. They gathered data from various sources and devised criteria to help identify genes that might not fit the coding profile. These features act as red flags, pointing out genes that may not be candidates for protein production.
Using statistical measures like non-synonymous to synonymous ratios, researchers assessed which genes met the criteria for being potential non-coding. They narrowed down their suspect list, leading to the identification of 1,118 genes in the most recent analysis.
Why Do Some Genes Slip Through the Cracks?
You might wonder—why do genes get misclassified as coding when they should be marked as non-coding? This happens because some genes might have previously shown some signs of coding but lack supporting evidence to back it up.
For instance, genes that are flagged as pseudogenes might still have intact open reading frames, but their lack of functional protein evidence is a crucial clue to their true nature. It’s much like a movie star who still has a fan following, even though they haven’t appeared in anything recently. Their past glory doesn’t necessarily mean they’re still active!
The Mystery of Read-Through Genes
Read-through genes deserve special mention. These genes are a unique category where all their transcripts are classified as read-through. These genes often cause a stir, as they can sometimes be mistaken for true coding genes. Yet, in reality, they may not produce functional proteins at all.
Researchers continue to examine the coding status of read-through genes, and many believe these genes should be reclassified. As more evidence comes to light, the landscape of coding genes continues to shift, and scientists are keen to refine their lists to ensure accuracy.
The Push for Consensus
Researchers are aware that having a consensual understanding of the number of coding genes is crucial for the scientific community. This is important not only for basic research but also for clinical applications. If there are too many misclassified genes in the reference set, it can make large-scale biomedical experiments confusing, leading to erroneous results.
As scientists work together to harmonize their lists, they hope to arrive at a final agreed-upon set of bona fide coding genes. This project requires collaboration and open communication across various research groups to ensure everyone is on the same page. After all, trying to play a game where everyone has different rules is no fun!
The Changing Landscape of Coding Genes
With advancements in technology and more data becoming available, the landscape of coding genes is continuously evolving. Researchers are now focusing on some of the smaller, less well-studied genes, as they may hold potential for novel protein coding. Many researchers believe that the focus on small ORFs is only beginning, and there may be more discoveries just around the corner.
The recent completion of the CHM13 assembly, which identified a host of new genes, has also sparked excitement within the research community. While many of these new genes come from large, duplicated families, their introduction into the field could change our understanding of coding genes.
Conclusion: The Future of Gene Annotation
The process of detecting and validating coding genes is a complex, ongoing effort that requires collaboration, open-mindedness, and, most importantly, patience. With each new analysis, researchers are piecing together the puzzle and refining their understanding of the human genome.
As they continue to work through the discrepancies between databases and refine their lists of coding genes, researchers remain hopeful that they will eventually achieve a clear and accurate picture of what constitutes a coding gene in our genome. So, while the quest may seem daunting, it’s one that scientists are more than ready to tackle—armed with evidence, collaboration, and perhaps a few coffee breaks along the way.
Title: More than 2,500 coding genes in the human reference gene set still have unsettled status
Abstract: In 2018 we analysed the three main repositories for the human proteome, Ensembl/GENCODE, RefSeq and UniProtKB. They disagreed on the coding status of one of every eight annotated coding genes. The analysis inspired bilateral collaborations between annotation groups. Here we have repeated our analysis with updated versions of the three reference coding gene sets. Superficially, little appears to have changed. Although there are slightly fewer genes predicted as coding overall, the three groups still disagree on the status of 2,606 annotated genes. However, a comparison without read-through genes and immunoglobulin fragments shows that the three reference sets have merged or reclassified more than 700 genes since the last analysis and that just 0.6% of Ensembl/GENCODE coding genes are not also annotated by the other two reference sets. We used eight features indicative of non-coding genes to examine the 21,873 coding genes annotated across the three reference sets. We found that more than 2,000 had one or more potential non-coding features. While some of these genes will be protein coding, we believe that most are likely to be non-coding genes or pseudogenes. Our results suggest that annotators still vastly overestimate the number of true coding genes.
Authors: Miguel Maquedano, Daniel Cerdán-Vélez, Michael L. Tress
Last Update: 2024-12-09 00:00:00
Language: English
Source URL: https://www.biorxiv.org/content/10.1101/2024.12.05.626965
Source PDF: https://www.biorxiv.org/content/10.1101/2024.12.05.626965.full.pdf
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to biorxiv for use of its open access interoperability.