02 February 2021

Genome annotation

Genome Annotation

Tim Hubbard, Serious Science Translation: Irina Lineva, Post-science

Genomic data processing is a relatively young field of research: the first genome of an organism was sequenced only in 1995, and by 2000 we had obtained the complete human genome. But if we want to use our knowledge about the genome, the most important thing here is to understand which parts of the genome are functional and what they are responsible for. Of course, we have long understood that genes are the main units of the genome, but where are they located? If you look at the human genome, you will see just a long chain of letters – DNA bases, there are 3 billion of them. But how to understand where the genes are here?

Introns: "extra letters" in words 

If you work with the genome of bacteria, then at least computers can directly calculate the location of genes, since they are stacked in very simple blocks: you can see the beginning of the gene, the main part that encodes the protein, and the final part. If you aim to search for these blocks, you can determine the location of the genes. But in vertebrates, in almost all complex animals, genes are not arranged in blocks – they are divided into fragments: exons, introns and fragments between them. In humans, such intermediate fragments can be very long, so it is difficult to identify the location of the gene with high accuracy.

Specialists have developed computer programs that can determine the location of genes, but the results of such an analysis are not very reliable: even now, twenty years after we sequenced the genome, the only way to localize genes is to analyze the contents of the cell. 

The work of the gene begins in DNA: an RNA copy of the gene is created, then this RNA is processed and translated into a protein. You can find these RNAs in the cell and correlate them with the corresponding sections of DNA – using this information, you can localize genes. This is the only working way to determine the location of genes: it requires computers, but nevertheless the identification of genes is based on the analysis of RNA fragments. 

Cellular noise

What is the problem? Firstly, the cell is full of noise, that is, incomplete RNA and other molecules, so you don't have an absolutely clear picture. Secondly, a different set of genes is active in cells of different types. This means that different sets of RNA can be obtained from different cells, and thus, no matter how many cells you analyze, you will never have a complete set. 

So do we know the location of all human genes? The answer to this question is: not quite, because we cannot be sure that we have checked all the existing cells. There are 37 trillion cells in the human body, and we don't know how many types they can be divided into. But even if you had samples of all cell types, only some genes are active in each of them at any given time. We always know that some of the currently active genes will be turned off at any other time, so when we analyze RNA, we always get a somewhat incomplete picture of the location of the genes. 

Nevertheless, based on these data, we have made great progress. In recent years, we have collected a fairly extensive array of genetic information, which is now available in databases like Ensembl and other genomic browsers around the world.: they allow you to analyze a fragment of the genome and see which genes it contains. 

What genes encode (not just proteins!)

There is another problem with the identification of genes: how to determine the function of a gene? For a long time it was assumed that everything starts with DNA, then RNA is created, then protein is synthesized, and at the end of this chain there should always be a protein. When we isolate specific genes, we always rush to start looking for which part of this gene is responsible for protein synthesis. But as we learned from RNA analysis, in many cases the protein is not synthesized in the end. Before we sequenced the human genome, experts estimated the number of human genes at approximately 100 thousand, and it was a kind of competition to guess how many human genes we would count when we completed genome sequencing. After the initial analysis, the estimate of the number of genes dropped to 30 thousand, and this number decreased and decreased over time, and in the end we came to about 20 thousand genes, but these are genes encoding proteins. Now we understand that there may be up to 20 thousand more genes (and possibly more) that encode not protein, but RNA, and these RNAs themselves perform important functions in the cell.  

How can one determine whether a gene encodes a protein or not? You can search for repeating fragments of three letters – triplets that specify how RNA (four letters) is translated into protein (20 amino acids). This is a statistical method of genomic sequence analysis. There are very, very short proteins. If you look at all the expressed RNAs, you will find many potential opportunities to create very short proteins, although many of them are not actually synthesized. At the moment, since our predictive abilities are small, we need additional experimental data to understand whether a protein is encoded in this genomic sequence or not.

Open questions

Scientists who are working on genome annotation are now increasingly turning to mass spectroscopy data as well, since it allows you to see whether a protein is synthesized in a given sequence or not. In some cases, after that, they correct the conclusions: for example, they exclude genes that, according to our theory, encode a protein, but actually do not, or those genes that we thought they encode RNA, but then it turned out that they encode a small protein. So the process of genome annotation does not stop, because until we can process the complete human genome using a computer and directly determine the location of all genes, we will have to rely on experimental data on RNA and proteins, and these data will always be incomplete, since we get them from specific cells in which they are synthesized specific proteins and RNA.

I think the most important open questions in this area are related to how many genes can be isolated in the human genome and what the genes encoding RNA do. It is obvious that some of them perform certain functions, but it is unclear how many RNAs are functional, and how many are just noise and in what processes of cellular life they are needed. This is a relatively new area: we learned about their existence in such quantities only 5-10 years ago. Of course, all this is connected with epigenetics, which will allow us to get a better idea of which molecules bind to DNA and regulate the activation of RNA production processes. 

Thus, future research directions in the field of genome annotation are related to the genes encoding RNA, their number and functions: we know the functions of some of them, but there are few of them. The task of accurately identifying all human genes will take a long time. We may have found most of the genes encoding proteins, but we know that there are alternative forms of these genes, alternative splicing, and the more cell types we have analyzed, the more new alternative forms we will learn. There is a whole community, a project called GENCODE, which manages the process of genome annotation and will be engaged in this task for many years.

Portal "Eternal youth" http://vechnayamolodost.ru


Found a typo? Select it and press ctrl + enter Print version