03 December 2021

Genomes of Russia

"The main application of our work is the diagnosis of rare hereditary diseases"

XX2 century

Recently, the dream of Russian medical geneticists has come true — scientists have created a database of the genetic diversity of the population of our country. Specialists in genetic diseases, oncogenetics, neonatal screening and other doctors have long needed this database. How exactly the project will help medicine, why private companies have not wanted to share data about their patients with colleagues for a long time, how genetics affects the appearance of diseases and how the diet allows some children to develop normally, says one of the authors of the study, bioinformatician Alexander Predeus.

predeus.jpg

Russia is a very multinational and ethnically peculiar country, and our state is not completely integrated into the world medical and genetic science. Because of these two factors, domestic medical geneticists for a long time poorly understood the genetic diversity of the country's inhabitants.

About the most important thing in our DNA

People who are far from science do not understand well what genes and genetic diversity are. This is largely due to the fact that genetics is a complex and unintuitive science. In addition, in the last 20 years, genetics has "evolved" very much — the biggest revolutions are now taking place in genomics, molecular biology and in general the sciences of life. It is difficult to keep up with progress and honestly study the basics — as a result, very wild judgments about genetics in general and human genetics in particular are often expressed.

So, a little digression into biology: every person has a genome — this is our DNA. All cells, RNA, and proteins are built on the basis of the inherited information contained in it. It is DNA that determines the development program of any organism. In a rough sense, "from a bird's-eye view", the DNA sequence defines our biological species. The fact that we are humans, homo sapiens, is determined by our DNA sequence and a set of genes — DNA sections that serve as a "blueprint" for synthesized RNAs, which, in turn, can be translated into proteins. We have about 20 thousand different protein-coding genes, and it is their sequences that determine that we are human.

At the same time, DNA is constantly changing, because various errors (mutations) continuously creep into it — which can then be inherited by the next generations. Some mistakes have very serious consequences, and some — not so much. Some do not have any consequences at all and are just a neutral variation. This leads to the fact that within each species — including humans — there is some diversity of DNA. In other words, we are not only very different from whales or monkeys, but also differ from each other — although to a much lesser extent. The differences between two people can be estimated at about 1/1000 of the entire genome; in some parts of the genome, the variability is greater, and in some — less.

Our genome consists of 3 billion nucleotides, and three to five million of them usually differ from the "average", which is called the reference genome. This, on the one hand, is a lot, on the other — not very much: about 99.9% of the DNA that does not belong to the sex chromosomes, any two people have the same.

It is also important to understand that we are diploid organisms. We have two copies of DNA — from Mom and dad. This is important for understanding the nature of genetic diseases, and our work was done precisely for this.

Where are the causes of genetic diseases

In general, such a situation often occurs in biology and medicine: of the many factors that theoretically can influence various signs and diseases, only a small part is really important. And the people who study the issue first of all determine which factors are important and which are not.

Variability in different parts of the genome may be slightly different. Or very different — if a section of our genome is made quite poorly and it is difficult for our biological mechanisms to copy it. The 20 thousand protein-coding genes mentioned earlier are noticeably more important and less variable than the rest of the genome. It may seem surprising, but they occupy a fairly small percentage of the total DNA — about 1%. That is, of the 3 billion bases of our genome, only about 35 million pairs of nucleotides encode proteins. Geneticists usually focus on these sites — perhaps with the addition of adjacent regulatory regions — because the vast majority of genetic diseases known to us are caused by DNA changes there. That's how we can narrow down the problem, about 100 times. That's not bad.

Let's go back to the calculations: we have 35 million pairs of nucleotides, the variability is about 1/1000, how many variants do we get? In the coding part of the variants, if this math is applied, there will be 35 thousand — but this is wrong, because there are fewer mutations in the coding part. Mainly because there is less tolerance to them — that is, many mutations are not the same in their effect. If something happens in a neutral and not very necessary part of the DNA, the body may not notice it at all. And if it happens in a vital protein, then the body will most likely notice it one way or another. Maybe not right away, but maybe right away. Or maybe it is impossible to be born with such an option at all — children cannot exist with such an option purely theoretically. Therefore, there are usually fewer variants in protein-coding regions — about 25 thousand for Europeans, up to 30 thousand for Africans, and somewhere in between for the rest of the global populations of the Earth.

Genetics and phenotypes: where is the connection here?

An individual set of genetic variants affects phenotypes — something that we can observe. Height, weight, diseases are all our phenotype. Depending on which trait we are talking about, our genetics determines phenotypes to a very different degree. She defines some diseases almost completely — for example, cystic fibrosis, one of the typical genetic diseases. And then: even if there is a mutation in the gene responsible for this disease, the probability of manifestation of the disease is slightly less than 100%. This is called penetrance. There is a lot of discussion around this term — as, indeed, around many topics in biology and medicine.

There are phenotypes that are very little determined by genetics. A typical example is infections. That is, if a person is infected with HIV, it has almost nothing to do with his genetics. There are, of course, people who react less to HIV or do not react at all — this is another question. But whether a person will get an infection or not is not a question of genetics, but rather of whether he was infected or not. Most people react pretty much the same way to HIV infection.

Genetic diseases are those in which genetics explains the vast majority of observed phenomena. That is, genes are "to blame" for the occurrence of the disease. At the same time, formally, such diseases can be divided into monogenic — or Mendelian, in honor of the founding father of genetics, Gregor Mendel, and polygenic. Sometimes a breakdown in only one gene leads to the disease. It happens that the disease is caused, for example, by a breakdown in the metabolic pathway or cellular type, which can be spoiled by breaking several genes. And there are polygenic diseases — for example, many types of autism, autoimmune diseases like lupus, or type II diabetes. There are many diseases that arise as a result of the complex interaction between human development and his genetic "background".

How a diet saves a child's brain

Our work is useful primarily for the neonatal diagnosis of classical monogenic Mendelian diseases. Monogenic diseases are mainly those for which screening is performed directly in the maternity hospital in newborns. These are cystic fibrosis, phenylketonuria and other diseases. The widespread Down syndrome — trisomy in chromosome 21 — also applies to them, but occupies a special place among inherited diseases. Down syndrome is not just a replacement of one nucleotide with another, but a whole extra chromosome.

Phenylketonuria is a disease in which a person cannot absorb the amino acid phenylalanine, and if the patient eats food containing this amino acid, his brain begins to suffer from toxic substances entering it. However, if a child with this diagnosis is put on a special diet from the first week of life, then he will develop and grow almost normally. If he consumes food containing this amino acid that is absolutely safe for other people, then the baby will develop with intellectual disability.

Therefore, one of the applications of our work is diverse screening. For example, when planning a pregnancy. If there are hereditary diseases in the family, then with the help of this analysis and IVF, you can simply make sure that your children do not inherit them. Also, in many countries, screening of newborns for an increasing number of inherited diseases is gaining momentum, in which early intervention is critically important — as with the above-mentioned phenylketonuria. Although such approaches are not cheap, in the future they save the state huge sums on treatment.

About different types of inheritance of diseases

Let's go back to what the problem of monogenic diseases is in general. We mentioned above that we carry two copies of non-sex chromosomes, which is called diploid. In fact, we have two copies of DNA, two copies of each gene, and they may be slightly different. Due to this, different types of inheritance are possible, depending on the function of the gene; the most popular of which are autosomal dominant and autosomal recessive. Usually, in the analysis of inherited diseases, we mean that a specific gene is to blame for the disease. So, as we wrote earlier, some variants "break" the gene — and only one copy can be "broken". For some proteins, this is not a big problem — the second copy just produces a working protein, and everything is fine. We live without noticing it. And in some cases, both copies are important for the proper operation of all systems. And even if one copy is broken, a disease occurs, accompanied by an autosomal dominant type of inheritance. In this case, as a rule, one variant in some gene will be responsible for this disease.

In the case of an autosomal recessive type of inheritance, it is necessary that both copies are broken. It is for the prevention of such diseases, from a medical point of view, closely related marriages are not recommended. We all carry broken copies of different genes — according to various estimates, from 20 thousand to 100 genes in each person carry practically non-functional copies, of which about 20 are vital for the functioning of our body. With a certain degree of parental kinship, fragments of genomes will match, despite recombination (the process in which DNA copies are "rearranged" during the production of germ cells). There is a chance that long sections of DNA will match almost completely — and the probability increases that both copies of a number of genes will be broken, causing hereditary diseases with autosomal recessive inheritance.

But even with unrelated marriages, there are many options when such coincidences will occur. And we are faced with a problem that medical geneticists often face — the problem of a "needle in a haystack": we have 25 thousand protein-coding variants, and we are looking for just one option — exactly the one that is to blame for this genetic disease. This is, to put it mildly, not easy. And to solve this problem, there are a lot of logical tricks that have been used by medical geneticists for decades — in many ways before the development of sequencing and molecular biology.

Each person's genetic signature is unique

It should be noted that I use the terms variant, mutation, polymorphism in an almost interchangeable meaning. Medical geneticists can beat me for it — but I'm a bioinformatician, I can say that. Sometimes mutation refers to a variant that causes the disease, and polymorphism refers to a variant that is neutral. From the point of view of classical genetics, any change in the genome is the result of mutation. However, here I use these terms in roughly the same way.

The main tool of medical genetics remains the comparison of the genetic data of sick and healthy people. However, the methods by which these comparisons are carried out are constantly evolving. Recently, we (in the sense of humanity) have learned to sequence the human genome — to determine its sequence — and very quickly, efficiently and at a reasonable price. Much of what scientists had previously only guessed became apparent in the form of a literal nucleotide sequence. Such a breakthrough in methods gave us the opportunity to sequence a very large number of both sick and healthy people.

How does this help us in finding causal variants in monogenic diseases? Let's say we know that the disease in question has a clear type of inheritance — dominant or recessive, which we discussed earlier. Then, as a rule, we can either estimate from the frequency of this variant (for dominant inheritance) or from the frequency square (for recessive) how many cases of this genetic disease we should see in the population and compare it with the observed frequency known from epidemiological studies.

And how do we know the frequency of the option being discussed? This is where the main benefit of our work lies. If we sequence 10 thousand people, we will see that a specific variant occurs 20 times. So, the frequency of the variant will be 20 divided by 20 thousand (we are diploids, remember?) — 0.001. This is a very rare variant; there are a great many such and even rarer variants in each particular genome. This is what makes each person's genome very unique. Most of the known genetic variants in the world occur in only 1, 10, 100 or 1000 people. However, there are options that occur in half of the planet or even 95%. But they are a grain of sand in the general sea of rare and unique options.

So, it is this frequency that allows you to assess whether an option can cause a disease or not. Because for most of the "suspects", the assessment of the frequency of the disease is too high. The logic is this: if a variant occurs in 1% of people and we mean that it is to blame for the dominant disease, then 1% of people should have this disease. But 1% is shockingly high for a specific monogenic disease. About 7 thousand monogenic diseases are known, and from 3 to 5% of people suffer from them in total. Some diseases manifest themselves weakly — and you may not even suspect that a person suffers from it. And some diseases cannot be overlooked. But, still, in general, every monogenic disease is a rarity, and its frequency in the population varies from 1 per million and below, to hundreds of cases per 100,000. Therefore, it is almost impossible to imagine an individual monogenic disease with a population frequency of 1%.

Thus, in the vast majority of cases, we can use these frequencies to filter false positive associations. If you are told: "Variant A causes disease B." You ask: "What is the frequency of option A?" You check in the way described above, and the frequency turns out to be too high. So you understand that option A has nothing to do with it.

"A large sample is very good"

Why did we make our own project if there are already such projects in the world with frequencies in open access with samples of 150 thousand people? The answer is simple. Ethnic groups and populations differ in both frequent and rare variants. If you have seen pictures of the genetic determination of populations, for example, on 23andMe, you need to explain that usually very frequent variants are used for such analysis — for example, with a population frequency of more than 5%. Such options are the most informative for determining the origin of a person, but the least interesting from a medical point of view, because they obviously cannot cause hereditary diseases in any population. We are also interested in options with frequencies from 1% and below.

We have a great genetic diversity in our country — and the frequencies described above will differ exactly in the ranges that are important to us. For Russian medical geneticists, such a resource was very necessary — precisely in order to work better with our patients.

As I wrote at the very beginning, unfortunately, Russia is not very integrated into the scientific world community. But our country has the money and skills to create such a project. In addition, in private commercial companies, such diagnostics have been done for a long time. The price of full—genome sequencing is now more than 90 thousand rubles, and the full exome sequencing of those 35 million protein-coding nucleotides will cost about 40 thousand rubles. This, of course, is not cheap. But this is done quickly, efficiently and in a stream — and not only in St. Petersburg and Moscow. After collecting some sample, the company can get its own conclusions and estimates. But it is very important to understand: the larger the sample, the better you can estimate the frequency you are interested in, and the more rare variants you will be able to characterize reliably. A large sample is very good.

Why is it important for scientists to unite

There was a very interesting work that found out how many people need to be sequenced before certain positions are saturated so that we can see each position changed at least once. It takes from many hundreds of thousands to millions of samples. We are not there yet, and world genetics is not there yet.

The need for such a database in Russia has been clear to everyone for quite some time. But commercial structures did not want to share the data, because they were afraid that they would be used in an inappropriate way, and were afraid of losing a competitive advantage. Nevertheless, so far everything suggests the opposite: the reputation of the companies united in our project is becoming stronger, access to valuable and reliable information is becoming more efficient and faster. Together we have definitely become stronger. We're in In St. Petersburg, we have been working with the company "Serbalab" for a long time, with Andrey Glotov. As they worked, they accumulated more and more data — they accumulated more than 2 thousand of the above-mentioned complete exomes. And in Moscow has a laboratory "Genetiko", which he headed Artur Isaev, who showed interest in the overall project. And we decided to join forces. The scientific result of our project is modest — we found known variants for 18 diseases that occur in our populations more often than in the rest of the world, found a number of new variants, processed everything in one way, reducing the number of errors potentially dangerous for interpretation.

However, our main result is ideological and, so to speak, political. The data obtained are definitely important for practicing geneticists. We have successfully shown that it is very cool to unite like this — you can do it legally, effectively and profitably for all parties involved. We have developed standard contracts that will facilitate the entry of new organizations into our consortium. We have come up with bioinformatic approaches that minimize the exchange of raw data, leaving them under the full control of the owners. And at the basic level, our result, population frequencies in the Russian population, can be used by everyone who signs the relevant agreement. We plan to further develop the site where the results are presented — ruseq.ru . Welcome!

Portal "Eternal youth" http://vechnayamolodost.ru


Found a typo? Select it and press ctrl + enter Print version