02 November 2016

Keep it up !

Bioinformatics, Cheburashka and DREAM-ENCODE

Julia Popova, XX2 CENTURY

The first round of the DREAM-ENCODE biology machine learning competition, timed to coincide with the international DREAM Conference, was won by the team autosome.ru from Russia. The team members are Ivan Kulakovsky, a leading researcher at the Laboratory of Computational Methods of Systems Biology of the IMB RAS, Grigory Sapunov, co–founder of Inten.to and Vsevolod Makeev, Corresponding Member of the Russian Academy of Sciences, Head of the Laboratory of Systems Biology and Computational Genetics of the IOGen RAS, told the "XX2 century" about the computer analysis of regulatory regions of the genome and about some misconceptions living in modern society.

XX2 CENTURY: The first question is how can we explain what transcription factors are for those who are far from science?

Ivan: In fact, these are local switches of gene activity.

Let's recall the basics of molecular biology and look at the genome as an abstract sequence of nucleotide bases – the letters "A, C, G, T". The most studied sections of the sequence that encode proteins, that is, protein-coding genes. It is useful to understand that in higher eukaryotes, protein-coding genes cover only a small fraction of the genome, for the human genome only 1-2%. The first question is what else is important written in the genome, besides the genes encoding proteins. The second question is how, on the basis of the same genome, all the diversity of cell types is realized in one multicellular organism.

Both issues are directly related to regulatory segments of the genome, sections of non-coding regions that determine the activity of genes, for example, changing the efficiency of RNA synthesis (transcription). Transcription factors are a special class of proteins that regulate the start of transcription of specific genes. Most transcription factors are able to independently recognize suitable "binding sites" in DNA, characteristic sequences of nucleotides in regulatory regions. By binding a section of DNA, the transcription factor turns on or off the corresponding target genes.

XX2 CENTURY: If scientists can discover the algorithm by which proteins bind to DNA, how will they be able to use it? And do you believe that this algorithm will be discovered soon?

Ivan: The global task is quite large-scale: eukaryotic chromosomes are stacked in a complicated way in the cell nucleus in a tangled tangle, some parts of the genome are extremely densely packed and inaccessible for interaction, other parts are already connected by competing proteins. A lot of experimental methods have been invented to map binding sites occupied by a specific protein, but the experiment itself is time–consuming, and the data obtained is quite "noisy". The problem is complicated by the fact that there are a lot of transcription factors (at least one and a half thousand in humans), and they work in different combinations in different types of cells. That is, for each transcription factor, the experiment will have to be carried out separately in each cell type. And finally, it is methodically not so easy to move from cell cultures grown "in vitro" to normal cells and organs.

This is where bioinformatics comes on the scene. The transcription factor binds DNA with a characteristic "tail" (DNA-binding domain), which prefers to meet specific "words" in DNA – sequences of nucleotides, and anchors in the most successful ones. That is, using statistical methods of text analysis, it is possible – for a relatively short sequence – to determine the most likely landing site of the transcription factor.

In turn, by experimentally determining the map of available genome regions (the so-called "open chromatin") in a particular cell type, it is possible to predict the binding of specific transcription factors using computational methods.

It cannot be said that we already fully understand how a protein finds its binding sites, but with the help of a combination of experimental and computer approaches, it is already possible to obtain a detailed "genomic map" of binding sites. The map shows where the regulatory regions are located, which genes are potentially under their control. Global goals are deciphering the "grammar" of regulatory domains, engineering regulatory sequences with the necessary properties. From the point of view of practice – the selection of "cocktails" of transcription factors and controlled change of cell types for the tasks of regenerative medicine and modeling diseases of various tissues and organs. Already, a detailed genomic map of binding sites makes it possible to predict the consequences of possible mutations in regulatory areas affecting the activity of specific genes.

Vsevolod: I want to additionally draw attention to the fact that direct medical applications of gene editing are still a risky business. It was not for nothing that Ivan said about "modeling of diseases", that is, changes that make it possible to reproduce mutations in "artificial organs" that lead to hereditary diseases, study their course, and thereby better understand the mechanism of their occurrence and possible therapy. From the point of view of direct applications, applications in the field of biotechnology look more real. You can try to change the dynamics of genes in domestic animals or agricultural plants by editing regulatory regions, achieving the appearance of new consumer properties.

XX2 CENTURY: Tell us about the method by which you won the competition.

Ivan: I would like to say a few words about the contest itself – a joint project of the ENCODE international consortium and the DREAM initiative. ENCODE has been annotating regulatory regions in human and mouse genomes using various experimental methods for more than 10 years. DREAM, in turn, holds various competitions on the application of machine learning methods for a wide range of biological tasks.

The published ENCODE results were obtained on "immortal" cell lines, but the consortium is also conducting experiments on living tissue samples on a new round. The goal of the DREAM-ENCODE competition is to predict the binding of transcription factors in normal tissue using knowledge of open chromatin regions and features of the genomic map of binding sites obtained on cell lines. This task has a simple practical application: in the future, you can limit yourself to a minimum set of experiments on primary tissues and organs and reuse ready-made data as much as possible.

Our method is based on a meaningful choice of "training" data. To do this, we came up with a simple algorithm, called "Cheburashka" in the working version for its naive approach. So Cheburashka became an informal mascot of the team.

autosome_ru.jpg
Cheburashka and the team autosome.ru – winner of the first round of DREAM-ENCODE
Below: Vsevolod Makeev, Andrey Lando, Grigory Sapunov, Ilya Vorontsov.
Above: Ivan Kulakovsky, Valentina Boeva, Cheburashka and Irina Eliseeva.

And for the final predictions, we used a well–known machine learning library - XGBoost. I think that it was the combined approach that allowed us to successfully perform in the first round of the competition, timed to coincide with the DREAM profile conference. The winners of the first round reveal their cards: they share technical details and considerations. A member of our group, Andrey Lando (a student of Moscow Phys.Tech), was invited to make a presentation at the DREAM conference. The second round will last until the beginning of 2017, and we expect that our achievements will be useful to future leaders.

XX2 CENTURY: Tell us about machine learning. Why did you need a specialist in this field for your work?

Grigory: The amount of available data in biology is huge and continues to grow. It is impossible to cover them all with the mind of one person, it is extremely difficult for a group of people. Computers come to the rescue.

If one person is able to keep 3-5 variables relevant to the task in his head, then a computer is able to simultaneously work with hundreds and thousands of variables, also taking into account the interactions between them. But already for 5 variables there are 10 pairwise combinations, with an increase in the number of variables, the number of combinations grows quadratically, plus there are even more complex combinations of three, four or more factors, so that even with five variables it is already difficult for a person to work.

Additional difficulties are added by the amount and volumes of available data. It is impossible for a person to revise hundreds of thousands and millions of genomic intervals, it is necessary to greatly reduce the amount of information, leaving some summary statistics and other aggregated information. And this is already halfway to machine learning, statistics are very closely related to the field of machine learning, and coming up with the right way to aggregate data is already essentially an activity of "inventing" features suitable for solving a problem (feature engineering) is the most important element of classical machine learning (as opposed to deep learning, deep learning, which is largely degrees can be spared from this step).

In this task, we have limited ourselves to classical machine learning. We no longer had enough time and computing resources for full-fledged experiments with deep learning, but preliminary experiments have shown that this approach is reasonable and gives encouraging results, and we expect serious breakthroughs here in the future.

In general, in the coming years and decades, the most noticeable and tangible changes affecting the quality of our lives will occur in biology and medicine. A huge amount of data has already been accumulated there, and in the coming years they will accumulate by orders of magnitude more – both due to the greater spread of sequencing, and due to the greater digitalization of our entire life (the Quantified Self movement is the most obvious example here, electronic medical records can also be attributed here). The potential to benefit from all this data is huge, and the use of machine learning for this will be a necessity.

XX2 CENTURY: While such important contests are being held, society is at the mercy of fears about the same GMOs. What can this be connected with – at least in our country?

Vsevolod: In my opinion, the problem of GMOs is largely caused by the media.

The general public does not understand well that now any industrial variety is the result of high technologies, and "traditional technology" is far from a peasant farm of the century before last (for example, there is a "genome–oriented selection" in which all possible variants of the genome of a particular culture are catalogued, and methods for obtaining specified combinations of variants are being developed). Actually, what is now called GMOs, that is, the introduction of genes of other species, differs primarily in that such manipulations are easier to identify in the final product. New varieties obtained with and without the use of "genetic modifications" are tied to "big technologies" of cultivation, and the real competition is between these "big technologies", the advantages and disadvantages of which the population without access to data has no opportunity to understand. Most likely, someone (and not in our country) is conducting a comprehensive socio-economic analysis, and the public is forced to listen to invented scary stories that these embedded genes are dangerous, and easy-to-understand revelations of these stories. The true discussion is not even clear to us. I suspect that the very intensity of the discussion suggests that the technology based on GMOs and the so-called "traditional technology" are approximately equal in efficiency, otherwise some answer would have already been found. But again, without a separate study in an area where many of the data are likely to be a trade secret, something is difficult to say.

XX2 CENTURY: I mentioned the fear of GMOs – and maybe you can name other misconceptions that are dangerous for the development of science?

Ivan: The weak connection of fundamental science with society plays into her hands: applied science, innovations, attempts to introduce new technologies suffer more from misconceptions.

At the same time, in the wake of "innovation" and "embeddability", the primary goal of fundamental science is masked – to expand the field of objective knowledge about the structure of the world. It is good when the results of scientific work go beyond the limits of specialized journals and find application in life, but the expectation of immediate practical usefulness from scientific research is the most dangerous misconception.

Portal "Eternal youth" http://vechnayamolodost.ru 02.11.2016

Found a typo? Select it and press ctrl + enter Print version