29 October 2021

Big biological data

A story about biology, in which scientists are more likely to encounter a computer mouse than a live one

Chiara Makievskaya, N+1

In science fiction, there are often such scenes: a scientist in search of a solution to a serious problem of humanity presses a button on a computer keyboard, molecules rotate on the monitor, the download lane is rapidly filling up. As soon as progress reaches 100 percent (it usually happens in a couple of seconds), the scientist enthusiastically notifies colleagues that he has made a scientific discovery. Let's figure out whether the life of bioinformatics is really so easy.

Usually, the profession of a biologist is associated with working with a pipette in a laboratory, collecting herbariums or observing animals, but not with computers. And this is a misconception, because modern biology is simply unthinkable without a computer. Biologists themselves, in professional language, even divide experiments into "wet", that is, performed in laboratories using chemical solutions, and "dry", implying computational experiments on computers. Similar computer experiments are often called in silico (from in silicio — "in silicon", meaning silicon in computer microchips) by analogy with the Latin expressions in vitro ("in vitro") and in vivo ("in vivo").

The interdisciplinary field that combines biology, chemistry, computer and mathematical sciences is commonly called bioinformatics. At the same time, bioinformatics, rather, can be presented not as a pure science, but as a set of techniques and methods of working with biological data. These approaches allow us to solve biological problems that require the analysis of large amounts of data.

There are three major areas of work of bioinformatics:

  • structural bioinformatics — approaches to study the spatial structure of biological molecules;
  • functional and comparative genomics are approaches that allow us to determine the role of individual sections of the genome and compare the content and organization of genomes of different organisms;
  • systems biology is the study and modeling of complex interactions in living organisms.

In each direction, scientists face computationally complex biological problems, for many of which there are no unambiguous solutions. Let's take a closer look at some of these tasks.

How to wind up a ball?

Task: Knowing the amino acid sequence, get the protein structure.

Why is this necessary? Improperly folded proteins are easily destroyed and lose their structure and function, which can lead to many diseases. Alzheimer's disease is an example of a neurodegenerative condition caused by improper protein packaging. This disease is characterized by plaques in the brain caused by improper stacking of β-sheets of fibrillar β-amyloids present in the medulla. Huntington's disease and Parkinson's disease are also examples of neurodegenerative diseases caused by improper protein folding. Obtaining more protein structures is the way to develop new drugs, search for new therapeutic targets, and understand the mechanisms of development of a number of pathologies.

Proteins are biopolymers whose monomers are amino acid residues. Even from a school biology course, you may remember that most proteins have several levels of structural organization: primary, secondary, tertiary and quaternary structures (as a rule, in the latter case we are talking about whole protein complexes). A simple chain of amino acid residues is the primary structure of a protein. However, in natural conditions, the protein rarely stays in this state.

Proteins can fold to form a secondary and tertiary structure. Such a process in the language of biologists is called folding (from the English folding — "folding"). A protein can properly perform its biological function only when it has a native structure (from English native — "natural"). That is, the spatial structure obtained as a result of folding.

Thanks to culinary experiments on cooking scrambled eggs, everyone knows rather the reverse process: under the influence of destabilizing external factors (for example, high temperatures), proteins are able to undergo changes in their conformation and "unravel". This process is called denaturation. At the same time, the amino acid sequence of the protein does not change, but only its structure undergoes changes.

From the point of view of science and computational biology, the contemplative folding process is much more interesting. During it, a high-precision adjustment of the orientation of all sections of the molecule relative to each other takes place. Not only the position of each amino acid residue is adjusted, but also the position of chemical bonds within them.

There are a huge number of such combinations in proteins, but of the countless folding options of the same protein molecule, only a few are correct (a family of very similar conformations). Methods of structural bioinformatics help to predict with a certain accuracy such a correct position of the protein.

biological-math1.gif

A test model of the TRP Cage mini-protein along ready-made trajectories.

An interesting paradox is associated with protein folding, formulated in 1986 by the American molecular biologist Cyrus Levinthal and named in his honor the "Levinthal paradox": "The time interval during which the polypeptide comes to its twisted state is many orders of magnitude less than if the polypeptide simply went through all possible configurations."

Levinthal gave the following estimate: for each amino acid residue in the polypeptide chain, about 10 different conformational positions are possible. Imagine a long protein consisting of 100 amino acid residues. If we take into account all possible combinations of the positions of its residues, we will get about 10100 conformations (by the way, such a large number is called google). In this case, sorting through all possible conformations in search of the right one would take a very long time.

Levinthal suggested that the native structure of the protein is determined by kinetics, that is, it corresponds not to a global, but to a rapidly achievable minimum of free energy of the chain. However, the raised issue cannot be solved experimentally.

Then the analytical theory of folding of single—domain globular proteins was born – the Finkelstein-Badretdinov theory. The protein can collapse not "all of a sudden", but by the growth of a compact globule due to the sequential adhesion of new links of the protein chain to it, while the final interactions will gradually be restored. The drop in entropy in the course of sequential folding is almost immediately compensated by the energy of the resulting interactions. Thus, a term proportional to 10N will be excluded from the Levinthal calculations, where N is the number of amino acid residues, and the folding time of the protein will depend on smaller values in order.

As we have already said, with the help of structural bioinformatics methods, it is possible to predict the structure of a protein and consider the molecular dynamics of folding. However, computationally, this is a rather time—consuming process, often requiring the use of supercomputers - powerful computing clusters that allow parallelizing the solution of a computational problem to achieve maximum performance.

biological-math2.jpg

The supercomputer of the Oak Ridge National Laboratory. Carlos Jones / ORNL

A similar effect can be achieved if calculations are carried out not using a supercomputer, but using many ordinary computers. Scientists from Stanford University, the creators of the Folding@home project, thought about this. This is a distributed computing project for computer simulation of protein molecule folding. The aim of the project is to understand the causes of diseases caused by defective proteins, such as Alzheimer's and Parkinson's disease, by modeling the processes of folding/unfolding of protein molecules.

To perform calculations, Folding@home uses not a supercomputer, but the computing power of hundreds of thousands of personal computers from around the world. To participate in the project, you need to download a small program that runs in the background and performs calculations only at a time when processor resources are not fully used by other applications. Folding@home periodically connects to the server to receive the next batch of data for calculations, and after the calculations are completed, their results are sent back. At the same time, project participants can see the statistics of their contribution.

Folding@home is not the only such project. Rosetta@home is a distributed computing project aimed at predicting protein structure. This is one of the most accurate systems for predicting the tertiary structure of a protein molecule. Since Rosetta@home only predicts the final folded state of the protein, without modeling the folding process itself, Rosetta@home and Folding@home do not mutually exclude each other, but, on the contrary, complement each other, since they implement solutions to slightly different tasks.

Rosetta@home even gave birth to a puzzle game dedicated to protein folding. Some Rosetta@home users have noted that they see solutions during the calculation process, but cannot interact with the program to show them themselves. Then the University of Washington developed a puzzle fold.it The goal of this game is to find the three-dimensional structure of a certain protein with the lowest level of free energy, that is, the most native protein structure.

The player needs to do the protein folding by his own "hands". For example, players can interactively manipulate a molecule by changing the shape of the main frame and the position of the side groups. They can also rotate α-helices around the axis, change the communication of chains in β-structures, impose restrictions on changes in certain parts of the protein. The player learns how well he succeeded in folding by getting points. They are charged, in particular, for the formation of new hydrogen bonds and the concealment of hydrophobic residues inside the molecule.

In 2020, a breakthrough was made in the field of predicting the three-dimensional structure of a protein thanks to the developers of DeepMind. They presented the second version of the algorithm for predicting the three—dimensional structure of a protein from a sequence of amino acids - AlphaFold 2. AlphaFold 2 does not solve the folding problem yet, because, at least, it does not model the intermediate folding process, but only predicts the final structure. However, if we omit all the technical details, the development of such an algorithm is not yet revolutionary, but also an important step for structural bioinformatics.

DeepMind did not stop there. On July 15, 2021, the DeepMind development team published full information about AlphaFold 2 and the source code of the program. Now the company has announced that it plans to use AlphaFold 2 to predict the structure of almost every protein in the human body, as well as the structure of hundreds of thousands of proteins found in the 20 most widely studied organisms. Among them are fruit flies, mice and yeast.

In the next few months, DeepMind promises to predict the structures of more than 100 million proteins, more or less known to science. If the level of reliability of AlphaFold 2 predictions is high enough, scientists will be able to make a significant leap in drug development and understanding of the mechanisms of development of certain pathologies.

But how do you wind up the ball correctly?

Task: learn to model cellular processes entirely taking into account the interaction of all intracellular components.

Why is this necessary? Almost all of today's experiments in the field of molecular dynamics modeling use simplified conditions that do not take into account all the features of intracellular processes. It is important to understand how the studied biological macromolecules behave inside cells, since the results obtained without taking into account the external environment and some interactions may be irrelevant.

In principle, the study of not only folding, but also conformational dynamics of proteins and nucleic acids is widely studied using molecular dynamics modeling. Already today, this has led to a deeper understanding of the mechanisms of a number of biochemical processes. However, almost all of these studies and experiments use simplified conditions that do not take into account the physicochemical complexity of the intracellular environment. In fact, many questions remain about how biological macromolecules behave inside cells and how relevant the results obtained "in a vacuum" are.

And scientists come to the conclusion that it is necessary to study structures and conformational dynamics at the atomic level, taking into account interactions at the cellular level in real biological conditions. That is, ideally, to be able to model an entire cell with all the biochemical processes inside it, and to make some point studies of the properties of individual proteins already in the context of their surrounding environment.

The most successful models of whole cells are based on empirical mathematical models parameterized on the basis of experimental data and focused on the kinetic representation of cellular processes. Due to the empirical nature of these models, they do not consider the interactions of individual models. The lack of atomic resolution does not allow us to consider in detail the molecular mechanisms and processes occurring inside the cell. Because of these problems, it is often impossible to predict how changes at the molecular level affect the work of the cell as a whole.

Alternatively, you can use physical models. To build a model of an entire cell at the atomic level, it is also necessary to be able to model the proteins of this cell, which, in turn, requires their structures in a sufficiently good atomic resolution. Obtaining such high—quality structures is a non-trivial task. And especially difficult for membrane proteins, since they are very difficult to crystallize. Protein crystallization is one of the most important stages of X—ray diffraction analysis, the most popular method for obtaining protein structures.

Such models already make it possible to link changes at the molecular level with the effect on cellular function. However, the main limitation of physical modeling is the lack of computing power to build models that preserve molecular details with high resolution.

In the future, perhaps, with the development of computer technology and the increase in computing power of computers, it will be possible to simulate most of the processes occurring inside cells. Modeling changes at the molecular level and their connection with the biological functions of cells make drug development more effective. It will be possible to take into account side effects from the very beginning of drug development, catching them at the stage of computer modeling, and not during preclinical studies.

How to read an unwritten book?

Task: decode the genomes of organisms.

Why is this necessary? With the help of decoded nucleotide sequences, scientists will find out what genetic information is contained in a certain segment of the genome. For example, this way it is possible to determine which parts of DNA contain protein-coding genes, and which carry regulatory functions. In addition, working with the genome sequence, it is possible to identify genetic changes that can cause a number of diseases.

Today it is difficult to find a person who has not heard about the sequencing of nucleic acids — at least in the context of decoding their own DNA. If desired and financially possible, everyone has access to the service of "decoding" their own genome. DNA and RNA sequencing is a fairly routine process for research laboratories.

There are several ways to sequence nucleic acids. Historically, the first and simplest of them, Sanger sequencing, allows you to read sequences of up to a thousand pairs of nucleotide bases. This method is most often used to read small fragments of the genome, as well as to validate the results of more modern sequencing (next-generation sequencing, NGS), where the size of one read fragment varies from 25 to 500 pairs of nucleotide bases.

biological-math3.png

The results of Sanger sequencing in the form of a chromatogram in which each peak corresponds to a nucleotide base. BRCF.

NGS methods are usually used for more accurate and in-depth reading of genetic material, which is necessary, for example, for resequencing, assembly of new genomes (de novo), transcriptomic and epigenomic studies. Also, NGS sequencing is much more productive. It allows millions of short fragments of nucleic acids to be read simultaneously.

New generation sequencers cannot determine the nucleotide sequence of the entire DNA molecule at a time, since errors inevitably occur with a large length of the "readable" fragment. Therefore, before direct sequencing, DNA is randomly crushed into fragments with an average length of about 500 nucleotides. Such fragments are read from both ends. As a result of the sequencer's work, "reads" are obtained (from the English read — read).

Let's focus a little more on the two main tasks of NGS voiced above: resequencing and de novo sequencing. In the first case, it is initially assumed that the genome of the object under study coincides with the "reference" (reference), that is, a certain generalized DNA sequence is known. Resequencing allows you to detect individual differences between a particular sample and a reference. In addition, if each nucleotide base of the DNA sequence is checked by multiple readings, the statistical reliability of the found genetic features increases. It is believed that the genome is resequenced with a high "coverage" (deep sequencing) if each of its letters has been read on average 30 times or more.

Computationally, the task of resequencing is considered relatively easy, in contrast to the task of assembling the genome de novo. In this case, it is necessary to reconstruct the genome from a set of reads, without having a "gold standard" for assembly. That is, in this case we are talking about decoding absolutely unknown DNA sequences, for example, the genome of some new previously unknown species.

Methodologically, the approach is based on the fact that with a sufficiently large number of reads, there will necessarily be several that overlap. If you combine them, then you can gradually "build up" the decoded DNA sequence. A set of such overlapping DNA fragments is called contig.

Further, the contigues are also combined into bundles-scaffolds. This is an intermediate incomplete structure of the sequenced sequence. In fact, it is a series of contigues arranged in the correct order, but not necessarily connected in one continuous sequence. Unencrypted "holes" in scaffolds are decrypted using other approaches. De novo genome assembly is an algorithmically and computationally complex process, almost impossible without the involvement of supercomputer power. This is not only promising, but also the only approach to genome assembly in the absence of a reference.

However, the lack of effective algorithms implemented on video accelerators imposes restrictions on the size of the processed data, just like the lack of RAM. So the development of this field of bioinformatics (in fact, like most others) is directly related to the increase in the computing power of devices currently available to humans.

One head is good, but thousands are better!

Task: solve major problems of biology with the help of bioinformatics.

Why is this necessary? Most of the major tasks of biology lie in the interdisciplinary field of knowledge. Combining efforts, distributing tasks, transferring knowledge and experience to each other is what is necessary to find not the most obvious answers to the questions that life and biology in particular presents.

Solving major problems of biology using bioinformatics methods is not only a matter for experienced scientists, for example, thanks to the Folding@home project. There are also a large number of bio-hackathons — forums for specialists and students from different fields of bioinformatics and biology, who jointly solve any problem in a limited time.

An example of such a bio-hackathon is BioHackathon Europe, which is organized every November by the European intergovernmental organization ELIXIR. During the week, more than 150 specialists from all over the world are working on various projects. The week begins with a symposium dedicated to the presentation of these projects. And then for five days the participants write codes to solve various bioinformatics problems.

In addition to hackathons, crowdsourcing research is also being conducted. Crowdsourcing in the context of biomedicine and systems biology implies that absolutely any specialist related to biology, medicine, chemistry or any other related field can participate in the research and contribute.

Supported by Philip Morris International (PMI) in 2011 using the INTERVALS platform (an open resource designed for collaboration and data analysis by third parties) the sbv IMPROVER project was launched. It verifies the results of laboratory tests conducted by the company. Crowdsourcing research is also conducted on the basis of sbv IMPROVER.

In 2019-2020, on the basis of sbv IMPROVER, a group of scientists conducted a crowdsourcing study of the diagnostic potential of metagenomic data. Metagenomics is a branch of genomics that studies the genome not of an individual organism, but of a set of inhabitants of microbial communities living in different natural conditions. The purpose of this study was to develop and validate classification models for metagenomic biomaterial samples. Initially, the entire analysis was based on the results obtained by the winners of the scientific test.

The data of patients with ulcerative colitis, as well as those suffering from Crohn's disease were studied. In this case, the use of crowdsourcing made it possible to collect a significant amount of data, as well as to reduce the influence of subjective factors on the results obtained. These sbv IMPROVER platforms were available to the international scientific community from September 2019 to March 2020.

Lusine Khachatryan, a researcher at PMI Science and a specialist in computational biology at the PMI Research Center, talks about how this study was organized and the importance of its results.:

"The main types of clinical manifestations of inflammatory bowel diseases are Crohn's disease and ulcerative colitis. For their diagnosis, as a rule, highly invasive procedures are required.

However, a number of studies suggest a link between the diversity of the microbiota of the gastrointestinal tract and inflammatory bowel diseases, so we began to determine the diagnostic potential of the microbiota in terms of inflammatory bowel diseases. In addition, it is important to note that a stool sample can be taken in a non-invasive way, which would greatly facilitate the diagnosis of inflammatory bowel diseases.

The aim of the study was to develop and validate classification models for metagenomic biomaterial samples. The data of two types of patients were studied: those who suffer from ulcerative colitis, or those who suffer from Crohn's disease. As a control group, data from healthy people were used – in whom inflammatory processes in the intestine were not recorded. Using metagenomic data, it was necessary to find differences between patients with and without inflammatory bowel diseases. It was also important to understand whether it was possible to distinguish between people suffering from Crohn's disease and ulcerative colitis within the group of patients. The main idea was that the participants of The Metagenomics Diagnosis For IBD Challenge needed to create a machine learning algorithm that could first learn from known data and then be applied to new datasets to predict classes of people in an unknown dataset.

The main task consisted of two additional subtasks. In the first subtask, participants were asked to start with raw sequencing data. They could apply their own pipeline to process metagenomic sequencing data and obtain parameters, and then, using these parameters, they could generate a machine learning algorithm.

The second subtask already had the generated parameters. That is, the participants had only to generate a machine learning algorithm. Participants could choose one of the proposed subtasks, or try to solve both.

The data we used in this project are two publicly available datasets on a Chinese and American cohort of patients. We also offered our new set of test data, which has not been published anywhere before. For each subtask, the three most successful teams were identified, which we awarded.

The main conclusion of our project is that metagenomic data can be used to determine whether a person is ill with inflammatory bowel disease or not. It may seem that this is not new information, since there are many studies on similar topics. However, all previous studies used the same cohorts to train the model and predict. There are no studies in which one cohort is used to train a model, and a completely different cohort is used to predict a person's status (sick or healthy). Therefore, the results obtained during our project are new and different from those obtained in previous studies.

However, it turned out that it is still quite difficult to determine within a group of sick people whether a person has ulcerative colitis or Crohn's disease, since classical approaches in this case lead to a very high probability of erroneous prediction."

This is not the only task that was proposed to be solved on the sbv IMPROVER platform. The task of "Identification of exposure response markers" from the field of systemic toxicology was aimed at verifying that markers that allow predicting the status of a smoker can be obtained from gene expression data from human/rodent blood.

The competition included two subtasks: for the first, participants had to develop models that could predict gene signatures and biomarkers of smoking exposure using data on human blood gene expression. According to the available data, it was necessary to distinguish smokers from non-smokers, as well as classify non-smokers as former smokers or never smokers. In the second subtask, the participants also worked with gene expression data, but already in the blood of mice. It was necessary to find species-independent genetic signatures that could predict the effects of smoking.

biological-math4.jpg

The scheme of the crowdsourcing study "Identification of markers of reaction to exposure", aimed at verifying that markers that allow predicting the status of a smoker can be obtained from gene expression data from human/rodent blood. Chem. Res. Toxicol. / American Chemical Society.

The study was conducted in 2015-2016. And although the computational task of systemic toxicology offered participants to predict the smoking status, the methods proposed by the participants could theoretically be applied to predict the effects of any toxic substances or external stimuli. Exposure to all external toxic substances can cause molecular changes in human blood, and the ability to determine the status of exposure from readily available blood samples is of great importance for assessing the toxicological risk of chemicals, medicines and consumer goods.

***

We hope that after reading these 3 thousand words, you were able to get an idea of how in modern biology and biomedicine there is work with huge amounts of data.

Specialists who process them need to be able to extract information from them correctly, build effective algorithms, and automate their work competently — this is where computer science and biology come together. And modern biology is not at all limited to working in a laboratory among flasks and pipettes — some biologists spend much more time at a computer than in a lab in a bathrobe.

This article is not advertising and pursues socially significant goals of warning potential consumers of tobacco products about the harm caused by tobacco consumption, and educating the population and informing them about the dangers of tobacco consumption and the harmful effects of tobacco smoke on others.

Portal "Eternal youth" http://vechnayamolodost.ru


Found a typo? Select it and press ctrl + enter Print version