18 September 2015

Educational program on DNA reading technologies

Sequence it

Alexander Ershov, N+1 

Every two years, the number of transistors placed on an integrated circuit is doubled. This is the canonical version of Moore's law, an empirical observation of one of the founders of Intel, which is known to everyone interested in technology news. As a driver of the incredible success of the IT industry, Moore's law has long been perhaps the most visible marker of progress. However, there are areas of technology where even this "icon of the singularity" looks rather pale. We are talking about DNA reading technologies.

The first human genome, obtained as part of a ten–year international project, cost about three billion dollars - this is the cost of the entire program, including numerous scientific studies that had to be carried out in the course of work. At the end of the project (the draft version of the genome was published in 2001), the cost of reading another genome of comparable size was estimated at approximately $ 100 million. It is not difficult to calculate that now, 14 years later, if Moore's law had been in effect in biotech, the cost of sequencing would have been about 750 thousand dollars. In fact, as of 2015, the cost of reading the complete human genome is about five thousand dollars, and the price of genotyping – analyzing only selected, "key" sections of the genome – has already dropped to hundreds of dollars. Even taking into account the preparation of samples and logistics, both versions of the process take not years, but about a few weeks.


Moore's law and the dynamics of the cost of genome sequencing.
Image: genome.gov

And here at least two questions arise. First of all, what has happened in the sequencing technique during this time that has allowed such a radical reduction in the cost of sequencing? And, secondly, why do we hardly notice this explosive progress on ourselves? 

The second question is easier to answer: the "explosion" of the genomic revolution has indeed occurred, but its "explosive wave" has not yet reached us, ordinary consumers. Genomic data has become radically more accessible to scientists and pharmaceutical companies for a long time, but genomic data began to appear in the lives of their owners only in the last few years. But the first question – how and why this happened, how the sequencing technology changed and how scientists read DNA today – requires a separate analysis.

First you need to decide what sequencing is. Sequencing (from the word sequence – sequence) is called determining the order of elementary units of monomers in a polymer. Moreover, these polymers can be not only DNA or RNA, but also, for example, a protein or even a polysaccharide. The term "sequencing" itself arose at the moment when it became clear that in biology, the properties and functions of polymers cannot be determined simply by their composition, as chemists used to think: this composition turned out to be too similar for completely different molecules from a functional point of view. And if it's not the composition, it means that the sequence of monomers plays a key role – an idea that seems trivial now, but in the 40s of the last century was completely new.

The priority in the discovery of this key concept for biology belongs to the British scientist Frederick Sanger. The "sequencing dad" was born in 1918 in the family of a doctor and lived a very long and exceptionally fruitful scientific life. The only two-time Nobel laureate in chemistry in the world (1958 and 1980), he fully managed to catch the genomic revolution, created primarily by his own hands.

However, the object of the world's first sequencing, which Sanger carried out in the late 40s, was not DNA or RNA at all. It was insulin – the only peptide at that time available in more or less pure form in sufficient quantities. What amino acids are part of insulin was known by that time, but scientists believed that the proportions of these amino acids in insulin are approximate, and not too important – it was assumed that when creating proteins, life is guided by an "analog" approach, "pouring" amino acids into different proteins by eye.

It was Sanger who showed that this was not the case. He managed to find a reagent that selectively interacts with only one amino acid from the entire polypeptide chain: the one that is at the very beginning of the peptide, and which, because of this, has a unique chemical group. And if there is only one such amino acid for the initial polypeptide, then after partial cutting, any of the amino acids can become the terminal one, which means that all of them can be identified. By breaking insulin into small fragments and determining the terminal amino acids, Sanger managed (in just eight years of painstaking work) to assemble a complete "puzzle" of monomers, and thus determine the exact structure of insulin. For this work, Sanger received the Nobel Prize in Chemistry just six years after the last publication.

However, after the success with insulin, no one, neither Sanger nor his competitors, even tried to approach DNA sequencing. The "most important molecule" looked too huge and intimidating to be able to try to read its sequence. In addition, each cell, as is known, contains only one or two copies of genomic DNA, which means it is quite difficult to obtain a sufficient number of identical molecules (which the method requires).

The tactical target was another nucleic acid, RNA. More precisely, its separate variety, transport RNA, which is used in the cell as an adapter that "brings" amino acids during protein synthesis. Its length is only 70-80 nucleotides, and the number of copies per cell reaches hundreds of thousands of pieces. For RNA sequencing, Sanger applied the same general strategy: labeling the terminal monomer followed by partial destruction of the molecule into many fragments. However, his luck changed here. Robert Holley and co-authors from Cornell University were able to publish the sequence of one of the yeast tRNAs already in 1965, it was Holley who became the first person to determine the sequence of any nucleic acid. And although only three years later Sanger's group managed to sequence RNA almost twice as long (it was one of those molecules that are part of the ribosome), he managed to really recoup this defeat only ten years later.

The DNA sequencing method now bearing Sanger's name was published in 1977. For more than 30 years, until the mid-noughties, it remained the main way to determine the sequence of any nucleic acid: it was by this method (with minor modifications) that the human genome was read. And so far, Sanger sequencing is the most accurate method, which is used if necessary to check the results of a new generation of sequencing.

The idea behind Sanger sequencing is so simple and elegant that I would like to dwell on it in more detail. Until now, as we have seen, all methods of reading the sequence were based, relatively speaking, on destruction: on obtaining pure matter, fragmenting it and restoring the "puzzle" from the resulting fragments. Sanger decided to act in the opposite way: to read DNA not by fragmentation, but by synthesis, and to use for this a natural enzyme, DNA polymerase - the same molecular machine that doubles DNA before cell division.


The procedure is carried out as follows: a DNA fragment labeled with a radioactive isotope at one end is divided into four test tubes. Reagents necessary for the synthesis of new DNA are added to each of them, including single "letters" that the polymerase will have to bind into a new strand – in exact accordance with the original matrix. However, in addition to the usual "letters", a small number of specially "spoiled" ones are added to the solution – such that after which it is impossible to attach the next "letter" (they simply lack the appropriate place for communication).

As a result, at the end of synthesis, a set of DNA of different lengths appears in each test tube, and each of the molecules carries a radioactive label at the beginning and a corrupted "letter" at the end. Since we put only one type of corrupted bases in each of the four test tubes, we know which letter ends all the DNA fragments in this test tube. Now it is enough to disperse the contents of all four test tubes in a gel that separates DNA along the length and attach a photographic film to it, which will record accumulations of radioactivity. A "ladder" of steps will appear on the film, each of which will correspond to one letter in the sequence. Going up the stairs, we will read the entire sequence of the original DNA.

Sanger's method turned out to be extremely reliable and convenient for reading sequences. Perhaps it was due to the fact that he relied on natural enzymes "trained" during billions of years of evolution, and not synthetic reagents. It was with the help of this method (more precisely, with the help of its immediate predecessor) that Sanger managed to read the first complete genome of an individual organism in history - the genome of the bacteriophage ϕX174, containing only 5386 bases (by the way, this same phage in 2003 became the first organism whose genome was completely synthesized artificially).

Already a few years later, in the mid-80s, when incremental improvements gradually increased the speed and power of sequencing, complete genomes of increasingly complex viruses began to appear and scientists for the first time started talking about the possible sequencing of the genomes of higher organisms, including humans.

It became possible to get close to this task when molecular cloning methods appeared, which made it possible to cut out individual fragments from the genome, then insert them into model organisms (E. coli or yeast), and then gradually, piece by piece, sequenced – by the same Sanger method.

In 1990, the international Human Genome program was launched, in which each team from America, Europe and Japan was allocated separate sections of the pre-marked genome for sequencing. By this point, several biotech companies, primarily Applied Biosciences, had learned how to automate Sanger sequencing processes. At first, the sequencers could only independently read the "ladders" on the photographic film, turning them into a letter sequence, and then the process of separating DNA fragments was transferred from the gel (which had to be poured again each time) into a thin capillary. The radioactive labels were replaced with fluorescent ones, and this made it possible to read the sequence right during the passage through the capillary: four colors – four different letters.

One of the first to realize the possibilities of sequencing automation was Craig Venter, the future odious initiator of the "genome race", who decided to get ahead of the entire international collaboration in reading the human genome (as well as the author of the first living organism with fully synthetic DNA). In 1992, two years after the launch of the Human Genome project, Venter organized his own institute with the catchy abbreviation TIGR (The Institute for Genomic Research), in which DNA sequencing was first put on stream. In 1995, Venter's group managed to obtain the first genome of a full-fledged cellular organism, or rather even two: bacteria Haemophilus influenzae and Mycoplasma genitalium. 

From a purely biological point of view, it was fundamental that it was about full-fledged cellular genomes: viruses live only in cells and in their lives rely on cellular systems encoded not by their own genome, but by the genome of the cell. And only the latter contain the complete information necessary for life and are therefore much more interesting than viral genomes.

From a technical point of view, Venter's innovation was that for the first time he applied a radical approach to genome sequencing. As already mentioned, until now, scientists have been working with the genome "piece by piece", which made it possible to read the DNA of the most complex organisms, but also required a huge amount of time for cloning. Venter pioneered the so-called "shotgun method", when the entire genome is cut into many short intersecting fragments that are read and then reassembled: from the bottom up. 

This approach significantly simplifies all stages that require human participation and is well suited for automation. However, it has two fundamental drawbacks. Firstly, assembling a genome from millions and millions of short fragments requires huge computational resources, implies the creation of fundamentally new mathematical algorithms and requires multiple coverage (for example, to assemble a genome with a length of 100 nucleotides, you need to read fragments with a total length of 1000 – in this case, they say about a tenfold coverage). 

And secondly, this method works according to the principle "until everything is done, nothing is done." It was thanks to this feature that it was especially interesting to watch the genome race: if the quality and quantity of "raw" sequences were a little lacking for Venter, the whole sequencing adventure would have been a waste of time. This, however, did not happen. Both genomes, both obtained during an international project and collected by Venter's private company, were published in two issues of Nature and Science, published in one week in February 2001.

After hearing about the Human Genome project, many people ask: – which one exactly? Who was the person who spent so much money and effort of scientists to read the genome? And although the answer is trivial (it was not anyone, it is a reference genome, the DNA for which was obtained from several anonymous donors), such a question hits the nail on the head. None of us is the owner of a consensus genome, there are a huge number of unique differences in everyone's DNA, and it is these differences that make us who we are. Therefore, on the very first day after the publication of the reference human genome, it became clear that the main goal of all subsequent research in human genomics would be individual differences.

There is a certain irony in the fact that the first people whose personal genome was read were two main rivals of the "genome race": the discoverer of the DNA structure James Watson and Craig Venter, already familiar to us. Compared to the reference genome, about three million individual polymorphisms – single-letter substitutions that differ from person to person - were found in each of them. The cost of sequencing both individual genomes by the time they were read in 2007 was about a million dollars. And this is, of course, significantly lower than 100 million in 2001, but still a lot: with such prices, it would be strange to expect to read the genomes of hundreds or thousands of people or offer such a service to ordinary people. However, fortunately, just at the moment when the reference genome was created, the technology matured, allowing to "catch" individual differences in genomes is fundamentally simpler and cheaper than conventional sequencing. We are talking about DNA microchip technology.

A DNA microchip is a small plate on which short single-stranded DNA fragments are fixed using photolithography–like technology. When incubated with a sample solution, two DNA molecules – one on the chip, the other in the sample – can form a strong pair. If all the molecules of the sample are pre-marked with a fluorescent marker, then after incubation we will see glowing dots on the chip in those places where strong pairs have formed. And if this happened, it means that the sequences of DNA fragments in the sample exactly matched the sequences on the chip – and we knew them in advance. 

Having established which individual characteristics exist in the population, we can create a microchip with hundreds of thousands of different polymorphisms. This will allow you to get information about the presence of certain "one-letter" substitutions in your DNA in just a few hours. Formally, such a procedure is not sequencing, but it allows you to read DNA sequences, variants of which we already know. It is impossible to use microchips to read completely new sequences (although work is underway in this direction), but when it comes to personal genomics, this is not required: The DNA of different people is known to match 99 percent. With the help of modern microchips, about a million known polymorphisms can be read, that is, about one third of the number of them that is present in the genome.

DNA microchips began to appear in scientific laboratories in the 90s, and in the mid-2000s, the first companies appeared offering analysis of the personal genome based on them. The notorious 23andMe, founded by Sergey Brin's ex-wife, was just one of the first such companies. Now Ann Vozhitski's company has many competitors, both in the world and in Russia. 

However, today genotyping technologies are coming on the heels (both in terms of speed and cost) the so-called new generation sequencing methods. It was their appearance that brought down the cost of the procedure from millions to thousands of dollars. These methods are very different, and it will not be possible to tell about all of them. 

It is possible to stop a little, perhaps, only on the so–called pyrosequencing - a method that is based on the hydrolysis of pyrophosphate. When nucleotides are connected to each other in a DNA chain, this compound is always thrown into the solution – a high-energy fragment of a nucleotide, which then collapses without a trace and by its "death" ensures the unidirectionality of the synthesis reaction. In the mid-2000s, many scientific groups independently noticed that the destruction of pyrophosphate is appropriate to use in sequencing: it can be "fed" to a special enzyme that can convert the binding energy of pyrophosphate into a pulse of light. Then, by the presence or absence of the flash, it will be possible to judge whether the reaction of attaching the nucleotide to the chain has passed or not. When a copy of the sample is synthesized on the DNA matrix, the flash indicates the presence of a complementary base in the nucleic acid.

It looks like this: DNA is cut into millions of short fragments, applied to microscopic beads, copied (so that only identical copies of one fragment are on one bead) and distributed into microscopic cells made in a special substrate. After that, a synchronous reaction begins in the cells. One type of nucleotides is passed through the substrate – if there is a flash in the cell, then this nucleotide is suitable for synthesis, then there is a complementary nucleotide on the matrix. Then the substrate is washed from the first nucleotide and the second one is fed – this time other cells light up, those in which there is a corresponding complementary base. Thus, by repeatedly washing the cells with four nucleotides, bioengineers read the DNA sequence by flashes in individual cells. The main feature of this and similar methods is the ability to conduct a huge number of parallel reactions. And although the accuracy of the reaction in each of the cells is small, the huge number of such cells makes sequencing very fast and, therefore, cheap.

And yet, so far, genome-wide sequencing cannot match the cost of genotyping. Yes, in research laboratories, new generation technologies are already displacing genotyping from traditional tasks for this method (for example, for RNA analysis and gene expression). But things are different in personal genomics: genome features that are noticeable only with full sequencing and are not visible to microchips are so rare that real sequencing seems like firing a cannon at sparrows. 

In the last year or two, there has been a slight stagnation in the market of genome-wide DNA reading (by the way, reminiscent of the situation before the new generation of methods appeared). Therefore, it can be expected that in the coming years consumer genomics will continue to rely on DNA microchips. Considering how accessible they have already become, even the introduction of new revolutionary sequencing methods is unlikely to change anything much in the consumer market. This means that the moment has come when it is no longer about the technologies with which genomic data is obtained, but about their interpretation. But this is a completely different conversation.

Portal "Eternal youth" http://vechnayamolodost.ru
18.09.2015
Found a typo? Select it and press ctrl + enter Print version