19 February 2013

Genome sequencing for dummies

Genomics: problem statement and sequencing methods

Sergey Nikolenko, Candidate of Physical and Mathematical Sciences, senior researcher at the Laboratory of Computational Biology of St. Petersburg Academic University in a series of articles talks about some bioinformatics problems related to the assembly and analysis of genomes, focusing on the mathematical, combinatorial formulation of the problem. In this introductory text, we are talking about what the input data for genome assembly looks like and how they are obtained.

What does a DNA molecule look like?

Let's start with what a DNA molecule looks like. Polymer molecules are characterized by a primary structure, which is simply the composition of the molecule (in this case, the sequence of letters A, C, G and T, which make up the genome), a secondary structure, i.e., what chemical bonds are established between these components and which basic spatial structures are obtained as a result (in this case – double helix), and the tertiary structure, i.e. the way the secondary structure is "stacked" in space. The secondary structure of DNA is a double helix consisting of four different nucleotides.

Drawing from Wikipedia

Nucleotides are designated by the nitrogenous bases contained in them: adenine (A), cytosine (C), guanine (G) and thymine (T) (there is also uracil, which replaces thymine in RNA), and in the future we will always use these letters. In a double helix, these nucleotides are connected to each other by hydrogen bonds, and the connection is established according to the principle of complementarity: if there is A in one strand of DNA, then there will be T in the complementary strand, and if there is C in one strand, then there will be G in the other. This is what makes it relatively easy to replicate (copy) DNA, for example, during cell division: to do this, it is enough to simply break the hydrogen bonds by dividing the double helix into strands, after which the paired thread for each "descendant" will automatically assemble correctly. It is important to understand that DNA is two copies of the same "text" of four "letters"; the "letters" in the copies are not identical, but uniquely correspond to each other. For example:

ATGCAGAACAGACGATCAGCGACACTTTA
TACGTCTTGTCTGCTAGTCGCTGTGAAAT

Of course, it would be convenient if we could carefully "pull out" one strand of DNA and calmly, nucleotide by nucleotide, "read" this thread from beginning to end. With such an ideal method of sequencing (reading DNA), no tricky algorithms would be needed. Unfortunately, this is not possible at this stage, and we have to be content with the results of the sequencing that we have.

What is sequencing?

Sequencing is a common name for methods that allow you to establish the sequence of nucleotides in a DNA molecule. Currently, there is not a single sequencing method that would work for the whole DNA molecule; they are all arranged as follows: first, a large number of small DNA sections are prepared (the DNA molecule is cloned repeatedly and "cut" in random places), and then each section is read separately.

Cloning occurs either simply by growing cells in a Petri dish, or (in cases where it would be too slow or for some reason would not work) using the so-called polymerase chain reaction. In a brief and inaccurate presentation, it works something like this: first, DNA is denatured, i.e., hydrogen bonds are destroyed, obtaining separate strands. Then the so-called primers are attached to the DNA; these are short sections of DNA to which DNA polymerase can join – a compound that, in fact, is engaged in copying (replication) of the DNA strand.

Drawing from Wikipedia

At the next stage, the polymerase copies DNA, after which the process can be repeated: after a new denaturation, there will be twice as many individual strands, at the third cycle – four times, and so on.

All these effects are achieved mainly by changing the temperature of a mixture of DNA, primers and polymerase; for our purposes, it is important that this is a fairly accurate process, and errors in it are rare, and the output is a large number of copies of sections of the same DNA. Different sequencing methods differ from each other not by cloning methods, but by how to then read the resulting "soup" from numerous copies of the same DNA.

Sequencing by Sanger

The first sequencing method that scientists were able to use to process entire genomes (including the human genome) was Sanger sequencing. The meaning is as follows: a section of DNA is cloned, after which the resulting mixture is divided into four parts. Each part is placed in an active environment where there are:

DNA polymerase, which, as we have already found out, is engaged in replication,
primers needed to start the replication process,
a mixture of all four nucleotides that will serve as "bricks" for the construction of new copies of DNA,
and, most importantly, special variations of one of the nucleotides (exactly one type of nucleotides for each part), which stop further copying of the DNA molecule.

Actually, the process is almost identical to DNA cloning, which we met in the previous section. The only difference is that now "false" nucleotides are mixed into one of the nucleotides; they can form exactly the same hydrogen bond, but they cannot continue their thread further.

As a result, a large number of copies of the prefixes of the DNA section under study are formed in each part, which have different lengths, but always end with the same letter – depending on when you are lucky enough to take a "false" nucleotide into the cloning process. For example, in a test tube where all sequences end in T, our example above would result in a mixture of the following prefixes:

ATGCAGAACAGACGATCAGCGACACTTTA (sample)
AT
ATGCAGAACAGACGAT
ATGCAGAACAGACGATCAGCGACACT
ATGCAGAACAGACGATCAGCGACACTT
ATGCAGAACAGACGATCAGCGACACTTT

How now, having received such a mixture, to "read" the genomic sequence? Note that in total, in four test tubes, we received all possible prefixes of the site we are interested in. This means that if we can simply measure the length of each prefix (more precisely, not even measure it, but simply arrange it by finding out which one is longer), then we can find out the sequence too. Suppose we saw that prefixes of this length are in test tubes (in order, from the lightest 1 to the heaviest 10):A C G T
1, 5, 7, 8, 10 4, 9 3, 6 2

Obviously, this sequence starts with A (because the easiest prefix, consisting of one letter, ends with A); then comes C, then again A, and so on. As a result, you can read the original section: ATGCAGAACA.

And to measure the length, you can, for example, measure the mass of all prefixes in all test tubes. To measure the mass, you can, for example (different sequencers used different procedures, but the essence does not change), ionize these molecules and send them racing to a charged electrode in a special gel that will create friction and slow down the progress of the molecules – this method is called electrophoresis. With the same charge, heavier molecules will move slower, and the result will be something like this picture.

Drawing from Wikipedia

It can be seen that (ideally) you can simply read the sequence of nucleotides from the lightest prefix (i.e., the prefix of one letter) to the heaviest.

Results and errors of Sanger sequencing

At the output of the Sanger sequencer, short sections of DNA are obtained, the so-called reads. For bioinformatics, two things are fundamental: firstly, how long the reads are obtained, and secondly, what errors can be in them and how often (of course, there is nothing perfect in the world).

Sanger's reeds are very good according to these criteria: reeds about a thousand nucleotides long are obtained, and the quality begins to drop noticeably only after 700-800 nucleotides. The very process of sequencing by Sanger, which we met in the previous section, determines both the effect of a drop in quality (it is more difficult to distinguish a molecule weighing 700 from a molecule weighing 701 than a mass of 5 from a mass of 6), and another unpleasant effect – if a long sequence of the same letter occurs in the genome (...AAAAAAAA ...), it is difficult to determine exactly how long it is – all intermediate masses will fall into the same tube, some of them may not meet, some merge with each other, etc. But still, Sanger sequencing gives excellent results with sufficiently long reads, which are then relatively easy to assemble. We will talk about how this is done in the following texts.

It was with the help of Sanger sequencing that the human genome was first decoded. Sanger sequencing is still used today, but it is increasingly being replaced by other methods, and it is being used less and less. To whom and why did it give up its positions?

Second generation Sequencers: Illumina

Modern sequencers are the so–called second generation sequencers (SGS, second generation sequencing). In them, sections of DNA are still repeatedly cloned, but the reading process is not arranged in the same way as Sanger's. There are many different methods that differ quite significantly, so we will consider only one of them, one of the most popular today is sequencing using the Solexa method (now Illumina; there is no need to look for a deep meaning in changing the name, just one company bought another).

The process of Illumina sequencing is illustrated in the figure; in addition, you can watch one of several existing videos with animation of this process – in this case, indeed, it is better to see once than to read the text a hundred times. However, brief comments will also come in handy; this is how the Illumina sequencing process works.

DNA copies are cut in random places into a large number of small sections.
Special adapters are added to each site on both sides – small sequences of nucleotides known in advance.
Then the resulting mixture is placed on a specially prepared substrate, from which DNA regions complementary to adapters "grow" in the form of a lattice. Thus, they are able to "bind" the DNA regions equipped with adapters to these places. In addition, the adapters also contain primers, sites to which DNA polymerase can join, which carries out DNA replication.
In step 3, different sections of DNA randomly "stick" to different places in the lattice. Now we repeatedly clone each section around its place, thereby obtaining whole "clusters". This process is known as bridge amplification, because DNA binds to the substrate with two ends at once; we will talk about what this means for bioinformatics in the next section.
DNA sections are denatured (hydrogen bonds are destroyed) – as a result, different DNA sections consisting of a single strand "grow" from the lattice nodes on the substrate.
The substrate is placed in a solution containing DNA polymerase and specially labeled nucleotides, which immediately finish the replication process (if you remember, these were also used in Sanger sequencing). They attach to the DNA, one to each site. Accordingly, each section is joined by the "letter" with the complementary to which it begins.
Then the "extra" nucleotides are washed off, and the labels of the remaining ones are read; in Illumina technology, these are fluorescent labels that can be made to glow in different colors and photographed. It is at this step that we will find out with which letter each "cluster of sections" of DNA begins.
After that, a radical is chemically "cut off" from the already bound nucleotides, which interfered with the further superstructure of the DNA molecule. Now you can go back to step 6 and repeat the process, reading the second letters in each sequence on the second cycle, and so on.

As a result, at each cycle we read simultaneously a very large number of nucleotides from different sequences. But we have to pay for this by the fact that the sections of DNA that we can read turn out to be much shorter than in the case of Sanger sequencing – Illumina reads are usually about 100 nucleotides long.

Paired reads and problem statement

There is one more important detail. Sections of DNA "stick" to the substrate with both ends, and we can find out which sequences correspond to the same section. This means that in reality we are reading the same section, the length of which is approximately known to us, from both sides at once. As a result , the data is approximately of this type:

ATGCAGA???????????????CACTTTA,

moreover, the distance between the known lines (the number of question marks) is not exactly known. Depending on the technology, it is possible to obtain both very long unknown fragments (about 1000 nucleotides), "framed" by two 100-length reeds, and short fragments in which literally two to three dozen nucleotides between the reeds are unknown. Both of them can help a lot in the assembly, and we will also talk about this in the next series.

So, now we can formally set the task of assembling genomes. It sounds like this: for a large number of substrings of small length, restore the original long string in the alphabet from the letters A, C, G, T. In the case of Illumina sequencing, for a large number of pairs of short substrings separated in the original string by an approximately known distance. Having set this task, we can forget about biology, chemistry and medicine – we are facing a purely algorithmic task. However, before moving on to mathematics, let's make a few more remarks.

Errors and quality indicators in second-generation sequencers

As we already know, sequencing always contains errors. In Illumina sequencers and similar ones, errors usually occur at the phase when it is necessary to recognize labeled nucleotides, i.e. to understand what color and with what strength clusters of repeatedly cloned DNA sections glow. The figure shows a typical example of such a photo generated by the Illumina sequencer.

Drawing from the website medicine.yale.edu

The problem here is that due to the imperfection of the other stages of the process, clusters never glow with only one color; it is always a mixture of all four colors with one intensity or another. It is necessary to identify the most intensive component and assess how likely an error is in this letter; this task is called base calling (recognition of nucleotides). Base calling is a whole science, the details of which we will not go into now.

It is important for us now that, as a result, the sequencer matches each nucleotide of each reed with the probability that this nucleotide was recognized correctly. These probabilities can also be used during assembly, and sequencers issue them along with the actual reads.

As a result, a typical read in the so-called fastq format, standard for second-generation sequencers, looks something like this:

@EAS20_8_6_1_3_25/1
GCAAAAAACTTACCCCGGAACAGGCCGAGCAGATCAAAACGCTACTGCAATACAGACCATCAAGCACCAACTCCCNNNCGTAGNNNNNNTATGTTNNNNG
+EAS20_8_6_1_3_25/1
HHHHHHHGHHHHHHHHHHHHHHHHHHHHEHHHHHHHHEGHHHHGHHGHEFD?A=A&FFBB>&::===@&@E@E>A#########################

The first and third lines contain reed's name; the second line is the nucleotide sequence itself. Note that among the letters A, C, G, T there are also letters N – this means that the sequencer could not unambiguously determine which nucleotide was here, and gave up. And the fourth line encodes, on a logarithmic scale, the probability that a particular nucleotide is recognized correctly; for example, H here corresponds to an error probability of about one ten-thousandth. As a rule, the quality deteriorates by the end of the read; in our example, as you can see, the tail of the read could not be read at all reliably.

Other sequencing methods

Although we examined the Illumina sequencer (Solexa) in more detail, in fact, the light did not converge on this method. There are other second-generation sequencers with different properties.

In sequencing by ligation, at the phase when it is already necessary to recognize nucleotides, not DNA polymerase and the replication process are used, but special short "probes" that attach to complementary nucleotides, are fixed, then washed out, and the process repeats again. This is how the SOLiD sequencer from Applied Biosystems works.

Pyrosequencing is based on chemiluminescent signals that are supplied by specially modified nucleotides when they are connected to a complementary nucleotide in a readable DNA strand; for example, the 454 sequencer from 454 Life Sciences works on this principle.

The principle of operation of the PacBio sequencer (from Pacific Biosciences) is very similar to the principle of operation of Illumina, but it has a different detection method - special "grids" allow you to catch signals from individual molecules (the method is called SMRT, single molecule real time sequencing). This allows you to speed up the process, fit more reads on one substrate (you need to clone DNA less, you don't need to grow large clusters) and significantly increase the length of reliably readable reads.

The recently appeared method of ion semiconductor sequencing (the IonTorrent sequencer is based on it) instead of all this simply detects compounds (ions) that are released when a new nucleotide is attached to a DNA strand. This allows you to radically reduce the time and cost of the resulting reads, although the percentage of errors is getting higher, and there are more errors in fragments from a repeating one letter.

Human thought does not stand still: sequencing methods are constantly improving. However, almost all modern methods produce relatively short reeds, from 100 to 400 nucleotides; in this cycle we will mainly talk about how to collect exactly short reeds.

Sanger or Illumina?

The human genome was first assembled on Sanger sequencers, and the algorithmic side of that project was worked out much less than it is now, ten years later. The algorithms used to assemble the first human genome are much simpler than those we will talk about later. However, the first genome was collected after all; maybe all algorithmic progress is an unnecessary myth, and old programs would be enough?

Incredibly, but a fact: the "old" sequencers (first generation, Sanger) give out much more suitable data for assembly than the "new" (second generation). This is mainly expressed in the length of the reads, those sections of DNA that can be read sequentially, and which, in fact, need to be collected into one big line. The first generation sequencers produced reeds of more than five hundred nucleotides in length, usually about a thousand. Modern sequencers produce pairs of reads, each of which has a length of about one hundred nucleotides.

Why use second-generation sequencers at all, what are they better for? The reason here, as it often happens in science and even in medicine, is purely economic: modern sequencers are much cheaper. The project to assemble the first human genome, completed in 2003, took 13 years and cost $3.8 billion. Since then, the price of sequencing has decreased exponentially; "Moore's law in genetics" works even faster than usual, and reduces the price every two years by almost an order of magnitude, so that when the genome of Gordon Moore himself was sequenced in 2010, it cost only about $10 thousand. New sequencing technologies promise to learn how to process the human genome for $1,000 or even less, which opens up opportunities for mass sequencing for medical purposes.

At this level, the price of the algorithmic side of the issue becomes important. To ensure that the assembly of genomes does not take longer and does not cost more than their sequencing itself, it is necessary to develop very fast algorithms to solve the assembly problem. This will be discussed in the next article.

Literature:

Phillip E. C. Compeau, Pavel A. Pevzner. Genome Reconstruction: A Puzzle with a Billion Pieces. In Bioinformatics for Biologists.
Pevzner P.A., Tang H., Waterman M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 98(17):9748-9753, 2001.
M.C. Schatz, A.L. Delcher, S. Salzberg. Assembly of large genomes using second-generation sequencing. Genome Research, 20(9):1165-1173, 2010.
M.Chaisson, P. Pevzner, H. Tang. Fragment assembly with short reads. Bioinformatics 20 (13): 2067-2074, 2004.

Portal "Eternal youth" http://vechnayamolodost.ru 19.02.2013

Genome sequencing for dummies

Genomics: problem statement and sequencing methods

Related posts

Mikhail Gelfand: "Science benefits by its very existence"

AstraZeneca and Yandex will create a platform for cancer diagnosis

A recipe written in your genes

Artificial Intelligence made in China

What will genomic technologies lead to?