07 September 2009

History of genomics. Part 1: Genomic projects

In this article, I will try to tell you in a popular way how the first methods of reading genetic sequences appeared, what they were and how genomics moved from reading individual genes to reading complete genomes, including complete genomes of specific people.
Part 2, "DNA Technologies", will be devoted to the most modern and unusual sequencing methods - reading genetic sequences and other technologies related to DNA molecules.
Alexander Panchin, M.Sc. of the A.A. Harkevich Institute of Information Transmission Problems of the Russian Academy of Sciences, postgraduate student of the Faculty of Bioengineering and Bioinformatics of Moscow State University.

Shortly after the discovery of Watson and Crick [1], genomics was born – the science of studying the genomes of organisms, which includes intensive reading of complete DNA sequences (sequencing) and mapping them on genetic maps. This science also examines the interactions between genes and alleles of genes and their diversity, patterns in evolution and the structure of genomes. The development of this area was so rapid that until recently, text editors like Microsoft Word did not know the word "genome" and tried to correct it with the word "gnome".


James Watson (left) and Francis Crick (right) are scientists who discovered the DNA double helix

The gene of the envelope protein of the RNA virus, the bacteriophage MS2, studied in the laboratory of Walter Fiers in 1972, was the very first to be read [2]. In 1976, the primary and secondary structure of the gene of its replicase, an enzyme responsible for the reproduction of viral particles, was deciphered [3]. Short RNA molecules were already relatively easy to read then, but large DNA molecules could not really read yet. For example, the sequence of a 24-nucleotide-long section of the lactose operon gene obtained in 1973 by Walter Gilbert and Allen Maxham [4] was considered a significant breakthrough in science. Here is the sequence:

5'—TGGAATTGTGAGCGGATAACAATT 3'
3'—ACCTTAACACTCGCCTATTGTTAA 5'

The first DNA reading techniques were very inefficient and used radioactive tags for DNA and chemical methods for distinguishing nucleotides. For example, it was possible to take enzymes that cut the nucleotide sequence with different probabilities after different "letters". The DNA molecule consists of 4 letters-nucleotides A, T, G and C (adenine, thymine, guanine and cytosine), which are part of a double antiparallel (two chains are directed in opposite directions) helix. Inside this spiral, the nucleotides are opposite each other in accordance with the complementarity rule: opposite A in the other chain is T, opposite G is C and vice versa.

Gilbert and Maxam used 4 types of enzymes. One cut after A or G, but better after A (A>G), the second cut better after G (G>A), the third after C, and the fourth after C or T (C+T) [5]. The reaction was carried out in 4 test tubes with each type of enzymes, and then the products were placed on a gel. DNA is a charged molecule, and when the current is turned on, it runs from minus to plus. Small molecules run faster, so the cut DNA molecules line up along the length. Looking at the 4 tracks of the gel, you can tell in which sequence the nucleotides are located.

A breakthrough in the field of DNA sequencing occurred when the English biochemist Frederick Sanger in 1975 proposed the so-called chain termination method for reading DNA sequences. But before we talk about this method, it is necessary to understand the processes that occur during the synthesis of new DNA molecules. DNA synthesis requires an enzyme – DNA-dependent DNA polymerase, which is capable of completing a single-stranded DNA molecule to a double-stranded one. To do this, the enzyme needs a "seed" – a primer, a short DNA sequence that can bind to a long single-stranded molecule, which we want to complete to a double-stranded one. The nucleotides themselves are also needed in the form of nucleotide triphosphates and certain conditions, such as a certain content of magnesium ions in the medium and a certain temperature. Synthesis always goes in one direction, from the end called 5’ to the end called 3’. Of course, to read DNA, a large number of matrices are needed – that is, copies of the DNA that is going to be read.

In 1975 Sanger came up with the following. He took special (terminating) nucleotides, which, having joined the growing chain of the DNA molecule, prevented the attachment of subsequent nucleotides, that is, "broke" the chain. Then he took 4 test tubes, into each of which he added all 4 types of nucleotides and one type of terminating nucleotides in a small amount [6]. Thus, in a test tube containing the terminating nucleotide adenine, the synthesis of each new DNA molecule could be interrupted anywhere where "A" should have stood, in a test tube with terminating guanine – anywhere where "G" should have stood, and so on. 4 tracks from 4 test tubes were applied to the gel, and again the shortest molecules "ran away" ahead, and the longest ones remained at the beginning, and by the differences in the bands it was possible to tell which nucleotide followed which. To see the stripes, one of the four nucleotides (A, T, G or C) was labeled, without changing chemical properties, using radioactive isotopes.


The Sanger method in the classic version is on gel. Three series of 4 tracks are shown.

Using this method, the first genome based on DNA was read – the genome of the bacteriophage ϕX174, with a length of 5386 nucleotides (the genome of the MS2 phage with a length of 3569 nucleotides, read earlier, consists of RNA).

The Sanger method was significantly improved in the laboratory of Leroy Hood, where in 1985 the radioactive label could be replaced with a luminous, fluorescent label [7]. This made it possible to create the first automatic sequencer: each piece of DNA was now colored in different colors, depending on what the last letter was (a color-labeled nucleotide that breaks the chain). The fragments were separated on the gel by size, and the machine automatically read the spectrum of the glow of the incoming bands, giving the results to the computer. As a result of this procedure, a chromatogram is obtained, according to which it is easy to establish a DNA sequence up to 1000 "letters" long with very few errors.


An example of a chromatogram on a modern sequencer using the method of breaking the chain by Sanger and a luminous label.

For many years, the improved Sanger method will become the main method of mass genome sequencing and will be used for many complete genome projects, and Sanger will receive the second Nobel Prize in Chemistry in 1980 (he received the first one back in 1958 for reading the amino acid sequence of insulin, the first sequenced protein). The first complete genome of a cellular organism was the genome of a bacterium that causes some forms of pneumonia and meningitis – Haemophilus influenzae [8]. In 1995, the genome of this bacterium had a length of 1830137 nucleotides. In 1998, the first genome of a multicellular animal appeared, the roundworm Caenorhabditis elegans [9], with 98 million nucleotides, and then in 2000 the first plant genome appeared – Arabidopsis thaliana [10]. The genome of this plant, a relative of horseradish and mustard, has a length of 157 million nucleotides. The speed and scale of sequencing grew at an amazing rate, and the emerging databases of nucleotide sequences were replenished faster and faster.

Finally, it was the turn of the mammalian genome: mouse and human. When in 1990 James Watson led the project of reading the complete human genome at the National Institutes of Health (NIH) in the USA, many scientists were skeptical about this idea. Such a project required enormous investments of money and time and, given the limited capabilities of existing genome-reading machines, it seemed simply impossible to many. On the other hand, the project promised revolutionary changes in medicine and understanding of the structure of the human body, but there were problems here too. The fact is that at that time there was no accurate estimate of the number of genes in humans. Many believed that the complexity of the structure of the human body indicates the presence of hundreds of thousands of genes, and maybe several million, and, therefore, to understand such a number of genes, even if their sequences can be read, would be an impossible task. It was in the presence of a large number of genes that many assumed the fundamental difference between humans and other animals – a notion later refuted by the Human Genome project.

The idea to read the human genome was born in 1986 on the initiative of the US Department of Energy, which subsequently funded the project together with the NIH. The cost of the project was estimated at $ 3 billion, and the project itself was designed for 15 years with the participation of a number of countries in the project: China, Germany, France, Great Britain and Japan. The so–called "artificial bacterial chromosomes" (BAC - bacterial artificial chromosome) were used to read the human genome. With this approach, the genome is cut into many parts with a length of about 150 thousand nucleotides. These fragments are embedded in artificial ring chromosomes, which are embedded in bacteria. With the help of bacteria, these chromosomes multiply, and scientists get many copies of the same fragment of the DNA molecule. Each such fragment is then read separately, and the read pieces of 150,000 nucleotides are mapped onto the chromosome. This method allows you to sequence the genome fairly accurately, but it requires a very large amount of time.

But the Human Genome project was moving at an extremely slow pace. Scientist Craig Venter and his company Celera Genomics, founded in 1998, played about the same role in the history of genomics as the Soviet Union influenced the flight of Americans to the moon. Venter said that his company will finish sequencing the human genome before the state project is completed. The project will require only $ 300 million – only a small part of the costs of the state project, due to the use of a new sequencing technology "whole genome shotgun" – reading random short fragments of the genome. When Francis Collins, who succeeded in 1993 James Watson, as the head of the human genome reading project, found out about Venter's intentions, he was shocked. "We'll make a human genome, and you can make a mouse," Venter suggested. The scientific community was alarmed, and there were a number of reasons for that. Firstly, Venter promised to finish his project in 2001, 4 years ahead of the deadline set for the state project. Secondly, Celera Genomics was going to make money on the project by creating a database that would be paid for commercial pharmaceutical companies.

In 2000, Selera Genomics proved the effectiveness of its sequencing method by publishing the genome of the fruit fly drosophila together with the laboratory of geneticist Gerald Rubin [11] (previously, whole genome shotgun was used to read the first genome of the bacterium, but few believed that this method was suitable for large genomes). It was such a kick from a commercial company that stimulated the development of improved and more modern methods of reading genomes in the Human Genome project. In 2001, a preliminary version of the genome was published by the state project and Selera [12, 13]. Then a preliminary estimate was made of the number of genes in the human genome, 30-40 thousand. In 2004, almost two years earlier than planned, the final version of the genome was released [14]. In the last article it was said that the number of genes in humans is presumably only 20-25 thousand. This number is comparable with other animals, in particular with the nematode C.elegans.

Practically no one imagined that the number of genes that ensure the work of our body could be so small. Later, other details became known: the human genome has a length of about three billion nucleotides, most of the genome consists of non-coding sequences, including all kinds of repeats. Only a small part of the genome actually contains genes – sections of DNA from which functional RNA molecules are read. An interesting fact is that as knowledge about the human genome increased, the number of putative genes only decreased: many potential genes turned out to be pseudogenes (non-functioning genes), in other cases several genes turned out to be part of the same gene.

Further sequencing rates increased exponentially. In 2005, the chimpanzee genome was published [15], which confirmed the amazing similarity between monkeys and humans, which was seen by zoologists of the past. By 2008, the genomes of 32 vertebrates, including cats, dogs, horses, macaques, orangutans and elephants, 3 genomes of invertebrates, 15 insect genomes, 7 worm genomes and hundreds of bacterial genomes were fully read.

Finally, in 2007, humanity approached the possibility of sequencing the genomes of individual people. The first person for whom the complete individual genome was read was Craig Venter [16]. At the same time, the genome was read so that it was possible to compare Venter's chromosomes inherited from both parents. So it was found out that there are about three million one-letter nucleotide differences between one and another set of chromosomes within one person, not counting a huge number of large varying sites. A year later, the complete diploid genome of James Watson was published [17]. Watson's genome contained 3.3 million single-letter substitutions compared to the annotated human genome, of which more than 10,000 led to changes in the proteins that encode his genes. Watson's genome cost $1 million, that is, the price of reading genomes has fallen more than 3,000 times in 10 years, but this is not the limit. Today, scientists face the task of "1 genome – $1000 – 1 day", and it no longer seems impossible with the advent of new sequencing technologies. The next part of the "history" will tell about them.


James Watson and Craig Venter are the first people with individual read genomes.

Literature1. Watson J, Crick F: A Structure for Deoxyribose Nucleic Acid.
Nature 1953(171):737-738.
2. Min Jou W, Haegeman G, Ysebaert M, Fiers W: Nucleotide sequence of the gene coding for the bacteriophage MS2 coat protein. Nature 1972, 237(5350):82-88.
3. Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant D, Merregaert J, Min Jou W, Molemans F, Raeymaekers A, Van den Berghe A et al: Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature 1976, 260(5551):500-507.
4. Gilbert W, Maxam A: The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A 1973, 70(12):3581-3584.
5. Maxam AM, Gilbert W: A new method for sequencing DNA. Proc Natl Acad Sci U S A 1977, 74(2):560-564.
6. Sanger F, Nicklen S, Coulson AR: DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 1977, 74(12):5463-5467.
7. Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SB, Hood LE: Fluorescence detection in automated DNA sequence analysis. Nature 1986, 321(6071):674-679.
8. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM et al: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995, 269(5223):496-512.
9. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 1998, 282(5396):2012-2018.
10. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408(6814):796-815.
11. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF et al: The genome sequence of Drosophila melanogaster. Science 2000, 287(5461):2185-2195.
12. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA et al: The sequence of the human genome. Science 2001, 291(5507):1304-1351.
13. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W et al: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860-921.
14. Finishing the euchromatic sequence of the human genome. Nature 2004, 431(7011):931-945.
15. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 2005, 437(7055):69-87.
16. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G et al: The diploid genome sequence of an individual human. PLoS Biol 2007, 5(10):e254.
17. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT et al: The complete genome of an individual by massively parallel DNA sequencing. Nature 2008, 452(7189):872-876

Portal "Eternal youth" http://vechnayamolodost.ru07.09.2009

Found a typo? Select it and press ctrl + enter Print version