24 January 2013

Store information on DNA carriers

The dream was recorded in DNA
Five files with a total volume of 5.2 megabits were recorded in artificial DNAIvan Kulikov, "Newspaper.

Ru"A joint research group from the European Institute of Bioinformatics (EBI), located in the UK, and the European Laboratory of Molecular Biology (EMBL), located in Germany, together with Agilent Technologies (USA) has developed a technology that allows the use of artificial DNA as a long-term, reliable and non-volatile information carrier.

An article describing the technology (Goldman et al., Towards practical, high-capacity, low-maintenance information storage in synthesized DNA) is published today in Nature.

Using as a memory device short single–stranded DNA, the so-called oligonucleotides (an oligonucleotide is a short form of nucleic acid containing a relatively small, up to several dozen, number of nucleotides), the researchers recorded on an array of such DNA five different files containing the complete collection of Shakespeare's sonnets (text in ASCII format), an article by the discoverers of the DNA structure of James Watson and Francis Crick's "Molecular structure of nucleic acids" in PDF format, a color photo of the EBI building in JPEG format, a 26-second MP3 file with a fragment of Martin Luther King's speech "I have a Dream", as well as a file with the Huffman algorithm used to convert binary files into a form convenient for presenting data via the sequence of nitrogenous bases of DNA.

The total amount of useful data recorded and read from DNA was approximately 5.2 megabits.


Dr. Nick Goldman from EMBL-EBI holds in his hands a test tube with all Shakespeare's sonnets, a classic scientific article,
a sound file and a photo of his institute, recorded on DNA. // Nature

To record this volume, 153,335 synthesized short DNA strands of 117 nucleotides (117 bits) each were used. The data was encoded in four blocks of 25 nucleotides. The remaining 17 nucleotides (17 bits) encoded the address labels necessary for assembling data into the original file array.

Coding took place in three stages. The binary code in which the data was presented was first converted on a computer into a ternary one by means of the Huffman algorithm, with which eight-bit data blocks (bytes) were represented as a sequence of five ternary numbers, or trits (0,1,2). Next, the block sequence of trits was converted into a code of three nucleotides.

Ternary encoding allowed not only to compress data, but also to reduce the probability of errors during subsequent DNA reading and reconstruction of the binary array.

As you know, DNA is a polymer molecule, which consists of four nucleotides (adenine, guanine, thymine and cytosine – A, G, T, C). To convert a ternary code, three are enough, so in each subsequent ternary block, the bases could be combined in different ways, because one of the four nucleotides in they could be missing. The latter ensured that during DNA synthesis, two or more identical nucleotides would not have to be joined into one polymer chain (the so-called homopolymer), which reduces the likelihood of errors during subsequent data reconstruction.

A scheme for converting data (Shakespeare's sonnet) into a DNA array:
a) binary code
b) ternary code
c) DNA code
d) duplicated DNA fragments with a step offset of 25 bits (yellow marks DNA sections with address labels).
// Nature

The 153335 DNA codes obtained in this way were sent to Agilent Technologies in the USA, where they were synthesized on special equipment, while each of the 117-bit oligonucleotide molecules was multiplied into 12 million copies.

The frozen and vacuum–dried array of synthesized DNA, which is a tiny pinch of organic matter in a hermetically sealed test tube, was sent by regular express mail back to England and then to Germany, to one of the EMBL laboratories, where the DNA was decoded back with almost one hundred percent accuracy, which, in turn, made it possible to successfully reconstruct the five original files (the number and content of which the laboratory staff did not know).

Considering DNA memory as a future potential standard for storing and reading data allows the impressive advantages that this technology has over the electron-optical storage devices that are used now. This is a huge recording density – theoretically, up to 455 exabytes (1 exabyte = 1018 bytes, or a billion gigabytes – VM) of data can be recorded in one gram of single-stranded DNA, energy independence, as well as durability: DNA degrades over time, but in the natural environment it can store information for tens of thousands of years, and with artificial conservation and longer.

They have been successfully trying to memorize information through DNA since the late 80s, but the real breakthrough in this direction has occurred only now, with a rapid reduction in price and, most importantly, an increase in the accuracy of technologies for the rapid synthesis and decoding of DNA molecules.

Note that the EBI-EMBL team, which described its DNA memory technology in Nature, is not a pioneer here.

Relatively recently, George Church's group, which has been experimenting with DNA memory for a long time and working at Harvard, reported in a rival Science that it managed to record and read several files (a book, images and JAVA code) from a synthesized array of short single-stranded DNA, moreover, exactly the same total volume - 5.2 megabits (see the article "The Book from DNA" – VM).

A comparison of the technologies used shows that both groups used almost identical methods of recording and reading information from DNA.

The data array was first divided into blocks of a little more than a hundred bits in size, then recoded into a letter sequence of nucleotides, on the basis of which short, slightly more than 100 bases, DNA chains were synthesized. Reading information from the array was carried out using an automated polymerase chain reaction and parallel DNA sequencers of the latest generation: DNA chains were repeatedly cloned, then, simultaneously correcting errors, they were read, and the resulting codes were combined into data arrays in accordance with address labels located at the ends of the chains.

The only significant difference lies in the scheme of encoding the binary stream into a sequence of nucleotides: if the Church group used a simple conversion scheme, taking a pair of different bases (for example, AG and TC) for the conditional "zero" and "one", then the EBI-EMBL team used a more complex algorithm, converting the bit stream into a triad (ternary) by means of the Huffman algorithm. The latter made it possible to compress the data, pushing more information into 5.2 megabits, and reduce the likelihood of errors by excluding homopolymer chains from the DNA array. Another trick that increased error tolerance was the four-fold duplication of 117-bit chains with a regular code offset of 25 bits, moreover, every second double was encoded in the reverse sequence. With this scheme, the probability of identical errors occurring in several chains at once becomes negligible.

It is the resistance to errors that the authors of the article in Nature called the main advantage of their technology, answering the question at a specially organized press briefing about how their DNA memory differs from the DNA memory developed at Harvard.

However, one can argue with this: firstly, Church's group also incorporated an error correction algorithm into its DNA memory, in which the codes of the multiplied "mirror" DNA chains were compared. Secondly, the authors of the article in Nature themselves recognize the "redundancy" of their scheme, since the accuracy of modern devices synthesizing and reading short DNA chains up to 200 bases is very high, and the average number of errors rarely exceeds one per 500 bases.


An EBI photograph recorded and read using DNA. // Nature

Be that as it may, despite the identity of the experiments conducted on the use of artificial DNA as a data carrier, as well as the amusing costs of competition between the two main scientific journals, which kept secret from each other almost identical in content articles describing interesting and promising technology that came to them almost at the same time – at the beginning of the summer of 2012 (Science, as we can see, reacted more quickly, and Nature still did not have the planned little sensation), the debut of DNA memory can be considered successful. A potential area of its application may be long-term archiving of relatively infrequently requested information: having estimated the rate at which DNA synthesis and decryption technology is becoming cheaper, the EBI-EMBL group predicts that DNA memory will be able to compete with magnetic tape data storage technologies, which are still very popular, in the next 50 years.

Portal "Eternal youth" http://vechnayamolodost.ru24.01.2013

Found a typo? Select it and press ctrl + enter Print version