03 March 2017

A new record of recording information in DNA

The DNA recorded the "Arrival of the train" and the KolibriOS operating system

Alexander Ershov, N+1

Yaniv Erlich and Dina Zielinski – employees of Columbia University and the Genome Center in New York – have developed a new technology for recording information in DNA, which is approaching the theoretical limit on the density of encoded information, guarantees reliable reading in conditions of a large number of errors and is able to fit up to 215 petabytes of data in one gram nucleic acid. Details of the method are published in the journal Science (DNA Fountain enables a robust and efficient storage architecture and in a press release by Columbia University Researchers Store Computer Operating System and Short Movie on DNA).

The informational possibilities of DNA became clear immediately after the discovery of its structure in 1953, but scientists began to think about this topic in an engineering way only a few years ago. This is primarily due to a sharp drop in the cost of chemical synthesis and (especially) the cost of reading the sequence of nucleic acids.

One of the first serious works in which the information capabilities of DNA were tested in practice was carried out in 2012 by a team led by the famous molecular biologist George Church. Then scientists encoded a 52,000-word book, several jpeg images and a small JavaScript program in the DNA sequence. The total amount of information was ~ 700 kilobytes, which fit into 55 thousand individual DNA fragments with a length of 159 nucleotides. Most (but not all) of the encoded information was then able to be read. However, there was no error correction method or redundancy system in the encoding used by Church and colleagues: the binary sequence was simply translated into a sequence of nucleotides on the principle of one nucleotide – one bit (adenine or cytosine corresponded to 0, guanine or thymine – 1).

Such a system can be used to demonstrate the capabilities of the technology, but in practice, of course, it is not applicable. In the future, several other teams tried to use "real", well-known coding methods in information theory and apply them to work with DNA. For example, scientists used the classic Reed-Solomon code, which allows correcting errors in data blocks and is used, in particular, when writing information to CD. However, this code, according to the authors of the new article, is not very well suited for DNA: the nature of the errors that occur when copying it leads to a large variation in the representation of oligonucleotides of different types, especially with a large amount of data, which has a bad effect on the "readability" of the code. In addition, the resulting information encoding density (in those works where this code was used) was only about half of the theoretical limit. Therefore, Erich and Zelinsky decided to develop their own method of recording information in DNA, taking as a basis the so-called fountain code.

DNA contains four types of bases (A, T, G and C), so the maximum encoding density of information in it can reach two bits per nucleotide. In reality, the amount of encoded information per symbol is always lower: firstly, due to the need to introduce redundancy, which should compensate for errors in the synthesis and reading of DNA fragments, and secondly, due to the introduction of "service sequences" that are needed for indexing (barcoding) sequences, enabling DNA copying in PCR etc. According to the calculations of the authors of the new article (details are given here) The Shannon information density, taking into account the average length of nucleotides, the size of the copy adapters and typical synthesis errors, is about 1.83 bits per nucleotide. The use of the new method made it possible to achieve information density, which is 86% of this theoretical limit.

The new encoding works as follows. First, the binary sequence is divided into disjoint segments of a fixed length of 32 bytes, and then they are encoded using so–called "drops" - specially obtained sequences of slightly longer length. The resulting droplets are then translated directly into a DNA sequence with maximum density (two bits per nucleotide), and – this is the superstructure over the fountain code – are checked according to the biochemical restrictions imposed by the DNA reading technique: fragments should not contain long single-nucleotide repeats or sections with too large or too small a proportion of GC nucleotides (the combination of A+T/G+C affects the physical properties of the molecules). If coding leads to violations of restrictions, it is simply repeated again until it is possible to create the correct sequence. The obtained result is supplemented with standard PCR adapters and sent to an automatic synthesis device.

The Lubeck transformation and selection in the fountain DNA coding method work iteratively - until a satisfactory result is obtained.

The key difference between fountain coding is that each "drop" of the code contains information about several different segments of the original sequence – so that even the loss of a few drops can be compensated at the expense of others. To index the drops, a random number generator is used, which adds the so–called grain to each of them - the method is described By Michael Luby from MIT back in the late 90s and now it is widely used in the transmission of mobile video, where frequent loss of blocks of information is possible.

As a result of using a new method – fountain code modified by the authors to work with DNA – scientists managed to experimentally encode and read 2.14 megabytes of information in the form of DNA oligonucleotides. They contained an Amazon gift card, the KolibriOS operating system, Shannon's article about transmitting information in a noisy channel, a video file of the Lumiere brothers' "Arrival of the Train" and even one computer virus. The final physical density of the record – it was measured in an experiment with gradual "homeopathic" dilution of DNA – was 215 petabytes (215,000,000 gigabytes) per gram of nucleic acid.

The main result of the new work – far from the first in its field – can be called an approximation to the theoretical limit of the density and reliability of coding information based on DNA. Given the proximity of the data obtained to the theoretical limits, it is difficult to hope for any radical improvement in these indicators in future works. Now the main obstacle to the practical use of DNA as a carrier of information remains the high cost of synthesis. So, in the new work, the final cost of the "DNA flash drive" was 3.5 thousand dollars per megabyte of data. However, this figure should be evaluated in the right context: firstly, once created such a medium can be easily copied almost an unlimited number of times. Secondly, the current cost of recording information in DNA is the result of the use of a conventional modern method of chemical synthesis, developed primarily with the priority of accuracy requirements. As shown in the new work, such accuracy is highly redundant for information storage tasks. A significant reduction in cost can be achieved by relaxing this requirement, but so far such "fast and dirty" methods of DNA synthesis have not been widespread due to the fact that they had no practical application.

Interestingly, the new work, although it surpasses all the work done so far in recording density, is significantly inferior to them in terms of data volume. So, last summer, scientists from the University of Washington, with the financial support of a private company, managed to record more than 200 megabytes of data in DNA, including digitized works of art, 100 literary works from the Gutenberg project, the UN Universal Declaration of Human Rights in more than 100 languages, the seed database of the non-profit organization Crop Trust and the clip This Too Shall Pass the OK Go band in high resolution.

Portal "Eternal youth" http://vechnayamolodost.ru  03.03.2017


Found a typo? Select it and press ctrl + enter Print version