01 July 2010

How to find the right gene in the genome?

Search engine: Don't get lost in genomes"Popular Mechanics" according to Technology Review:
How to Build a Better DNA Search EngineWhat do the Chinese language and bioinformatic databases containing the genomes of the inhabitants of our planet have in common?


From the point of view of organizing a quick search by keywords, there is a similarity.

What Google has taught modern users of the world Wide Web is that online search is fast, very fast. Small letters under the search bar constantly remind you of this. Enter, for example, the word "physics" in the search engine – and get "approximately 17,900,000 results" in 0.16 seconds. With the speed of thought? Maybe even faster.

For researchers working with bioinformatic databases, such speeds so far seem fantastic. These bases are huge, and continue to grow exponentially. They contain data on both the genomes of different species and the genetic information of individual individuals within a species.

It may seem that finding a gene common to some species or individual organisms is as simple as typing keywords into a familiar search engine and getting the result. However, this is not the case.

Bioinformatic search systems mainly use BLAST or FASTA algorithms. In fact, this is a comparison of data from one genome with another, then with the next, and so on. Such a search gives satisfactory results, provided that the total number of matched genomes is small.

Search engines operating on the Internet faced this problem 20 years ago in the wake of the growth of information on the World Wide Web. Initially, they indexed the contents of the network by recording the words contained in the documents. To find a page by keyword, the system searched for it first in one document, then in the next, and so on. As the total number of web pages grew, this method worked slower and slower…

Then search engines changed their approach to business: they turned the indexing process upside down, creating a so-called "inverted index". The principle is simple: instead of creating a list of words contained on each page, each word is assigned a list of pages on which it occurs.

This greatly simplifies the search, but there are some subtleties that darken the picture. For the Russian language, it is not difficult to determine the border of words – they are separated by a space. But the same cannot be said about genetic information – there are no "gaps" in genomes. Therefore, the question arises: what is considered a "word"?

This is where humanity comes in handy with such a difficult Chinese language to learn, virtually devoid of gaps, and the principles of Chinese search engines. In order to index the text without delimiters (spaces), you can split it into n-grams – fragments with a length of n characters (letters). For example, by splitting the text into 2-grams (two-letter words), the search for the three-letter keyword ABC can be performed as a search for the words AB and BC. Moreover, this is exactly how some Chinese search engines work.

But how many letters are in each "word" of the genome, how long should the search engine index the sequence? One–letter splitting gives only 4 different "words" - the main nucleotides A, T, C and G. But with such a split, searching for longer "keywords" will take too much time and may be practically impossible.

The optimal length of the n-gram is based on the statistical distribution of words in the text (including genetic), obeying Zipf's law.

Zipf's Law is named after its discoverer– the American linguist George Kingsley Zipf from Harvard University. This is an empirically found pattern of the frequency distribution of natural language words: if all the words of a language or a sufficiently long text are ordered in descending order of the frequency of their use, then the frequency of the nth word in such a list will be approximately inversely proportional to its ordinal number n (the so-called rank of this word). For example, the second most used word is about twice as rare as the first, the third is three times less common than the first, and so on.


The graph shows the frequency of occurrence of words in the Russian Wikipedia

In fact, in a fairly large document, 50% of words occur only once. This can be used to determine the length of the genome "word" for indexing.


Dependence of the number of words,
occurring in the genome only once,
from the selected length of one word.

So, when splitting the Chinese text into 1-grams, the number of "words" occurring only once is less than 50%, for 2-grams this value is about 50%, and for 3-grams it is again much less than 50%. Therefore, the "golden mean" is the splitting of the text into two-letter words.

Applying the same principle to the genomes of arabidopsis, Aspergillus, drosophila and mouse, researchers from SOSO.com (one of the three largest search engines in China) came to the conclusion that in this case it would be optimal to use 12-letter "words".

The application of this approach in bioinformatics does not require the creation of any new technology - existing systems (including open source) are quite capable of coping with the task. With the growing volume of processed data, there is a need to apply a modern approach to search, and it looks like SOSO is going to occupy this niche.

Portal "Eternal youth" http://vechnayamolodost.ru01.07.2010


Found a typo? Select it and press ctrl + enter Print version