01 October 2015

In search of an ancestor

What you can learn about your roots with the help of personal genomics

Alexander Ershov, N+1 

Last time we talked about the technologies that scientists use to read DNA. About sequencing (complete or partial) and about the close technology of genotyping, which allows you to find out only individual (but at the same time the most informative), "selected places" of your genome. Both technologies provide a huge amount of information and today we will talk about how scientists are trying to interpret it and how to understand it for an ordinary person.

Companies that are engaged in Direct To Consumer genomics, that is, they work directly with consumers (and not geneticists), are often the object of either unfounded hopes, or, conversely, fierce, but equally unfounded criticism. Both are explained by the fact that people do not know what to expect from the results of genotyping/sequencing. Let's try to figure it out.

Let's start with the necessary match. All genetic information of any person can be divided into three unequal fragments. These are, firstly, "ordinary" chromosomes (autosomes), which each person has in two copies: the one received from the mother, and the one from the father. Secondly, these are the sex chromosomes: X and Y. There is always only one X chromosome in eggs, and one copy of either X or Y in spermatozoa. Accordingly, Y-chromosomes are inherited only in the male line, women do not have them. The last and smallest part of the genome is mitochondrial DNA, that is, the nucleic acid that the symbiotic bacteria that were enslaved billions of years ago and turned into mitochondria continue to carry with them. We get all of them from the mother, so mitochondrial DNA is inherited only through the maternal line. As you can see, some elegant parity is observed here.

Why is it important to know about these three parts of the genome? Because they speak very differently about our origin. The fact is that in each generation, autosomes undergo a stage of shuffling (recombination), and sex chromosomes (most of them) and mitochondrial DNA do not participate in this process. In our genome, each of the homologous chromosomes of the father and mother exists separately, but their average "mix" is transmitted to children: exactly fifty-fifty. This genetic information is shuffled not at the level of chromosomes or genes, but at random: it happens that we pass half of a certain gene to children from the mother, and half from the father. Accordingly, it is impossible to build a family tree of an individual according to autosomes – this tree simply does not exist. For each individual gene, there is, but not for a person. In this sense, the autosomal genome resembles a solution, the composition of which we can determine quite accurately, but the history of mixing is almost impossible (however, we can estimate the "degree of mixing" – but more on that later).

With mitochondrial DNA and the Y chromosome, the situation is completely different. These parts of the genome gradually accumulate mutations, but do not participate in recombination. Therefore, any newly appeared mutations are preserved in future generations of the carrier almost forever (unless reverse mutation occurs). This is how haplogroups are formed – aggregates of people who carry this mutation and are descendants of the person in whose genome it first appeared. Haplogroups form a branching tree, the thinner branches of which are sub–variants of thicker branches. Each of us can match a leaf on the mitochondrial haplogroup tree, and men can also match a leaf on the Y–chromosome haplogroup tree.

If you think a little about the nature of inheritance of these parts of the genome, it becomes clear that somewhere in the past there must have been a woman who is the last common mother of all living people. Similarly with the Y-chromosome: there must have been someone whose Y-chromosome was inherited by all living men. The reader, of course, knows these characters: These are the so-called mitochondrial Eve and Y-chromosomal Adam. 

Despite the big names, there is nothing unusual in the existence of such people: this is a simple consequence of the nature of inheritance of mitochondria and Y chromosomes. For any group of people (and multicellular organisms in general), such ancestors must exist.

It may seem that mitochondrial Eve and Y-chromosomal Adam had to necessarily differ from their contemporaries – well, at least, be the founders of some new community that separated from the rest of the human population. That's not so. If you look at the inheritance scheme, it becomes clear that the moment of the appearance of the mitochondrial "Eve" (or "Adam") for any divided population is always earlier than the time of separation. And if we take into account the inevitably random nature of inheritance, it turns out that there is no reason to consider these people as special.


A fictional example of a family tree of mitochondrial DNA variants. In each generation, there are 15 females in the population, each of which can have zero, one, two or three daughters. All females of the 16th generation date back to mitochondrial Eve from generation No. 2. A drawing from A.Markov's book "Human Evolution. Book 1. Monkeys, bones and genes" – VM.Here it would be possible to tell how, where and when, according to the analysis of mitochondrial DNA and the Y chromosome, the human population was divided into different branches.

But, firstly, this data is quite well-known, books have already been written on it and even films have been made. And secondly, this is the first thing that any DTS-genomics company will tell you in a popular form. Therefore, it will be more interesting to mention here a few stories about what information about the origin can be extracted from the genome if modern methods of its analysis are applied.

The first story: The Cohen Brothers"And Moses said to the Lord, O Lord!

I am not a talkative person [...]: I speak hard and am tongue-tied. And the anger of the LORD was kindled against Moses, and he said, Have you no brother Aaron, a Levite? I know that he can speak, and behold, he will come out to meet you, and when he sees you, he will rejoice in his heart; you will speak to him and put words in his mouth, and I will be with your mouth and with his mouth, and I will teach you what to do; and he will speak to the people instead of you; so he will be your mouth, and you will be to him instead of God (Exodus 4:14-16)."

This is how Aaron, the brother of Moses, appears for the first time in the Torah and the Old Testament. It was Aaron, according to Jewish tradition, who became the founder of the priestly estate, which received the name of the Cohens. Cohen's status was passed down from generation to generation strictly along the male line: from father to son. And although after the destruction of the Temple, the Cohens found themselves unable to fulfill their vocation, their prestigious status remained, and the name of the estate turned into many surnames: Kohen, Kahana, Kun, Katz, and so on.

But what does personal genomics have to do with it? The fact is that the nature of the inheritance of the status of the Cohens coincides with the inheritance of the Y chromosome, which makes it possible to check, firstly, the common origin of the living Cohens, and, secondly, the lifetime of their last common ancestor. And such work was carried out, and in the late 90s, even before the widespread spread of DNA microchips and the advent of personal genomics. 

It turned out that almost half of modern people who call themselves Cohens are carriers of a rather rare variant of the J1 haplotype, and among other Jews the frequency of this variant is several times less. The frequency of this haplotype is almost the same for both Sephardim and Ashkenazim, although these groups have been very isolated from each other over the past 500 years. The Levites (that is, the descendants of Levi, to whom both Aaron and Moses belonged) are a significantly more heterogeneous group.

Moreover, based on the existing diversity within the Cohen haplotype, geneticists were able to calculate its age. It was about 2.5 thousand years old (with a 95 percent probability it is in the range of 2100-3250 years), which is in good agreement with the time of the exit from Egypt and the foundation of the first Temple.

But the Cohen case is quite specific, is it possible to conduct a similar study on other populations? It turns out that it is possible, and even quite easy. 

Studies conducted on the population of Great Britain and Ireland have shown that in many (although not all) cases, the carriers of the same surname have a sharply increased frequency of some Y-haplotype compared to other populations. Of course, such a study is complicated by the fact that some surnames are taken by different "founders", that surnames are passed on not only to biological, but also to adopted children, and so on. It is all the more surprising that in practice it works. For example, almost 90 percent of Britons with the surname Attenborough are carriers of haplogroup E1b1b1, and 95 percent of Herricks are carriers of haplogroup I. 

Many projects of genomics enthusiasts, such as Ysearch, Ybase or academic YSTR, are dedicated to creating a database of surname matches and searching for kinship. The success of such work is shown at least by the fact that in 2013, based on open data from such databases, it was shown that genomic data can be quite easily deanonymized by predicting the owner's surname from them.

The second story: Genomic GPS

One of the most interesting illustrations of how genetic data can be used to analyze origin is an article by Carlos Bustamante's group from Cornell University. This article was published quite a long time ago, in 2008, but for some reason it is still not as well known as it deserves.

The work concerns the analysis of the genetic diversity of the modern population of Europe and how this diversity correlates with the geography of residence. For a person who is interested in his origin, of course, it is not bad to know that he, for example, belongs to the Y-chromosome haplogroup R1a. So, his paternal line most likely goes back to the population of eastern Europe. But that's not enough. It's all the more disappointing to receive such scant information when we know that during the analysis we just threw out the lion's share of the genome.

As we have already seen, using genetic polymorphisms in autosomes is not as easy as when working with mitochondrial DNA and the Y chromosome. There is no similarity of the haplotype tree here. This is on the one hand. On the other hand, the diversity of SNPs in autosomes is much higher than in other parts of the genome. Simply because these last parts are much smaller. This means that the resolution of the genetic analysis of autosomes can potentially be much higher.

Carlos Bustamante's group solved this problem in a very elegant way. The genomes of more than three thousand Europeans were analyzed in the work, which were collected as part of the POPRES project. The data were obtained using conventional microchips for about half a million SNPs, which are now used by almost all DTC genomic companies. Out of three thousand people, the researchers selected only those whose grandparents came from the same country and rejected those in whose genome traces of recent mixing were visible. There were 1,387 people left, and the analysis was based on their data.

The mechanism of this analysis is quite simple. Let's imagine the set of SNPs of one person as a sequence. From a mathematical point of view, this will be a vector of dimension ~500000×1 (that is, as much as the SNP analyzes a chip × 1 person). If we collect data from the entire sample, we will get a matrix with a dimension of 500,000 ×1387 (so many people were selected for analysis). Then the principal component method can be applied to this matrix, the task of which is to find the dimensions by which the greatest variety of values is observed in the matrix (for a two–dimensional point cloud, the main component will be the diagonal along which this cloud is elongated, for a refrigerator, the main component is height, a TV is width, etc.). 

So, if you collect all the genetic diversity of 1387 Europeans, and compress it to two dimensions, you will get this picture. The first component is vertical, it "encodes" the largest share of diversity and at the same time deviates by only a dozen degrees from the geographical north-south direction.

The genetic proximity of the inhabitants of Europe in the diagram of the two main components.

Note that there is not a single bit of geographical data in this picture. The fact that it resembles a map of Europe to such an extent is entirely a consequence of the genetic data itself, and not topographic information about the donors of the material. If you do not take into account a few, literally counted misses, the resolution of such an analysis is really surprising. The two-dimensional diagram shows both the "boot of Italy", and "genetic Pirineas", and "genetic Balkans". The French-, German- and Italian-speaking inhabitants of tiny Switzerland are perfectly different from each other.

In the same work, the authors also performed the reverse procedure: they tried to predict a person's place of residence based on his genetic data. At the same time, polymorphisms of one person were compared with polymorphisms and data on the place of residence of all others. It turned out that for 90 percent of people it is possible to predict the place of residence with an accuracy of at least 700 kilometers, and for half of the studied the error is generally less than 300 kilometers. This result is especially impressive when you realize that the basis of the work was the data of only one and a half thousand people, and the personal information about the geography of residence was very rough and approximate.

Fortunately, the data in question did not remain on the pages of the magazine. They are available to anyone in several specialized genome analysis services. Anyone who has passed genetic testing and has their data "on hand" can try to find themselves on the genetic coordinates of Europe. 

Here, however, three limitations should be borne in mind. Firstly, for non-Europeans, such an analysis will be meaningless – you need to use a map here that corresponds to the background (or start with the world map, which is also, of course, there). Secondly, people whose genome has several lines that are far from each other mixed up need to interpret the results carefully, since it can miss the first two components very much (after all, PCA analysis considers the genome as a whole, as a homogeneous "solution"). Thirdly, the positioning accuracy depends on the details of the received database. For example, very few people of Slavic origin participated in the original work, so residents of Eastern Europe cannot count on a good resolution of the analysis. Fortunately, data has recently emerged that, potentially, can tell a lot more to such people.

The third story: The Farewell of the SlavsWe are talking about an article by a large international group led by Oleg Balanovsky at the Institute of General Genetics of the Russian Academy of Sciences.

The work was published just a month ago and is devoted to the study of the genetic diversity of speakers of the Baltic-Slavic languages. It is not only focused on residents of Eastern Europe, but also includes data from a significantly larger number of people than in the Bustamante project, and the range of these data is much wider: there are data on autosomes, and on mitochondrial DNA, and on the Y chromosome. In addition, some comparative linguistics data were also used for the analysis. 

As is known from the data of this very linguistics, the Slavic-Baltic languages were separated into a separate branch of Indo-European somewhere about 5-7 thousand years ago. Then, about 3 thousand years ago, they were divided into the Baltic (the current Latvian and Lithuanian languages) and Slavic branches. The latter was divided into southern (Serbian, Bulgarian, Macedonian and others), western (Polish, Czech, Slovak and others) and eastern (Russian, Ukrainian, Belarusian) branches, with the latter division dating back to about the XII-XIV centuries. 

By linguistic and historical standards, the period of formation of the Baltic-Slavic languages is certainly a large period of time. However, in the time scales that genomics operates on, this is not the case. Suffice it to say that most of the currently existing large mtDNA haplotypes originated tens of thousands of years ago, long before the appearance of even the most ancient language groups. It is all the more interesting that the resolution achieved in the project of Russian geneticists turned out to be enough to draw some conclusions about the current speakers of Slavic languages and their ancestors.

Genetic proximity of speakers of Slavic and other European languages according to autosomes.
Diagram of the first and third main components.

It would not be superfluous to remind that language, ethnicity, nation, material culture and nationality are very different concepts that consist of each other in extremely complex relationships. Linguists and historians are often confronted with the fact that some peoples adopt the languages of other peoples, that different peoples assimilate someone else's material culture, merge, separate and form different states. Therefore, when interpreting the results, it is necessary to be attentive to what is being discussed: linguistic, genetic or cultural kinship.

Prediction of the place of residence of each of the study participants based on the data of the analysis of the main components. An all-against-one test.

The basis of the work was the same analysis of polymorphisms by the principal component method. However, this time, not the first two, but the first and third components were chosen, because they give the best resolution among the Slavs within the framework of the sample used. In addition, some groups (for example, native speakers of Russian) were divided into subgroups according to the geography of settlement. Three separate maps were made for three parts of the genome. Moreover, the maps for autosomes and Y-chromosomal DNA turned out to be quite similar, but the picture for mitochondrial DNA differs from them (and has significantly lower resolution).

A map of the settlement of Slavic-speaking peoples in Europe and genotyping centers that participated in the work.

What did you find out? Firstly, the genetic data generally confirmed what could be assumed from the data of linguistics: the Balto-Slavs really form a relatively compact cluster among the surrounding population. However, genetic boundaries are not always as rigid as linguistic ones. For example, there is a fairly strong genetic divide between Poles and Germans, while the border between the same Germans and Czechs is smoother, corresponding to a long exchange of genes. 

Secondly, it turned out that within the framework of Eastern Europe, genetic proximity is quite well predicted by geographical proximity. On the contrary, groups that are quite close in linguistic terms may be far from each other in the genetic sense if they are separated by large distances. The standard explanation of such a picture among geneticists is the presence of a so–called substrate, that is, a certain population that inhabited the territory before the languages under study spread to it. According to this scenario, the descendants of the "aboriginal" population as a whole remained living in the same places as before and retained their genetic characteristics, but adopted a new language and, possibly, culture. 

Thirdly, it turned out that, in a genetic sense, the southern Slavs are significantly further from the Western and eastern ones; the latter form a very compact cluster, to which the speakers of the Baltic languages are very close. In many ways, this can be explained by the same "action of distances", which have already been mentioned, because the southern Slavs inhabiting the Balkans are separated from the western and eastern non-Slavic-speaking peoples (Romanians, Hungarians) with whom they are united by almost the same amount of autosomal DNA as with their northern language relatives. 

And, fourthly, the spread of genetic diversity among the Eastern Slavs turned out to be quite interesting. Native speakers of the Russian language in the work were divided into three geographical groups: northern, central and southern. And these three groups on the resulting genetic map are stretched into a very narrow and long line that stretches from the main cluster of Ukrainians, Belarusians and Poles to Finns, Karelians and Komi. At the same time, the southern subgroup of Russians is almost indistinguishable from other Eastern Slavs. 

Obviously, we are talking about large-scale mixing or assimilation in the north of present-day Russia, which led to increased genetic heterogeneity of native speakers of the Russian language compared to other Eastern Slavs. For many Russian-speaking people who want to learn more about their roots, such a picture can be very interesting. Potentially, it allows you to clarify your origin even when it is known that among the closest human ancestors all were native speakers of the same (Russian) language. Thus, for example, it is possible to estimate which of the three subgroups this carrier is closer to: the northern, central or southern. 

Fortunately, the data collected in this work is freely available. Right now, it will most likely not be possible to use them, because it requires the creation of a service and some adaptation. But if you look at the example with the genetic map of western Europe, which is already available on many services, then we can assume that you will not have to wait for this moment for a long time.

Portal "Eternal youth" http://vechnayamolodost.ru
01.10.2015
Found a typo? Select it and press ctrl + enter Print version