07 April 2017

Citizenship by DNA

Do Jews buy genetic tests

In our experience, the analysis of a person's population membership by DNA raises three big questions for the public: is it possible to link genes and ethnic groups, how does the analysis of origin from a technical point of view take place, and whether genetic tests are able to "identify Jews". For some reason, it is the question of Jewish identity by DNA that is of great concern to both those who have indisputable evidence of belonging to the G-D-elected people, and those who do not eat matzo and do not read the Torah.

In the new Genotek material on Geektimes, we will try to answer everything in order. And yes, we will also determine the Jews.

hebrew1.jpg

Races (population groups) in biology, medicine and genetics

Humanity has a bad habit of justifying violence by the "innate" superiority of one race over another – that's why modern biologists approach the issue of genetic differences between populations with the caution of a minesweeper. (Not) the existence of biological boundaries between racial and ethnic groups has been fiercely debated throughout the 20th century, but a final consensus on this issue has not yet been reached [1].

There were hopes that sequencing the human genome would reconcile everyone. A genome read "from" and "to" will show that the boundaries between groups are of a social nature, and the genes are the same for everyone. It turned out differently: a careful study of the human nucleotide code revived and intensified interest in the biological differences between racial and ethnic populations. In general, the same genes had slightly different allelic variants associated with the risk of diseases [2], drug metabolism [3], the body's response to environmental conditions [4], and these variants were found in different populations with different frequency.

The search for non-existent "Indian" or "African" genes has been stopped, but research in the field of medical and population genetics continues to draw parallels between the biological characteristics and ethnicity of the participants. The use of the terms "race" and "ethnicity" in such works is actively discussed (and often condemned). There have been attempts to introduce rules forcing researchers to justify the need to use "slippery" categories and clarify what exactly is meant by specific terms. In February last year, Science, one of the most authoritative natural science journals, published an ambiguous article [5] suggesting that the use of the term "race" in genetic research should be completely abandoned, replacing it with a more correct and neutral "ancestry" - "origin".

But even in conditions of uncertainty with terms, it is still necessary to divide humanity into population groups: in particular, for the correct conduct of clinical trials of drugs and disease risk assessment. For example, three allelic variants of the NOD2 gene – R702W, G908R and 1007fs – are associated with an increased risk of Crohn's disease in Americans of European descent [6,7], however, none of these variants is associated with Crohn's disease in Japanese [8]. Alleles of the CCR5 gene are known to affect the rate of development of immunodeficiency in HIV-infected patients [9]: among them, a variant was found that slows down the progression of the disease in Americans of European descent, but accelerates its development in African Americans [10]. In Asians, a correlation was found between polymorphisms of the p53 protein gene, which regulates the stress response and suppresses the development of tumors, and average winter temperatures in the habitats of populations – genetic adaptation to frost [11]. And if in the past only information provided by the participants themselves was used to divide the sample into ethnic groups, in the post-genomic era they are increasingly supplemented and clarified with a genetic assessment of the origin of the subject.

Genetic variations between populations

In everyday life, we divide people into groups by appearance or language of communication. Most Danes look more like each other than each of them looks like an Italian (here's a cool visualization with averaged portraits of different nationalities). The Danes and Italians are much closer to each other than each of them is to the inhabitants of sub–Saharan Africa: human phenotypes are clustered according to a geographical pattern. The distribution of genotypes has a similar structure: members of a local group, as a rule, have closer kinship ties than residents of remote areas, and populations inhabiting one region are closer than those whose habitats are separated by geographical barriers (for example, a mountain range or a body of water).

At the same time, the genetic diversity of the human population is lower than that of many biological species. This is explained by the fact that humanity is a young species: individual groups had relatively little time to accumulate differences. Two randomly selected people differ from each other by each of ~1000 nucleotides, whereas two chimpanzees do not match once in ~500 "letters". And yet, in total, there are about 3 million potential "points of divergence" in the human genome. Most of these inconsistencies, called single nucleotide polymorphisms (SNP), are neutral or practically neutral, but some of them are responsible for phenotypic differences between people.

The distribution of neutral polymorphisms (since they do not carry biological meaning, they are not subjected to directed evolutionary selection, they are carried by the wind of migrations) in the world population reflects the demographic history of our species. Genetic and archaeological evidence indicates that the size of the human population has grown significantly in the last 100,000 years. People settled outside Africa, colonizing the rest of the world. The process of settlement affected the geographical distribution of alleles in two ways: firstly, the "founder effect" was affected - in the population of settlers, as a rule, only a part of the genetic variants from the entire pool of their diversity in the ancestral population was represented; secondly, the so–called "assorted crossing" took place, i.e. pairs were formed mainly inside their group, which limited the spread of existing and emerging de novo polymorphisms among individuals inhabiting various geographical areas. These processes led to a gradual accumulation of genetic differences.

In the context of population groups, genomic markers began to be studied in the 70s - 80s, in the 90s they began to be used to identify the population belonging to a particular person. Researchers have demonstrated time and time again that genetic polymorphisms can successfully isolate population groups and determine the group identity of an individual. At the same time, it was shown that people living on the same continent, as a rule, are closer to each other genetically than people from different continents. At first, in such studies, information about the place of birth, race, and ethnic group was known from the very beginning and was used together with genetic data; if the subjects were distributed into clusters "blindly", solely on the basis of genetic traits, the correspondence between geographical origin, ethnicity and population structure was less obvious. As further studies have shown, success depended on the genetic markers used and their number (more is better), the correct choice of reference populations and other factors [12].

By 2004, in the USA, the genetic definition of population affiliation was used not only in biomedical research, but also in crime investigations: this article from Nature contains an exciting story about how the police, desperate to find a criminal, ordered a DNA test from a commercial company, determined the suspect's skin color and solved the case. Proposals for the analysis of genetic origin successfully got into the wave of people's general interest in their own past. "Roots mania" is what this hobby was called in an article in Time devoted to "America's latest obsession" – genealogical research.

Genomic methods are actively used by specialists studying the origin and evolution of peoples. For example, in 2013, an international team of researchers used genetic analysis to refute the hypothesis of the origin of Ashkenazi Jews from the Khazars [13]. The set of genomic data used by the authors is publicly available: more than 100 world populations are represented in it. We propose to model a small study together with us: to determine the place of Genotek customers in this sample, and at the same time to understand the technical details of determining population affiliation.

The purpose of the study

Determine the place of Genotek customers among reference populations. To find out if there are representatives of Ashkenazi Jews in our sample. Demonstrate the principles and methods of analyzing the population belonging of an individual.

Research objectives

To process the genotyping data of 722 subjects with the ADMIXTURE program, using as a training sample a data set from Behar et al., 2013.

Materials and methods

In the initial work of Behar et al., 2013, data from 1,774 people were used: among them were representatives of 88 non-Jewish populations (from Arabia, Central Asia, East Asia, Europe, the Middle East, North Africa, Siberia, South Asia and sub-Saharan Africa) and 18 Jewish populations. The authors needed an extensive data set to accurately determine the place of Ashkenazim in the context of world populations: The task was to represent all three geographical regions from which this group could hypothetically originate – Europe, the Middle East and the Khazar Khaganate. The authors emphasized the difference between the approach to the selection of samples representing modern European, Middle Eastern and Jewish populations – direct descendants of ancestral populations, and samples corresponding to the Khazar Khaganate, which ceased to exist about 1000 years ago. The catch is that none of the currently existing populations is a direct heir to the kaganate. The authors chose residents of the South Caucasus (Abkhazians, Armenians, Azerbaijanis, Georgians), the North Caucasus (Adygs, Balkars, Chechens, Kabardins, Ossetians and several other nationalities), Chuvash and Tatars as possible modern representatives of the Khazars. We added samples of 722 people from various regions of Russia to the dataset.

For statistical analysis, we used the ADMIXTURE program, which allows us to estimate the most likely origin of an individual based on genotype data. In addition to it, the authors of the article under discussion used other statistical methods that gave a similar answer to the question posed. We will focus on ADMIXTURE, since it is this algorithm that allows us to estimate the percentage contribution of ancestral populations to the genomes under study.

ADMIXTURE uses Monte Carlo methods in Markov chains (Markov chain Monte Carlo, MCMC). Here is a link to an article by the authors of the algorithm for those who want to understand the mathematical side of the process in more detail.

Let's look at how ADMIXTURE works on the example of samples and populations from our set

In total, we have 2,496 samples/individuals, each of which belongs to one of 106 modern populations. We assume that modern populations are most likely descended from a relatively small number of ancestral populations. "Ancestral populations" in this analysis are some ancient genomic clusters united by the principle of genetic similarity. ADMIXTURE allows both to make arbitrary assumptions about the number of such clusters in the sample, and to select the optimal number of them that most correctly describes the real distribution of genomic data.

Having received information about genotypes and the estimated number of "ancestral" populations (K), ADMIXTURE builds a model estimating the contribution of each of the "ancestral" populations to each sample. When interpreting data, both the quantitative composition of the genome (the percentage of clusters) and the qualitative one – their presence or absence in specific genomes - are important. Based on these data, it is possible to make assumptions about the evolutionary processes in the population, in particular, about the presence or absence of common "roots" in population groups. However, the conclusions will be legitimate if the model we have built is good: the optimal value of K is selected.

Let's select the optimal value for

How to determine which number of "ancestral" populations most accurately corresponds to the true one for a given sample? Empirically!

ADMIXTURE – smart program: building a model of the genetic structure of populations based on data on the genotypes of individuals (estimating the contribution of each of the ancient genomic clusters to each of the genomes of the sample) for a given number of K, she does not forget to compare with reality at the end. Check how well the input data is described by the constructed model. The measure of comparison is the "error" – a value describing the discrepancy between the model and real data. The larger the error, the worse the assumption about the number of ancestral populations corresponds to reality.

How to choose the optimal K value? We run the ADMIXTURE algorithm on this sample, substituting different values for K, and we get our own error value for each K. We plot the dependence of the error value on K. Here is what the graph turned out to be for the authors of the article:

hebrew2.jpg

The optimal value of K is at the minimum point of the function. If there is no minimum on the chart (the function is constantly growing or decreasing), you will have to build models by choosing new ones until you can find the right one.

Even with an optimally matched K, the reliability of the analysis results depends on the correctness of the sample:

1. Individuals should not be related to each other.
2. Single nucleotide polymorphisms (SNPs) for which genotyping is performed should be evenly distributed across the genome with a sufficiently high density.
3. SNP alleles should be in equilibrium coupling, that is, the probability of the presence of this allele in a particular individual should depend only on the frequency of this allele in the population, but not on other alleles in the genome.

As can be seen from the graph, the optimal K for this sample was 10 "ancestral" populations.

Results

ADMIXTURE visualizes the results of the analysis like this (only part of the data is visible in the figure):hebrew3.jpgEach cluster has its own color, and populations differ (or do not differ) in the proportions of clusters in the genome. Here is an interactive version of the picture for detailed study: hover the mouse and scroll to see all the populations or consider some of the groups in more detail.

In general, within the Genotek "population", the ratio of clusters is expected to correspond to the pattern characteristic of populations of Eastern European origin. The fun begins at the level of individual samples:

hebrew4.jpg

Although exactly the population closest to this sample is determined by numerical values, a lot of information can be obtained by visual comparison of patterns. We suggest that you independently determine the closest populations for the samples of four Genotek clients from the picture.

Answer
In this picture, samples 1 and 2 are of Asian origin: the predominance of the pink cluster is characteristic of the Japanese and the Khan people in our sample, blue – for the Yakuts, the third sample shows the ratio of components characteristic of Russians, Belarusians, Ukrainians and Poles, and the fourth is a typical Ashkenazi Jew. In total, we found 9 Ashkenazi Jews among 722 samples.

Conclusion

Population affiliation is far from the only factor determining a person's ethnic identity. However, it is still possible to identify a correlation between ethnic groups and the genome structure of their representatives. Such an analysis is used both for scientific and medical purposes, and for the study of one's own roots by everyone. At the same time, it is important to understand that the models are constantly being improved, and the results obtained for greater accuracy should be considered together with other data, for example, a family family tree.

The authors of the original article found no evidence of the Khazar origin of the Ashkenazim. Genetic tests, of course, "know how" to identify Jews – however, we should not forget that "Jewishness" is, first of all, a state of mind.

In the near future, Genotek will launch an updated Genealogy DNA test with expanded results: we will bring the number of populations to hundreds, we will add Jewish populations. We will update the information in the personal account for everyone who has ever transferred their genetic material to us. If you are still not genotyped, we invite you to join.

List of literature

  1. Foster M., Sharp R. (2002). Race, Ethnicity, and Genomics: Social Classifications as Proxies of Biological Heterogeneity. Genome Res.

  2. Collins F.S., McKusick V.A. (2001). Implications of the Human Genome Project for medical science. JAMA.

  3. Nebert D.W., Menon A.G. (2001) Pharmacogenomics, ethnicity, and susceptibility genes. Pharmacogenomics J.

  4. Olden K., Guthrie J. (2001). Genomics: Implications for toxicology. Mutat. Res.

  5. Yudell M., Roberts D., DeSalle R., Tishkoff S.(2016). Taking race out of human genetics. Science.

  6. Ogura, Y. et al. (2001). A frameshift mutation in NOD2 associated with susceptibility to Crohn’s disease. Nature.

  7. Hugot, J. P. et al. (2001). Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn’s disease. Nature.

  8. Inoue, N. (2002). Lack of common NOD2 variants in Japanese patients with Crohn’s disease. Gastroenterology.

  9. Martin, M. P. et al.(1998). Genetic acceleration of AIDS progression by a promoter variant of CCR5. Science.

  10. Gonzalez, E. et al.(1999). Race-specific HIV-1 disease-modifying effects associated with CCR5 haplotypes. Proc. Natl Acad. Sci. USA.

  11. Shi, Hong et al. (2009). Winter Temperature and UV Are Tightly Linked to Genetic Changes in the p53 Tumor Suppressor Pathway in Eastern Asia. American Journal of Human Genetics.

  12. Bamshad M., Wooding S., Salisbury B. et al. (2004). Deconstructing the relationship between genetics and race. Nat Rev Genet.

  13. Behar D.M. et al. (2013). No Evidence from Genome-Wide Data of a Khazar Origin for the Ashkenazi Jews. Human Biology.

Portal "Eternal youth" http://vechnayamolodost.ru  07.04.2017


Found a typo? Select it and press ctrl + enter Print version