What is the difference between race and ancestry? How does ancestry inference work? What implications does it have for society? Home genetic testing kits such as 23andMe and Ancestry have exploded in popularity in recent years. Customers are able to learn about their unique ancestry, along with estimated percentages of their genomes that can be attributed to different geographic regions. Public interest in learning about genetic heritage have helped the consumer genetics market grow to a $117 million industry as of 2017[1].

In addition to personal entertainment, ancestry inference is a crucial tool in the study of human evolution and population history. Understanding how migration and interbreeding events lead to the exchange of genes and the genetic variation seen today depends on identifying the origins of genetic sequences[2]. Ancestry inference is also important for disease genetics and drug development[3]. It can help explain why some populations are predisposed to certain diseases than others and for discrepancies in responses to drugs.

Race vs Ancestry

There is a key difference between race and ancestry. Race itself is not biological. According to a 2002 Stanford study, there tended to be more similarity between populations than there was within[4]. The scientists found that only 7.4% of over 4000 alleles–different variations of a gene–were specific to one geographic region, while 92% of all alleles were found in more than two regions. Another series of studies in which the first human genomes were sequenced showed that two scientists of European descent were more genetically similar to a scientist of East Asian descent than they were to each other[5]. A 2010 paper in Genomics found that haplotype heterozygosity–proportion of individuals in a population in which a collection of alleles inherited as a block from one parent differ from those passed down from the other parent, and a measure of genetic diversity–decreased with distance from Africa, suggesting that an individual from Africa may be more similar to an individual from Europe than to another person from Africa[6]. This finding was expected as modern humans originated in Africa and migrated to other parts of the world, with younger populations having had less time to develop variation. In essence, because there is so much similarity between races and so much variation within, race can only be defined as a social construct[7].

Contrary to race, ancestry is not about categorizing but more about unravelling the biogeographic history of genetic variation in a population. Because populations lived in relative isolation before transcontinental travel became ubiquitous, each geographic region has its own genetic fingerprint. Through admixture events, in which distant populations interbred, humans in existence today have genes from many different regions. Researchers refer to the current human genome as a “mosaic” of segments originating from around the world[8]. Ancestry studies aim to find where these segments came from.

Contrary to race, ancestry is not about categorizing but more about unravelling the biogeographic history of genetic variation in a population.

Ancestry informative markers

On a purely genome-wide level, it is nearly impossible to tell where a person originated, especially if the individual is admixed. The answer lies in specific parts of your DNA called ancestry-informative markers (AIM)[9]. AIMs are a subset of single nucleotide polymorphisms (SNP), which are genetic mutations that occur in more than 1% of a population. As a result of differences in environments and evolutionary times, some SNPs occur in higher frequencies in certain populations. These SNPs are known as AIMs.

It is important to draw a line between AIMs and race. While it is true that there are genetic markers in your genome that may hint at their origin, there is no single gene that is categorically indicative of your race. AIMs occur more frequently in a certain population, but this does not mean that 100% of the population has the SNP. Using AIMs, scientists can separately predict where individual parts of your DNA may have originated. Combining discrete inferences together paints an estimation of your genome’s unique biogeographic ancestry as a whole. The process of finding the proportions of your DNA that came from source populations is known as global ancestry inference. Local ancestry inference (LAI) refers to the study of the origin of individual chromosomal segments.

In order to find AIMs, researchers must have a large reference database of DNA sequences from a variety of populations. All sequences must have a verified ground-truth ancestry label, which allows scientists to iterate through to pinpoint SNPs that correlate with certain populations. It is integral to select a diverse range of relevant AIMs that will yield the correct estimations. For instance, the SNPforID 34-plex is a panel of 34 SNPs that can accurately distinguish between African, European, and East Asian ancestry[10].

It is also possible to infer more specifically to sub-populations within these general categories. For instance, 23andMe’s commercial product estimates ancestry from 45 regional populations. To make these targeted predictions, scientists consult a larger set of markers. Earlier ancestry inference methods such as STRUCTURE in 2003 used AIMs, but as sequencing and computational technologies progressed, it became clear that looking at high-density SNP data as contextual information in addition to AIMs produced better results[11]. 23andMe computes as much 50,000 markers per chromosome, making it possible to infer ancestry at the subcontinental level.

23andMe computes as much 50,000 markers per chromosome, making it possible to infer ancestry at the subcontinental level

Current Approaches to Local and Global Ancestry Inference

According to 23andMe researchers, Ancestry Composition algorithm is a three-step pipeline for performing local ancestry inference[12]. Combining all LAIs together, 23andMe produces a global ancestry inference. It requires phased sequences, which are haplotypes, or collections of genes that come from one parent.

Ancestry Composition’s first module is a support vector machine, which is a classifier that is iteratively refined to find the optimal way of separating input data points into predetermined categories. The module initially splits the input haplotype into segments with equal genetic markers, which in 23andMe’s experiment, was 100 markers. The SVM then classifies each segment out of 25 populations across the world.

The second module is a statistical Markov model that transforms segment classifications into confidence probabilities. Probable errors are also corrected, such as classification errors. For instance, if a string of segments that is continuously classified as being from one population is interrupted by a classification for another population, the module flips the misclassified segment to the majority label. The third module “calibrates” these probabilities by comparing them against empirical ancestry data.

Ancestry Composition achieved respectable accuracies in both continental and subcontinental classification tasks. Evaluation was done in two ways: one with a confidence threshold of 0%, meaning that all samples had a guess, and one where the model only gave predictions for confidence levels above 80%. Ancestry Composition attained a precision (percentage of true positives out of all positive predictions) of above 98% and a recall score (percentage of true positives out of all ground-truth positives) of above 94% on the continental task in both test settings. On the subcontinental task, it reached a precision of above 84% with a confidence threshold of 0%, while with a threshold of 80%, precision increased to above 90%. This did mean that some segments remained unclassified.

While Ancestry Composition is used in industry, a common method for LAI in academia is LAMP. Ancestry Composition is limited in that classifications are based on arbitrarily chosen windows of the input haplotype, not taking into account that inherited segments may overlap between windows. LAMP uses a clustering algorithm on all fixed-length windows that overlap with a target SNP and assigns the majority classification to the SNP[13]. In Yoruba and European admixed samples, LAMP achieved an inference accuracy of 94%. However, as similarity between populations increased, LAMP decreased in accuracy, with a score of 48% for Chinese and Japanese admixed samples. Subsequent LAI methods have built upon LAMP’s benchmark; WINPOP delivers significant improvements on LAMP’s performance in closely related populations.

Going forward, the field of ancestry inference has several challenges to address. Accuracy depends on the quality of reference data. It is difficult to find markers representative of an ancestral population when many people in existence today are admixed. In addition, many popular LAI programs often contradict each other in the same tasks. For instance, in a disease association study, LAMPLD and MULTIMIX diverged in 18% of inferences[14]. Scientists are constantly working to improve the accuracy of their methods, and advancements in computation and genomics will facilitate new developments in ancestry studies.

Members of the same preconceived race do not necessarily share the same ancestry

Conclusion

Race, being based on physical appearance, has no biological bearing. There is no single allele that is completely unique to one race and appears in all individuals in it. Ancestry is a more valid descriptor of an individual’s population affiliations, as genetic variation has a distribution based on geography, rather than skin color. Members of the same preconceived race do not necessarily share the same ancestry, which is an important distinction to make when, for example, discovering how disease susceptibility genes from one region flowed through history. The age of genomic medicine presents exciting opportunities and challenges. As more populations gain representation in ancestry-related association studies, it will be important to educate the public on its benefits and implications, both in medicine and societal views on race.

References

[1] Global Direct-to-Consumer Genetic Testing Market is growing with Double Digit CAGR. (2018, February). Retrieved from www.credenceresearch.com/report/direct-to-consumer-genetic-testing-market

[2] Royal, C. D., Novembre, J., Fullerton, S. M., Goldstein, D. B., Long, J. C., Bamshad, M. J., & Clark, A. G. (2010). Inferring Genetic Ancestry: Opportunities, Challenges, and Implications. The American Journal of Human Genetics, 86(5), 661-673. doi:10.1016/j.ajhg.2010.03.011

[3] Rotimi, C. N., & Jorde, L. B. (2010). Ancestry and Disease in the Age of Genomic Medicine. New England Journal of Medicine, 363(16), 1551-1558. doi:10.1056/nejmra0911564

[4] Rosenberg, N. A. (2002). Genetic Structure of Human Populations. Science, 298(5602), 2381-2385. doi:10.1126/science.1078311

[5] Ahn, S., Kim, T., Lee, S., Kim, D., Ghang, H., Kim, D., . . . Kim, S. (2009). The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Research, 19(9), 1622-1629. doi:10.1101/gr.092197.109

[6] Xing, J., Watkins, W. S., Shlien, A., Walker, E., Huff, C. D., Witherspoon, D. J., . . . Jorde, L. B. (2010). Toward a more uniform sampling of human genetic diversity: A survey of worldwide populations by high-density genotyping. Genomics, 96(4), 199-210. doi:10.1016/j.ygeno.2010.07.004

[7] Yudell, M., Roberts, D., Desalle, R., & Tishkoff, S. (2016). Taking race out of human genetics. Science, 351(6273), 564-565. doi:10.1126/science.aac4951

[8] Thornton, T. A., & Bermejo, J. L. (2014). Local and Global Ancestry Inference and Applications to Genetic Association Analysis for Admixed Populations. Genetic Epidemiology, 38(S1). doi:10.1002/gepi.21819

[9] Pfaffelhuber, P., Grundner-Culemann, F., Lipphardt, V., & Baumdicker, F. (2020). How to choose sets of ancestry informative markers: A supervised feature selection approach. Forensic Science International: Genetics, 46, 102259. doi:10.1016/j.fsigen.2020.102259

[10] Phillips, C., Salas, A., Sánchez, J., Fondevila, M., Gómez-Tato, A., Álvarez-Dios, J., . . . Carracedo, Á. (2007). Inferring ancestral origin using a single multiplex assay of ancestry-informative marker SNPs. Forensic Science International: Genetics, 1(3-4), 273-280. doi:10.1016/j.fsigen.2007.06.008

[11] Geza, E., Mugo, J., Mulder, N. J., Wonkam, A., Chimusa, E. R., & Mazandu, G. K. (2018). A comprehensive survey of models for dissecting local ancestry deconvolution in human genome. Briefings in Bioinformatics, 20(5), 1709-1724. doi:10.1093/bib/bby044

[12] Durand, E. Y., Do, C. B., Mountain, J. L., & Macpherson, J. M. (2014). Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution. doi:10.1101/010512

[13] Sankararaman, S., Sridhar, S., Kimmel, G., & Halperin, E. (2008). Estimating Local Ancestry in Admixed Populations. The American Journal of Human Genetics, 82(2), 290-303. doi:10.1016/j.ajhg.2007.09.022

[14] Chen, M., Yang, C., Li, C., Hou, L., Chen, X., & Zhao, H. (2014). Admixture mapping analysis in the context of GWAS with GAW18 data. BMC Proceedings, 8(Suppl 1). doi:10.1186/1753-6561-8-s1-s3