KinSeeker Genealogy Services
  • Home
  • About
  • Services
  • Blog
  • Publications
  • Contact
  • FAQ
  • Testimonials
  • Links

The KinSeeker

Q & A: How Does AncestryDNA Estimate Ethnicity?

7/25/2016

6 Comments

 
Recently, I asked my Facebook followers to send me their questions about DNA testing.  One person wanted to know how AncestryDNA determines ethnicity percentages. In particular, she was interested in what regions of the genome Ancestry uses to draw these conclusions. 
 
First, it is important to understand that Ancestry is not actually sequencing a client’s entire genome. The vast majority of DNA would not be informative, because it is the same in all people.  Instead, Ancestry determines the client’s DNA sequence only at specific positions that are known to vary among different ethnic groups.  These differences, which are scattered all over the genome, are called single nucleotide polymorphisms (SNPs – pronounced “snips”). Determining an individual’s sequence at a variety of SNPs is called “genotyping”. 
 
Ancestry uses SNPs that were originally identified by comparing genomes from individuals of European, East Asian (Han Chinese and Japanese), and West African (Yoruba) ancestry.  Since Ancestry wants to be able recognize other ethnicities too, they had to develop a reference panel of people from a variety of known ethnic backgrounds. To do this, they genotyped people whose ancestors all came from the same geographic region and thus were likely to descend from a single ethnic group.  They also incorporated data from the public Human Genome Diversity Project (HGDP), which genotyped individuals from about 50 different populations around the world.  When the SNP data from this reference panel was plotted on a graph, it formed clusters corresponding to 26 distinct geographic regions.  Ancestry uses these 26 regions to define a client’s ethnicity.
 
When a client’s DNA is genotyped, the data is compared to the reference panel at 300,000 SNPs (the sites for which the HGDP and Ancestry’s technique both provide information).  The most informative SNPs are then subjected to some high-powered statistical analysis.  Basically, they calculate the predicted SNP results for all possible proportions of ethnicity and compare those predictions to the client’s actual SNP results to determine which ethnicity combination has the highest probability of producing the client’s results.  The “winning” combination is reported to the client as their Ethnicity Estimate.
 
Obviously the results of this type of analysis are only as good as the reference panel. Ancestry has already upgraded the reference panel once (they are currently using the V2 panel) and additional improvements are in the works. 
 
The quality of results may also vary depending on the ethnicity of the subject. Because of the SNPs that were chosen, the Ancestry ethnicity test works best for people of European ancestry.  However, even some regions within Europe are difficult to distinguish due to migration and population mixing.   For example, the regions defined as Great Britain and Europe West show a lot of overlap.  Ancestry provides a brief history of each of the geographic regions, highlighting population movements that are likely to have affected the genetic makeup of its inhabitants.
 
In my personal experience, the geographical regions identified by the Ancestry Ethnicity Estimate match up fairly well with what would have been predicted based on standard genealogy.  It is important to check the error bars on each region, since they are often quite large.  For example, on one test that showed 6% Great Britain, the actual range is 0-21%.  As Ancestry upgrades their reference panel and algorithms, these results are likely to improve.
 
If you are interested in reading about Ancestry’s Ethnicity Estimate in even more detail, check out the white paper describing their method. 
 
If you have other questions about DNA testing for genealogy purposes, comment below or submit questions through the Contact form on this site.  You can also send me a message through the KinSeeker Genealogy Services Facebook page.  I will try to answer any questions in a future post.
6 Comments
John Bartelt
2/24/2017 09:18:55 pm

The white paper describes how they determine the estimated errors in the ethnicity. But I can't tell from the text or figure 4.10 what Confidence Level or statistic they are using. Is it a 90& C.L range? FWHM? One standard deviation? Or?
Thanks

Reply
Teresa Shippy
2/27/2017 09:52:32 pm

Hi John,

Thanks you for your question! I'm definitely not a statistician, but I think the answer is that bootstrapping does not rely on a statistical method to determine variability. They simply repeat the process multiple times (in this case 40) and the range of results gives them the error estimates. The ethnicity percentage they report to the customer is the mean value of the 40 results. The high and low results provide the endpoints of the range.

You might also want to look at https://support.ancestry.com/s/article/ka215000000TyOxAAK/Viewing-Ethnicity-Results-from-AncestryDNA-US-1460088591488-2556

Hope this helps!

Teresa

Reply
John Bartelt
2/27/2017 10:24:33 pm

The page you link to seems to agree with your description: that the range covers all 40 estimates. However, just yesterday I read a different ancestry.com page which indicated that was not true. This: https://www.ancestry.com/dna/ethnicity/1220AB3A-7AAD-40ED-A3E9-9DAC370BF958
In the example cited there, the range covered only 29 of 40 estimates. That is roughly consistent with plus or minus one standard deviation (sigma), or a 68% confidence interval. That would be a better-defined statistic than trying to cover *all* the estimates, which could include weird outliers.
So I'm not sure what to believe.

John Bartelt
2/27/2017 10:28:28 pm

Just realized the link I provided was not correct, and I can't figure out how to link to the correct page, since it is just a pop-in when yo click on "How is the range calculated?".

Teresa Shippy
2/28/2017 06:50:55 am

I found the page and I see what you mean about the range not including all of the 40 bootstrap samples. I also noticed that the arrows showing the likely range in the white paper only span part of the actual range of samples. I agree that it looks close to +/- one standard deviation, but I can't find anywhere that they actually say so.

Reply
Liam link
12/22/2020 11:45:22 pm

First time reading this, thanks for sharing

Reply



Leave a Reply.

    Teresa Shippy

    Teresa is the the owner of KinSeeker Genealogy Services.  She has a Ph.D. in Biology and a lifelong fascination with genealogy. She been researching her own family history for over 20 years and loves helping others "find their stories."

    Archives

    September 2018
    August 2018
    December 2017
    July 2017
    March 2017
    November 2016
    September 2016
    August 2016
    July 2016
    June 2016
    May 2016
    April 2016
    March 2016

    Categories

    All
    Newspapers

    RSS Feed


    Picture
    Please visit the KinSeeker Genealogy Services Facebook page


    This blog is owned by Teresa Shippy.  Content may not be copied without permission.

    ©2016, copyright Teresa Shippy

Proudly powered by Weebly
  • Home
  • About
  • Services
  • Blog
  • Publications
  • Contact
  • FAQ
  • Testimonials
  • Links