Wednesday, 27 April 2016

New Project Member ... with clues to the Spearin origin?

Please welcome a new project member to Genetic Family 1 (GF1; the Limerick Spearin's). Member 458314 is not a Spearin, he's a Graham. But he is a fairly close match to the members of GF1. And he may hold clues to the origins of the Spearin's in GF1.

Evidence from STRs
Below is the summary of Mr Graham's Genetic Distance to other members of the Spearin Surname Project at the 67 marker level. He has a Genetic Distance (GD) of 5/67 with his closest match, 6/67 with 4 other members of GF1, and 7/67 with two other members. He also matches two people in the Ungrouped category but much more distantly (9/67 and 20/67).

TiP Report (at 67 markers) for new member 458314 showing his GD to his closest matches in the project
(click to enlarge)

Clicking on the TiP Report for his 7/67 matches reveals that the TMRCA (Time to Most recent Common Ancestor) is estimated to be about 12 generations ago (50% probability level) with a 90% range of 6-21 generations ago. That roughly equates with 360 years before present (90% range 180-630 ybp), which in turn gives a rough estimate of the birth year of the common ancestor of 1590 (90% range 1320-1770, assuming the average birth year of those tested is 1950).

So why does Mr Graham match the GF1 Spearin's?

There are several possible Scenarios based on the DNA alone:
  1. There may have been an NPE (non-paternity event, e.g. adoption, illegitimacy, etc) somewhere along the line and Mr Graham is actually a Spearin, and we all share a common Spearin ancestor born about 1590 (I say we because I too am a GF1 Spearin).
  2. We Spearin's in GF1 are actually all Graham's, and the NPE was on our line, not Mr Graham's.
  3. Mr Graham and the GF1 Spearin's are actually related prior to the common usage of surnames, which in the UK occurred around about 1200-1300. However, we see from his TMRCA estimate that there is a 95% probability that he is related to the GF1 Spearin's sometime after 1320, so this scenario seems unlikely.
  4. What we are looking at is in fact an example of Convergence. This is when the genetic profile of one person appears to be fairly close to that of another person but in fact there are hidden back mutations or parallel mutations within their profiles that make them related much further back than they seem.

So which of these scenarios is the most likely in Mr Graham's case?

Evidence from SNPs
Well, we might get some clues from the terminal SNP markers of his closest matches. Mr Graham's own terminal SNP is the upstream SNP M223 placing him firmly in Haplogroup I (along with the GF1 Spearin's).
  • At 67 markers, he has 19 matches whose terminal SNPs include L1198 (x1), PF5268 (x1), Y18109 (x2; both GF1), Y6060 (x1), and Z166 (x2; 1 from GF1).
  • At 37 markers, he has 11 matches including L1198 (x1) and Y18109 (x1; GF1)
  • At 25 markers, he has 44 matches including CTS6433 (x1), L1198 (x1), PF5268 (x1), Y18109 (x2; both GF1), and Z166 (x2; 1 from GF1).

All these terminal SNPs (except one) are on the same or adjacent branches of the human evolutionary tree to that on which the GF1 Spearin's sit. I've marked these branches with a red dot in the diagram below. The exception is SNP CTS6433 which is on the following branch:
  • I-M223 > CTS616 > CTS10057 > Z161 > CTS2392 > Z173 > CTS6433

So although the exception is still within the I-M223 haplogroup sub-clade (like the GF1 Spearin's), it is a completely different branch.

However, taking all the evidence into consideration, there seems to be little doubt that Mr. Graham will test positive for L1198 - the only question is which of the sub-branches does he sit on. Currently there appear to be 3 possibilities - Y6060, PF5268, and Y18109 (the GF1 Spearin's). To answer this question, there are several courses of action open to Mr Graham:
  • do the I-M223 SNP Pack ($119) - this will test for most of the relevant downstream SNPs (in pink below) but not all of them (in blue). Additional single SNP testing (e.g. for Y18109) might be indicated thereafter
  • do the Big Y test ($575, or wait for the sale when it is usually $475 or lower) and a YFULL reanalysis ($49) - this will assess most/all of the relevant SNPs and detect some new ones too

Placement on the Haplotree of the Terminal SNPs of Mr Graham's closest matches

If further SNP testing reveals that Mr Graham sits on one of the adjacent branches in the haplotree (e.g. Y6060), then the connection will be very far back in time (L1198 for example was formed 3000 years ago approximately - see diagram below), and if that is the case we are probably looking at is Scenario 4, Convergence.

However, if he sits on the same branch as the GF1 Spearin's (Y18109) then we can conclude that we are related within the past 2200 years (approximately) as this is when it is estimated that the SNP Y18109 was formed (see previous post). This is consistent with any of the first 3 scenarios above, but does not help us distinguish which scenario is the most likely.

Only by doing the Big Y test (and a YFULL reanalysis) would we get a better idea of which scenario is the most likely. If we compared his Big Y results to those of the GF1 Spearin's who have already done the Big Y test, we might find any of the following:
  • He sits on an adjacent branch, below L1198 or below Y17535 => the most likely scenario is Scenario 4: Convergence
  • He sits on branch Y18109, matches some of the 10 SNPs in the terminal SNP block, but does not match others. This splits up the Y18109 10-SNP block (as discussed in a previous post) and places him on a new adjacent branch with a branching point estimated to be either before the common usage of surnames (=> Scenario 3 is the most likely scenario i.e. he is related to the GF1 Spearin's before 1200-1300 AD) or after the common usage of surnames (=> Scenario 1 or 2 is most likely). Either way, the new branching point could be dated and would move everybody concerned further down the human evolutionary tree.
  • He sits on the Y18109 branch, matches all 10 SNPs in the terminal SNP block, but does not match any of the unique SNPs of those GF1 Spearin members already tested => Scenario 1 or 2 is most likely
  • As above, he sits on the Y18109 branch, matches all 10 SNPs in the terminal SNP block,  and in addition matches one of the existing Big Y-tested GF1 members on some of their unique SNPs => possibly Scenario 1 is the most likely and an NPE has occurred somewhere along Mr Graham's ancestral line. This would also create a new branching point which could be dated and would move (some of) us further downstream on the human evolutionary tree.

Branching points & Terminal SNP blocks below L1198

Genealogical evidence
So far, the discussion has merely focussed on the genetic evidence. But this is where we bring in the evidence from Mr. Graham's known genealogy. It is his grandson who manages his DNA results and here is what he says:
I submitted my Grandfather's YDNA to get tested as his father Edward Graham (+ Sister) took his Mother's maiden name "Graham" and he has no recorded Father on his birth record.

So once the results came in we had 9 good matches for Spearing, Speiran, Spearin, Speerin

Now the fun begins finding the link to a Spearin in New Zealand.

So clearly there is an NPE in the Graham line and it is at the level of Mr Graham's father (born in 1901 in New Zealand). The question is: does it go back to a Spearin or to some other surname?

One obvious course of action (as Mr Graham's grandson suggests) would be to search for a Spearin in New Zealand in 1901 who could have been Mr Graham's father's father. There are several potential candidates* in New Zealand around this time with the names Spearing and Sperring (more usually an English variant) but no one by the name of Spearin, Speiran, Speirin, or Spierin (more usually the Irish variant associated with the GF1 group). So, there is no clear signal currently that a GF1 Spearin was the father of Mr Graham's father.

Next Steps
One could try to track down some of the present day New Zealand Spearing's and encourage them to do a Y-DNA-37 test to see if there is a close match to Mr Graham. Or Mr Graham could do the Big Y test (and YFULL reanalysis) to see where he sits on the human evolutionary tree relative to the GF1 Spearin's.

The latter seems like the best course of action as it will give us the most information. It is likely to give us quite a bit of additional information about our relative positions on the haplotree but it won't answer all our questions - it may not identify any additional surname candidates for Mr Graham's father's father, and it may not give us any further clues to the ancestral origins of the GF1 Spearin's.

And as has been the case for many years, it will still be a waiting game to see if any closer matches to Mr Graham or the GF1 Spearin's emerge over time.

But one day, we will get there (in all likelihood). It is only a matter of time.

Maurice Gleeson
April 2016

* from the New Zealand Electoral Rolls 1853-1981 on Ancestry

Friday, 22 April 2016

Big Y results - comparing TMRCA estimates

In the previous post, we looked in depth at the SNP markers identified by FTDNA and YFULL, and compared the reports from each company for similarities and differences. However, this post explores the topic of TMRCA (Time to Most Recent Common Ancestor) and thankfully this is a lot more straightforward.

TMRCA Estimates based on SNPs
Another useful piece of information from the YFULL analysis is the estimate for when the SNPs in this terminal block emerged and the TMRCA estimate between the two volunteers who have tested (TMRCA is Time to Most Recent Common Ancestor). We discussed in a previous post that the SNPs in this block emerged about 2200 years ago (or 200 BC) but today we are looking at the TMRCA between the two volunteers.

SNP emergence estimate & TMRCA estimate for GF1

Their TMRCA is estimated to be a mere 150 years before present (ybp) by which they mean 150 years prior to the approximate date of birth of these individuals, which (let's say) is approximately 1950. This gives a common ancestor born about the year 1800. However, the 95% Confidence Intervals around this estimate indicate that it could be anywhere from 75 years ago to 500 years ago. Or in other words, we can be 95% confident that the common ancestor was born some time between 1450 and 1875. This estimate could be refined if more people from Genetic Family 1 (GF1) were to do the Big Y test and upload their results to YFULL, but for now there is no pressing need to do so.

Calculation of the TMRCA estimate

TMRCA Estimates based on STRs
But how does this compare with TMRCA estimates based on STR markers? The TiP Report for the comparison of these two volunteers is detailed below.  It is based on a comparison of their STR markers at the 67-marker level. You can access your own TiP Report by clicking on the orange TiP icon beside each of your matches. This tells you how close or how distantly you are related (based on your STR values). You can select comparisons based on 12 markers, 25 markers, 37, 67 or 111 (depending on how many markers you have personally tested).

This analysis assesses the probability that the two individuals share a common ancestor on their direct male lines within the past "X" number of generations. This is a cumulative probability and so the probability increases over time and eventually reaches 100%.

TiP Report comparing Volunteer A (H1223)  with Volunteer B (164729)
(click to enlarge)
The 50% (midpoint) value is about 10 generations - in other words, there is a roughly 50% chance that the common ancestor was born within the last 10 generations, and a roughly 50% chance that it was sometime before that. The 5% and 95% probability levels are about 4 and 19 generations respectively. Allowing 30 years per generation, this gives us a midpoint TMRCA estimate of 300 years before present (ybp), with a 90% Confidence Interval of 120 to 570 years ago. And translating this into actual years gives us a midpoint estimate of 1650 (assuming an average year of birth for the two volunteers of about 1950), with a range of somewhere between 1380 to 1830 AD.

This TMRCA estimate based on STR values (1380-1650-1830) is not that close to the TMRCA estimate based on SNP values (1450-1800-1875). In fact, the midpoint estimate is out by 150 years.  Also, the range around the "best estimate" is very large, and could be quite far back in time (1380-1450). This is why we really have to be careful when interpreting TMRCA estimates - they may be out by several hundred years ... and in either direction!

However, there is an additional technique we can use to try to obtain more accurate assessments of  TMRCA estimates for the entire group, and that is something we will explore in a subsequent blog post.

Maurice Gleeson
April 2016

Friday, 15 April 2016

Big Y Results - Terminal SNPs, Shared SNPs & Unique SNPs

In the previous post we looked at some of the initial results of the YFULL analysis of the Big Y test from two of the volunteers from Genetic Family 1 (GF1). In this post we will take a closer look at the SNPs revealed by the additional YFULL analysis and then compare and contrast them with the original results from FTDNA.

This is quite a long post but stick with it!

Terminal SNPs
The two volunteers from GF1 were given new ID numbers at YFULL and you can see them in the haplotree diagram below - they are the last two numbers.
  • The first volunteer (FTDNA kit number 164729) is YF04104 (results available in Sep 2015)
  • The second volunteer (H1223) is YF04316 (results Oct 2015)

The terminal SNP Block for GF1 on the YFULL Haplotree

Our two brave volunteers have been placed on the YFULL Haplotree as a sub-branch below SNP Y17535. Both our volunteers have the terminal SNP Y18109 or rather they have a whole "block" of terminal SNPs, namely:
  • Y18109
  • Y18110 
  • Y18111 
  • Y18112 
  • Y18113  
  • Y18114 
  • Y18115
  • Y18116
  • Y18117
  • Y18118 

Shared SNPs
When two or more people share several terminal SNPs in common, this terminal SNP "block" is usually named after the first SNP in the block, which in our case is Y18109. These terminal SNPs are shared between our two volunteers and no other people in the world (currently).

If we move up the tree to the next nearest branching point, we find this is marked by the SNP Y17535 (which represents a SNP block of 5 SNPs). There are 2 sub-branches below Y17535 - our own sub-branch (Y18109) with our own 2 volunteers, and another sub-branch (Y17535*) with a single individual. Thus 3 individuals (currently) share the SNP Y17535.

Shared SNPs on the L1198 branch of the YFULL haplotree
(click to enlarge)

And if we move further upstream to the next nearest branching point, this is marked by the SNP L1198 (representing another SNP block of about 7 SNPs, although some are "equivalent SNPs" i.e. the same SNP with several different names because it was discovered by several different people around about the same time - these are the SNPs separated by a forward slash). There are 4 sub-branches below L1198 - the Y17535 branch just discussed above (with 3 individuals), but also an L1198* sub-branch (3 people), a Y6060 sub-branch (with 3 subsequent sub-branches, the last of which also has 3 sub-branches; 9 people in total), and an S20905 sub-branch (aka Z190, but not shown in the diagram for some strange reason; with 4 levels of sub-branching) containing 11 people altogether (currently).

So, in total, 26 people share the SNP block L1198, 3 people share the SNP block Y17535, and 2 people (our volunteers) share the SNP block Y18109.


But as more people test, either from GF1 or our close genetic neighbours (if they exist), each SNP block should be gradually split up. In other words, we can expect our particular sub-branch of the tree to be joined by adjacent sub-branches sprouting nearby, some of which will "steal" SNPs from our current terminal SNP block. We can also expect further sub-branches to sprout below our current terminal SNP / SNP block. For example, if (say) 10 people from GF1 were to test, the current 10-SNP block might dwindle to (say) a 2-SNP block, with (say) 4 sub-branches below it - one sub-branch containing a single terminal SNP, another containing a 3-SNP block, and two containing a 2-SNP block.

The Take Home Message is: our current terminal SNP block will dwindle and will be split up as more people do the Big Y test (or similar tests).

Unique (Personal) SNPs
In addition to the 10 SNPs that the two volunteers share in common with each other (i.e. the Y18109 SNP Block), they each possess SNPs that the other does not have. In other words, they have their own unique, personal or "private" SNPs that no one else in the world has (currently). No doubt if we were to test other members of GF1 for these "private" SNPs we would find that some of these unique SNPs would no longer be unique anymore - they would be shared with other members of the Spearin group - thus dwindling the number of unique SNPs possessed by any given individual.

YFULL reports that Member YF04316 (H1223) has 16 "Novel SNPs" with 3 SNPs characterised as Best Quality, 1 as Acceptable Quality, and 11 as Ambiguous Quality. These are illustrated in the diagram below. In contrast, member YF04104 (164729) has 51 Novel SNPs but none are of best quality, 1 is of acceptable quality, 49 are of ambiguous quality, and 1 is of low quality (these are not shown here because they take up too much space). But what do they mean by quality?

The quality of a SNP is a reflection of how confident the company is about declaring it to be a true positive SNP and not a false positive finding. There are various reasons for why the test might throw up a false positive result and we don't need to go into the details here, but it is simply important to remember that some results may be false positives and it is best to focus on the SNPs that the company is most confident about (i.e. the best quality SNPs).

Unique SNPs (currently) possessed by member YF04316

If more people from GF1 tested, we would probably find that some of the 16 Best Quality Novel SNPs of Member YF04316 (H1223) would turn up in the results of some of the new people, and would no longer be "private" or unique - they would be shared by other members in the group. And this might even result in one or several more branches being formed.

So, in a similar way to how the shared SNPs in the current GF1 terminal SNP block will split up as our genetic neighbours get tested, these unique SNPs to H1223 will also gradually disappear as more people test. So, for example, if everyone from GF1 were to do the Big Y test, a lot of H1223's unique SNPs would turn out to be shared by other members of GF1 (and thus they would not be unique any more). This could be useful when building a Mutation History Tree (discussed in a subsequent blog post) but we could also probably achieve this with the existing STR data instead, so there is no burning need for more people in GF1 to do the Big Y test.

Comparison between FTDNA Analysis & YFULL Analysis
We have looked at the YFULL reanalysis of the Big Y data. Now we are going to compare it to the Big Y data analysis originally performed by FTDNA to see if (and where) there are similarities and differences.

The FTDNA results report that our two volunteers match on 24,165 known SNPs and differ on 2 known SNPs, namely YSC0000155 and PF3643. In fact it is member 164729 who appears to be lacking these SNPs - H1223 appears to have them both. This is very surprising given that we expect our two volunteers to be related by a common ancestor some time in the 1600's and so there should be a very close relationship between them with no major differences in the SNPs they share. So for them to differ by two SNPs is quite a surprise.

Furthermore, these two SNPs in question are nowhere to be found on either the FTDNA Haplotree or the ISOGG Haplotree. I found YSC0000155 on YBrowse and it was discovered in a Haplogroup J-L147 person but there is no further information available on this SNP. Similarly, PF3643 was discovered in 2011 and possibly belongs in Haplogroup I. The I-M223 Yahoo Discussion Group notes that PF3643 turns up in some but not all I-M223 people and that "some people's Big Y test did not record a result for PF3643. However, there is enough data to show that Z79+ people must have had a back mutation from derived C back to ancestral A." So it is difficult to judge whether these SNPs are relevant to our own particular Spearin sub-branch of the human evolutionary tree. I suspect that these particular SNPs may be quite far upstream from where we currently sit and are of no particular relevance to the conversation that follows. But I could be wrong.

Furthermore, the nature of NGS tests (Next Generation Sequencing tests) like the Big Y means that this particular test simply failed to detect these two SNPs this time around and they are in fact present after all. If we were to repeat the same test in the same individual they might pop up in the second test.

A big thank you to John Cleary who pointed out that you can check SNP information on YFULL if you know the SNP name. Just go to Check SNPs, enter the name of the SNP in question and click on the magnifying glass icon when it comes up.

I was able to check the YFULL website for FTDNA's mystery missing Known SNPs (YSC0000155 and PF3643) but obtained no additional useful information. I still do not know where these are placed in the haplotree. Perhaps they have not been allocated a position as yet.

Enter a SNP name to get SNP details
(click to enlarge)

But the above discussion relates to "known" SNPs. Let's take a look at the "unknown SNPs - the "Novel" SNPs.

Shared SNPs
According to the FTDNA analysis, our two volunteers have 201 "Shared Novel Variants" but when you click on the number 201, the pop-up box not only has Shared Novel Variants but also the SNPs unique to each of the two individuals. So this should not really be under the heading "Shared Novel Variants" as it also includes "unique" variants that are not shared with anyone. A relatively minor criticism, but potentially confusing.

FTDNA's Big Y results page for H1223 - 201 "Shared Novel Variants" with 164729

There are 3 tabs in the Shared Novel Variants pop-up box - one tab has 156 "Shared" SNPs, 45 "unique" to H1223, and 13 "unique" to 164729 ... and that adds up to a total of 214 ... so where does the 201 come from?? 156 + 45 is 201 ... so did they forget the other 13 SNPs? Other numbers for nearby neighbours (155, & 190) also do not add up correctly. This is not potentially confusing - it is confusing.

Pop-up box with 3 tabs showing Shared SNPs & unique SNPs

Apart from the confusion over the term "Shared" and the actual number of SNPs detected, there are several further sources of confusion.

Firstly, the definition of the term "Novel" in the phrase "Shared Novel Variant". Novel is supposed to refer to SNPs that have never been discovered before. But ... before when? The definition of Novel varies between companies so what is novel to FTDNA may not be considered novel to YFULL. And vice versa. Furthermore, presumably anything "novel" has a time-limit, after which it becomes classified as "known" ... but no one knows when this time-limit expires. And this may also differ among companies ... one man's "cutting edge" may be another's "yesterday's news". There is no standardisation. So caution is necessary when interpreting these results and comparing them between companies. There will be differences in how companies report the same data.

Here's another source of confusion. FTDNA reports 156 Shared SNPs whereas YFULL does not give this actual number - it places the two individuals together on the YFULL tree sharing 10 SNPs in their shared Terminal SNP Block (Y18109), 5 SNPs shared at the branching point above that (Y17535 branch), and possibly 7 SNPs on the branching point above that (L1198 branch). So, where on the tree are these 156 shared SNPs that FTDNA says the two volunteers share? Do they go right back up the tree, back to "genetic Adam"?

And this is also where we encounter our next problem - FTDNA do not report SNP names, only SNP positions. This makes it difficult to identify SNPs and compare results between companies - some people use SNP names for identification, other people use SNP positions. In order to find out the SNP names (and thereafter ascertain where on the tree they sit), we have to enter every SNP position on YBowse to see if there is a corresponding name (or several corresponding names). That's 156 SNP positions!! What a pulaver!

Below is a screenshot of ISOGG's YBrowse utility. By entering the position in the search box, you can find if there are any particular SNPs at that particular position on the Y chromosome. You have to enter the position in the format shown. The example below is for position 7,321,330 and there are (apparently) 4 different SNP names at this particular position. This initially suggests that they are all equivalent SNPs (i.e. same SNP, different names) but further examination of the Details for each of the 4 SNPs reveals that there is a contradictory direction of mutation - was it from C to A (SNPs 1,3,4), or from A to C (SNP 2)? Which came first? The chicken (C) or the Egg (A)? [Note: allele-anc refers to the ancestral value (i.e. the original or reference value) and allele-der refers to the derived or mutated value.]

Browse reveals there are 4 SNPs at position 7,321,330

Details of the 4 SNPs with contradictory directions of mutation
(click to enlarge)

A further point of confusion is the fact that this particular SNP is found in several Haplogroups, namely R, O & Q, whereas we know the Spearin's are in Haplogroup I. So ... what does this mean? This does not look like a SNP that is uniquely shared by just our two volunteers. It appears to be a SNP that is shared not just by our two volunteers but by a host of other people??... including people in other haplogroups? In which case, there is really not much point in me trying to identify all 156 "shared SNPs" that FTDNA says our two volunteers have in common.

I stopped after five!

What about the 10 SNPs shared between our two volunteers (the so-called Y19108 block) on the YFULL tree? Are these included in FTDNA's list of 156 shared SNPs? And what about the shared SNPs further upstream at branching points (Y17535, L1198, etc)? Are these also in the FTDNA list of shared SNPs?

Well, it was possible to use YBrowse to identify the positions for each of the SNPs on the YFULL tree. And then compare these positions to the FTDNA list to see if they appeared there. Here's what was found:
  • all 7 SNPs in the L1198 block are missing from FTDNA's Shared Novel Variants list ... but this could be because they are relatively well-established "upstream" SNPs and therefore do not meet the criteria for "Novel"
  • 3 of the 5 SNPs in the Y17535 block are present in FTDNA's list but 2 are missing (see diagram below) ... however, one of them (Y17491) turns up in FTDNA's list of unique SNPs for H1223 (YF04316)! It seems this particular SNP was recognised as a unique SNP by FTDNA but as a shared SNP by YFULL. So who is "right"?
  • 6 of the 10 SNPs in the Y18109 block are present in FTDNA's list but 4 are missing (Y18109, -10, -16, & -18) ... and again, 2 of them turn up in FTDNA's list of unique SNPs for H1223. These SNPs are identified as unique by FTDNA but shared by YFULL.  So who do we believe?
The fact that the Y18109 SNP is missing from FTDNA's Shared SNP list is highly confusing because FTDNA have assigned the terminal SNP for both our volunteers as Y18109. How can they do this if it does not turn up as a shared SNP between the two volunteers??? However it does appear in the list of SNPs tested for each of our volunteers on their Haplotree & SNPs page on the FTDNA website.  And when I download the SNPs from each volunteer into a csv file, there it is, Y18109, in both files, and derived from the Big Y test! Why then does it not turn up in the Shared Novel Variants list? Perhaps it is classified as a "known" SNP? And that's why it turns up in the downloaded csv file but the others do not? But that still does not explain the absence of the other 3 missing SNPs from our terminal Y18109 SNP block.

It's a conundrum. A quandary. A mystery.

A portion of my spreadsheet with the 156 Shared Novel Variants reported by FTDNA

Only 3 of the 5 SNPs in the Y17535 SNP Block appear on FTDNA's Shared Novel Variants list

So FTDNA do not identify all the shared SNPs identified by YFULL. Possibly because the two companies have different thresholds / criteria for declaring a SNP to be present.

But it points to a major lack of consistency between the YFULL analysis and the FTDNA analysis. And this naturally will raise concerns in people's minds about the accuracy of the data. Who got it right? Maybe both companies did. Maybe the differences are all down to the different criteria employed by each company for declaring a SNP. Or maybe not. Which analysis do you believe? Which is more reliable?

And what about the rest of the 156 Shared SNPs? Only 9 SNPs relate to the 3 branches of the YFULL tree discussed above - where do the other 147 fit in? Are they further upstream? It would be much more helpful if FTDNA simply reported the SNPs shared uniquely by Person A and Person B and no one else.

So, thus far, the analysis of FTDNA's 156 Shared SNPs has not been very helpful at all. Maybe we'll have better luck with the unique SNPs?

Unique SNPs
FTDNA reports that H1223 (YF04316) has 45 unique SNPs (i.e. not shared with 164729 / YF04104) and similarly 164729 (YF04104) has 13 unique SNPs (i.e. not shared with H1223 / YF04316). This differs considerably from the 16 and 51 unique SNPs reported by YFULL above. 

Unique SNPs reported by each company
But once again, the different companies have different criteria for declaring a SNP and this effects the results. If we take a closer look at the reporting criteria, FTDNA describe their "confidence" in the SNP as high, medium or unknown. In contrast, YFULL describes the "quality" of the SNP as best, acceptable, ambiguous, & low. Neither set of criteria are right or wrong - merely different approaches.

And when we compare the two sets of unique SNPs, there is only agreement between FTDNA and YFULL with regard to 2 unique SNPs for member H1223 (YF04316) and 1 unique SNP for member 164729 (YF04104). These are illustrated in the diagrams at the end of this post. 

  • Note that for H1223, none of YFULL's "Ambiguous quality" SNPs are reported by FTDNA. And similarly all but 2 of FTDNA's "high confidence" SNPs are reported by YFULL.  There are 3 "Best Quality" SNPs from YFULL (green and yellow highlight) but only 2 of these (yellow highlight) are declared by FTDNA. 
  • For 164729 (YF04104), only 1 unique SNP is declared by both companies (yellow highlight). This is deemed to be of "high confidence" by FTDNA and "acceptable quality" by YFULL.

Therefore, in terms of consistency or agreement between the two companies, the vast majority of unique SNPs declared by one company are not declared by the other. In terms of percentages this works out as: 2/45 (4.4%) and 1/13 (7.7%) agreement for FTDNA; and 2/16 (12.5%) and 1/51 (2%) for YFULL. This gives an average consistency score of a mere 6.7%. Or to put it another way, the companies will disagree 93.3% of the time.

So, even though we have a huge amount of information from both analyses, there are major differences between the two companies and what they put in their reports. The amount of inconsistency is quite astounding and highlights the need for caution in interpreting these reports.

To resolve these inconsistencies in reporting, we have to delve deeper into the data itself. And that means exploring the vcf files, bed files, and BAM files that contain the fine details of our DNA results (not accessible to Project Administrators without the express permission of the project members concerned). This is not a job for the faint-hearted and involves many hours of review and analysis. It is not a task that most Surname Project Administrators would embrace, and personally, I leave this type of analysis to the experts - the Haplogroup Project Administrators. This highlights the need for a close collaboration with people like Wayne Roberts and Aaron Salles Torres who are administrators of the I-M223 project. They have an overview of much more data than any Surname Project Administrator, and can potentially see patterns that would be easily missed by someone looking at a mere subset of the data.

Despite all the above caveats, we have actually learnt quite a lot from SNP testing. Both interpretations of the SNP data (by FTDNA and YFULL) place us in more or less the same position on their respective haplotrees. They both assign the same terminal SNP (Y18109). And there is some (minor) agreement on what are likely to be unique SNPs for each individual.

This entire exercise has been very useful in highlighting the fact that there is no standardisation currently in the way that the data from the Big Y test is analysed and interpreted. The same applies to other NGS tests, such as those offered by FGC (Full Genomes Corporation). And this is no surprise. We have to bear in mind that we are on the crest of the wave of scientific discovery here. We are the first explorers in a brave new world. As a community, it will take time for us to take in what we are seeing, analyse it, make sense of it, and arrive at a consensus regarding the best way to interpret and present the data. As Humphrey Bogart said to Claude Rains, this is simply the start of a wonderful relationship.

In the next post we will be looking at a topic that is (perhaps) a little bit more straightforward: TMRCA estimates - the Time to the Most Recent Common Ancestor.

Maurice Gleeson
April 2016

Unique SNPs for member H1223 - only 2 SNPs were jointly declared by both companies

Unique SNPs for member 164729 - only 1 SNP was jointly declared by both companies

Update 4 August 2016
I received this helpful comment from the I-M223 Yahoo Discussion Group:
Regarding the numbers reported in the Shared Novel Variants pop-up boxes I can offer the following. The first tab in a Shared Novel Variants pop-up box is shown in the attached figure. It states that there are 157 shared entries. Notice the position 14263127 is ancestral, i.e. G-G, and should not be in the list. There is one other like that so in reality there are 155 shared entries. In this case the mystery number is 200 leaving 45 that I cannot account for.
I reconciled the above against novel variants listed in the data exported as CSV files. Those same two bogus entries are present. The other kit in the comparison has 18 such bogus entries. After elimination of the bogus entries one kit has 28 novel variants not shared and the other has 26. These agree with the numbers reported as not shared in the other tabs of the pop-up box.

So from this it appears that there is a bug in the FTDNA system but this does not account for the discrepancies previously noted.

Tuesday, 5 April 2016

Results of Big-Y SNP testing

Last year we raised some money for SNP testing of two of the project members from the first group in the project, Genetic Family 1 - the Limerick Spearin's.

Previously we had undertaken sequential SNP testing, with member 200083 kindly volunteering to be the "group representative" for Genetic Family 1 (GF1). The progress of this SNP testing has been covered in previous project updates. He tested positive for SNPs Z78, Z185, L1198, and Z166 and negative for Z190, Z79, F3195, and PF5268. This testing helped us move down the human evolutionary tree, placing us on sub-branches that were more and more further downstream. So that by last year, we were placed on the Z166 sub-branch of the human evolutionary tree.

The "SNP Progression" looked like this:
I- ... M438 > L460 > P214 > M223 > CTS10057 > Z161 > C6433 > Z78 > L1198 > Z166

This progression is represented diagrammatically below and shows the particular sub-branch of the evolutionary tree whereon Genetic Family 1 sits. Or at least, where it sat in February 2015. Since then, things have changed. (Note that our terminal SNP Z166 was not included in this diagram from 2013).

To move us even further down the human evolutionary tree (also known as the Haplotree), we decided to undertake Big Y testing. This test investigates up to 50,000 SNP markers on the Y-chromosome. The SNP markers are different DNA markers to the STR markers that we see on the Results pages of the project and you can read a blog post about the differences between SNP and STR markers here.

The two volunteers for Big Y testing (H1223 & 164729) were chosen on the basis that they appeared to be the most genetically different members of Genetic Family 1, based on their STR marker differences. The Genetic Distance between the two members was 5/67 (i.e. there was a 5-step difference between them on their 67 marker test results). Also, one had origins in Limerick whereas the other had been in and around Georgia since the early part of the 1800's.

The Big Y tests were conducted after our Spearin Reunion during the summer of 2015 and the first results became available in late 2015. Thereafter, they were analysed and attempts were made to place the newly discovered SNP markers on the human evolutionary tree. This can take some time to interpret because the number of tests available for comparison are limited. Since then there has been ongoing communication with the I-M223 Project Administrators trying to interpret what the results mean and what they tell us.

Firstly, we have a new terminal SNP marker for Genetic Family 1. It is Y18109. This has moved us at least two branches further down the human evolutionary tree. Our new SNP Progression looks like this:
I- ... M438 > L460 > P214 > M223 > CTS10057 > Z161 > C6433 > Z78 > CTS8584 > Z185 > Z180 > L1198 = Z166 > Y17535 > Y18109
And here is a diagram (from SNP Z78 downwards) of the new SNPs (green above) where this places us on the Haplotree (taken from FTDNA's version of the Haplotree which can be found on your own Results pages). Note that some additional SNPs (brown) have been included since the 2013 version and this is typical - more SNPs will be discovered and added as the science evolves.

The new terminal SNP for GF1 on the FTDNA Haplotree
(green = tested positive; red = tested negative)

So, this new data raises several questions: when in time did these new SNP markers emerge? How far down the human evolutionary tree are we now? And who are our neighbours on this new sub-branch? Can we learn anything from them? Does this new information tell us anything about the origins of the Spearin surname?

In order to date these new SNPs, we turn to YFULL. Their experimental Haplotree has time estimates for the emergence of these SNPs, as illustrated below.

click to enlarge

From this we see that our new terminal SNP (Y18109) is estimated to have come into existence about 2200 years ago (with a 95% Confidence Interval of 3100 to 1500 years ago). The SNP above it (Y17535) is about 2800 years old, and so came into existence about 600 years before Y18109. And the ones above that (the equivalent SNPs L1198 and Z166) are about 3000 years old. So this takes Genetic Family 1 further downstream and places us roughly at about 200 BC. But it could be anywhere from 1100BC to 500 AD. So we are still fairly far back in time, and certainly not at the point where surnames came into common usage (about 800-1000 years ago). 

To learn more about our neighbours we turn to the I-M223 project. Previously GF1 had been placed in Cont1 Group2 and we had quite a few neighbours here (see the previous post for details). But now we sit in a new group, Cont1h1, and we have lost all of our nearest neighbours. They have been split off into different adjacent sub-branches. 

GF1 now sits on its own branch - it's lonely being unique!

Our current nearest neighbour is a chap called Braz whose MDKA (Most Distant Known Ancestor) came from Portugal (green boxes in the diagram below). Neighbours on more upstream branches have ancestry from a fairly diverse number of places, but all in Western Europe, including Denmark, Finland, Netherlands, Germany, England & Scotland. So although recent SNP testing is helping to split this larger group out into smaller sub-branches, we have a long way to go as yet to narrow down the Spearin origins to a particular country or location.

Nearest neighbours (& their origins) to GF1 with dates for the various sub-branches

So what does the future hold? Well, we will continue to accrue benefits from this Big Y testing over time, so all we have to do now is wait, and keep an eye on it. As more people test, we will get more information for Genetic Family 1 in terms of both the timeline (when did our Spearin-specific branch come into existence?) and ancestral locations (where did it emerge? where did it move to?). 

In subsequent blog posts we will take a closer look at the actual SNP markers discovered for our two volunteers and what this tells us about how closely they are related. We will also explore what this means for Genetic Family 1 and generate a Mutation History Tree for this genetic family.

Maurice Gleeson
April 2016