Saturday 24 August 2019

Project Update 2019 - Part 2: Bridging the Gap

In the first part of this update, we illustrated how Genetic Family 1 (GF1, the Limerick Spearin's) sit on a very isolated branch of the Tree of Mankind with very few clues as to the origins of the group. Big Y testing by the outliers in GF1 (namely Laveaud, Wall, Graham, & Church) might provide some additional pointers but we might be more successful in addressing the question by targeted recruitment of English Spearing's and European Spiering's (targeted outreach via Facebook is ongoing).

In this update we will look at building a "family tree" for the Limerick Spearin's of Genetic Family 1 (GF1) using their DNA data to reconstruct the branching structure of the tree back to the Early Limerick Spearin's (Mathew, Nicholas & Luke, the presumed sons of George Spearin born in London in 1646).

Can we bridge the gap?

Most of the GF1 members have well-characterised Brick Walls in their family trees at around about 1800. Before the 1800 timepoint, there is a gap of about 2-4 missing generations - Mathew, Nicholas & Luke were probably born in the late 1660s, their children in the 1690s-1700s, their children in the 1710s-1720s (Missing Generation 1), their children about 1740 (Missing Generation 2), their children about 1770 (Missing Generation 3), and their children were the Brick Wall ancestors we see in the family trees of many members of GF1.

Slide from Project Update YouTube video (2015)

One intriguing question is: would it be possible to bridge the gap by using DNA? In other words, could we use DNA to help define the branching pattern among the 14 members? If we could, we might be able to say that one group of families descend from 1 son, another group from another son, and a third group from the third son. We might never be able to say which son was which (Mathew, Nicholas or Luke) but knowing the branching structure of the tree would help us focus our research. We might never be able to identify every ancestor in the 3-4 missing generations but the DNA could potentially provide a framework (i.e. the branching structure) for the missing generations.

Potentially.

It is possible that both Y-DNA and Family Finder results (i.e. autosomal DNA) might be helpful in defining the branching structure. Let's take a look at Y-DNA data first.  

Using Y-DNA to extend the Family Tree into the gap

A previous attempt was made in 2015 to build a "family tree" based on Y-DNA data (specifically the STR results generated by the standard Y-DNA-37 test - see diagram below). This is reviewed in this YouTube video here. Since then, a few additional members have joined the project, others have upgraded from the 37-marker test to the 67 marker test (Y-DNA-67) or 111 marker test (Y-DNA-111), and we have additional SNP data available (thanks to Big Y testing of 2 project members which is reviewed in a series of earlier blog posts starting here).

Family Tree for the Limerick Spearin's (GF1) based on Y-DNA data (2015)

There is also a new online tool called the SAPP tool which allows us to combine genealogical data, STR data, and SNP data together in order to produce a "best fit" family tree for everyone in GF1. Using this new tool, a family tree containing data from the 14 Spearin's of GF1 was produced - see below; details in footnote [1]. Sadly, it does not give us much more information than what we had already produced in the earlier version of the "family tree". However, it does give a more accurate date for the overall MRCA (Most Recent Common Ancestor) for the entire group, namely 1750 (range 1650-1850), which roughly ties in with the known genealogy.

The "best fit" family tree generated by the SAPP tool (Spearin_7 MHT)
Note: the ID numbers reflect: the order of the group members on the Results Page, their initials,
the last 4 digits of their kit number, and the family to which they belong.
(click to enlarge)

Translating this diagram into a more user-friendly version gives us the family tree diagram below. This shows the following features:
  • the Y18109 9-SNP Block discussed in the previous post (which was presumably also carried by George Spearin who was born back in 1646)
  • the various branches with their STR mutations identified from the standard Y-DNA results
  • the family ID  for each of the 14 members (you can see the pedigree for each family here) as well as the individual ID numbers (initials and last 4 digits of the kit number) and the S numbers used for SAPP
  • the number of STRs tested by each member
  • the Private / Unique SNPs possessed by the 2 members who have done the Big Y-500 test [2]
  • the number of potential STR mutations identified among the additional STR markers (up to 450) included in the Big Y-500 test (note that this test has been updated to the Big Y-700 as of 2019 and this new test is anticipated to detect about 50% more SNPs than the Big Y-500 and provide up to 200 additional STR markers)

User-friendly version of the "best fit" family tree generated by the SAPP tool.Shared mutations are highlighted, but only orange highlight 
indicates branch-defining shared mutations.
(click to enlarge)

There are several important points to note about this "best fit" family tree:
  • Despite the STR & SNP testing carried out to date, the DNA has been practically of no help in defining specific branches:
    • DNA predicts a branching point (CDYb>42) within the ON1 family (George 1775), which we already knew about from the known genealogy.
    • And it predicts another branch (pre-1790) based on a mutation in the STR marker CDYb (it decreases in value from 41 to 40) which suggests that families ON2, NSW2 & NJ1 share a more recent common ancestor than the other families. However the CDYb marker is notorious for flipping back and forth in value from generation to generation so this may be a false conclusion and I don't trust it.
  • There are 20 mutations identified via STR testing (up to 111 STRs) and (at least) an additional 3 mutations identified via the extra STRs tested as part of the Big Y-500 test. [3] This gives a total of 23 STR mutations.
  • Most of the 23 STR mutations are not shared i.e. they occur in a single individual.
  • There are 11 shared mutations, and of these, 5 of them are potentially branch-defining (CDYb<40 is shared by 3 people and CDYb>42 is shared by 2 people). The rest (6) are Parallel Mutations i.e. the same mutation occurs by chance in two separate lines of descent (413b>23, CDYa>34, & 712>21, each occurring in 2 people).


The Way Forward with STRs?

In order to define branching points within the "best fit" family tree, we need a lot of mutations (both STR & SNP) that are shared by some members but not by others. And so far we have only identified 2 branch-defining STR mutations (CDYb<40 and CDYb>42, discussed above). So what are the chances of identifying additional branch-defining mutations via more extensive Y-DNA testing (e.g. by upgrading to 111 STR markers, and/or doing the Big Y-700 test)? And would this allow us to define the branching structure of the missing 3-4 generations?

The short answer is: we wouldn't know until we did it, and the chances are probably low.

Here's why.

14 mutations were identified among 14 people who tested the first 37 STR markers (markers 1 to 37)
4 mutations were identified among the 7 people who tested the next 30 markers (markers 38-67; n=30)
2 mutations were identified among the 2 people who tested the next 43 markers (markers 68-111; n=43)
3 mutations were identified among 2 people in the Big Y-500 STR panel (markers 112-561; n=450)
This is summarised in the table below.

STR mutations (yellow/green) among the 14 members of GF1

From this we can calculate crude mutation rates as follows:
  • Markers 1-37 ... ... 14 / (37 x 14) = 0.02702   = 27 / 1000
  • Markers 38-67 ...    4 / (30 x 7) = 0.0190476  = 19 / 1000
  • Markers 68-111 ...  2 / (43 x 2) = 0.0232558  = 23 / 1000
  • Markers 112-561 ... 3 / (450 x 2) = 0.003333 = 3.3 / 1000

This suggests that most mutations will occur among the first 37 markers (which supports the use of the Y-DNA-37 test as the standard initial test for those joining the project). However it also suggests that a significant number of mutations would also be found by testing to 67 markers and 111 markers (although this conclusion is based on only 7 and 2 participants respectively).  The STR Panel associated with the Big Y-500 test has the lowest mutation rate, but because there are 450 STR markers in this panel, it will still generate significant numbers of mutations. Upgrading from Y-DNA-37 to Y-DNA-111 would cost about $190 whereas the Big Y-700 test would cost about $500 so both options are costly.

Of the 23 STR mutations identified thus far, 11 (48%) were shared mutations, and of these 6 (26%) were Parallel Mutations (according to the "best fit" family tree) and 5 of them (22%) were branch-defining mutations, arranged in 2 sets - 2 people shared CDYb>42, and 3 people shared CDYb<40. (And to repeat, the latter may be a false finding as the CDY markers are very fast-mutating markers and may shift back and forth in value from one generation to the next).

So, based on these data, we would predict that testing everyone to 111 markers would generate a further  (4+12=) 16 mutations, and of these about 20-25% (3-4) would be shared, branch defining mutations. And about 40% of these (1-2) would be in the period of the 3-4 missing generations (approximately 1690 to 1800). And you need at least 2 people with a shared mutation to form a new branch, so the most we could hope to identify with STR markers is 1 new branch.

But this is merely an estimate based on the data we have so far. The final picture (if everyone upgraded) could look considerably better ... or considerably worse.

Could SNPs help?

Similarly, if everyone did the Big Y-700 test, what's the best we could hope? How many unique SNP mutations might it reveal?

The 2 members who did the Big Y test are reported to have 2 unique SNP mutations each. [2] Even if all the group members had 2 new mutations each (28 in total), not all of them would be branch defining within the 1690-1800 time period of the missing generations. We could guesstimate that 50% (14) of the new mutations would be unique (private) SNPs to individual members, and 50% would be shared (i.e. branch-defining) with other project members, but only about 25% (7) would be in the missing generations period (1690-1800). This gives us only 7 branch-defining SNPs ... but this is just a guestimate.

And as it takes a minimum of 2 shared mutations to define a branch, only a maximum of 3 branches could thus be defined within the time period of the missing generations. And this would allow us to separate the 14 members into 3 distinct family subgroups (at most) within the 1690-1800 time period.

So we could define 1 new branch with STRs and a maximum of 3 with SNPs (potentially), and this gives a maximum of 4 new branches within the 1690-1800 time period. And that might help considerably to answer the question: can we bridge the gap?

But this is only an estimate.

And is it worth it?

What do you think?

Conclusions

This has been a very useful exercise. But there remains considerable doubt as to whether upgrading everyone to 111 markers or the Big Y would produce meaningful results. And it would only have the best chance of working if everyone upgraded (and we know that not everyone will) because we always need something to compare the results to - a single result in isolation is essentially worthless. Currently (for comparative purposes) we have 14 sets of Y-37 results, 7 sets of Y-67 results, 2 sets of Y-111 results, and 2 sets of Big Y-500 results.

I am currently using the General Fund to upgrade 2 members (the ones who did the Big Y test) from Y-DNA-67 to Y-DNA-111. It only costs $29 each and it may produce some interesting results so it is worth doing. It would bring the total number of members who have done the Y-111 test to 4.

However, cost is an important consideration. The cost of everyone upgrading to the Big Y-700 would be in the region of $6000 (for 12 people). And that is a lot of ice cream. Would the money be better spent elsewhere?

Therefore I would not recommend upgrading to Y-DNA-111 or doing the Big Y-700 test unless you are particularly curious. And the reason for not recommending this is because there are serious doubts as to whether it is capable of addressing the particular issue at hand i.e. trying to bridge the gap of the 3-4 missing generations by defining the branching structure of the family tree in that particular tranche of time (1690-1800).

Might we be better using Family Finder data (i.e. autosomal DNA, atDNA)? This will be explored in the next blog post.

Hang in there!

Do good things come to those who wait?
Maurice Gleeson
Aug 2019

Footnotes, Sources & Links

[1] the SAPP tree was generated in a series of steps. A Mutation History Tree (MHT) was generated for each step from Step 2 onwards and a sense-check was performed.
  1. firstly a text file was generated with the crude data
  2. floating STRs (from results transferred from HeritageDNA) were removed, missing values (markers 31-35) for S03 were taken from S05 (same person, duplicate test), labels were added
  3. Z166 modal was used as an anchor, SNP & genealogy data was added
  4. floating STRs were restored & missing STRs (tentatively) imputed from GF1 modal
  5. CDYa&b were ignored
  6. CDYa&b were reactivated, outliers ignored, S03 ignored (duplicate of S05), George 1775 added
  7. CDYa&b changed in Z166 modal from 34-39 to 33-41 to reflect GF1 modal, MDKA birth locations added where known

[2] The 2 members who have done the Big Y share the 9-SNP Block headed by Y18109. Presumably all of these SNPs were shared by the overall common ancestor for GF1 (which we presume to be George Spearin born in London in 1646, son of George Spearin & Rebecca Carter).

These 2 project members also appear to several "Private" SNPs i.e SNP markers that are unique to each of them individually (and not shared by anyone else in the entire FTDNA database). However, because of the way FTDNA present the data, it can be very difficult to identify which unique SNP belongs to which person:
  • The GA1 member (PMS-4729) has 1 unnamed variant 
    • 8480410 = Y47137 (discovered by YFULL in 2015) 
  • The LIM10 member (JS-1223) has 2 unnamed variants
    • 4503779 = BY58131 (discovered by FTDNA in 2018)
    • 8769214 = Y47666 (discovered by YFULL in 2015)
From this we might expect them to have 3 Non-Matching Variants but only 2 are recorded in their respective Big Y results:
  • ZS2445 (position 14,706,801; discovered by Victor Was in 2014) ... where did that come from?!
  • 8480410 = Y47137 (discovered by YFULL in 2015)
The first SNP was discovered in 2014, a year before the 2 members tested (Aug 2015), so this is probably not unique to our 2 project members. But we simply don't know. The second SNP is probably a unique SNP possessed by the GA1 member (PMS-4729).

And this highlights the problem with the way FTDNA present the Big Y results - you can never be sure if the SNPs are a) genuine / reliable; b) unique / private SNPs; and c) to which particular individual do they belong.

However, both these members have had their Big Y results re-analysed by YFULL and here is what YFULL says:
  • GA1 member (PMS-4729)  has the YFULL ID YF04104.
    • He has 1 private/unique SNP of acceptable quality ... Y47137
    • He also has 42 unique SNPs of ambiguous quality and 1 of low quality. 
  • LIM10 member (JS-1223) has the YFULL ID YF04316.
    • He has 2 unique SNPs of "best quality, namely ... Y47666 (as above) & BY58131 (as above)
    • He has 1 private/unique SNP of acceptable quality ... Y54303 (where did that come from??)
    • He also has 9 unique SNPs of ambiguous quality and 1 of low quality. 
So from the above, it would seem that YFULL identifies 3 unique SNPs (of best or acceptable quality) for the LIM10 member and 1 unique SNP (of acceptable quality) for the GA1 member. This give 4 in total between the 2 members, and thus an average of 2 per member ... and this latter figure is consistent with what FTDNA describe in the Big Y Block Tree, namely: Private Variants ... Average: 2

It is only by comparing these assessments to additional Big Y data that we can judge which of these SNPs are important and which ones are not.



[3] There may be more STR mutations among the 450 additional STR markers that come with the Big Y-500 test but we would need at least one more person to do the Big Y test in order to ascertain this. This is because at least 3 people are needed to generate the modal value for each STR marker.





No comments:

Post a Comment