Abstract
Mosaic mutations can be used to track cell lineages in humans. We used cell cloning to analyze embryonic cell lineages in two living individuals and a postmortem human specimen. Of ten reconstructed post-zygotic divisions, none resulted in balanced contributions of daughter lineages to tissues. In both living individuals one of two lineages from the first cleavage was dominant across tissues, with 90% frequency in blood. We propose that the efficiency of DNA repair contributes to lineage imbalance. Allocation of lineages in postmortem brain correlated with anterior-posterior axis, associating lineage history with cell fate choices in embryos. We establish a minimally invasive framework for defining cell lineages in any living individual, which paves the way for studying their relevance in health and disease.
One-sentence summary:
The first cell division in humans produces cellular lineage trees with asymmetric fates.
Somatic mutations, generated after formation of the zygote, can be used as permanent cellular markers to trace cell lineages and their spread throughout the human body (1, 2). It was observed that lineages contribute unequally to the blood beginning from cleavages of the zygote (1). Similarly, normal development may result in unequal characteristics of symmetrical organs, such as different volumes of left and right frontal and occipital cerebral cortex (3); and several pathological conditions exhibit asymmetrical manifestations, such as motor symptoms in Parkinson Disease. However, the timing and rules of cell lineage separation, left-right asymmetry, spread and local expansion in organs are largely unknown.
To study this question, we reconstructed and analyzed early cell lineages in three unrelated individuals: a phenotypically normal living 66-year old female (NC0), a living 29-year old male (LB), and a post-mortem female fetus where we previously studied somatic mutations during neurogenesis, sample 316 (2). For the living individuals, we collected skin biopsies from two locations on left and right arms and one location on left and right thighs (Fig. 1A), and cultured fibroblasts from each biopsy. Multiple induced pluripotent stem cell (iPSC) lines were derived (a total of 74 for LB and 15 for NC0) and a selection of iPSC lines was sequenced from each fibroblast sample (Table S1). For the post-mortem fetal specimen, we re-analyzed sequencing data for 11 clones previously derived from telencephalic neuronal progenitors (2), and also sequenced 2 iPSC lines derived from dural fibroblasts. Most iPSC lines are clonal, and indeed, out of all iPSC lines, only one line showed evidence of being founded by two cells (Fig. S1). Comparing the genomes of iPSC lines/clones from the same person allowed us to discover the somatic mutations present in the founder cells of each clone (i.e. skin fibroblasts for LB and NC0, and brain progenitor cells for 316) (Fig. S2; Methods). Sites with somatic variants were also analyzed in high coverage sequencing data for bulk blood, saliva, and urine from the living individuals, and multiple brain regions and spleen from the fetal specimen, to determine presence and allele frequency in those tissues.
Figure 1. Reconstruction of early cell lineages in two individuals reveals lineage imbalance across tissues.

(A) Outline showing location of the biopsies used to derive iPSC lines from skin fibroblasts. Sequenced iPSC lines are listed in (B, C) at the terminus of each branch (LT#4, LT#7, LT#1 with shallow coverage are italicized). Counts of mutations per line are shown in Fig. S2. (B,C) Early lineage trees, with circles representing cells and lines likely parental relationships. Open circles mark cells with no mosaic variant leading to ambiguous branching. Mosaic SNVs (black) and indels (green) are denoted by Latin and Greek letters. SNVs found only in bulk tissues are marked with asterisks. Lineage frequencies in bulk samples are shown in log scale by bar graphs, with arrows indicating the expected frequencies for a balanced lineage contribution. Squared plots show correlations between the frequencies in bulk samples for upper (Y-axis) and lower (X-axis) branches, with stars indicating expected frequencies for balanced contributions. Division at branch marked by e, f, g, h may not be fully captured, as frequencies in saliva are inconsistent with diagonal.
Reconstructing early lineage trees.
To reconstruct the early developmental cell lineages starting from the first zygotic cleavage of each individual, we selected somatic variants shared by clones/lines or by multiple bulk tissues (Figs. 1, S3; Table S2; Methods). To make branches in the lineage, we relied on variant sharing by lines/clones and on the rule that variants from consecutive cell divisions must be present in tissues at progressively decreasing frequencies. Unrepaired DNA damage that propagates through several cell cycles and may result in multiple mutation at the same genomic position (4) is unlikely to interfere with lineage tree reconstruction (Methods). As expected from our selection criteria, all variants in the trees were shared by at least two tissues from different germ layers, i.e., mesoderm (fibroblasts, blood and spleen), ectoderm (brain and, in part, saliva), and endoderm (urine), and therefore arise in common progenitors of the three germ layers before gastrulation.
Occasionally, due to absence of variants, lineages could not be assigned to a particular division and multiple solutions in tree branching were possible (polytomy). In such cases, we chose the solution where sister branches had the most balanced frequencies in tissues, but alternative solutions were also recorded (Figs. S3, S4). To ensure completeness of reconstructed cell divisions, we compared frequencies of the reconstructed sister branches at each division in tissues. Normalized to the frequency of the mother cell, such frequencies should sum up to 100% in all tissues or, in other words, should be located on a f1+f2=100% line in a squared plot (Fig. S5). Within measurement errors, the frequencies of ten reconstructed divisions (2 in NC0, 5 in LB, and 3 in 316) fit the line, suggesting that these divisions are fully reconstructed (Fig. 1, S3).
Imbalanced contribution of early cell lineages.
For any given cell division, an equal 50%:50% representation of corresponding daughter lineages in the progeny would correspond to stars in the squared plots (Fig. S5). For all reconstructed divisions (starting from the first one) we observed imbalanced contributions of sister lineages in tissues, where dots in the squared plots are on the diagonal but away from the stars (Fig. 1, S3). In the two living individuals, the largest imbalance was observed for the first two blastomeres revealing the presence of a dominant and a recessive cell lineage (Figs. 1, 2A). In both individuals, one of the blastomeres generated 70%–90% of cells in tissues, with the second blastomere generating the remaining 30%–10%. The largest imbalance was present in blood, with a contribution ratio of 90:10 from dominant:recessive blastomeres in both NC0 and LB, while the smallest difference was in cells from the urinary tract, with roughly 70:30/80:20 contribution ratio in the 2 individuals; urinary tract cells showed the least deviation also in subsequent cell divisions (Fig. S6). In the postmortem individual no marked imbalance of the first two blastomeres existed, though one of the two alternative trees is consistent with a 90:10 imbalance in the first division (Fig. S3). The ambiguity also existed for LB, but the alternative trees only increase the imbalance (Fig. S4).
Figure 2.

(A) Fraction of cells contributed from the dominant blastomeres in the first cell division to various tissues. (B) Difference of sister lineages contribution to blood, saliva and urine for fully reconstructed cell divisions. (C) Lineage contribution to different brain regions at each cell division; lineages marked by one of the corresponding mutations. FR-GZ, frontal germinal zone; FR-CX, PA-CX and OC-CX, frontal, parietal and occipital cortex, respectively; CB stands for cerebellum. Arrows indicate the correspondence between mother and daughter lineages. (D) Regression of lineage frequency across brain regions. Significant correlations (p-value < 0.05) are shown by solid lines and are marked by stars in C, while non-significant correlations are shown by dashed lines. Lineages with marginal significance (p-value < 0.1) are marked by variants ‘α’, ‘a’ and ‘o’. Regressions are shown for 11 lineages with independent frequencies (see Methods).
Higher prevalence of one sister lineage over the other one was typically consistent across tissues for all divisions (Fig. 2B and S6), suggesting consistency of early lineage allocations across germ layers and the body. We noticed that in LB (Fig. 1) there is a higher fraction of indels among variants of the recessive lineage as opposed to the dominant lineage (5 vs 1, p-value = 0.03 by Fisher’s exact test) (Methods). Instead, in NC0 the cell at the origin of the recessive lineage had 10 SNVs and two indels (compared to only one SNV in the dominant lineage) (Fig. 1). Consistent with that, one of the alternative lineage trees for 316 with the largest imbalance in sister lineages also had a disproportionally large count of SNVs (i.e., 11 vs 0) in the recessive as compared to the dominant lineage (Fig. S3C). Based on these observations, we hypothesize that the efficiency of DNA repair in the early divisions of the human embryo contributes to lineage imbalance.
Lineage distribution in relation to anterior-posterior axis.
The available data allowed us to gain insight into lineages from the 1st to 4th cell division (first week of development), which corresponds to the human embryo pre-implantation stage. We genotyped mosaic SNVs from high depth resequencing, and inferred frequencies of early lineages across multiple brain regions of specimen 316 (see Methods). Six lineages showed significant correlations (p-value < 0.05 by Spearman rank-order) and three more marginally significant correlations (p-value < 0.1) between their frequencies in dorsal brain regions and the arrangement of these regions along the A-P axis, as exemplified by the order FR-GZ => FR-CX => PA-CX => OC-CX => CB (Fig. 2C,D). The two remaining lineages had the lowest frequencies, consistent with significance not being detected due to larger errors in measuring their tissue frequencies. This analysis revealed a systematic relationship between early lineage distribution and A-P axis in brain.
Recurrence of population and mosaic variants.
We also sampled and sequenced the blood of LB’s father and mother (Table S1). Two early mosaic SNVs used for reconstruction of the LB lineage tree (SNP β and ε, see Table S2) matched to germline single-nucleotide polymorphisms (SNPs) in LB’s mother (Figs. 3, S7). Both SNPs were also present in the catalogue of germline variants by gnomAD (one with frequencies in human population of more than 0.001). Examination of the SNV sites revealed nearly 50% variant allele frequency in all iPSC lines where the SNVs were called, 0% (or near 0% in one line) allele frequency in all iPSC lines where the SNVs were not called, and intermediate frequency in blood, saliva, and urine (Table S2). We found no deletion or loss-of-heterozygosity in the lines missing the variants, which could have accounted for the variants being inherited but absent in some lines. Furthermore, population-based phasing of germline variants in LB and his parents was consistent with the scenario that haplotypes with the matching SNPs in LB’s mother were not inherited by LB. Ultimately, physical read backed phasing confirmed the nonhereditary nature of the mosaic SNV β in LB (Fig. 3). We, therefore, concluded that these SNVs are bona fide mosaic variants recurrent of known population variants. Sequence context aware simulation (Methods) of a random match of mosaic and germline variants suggested that a count of two recurrent SNVs is unlikely to happen by chance (p-value = 0.006). In addition, we also found that 3, 1, and 3 of early mosaic SNVs in LB, NC0 and 316, respectively, match known frequent (>0.001 allele frequency) population variants in gnomAD (Table S2), which is unlikely to happen by chance according to the same simulation (p-value = 0.03). In sum, we detect a high recurrence of population variants as post-zygotic variants during embryonic development. SNV β discussed above occurs at the boundary of two homopolymers (T)12(A)6. Homopolymers are known to be prone to expansion and contraction (5). It is therefore possible that the SNV is the result of T-homopolymer contraction and A-homopolymer expansion, i.e., due to overlap of two indels: TT>T and A>AA. This possibility is also consistent with the above hypothesis that the recessive lineage has a higher fraction of indels, as the discussed SNV is found in branches within the recessive lineage (Fig. 1).
Figure 3. Recurrence of a germline SNP in LB’s mother as a mosaic SNV in LB.

(A) Variant allele frequency of the T>A somatic SNV variant across LB’s iPSC lines (orange), LB’s bulk samples (blue), and blood of his parents (green). (B) Schematics of germline haplotype inheritance based on population-based SNP phasing for LB (LB1, LB2), father (P1, P2), and mother (M1, M2). Ten non-contiguous variable positions in parents downstream and upstream from the somatic SNV are shown. (C) Read level evidence for the maternal haplotype with the variant not being inherited by LB. Every row has a single connected (light grey lines) pair of reads (dark grey lines). The SNP in the mother is in phase with 4 nearby variants, none of which is present in LB.
Discussion
Here we developed and applied a minimally invasive framework for studying early lineages and their imbalance in living human individuals, which consists of two components: 1) derivation and analysis of genomes of clonal iPSC lines to discover somatic mutations and conduct lineage reconstruction; and 2) analysis of variant distribution in bulk tissues such as saliva, blood and urine to establish their hierarchy in development. The iPSC derivation could be expanded to other cell types, such as epithelial cells from urine (6), erythroblast from blood (7), and keratinocytes from skin (8), increasing the diversity of sampling of early lineages. Variant distribution analysis in DNA from bulk blood, saliva, and urine can be complemented by the analysis of DNA from buccal swabs, multiple skin regions, hair follicles, feces, vaginal cells and sperm. Therefore, our study paves the way for comprehensive and large-scale analyses of early lineages and understanding their role in human health and disease.
Our study supports previous observations of an imbalanced lineage contribution to tissues in the human body after the first cleavage (1, 9); however, our results point to a generally higher imbalance of 90:10 versus 2:1 proposed previously, thereby suggesting, at least in some individuals and perhaps more generally, the existence of dominant and recessive lineages starting from the first division of the human zygote. Only minor deviations from the general trend in lineage allocation at each division were observed within each germ layer, implying that the imbalance was established by an intrinsic mechanism in the original lineage precursor cells rather than selective processes within each tissue compartment. The ordered pattern of lineage distribution along the A-P axis across several brain regions suggests that lineage founder cells may bias their daughter’s allocation according to an A-P axial rule at very early stages of human embryonic development.
For LB we observed an excess of indels in the recessive versus the dominant lineage, and we hypothesize that one factor contributing to this imbalance is the efficiency of DNA repair. Indels can be created from polymerase slippage and from faulty mismatch repair (10), while in development most SNVs arise from spontaneous deamination of 5-methylcytosine (11). Additional time spent by a cell on DNA repair may decrease proliferation rate leading to lower contribution to tissues (Fig. S8A). For NC0 and possibly for the fetal specimen, no excess of indels, but rather an increase in SNV burden in the founder cell of the recessive lineage was observed. We hypothesize that this increased burden could be also caused by deficient DNA repair generating genomic instability in the recessive lineage, a phenomenon hypothesized to exist in vivo based on analysis of in vitro fertilized embryos (12). The instability may result in multiple consecutive cleavages giving only one viable daughter cell, and thus leading to accumulation of point mutations from such cleavages in the viable cell; these mutations would retrospectively seem to occur from a single division (Fig. S8B). An alternative explanation could be that at the first cleavage one of the created blastomeres commits mostly to the extra-embryonic lineages, while the other blastomere mostly commits to inner cell mass and becomes the dominant lineage. Although debated (13), such a possibility was previously proposed for mammalian embryos (14, 15). In such case, cells segregation into trophoblast would play the same role as non-viable cells resulting from genome instability, i.e., they will not be present in the adult body (Fig. S8B). Thus, lineage studies can shed light onto mechanisms that regulate cell fate decisions during development as well as representation of different lineages in tissues of any living individual and across the human population.
Materials and Methods
Sample Collection for fetal specimen
Fetal sample 316 collection, handling, clonal cell lines derivation and sequencing has been previously described (2). De-identified postmortem human brain specimens were obtained after appropriate informed consent. Tissue was handled in accordance with ethical guidelines and regulations for the research use of human brain tissue set forth by the NIH (http://bioethics.od.nih.gov/humantissue.html) and the WMA Declaration of Helsinki (http://www.wma.net/en/30publications/10policies/b3/index.html).
Sample collection for living individuals
Informed consent was obtained from each subject according to the regulations of the Institutional Review Board and Yale Center for Clinical Investigation at Yale University. Skin biopsies (approx. 3mm3) were collected respectively from the inside of the upper left and right arms (two locations about 5 cm apart and corresponding to different dermatomes) and upper left and right thighs and cultured separately in fibroblast culture media (DMEM high glucose, FBS, L-glutamine, N.E. amino acids, Pen/Strep; all Invitrogen). Fibroblasts attached, proliferated and were passaged twice before DNA was extracted (DNeasy Blood and Tissue Kit; Qiagen according to manufacturer recommendations).
Saliva DNA was collected and purified using the Oragene-Discover kit (DNA Genotek) following the manufacturer instructions. Saliva DNA was extracted using DNeasy Blood and Tissue kit (Qiagen) with the following modifications: 5 ml AL-buffer and 200 μl Proteinase K were added to collected saliva in buffer and incubated at 56°C for 30 minutes. RNA was digested using 20μl RNAse A (Qiagen) for 5 minutes and DNA was extracted using 4 extraction columns in parallel to optimize the yield.
For the purification of DNA from urine, urine was collected into 50ml falcon tubes and kept on ice during transport. Cells were pelleted by centrifugation (400g for 10 minutes at RT). Supernatant was aspirated and cells were resuspended in 500μl PBS and 1.5ml of lysis buffer (DNeasy Blood and Tissue kit (Qiagen) per each 50ml falcon tube. 100μl of Proteinase K were added to each tube, vortexed and incubated at 56°C for 25 min. 10μl of RNAse A were added to each sample and incubated for 5 minutes. DNA was extracted using the DNeasy Blood and Tissue kit (Qiagen).
Blood collection and DNA purification.
For LB, his mother and father and NC0 10–15 mL of blood was collected using BD Vacutainer ACD tubes. DNA was extracted using the Gentra Puregene Blood Kit (Qiagen) following standard manufacturer protocols.
iPSC line derivation
The iPSC lines were derived using the Epi5 Episomal iPSC Reprogramming Kit (Invitrogen catalog A15960) which delivers the five reprogramming factors Oct4, Sox2, Klf4, L-Myc, and Lin28. The iPSC lines were propagated using mTeSR1 media (Stem Cell Technologies). Genomic DNA was extracted at passage six, using QIamp DNA Minikit (Qiagen) following the manufacturer instructions.
Sequencing of bulk tissues and iPSC lines
Whole genome sequencing (WGS) of bulk DNA samples from living individuals was conducted at 200X, while WGS of iPSC lines was typically conducted at 30X. All sequencing was conducted at BGI using with 2×100 bp paired reads. For all but urine samples sequencing library preparation was PCR-free.
Calling mosaic variants in iPSC lines and clones
Mosaic SNVs and indels were called from an exhaustive all-2-all comparison of all lines/clones for each individual (Fig. S2). For each pairwise comparison of lines/clones we used calls made by both MuTect2 (16) and Strelka2 (17). Previously we described an approach for selecting bona fide mosaic variants from such comparison (2). Conceptually, the approach selects calls that are consistently made when comparing lines from a set A to lines from a complementary set B (sets A and B account for all lines/clones in an individual). Here we extended the approach to select calls when they are found in most lines/clones (i.e., when set A is much larger than set B) and thereby resemble germline variants. Calls were required to have 35% allele frequency in line/clones. Additionally, only indels shorter than 10 bp were retained to reduce false positive rate. The filtering tool is freely available https://github.com/abyzovlab/all2. To protect individual’s privacy, the coordinates of discovered variants were de-identified in the manuscript. Full information is available from the database (see Data access).
Calling mosaic variants in bulk tissue DNA
To call variants from bulk tissue we followed best practice development by the Brain Somatic Mosaicism Network (unpublished). Briefly, calling consisted of the following steps: 1) call variants with GATK using a ploidy setting of 50; 2) eliminate calls in inaccessible genomic regions according to the 1000 Genomes mappability mask; 3) discard germline variants that have a population allele frequency of >0.001 in gnomAD catalogue; 4) eliminate calls consistent with ~50% VAFs by a binomial test significance of 10−6; 5) eliminate calls in genomic regions exhibiting copy number gains; 6) mandate that candidate mosaic SNVs have at least 5 independent supporting reads that have minimum values of 20 for mapping and 20 base quality; 7) identify and eliminate false-positive calls using samples in the 1000 Genomes Project as a panel of normal filter; 8) filter calls using MosaicForecast tool (18).
Lineage tree reconstruction in LB
We first sequenced six iPSC lines to 30X coverage and discovered mosaic variants in those six lines using the all-2-all exhaustive comparison described above. We next constructed the early lineage tree and selected 14 variants marking branches of the tree. To avoid sequencing redundant iPSC lines derived from the same fibroblast clone in the skin, we then genotyped all the iPSC lines for the presence of those fourteen mosaic variants using amplicon-seq (see below). Based on genotyping, the lines were assigned to existing and new branches in the tree. This allowed us to select from each biopsy, one line likely contributing a new branch to the lineage tree for further analysis, i.e., nineteen additional lines, that were sequenced at 30X coverage. We then called mosaic variants using the all-2-all exhaustive comparison from all 25 of the sequenced iPSC lines (see above; Table S2). For each called mosaic variant we then estimated its VAF in bulk tissues (saliva, blood, urine) by using the mpileup function of samtools requiring minimum values of 20 for mapping and 20 base quality. Estimates of VAF for low frequency variants are less reliable and to select most confident sites we paralleled VAF estimated by a different approach. Namely, we re-called variants in the bulk using GATK with ploidy 100 and extracted from the output the number of reads supporting reference and alternative alleles. Such ploidy value allows discovering SNVs with VAF below 1% and, combined with 200X coverage of bulk allows for VAF estimation for variants with at least 2 supporting reads (i.e., for over 90% of variants with actual VAF of 2%). Out of the called mosaic variants we chose a subset of early (pre-gastrulation) variants for the lineage tree construction using the following criteria: (1) a mosaic variant (SNV or indel) is shared by at least two iPSC lines from different biopsies; or (2) a mosaic SNV has at least 2% VAF in at least one bulk tissue and at least one supporting read in another bulk tissue.
In addition to mosaic variants discovered in iPSC lines, discovery in bulk DNA from blood, saliva and urine revealed 4 high VAF mosaic SNVs not sampled in any of the iPSC lines. We placed them in a separate branch to complement the branches reconstructed from analysis of iPSC lines. The criterion we used is that the sum of VAFs in this newly created branch with its sister branch should be roughly equal to the parental branch (Fig. S5). SNV at chr2:74108699 G>T discovered in both LA2#52 and RA1#11 conflicted the reconstructed lineage tree and was excluded. One more SNV (chr3:112444103 C>T) discovered in LA2#4 had VAF in bulk tissues higher than variants in preceding branches and was not used in the tree construction (Table S2).
The frequency of a lineage was calculated as average cell frequency of corresponding variants.
Lineage tree reconstruction in NC0
For each of two biopsies in the left and right arm we sequenced four iPSC lines to 30X coverage. For each one of the other four biopsies, we sequenced one iPSC line to 30X coverage. The line from left thigh was excluded from the analysis because of likely originating from two cells (Fig. S1). We then called mosaic variants from exhaustive comparison of the 11 iPSC lines with 30X coverage. The lineage tree was constructed as described for LB. One SNV (chr18:4708635 A>G) was excluded from tree construction as likely germline variant. The exclusion was based on the following observations: 1) the variants had VAF above 45% in blood, saliva, and urine; 2) the variant is present in human population with allele frequency of over 10%; 3) the variant was present in all but three redundant (i.e., sharing almost all of their variants) iPSC lines (LA1#4, LA1#8, LA1#10), consistent with it being inhered but lost in fibroblasts from the LA1 biopsy; 4) the variant conflicts with the branches in the tree (Table S2). We sequenced three iPSC lines from left thigh to 5X coverage as an alternative to variant genotyping by amplicon-seq that we performed for LB. Genotyping variants in these lines allowed us to place these lines into the constructed tree. Mosaic variant ‘a’ with the highest VAF was validated by Sanger sequencing (Fig. S9). The frequency of a lineage was calculated as average cell frequency of corresponding variants.
Influence of unrepaired DNA damage on tree reconstruction
We think that unrepaired DNA damage that propagates through several cell cycles (4) is unlikely to interfere with lineage tree reconstruction. The crucial molecular event for retrospective lineage tracing is the occurrence of a mutation, which is then a permanent and informative mark of cell lineages (Fig. S10A). The history leading to the mutation, including if and when the original DNA damage resulting in the mutation is repaired, and how many cell cycles an unrepaired site with DNA damage propagated through, is irrelevant. Occurrence of multiple but different mutations occurring sequentially at the same nucleotide from the same DNA damage event is also not problematic and, theoretically, may yield additional information for lineage reconstruction (Fig. S10B). We found two sites where different mosaic SNVs could have recurred from the same lesion, however, even if true (see next paragraph), these SNVs originated in late development because they were in terminal branches of the tree and were absent from blood, saliva and urine (Fig. S10D,E). These SNVs do not create any problems in early lineage reconstruction.
Now we estimate the probability of two iPSC lines in terminal branches (we’ll call them neighboring lines) to have one occurrence of independently generated and different mutations at the same position. Among 25 iPSC lines analyzed for LB there are 5 pairs of iPSC lines in terminal branches: 1) one pair for RA2#11 and #40; 2) one pair for RT#13 and #11; 3) three pairs for RA1#6, #10, #11 (Fig. 1). Thus, the probability of neighboring lines having independently generated mutations at the same position by chance is . Similarly, the chance of neighboring lines in NC0 (for LA1#10, #4, #8 and for RA1#1, #7, #8, #9; Fig. 1) having independently generated and different mutations at the same position is . Therefore, it is unlikely that observed multiple mosaic SNVs occurring at the same site happen by chance.
The only challenge we see is when multiple occurrence(s) of the same mutation arise at the same site of DNA damage in different cells/lineages (Fig. S10C). Such mutations may result in conflicts or errors when making branches of the lineage tree. Considering that DNA damage sites are typically repaired and only rarely result in a mutation, occurrence of the same mutation from the same lesion seems to be unlikely. In our analysis we found only one such conflict – the already mentioned SNV chr2:74108699 G>T discovered in LA2#52 and RA1#11 in LB. These two lines are distinguished by 8 other variants (x, l’, m’, n’, o’, p’ and r, j’); and even if the two SNVs in the two lines arise from the same DNA damage it is unlikely that such confounding factor hampers lineage determination because there are many other variants that allow correct placing of the branches in the lineage tree. Additionally, that SNV has 0% VAF in blood, saliva, and urine, and thus it is more likely to be a recurrent SNVs occurring in later development in the lineage captured by LA2#52 and RA1#11.
Lineage tree reconstruction in 316
We previously called mosaic variants in human brain progenitor clones of the post-mortem fetal specimen 316 and constructed an initial tree of early lineages assuming equal contribution of sister lineages (2). Given the evidence of imbalanced lineage contribution in LB and NC0 we removed this assumption. Additionally, using the extended filtering approach described above, we re-called and filtered mosaic variants in previously obtained WGS data for clones and added new data for two iPSC lines derived from skull fibroblasts of the same fetal specimen. Compared to the previous call set, we found 3 additional SNVs shared by multiple clones/lines, all of which were at high frequency in brain regions and spleen (VAF of 10%-35%), allowing to resolve the common ancestry of two pairs of lineages (i.e., making two unambiguous branches) (Fig. S3). Mosaic variant ‘α’ with the highest VAF was validated by Sanger sequencing (Fig. S11). We also found 3 indels (indels were not analyzed previously) shared by multiple lines/clone, which all supported branches based on SNVs. Previously, we used a capture-seq validation approach to determine the frequency for discovered mosaic variants across brain regions. Using that data, we previously omitted from lineage reconstruction variants for which we could not determine precise VAF in tissue. Because of using 125 bp baits, capture-seq has limited power of enriching for sequence in repeats of comparable length. Contrary to that, WGS is powered to call variants in repeats of length up to the fragment size of DNA being sequenced, i.e., up to 300–500 bps. Following this consideration, we now included all SNVs shared by at least 2 clones/lines variants (7 additional SNVs) for tree reconstruction. These additional variants allowed resolving a conflict that existed in the previous tree by adding a lineage split (i.e., an additional branch) (Fig. S3). Following the same criteria as for LB and NC0 we also used mosaic SNV with at least 2% VAF in at least one bulk tissue and at least one supporting read in some other bulk tissue for lineage reconstruction. The lineage tree was constructed as described for LB and NC0.
Comparing fractions of indels in lineages in LB
Indels are known to be more challenging for discovery and genotyping than SNVs. Because of that, we only used indels that were discovered in at least 2 iPSC lines from different biopsies to reconstruct the lineage tree. For consistency, we applied the same considerations to SNVs when comparing fractions of indels in dominant and recessive lineages. We did not count variants from the first division since as at this point the dominant/recessive lineages did not exist. The dominant lineage had 22 variants shared by iPSC from different biopsies, which included only 1 indel. The recessive lineage had 15 shared variants with 5 indels. The proportion of indels was significant by Fisher’s exact text (p-value = 0.03).
Analysis of lineage distribution in fetal brain regions
There are total 13 reconstructed lineages in postmortem specimen 316 (Fig. 3C). Of them, frequencies of two lineages are dependent on the frequencies of other lineages: the frequency of recessive lineage is dependent on the frequency of the dominant lineage and the frequency of lineage ‘b’ is dependent on the frequencies of lineages ‘a’, ‘α’, and ‘ε’. To reduce redundancy, those two lineages (recessive and ‘b’) were not included in the analysis.
Sequence context aware simulation
To assess the chance of the discovered mosaic SNVs to match germline SNPs across the genome, we applied a sequence content aware simulation (i.e., we considered the tri-nucleotide sequence around the variants). In each round of simulation, we randomly scattered all the identified mosaic SNVs across positions in the genome with the same tri-nucleotide sequence. We only considered SNVs in P-bases and scattered them across P-bases of the 1000 Genomes Project accessibility mask. We then counted the fraction of SNVs matching to germline SNP in parents (for LB) or in gnomAD with population frequency of at least 0.1% (for all 3 individuals). We then repeated this process 1,000,000 times, building the null distribution of the number of mosaic SNVs matching to germline SNPs. We then calculated p-value of finding exact or larger numbers of observed recurrent mosaic SNVs using these distributions.
Amplicon-seq and Sanger-seq validations and processing
Amplicon-seq was used to genotype 14 SNVs in iPSC lines, blood, saliva, and urine of LB. Primers (Table S3) were designed by selecting a DNA template of 800 nucleotides which contains the candidate SNV (SNV ± 400bp), with an amplicon size between 200–450bp. For PCR amplification, we used Phusion High-Fidelity DNA polymerase (Thermo Fisher Scientific) to minimize the polymerase error rate; optimal annealing temperature was defined by the Tm calculator tool available on the Thermo Fisher Scientific website and validated by PCR. Primer specificity was initially determined in silico with UCSC Genome Browser (http://genome.ucsc.edu/index.html), and then confirmed on the 2% agarose gel by the presence of a unique PCR product of the expected size. Amplicon DNA was purified using the QIAquick PCR Purification Kit (Qiagen). PCR products from the same sample were pooled and samples were submitted for sequencing under following conditions: MiSeq paired-end, 250bp.
SNVs “a” discovered in NC0 (Fig. S9) and “α” discovered in 316 (Fig. S11) were confirmed by Sanger sequencing. Primers for Sanger sequencing were designed as described for amplicon-seq and PCR products were visualized on 2% agarose gels. 40–50ng of amplified DNA per sample were submitted for sequencing.
Data access
Identified variants and full supplementary tables have been deposited to the NDA database and associated with the study #1057, DOI:10.15154/1520633. Read level data have been uploaded to the same database and have been linked to the same study. The code is available at https://github.com/abyzovlab/all2.
Supplementary Material
Table S1: Description of samples.
Table S2: Chart of somatic variants used for lineage tree reconstruction.
Table S3: Primers to genotype SNVs used for lineage tree construction.
Acknowledgments
We are most grateful for LB and his parents as well as NC0 for their willingness to participate in this study. We acknowledge the Yale Center for Clinical Investigation for clinical support in obtaining the biopsy specimens. We thank Dr. Caihong Qiu and Jason Thomson of the Yale Stem Cell Center Core Services for the generation of iPSC lines. We thank BGI Americas Corporation for library preparation and deep sequencing. We thank the Keck DNA Sequencing Facility at Yale for their assistance with DNA sequencing service. We thank Drs. Jessica Mariani and Soraya Scuderi for the generation and amplification of brain neurosphere clones. We acknowledge members of the Brain Somatic Mosaicism Network (BSMN) for helpful comments and discussions.
Funding:
This work was funded by the National Institute of Mental Health (grants R01 MH100914, U01 MH106876).
Footnotes
Competing interests: The authors declare no competing interests.
The Supplementary Materials for this article include:
Figs. S1 to S11
Data and materials availability:
All data of our work is included in the text and supplemental information. All primary data are accessible at NDAR (study #1057, DOI: 10.15154/1520633) and freely available for download.
References
- 1.Ju YS et al. , Somatic mutations reveal asymmetric cellular dynamics in the early human embryo. Nature 543, 714–718 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bae T et al. , Different mutational rates and mechanisms in human cells at pregastrulation and neurogenesis. Science 359, 550–555 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Weinberger DR, Luchins DJ, Morihisa J, Wyatt RJ, Asymmetrical volumes of the right and left frontal and occipital regions of the human brain. Ann Neurol 11, 97–100 (1982). [DOI] [PubMed] [Google Scholar]
- 4.Aitken SJ et al. , Pervasive lesion segregation shapes cancer genome evolution. Nature 583, 265–270 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ellegren H, Microsatellite mutations in the germline: implications for evolutionary inference. Trends Genet 16, 551–558 (2000). [DOI] [PubMed] [Google Scholar]
- 6.Zhou T et al. , Generation of human induced pluripotent stem cells from urine samples. Nat Protoc 7, 2080–2089 (2012). [DOI] [PubMed] [Google Scholar]
- 7.Varga E, Hansen M, Wust T, von Lindern M, van den Akker E, Generation of human erythroblast-derived iPSC line using episomal reprogramming system. Stem cell research 25, 30–33 (2017). [DOI] [PubMed] [Google Scholar]
- 8.Aasen T et al. , Efficient and rapid generation of induced pluripotent stem cells from human keratinocytes. Nat Biotechnol 26, 1276–1284 (2008). [DOI] [PubMed] [Google Scholar]
- 9.Lee-Six H et al. , Population dynamics of normal human blood inferred from somatic mutations. Nature 561, 473–478 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Garcia-Diaz M, Kunkel TA, Mechanism of a genetic glissando: structural biology of indel mutations. Trends Biochem Sci 31, 206–214 (2006). [DOI] [PubMed] [Google Scholar]
- 11.Ehrlich M, Zhang XY, Inamdar NM, Spontaneous deamination of cytosine and 5-methylcytosine residues in DNA and replacement of 5-methylcytosine residues with cytosine residues. Mutat Res 238, 277–286 (1990). [DOI] [PubMed] [Google Scholar]
- 12.Vanneste E et al. , Chromosome instability is common in human cleavage-stage embryos. Nat Med 15, 577–583 (2009). [DOI] [PubMed] [Google Scholar]
- 13.Takaoka K, Hamada H, Cell fate decisions and axis determination in the early mouse embryo. Development 139, 3–14 (2012). [DOI] [PubMed] [Google Scholar]
- 14.Piotrowska K, Zernicka-Goetz M, Role for sperm in spatial patterning of the early mouse embryo. Nature 409, 517–521 (2001). [DOI] [PubMed] [Google Scholar]
- 15.Piotrowska K, Wianny F, Pedersen RA, Zernicka-Goetz M, Blastomeres arising from the first cleavage division have distinguishable fates in normal mouse development. Development 128, 3739–3748 (2001). [DOI] [PubMed] [Google Scholar]
- 16.Cibulskis K et al. , Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31, 213–219 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Saunders CT et al. , Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28, 1811–1817 (2012). [DOI] [PubMed] [Google Scholar]
- 18.Dou Y et al. , Accurate detection of mosaic variants in sequencing data without matched controls. Nat Biotechnol 38, 314–319 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Table S1: Description of samples.
Table S2: Chart of somatic variants used for lineage tree reconstruction.
Table S3: Primers to genotype SNVs used for lineage tree construction.
Data Availability Statement
All data of our work is included in the text and supplemental information. All primary data are accessible at NDAR (study #1057, DOI: 10.15154/1520633) and freely available for download.
