Skip to main content
. Author manuscript; available in PMC: 2022 Apr 22.
Published in final edited form as: Nat Biotechnol. 2019 Dec 23;38(3):343–354. doi: 10.1038/s41587-019-0366-x

Figure 2. Analysis pipeline for locating somatic SVs in single cells.

Figure 2

Schematic shown for three single cells representing a heterogeneous population. (a) Single cell data are mapped to a reference genome (grey line) with Strand-seq reads (arrows) aligned in either the Watson (‘W’; orange) or Crick (‘C’; blue) direction. Left: reads are counted in 100 kb bins. Middle: Joint segmentation is performed on the binned data. Piecewise constant functions (black horizontal lines) are fitted to each segment and strand. Segmentation occurs across all cells based on change points in the fitted piecewise constant functions, to locate putative SV breakpoints between bin boundaries (vertical purple lines). Right: Heterozygous SNP positions are used to build consensus haplotypes using StrandphaseR22, resulting in SNPs assigned to chromosome-length haplotypes designated ‘H1’ (red) or ‘H2’ (blue). (b) Consensus haplotypes (now horizontal lines with SNP bubbles) are used to haplotype-tag (haplotag) individual Strand-seq reads in each cell. Any read overlapping a SNP is assigned to H1 (red lollipops) or H2 (blue lollipops) depending on the allele present in the read. Purple lines denote segment breakpoints. (c) Probabilistic model for SV calling. A multinomial distribution is used for the haplotagged read data (left panel). For each segment, the single cell data are considered as four different classes: C reads from H1 (C-H1), W reads from H1 (W-H1), C reads from H2 (C-H2), and W reads from H2 (W-H2). Random variables are represented by circles and parameter by boxes: N represents the true underlying copy-number (which we seek to infer) for each of these four categories, p the corresponding parameters of the multinomial distribution, and X represents the observed read counts in each category. A negative binomial (NB) distribution is used to model the total number of W and C reads (right panel). NB distributions for copy-numbers (CN) 0, 1, and 2 are depicted. Depending on the observed read counts (vertical dotted lines) for each segment, the likelihood of each CN is calculated. The full probabilistic graphical model is shown in Fig. S4. (d) Using this Bayesian model, the most probable SV type is assigned to each segment. In the schematic, two cells contain an inferred duplication on the H1 haplotype (Dup_H1; pink segment), and the other cell contains no SV (assignment to reference state; grey segment) (e) Example Strand-seq data analyzed with scTRIP for two RPE-1 cells and one C7 cell. RPE-1 cells exhibit a somatic duplication event (Dup_H1; chr3:60900000-62300000) absent in C7. Additional SVs called in Strand-seq data are shown in Fig. S8.