Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2022 Mar 9;12:3836. doi: 10.1038/s41598-022-05837-0

Normalization of clonal diversity in gene therapy studies using shape constrained splines

L Del Core 1,2,, D Cesana 2, P Gallina 2, Y N Serina Secanechia 2, L Rudilosso 2, E Montini 2, E C Wit 3,, A Calabria 2,, M A Grzegorczyk 1,
PMCID: PMC8907296  PMID: 35264585

Abstract

Viral vectors are used to insert genetic material into semirandom genomic positions of hematopoietic stem cells which, after reinfusion into patients, regenerate the entire hematopoietic system. Hematopoietic cells originating from genetically modified stem cells will harbor insertions in specific genomic positions called integration sites, which represent unique genetic marks of clonal identity. Therefore, the analysis of vector integration sites present in the genomic DNA of circulating cells allows to determine the number of clones in the blood ecosystem. Shannon diversity index is adopted to evaluate the heterogeneity of the transduced population of gene corrected cells. However, this measure can be affected by several technical variables such as the DNA amount used and the sequencing depth of the library analyzed and therefore the comparison across samples may be affected by these confounding factors. We developed an advanced spline-regression approach that leverages on confounding effects to provide a normalized entropy index. Our proposed method was first validated and compared with two state of the art approaches in a specifically designed in vitro assay. Subsequently our approach allowed to observe the expected impact of vector genotoxicity on entropy level decay in an in vivo model of hematopoietic stem cell gene therapy based on tumor prone mice.

Subject terms: Statistical methods, Stem-cell research

Introduction

Gamma retroviral and Lentiviral Vectors (LVs) are widely adopted in Gene Therapy (GT) thanks to their ability to insert therapeutic transgenes in the host cell genome of hematopoietic stem/progenitor cell (HSPC). After transplantation into the patient the HSPCs reconstitute the entire hematopoietic system and correct the genetic defect. Therefore, vector integration ensures the maintenance of gene correction during self-renewal of HSPCs as well as its transmission to their cell progeny1. These vectors integrate semi-randomly within the human genome, and then each transduced cell harbours a vector integration in a distinct genomic position (integration site, IS) that can be adopted as a genetic mark to distinguish each engrafted clone. The retrieval of IS from transduced cells can be done by using PCR protocols that allow to specifically amplify the vector/genome junctions from their genomic DNA. Sequencing and mapping on the target cell reference genome allow to identify IS that can be univocally used for clonal identity. Therefore, the analysis of vector IS from DNA of blood cells harvested at specific time points after transplant from GT patients provide information on number of hundreds to thousands of clones present in circulation and their relative abundance. For this reason, IS studies are required for safety and long-term efficacy assessment in preclinical and clinical studies28.

The Shannon entropy index, a well-established measure of species diversity in ecology9, has become one of the most widely used measure of IS diversity in HSC-GT applications10. This measure has been positively correlated to high levels of genetic modification and engraftment of genetically modified cells while low levels of entropy were associated to poor levels of genetic modification, or oligoclonality due to poor engraftment or even the appearance of highly dominant clones resulting from malignant transformation11. Indeed, the complexity of a given DNA sample is computed considering both the total number of different IS obtained and their relative abundance. Thus, highly polyclonal samples characterized by large number of IS whose abundance is evenly distributed will have a higher Shannon diversity index than oligoclonal samples with a relatively smaller number of IS and/or characterised by the presence of highly dominant clones. However, the Shannon diversity index does not consider variations in sample size (amounts of DNA analyzed) or the efficiency in species retrieval (different PCR protocols for IS retrieval, sequencing platforms and sequencing depth) of complex ecosystems, such as the population of vector integrations sites in the genome12.

Thus, while Shannon diversity index provides an objective measure of the clonal complexity of any given IS sample, these confounding factors should be taken in account when the clonal complexity of different samples is compared. Since longitudinal studies of GT patients for IS monitoring could require the analysis of several samples collected under heterogeneous technical conditions, a method aimed at removing confounding effects in diversity index is needed. To remove confounding factors in the estimations of ecosystem diversity, several methods have been applied. Random subsampling without replacement, called “rarefying”13, is among the most popular methods for the normalization of species count data in ecology as well as for next generation sequencing (NGS) data in microbiology. Given a predefined sequence depth (total count, SD), a subsample from each library is generated by randomly picking reads without replacement, until the selected total number of counts is reached. Although rarefying has become the state-of-the-art tool for NGS data analysis14, some limitations have been recognized. Indeed15, demonstrated that rarefying is statistically inadmissible and should be avoided. Furthermore, in16 it was highlighted that estimates of species diversity in sites/habitats at local scale, namely the α-diversity17, for rarefied microbiome count data may be strongly biased. This is mainly due to the rare species which may be over- (or under-) represented in the samples that have been normalized to a smaller depth by rarefaction. An alternative normalization to rarefying is scaling, which adjusts the size of all samples by scaling their counts to the same total amount. Scaling preserves the relative frequencies of the species and keeps the species richness unchanged. Therefore, simple scaling does not remove the effect provided from the library depth neither on species richness nor on species diversity. Beule18 introduces a novel normalization method for species count data called scaling with ranked subsampling (SRS) and the authors demonstrate its suitability for the analysis of microbial communities.

The growing number of normalization and scaling approaches highlights that a robust method has not been developed yet. In this work we show that all proposed methods have limitations. In particular they miss of a precise quantification of the effect of each confounding variable on the Shannon entropy. Furthermore, we also show that the rescaled Shannon entropy index obtained by either rarefying or scaling with ranked subsampling still suffers from the effect of the confounders. We propose a spline-regression approach aimed to explain and remove those effects from the diversity indexes. The effect of the confounders is measured using a B-spline term whose shape is restricted according to a biological-sustained hypothesis. We test our framework by analysing a novel in-vitro dataset properly designed to simulate the same clonality state under different combinations of technical conditions. We also compare our method with the previously proposed methods from the literature in terms of efficiency according to hypothesis testing. That is, we consider a rescaling method to be more efficient if there is more evidence for the corresponding rescaled measure being independent from the effect of the candidate confounders. Finally, our rescaling approach allowed to unmask the expected impact of vector genotoxicity on entropy level decay in an in vivo model of hematopoietic stem cell gene therapy based on tumor prone mice19,20.

Next generation sequencing of clonal tracking data from gene therapy studies

There are several high-throughput systems capable to quantitatively track cell types repopulation from an individual stem cell after a gene therapy treatment2123. Tracking cells by random labeling is one of the most sensitive systems24. In HSC-GT applications, haematopoietic stem cells (HSCs) are sorted from the bone marrow of the treated subject and uniquely labeled by the random insertion of a viral vector inside its genome. Each label, called clone, or integration site (IS), is defined as the genomic coordinates where the viral vector integrates. After transplantation, all the progeny deriving through cell differentiation inherits the original labels. During follow-up, the labels are collected from tissues and blood samples using Next Generation Sequencing (NGS)2528. NGS is a recent approach for DNA and RNA sequencing, which consists of a complex interplay of chemistry, hardware, optical sensors and software2932. In gene therapy applications NGS does allow identifying, quantifying and tracking clones arising from the same HSC ancestor. Over the past decades, clonal tracking has proven to be a cutting-edge analysis capable to unveil population dynamics and hierarchical relationships in vivo3336. Clonal diversity, measuring how many distinct clones are collected and how they distribute, can address some of these aspects. Loosely speaking, the less distinct clones the lower the clonal diversity and in turn the less the system is being repopulated in that particular cell compartment. Furthermore, under the same number of different clones collected, the more their distribution is far from the uniform, the lower the clonal diversity and the more the dominance of few clones, thus suggesting the possible occurrence of an adverse event. The Shannon entropy index37, a well-established measure of population diversity in ecology studies9, nowadays is hugely used as a proxy of clonal diversity in gene therapy applications10. Following37, the Shannon entropy index is defined as

h(X)=-i=1nP(xi)logP(xi) 1

where X has possible realisations x1,,xn which occur with probabilities P(x1),,P(xn). Therefore, Shannon entropy is a special case (up to a change in sign and a multiplicative factor 1/n) of the Kullback-Leibler divergence38

DKL(PQ)=i=1nP(xi)logP(xi)Q(xi) 2

when the reference Q is the uniform distribution on x1,,xn.

Potential limitations of entropy-based measures in gene therapy applications are related to the heterogeneous nature of NGS data3943. Indeed due to sampling and technical conditions, such as the amount of the host DNA being sequenced and the PCR being adopted, the number of reads obtained per library can span orders of magnitudes12 which may affect the cellular counts and in turn their Shannon entropy. These differences in magnitude of library size/depth mainly depends on unequal pooling of PCR products before sequencing. In order to pool PCR products from individual samples in equimolar amounts44, DNA concentrations are commonly determined by ultraviolet-visible (UV) or fluorescence spectroscopy, real-time PCR or digital PCR45. Although these methods are very effective46, an identical library size across samples is difficult to achieve. Nonetheless, if we define the multiplicity of infection (MOI) as the average ratio between the number of virus particles and the number of target cells present in a defined space, then the actual number of viruses that will integrate on any given cell can be described by a stochastic process, such as some cells may absorb more than one infectious agent while others may not absorb any of them. Typically, the probability P(n|m) that a cell will absorb n virus particles when inoculated with an MOI of m can be modelled as a Poisson variable with rate m,

P(n|m)=mne-mn!

Therefore, by definition, it is possible to increase the expected number of vector copies per cell (VCN) by properly tuning the MOI in the design of the experiment/treatment. As a result, the VCN may affect the number of IS collected and in turn the Shannon entropy.

In Fig. 1 we show the behaviour of the Shannon entropy index as a function of the DNA amount, the VCN and the sequencing depth (SD) in the case of an in-vitro assay described in “In-vitro assay” Section.

Figure 1.

Figure 1

From left to right: Scatter plot of the Shannon entropy index against the DNA amount, the vector copy number (VCN) and the sequencing depth (SD) for all the samples included in the in-vitro assay. Only a single amount of DNA has been taken for every sample. The total amount of integrations found in a sample, namely the total number of sample’s sequencing reads, has been used as proxy for the sample’s sequencing depth (SD).

Figure 1 suggests that the Shannon entropy index strongly depends on the quantitative confounders, until it reaches a steady-state. These features motivate us to use shape constrained splines (SCS) in order to model the effect of the candidate confounders on the entropy measurements.

Methods: shape constrained splines

Definition of the model

Shape-constrained splines (SCS) for fitting, smoothing and interpolation have been explored and proposed in various works, such as4752. In this work we follow the cone-projection approach51,52. We model the logarithmic observed entropies hi’s, for i=1,,n, as a function of a SCS-bases Cik for every potential confounder k=1,,K plus a term Fij for any other additional feature of interest j=1,,J, so that

log(hi)=β0+k=1KCikβck+j=1JFijβfj+εii=1,,n 3

where β0 is the intercept, Cik is the basis of a quadratic spline for the k-th confounder used for observation i for which we assume a saturation state at the right boundary knot and a monotone increasing concave shape. For our applications, the boundary knots of a spline basis associated to a variable x are defined as the minimal and maximal value of x. The term Fij corresponds to the i-th observation of a basis describing the j-th additional component, such as the time or the cell type. The corresponding parameter vectors are βck and βfj respectively. Finally we assume for the noise variable:

εiiidN(0,σ2)i=1,,n

Therefore, the model can be compactly written as

logh1hnlog(h)=111β0+C11C1KCn1CnKCβc1βcKβc+F11F1JFn1FnJFβf1βfJβf+ε1εnε 4

where the number of features K and J depend on the project and/or the specific research questions that must be addressed. Quadratic splines are characterized by the discontinuity of the second-order derivative, which makes their treatments harder than cubic splines. This applies already to unconstrained spline fitting and interpolation. In particular, the definition of quadratic penalised/smoothing splines is not straightforward. Therefore, in general, cubic splines should be preferred over quadratic splines. Despite this, in our in-vitro assay (VA) application only three distinct values of both the DNA amount and the vector copy number (VCN) are available. Therefore, in order to get a good trade-off between bias and variance as well as in order to obtain a full-rank design matrix, we chose quadratic splines with one interior knot for both the DNA amount and the vector copy number (VCN), and a quadratic spline with two interior knots for the sequencing depth. In Sect. S.4.1 of the supplementary material we compare the fits of quadratic and cubic splines (cf. Fig. S.1). For consistency, we also use quadratic splines in the mice study on genotoxicity, where our goal is to evaluate the  impact of the vector design on the entropy decay; cf. Section “Viral vector safety in a mouse genotoxicity study”.

Shape-constrained splines (SCS) normalization of Shannon entropy

For simplicity, we set K=1 and J=0 in (4), namely we consider only one confounder and no additional factors of interest. The general case can be obtained straightforwardly. Here we follow53 and we represent an (r+1)-th order B-spline as

m(x)=j=1qβjBjr(x) 5

where, for j=1,,q, the bases are iteratively computed as

Bjr(x)=x-kjkj+r+1-kjBjr-1(x)+kj+r+2-xkj+r+2-kj+1Bj+1r-1(x), 6
Bj-1(x)=1,kjxkj+10,otherwise 7

for a given sequence of evenly spaced knots ξ1ξ2ξq+r+2, where q is the number of basis functions and βj’s are the corresponding coefficients. The first order derivative of (5) can be written as

m(x)=1δj=2qBjr-1(x)(βj-βj-1) 8

where δ is the distance between two adjacent knots. Since all B-spline basis functions are nonnegative by definition, a sufficient condition for m(x)0, and in turn for the monotone-increasing shape of m(x), is

βj-βj-10j=2,,q 9

Furthermore, the second order derivative of Eq. (5) can be written as

m(x)=1δ2j=3qBjr-2(x)(βj-2βj-1+βj-2) 10

Then a sufficient condition for m(x)0 and in turn for the concavity of the spline in (5) is

βj-2βj-1+βj-20j=3,,q 11

The monotonicity and concavity constraints can be written respectively as

Vβ0Wβ0β=β1βq 12

where

V=-11-11-11R(q-1)×q;W=-12-1-12-1-12-1R(q-2)×q; 13

If both monotonicity and concavity constraints must be satisfied, the first q-2 constraints/rows of V are redundant, as stated by the following Lemma.

Lemma 1

If βj-2βj-1+βj-20j=3,,q and βq-βq-10, then βj-βj-10j=2,,q-1.

Proof

graphic file with name 41598_2022_5837_Figa_HTML.jpg: βq-2βq-1+βq-20 and -βq+βq-10 hold, which together imply βq-1-βq-20. graphic file with name 41598_2022_5837_Figb_HTML.jpg: βk+1-2βk+βk-10 and -βk+1+βk0 hold, which together imply βk-βk-10.

Therefore by Lemma 1, if both constraints V and W are applied, the whole matrix of constraints reduces to

Z=-11-12-1-12-1-12-1 14

Furthermore, we need to consider that sampling might be characterised by a sequencing saturation level due to technical limitations. In this case, the saturation level can be included by considering a steady-state/stationary-point at the right boundary knot ξq+r+2, namely by setting

1B(ξq+r+2)β=0 15

where 1B is the first derivative of the spline basis

B=B1rBqr

Our R code implementation allows to switch between the presence and absence of the saturation level by the additional logical input parameter SATURATION. By default this parameter is set to TRUE, but it can be switched to FALSE if the user prefers not to implement a saturation level (or a steady state) w.r.t. a particular predictor variable. In our case of quadratic degree (r=1), the constraint (15) reduces to

βq=-B[ξq+r+2,q-1]B[ξq+r+2,q]βq-1 16

which can be written compactly using the following affine transformation

A:RnX×qRnX×(q-1),XXA 17
A=100000100-B[xn,q-1]B[xn,q]Rq×(q-1); 18

where nX is the number of rows of X.

Estimation procedure

Given n observations (xi,yi) of one predictor x and a response y, the restricted least squares estimate β^RLS of β subject to the constraints (14) and (16) can be obtained as

β^RLS=argminβS(y-Bβ)(y-Bβ) 19

where

S=βRq;β0;Zβ0;βq=-B[ξq+r+2,q-1]B[ξq+r+2,q]βq-1

Therefore, using (17)–(18) we can directly include the linear equality constraint

βq=-B[ξq+r+2,q-1]B[ξq+r+2,q]βq-1

inside the objective function, and the optimisation problem (19) reduces to

β^RLS=argminβS-2βBAy+β(BA)BAβf 20

subject to only linear inequality constraints, where

β=(β1,,βq-1)andS=β0;ZAβ0

Since f is a quadratic function, we solve (20) using quadratic optimization. To this end we use the function solve.QP() from the R package quadprog. Once the restricted least squares estimate β^RLS is obtained, we follow the cone projection approach52 and we define a point-wise confidence interval (CI) with 1-α/2 coverage for bpAβ^RLS as

bpAβ^RLSzα/2σ^RLS2(bpA)G^bpA 21

where bp=B(xp) is the B-spline basis B(x) evaluated at the prediction point xp and the variance is estimated as

σ^RLS2=(y-BAβ^RLS)(y-BAβ^RLS)n-1.5d

where d is the dimension of the cone’s face where the projection of y onto

F={ηRn|η=BAβsuch thatZAβ0}

lands on. The dimension d of F can be computed using the Hinge algorithm implemented in the R package coneproj54. The matrix G^ is computed as the following weighted average

G^=I{1,,q-1}G^(I)p^I 22

with the (q-1)×(q-1) matrix G^(I) defined as

G^kI,lI(I)=((XIXI)-1)kI,lIG^kI,lI(I)=G^kI,lI(I)=G^kI,lI(I)=0 23

where XI are the columns of BA indexed by I. Each weight p^I represents the estimated probability that the projection of y lands on the cone’s face corresponding to I. The probabilities p^I are obtained by simulating many normal random vectors with mean vector y^=BAβ^RLS and covariance matrix σ^RLS2In, and recording the resulting sets I’s, along with their frequencies. In case additional unconstrained components are present, the definition of (22) can be extended52. Furthermore, if we need to select from a set of candidate models featuring different covariates, we use information criteria55. For our analyses we use the corrected Akaike Information Criterion (AICc)

AIC(M)=-2log(L(β^RLS|y))+2p+2p·p+1n-p

for model selection, where L(β|y) is the likelihood of model M and p the number of parameters of M, which is equal to d in our set-up. In case some models have similar AICc values, we follow Burnham et al.55 and we average across all ones using the frequentist model average estimator

β^fma=l=1Lλlβ^l 24

where β^l is the parameter vector estimated under the l-th candidate model, and λl the corresponding weight which can be computed as

λl=exp(-BICl/2)j=1Lexp(-BICj/2) 25

where BICj is the Bayesian Information Criterion (BIC) associated with the j-th model. In case of model averaging, the BIC is preferred over AIC/AICc, since it provides a better estimation of the marginal likelihood55. From a Bayesian perspective, {λj}j can be interpreted as an estimator of the posterior probabilities

p(Mj|y,X)=p(y|Mj)p(Mj)l=1Jp(y|Ml)p(Ml)j=1,,J 26

of the candidate models under a uniform prior {p(Mj)}j, where

p(y|Mj)=Θjp(y|Mj;Θj)p(Θj|Mj)dΘj 27

is the marginal likelihood55.

Ethics oversight

All procedures were performed according to protocols approved by the Animal Care and Use Committee of the San Raffaele Institute (IACUC 619) and communicated to the Ministry of Health and local authorities according to Italian law. All experiments were performed in accordance with relevant guidelines and regulations. The reporting in the manuscript follows the recommendations in the ARRIVE guidelines.

Applications of SCS in NGS data

In-vitro assay

To evaluate the reliability and sensitivity of the SCS rescaling method, we generated an IS dataset originating from an EBV-transformed B cell line transduced with a LV at Multiplicity of Infection (MOI) of 0.1, 1 and 10 to obtain DNA samples with increasing levels of polyclonality. Therefore, by increasing the MOI at each transduction we expect an increase in the vector copy number (VCN). As expected, the different vector doses resulted in different VCNs (see Supplementary Table S.3). Different amounts of DNA (5, 20 and 100 ng) were used for IS retrieval. LV ISs were retrieved by Sonication Linker-mediated (SLiM) - PCR (see Sect. S.1 of the supplementary material). Briefly, DNA material was sheared by sonication, subjected to end-repair and adenylation and then split in 3 technical replicates. Each replicate was ligated to a different barcoded linker cassette and subjected to two rounds of PCR allowing the amplification of the cellular genomic portion close to the vector IS. The different barcoded PCR products from each sample were assembled in libraries and sequenced by using Illumina platform. After sequencing, reads were processed by a dedicated bioinformatic pipeline56 to identify for each PCR/sample the different vector integration sites. For each IS the clonal abundance was determined by the R package SonicLength57 using the corresponding fragment length distribution. A varying number of ISs, ranging from 22 to 40575, was obtained from each sample (see Supplementary Table S.1) and, as expected, the number of IS retrieved from each sample increased proportionally to the vector dose (see Supplementary Table S.2). The total number of sample’s sequencing reads was used as proxy for the sample’s sequencing depth (SD). The magnitude of VCN, DNA amount and SD affects the clonality so that the samples are incomparable. Indeed Fig. 1 clearly shows a positive trend between the Shannon entropy index and the potential confounders. With the VA we are able to really understand the impact of the variables (confounding factors) to the entropy index, thus allowing a robust integrated analysis. We used the VA as “ground-truth” to compare our SCS-rescaling method with the competitor approaches (RAR and SRS). In our SCS method we took in consideration the DNA amount, VCN and SD as potential confounders.

In this case the number of candidate confounders is K=3 with no additional factors of interest (J=0) and, according to the general formulation of Eq. (4), the model was defined as

log(h)=1β0+CdnaCvcnCsdCβdnaβvcnβsdβc+ε 28

where we used two equidistant interior knots and the range of values as boundary knots for every SCS term in C. We report the corresponding fitted surface in Fig. 2 . In Fig. 2 we also show the rescaled values, i.e. the residuals that remain after having adjusted for the confounders. That is, according to the model definition in Eq. (4), we used the residuals

hres=explog(h)-Cβ^c 29

as the rescaled values, where β^c is the vector of the fitted parameters.

Figure 2.

Figure 2

Observed (left) and SCS-rescaled (right) entropies (dot symbols) as a function of the confounders, together with the corresponding shape constrained (bivariate) splines (green surface). Top panels show the slices for the DNA and the VCN. Bottom panels show the slices for the SD and the VCN.

We compared our method with the two previously proposed in literature, such as the rarefaction (RAR)13 and the ranked subsampling (SRS)18 approaches. We assessed the efficiency of the rescaling methods by correlation p-values for the two-sided test problem:

H0:ρ(X1,X2)=0vsH1:ρ(X1,X2)0 30

where ρ(X1,X2) is the Spearman’s rank correlation between two random vectors X1 and X2. We preferred Spearman’s rank correlation over Pearson correlation since we assumed that the relationships are monotonic and possibly non-linear. Low p-values give statistical evidence for dependencies and thus for unsolved confounding effects. For the comparison, the total amount of reads of the sample with the lowest SD has been chosen as rarefaction level with 1000 replications for both the standard rarefaction (RAR) and its ranked version (SRS). We report the results in Fig. 3. These pictures show that our SCS method outperformed both RAR and SRS methods in terms of correlation test p-values between the rescaled entropy and every potential confounder. Indeed, for all three confounders our new approach yields high p-values (0.37, 0.15 and 0.31 for DNA, VCN and SD respectively), so that we have no indication to reject the null hypothesis that the rescaled entropies and the confounder values are not correlated. For each of the two competing approaches we got 2-3 very low p-values (0.01), so that statistically significant amounts of correlations are left.

Figure 3.

Figure 3

Each panel shows the absolute value of the Spearman’s rank ρ correlation coefficient ρ(h,confounder) (y-axis) between a confounder and the observed or rescaled Shannon entropies (different bars). For every correlation coefficient, we performed the two-sided Spearman’s rank correlation test of Eq. (30) for checking the hypotheses H0:ρ(h,confounder)=0vsH1:ρ(h,confounder)0. The number of leading zeros after the decimal point of the p-values are reported on top of each bar as white stars.

Subsequently, we also checked whether our SCS-rescaling method unveils comparable clonal levels among the samples. A proper rescaling method should return similar clonal diversities independently from the confounders. Indeed, Figs. 2 and 4 show that our SCS-rescaling method drastically reduced the variability of the observed entropies due to the effect of the confounders and, in turn, that the clonal level of the VA samples, measured by the SCS-rescaled Shannon entropy index, is approximately the same.

Figure 4.

Figure 4

Box-plot (minimum, maximum, median, first quartile and third quartile) of the observed (OBS) and rescaled Shannon entropies using the SCS, RAR and SRS approaches.

It can be seen from Fig. 4 that the competitor methods RAR and SRS are also able to reduce the variability of the observed entropies. However, unlike our new SCS-rescaling method, the competing methods did not remove the effect of the confounders, as confirmed by the p-values of Spearman’s rank correlation tests provided in Fig. 3. While the SCS-rescaling method made all (rank) correlations insignificant, the competing methods RAR and SRS left significant rank correlations (=dependencies) between confounders and the entropy. For more explanations and illustrations we refer to Sect. S.4.2 and Fig. S.2 of the supplementary material.

Viral vector safety in a mouse genotoxicity study

We analysed the IS data collected from an established hematopoietic stem cell gene therapy model previously used to demonstrate how the genotoxic impact of integrating vectors is strongly modulated by their designs19,20. In this experimental setup Cdkn2a-/- tumor prone Lin- cells were ex-vivo transduced with two different LVs expressing GFP: the highly genotoxic LV vector, LV.SF.LTR (hereinafter referred to as LTR) and the non-genotoxic SIN.LV.PGK.GFP.PRE (hereinafter referred to as PGK). Transductions protocol and culture conditions were reported in19,20 and further details can be found in Sect. S.3 of the supplementary material. Twenty-four hours after transduction, vector- and mock- transduced cells (5-7.5×105 cells/mouse) were transplanted into lethally irradiated wild-type mice by tail vein injection (Mock-control, N=19; LV.SF.LTR, N=24 and SINLV.PGK, N=23). Six days after transduction the percentages of GFP+ cells were assessed by Fluorescence Activated Cell Sorting (FACS) analysis and ranged from 90 to 95% for the all vector and conditions. Engraftment level of transduced cells was assessed by measuring the percentage of GFP-expressing cells in the peripheral blood at 8 weeks post transplantation and were 80.8±2.9 % and 46.2±4.8 % in the group of mice transplanted with PGK and LTR vector respectively. As expected, mice transplanted with Cdkn2a-/- Lin- cells transduced with the the LTR vector developed tumors and died significantly earlier compared to mock-treated mice (p<0.0001, Log-rank Mantel-Cox test, median survival time: 282 and 149.5 days for mock-control and LTR- transduced group respectively). Mice transplanted with PGK-transduced cells did not show any acceleration of tumor onset compared to the mock-control group (median survival time: 289 days). All data are in agreement with the one previously published19,20. For the retrieval of vector insertion sites (ISs), peripheral blood was collected on a monthly basis from transplanted animals receiving transduced cells. Lymphoid B and T cells as well as myeloid cells were isolated by fluorescence activated cell sorting. To recover enough DNA material, equal amounts of blood from two or three mice belonging to the same experimental group were pooled before the sorting procedure. The composition of pools was maintained constant during the whole experiment, so that each pool is composed by the same mice over time. ISs were then retrieved by SLiM-PCR58 at different time points from sorted T (CD3+) and B (CD19+) lymphocytes, from myeloid cells (CD11b+) and unsorted blood cells (total MNC). From the DNA purified from all the different sorted samples, we also measured the VCN by ddPCR. Overall, a higher amount of ISs were retrieved from the group of mice transplanted with Lin- cells transduced with PGK, reflecting the higher level of VCN observed in PGK versus LTR group of transplanted animals. Few statistics on the number of IS collected in each treatment/condition, along with the corresponding VCN, are reported in Table 1. The Shannon entropy index was then computed from each IS sample and the application of a simple spline without shape constraints and without considering any technical confounder yielded the results shown in Fig. 5.

Table 1.

Mice study: Quartiles and range of the DNA amount, VCN, PS, SD and nIS for the n=242 samples and separately for PGK (left) and LTR (right) treatment conditions.

DNA VCN PS SD nIS DNA DNA PS PS nIS
PGK LTR
Min. 8.64 1.31 1.000 60 35.0 Min. 8.64 0.240 1.000 189 35.0
1st Qu. 106.56 10.90 2.000 1969 433.0 1st Qu. 94.50 5.320 1.000 1130 217.0
Median 200.00 13.59 2.000 5881 720.0 Median 200.00 6.300 2.000 2973 383.0
Mean 181.07 12.80 1.964 9351 989.3 Mean 222.88 6.219 2.104 4695 731.9
3rd Qu. 200.25 13.90 2.000 14055 1220.0 3rd Qu. 222.50 7.800 3.000 7390 873.0
Max. 973.00 27.00 3.000 49853 4324.0 Max. 973.00 10.500 7.000 15375 3213.0

Figure 5.

Figure 5

Observed Shannon entropies (y-axis) over time (x-axis) in each treatment (different colors), along with a simple spline without any shape constraints and confounder adjustments for every combination of cell marker and viral vector. Quadratic splines are fitted using the standard lm() and bs() R functions.

From Fig. 5 we cannot see a clear separation between the entropy profiles of the two vectors PGK and LTR. The prediction intervals overlap so that the differences in the profiles do not appear to be statistically significant. Henceforth, we cannot draw the conclusion that PGK is safer than LTR. However, the high variability of DNA amount (in nanograms), VCN, and the SD used for IS retrieval has a clear impact on the entropy measurements, as suggested by Fig. 6.

Figure 6.

Figure 6

From top left to bottom right: Scatter plot of the raw (unscaled) Shannon entropy index computed from the entire dataset (n=242 IS samples) versus the DNA amount, the vector copy number(VCN), the pool size (PS) and the sequencing depth (SD).

Furthermore, since some mice died faster, the size of each pool (PS) decreased over time, leading to variation in the cell counts and in turn in the Shannon diversity index calculations (see Fig. 6). Few statistics of these quantites are reported in Table 1 separately for each vector treatment.

This suggests that initial results of Fig. 6 might be biased by the presence of these confounding factors. The heterogeneity of these factors may affect the estimate of the cell counts and in turn the corresponding Shannon entropies. We therefore applied our shape-constrained spline approach of “Methods: shape constrained splines”, including the DNA amount, VCN, SD and PS as potential confounders.

We proceeded as follows: We used the general formulation of Eq. (4), including a shape constrained spline (SCS) term with two interior knots for every confounder, plus a spline term w.r.t. the time decay for every combination of cell lineage/marker (L) and viral vector (V) as additional factors of interest. In this way we described the entropy decays separately for each combination of cell marker and treatment (viral vector) while removing the bias provided by the potential confounders. We also set the vector specific intercept V to zero to make sure all the individuals have the same clonal diversity before the treatments. Therefore, following the general formulation of Eq. (4), the model it has been explicitly defined as

log(h)=1β0+CdnaCvcnCpsCsdCβdnaβvcnβpsβsdβc+Stl1Stl2Stl3Stl4Fβl1βl2βl3βl4βf+ε 31

where C binds all the confounder’s SCS bases and βc is the vector with all the corresponding parameters stacked together. Alike, F is a block-diagonal matrix where each block Stl is defined as

Stl=1Stl,v11Stl,v2

and each sub-block Stl,v, corresponding to the l-th cell lineage and the viral vector v, is the basis of a monotone decreasing quadratic spline w.r.t. the time t for which we assume a steady-state to the left of the second right boundary knot. Indeed, each mouse pool started with 2-7 mice, which then successively died, until no mouse was left, so that no measurements could be taken anymore and therefore we do not expect any further change in the entropy thereafter. For this purpose we use again the affine transformation defined in Eqs. (17)–(18). Finally, we refer to βf as the vector with all the corresponding parameters stacked together. Therefore in this case the number of confounders is K=4 and the number of additional factors of interest is J=8, namely a spline basis for the time-decay for every combination of the two treatments and the four cell lineages.

In order to identify the most important confounders among the candidates, we have fitted our model for each of the 24-1=15 possible confounder subsets. Each candidate model included always F as fixed term and featured at least one SCS term in C for the confounders. Then we averaged across the most likely models according to the frequentist criterion defined in Eqs. (24)–(25) and we reported in Fig. 7 the posterior distribution, along with the marginal inclusion probabilities of the four individual confounders.

Figure 7.

Figure 7

Approximated posterior distribution distribution (left) of the 15 candidate models according to the frequentist model averaging method of Eqs. (24)–(25), and the marginal posterior probabilities of the 4 potential confounders (right).

Results from model averaging suggest that the posterior distribution is mainly dominated by three models namely: PS+SD (4th model), VCN+SD (6th model), and SD (8th model). The remaining 12 models get substantially lower posterior probabilities and thus have only negligible effects on the model averaging estimator. Therefore, after computing the frequentist model averaging estimator

β^fma=β^0β^c,fmaβ^f 32

of Eq. (24), we used the residuals

hres=e(log(h)-Cβ^c,fma) 33

corresponding to the confounder terms as the rescaled values. Rescaled entropies are shown in Fig. 8 together with the lineage×vector-specific spline decays with a confidence interval of 0.95 coverage.

Figure 8.

Figure 8

Rescaled Shannon entropies (y-axis) over time (x-axis) for every combination of cell marker (panels) and viral vector treatment (colors) together with the corresponding SCS-fitted decay splines.

Thanks to the SCS-rescaling approach, a significant difference in the entropy decay for MNC, T-cells (CD3+) and Myeloid cells (CD11b) was observed depending on the genotoxicity level of the vector adopted. Whereas in the B-cell compartment no major differences in the entropy decay under the two vector treatments were observed. Indeed, consistently with the previous results19,20, the B-cell compartment is less affected by the genotoxicity of the LTR vector. In Sect. S.4.2 of the supplementary material we show that the RAR and the SRS approaches are less efficient than the proposed SCS approach, cf. Figs. S.5 and S.6 of the supplementary material.

Discussion

We have shown that the Shannon entropy index, a widely used measure of genetic variability, is strongly affected by the variability of technical factors. We have introduced a shape-constrained splines (SCS) approach aimed to quantitatively measure and remove the effect of confounders from the target of interest. In particular we have shown that our approach can remove confounding effects from the Shannon entropy index. We also have shown that our SCS approach outperforms all the state of the art rarefaction approaches like the RAR13 and its ranked-subsampling version18. That is, using a correlation test, we have found statistical evidence that the SCS-rescaled diversity measure does not significantly depend on the effect of the confounders anymore. Furthermore, our method is useful for genetic applications, as it provides an unbiased and more affordable measure of clonal diversity, and in turn it avoids to draw misleading results. As an example, the entropy decay of treatment-specific longitudinal studies may be erroneously interpreted if we do not take into account that the changes in the entropy increments may depend more on the confounders than on the biological treatments. Our method allows to discriminate between the two effects and to remove the one that comes from the confounders.

Since our approach is spline-regression based, its main limitation could be related to the available sample size. This is a potential problem when the number n of libraries is too low to define a spline basis. For the same reason the degree of the splines and the number of its knots should be chosen carefully. In particular, for the case of only one library/sample it would be possible to rescale its diversity only using the parameters inferred from an external controlled environment, like the VA explored in Sect. 4.1, with a sufficient library size n. This is the main price we pay if we switch from a rarefaction-based rescaling approach to a spline-regression based one. Our model averaging approach allows also to rank the impact of the confounders according to their approximated posterior probabilities. We perform model averaging by means of the Bayesian Information Criterion which allows us to get an estimate of the marginal likelihood of each candidate model, and in turn, of the corresponding marginal confounder inclusion posterior probabilities. One possible methodological extension of this framework could be the implementation of a more precise method to estimate the marginal likelihood. This could, for example, either be done by Laplace Integration or by Bayesian thermodynamic integration.

Supplementary Information

Acknowledgements

This publication is based on work from COST Action CA15109 (COSTNET), supported by COST (European Cooperation in Science and Technology). E.C.W. acknowledges support from the Fondazione Leonardo (514.7.010.098-4) and funding from the Swiss National Science Foundation (SNSF 188534). Grant Giovani Ricercatori GR-2016-02363681 - Ministry of Health. Telethon Foundation grants TGT11D1, TGT16B01 and TGT16B03.

Author contributions

L.D.C., E.C.W., and M.A.G. analysed the data. A.C., D.C., and E.M. conceived and designed the experiments. P.G., Y.N.S.S. and L.R. performed the experiments. All authors wrote the paper. E.C.W., A.C., and M.A.G. supervised the work.

Data availability

The data that support the findings of this study are openly available in the GitHub repository SHARES at https://github.com/calabrialab/SHARES.

Code availability

The R code used to produce all the analyses of this work is openly available at the GitHub repository https://github.com/delcore-luca/SCS.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors jointly supervised this work: E. C. Wit, A. Calabria and M. A. Grzegorczyk.

Contributor Information

L. Del Core, Email: l.del.core@rug.nl

E. C. Wit, Email: ernst.jan.camiel.wit@usi.ch

A. Calabria, Email: calabria.andrea@hsr.it

M. A. Grzegorczyk, Email: m.a.grzegorczyk@rug.nl

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-022-05837-0.

References

  • 1.Dunbar CE, High KA, Joung JK, Kohn DB, Ozawa K, Sadelain M. Gene therapy comes of age. Science. 2018;359(6372):eaan4672. doi: 10.1126/science.aan4672. [DOI] [PubMed] [Google Scholar]
  • 2.Aiuti A, et al. Lentiviral hematopoietic stem cell gene therapy in patients with Wiskott–Aldrich Syndrome. Science. 2013 doi: 10.1126/science.1233151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Biffi A, et al. Lentiviral hematopoietic stem cell gene therapy benefits Metachromatic Leukodystrophy. Science. 2013 doi: 10.1126/science.1233158. [DOI] [PubMed] [Google Scholar]
  • 4.Cesana D, et al. Retrieval of vector integration sites from cell-free DNA. Nat. Med. 2021;27:1458–1470. doi: 10.1038/s41591-021-01389-4. [DOI] [PubMed] [Google Scholar]
  • 5.Kohn DB, et al. Lentiviral gene therapy for X-linked chronic granulomatous disease. Nat. Med. 2020;26(2):200–206. doi: 10.1038/s41591-019-0735-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Magnani CF, Gaipa G, Lussana F, Belotti D, Gritti G, Napolitano S, Matera G, Cabiati B, Buracchi C, Borleri G, et al. Sleeping Beauty-engineered CAR T cells achieve antileukemic activity without severe toxicities. J. Clin. Investig. 2020;130(11):6021–6033. doi: 10.1172/JCI138473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Marktel S, Scaramuzza S, Cicalese MP, Giglio F, Galimberti S, Lidonnici MR, Calbi V, Assanelli A, Bernardo ME, Rossi C, et al. Intrabone hematopoietic stem cell gene therapy for adult and pediatric patients affected by transfusion-dependent ß-thalassemia. Nat. Med. 2019;25(2):234–241. doi: 10.1038/s41591-018-0301-6. [DOI] [PubMed] [Google Scholar]
  • 8.Scala S, Basso-Ricci L, Dionisio F, Pellin D, Giannelli S, Salerio FA, Leonardelli L, Cicalese MP, Ferrua F, Aiuti A, et al. Dynamics of genetically engineered hematopoietic stem and progenitor cells after autologous transplantation in humans. Nat. Med. 2018;24(11):1683–1690. doi: 10.1038/s41591-018-0195-3. [DOI] [PubMed] [Google Scholar]
  • 9.Yuo TS-T, Tseng TA. The environmental product variety and retail rents on central urban shopping areas: A multi-stage spatial data mining method. Environ. Plan. B. 2021;48:2167–2187. [Google Scholar]
  • 10.Fu Y, et al. Mutational characterization of hbv reverse transcriptase gene and the genotype-phenotype correlation of antiviral resistance among chinese chronic hepatitis b patients. Emerg. Microbe Infect. 2020;9(1):2381–2393. doi: 10.1080/22221751.2020.1835446. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Abina SH-B, Gaspar HB, Blondeau J, Caccavelli L, Charrier S, Buckland K, Picard C, Six E, Himoudi N, Gilmour K, et al. Outcomes following gene therapy in patients with severe Wiskott–Aldrich syndrome. Jama. 2015;313(15):1550–1563. doi: 10.1001/jama.2015.3253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.McNulty SN, Mann PR, Robinson JA, Duncavage EJ, Pfeifer JD. Impact of reducing DNA input on next-generation sequencing library complexity and variant detection. J. Mol. Diagn. 2020;22(5):720–727. doi: 10.1016/j.jmoldx.2020.02.003. [DOI] [PubMed] [Google Scholar]
  • 13.Sanders HL. Marine benthic diversity: A comparative study. Am. Nat. 1968;102(925):243–282. [Google Scholar]
  • 14.Weiss S, Xu ZZ, Peddada S, Amir A, Bittinger K, Gonzalez A, Lozupone C, Zaneveld JR, Vázquez-Baeza Y, Birmingham A, Hyde ER, Knight R. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5(1):27. doi: 10.1186/s40168-017-0237-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.McMurdie PJ, Holmes S. Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 2014;10(4):1–12. doi: 10.1371/journal.pcbi.1003531. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Willis AD. Rarefaction, alpha diversity, and statistics. Front. Microbiol. 2019;10:2407. doi: 10.3389/fmicb.2019.02407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Whittaker RH. Evolution and measurement of species diversity. TAXON. 1972;21(2–3):213–251. [Google Scholar]
  • 18.Beule KPL. Improved normalization of species count data in ecology by scaling with ranked subsampling (SRS): Application to microbial communities. PeerJ. 2020;8:e9593. doi: 10.7717/peerj.9593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Montini E, Cesana D, Schmidt M, Sanvito F, Ponzoni M, Bartholomae C, Sergi LS, Benedicenti F, Ambrosi A, Di Serio C, Doglioni C, von Kalle C, Naldini L. Hematopoietic stem cell gene transfer in a tumor-prone mouse model uncovers low genotoxicity of lentiviral vector integration. Nat. Biotechnol. 2006;24(6):687–696. doi: 10.1038/nbt1216. [DOI] [PubMed] [Google Scholar]
  • 20.Montini E, Cesana D, Schmidt M, Sanvito F, Bartholomae CC, Ranzani M, Benedicenti F, Sergi LS, Ambrosi A, Ponzoni M, Doglioni C, Serio CD, von Kalle C, Naldini L. The genotoxic potential of retroviral vectors is strongly modulated by vector design and integration site selection in a mouse model of HSC gene therapy. J. Clin. Investig. 2009;119(4):964–975. doi: 10.1172/JCI37630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lu R, Neff N, Quake S, Weissman I. Tracking single hematopoietic stem cells in vivo using high-throughput sequencing in conjunction with viral genetic barcoding. Nat. Biotechnol. 2011;29:928–33. doi: 10.1038/nbt.1977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Nakamura T, Omasa T. Optimization of cell line development in the GS-CHO expression system using a high-throughput, single cell-based clone selection system. J. Biosci. Bioeng. 2015;120(3):323–329. doi: 10.1016/j.jbiosc.2015.01.002. [DOI] [PubMed] [Google Scholar]
  • 23.Gerrits A, Dykstra B, Kalmykowa OJ, Klauke K, Verovskaya E, Broekhuis MJC, de Haan G, Bystrykh LV. Cellular barcoding tool for clonal analysis in the hematopoietic system. Blood. 2010;115(13):2610–2618. doi: 10.1182/blood-2009-06-229757. [DOI] [PubMed] [Google Scholar]
  • 24.Harkey MA, et al. Multiarm high-throughput integration site detection: Limitations of LAM-PCR technology and optimization for clonal analysis. Stem Cells Dev. 2007;16(3):381–392. doi: 10.1089/scd.2007.0015. [DOI] [PubMed] [Google Scholar]
  • 25.Schuster SC. Next-generation sequencing transforms today’s biology. Nat. Methods. 2008;5(1):16–18. doi: 10.1038/nmeth1156. [DOI] [PubMed] [Google Scholar]
  • 26.Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer MLI, Jarvie TP, Jirage KB, Kim J-B, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437(7057):376–380. doi: 10.1038/nature03959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Demkow U, Ploski R. Clinical Applications for Next-Generation Sequencing. Elsevier Science; 2015. [Google Scholar]
  • 28.Shendure J, Mitra RD, Varma C, Church GM. Advanced sequencing technologies: Methods and goals. Nat. Rev. Genet. 2004;5(5):335–344. doi: 10.1038/nrg1325. [DOI] [PubMed] [Google Scholar]
  • 29.Ledergerber C, Dessimoz C. Base-calling for next-generation sequencing platforms. Brief. Bioinform. 2011;12:489–497,09. doi: 10.1093/bib/bbq077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Chang F, Li MM. Clinical application of amplicon-based next-generation sequencing in cancer. Cancer Genet. 2013;206(12):413–419. doi: 10.1016/j.cancergen.2013.10.003. [DOI] [PubMed] [Google Scholar]
  • 31.Kohlmann A, Grossmann V, Haferlach T. Integration of Next-Generation Sequencing into clinical practice: Are we there yet? Semin. Oncol. 2012;39(1):26–36. doi: 10.1053/j.seminoncol.2011.11.008. [DOI] [PubMed] [Google Scholar]
  • 32.Gargis AS, Kalman L, Bick DP, da Silva C, Dimmock DP, Funke BH, Gowrisankar S, Hegde MR, Kulkarni S, Mason CE, Nagarajan R, Voelkerding KV, Worthey EA, Aziz N, Barnes J, Bennett SF, Bisht H, Church DM, Dimitrova Z, Gargis SR, Hafez N, Hambuch T, Hyland FCL, Luna RA, MacCannell D, Mann T, McCluskey MR, McDaniel TK, Ganova-Raeva LM, Rehm HL, Reid J, Campo DS, Resnick RB, Ridge PG, Salit ML, Skums P, Wong L-JC, Zehnbauer BA, Zook JM, Lubin IM. Good laboratory practice for clinical next-generation sequencing informatics pipelines. Nat. Biotechnol. 2015;33:689–693,07. doi: 10.1038/nbt.3237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Biasco L, Pellin D, Scala S, Dionisio F, Basso-Ricci L, Leonardelli L, Scaramuzza S, Baricordi C, Ferrua F, Cicalese M, Giannelli S, Neduva V, Dow D, Schmidt M, Von Kalle C, Roncarolo M, Ciceri F, Vicard P, Wit E, Di Serio C, Naldini L, Aiuti A. In vivo tracking of human hematopoiesis reveals patterns of clonal dynamics during early and steady-state reconstitution phases. Cell Stem Cell. 2016;19(1):107–119. doi: 10.1016/j.stem.2016.04.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Wu C, Li B, Lu R, Koelle S, Yang Y, Jares A, Krouse A, Metzger M, Liang F, Loré K, Wu C, Donahue R, Chen I, Weissman I, Dunbar C. Clonal tracking of rhesus macaque hematopoiesis highlights a distinct lineage origin for natural killer cells. Cell Stem Cell. 2014;14(4):486–499. doi: 10.1016/j.stem.2014.01.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Mazurier F, Gan OI, McKenzie JL, Doedens M, Dick JE. Lentivector-mediated clonal tracking reveals intrinsic heterogeneity in the human hematopoietic stem cell compartment and culture-induced stem cell impairment. Blood. 2004;103(2):545–552. doi: 10.1182/blood-2003-05-1558. [DOI] [PubMed] [Google Scholar]
  • 36.Biasco L, Rothe M, Schott JW, Schambach A. Integrating vectors for gene therapy and clonal tracking of engineered hematopoiesis. Hematology. 2020;31:737–752. doi: 10.1016/j.hoc.2017.06.009. [DOI] [PubMed] [Google Scholar]
  • 37.Shannon CE. A mathematical theory of communication. Bell Syst. Tech. J. 1948;27(4):623–656. [Google Scholar]
  • 38.Kullback S, Leibler RA. On information and sufficiency. Ann. Math. Stat. 1951;22(1):79–86. [Google Scholar]
  • 39.Carboni I, Fattorini P, Previderè C, Ciglieri SS, Iozzi S, Nutini A, Contini E, Pescucci C, Torricelli F, Ricci U. Evaluation of the reliability of the data generated by next generation sequencing from artificially degraded DNA samples. Forensic Sci. Int. 2015;5:e83–e85. [Google Scholar]
  • 40.Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: A de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28(11):1420–1428. doi: 10.1093/bioinformatics/bts174. [DOI] [PubMed] [Google Scholar]
  • 41.Pereira-Marques J, Hout A, Ferreira RM, Weber M, Pinto-Ribeiro I, van Doorn L-J, Knetsch CW, Figueiredo C. Impact of host DNA and sequencing depth on the taxonomic resolution of whole metagenome sequencing for microbiome analysis. Front. Microbiol. 2019;10:1277. doi: 10.3389/fmicb.2019.01277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hahn A, Sanyal A, Perez GF, Colberg-Poley AM, Campos J, Rose MC, Pérez-Losada M. Different next generation sequencing platforms produce different microbial profiles and diversity in cystic fibrosis sputum. J. Microbiol. Methods. 2016;130:95–99. doi: 10.1016/j.mimet.2016.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Sabina J, Leamon JH. Bias in Whole Genome Amplification: Causes and Considerations. Springer; 2015. pp. 15–41. [DOI] [PubMed] [Google Scholar]
  • 44.Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq illumina sequencing platform. Appl. Environ. Microbiol. 2013;79(17):5112–5120. doi: 10.1128/AEM.01043-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Nakayama Y, Yamaguchi H, Einaga N, Esumi M. Pitfalls of DNA quantification using DNA-binding fluorescent dyes and suggested solutions. PLoS ONE. 2016;11(3):1–12. doi: 10.1371/journal.pone.0150528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Robin JD, Ludlow AT, LaRanger R, Wright WE, Shay JW. Comparison of DNA quantification methods for next generation sequencing. Sci. Rep. 2016;6(1):24067. doi: 10.1038/srep24067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Pya N, Wood SN. Shape constrained additive models. Stat. Comput. 2015;25(3):543–559. [Google Scholar]
  • 48.Bollaerts K, Eilers PH, Van Mechelen I. Simple and multiple P-splines regression with shape constraints. Br. J. Math. Stat. Psychol. 2006;59(2):451–469. doi: 10.1348/000711005X84293. [DOI] [PubMed] [Google Scholar]
  • 49.Brezger A, Steiner WJ. Monotonic regression based on bayesian p-splines: An application to estimating price response functions from store-level scanner data. J. Bus. Econ. Stat. 2008;26(1):90–104. [Google Scholar]
  • 50.Fritsch FN, Carlson RE. Monotone piecewise cubic interpolation. SIAM J. Numer. Anal. 1980;17(2):238–246. [Google Scholar]
  • 51.Meyer MC. Inference using shape-restricted regression splines. Ann. Appl. Stat. 2008;2(3):1013–1033. [Google Scholar]
  • 52.Meyer MC. A framework for estimation and inference in generalized additive models with shape and order restrictions. Stat. Sci. 2018;33(4):595–614. [Google Scholar]
  • 53.De Boor C, De Boor C, Mathématicien E-U, De Boor C, De Boor C. A Practical Guide to Splines. Springer-Verlag; 1978. [Google Scholar]
  • 54.Liao X, Meyer MC. coneproj: An R package for the primal or dual cone projections with routines for constrained regression. J. Stat. Softw. 2014;61(12):1–22. [Google Scholar]
  • 55.Burnham KP, Anderson DR, Huyvaert KP. AIC model selection and multimodel inference in behavioral ecology: Some background, observations, and comparisons. Behav. Ecol. Sociobiol. 2011;65(1):23–35. [Google Scholar]
  • 56.Spinozzi G, Calabria A, Brasca S, Beretta S, Merelli I, Milanesi L, Montini E. VISPA2: A scalable pipeline for high-throughput identification and annotation of vector integration sites. BMC Bioinform. 2017;18(1):520. doi: 10.1186/s12859-017-1937-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Berry CC, Gillet NA, Melamed A, Gormley N, Bangham CRM, Bushman FD. Estimating abundances of retroviral insertion sites from DNA fragment length data. Bioinformatics. 2012;28:755–762,03. doi: 10.1093/bioinformatics/bts004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Benedicenti F, Calabria A, Cesana D, Albertini A, Tenderini E, Spinozzi G, Neduva V, Richard A, Brugman M, Dow D, et al. Sonication linker mediated-PCR (SLiM-PCR), an efficient method for quantitative retrieval of vector integration sites. Hum.Gene Ther. 2019;30:A214–A215. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The data that support the findings of this study are openly available in the GitHub repository SHARES at https://github.com/calabrialab/SHARES.

The R code used to produce all the analyses of this work is openly available at the GitHub repository https://github.com/delcore-luca/SCS.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES