Bayesian modelling of high-throughput sequencing assays with malacoda

Andrew R Ghazi; Xianguo Kong; Ed S Chen; Leonard C Edelstein; Chad A Shaw

doi:10.1371/journal.pcbi.1007504

. 2020 Jul 21;16(7):e1007504. doi: 10.1371/journal.pcbi.1007504

Bayesian modelling of high-throughput sequencing assays with malacoda

Andrew R Ghazi ¹, Xianguo Kong ², Ed S Chen ³, Leonard C Edelstein ², Chad A Shaw ^3,^*

Editor: Jian Ma⁴

PMCID: PMC7394446 PMID: 32692749

Abstract

NGS studies have uncovered an ever-growing catalog of human variation while leaving an enormous gap between observed variation and experimental characterization of variant function. High-throughput screens powered by NGS have greatly increased the rate of variant functionalization, but the development of comprehensive statistical methods to analyze screen data has lagged. In the massively parallel reporter assay (MPRA), short barcodes are counted by sequencing DNA libraries transfected into cells and the cell’s output RNA in order to simultaneously measure the shifts in transcription induced by thousands of genetic variants. These counts present many statistical challenges, including overdispersion, depth dependence, and uncertain DNA concentrations. So far, the statistical methods used have been rudimentary, employing transformations on count level data and disregarding experimental and technical structure while failing to quantify uncertainty in the statistical model. We have developed an extensive framework for the analysis of NGS functionalization screens available as an R package called malacoda (available from github.com/andrewGhazi/malacoda). Our software implements a probabilistic, fully Bayesian model of screen data. The model uses the negative binomial distribution with gamma priors to model sequencing counts while accounting for effects from input library preparation and sequencing depth. The method leverages the high-throughput nature of the assay to estimate the priors empirically. External annotations such as ENCODE data or DeepSea predictions can also be incorporated to obtain more informative priors–a transformative capability for data integration. The package also includes quality control and utility functions, including automated barcode counting and visualization methods. To validate our method, we analyzed several datasets using malacoda and alternative MPRA analysis methods. These data include experiments from the literature, simulated assays, and primary MPRA data. We also used luciferase assays to experimentally validate several hits from our primary data, as well as variants for which the various methods disagree and variants detectable only with the aid of external annotations.

Author summary

Genetic sequencing technology has progressed rapidly in the past two decades. Huge genomic characterization studies have resulted in a massive quantity of background information across the entire genome, including catalogs of observed human variation, gene regulation features, and computational predictions of genomic function. Meanwhile, new types of experiments use the same sequencing technology to simultaneously test the impact of thousands of mutations on gene regulation. While the design of experiments has become increasingly complex, the data analysis methods deployed have remained overly simplistic, often relying on summary measures that discard information. Here we present a statistical framework called malacoda for the analysis of massively parallel genomic experiments which is designed to incorporate prior information in an unbiased way. We validate our method by comparing our method to alternatives on simulated and real datasets, by using different types of assays that provide a similar type of information, and by closely inspecting an example experimental result that only our method detected. We also present the method’s accompanying software package which provides an end-to-end pipeline with a simple interface for data preparation, analysis, and visualization.

This is a PLOS Computational Biology Methods paper.

Introduction

The advent of next generation sequencing (NGS) has generated an explosion of observed genetic variation in humans. Variants with unclear effects greatly outnumber those with severe impact. For example, the 1000 Genomes Project [1] has estimated that a typical human genome has roughly 150 protein-truncating variants, 11,000 peptide-sequence altering variants, and 500,000 variants falling into known regulatory regions. Simultaneously, genome-wide association studies (GWAS) have found strong statistical associations between thousands of noncoding variants and hundreds of human phenotypes [2,3]. Traditional methods of assessing the regulatory impact of variants are slow and low-throughput: luciferase reporter assays require multiple replications of cloning individual genomic regions, transfection into cells, and measurement of output intensity.

Massively Parallel Reporter Assays (MPRA), overviewed in Fig 1, were developed to assess simultaneously the transcriptional impact of thousands of genetic variants [4]. The simplest form of MPRA uses a carefully designed set of barcoded oligonucleotides containing roughly 150 base pairs of genomic context surrounding variants of interest. There are typically thousands of variants selected using preliminary evidence from GWAS, and there are usually ten to thirty replicates of each allele with unique, inert barcodes. The oligonucleotides are cloned into plasmids, making a complex library that is then transfected into cells. The cells use the library as genetic material and actively transcribe the inserts. Because the barcodes are preserved by transcription, counting the RNA products of each variant construct by re-identifying each barcode in the NGS product provides a direct measure of the transcriptional output of a given genetic variant. By designing the oligonucleotide library to contain multiple barcodes of both the reference and alternate alleles for each variant, one can statistically assess the transcription shift (TS) for each variant. MPRA can thus be used to identify functional driver variants among sets of statistically significant GWAS variants that are difficult to distinguish in observational studies because of linkage disequilibrium.

MPRA have successfully identified many transcriptionally functional variants [5, 6, 7], but the accompanying statistical analyses have been rudimentary. Initial studies focused on the computation of the “activity” for each barcode in each RNA sample. This involves averaging across depth-adjusted counts to compute a normalizing DNA factor for each barcode, then dividing depth-adjusted RNA counts by the DNA factor and taking the log of this ratio. Then a t-test is used to compare the activity measurements for each allele, followed by assay-wide multiple-testing corrections. The key limitations include ignoring systematic variation due to unknown DNA concentrations, compounded data transformation and summarization prior to modelling, and the failure to include the reservoir of prior data and biological knowledge concerning genes and genomic regions. The methods mpralm [8], MPRAscore [9], QuASAR-MPRA[10], and MPRAnalyze [11] are more recent methods, but they all suffer from some combination of common limitations: failure to model variation in input DNA concentrations, aggregation of data across barcodes, sequencing samples without modelling systematic sources of variation, and over-reliance on point estimates of dispersion that cause errors in transcription shift estimates.

Other areas of genomic analysis have generated a wealth of information on genomic structure and function, frequently specific to particular genomic contexts and variants. For example, the ENCODE project [12] provides genome-wide ChIP-seq data on transcription factor binding profiles, histone marks, and DNA accessibility. Computational methods such as DeepSea [13] use machine learning to provide variant-specific predictions on chromatin effects. Genome-wide databases like ENCODE and computational predictors like DeepSea contain real information about variant effects, but a method for incorporating this information into a statistical framework for experimental analysis of variants has not been developed.

We hypothesized that a structured, probabilistic modelling approach to high-throughput NGS screens such as MPRA would yield more accurate estimates of variant function while improving statistical sensitivity and specificity, particularly when incorporating prior information. This approach offers a flexible modelling system that can fit hierarchical model structures of count data while also directly accounting for experimental sources of variation. Our approach would also enable the integration of prior information and account for uncertainty in dispersion parameter estimates. These advantages offer significant improvements in statistical efficiency and provide opportunities for formulating systems-level hypotheses—for example, the impact of specific transcription factors—that are absent from other approaches. Here we present malacoda, an end-to-end Bayesian statistical framework that addresses gaps in the prior approaches while providing novel methods for incorporating prior information. The malacoda method focuses on MPRA but also has potential extension to a broad array of NGS-based high-throughput screens. We establish the superior performance of malacoda on MPRA compared to alternatives using simulation studies. We then apply the method to previously published findings to make new biological discoveries that we explore in the paper. We also apply malacoda to primary MPRA studies that we performed. We limit the analysis of our primary data to an examination of the inter-method consistency of effect size estimates in order to emphasize the potential of our statistical method. The barcode counts and cross-method effect size estimates for all of the results are included in S2 Data. To demonstrate the impact of malacoda for biologically relevant discovery, we analyzed previously published data by Ulirsch et al, and we identified the functional variant rs11865131 within the intron of the NPRL3 gene; we validated this finding by luciferase assay. The results demonstrate that using malacoda we can discover biologically important findings that were missed by prior approaches. We have made the software available as an open source R package on GitHub.

Methods

Overview

In malacoda we utilize a negative binomial model for NGS to consider barcode counts with empirically estimated gamma priors, and we explicitly model variation in the input DNA concentrations for each barcode. By default, the method marginally estimates the priors from the maximum likelihood estimates of each variant in the assay; the method also supports informative prior estimation by using external genomic annotations for each variant as weights. This approach enables disparate knowledge sources to inform the results in a principled, data-driven way. The probabilistic model underlying malacoda uses the NGS data directly without transformation, and it accounts for all known sources of experimental variation and uncertainty in model parameters. Finally, the method provides estimate shrinkage as a method for avoiding false positives.

Description of the statistical model

MPRA data are composed of the counts of the barcoded DNA input from sequencing the plasmid library and the counts of the barcoded RNA outputs from sequencing the RNA content extracted from passaged cells. The DNA counts vary according to the sequencing depth of the sample as well as due to the inherent noise in library preparation. The RNA measurements also vary according to sequencing depth, but they are also affected by the DNA input concentration and the inherent transcription rate of their associated region of genomic context. Fig 2A shows a subset of a typical MPRA dataset, with two barcodes of each allele for two variants and several columns of counts. We find that typically MPRA are performed with four to six RNA sequencing replicates and a smaller number of DNA replicate samples. Fig 2B shows a simplified Kruschke diagram of the model underlying malacoda, using the mean-dispersion parameterization of the negative binomial. More explicitly,

μ_{D N A_{b c}} \sim G a m m a (α_{μ_{D N A}}, β_{μ_{D N A}})

μ_{a l l e l e} \sim G a m m a (α_{μ_{R N A}}, β_{μ_{R N A}})

ϕ_{D N A} \sim G a m m a (α_{ϕ_{D N A}}, β_{ϕ_{D N A}})

ϕ_{a l l e l e} \sim G a m m a (α_{ϕ_{R N A}}, β_{ϕ_{R N A}})

C o u n t s_{D N A_{s, b c}} \sim N e g B i n (m e a n = d_{s} \times μ_{D N A_{b c}}, d i s p e r s i o n = ϕ_{D N A})

C o u n t s_{R N A_{s, b c}} \sim N e g B i n (m e a n = d_{s} \times μ_{D N A_{b c}} \times μ_{a l l e l e}, d i s p e r s i o n = ϕ_{a l l e l e})

Where d_s indicates the depth of a particular sequencing sample, μ_DNA,bc indicates the unknown concentration of a particular barcode in the plasmid library, and μ_allele indicates the effect of the genomic context of a given allele of a given variant. Parameters indexed by “bc” are vectors with an element for each barcode while those with the “allele” subscript contain two elements for the reference and alternate alleles. The shape α and rate β parameters of the Gamma priors are estimated empirically. Note that the mean of each negative binomial used to model a particular count observation is directly proportional to the sequencing depth of the sample from which that count observation arose. A more finely detailed walkthrough of the model and its implementation are available in section 1 of S1 Appendix.

Fig 2 — A) The table shows a subset of our primary MPRA data. The highlighted cell containing 759 barcode counts is influenced both by the sequencing depth of its sample (blue column) and the unknown input DNA concentration of its barcode (red row). B) A simplified Kruschke diagram of the generative model underlying malacoda. After evaluating the joint posterior on all model parameters, a 95% posterior interval on a variant’s transcription shift (shaded area) may be used for a binary decision between “functional” or “non-functional”. This example TS posterior shows a negative shift that excludes zero, meaning the variant in question would be called as “functional”. C) A conceptual diagram demonstrating three prior types available in the malacoda framework. The marginal prior (left) weights all variants in the assay equally, while the grouped and conditional priors utilize informative annotations as weights in the prior estimation process.

The negative binomial distribution is a natural choice for modelling NGS count data given its ability to accurately fit overdispersed observations frequently seen in sequencing data [14]. Briefly, the observed dispersion in NGS count data usually exceeds that expected from simpler binomial or Poisson models. We chose gamma distributions as priors for several reasons. They have the appropriate [0,∞) support, and for a non-negative random variable whose expectation and expected log exist, they are the maximum entropy distribution. Additionally, they are characterized by two parameters, which gives the prior estimation process enough flexibility to accurately fit the observed population of negative binomial estimates. Probabilistic modelling of the dispersion parameters is key as demonstrated by simulation in S2 Appendix. Allocating probability across a distribution of dispersion parameter values impacts the inference on the other parameters in the model, specifically the allele-level effects that the assay aims to evaluate. The practice of modelling dispersion parameters probabilistically helps avoid pitfalls found in methods that utilize point estimates of dispersion. This barcode-level count data model that quantifies the uncertainty on the dispersion parameters is a central contribution of the malacoda method.

After computing the joint posterior on all model parameters, the posterior on transcription shift is computed as a generated quantity by taking the difference between log of μ_allele for the alternate and reference alleles. We then compute the narrowest interval containing 95% of the posterior on TS (the highest density interval (HDI)) for each variant. The 95% HDI is used to make binary calls on whether a variant is functional or non-functional: if the interval excludes zero as a credible value, the variant is labelled as “functional”. We note here that 95% is an arbitrary threshold based on statistical convention and common values on the trade-off between sensitivity-specificity. Other common cutoffs such as 80% or 99% may be used. An optional “region of practical equivalence” may also be defined on a per-assay basis when there is particular interest in rejecting a null region of transcription shift values around zero [15].

Empirical priors

The gamma priors are fit empirically using maximum likelihood estimation. Specifically, each variant-level model is fit first by maximizing the likelihood component of the malacoda model, then empirical gamma distributions are fit to those estimates for the means and dispersions of the DNA, reference RNA, and alternate RNA. This approach offers several benefits. First, it leverages the high-throughput nature of the assay. The full dataset of thousands of variants determines the prior, so the contribution from each individual variant is small. Secondly, it constrains the prior to be reasonable in the context of a given assay. Specific circumstances regarding library preparation, sequencer properties, cell culture conditions, and other unknown factors will cause the underlying statistical properties of each MPRA to be unique. A less informed, general-purpose prior, such as Gamma(shape = 0.001, rate = 0.001), would assign a considerable amount of probability density to unreasonable regions of parameter space. Empirical estimation ensures that the priors capture the reasonable range of values for each parameter while avoiding putting unwarranted density on extreme values [16]. Finally, by sharing information between variants, empirical priors provide estimate shrinkage. The prior effectively regularizes all parameter estimates, a behavior which is important in multi-parameter models with relatively little data per parameter. This regularizing effect acts as an alternative to post hoc multiple testing correction: rather than widening the confidence interval on the estimate of the transcription shift, an empirical prior shrinks the estimate of transcription shift towards the global average while leaving the width of the interval intact. This data-driven approach acts as a natural safeguard against the risk of false positives found in multiple testing scenarios while simultaneously moderating the reported effect sizes of variants that display extreme behavior by chance. The regularization effect of the empirical prior is demonstrated in section 6 of S3 Appendix.

In order to incorporate external knowledge, the malacoda method also allows users to provide informative annotations to supplement the analysis. Fig 2C contrasts the marginal prior (left) with two prior types that make use of external annotations. These priors use the information in the annotations by employing the principle that similarly annotated variants should perform similarly in the assay. When the annotations are simply a set of descriptive categories (for example predictions of likely benign, uncertain, or likely functional), the grouped prior (2C, center) simply fits a prior distribution within each subset. When the annotations are continuous values, the conditionally weighted (2C, right) prior employs an adaptive kernel smoothing process to estimate the prior. To estimate the prior for a single variant, it initializes a t-distribution kernel centered at the annotation of the variant in question, then gradually widens this kernel until the n-th most highly weighted variant (where n is a configurable tuning parameter defaulting to 100) has a weight of at least one percent of that of the most influential variant. This ensures that the weights used to estimate the conditional prior are not dominated by the nearest neighbor in annotation space. While the diagram in Fig 2C shows this for only a single informative annotation on the horizontal axis, the software allows for an arbitrary number of continuous predictors to be used.

Simulation and validation studies

We took several approaches to validate and compare the malacoda method with alternatives. First, we simulated MPRA data across a realistic grid of parameters governing the fraction of truly functional variants, the number of variants in the assay, and the number of barcodes per allele. These simulations also modelled distinct sequencing samples, realistic variation in sequencing depth, and barcode failure during library preparation. We then compared malacoda to alternative methods including the t-test, mpralm, MPRAscore, QuASAR-MPRA, and MPRAnalyze. Across these simulations we compared performance metrics including area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPR) and estimate accuracy. The code used to generate these simulations is provided in sections 2 and 3 of S3 Appendix. Secondly, we applied malacoda and alternative methods to real MPRA data from the Ulirsch dataset [5], using inter-method consensus as a performance metric. We repeated this using our own primary MPRA data from an assay performed in K562 cells inspecting 2666 variants related to platelet function. This assay utilized oligonucleotides with 150bp of genomic context and inert 14bp barcodes. The barcode counts from this assay are presented in S2 Data. In both cases we ran malacoda using both a marginal prior and a conditional prior informed by DeepSea predictions for DNase hypersensitivity in the relevant cell-type. Finally, we tested a subset of variants with luciferase reporter assays to assess consistency with MPRA estimates of variant function.

Computational methods and software

Our method is available as an R package from github.com/andrewGhazi/malacoda. The package includes detailed installation instructions, extensive help documentation, an analysis walkthrough vignette, and implementations of traditional activity-based analysis methods. The statistical models are fit with Stan [17], which allows us to perform a fast first pass fit with Automatic Differentiation Variational Inference [18] and, if a narrow 80% posterior interval on TS excludes zero, to perform a final Markov Chain Monte Carlo (MCMC) fit with Stan’s No-U-Turn Sampler. This presents an effective balance between the speed of approximate variational inference and asymptotically exact estimation of parameters via MCMC for functional variants. By default, each variant is first checked with a variational first pass. Then, if the variant passes the posterior interval check, MCMC is performed with 4 chains using 200 warmup and 500 post-warmup samples per chain for a total of 2000 posterior samples. These default settings can refine the limits of the TS posterior interval with satisfactory precision within a short run time. While the adaptive Hamiltonian Monte Carlo provided by Stan can efficiently explore high-dimensional posteriors, any MCMC-based method has Monte Carlo error that makes estimate precision difficult in borderline situations—an additional digit of estimate precision requires 100 times as many MCMC samples. When using the 95% posterior interval to make a binary classification of functional or non-functional, variants on the borderline can require a large number of posterior samples to precisely refine the limits of the interval that is used to classify a variant as functional or non-functional. By default, malacoda checks to see if either edge of the 95% interval is close to zero, and if necessary, lengthens the MCMC chains in order to provide better precision in this scenario. A full walkthrough of the computational methods is provided in section 2 of S1 Appendix.

Our package also includes data processing functionality to extract barcodes from reads, filter barcodes by quality, and count barcodes from a set of FASTQ files through an application of the FASTX-Toolkit [19]. Through an interface with the FreeBarcodes package [20], the package can also decode sequencing errors in the barcodes of an assay that has been designed using our previous work, mpradesigntools [21]. In our experience this typically recaptures about 5% additional data with no additional cost beyond a line of code during the assay design process. The package also contains plotting functionality to help visualize the results of analyses.

Experimental methods

In order to collect experimental measurements of the transcriptional impact of variants through means other than MPRA, we performed luciferase reporter assays on seventeen variants. Four were among the strongest signals detected in our MPRA, six were variants from our MPRA where the statistical methods disagreed, and seven were variants from the Ulirsch dataset [5] where the malacoda marginal and DeepSea-based [13] conditional prior model fits disagreed.

150-200bp genomic DNA sequences flanking the variants were amplified by PCR using K562 lymphoblast (ATCC) genomic DNA as template, then cloned into PGL4.28 minimum promoter luciferase reporter vector (Promega) at NheI and HindIII sites. Counterpart SNP variants were generated by site-directed mutagenesis. All the constructs were validated by DNA sequencing. 3μg plasmid preparations were co-transfected with 0.5μg β-gal plasmid into 1x10⁶ of K562 cells with Lipofectamine 2000 based on manufacturer's instructions. Each assay was repeated with 3 independent plasmid preparations. 24 hours post transfection, luciferase and β-gal were measured. Luciferase units were then normalized to β-gal values. These results are available in S1 Data.

Results

Simulation studies

We evaluated our simulation results in three ways. First, we examined the accuracy of transcription shift estimates. Fig 3A shows the results of analyzing one simulated dataset, with the true value of the simulation’s transcription shift plotted on the x-axis, with the model estimates on the y-axis. For each fit of each simulation using each analysis method, we analyzed accuracy using two metrics: standard deviation of estimates for truly non-functional variants at zero (vertical width of the grey boxplot, lower is better) and correlation with the true values for simulated functional variants with nonzero effects (off-center points, higher is better).

Second, we also computed area under the receiver operating characteristic curve (AUC) and area under the precision-recall curve (AUPR) in order to characterize the binary classification performance of each method. Bayesian methods such as malacoda explicitly do not consider a null hypothesis and therefore do not output p-values. In order to create an analogous quantity needed to compute the AUC and AUPR, we instead computed one minus the minimum HDI width necessary to include zero as a credible transcription shift value to distinguish true and false positives. This process is presented in detail in section 4.1 of S3 Appendix. Fig 3B shows the ROC curves by method averaged over simulated assays with ten barcodes per allele, 5% truly functional variants, and 3000 variants. Fig 3C shows the precision-recall curves for the same simulations. Fig 3D shows that across all simulations with these characteristics, malacoda consistently showed the highest median AUC and AUPR, the highest correlation with the truth for functional variants, and the lowest standard deviation of estimates of truly non-functional variants. The last metric, “spread at zero”, particularly emphasizes the regularization effect, showing that while malacoda tends to produce the most accurate effects for functional variants, it can simultaneously provide the smallest estimates for truly non-functional variants. Other combinations of simulation parameters are shown in section 5 of S3 Appendix, displaying similar patterns.

In order to examine the performance of malacoda on real data, we applied the various methods to both the Ulirsch data [5] and to our own primary dataset. Unlike the case with simulations, the underlying true transcription shift values are not known. However, inter-method consensus can serve as a performance metric. Methods that utilize varying model structure will tend to make errors in different ways, so methods that consistently perform well will show higher correlation with alternatives than the correlations between the methods that perform poorly. Indeed, Fig 4 shows that the other methods tend to correlate with malacoda better than each other. This occurs despite the expected non-linear relationship between regularized and unregularized models (i.e. between malacoda and the other alternatives). The fits based on malacoda’s marginal and conditional priors (first and second rows/columns) in both panels of Fig 4 tend to correlate strongly because of the identical model structure paired with large spread of DeepSea predictions used in the prior estimation process. The conditional prior fit only deviates significantly from the marginal prior fit for variants with high DeepSea predictions.

Fig 4 — A) A pairwise plot of TS estimate comparisons between methods in our primary MPRA dataset, showing that alternative methods generally agree with malacoda more than each other. Shaded values above the diagonal show the correlation values for the corresponding plot below the diagonal. Color below the diagonal indicates local density of points in over-plotted regions. B) A pairwise plot of TS estimates using both the marginal and DeepSea-based malacoda priors in the Ulirsch dataset, showing a similar outcome.

Biological results

The variants we tested with luciferase reporter assays were predominantly chosen from the set where malacoda’s marginal and conditional fits disagreed on functionality, not those variants showing the strongest effects. These discordant variants tended to have small effects and the noise between replicates tended to be comparable to the mean intensity ratio. Therefore, the number of variants tested was not enough to overcome the noise inherent to light intensity-based measurements and provide conclusive results on the accuracy of the various MPRA analysis methods. While we were able to recapitulate the transcriptional functionality of several variants, we did not have enough data to clearly demonstrate that any of the MPRA analysis methods outperform the others in terms of correlation with luciferase results. Nonetheless, S2 Fig shows that the various methods are consistent with MPRA-based estimates for variants with large shifts, providing further evidence that MPRA results are biologically realistic.

We closely inspected a particular biological discovery to demonstrate malacoda’s ability to identify low-signal variants. One of the functional variants we identified with malacoda using the DeepSea-based conditional prior in the Ulirsch dataset [5] is rs11865131; this variant is identified by malacoda but not by any of the other methods after multiple testing corrections or with the marginal prior. The conditional prior is compared to the marginal prior in S1 Fig. We validated this variant is functional by luciferase assay in K562 cells with the results shown in Fig 5. The variant rs11865131 is in an intron within the NPRL3 gene which encodes the Natriuretic Peptide Receptor Like 3 protein. NPRL3 is part of the GTP-ase activating protein activity toward Rags [22] (GATOR1) complex. The GATOR1 complex inhibits mammalian target of rapamycin (MTOR) by inhibiting RRAGA function (reviewed in [22] MTOR signaling has been implicated in platelet aggregation and spreading in addition to aging associated venous thrombosis [23, 24]. Analysis of the rs11865131 locus with HaploReg [25] indicates that it colocalizes with ENCODE ChIP-Seq peaks for 36 bound proteins (predominantly transcription factors) in K562 erytholeukemia cells as well as containing enhancer histone epigenetic marks. Furthermore, this variant lies roughly one thousand base pairs away from the nearest exon-intron boundary, suggesting that it is unlikely to alter splicing of the NPRL3 transcript. Together, these data indicate that this is likely an important regulatory region. In addition to the heterologous K562 cell line, data from cultured megakaryocytes indicates that rs11865131 lies within RUNX1 and SCL ChIP-Seq peaks, two well-studied megakaryopoietic transcription factors [26]. This agrees with our data that platelet NPRL3 mRNA is positively associated with platelet count in healthy humans [27, 28]. These data indicate that malacoda has identified a likely important regulatory region for megakaryocytes and platelets that was missed by other MPRA analysis methods.

Fig 5 — A bar plot showing the difference in normalized luciferase intensity for both alleles of rs11865131 (p = 0.032). Black error bars indicate +/- one standard deviation.

MCMC can be computationally expensive, so we measured the run times in our study. The computational performance was first evaluated using the default settings of the malacoda package which are set to strike a balance between speed and precision for exploratory analysis. These settings include the variational first pass, 200 warmup samples, four chains yielding a total of 2000 posterior samples, and adaptively increased chain lengths. This initial analysis run of 8251 variants from the Ulirsch dataset took 29 minutes when parallelized across 18 threads on two Intel Xeon X5675 3.07GHz processors. We compared this to a highly precise analysis run on the same dataset with no variational first pass and excessively long 50,000 iteration MCMC chains for all variants, which took fifteen hours with the same number of cores on the same processors. The correlation between posterior mean TS between these two runs was 0.981 for non-functional variants and 0.99996 for functional variants. This result, together with the MCMC diagnostics shown in section 2.4.2 of S1 Appendix, demonstrates that the sampler used by our software is able to produce accurate estimates in a relatively short amount of time. Details of the computational methodology and results demonstrating convergence are presented in section 2.4 of S1 Appendix.

Discussion

We developed a fully Bayesian framework for the analysis of NGS high-throughput screens with particular focus on MPRA studies. The method, called malacoda, is an advance in statistical and computational science that probabilistically incorporates all known sources of variation for these high throughput NGS screens. The method does a better job of identifying true positives in simulated data and performs well in empirical studies. We also showed that the method identified a previously overlooked functional variant in the NPRL3 gene that has confirmatory evidence from a variety of other studies. Particular advantages of the method are accurate estimation of variant effects, the treatment of the dispersion parameter in both estimation and inference, and the potential to incorporate informative prior information.

The functional discovery of the variant rs11865131 represents a demonstration of the power of the malacoda method to identify biologically important results missed by alternative methods. This variant lies in an intronic region of the gene NPRL3, meaning approaches focused on alterations to the gene’s protein code would overlook this regulatory variant. Multiple lines of evidence point to the biological relevance of this variant, including epigenetic and transcription factor binding data as well as evidence of association with platelet count in healthy humans.

There are downsides to our method. First, Bayesian methods that estimate a joint posterior on many parameters by MCMC are significantly slower than optimization-based approaches. We took several approaches to mitigate this, utilizing Stan’s No-U-Turn Sampler and including options for first pass variational approximations, adaptive MCMC length, and parallelization. Together these features enable relatively fast model fitting. Second, our method does not account for uncertainty in our empirical prior estimation procedure [16]. Our R package includes a fully hierarchical model that adds an additional layer of hyperparameters in order to probabilistically model the gamma prior parameters at the same time as all of the variant-level parameters. This provides a joint posterior that models an entire MPRA dataset with a single MCMC fit. However, this model, featuring hundreds of thousands of parameters when used in the context of a typical MPRA, is presently too complex to fit in practice and was not used for the results presented in this work. Finally, our work is limited to MPRA performed in K562 cells, however there is nothing cell-type specific about the malacoda model. Our method can be used in MPRA performed in alternative cell-types so long as they follow the experimental structure outlined in the Methods section.

It is worthwhile to discuss the most effective ways to utilize external annotations to estimate informative empirical priors. We encourage users to utilize information that was originally used in the assay’s variant selection process. For example, assays designed around inspecting specific transcription factors with varying biological context may want to use the targeted transcription factor as the group identifier in a grouped prior as in Fig 2C. Using information independent of the original design can also be helpful, as we have demonstrated through the use of a conditional prior based on DeepSea’s K562 DNase hypersensitivity predictions which helped to refine the inference on a low-signal variant, rs11865131. The malacoda package can utilize an arbitrary number of continuous annotations, so any set of relevant, independent annotations may be used. As long as the principle of “similarly annotated variants have similar outcomes in the assay” holds, using informative annotations can help refine the analysis. Nonetheless, it is difficult to accurately predict the transcription shift of a single variant a priori. Conditional priors that make strong predictions of functionality should be treated with caution. We encourage the users to utilize the prior visualization functionality included in the package to contrast annotation-based priors against a marginal prior. Future advances in machine learning models for predictive variant annotation will likely improve the performance of the informative empirical priors.

It is desirable to identify an orthogonal gold-standard dataset to differentiate the accuracy of MPRA analysis approaches. Such an analysis would define an independent score of functionality for all variants, and then hits and non-hits from each MPRA analysis method could be compared for their concordance or correlation with this independent score. We attempted such an analysis using the Ulirsch dataset, ENCODE K562 bound protein levels, and DeepSea DNase hypersensitivity annotations. Unfortunately these analyses were inconclusive, showing no clear difference in annotation scores between analysis methods. There are at least two possible explanations for this difficulty. First, the noise present in both the MPRA and annotation data lowers the power to differentiate the methods. Secondly, there is misalignment between MPRA functionality and differential scoring in the annotation data. Both of these factors likely contribute to the negative result. We would postulate that if there were an idealized dataset showing high correspondence in variants that are potentially functionalizable by MPRA and simultaneously differentially scored in the orthogonal annotation data, then this hypothetical data could be used to compare the efficacy of the various MPRA analysis methods. At present, we know of no data source that would meet these requirements. While this limits our ability to quantify the performance of MPRA analysis methods, it speaks to the value of MPRA themselves. MPRA produce a unique biological signal that cannot be easily measured by other types of experiments or data.

The statistical method and validation work presented in this article present many future directions in the statistical analysis of high-throughput sequencing assays. This article has focused primarily on the analysis of “typical” MPRA: two alleles per variant, in a single tissue type, with no other experimental perturbations. However, we have expanded the modelling capabilities of the software package beyond these limitations. Models tailored to more complicated experimental structures, such as arbitrary numbers of alleles per variant, multiple tissue types, or cell-culture perturbations, are also included with the package. We also have expanded the model framework included in the package to CRISPR screen modelling. In this CRISPR model, the counts of gRNAs targeting specific genes in survival/dropout screens can make use of an analogous negative binomial structure with similar empirical gamma priors. This opens the path to incorporating gene-level annotations into Bayesian CRISPR screen analysis.

Sophisticated high-throughput assays are a central component to the future of genomics. Therefore, the statistical methods used for these data should be as efficient as possible, accounting for all sources of variation and quantifying the resulting uncertainty. Our software, malacoda, provides an end-to-end framework for the probabilistic analysis of MPRA data. Through our well-documented, easy-to-use R package, users can perform sequencing error correction and data pre-processing before executing a fully Bayesian analysis in as little as two lines of code. The method is capable of taking advantage of informative annotations through an adaptive empirical prior estimation. We hope that this work may act as a stepping stone towards further integrative, probabilistic analysis in the field of high-throughput genomics.

Supporting information

S1 Appendix. Model description, fitting, and diagnostics.

(PDF)

Click here for additional data file.^{(366.8KB, pdf)}

S2 Appendix. Negative Binomial variance estimation.

(PDF)

Click here for additional data file.^{(664.5KB, pdf)}

S3 Appendix. Simulation details and extended results.

(PDF)

Click here for additional data file.^{(1.6MB, pdf)}

S1 Data. RData file of luciferase and primary MPRA results.

An RData file that loads two objects: luc_results, a table of the luciferase results, and mpra_results, giving the primary data on MPRA counts for the variants tested with luciferaseF.

(RDATA)

Click here for additional data file.^{(18.2KB, RData)}

S2 Data. RData file of estimate comparisons and primary MPRA data.

An RData file that contains three data frames: ulirsch_comparisons, primary_comparisons, and primary_mpra_data. The first two data frames are the data necessary to produce Fig 4. Each row corresponds to one variant, and each column corresponds to a given analysis method. The values in the table give the transcription shift estimates. The third data frame gives the barcode counts from our primary MPRA dataset with anonymized variant identifiers.

(RDATA)

Click here for additional data file.^{(2.2MB, RData)}

S1 Fig. Prior comparison plot for rs11865131.

This figure compares the allelic priors for the RNA activity for both alleles of rs11865131. The blue line shows the marginal prior, the red line the conditional prior based on the DeepSea K562 DNase hypersensitivity prediction. Dotted lines show the prior means. Black tick marks show the RNA count observations adjusted for sequencing depth and DNA input. Because this variant tended to show higher than usual activity in both alleles, both priors shrink the activity considerably. Notably however, the conditional prior shrinks less than the marginal, particularly in the reference allele. The allele-specific difference in shrinkage is what allowed the conditional prior-based analysis to identify this variant as functional.

(TIF)

Click here for additional data file.^{(308.2KB, tif)}

S2 Fig. Luciferase versus MPRA estimates by method.

A scatterplot demonstrates the relationship between luciferase-based estimates of TS against MPRA-based estimates from each MPRA analysis method.

(TIF)

Click here for additional data file.^{(329.7KB, tif)}

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

CS, LE - R01HL128234, National Institutes of Health, https://www.nih.gov/ The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Auton A, Abecasis GR, Altshuler DM, Durbin RM, Bentley DR, Chakravarti A, et al. A global reference for human genetic variation. Nature [Internet]. 2015;526(7571):68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009;106(23):9362–7. 10.1073/pnas.0903103106 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Nishizaki SS, Boyle AP. Mining the Unknown: Assigning Function to Noncoding Single Nucleotide Polymorphisms. Trends Genet [Internet]. 2017;33(1):34–45. 10.1016/j.tig.2016.10.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Melnikov A, Murugan A, Zhang X, Tesileanu T, Wang L, Rogov P, et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat Biotechnol [Internet]. 2012;30(3):271–7. 10.1038/nbt.2137 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ulirsch JC, Nandakumar SK, Wang L, Giani FC, Zhang X, Rogov P, et al. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell [Internet]. 2016;165(6):1530–45. 10.1016/j.cell.2016.04.048 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Tewhey R, Kotliar D, Park DS, Liu B, Winnicki S, Reilly SK, et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell [Internet]. 2016;165(6):1519–29. 10.1016/j.cell.2016.04.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Shen SQ, Myers CA, Hughes AEO, Byrne LC, Flannery JG, Corbo JC. Massively parallel cis-regulatory analysis in the mammalian central nervous system. Genome Res. 2016;26(2):238–55. 10.1101/gr.193789.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Myint L, Avramopoulos DG, Goff LA, Hansen KD. Linear models enable powerful differential activity analysis in massively parallel reporter assays. BMC Genomics. 2019;20(1):1–19. 10.1186/s12864-018-5379-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Niroula A, Ajore R, Nilsson B. MPRAscore: robust and non-parametric analysis of massively parallel reporter assays. Bioinformatics. 2019;(July):1–3. 10.1093/bioinformatics/btz591 [DOI] [PubMed] [Google Scholar]
10.Kalita C. A., Moyerbrailean G. A., Brown C., Wen X., Luca F., & Pique-Regi R. (2018). QuASAR-MPRA: Accurate allele-specific analysis for massively parallel reporter assays. Bioinformatics, 34(5), 787–794. 10.1093/bioinformatics/btx598 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ashuach T., Fischer D. S., Kreimer A., Ahituv N., Theis F. J., & Yosef N. (2019). MPRAnalyze: Statistical framework for massively parallel reporter assays. Genome Biology, 20(1), 1–17. 10.1186/s13059-018-1612-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Consortium EP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2013;489(7414):57–74. 10.1038/nature11247.An [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. (DeepSea). Nat Methods [Internet]. 2015;12(10):931–4. 10.1038/nmeth.3547 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol [Internet]. 2014;15(12):550 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Kruschke J. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. 2nd ed London: Academic Press; c2015. P.336–40. [Google Scholar]
16.Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. Third Edition Boca Raton, FL: CRC Press; 2013. p. 51–6, p. 102–4. [Google Scholar]
17.Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A. Stan: A probabilistic programming language. J Stat Softw. 2017;76(1). 10.18637/jss.v076.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Kucukelbir A, Blei DM, Gelman A, Ranganath R, Tran D. Automatic Differentiation Variational Inference. J Mach Learn Res. 2017;18:1–45. Available from: https://arxiv.org/abs/1603.00788 [Google Scholar]
19.Assaf G, Hannon GJ. FASTX-Toolkit [Internet]. 2010. Available from: http://hannonlab.cshl.edu/fastx_toolkit/index.html [Google Scholar]
20.Hawkins JA, Jones SK, Finkelstein IJ, Press WH. Indel-correcting DNA barcodes for high-throughput sequencing. Proc Natl Acad Sci [Internet]. 2018;115(27):E6217–26. 10.1073/pnas.1802640115 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Ghazi AR, Chen ES, Henke DM, Madan N, Edelstein LC, Shaw CA. Design tools for MPRA experiments. Bioinformatics. 2018;34(15):2682–3. 10.1093/bioinformatics/bty150 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Shaw RJ. GATORs take a bite out of mTOR. Science. 2013;340(6136):1056–7. 10.1126/science.1240315 [DOI] [PubMed] [Google Scholar]
23.Aslan JE, Tormoen GW, Loren CP, Pang J, McCarty OJT. S6K1 and mTOR regulate Rac1-driven platelet activation and aggregation. Blood. 2011;118(11):3129–36. 10.1182/blood-2011-02-331579 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Yang J, Zhou X, Fan X, Xiao M, Yang D, Liang B, et al. MTORC1 promotes aging-related venous thrombosis in mice via elevation of platelet volume and activation. Blood. 2016;128(5):615–24. 10.1182/blood-2015-10-672964 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Ward LD, Kellis M. HaploReg v4: Systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease. Nucleic Acids Research. 2016;44(D1), D877–D881. 10.1093/nar/gkv1340 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Chacon D, Beck D, Perera D, Wong JWH, Pimanda JE. BloodChIP: A database of comparative genome-wide transcription factor binding profiles in human blood cells. Nucleic Acids Res. 2014;42(D1):172–7. 10.1093/nar/gkt1036 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Simon LM, Edelstein LC, Nagalla S, Woodley AB, Chen ES, Kong X, et al. Human platelet microRNA-mRNA networks associated with age and gender revealed by integrated plateletomics. Blood. 2014;123(16):37–45. 10.1182/blood-2013-12-544692 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Edelstein LC, Simon LM, Montoya RT, Holinstat M, Chen ES, Bergeron A, et al. Racial differences in human platelet PAR4 reactivity reflect expression of PCTP and miR-376c. Nat Med [Internet]. 2013;19(12):1609–16. 10.1038/nm.3385 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007504.r001

Decision Letter 0

Thomas Lengauer, Jian Ma

22 Dec 2019

Dear Dr. Shaw,

Thank you very much for submitting your manuscript 'Bayesian modelling of high-throughput sequencing assays with malacoda' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers thought that the method could be potentially useful but raised some substantial concerns about the manuscript as it currently stands. In particular, the details of the method description and evaluation approach should be further clarified. The data used in the study should be made available. It would also be important to put this work in the context of existing literature and highlight the advantages and limitations. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts.

In addition, when you are ready to resubmit, please be prepared to provide the following:

(1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors.

(2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text.

(3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution.

Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are:

- Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition).

- Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video.

- Funding information in the 'Financial Disclosure' box in the online system.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here.

We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us.

Sincerely,

Jian Ma

Associate Editor

PLOS Computational Biology

Thomas Lengauer

Methods Editor

PLOS Computational Biology

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Summary:

This manuscript presents malacoda, a Bayesian method for identifying alleles with significantly different abilities to activate gene expression in an MPRA. malacoda accounts for variation due to unknown DNA concentrations and allows for users to incorporate prior information about variants, properties that existing methods do not have. In addition, unlike previous methods, it does not summarize the data before modeling. The manuscript reports results from simulations that show that malacoda is able to detect functional variants with higher accuracy than previous methods across multiple numbers of bar-codes per allele, array sizes, and fractions of functional variants. The manuscript also shows that malacoda tends to agree more with other methods on real MPRA data than the other methods agree with each other. In addition, the manuscript presents luciferase assays that malacoda’s results for sixteen variants, including multiple functional variants that were not identified as functional by other methods. Overall, I think that malacoda is a useful tool for using MPRA data to determine if non-coding variants are functional that is likely to be widely used because it is more accurate than existing methods and a thorough description of how it works as well as code needed to run it have been made publicly available.

Major Comments:

1. Figure 3A makes me concerned that malacoda’s precision is not very high because a large percentage of the simulated variants with transcription shifts greater than zero seem to be non-functional. Figure 3D enhances this concern because many of the luciferase estimates near zero seem to correspond to MPRA estimates that are further from zero. The manuscript would be more convincing if the authors could show that malacoda had high precision or at least substantially higher precision than other methods. In addition, since Figures 3A and 3D do suggest that malacoda does not report any non-functional variants to have transcription shift > 1.5, including the precision at different transcription shift cutoffs might also be helpful.

2. The paper mentions primary MPRA data, but this data is never described in detail. If the data is new to this paper, then a description of the data, including how the tested sequences were selected, what MPRA protocol was used, and how the plasmids were constructed, needs to be added. If the data was taken from another publication, then that publication needs to be cited.

3. The supplemental websites are extremely helpful. They provide many simulations that give readers intuition that is helpful in understanding the modeling decisions in this paper. They also help readers think about how much data would be needed to obtain good parameter estimates for the models in this paper. In addition, code is provided to generate the figures along with the packages needed to run the code, which enables readers to reproduce the figures and modify them in order to further improve their intuition.

Minor Comments:

Introduction:

1. The value of malacoda is not that it is Bayesian but that it does not have the drawbacks of previous methods – ignoring variation due to unknown DNA concentrations, summarizing the data prior to modeling, and not accounting for relevant prior information. I would re-structure the final paragraph of the introduction to emphasize its strengths over existing methods instead of its Bayesian nature.

Methods:

1. In line 144, the second dot should be a comma.

2. In lines 144-145, the reasons behind the definitions of the means for the negative binomial distribution were clear to me, but I am not sure if they would be clear to someone who has not thought about MPRAs before. It therefore might be useful to add a sentence explaining the reasons.

3. The definition of “95% highest density interval on TS” in line 171 was not clear to me; it would be great if this could be clarified.

4. The reason that the method for learning the gamma priors removes the need for post-hoc multiple testing correction, as claimed in line 193, was not clear to me. My understanding is that, even though all of the variants in the MPRA are used to learn the gamma priors, the effect of each variant on transcription is still tested separately, which would mean that multiple testing correction should still be required.

5. If an existing software package was used for model fitting, then the package and settings that were used should be added.

6. In the section on Simulation and Validation Studies, it would be great if a complete list the parameters used in the simulations could be provided (providing them as a supplemental table would be sufficient). If the supplemental websites contain every parameter used in the simulations, then directing the reader to the appropriate supplemental website should be sufficient.

7. It would be great if the authors could also compare area under the precision-recall curve because there are probably more non-functional variants than functional variants, so high sensitivity and specificity does not guarantee high precision.

8. In the section on Experimental Procedures, it would be helpful to add a description of the full experimental procedure for the new MPRA in this manuscript.

9. In the section on Experimental Procedures, it would be helpful to direct readers to the supplemental file containing the variants tested in luciferase assays.

Results:

1. At the beginning of the section on Biological Results, it would be helpful to include an explanation of why the number of luciferase reporter assays was not enough to overcome the amount of noise inherent to light intensity-based measurements or provide a citation of a paper that explains this.

2. Lines 294-296 state that the results from the luciferase assay are consistent with the MPRA estimates, but this seems to occur for only the largest transcription shifts. I think that this should be modified to say that the results are consistent for large transcription shifts.

3. Figure 4 suggests that evaluations were done using DeepSea-based priors, but there is no description of these results other than that the effect of rs11865131 was found using the DeepSea based priors. Figure 4B suggests that the results using the DeepSea-based priors were highly correlated with those without the priors. It would be helpful to add a summary of these results to the Results section.

4. The results provide evidence that rs11865131 is functional because it affects the activity of an intronic enhancer. The results about this variant would be more compelling if the authors added that this variant is not close to any exon-intron boundary, so it is unlikely to also affect splicing.

5. It would be helpful to add a description of how the analysis of rs11865131 overlap with ChIP-seq peaks was done (For example, were all peaks or only reproducible peaks used? When there were datasets from multiple labs for a TF or histone modification, which dataset was used?) and what the TFs and histone modifications are whose peaks overlap it.

6. If a luciferase assay was done for rs11865131, it would be helpful to add a description of the assay’s results in the Results section in addition to having them in a supplemental file.

Discussion:

1. The paper would be easier to follow if the description of how the models were fit was moved to the Methods section.

2. The phrase “seemingly worthwhile” in line 334 should be defined.

3. Line 342 mentions “an additional layer of hyper-parameters.” If these hyper-parameters were used to obtain the results in this paper, then a description of them should be added to the Methods section. If not, then it would be helpful to clarify that they are an option that the user can add but were not used to obtain this paper’s results.

4. It is not clear how much the DeepSea priors helped for the MPRAs described in this study. It would be helpful to add a section to the Discussion that provides some guidance to users about when priors are likely to be helpful.

Figures:

1. According to Figure 3A, the transcription shift for many non-functional variants is higher than the transcription shift for functional variants. It may appear to be this way because the difference between the density of points at the origin and on other parts of the y-axis is not viewable by eye. Modifying this figure to show the difference in density would be helpful. One option is to use smaller points. Alternatively, another part of the figure could be made that zooms in on the y-axis.

2. It would be helpful to add a panel to Figure 3 that illustrates how the results from malacoda are used to determine which variants are “functional.”

3. It would be helpful to add precision recall curves to Figure 3 to help the reader understand how frequently variants that are called positives are false positives.

4. It would be helpful to make the x- and y-axes the same scale in Figure 3D or add a line for y = x.

5. It would be helpful to add an explanation of the plots on the diagonals of the plot tables in Figure 4.

Supplement:

1. In the Description of point estimates of variance parameters section, it would be helpful to add a description of how the curve between maximum log-likelihood estimates of means and variances is fit.

2. In the Simulating variance parameter point estimates section, it would be helpful to add an explanation of why the chosen values are representative of the data from an MPRA or provide a citation of a paper that has such an explanation.

3. In the Simulating variance parameter point estimates section, it would be helpful to add an explanation of why the maximum likelihood estimates are systematically lower than the true value.

4. The NB likelihood surfaces section claims that the simulated draws described on the supplemental website “usually” allow for the true value. It was not clear to me why they do not always allow for the true value; in other words, I know that the true value will not always be drawn, but I am not sure why drawing the true value is not always possible.

5. In the Effect on mean estimation section, the dotted and orange lines on the x-axis were confusing to me. I would recommend removing them.

Reviewer #2: The authors presented a method and an R package called malacoda to model MPRA read count data with negative binomial distribution. The main contribution lies in modeling the mean and dispersion parameters as variables sampled from unknown gamma distribution. The paper is well written in general but there are some missing technical details. Major comments are below.

1. Authors should describe the Bayesian model in much more detail as it is now in the manuscript

a. One page 7, line 144 and 145, authors outlined the two generative NB models for DNA and RNA counts. Then they use Figure 2B to depict the hyperparameter of the gamma distributions that are used to model depath_s\\mu_{bc} and \\varphi_{DNA} for Counts_DNA. Similarly, another gamma distribution is used eto model depath_s\\mu_{bc}\\mu{allele} and \\varphi_{RNA} variables.

b. Just the generative model itself is not fully described in the main text. What’s distribution of \\mu_{bc}? What’s the distribution of \\varphi_{RNA}? If some of them are gamma distributed, what are the hyperparameters of these gamma distributions? Fixed or also estimated? Etc etc. No detail described here whatsoever.

c. The details on how the model is inferred is completely omitted. I understand the authors use out of the box optimizer Stan library to do the model fitting part. But that does not mean that there is no need to describe the necessary detail on the Bayesian inference.

d. In particular, given that \\mu_{bc} is unknown and is both DNA and RNA NegBin model, how is the coordinate ascent work in the variational Bayesian and how is it sampled in the no-U-turn Hamiltonian Monte Carlo sampling?

e. If variational Bayesian is used, what’s the proposed distribution?

f. If Hamiltonian Monte Carlo sampling, what’s the leap frog steps? What’s the step size? How the parameters are sampled? Together (not tractable)? Or separately in what order?

2. In Figure 3D, where authors used their in-house data to show malacoda, what’s the correlation across the 4 methods? It looks like malacoda is not better than the other 3 methods.

3. For rs11865131, in page 14 line 299, what’s the prediction scores from the other methods? Why is the variant not identified by those methods but by malacoda?

4. Page 14 line 306: How many such variants (i.e., rs11865131) exclusively identified by malacoda but not other methods? Can you count the number of ENCODE ChIP-seq peaks that each of the top 100 variants co-localize and compare them with the other 3 methods?

5. Authors says their model takes 50,000 MCMC samples, do the model converge in the end? Please show the plot of joint posterior or Hamiltonian energy (since authors is using HMC sampler) as a function of iterations.

6. Regarding more advanced prior, the authors should be aware that there are more advance supervised learning method that is train to predict MPRA signals using sequence features and epigenomic features. See this paper:

a. Li, Y., Shi, A., Tewhey, R., Sabeti, P., Ernst, J., & Kellis, M. (2017). Genome-wide regulatory model from MPRA data predicts functional regions, eQTLs, and GWAS hits. bioRxiv. http://doi.org/10.1101/110171

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. 2020 Jul 21;16(7):e1007504. doi: 10.1371/journal.pcbi.1007504.r002

Author response to Decision Letter 0

21 Feb 2020

Attachment

Submitted filename: malacoda_resubmission_letter.docx

Click here for additional data file.^{(163.7KB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007504.r003

Decision Letter 1

Jian Ma

18 Mar 2020

Dear Dr. Shaw,

Thank you very much for submitting your manuscript "Bayesian modelling of high-throughput sequencing assays with malacoda" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' additional comments. In particular, one of the reviewers thought that the revision has not adequately addressed the concerns.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Jian Ma

Deputy Editor

PLOS Computational Biology

Thomas Lengauer

Methods Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I am impressed with the additional work that the authors have done to incorporate all of the reviewers’ feedback. The new version of the manuscript and the new appendix provide many helpful details that are necessary for fully understanding the presented computational and experimental work. In addition, the more comprehensive comparisons to previous methods and the new luciferase assay make me more convinced that malacoda is more effective than the previous methods described here at identifying alleles with differential ability to activate gene expression.

I have one new major comment:

1. I recently became aware of two relevant papers that the authors did night cite: Ashuach et al., Genome Biology, 2019 (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1787-z) and Kalita et al., Bioinformatics, 2018 (https://academic.oup.com/bioinformatics/article/34/5/787/4209990). It would be great if the authors could add an explicit explanation of the key advantages of malacoda over the methods described in these papers or the key differences between the types of problems that malacoda and these methods are capable of solving. If these methods are solving the same problem as malacoda, it would be great if the authors could add comparisons to these methods.

I have a few new minor comments:

Introduction:

1. In line 126, I would replace “wet bench” with “luciferase.”

Appendices:

1. The additional details in Appendix S1 are extremely informative for understanding exactly how malacoda works and how to run different parts of the method. malacoda seems to require the selection of multiple settings, including the number of warm-up samples, the number of samples per chain, and the total number of samples for MCMC. It would be great if the authors could add a description of how the recommended settings were selected.

2. Appendix S2 provides a helpful explanation of why having a prior on φ in the negative binomial distribution has the potential be beneficial, but it was clear what if any part of example 6 was illustrating malacoda’s prior on φ. It would be great if this could be clarified.

Reviewer #2: - It’d have been great if the authors were to describe what they have done to address my comments as opposed to just pointing me to a somewhere in the document. Especially, the changes are not highlighted relative to the original draft.

- For example, in response to my comment 1 on the distributions of the model, authors pointed me to line 174-179, where there is no distribution specified there but rather lines 151-156 have some added distributions. Authors point me to “Section 1 of the new S1 Appendix” on the rest of my comments expecting me to find the answers myself.

- For the variational Bayesian comments, the authors said “Stan’s variational interface with the R function rstan::vb(), …, former of these automatically transforms the parameters to the space of real numbers before using a Gaussian variational approximation.” How does the Gaussian approximation work on Negative Binomial? It may seems that my comments are too harsh. However, because this is a technical methodology paper, to me, the main contribution of this paper is on this detailed modeling.

- If the authors go through this round of the review, I would like to see in their response the model details instead of pointing me to somewhere else in the documents.

- For my comment 2, the correlation for malacoda is worse than the other two competitors (MPRAscore and even simple t-test). Authors said that this is not practically significant. But I thought the reason the authors show this scatter plot is to demonstrate that their method is better than other methods. Authors also said that “the assays shown were not selected to be representative of the most strongly functional variants”. But why not select the “representative of the most strongly functional variants” to test?

- On my comment 4, authors show a table comparing the consistency between top 100 variant predicted by each method with ChIP-seq peaks. It seems MPRAscore once again does better than malacoda. Even t-test has better lower false positive rate and comparable true positive rate. This does not seem to support malacoda as the method of choice.

- Comment 5 on 50,000 MCMC samples, instead of pointing to appendix, please add the plots in your response *if* the authors pass through this round.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. 2020 Jul 21;16(7):e1007504. doi: 10.1371/journal.pcbi.1007504.r004

Author response to Decision Letter 1

20 Apr 2020

Attachment

Submitted filename: malacoda_resubmission_letter_2.docx

Click here for additional data file.^{(1.4MB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007504.r005

Decision Letter 2

Jian Ma

10 May 2020

Dear Dr. Shaw,

Thank you very much for submitting your manuscript "Bayesian modelling of high-throughput sequencing assays with malacoda" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. Based on the reviewers' feedback, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days by addressing to the additional comments from the reviewers. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Jian Ma

Deputy Editor

PLOS Computational Biology

Thomas Lengauer

Methods Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The changes that the authors have made to incorporate my and the other reviewers’ feedback have further improved the manuscript. I think that a major limitation of this manuscript is that there is little evaluation on real data other than in K562, an immortalized cancer cell line that is unlikely to be representative of any cell type in the human body. However, the ideal data for such evaluation does not exist, and the authors stated this limitation very clearly in the Discussion section, so I do not think that this limitation should hold back this manuscript from being published. I have come up with a few ideas that I have listed below that might help further demonstrate the value of malacoda relative to other methods. I also identified a few parts of the manuscript and github page that I would recommend further clarifying. Since the authors have made all of the code publicly available and provided details in their appendices about how to run malacoda, I think that malacoda could be come a widely used method for using MPRA data to identify putatively functional variants.

I have one new major comment:

1. I think that the paper would benefit tremendously from comparing malacoda to other methods on an additional real MPRA dataset. I recently became aware of a dataset that might be ideal for such a comparison: the dataset in Tewhey et al., Cell, 2016 (https://www.ncbi.nlm.nih.gov/pubmed/27259153). Since there is substantial eQTL data from lymphoblastoid cell lines from datasets like Geuvadis, I think that additionally showing that the putatively functional variants that malacoda detects in this dataset are more likely to overlap with eQTLs (or SNPs in linkage disequilibrium with eQTLs) than those detected by other methods would make the value of malacoda relative to other methods more apparent.

I have a few new minor comments:

Response to other reviewer:

1. The other reviewer brought up the important point that the putatively functional variants identified by MPRAscore are more likely to overlap ChIP-seq peaks than those identified by malacoda. The authors responded by comparing the DeepSEA DNase annotations between those variants across methods and evaluating the similarity between the annotations with a two-way ANOVA test. I found the description of the two-way ANOVA test a little confusing. My understanding is that the null hypothesis of the two-way ANOVA is that the means of the DeepSEA DNase scores are equal across methods. Thus, my understanding is that not rejecting the null hypothesis, does not imply that the null hypothesis is correct; it implies that the null hypothesis cannot be rejected. If the authors re-did the analysis on LCLs and found more overlap between putatively functional variants identified by malacoda than those identified by other methods, then I would not be as concerned about the ChIP-seq result. Alternatively, the authors could evaluate if the putatively functional variants identified by malacoda tend to be closer to TSS’s or closer to K562 DNase peak summits than those identified by other methods.

Introduction:

1. An exciting application of using MPRAs to identify putative functional variants is to help determine which of multiple variants in linkage disequilibrium that have all been associated with a disease are likely to be causal. I think that mentioning this near the beginning of the introduction might encourage more researchers to read the rest of the paper.

2. In lines 101-102, I think that “transcription binding” should be replaced with “transcription factor binding.”

3. In line 106, I would replace “has been unclear” with “has not been developed” (if that is accurate).

Methods:

1. In line 181, I would replace “learn” with “evaluate.”

Results:

1. The way that the p-values were computed was clear from the Appendix, but the way that the p-values were used to compute AUC and AUPR was not immediately intuitive to me. It would be helpful if this were described in more detail.

Discussion:

1. The Discussion section describes some extensions to malacoda that are available in the software package but are not described in detail or analyzed in this paper. I would recommend removing these and writing another paper about them that describes them in detail. Other researchers might be reluctant to use them if they do not have access to a clear explanation of how they work.

Figures:

1. The color-coding at the bottom of Figure 1 might be a little confusing to some readers because the red RNA could be interpreted as meaning that all of the RNA is coming from the first variant. I might instead make the RNA corresponding colors to the variant that produced it or make it similar to the brown color of the DNA that will get transcribed in the middle of the figure.

Appendices:

1. The code is generally easy to follow; removing the commented-out code would make some parts even easier to follow.

github page:

1. I think that a more detailed description of the inputs would make malacoda easier to use. Specifically, a description of the exact format of mpra_data would be helpful.

2. I had trouble finding the help documentation on the github. It would be great if a link to it could be added in a prominent location.

3. There seem to be some R scripts in the github that are not explained anywhere on the github. It would be great if a brief description of each R script could be added.

Reviewer #2: Authors have addressed the technical concern and clarity comments that I have. Please add legends to Figure 3B and 3C.

Also, Figure 3D median spear at zero is confusing. Authors should just add error bar to the bars displayed on the first three panels for median AUC, AUPR, non-zero correlation.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. 2020 Jul 21;16(7):e1007504. doi: 10.1371/journal.pcbi.1007504.r006

Author response to Decision Letter 2

3 Jun 2020

Attachment

Submitted filename: malacoda_letter_may.docx

Click here for additional data file.^{(12.6KB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007504.r007

Decision Letter 3

Jian Ma

9 Jun 2020

Dear Dr. Shaw,

We are pleased to inform you that your manuscript 'Bayesian modelling of high-throughput sequencing assays with malacoda' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Jian Ma

Deputy Editor

PLOS Computational Biology

Thomas Lengauer

Methods Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007504.r008

Acceptance letter

Jian Ma

14 Jul 2020

PCOMPBIOL-D-19-01801R3

Bayesian modelling of high-throughput sequencing assays with malacoda

Dear Dr Shaw,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Laura Mallard

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Appendix. Model description, fitting, and diagnostics.

(PDF)

Click here for additional data file.^{(366.8KB, pdf)}

S2 Appendix. Negative Binomial variance estimation.

(PDF)

Click here for additional data file.^{(664.5KB, pdf)}

S3 Appendix. Simulation details and extended results.

(PDF)

Click here for additional data file.^{(1.6MB, pdf)}

S1 Data. RData file of luciferase and primary MPRA results.

An RData file that loads two objects: luc_results, a table of the luciferase results, and mpra_results, giving the primary data on MPRA counts for the variants tested with luciferaseF.

(RDATA)

Click here for additional data file.^{(18.2KB, RData)}

S2 Data. RData file of estimate comparisons and primary MPRA data.

(RDATA)

Click here for additional data file.^{(2.2MB, RData)}

S1 Fig. Prior comparison plot for rs11865131.

(TIF)

Click here for additional data file.^{(308.2KB, tif)}

S2 Fig. Luciferase versus MPRA estimates by method.

A scatterplot demonstrates the relationship between luciferase-based estimates of TS against MPRA-based estimates from each MPRA analysis method.

(TIF)

Click here for additional data file.^{(329.7KB, tif)}

Attachment

Submitted filename: malacoda_resubmission_letter.docx

Click here for additional data file.^{(163.7KB, docx)}

Attachment

Submitted filename: malacoda_resubmission_letter_2.docx

Click here for additional data file.^{(1.4MB, docx)}

Attachment

Submitted filename: malacoda_letter_may.docx

Click here for additional data file.^{(12.6KB, docx)}

Data Availability Statement

All relevant data are within the manuscript and its Supporting Information files.

[pcbi.1007504.ref001] 1.Auton A, Abecasis GR, Altshuler DM, Durbin RM, Bentley DR, Chakravarti A, et al. A global reference for human genetic variation. Nature [Internet]. 2015;526(7571):68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref002] 2.Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009;106(23):9362–7. 10.1073/pnas.0903103106 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref003] 3.Nishizaki SS, Boyle AP. Mining the Unknown: Assigning Function to Noncoding Single Nucleotide Polymorphisms. Trends Genet [Internet]. 2017;33(1):34–45. 10.1016/j.tig.2016.10.008 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref004] 4.Melnikov A, Murugan A, Zhang X, Tesileanu T, Wang L, Rogov P, et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat Biotechnol [Internet]. 2012;30(3):271–7. 10.1038/nbt.2137 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref005] 5.Ulirsch JC, Nandakumar SK, Wang L, Giani FC, Zhang X, Rogov P, et al. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell [Internet]. 2016;165(6):1530–45. 10.1016/j.cell.2016.04.048 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref006] 6.Tewhey R, Kotliar D, Park DS, Liu B, Winnicki S, Reilly SK, et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell [Internet]. 2016;165(6):1519–29. 10.1016/j.cell.2016.04.027 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref007] 7.Shen SQ, Myers CA, Hughes AEO, Byrne LC, Flannery JG, Corbo JC. Massively parallel cis-regulatory analysis in the mammalian central nervous system. Genome Res. 2016;26(2):238–55. 10.1101/gr.193789.115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref008] 8.Myint L, Avramopoulos DG, Goff LA, Hansen KD. Linear models enable powerful differential activity analysis in massively parallel reporter assays. BMC Genomics. 2019;20(1):1–19. 10.1186/s12864-018-5379-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref009] 9.Niroula A, Ajore R, Nilsson B. MPRAscore: robust and non-parametric analysis of massively parallel reporter assays. Bioinformatics. 2019;(July):1–3. 10.1093/bioinformatics/btz591 [DOI] [PubMed] [Google Scholar]

[pcbi.1007504.ref010] 10.Kalita C. A., Moyerbrailean G. A., Brown C., Wen X., Luca F., & Pique-Regi R. (2018). QuASAR-MPRA: Accurate allele-specific analysis for massively parallel reporter assays. Bioinformatics, 34(5), 787–794. 10.1093/bioinformatics/btx598 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref011] 11.Ashuach T., Fischer D. S., Kreimer A., Ahituv N., Theis F. J., & Yosef N. (2019). MPRAnalyze: Statistical framework for massively parallel reporter assays. Genome Biology, 20(1), 1–17. 10.1186/s13059-018-1612-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref012] 12.Consortium EP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2013;489(7414):57–74. 10.1038/nature11247.An [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref013] 13.Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. (DeepSea). Nat Methods [Internet]. 2015;12(10):931–4. 10.1038/nmeth.3547 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref014] 14.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol [Internet]. 2014;15(12):550 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref015] 15.Kruschke J. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. 2nd ed London: Academic Press; c2015. P.336–40. [Google Scholar]

[pcbi.1007504.ref016] 16.Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. Third Edition Boca Raton, FL: CRC Press; 2013. p. 51–6, p. 102–4. [Google Scholar]

[pcbi.1007504.ref017] 17.Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A. Stan: A probabilistic programming language. J Stat Softw. 2017;76(1). 10.18637/jss.v076.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref018] 18.Kucukelbir A, Blei DM, Gelman A, Ranganath R, Tran D. Automatic Differentiation Variational Inference. J Mach Learn Res. 2017;18:1–45. Available from: https://arxiv.org/abs/1603.00788 [Google Scholar]

[pcbi.1007504.ref019] 19.Assaf G, Hannon GJ. FASTX-Toolkit [Internet]. 2010. Available from: http://hannonlab.cshl.edu/fastx_toolkit/index.html [Google Scholar]

[pcbi.1007504.ref020] 20.Hawkins JA, Jones SK, Finkelstein IJ, Press WH. Indel-correcting DNA barcodes for high-throughput sequencing. Proc Natl Acad Sci [Internet]. 2018;115(27):E6217–26. 10.1073/pnas.1802640115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref021] 21.Ghazi AR, Chen ES, Henke DM, Madan N, Edelstein LC, Shaw CA. Design tools for MPRA experiments. Bioinformatics. 2018;34(15):2682–3. 10.1093/bioinformatics/bty150 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref022] 22.Shaw RJ. GATORs take a bite out of mTOR. Science. 2013;340(6136):1056–7. 10.1126/science.1240315 [DOI] [PubMed] [Google Scholar]

[pcbi.1007504.ref023] 23.Aslan JE, Tormoen GW, Loren CP, Pang J, McCarty OJT. S6K1 and mTOR regulate Rac1-driven platelet activation and aggregation. Blood. 2011;118(11):3129–36. 10.1182/blood-2011-02-331579 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref024] 24.Yang J, Zhou X, Fan X, Xiao M, Yang D, Liang B, et al. MTORC1 promotes aging-related venous thrombosis in mice via elevation of platelet volume and activation. Blood. 2016;128(5):615–24. 10.1182/blood-2015-10-672964 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref025] 25.Ward LD, Kellis M. HaploReg v4: Systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease. Nucleic Acids Research. 2016;44(D1), D877–D881. 10.1093/nar/gkv1340 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref026] 26.Chacon D, Beck D, Perera D, Wong JWH, Pimanda JE. BloodChIP: A database of comparative genome-wide transcription factor binding profiles in human blood cells. Nucleic Acids Res. 2014;42(D1):172–7. 10.1093/nar/gkt1036 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref027] 27.Simon LM, Edelstein LC, Nagalla S, Woodley AB, Chen ES, Kong X, et al. Human platelet microRNA-mRNA networks associated with age and gender revealed by integrated plateletomics. Blood. 2014;123(16):37–45. 10.1182/blood-2013-12-544692 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1007504.ref028] 28.Edelstein LC, Simon LM, Montoya RT, Holinstat M, Chen ES, Bergeron A, et al. Racial differences in human platelet PAR4 reactivity reflect expression of PCTP and miR-376c. Nat Med [Internet]. 2013;19(12):1609–16. 10.1038/nm.3385 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Bayesian modelling of high-throughput sequencing assays with malacoda

Andrew R Ghazi

Xianguo Kong

Ed S Chen

Leonard C Edelstein

Chad A Shaw

Roles

Abstract

Author summary

Introduction

Fig 1. Diagram of MPRA.

Methods

Overview

Description of the statistical model

Fig 2. MPRA data and malacoda priors.

Empirical priors

Simulation and validation studies

Computational methods and software

Experimental methods

Results

Simulation studies

Fig 3. Simulation results.

Fig 4. Inter-method consensus.

Biological results

Fig 5. Luciferase validation results.

Discussion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Thomas Lengauer

Jian Ma

Roles

Author response to Decision Letter 0

Decision Letter 1

Jian Ma

Roles

Author response to Decision Letter 1

Decision Letter 2

Jian Ma

Roles

Author response to Decision Letter 2

Decision Letter 3

Jian Ma

Roles

Acceptance letter

Jian Ma

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases