Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2020 Aug 4.
Published in final edited form as: Bioinformatics. 2020 Mar 1;36(5):1492–1500. doi: 10.1093/bioinformatics/btz744

Soft windowing application to improve analysis of high-throughput phenotyping data

Hamed Haselimashhadi 1,*, Jeremy C Mason 1, Violeta Munoz-Fuentes 1, Federico López-Gómez 1, Kolawole Babalola 1, Elif F Acar 2,3,4, Vivek Kumar 5, Jacqui White 5, Ann M Flenniken 6,7, Ruairidh King 8, Ewan Straiton 8, John Richard Seavitt 9, Angelina Gaspero 9, Arturo Garza 9, Audrey E Christianson 9, Chih-Wei Hsu 9, Corey L Reynolds 9, Denise G Lanza 9, Isabel Lorenzo 9, Jennie R Green 9, Juan J Gallegos 9, Ritu Bohat 9, Rodney C Samaco 9, Surabi Veeraragavan 9, Jong Kyoung Kim 10, Gregor Miller 11, Helmult Fuchs 11, Lillian Garrett 11, Lore Becker 11, Yeon Kyung Kang 12, David Clary 13, Soo Young Cho 14, Masaru Tamura 15, Nobuhiko Tanaka 15, Dong Soo Kyung 16, Alexandr Bezginov 2,3, Ghina Bou About 17, Marie-France Champy 17, Laurent Vasseur 17, Sophie Leblanc 17, Hamid Meziane 17, Mohammed Selloum 17, Patrick T Reilly 17, Nadine Spielmann 11, Holger Maier 11, Valerie Gailus-Durner 11, Tania Sorg 17, Masuya Hiroshi 15, Obata Yuichi 15, Jason D Heaney 9, Mary E Dickinson 9, Wurst Wolfgang 18, Glauco P Tocchini-Valentini 19, Kevin C Kent Lloyd 13, Colin McKerlie 2,3, Je Kyung Seong 16, Herault Yann 20, Martin Hrabé de Angelis 11, Steve D M Brown 8, Damian Smedley 21, Paul Flicek 1, Ann-Maries Mallon 8, Helen Parkinson 1, Terrence F Meehan 1
PMCID: PMC7115897  EMSID: EMS88049  PMID: 31591642

Abstract

Motivation

High-throughput phenomic projects generate complex data from small treatment and large control groups that increase the power of the analyses but introduce variation over time. A method is needed to utlize a set of temporally local controls that maximizes analytic power while minimizing noise from unspecified environmental factors.

Results

Here we introduce 'soft windowing', a methodological approach that selects a window of time that includes the most appropriate controls for analysis. Using phenotype data from the International Mouse Phenotyping Consortium (IMPC), adaptive windows were applied such that control data collected proximally to mutants were assigned the maximal weight, while data collected earlier or later had less weight. We applied this method to IMPC data and compared the results with those obtained from a standard non-windowed approach. Validation was performed using a resampling approach in which we demonstrate a 10% reduction of false positives from 2.5 million analyses. We applied the method to our production analysis pipeline that establishes genotype–phenotype associations by comparing mutant versus control data. We report an increase of 30% in significant P-values, as well as linkage to 106 versus 99 disease models via phenotype overlap with the soft-windowed and non-windowed approaches, respectively, from a set of 2082 mutant mouse lines. Our method is generalizable and can benefit large-scale human phenomic projects such as the UK Biobank and the All of Us resources.

Availability and implementation

The method is freely available in the R package SmoothWin, available on CRAN http://CRAN.R-project.org/package=SmoothWin.

1. Introduction

High-throughput, large-scale phenotyping studies evaluate variables of an organism's biological systems to examine the contribution of genetic and environmental factors to phenotypes. Standardized phenotyping screens that cover a wide range of biological systems have made useful insights for identifying new genetic contributors to robust phenotypes when compared with more focussed studies that often target well-characterized genes with varying reproducibility (Begley and Ellis, 2012; Edwards et al., 2011; Freedman et al., 2015; Prinz et al., 2011; Stoeger et al., 2018). Leveraging economies of scale and using standardized procedures, high-throughput phenotyping screens addresses these challenges and have been applied in biological screening of chemical compound libraries, agricultural evaluation of crop plants, genome-wide CRISPR-based mutagenic cell line screens and multi-centre phenotypic screening of mutated model organisms (Al-Tamimi et al., 2016; Dickinson et al., 2016; Flood et al., 2016; Friggens et al., 2011; Malinowska et al., 2017; Sun et al., 2017; Vitak et al., 2017; Viti et al., 2015). The continuous generation of large volumes of data introduces new challenges affecting automated approaches to statistical analysis that have to scale with increasing data and address the underlying complexity inherent in large projects (Kurbatova et al., 2015; Meyers et al., 2017; Vaas et al., 2013, 2012).

The International Mouse Phenotyping Consortium (IMPC) is a G7 recognized global research infrastructure dedicated to generating and characterizing a knockout mouse line for every protein-coding gene (Bradley et al., 2012; Brown and Moore, 2012; Hrabĕ de Angelis et al., 2015). Currently, the IMPC has phenotyped over 148 000 knockouts and 43 000 control mice (data release 9.2, January 2019) across 12 research centres in 9 countries. These centres adhere to a set of standardized phenotype assays defined in the International Mouse Phenotyping Resource of Standardised Screens (IMPReSS), and designed to measure over 200 parameters on each mouse. As part of these standardized operating procedures, critical factors that can impact data collection, such as reagent type or equipment, are reported as required metadata. Phenotype data are then centrally collected and quality controlled by trained professionals before being released for analysis. All phenotype data are processed by the statistical analysis package PhenStat—a freely available R package that provides a variety of statistical methods for the identification of genotype to phenotype associations by comparing mutant to control data that have the same critical attributes (Kurbatova et al., 2015). For quantitative data, linear mixed models are typically employed with several factors modelled in including genotype, sex, sex–genotype interaction, body weight and batch (i.e. phenotype measures collected on the same day). Mutant mouse lines found to have a significant deviation in phenotype measurements are assigned a phenotype term from the Mammalian Phenotype Ontology (Blake et al., 2017). These associations, as well as the raw data, are disseminated via the web portal (https://www.mousephenotype.org) using application programming interfaces and data downloads.

A challenge with high-throughput phenotyping efforts is the small sample size for the experimental group (i.e. the knockout mice) that is produced to maximize the use of finite resources, considering biological relevance and power analysis (Charan and Kantharia, 2013). All mice generated by the IMPC are on the inbred C57BL/6N strain. To reduce genetic drift, IMPC centers maintain wild-type C57BL/6N production colonies that are periodically rederived using commercial vendors (Dickinson et al., 2016; Kurbatova et al., 2015). Mutant F0 mice are bred with wild-type mice from the production colonies to reduce the confounding effects of any de novo, non-targeted mutations. In addition, the IMPC centres are encouraged to measure these knockout mice in two or more batches, as this improves the false discovery rate by modelling in the random effect of day-to-day variation (Karp et al., 2014). In contrast, large control sample sizes accumulate as they provide a strong internal control of the pipeline and typically generated with every experimental batch. Such large control groups represent a unique dataset that increase the power of the subsequent analyses and allow the construction of a robust baseline (Bradley et al., 2012). However, this can lead to the accumulation of heterogeneities including seasonal effects, changes in personnel and unknown time-dependent environmental factors (Karp et al., 2014).

A simple approach to cope with heterogeneity in the data is to set explicit time boundaries (e.g. 1 year) before and after experimental collection dates. This 'hard windowing' approach will capture different time-frames depending on how much time elapses between the first and the last batch of experimental data measured. This approach is unsatisfactory for IMPC data as some mutant lines had enough experimental mice to measure in one batch, while others needed multiple batches over 18 months due to breeding difficulties or other factors. This variation in time-frames can lead to a widely different number of controls being applied to an analysis, making it challenging to explore correlations between mutant lines. Thus, more tuneable approaches were needed.

In this study, we address the complexity of the data collected over time by proposing a novel windowing strategy that we call 'soft windowing'. This approach utilizes a weighting function to assign flexible weights, ranging from 0 to 1, to the control data points. Controls that are collected on or near the date of mutants are assigned the maximal weights, whereas controls at earlier or later dates are assigned less weight. In contrast to the hard windowing, the weighting function in the soft windowing allows for different shapes and bandwidths by alternating the tuning parameters. In addition, we demonstrate how to tune parameters and demonstrate the implementation of the soft windowing on the IMPC data.

2. System and methods

In high-throughput projects, such as the IMPC, the model parameters may not stay constant over time that can lead to misleading inferences. For example, Figure 1 illustrates changes to the control group trend and/or variation over time for the Forelimb grip strength normalized against body weight and Mean cell volume. One approach widely used in signal processing (Ford, 2003; Kervrann, 2011; Lima et al., 2009; Poularikas, 2018) is to define a windowing function that includes the appropriate number of data points to capture the effect of interest while minimizing the noise. This is defined by

W(x,l1,l2)={f(x)l1xl20o.w (1)

where setting f (x) to a constant, e.g. f (x) = 1, leads to hard windowing, while setting it to a smooth function results in the soft windowing. The same approach can be generalized to multiple signals (Huang et al., 2007; Li et al., 2007; Tang et al., 2009) or applied as a rolling window (Harel et al., 2008) in the presence of exogenous variables to account for time dependency in the regression coefficients (Brown et al., 2018). Alternatively, we propose a soft windowing approach for the regression methods by defining a weighting function that applies less weight to the residuals outside the window of interest. This leads to distinct advantages over the hard windowing. First, the entire dataset is included in the analysis in contrast to the limited data points in the hard windowing. Second, the windowing and the parameter estimation are coupled, which is a direct result of using the weighted least squares (WLS). Critically, by bounding the controls in a window, we freeze the analysis and abrogate the need for further analysis assuming no new experimental data are generated within the time window.

Fig. 1.

Fig. 1

Examples of longitudinal data from the IMPC selected for high variance in control population. Scatter plot of the Forelimb grip strength normalized against body weight (top) and mean cell volume (bottom) from the IMPC Grip Strength and Haematology procedures, respectively. The dashed black lines represent the overall trend of the controls (dark green). Mutant mice are in orange

3. Algorithm

Our novel windowing strategy explicitly defines the weighting function and proposes a simple but effective set of criteria to estimate the minimal window for the noise-power trade off.

3.1. Weight generating function

Let t = (t 1, t 2, … , tn) represent a set of n continuous time units, m = (m 1, m 2, … , mp) the time units when the treatments are measured (peaks in the windows), 1 = {(l 1L, l 1R), (l 2L, l 2R), … , (l pL, l pR)} a set of p non-negative left and right bandwidths and k = {(k 1L, k 1R), (k 2L, k 2R), … , (k pL, k pR)} a set of p positive left and right shape parameters. We impose the continuity on the time to simplify the definition of a continuous function over the time units, e.g. by converting dates to UNIX timestamps. Furthermore, we introduce a peak generating function (PGF) of the form of ci = F(t; miliL, kiL) (1 − F(t; mi + liR, kiR)), i = 1, 2, … , p where F(x; μ, σ) = PrX (Xx|μ, σ) is selected from the family of cumulative distribution functions with location μ and scale σ. In this study, we select F from the family of continuous and symmetric distributions (such as the Logistic, Gaussian, Cauchy and Laplace distributions). Then, we propose a weight generating function (WGF) of the form of

WGF(t,l,k,m)=i=1pci*+[ij{1,2,,p}i,jci*cj*+ijh{1,2,,p}i,j,hci*cj*ch*(1)p+1c1*c2*cp*],t,l,m,k+ (2)

where ci*=cimaxci denotes the normalized PGF. The first term on the right-hand side of Equation (2) produces the individual windows and the second term accounts for merging the intersections amongst the windows. Figure 2 shows the symmetric WGF (SWGF) that is l iR = l iL and k iR = k iL, i = 1, 2, … , p, for the different values of k [0.2, 50] coloured from blue (k = 50) to red (k = 0.2) and for the different values of l = 5, 10, 15. The vertical black dashed lines show the hard window corresponding to the value of l. From this plot, the function is capable of generating a range of windows from hard (blue) to soft (red). Furthermore, the weights lay in the (0, 1] interval for all values of time; however, they may not cover the entire (0, 1] spectrum in a bounded time domain. Then, the weights are normalized to be ranged in (0, 1] before inserting into the WGF as shown by ci* in Equation (2). Figure 3 shows the merge capability of the SWGF for the logistic F with m = 15, 35 and different values of k = 0.5, 1.5, 3 and l = 6, 8, 10, 12. From this figure, the function is capable of producing a range of flexible multimodal windows (top) as well as aggregated windows (bottom) if |m 1 + l| > |m 2l| for all m 1 < m 2, l ℝ. In all cases, the weights lay in the (0, 1] interval.

Fig. 2.

Fig. 2

Behaviour of the symmetric weight generating function (SWGF) for a spectrum of values for the shape parameter, k, ranging from k = 50 (blue) to k = 0.2 (red), in intervals of t = 1, 2, … , 70, and for the different values of the bandwith l = 5, 10, 15 (left to right). The black dashed lines show the hard windows corresponding to l. The grey dotted vertical lines show the window peaks. These plots show the capability of the WGF to generate different forms of the window

Fig. 3.

Fig. 3

Merging behaviour of the SWGF for different values of the shape parameter k = 0.5, 1.5, 3 and the bandwidth l = 6, 8, 10, 12 on a sequence of time points t = 1, 2, … , 60. The vertical dashed grey lines show the corresponding hard windows to l. This plot shows the capability of SWGF to generate multimodal windows as well as merging individual windows

3.2. Windowing regression

Let y = + e denote a linear model, with y, x, β and e representing response, covariates, unknown parameters and independent random noise eN(0, σ 2 < ∞), respectively. Imposing the weights in Equation (2) on the residuals leads to the following WLS:

Q(β)=WGF(t,l,k,m)yxβ22 (3)

where ‖⋅‖2 denotes the second norm of a vector. Minimizing Q(β) with respect to β leads to β^=(xwx)1xwy, where w is a diagonal matrix of weights from WGF and (′) denotes the transpose of a matrix. Weighted linear regression (WLR), in the context of this study, is equivalent to imposing less weight on the off modal time points with respect to m. We illustrate this in Figure 4, where 60 observations are simulated from the following model:

yt=tβ1I(t20)+tβ2I(20<t<40)+tβ3I(t40)+e,

with t = 1, 2, … , 60, β 1 = 0, β 2 = 1, β 3 = 0, eiidN(0,1) and I is the indicator function,

I(x[a,b])={1x[a,b]0o.w.

Fig. 4.

Fig. 4

(Left) Comparison between the inferences from the windowed linear regression on the simulated data (blue dashed line) and without windowing (dotted black line). (Right) The corresponding weights from WGF centred on m = 30. With windowing, we attempt to model the effective section of the data (blue dots)

In other words, the model is piecewise linear and only significant in the t ∈ (20, 40) interval. Figure 4 (top) shows the global estimation of the linear regression from the entire data (dotted black line) and the WLR by WGF(t, 9, 5, 30) (dashed blue line) as well as weights from the WGF on the bottom. This plot shows that the non-WLR leads to a horizontal line, where no significant gradient is detected, whereas the WLR tends to model the significant section of the data that leads to fitting the true line. Figure 4 compares the effect of windowing versus considering the entire dataset, showing the different conclusions.

3.3. Selection of the tuning parameters

Selection of the tuning parameters k and l to define the soft window has a strong impact on the final estimations and consequently on the inferences that are made from the statistical results. Indeed, a wide or over-smooth window can lead to the inclusion of too much noise, whereas a small window can result in low power in the analysis. An additional challenge is the direct linear correlation between increasing the number of peaks, m, and to the total number of the parameters for the windows (l, k) that results in significant growth in the computational complexity of the final fitting. This is due to tuning the window in the general form of WLS in Equation (3) requires 2p dimensions in space to search for the optimal l and k. To cope with this complexity, we propose to fix l and k so all windows are symmetric and have the same shape and bandwidth. We then select the tuning parameters by searching the space on the grid of (l, k) values and look for the most significant change in mean and/or variation of the residuals/predictions. The grid is searched by generating a series of scores from applying t-test (to detect changes in mean) and F-test (to detect change in variation) to the consecutive residuals/predictions at each step of expanding (ll + λ, λ > 0) and/or reshaping (kk + α, α > 0) the windows. This technique is based on the assumption that the mean and the variation of the residuals/predictions should remain unchanged in different time periods (St. Laurent, 1994).

To gain the necessary power in the analysis, we apply the statistical tests to the values of l that correspond to a minimum T observations in the windows. Then one can define the quantity of T(l) that is the total number of observations that is included in the hard window corresponding to l. We should stress that the definition of T(l) in the soft windowing can be challenging because the WGF assigns weights to the entire dataset in the final fitting. To address this complexity, we propose the Sum of Weights Score by SWS(k,l)=i=1nWGF(ti,k,,l,m), that is the summation of weights from WGF for specific l and k. Note that SWS(l, k) ≥ T(l) with the equality for sufficiently large k. Because l is generally unknown, a value of T(l) = T independent of l needs to be decided before the analysis. Our experiments, inspired by the z-test minimal sample size (n > 30), show that setting SWS ≥ T with

T{max(35,nπ2)Single peak35pMultiple peaks

provides sufficient statistical power and precision for the analysis of each sex-parameter in IMPC.

Once the bandwidth, l, is selected, the shape parameter, k, can be optimized on a grid of values similar to l.

This algorithm is implemented for a broad range of models in the R package SmoothWin that is available from https://cran.r-project.org/package=SmoothWin. The main function of the package, SmoothWin(…), allows an initial model for the input and, given a range of values for the bandwidth and shape, it performs soft windowing on the input model. Furthermore, it allows plotting of the results for diagnostics and further inspections. One also can generate the weights from SWGF using the expWeigh(…) function.

4. Implementation

4.1. Sensitivity analysis

The sensitivity of the soft windowing to the tuning parameters in particular, the minimum observation required in the window (T), is tested on the two IMPC examples introduced in Figure 1 for Mean cell volume and Forelimb grip strength normalized against body weight. To this end, the tuning parameters l, k and T are set to

  • l The total range of the experiment time divided into 500 logarithmic distanced values;

  • k the values in [0.5, 10] interval divided into 50 logarithmic distanced values;

  • T the values from 14 to the n divided into 25 logarithmic distanced values

where n is the total observation in the dataset. We should stress that l and k are selected to cover the entire experiment range and avoid bias by selecting the incomplete ranges. Then we only study the effect of T on the final fittings.

Figure 5 shows the sensitivity of the P-values to the change in the minimum observation required for the soft windowing, T. The left plots show the change in the P-value corresponded to the genotype effect in the linear mixed model (with genotype, sex, genotype–sex interaction and body weight in the fixed effect term and the batch in the random effect) for different values of T. The dashed blue vertical lines show the maximum toleration of T before a step-change in the P-values being observed. The right-hand side plots show the final fitting of the windowed model. The controls (triangles) weight are colour coded on a spectrum of green–purple, inside the window (green), on the border (grey) and outside the window (purple). Figure 5 shows the sensitive of soft windowing to the T, for instance, selection of a high value for T could lead to including too much noise in the final fitting.

Fig. 5.

Fig. 5

The sensitivity analysis of the soft windowing approach to the minimum observation required in the window. The left plots show the variation of the final Genotype P-values with different values of T. The vertical dashed blue lines show the maximum toleration of the algorithm before including too much noise in the final fittings. The right plots show the optimal soft-windowed linear mixed model fitted to the data. The controls (triangles) weight are colour coded from green (inside the windows) to grey (on the window borders) and purple (outside the window). The mutants are shown with the black plus (+) on the plots

4.2. Simulation study

To assess the performance of the soft windowing method, we implemented a resampling approach to construct a sample of artificial mutants from the IMPC control data by relabelling some controls as mutant. We then examined the difference in the number of false positives that were detected by the standard (non-windowed) analysis versus the soft-windowed approach. Since the resampling is only performed on the controls, we expect less false positives from the soft-windowed results.

Mutant data in the IMPC have a special structure, resulting from mice being born in the same litters and being phenotyped closely together in time (batch effect), which must be replicated in the resampling approach. We address this by utilizing structured resampling that replaces the mutants with the closest random controls in time. We create artificial mutant groups by randomly sliding the true mutant structure over the time domain of controls, collecting as many controls as there were mutants in the original set and repeating this procedure five times per dataset (Supplementary Fig. S1 shows an illustration of three iterations of the structured resampling on the Bone Mineral Content parameter).

For non-windowed and soft-windowed analyses, the same statistical model is fitted. That is the linear mixed model implemented in the R package PhenStat with genotype, sex, genotype–sex interactions and body weight for the fixed effect terms and the batch in the random effect. This setup implies that the difference in the results is a direct consequence of the control selection strategy by soft windowing. The outcome of the simulation study consists of 18 IMPC procedures across 11 centres and over 2.5 million analyses and P-values. Comparing the results from the IMPC standard and soft-windowed analyses on resampled data, we detect an overall of 14 201 and 12 716 false positives (FP), respectively, at the signficance level used by the IMPC, 0.0001. This constitutes more than a 10% relative improvement in FPs when the soft-windowed method is applied. Table 1 shows the top 10 IMPC procedures with the significant changes in the FPs. From this table, the procedures Body Composition, Open Field, Urinalysis, Heart Weight, Acoustic Startle and Pre-pulse Inhibition account for the highest relative reduction of 68% in FPs, whereas the Clinical Blood Chemistry, X-Ray, Insulin Blood Levels, Electrocardiogram and Eye Morphology account for the maximum increase of 32% in FPs. Supplementary Figure S2 shows parameters from the Body Composition and Clinical Blood Chemistry procudures that showed the biggest loss and gain in false positives for assocaited data parameters, respectively. This plot shows an improvement in decreasing FPs in all IMPC_DXA parameters, which contrasts with an increase in the FPs for IMPC_CBC parameters. We further examined the top two IMPC_CBC parameters, Alanine aminotransferase (IMPC_CBC_013) and Aspartate aminotransferase (IMPC_CBC_012) in Supplementary Figure S3, and noted a high level of randomly deviated points from the mean of controls that can bias the outcome of the structured resampling.

Table 1. Top 10 IMPC procedures with the highest change in the total number of false positives.

Procedure name No. P-valuesa NFPb WFPc Relative changed
Body composition(IMPC_DXA) 167 789 3809 2293 37.58
Clinical blood chemistry (IMPC_CBC) 320 949 1472 2414 62.12
Open field (IMPC_OFD) 182 894 1507 830 35.52
Haematology (IMPC_HEM) 243 640 3125 2746 46.77
Heart weight (IMPC_HWT) 16 236 553 409 42.52
Acoustic startle and pre-pulse inhibition(IMPC_ACS) 73 177 352 243 40.84
X-ray (IMPC_XRY) 7016 27 135 83.33
Insulin blood level(IMPC_INS) 9465 63 164 72.25
Electrocardiogram (IMPC_ECG) 122 257 378 471 55.48
Eye morphology (IMPC_EYE) 15 739 86 153 64.02
a

Total number of the analysis and P-values.

b

False positives from the non-windowed results.

c

False positives from the soft-windowed results.

d

Relative percentage change of the false positives ((WFP/(NFP + WFP))%).

4.3. Soft windowing as part of the IMPC statistics pipeline

We next show the performance of the soft windowing approach on IMPC data by integrating it into the standard IMPC statistics pipeline in PhenStat (Kurbatova et al., 2016). To this end, each dataset is processed by the PhenStat for the initial estimation of a fully saturated linear mixed model including genotype, sex, genotype–sex interaction and body weight in the fixed effect term and the batch in the random effect. The resulting fit is then passed into the soft windowing algorithm in the R package SmoothWin for the determination of the optimal windowing weights. After determining the optimal weights, the final model is fitted using a weighted linear mixed model and utilizing a backward elimination approach to optimize the final model.

Using data release 9.2 (January 2019), we re-analysed 14 million+ data points from which 10 million+ are mutant animals across the range of IMPC phenotyping procedures. The original IMPC standard analysis that did not apply the soft windowing approach to select the control data encompassed 403 000+analyses and P-values. This analysis led to 12 728 significant P-values (<0.0001), compared with 16 415 significant P-values when the soft windowing was applied, an increase of 30% in total significant P-values. The IMPC assigns mouse lines with phenotype terms from the Mouse Phenotype Ontology (MPO) when a significant deviation from the control data is detected for a given data parameter (Meehan et al., 2017). Our windowing approach led to 17 391 MPO associations gained and 15 996 associations lost. To explore these differences further, we created an online tool that displays the entire control dataset for a given mouse line-parameter assay with the statistical summaries for both the non-windowed methodology and the soft-windowed approach. Users may filter on a number of attributes, arrange filter order, zoom in on data visualization or navigate directly to the results (https://wwwdev.ebi.ac.uk/mi/impc/dev/phenotype-archive/media/images/windowing/).

Figure 6 shows the corresponding visualization on the IMPC website for the complete dataset (including males and females) previously shown for males only in Figure 1 (top) for the Forelimb grip strength normalized against body weight parameter from the IMPC Grip Strength procedure. The soft window is indicated, as well as changes in the total number of controls (here 1, 572 fewer after soft windowing—counting soft windowing weights >10−7). Furthermore, the P-value corresponding to the genotype effect shows a significant change in magnitude, from 2.05 × 10−4 to 6.75 × 10−18 after applying the soft windowing. We then tested if our soft-windowed analysis changed our human disease model discovery rate. We have previously described the IMPC Phenodigm translational pipeline that automatically detects phenotypic similarities between the IMPC strains and over 7 000 rare diseases described in the Online Mendelian Inheritance in Man (OMIM), Orphanet and the Deciphering Developmental Disorders (DDD) databases (Meehan et al., 2017). This pipeline generates qualitative scores on how well a mouse line's associated phenotypes overlap with the phenotypes of the human rare disease populations (Akawi et al., 2015; Firth et al., 2009; Meehan et al., 2017; Mungall et al., 2015; OMIM Browser, 2017; Rath et al., 2012). By comparing the disease model resulting from our soft-windowed analysis versus non-windowed analysis for IMPC data release 9:2, we find a slight increase in the number of disease models (106 versus 99 models using a threshold of 50% phenotype overlap from a set of 2082 mouse lines that contain mutations—Supplementary File SI).

Fig. 6.

Fig. 6

The soft windowing visualization in the IMPC website for the Forelimb grip strength normalized against body weight from the IMPC Grip Strength procedure. The plot shows the response over time as well as the fitted soft windows. The tables underneath show the comparison between the descriptive statistics obtained from the standard (non-windowed) analysis on the left and the soft-windowed approach on the right. The P-values correspond to the genotype effect after applying the statistical analyses taking the corresponding controls based on the non-window and soft-windowed approaches, respectively

5. Discussion

High-throughput phenomics is a powerful tool for the discovery of new genotype–phenotype associations and there is an increasing need for innovative analyses that make effective use of the voluminous data being generated. Batch effects are inevitable when a large amount of data is collected at different times and/or sites and, therefore, need to be accounted for in the statistical analysis. In this study, we developed a novel 'soft windowing' method that selects a window of time to include controls that are locally selected with respect to experimental animals, thus reducing the noise level in the data collected over long periods of time (years). Soft windowing has notable advantages over a more traditional hard windowing approach. In contrast to the limited data points included in the hard windowing method, the entire dataset is considered for the analysis. To this end, we engineered a weighting function to produce weights in the form of a window of time. Control data collected proximally to mutants were assigned the maximal weight, while data collected earlier or later had less weight. This method has the capability of producing indivdual windows as well as merging intersected ones. Moreover, the method was implemented to automatically select window size and shape.

The performance of the method was shown on a simulated scenario that uses real control data collected by the IMPC high-throughput pipelines to assess detection of false positives. We also showed the enhancements to the IMPC statistical pipeline that establishes genotype–phenotype associations by comparing mutants versus control data using our soft-windowed approach.

There are two known conditions that affect the method: (i) the WGF can be slow when there are too many (>20) distinct windows, however, we have optimized the algorithm to be fast enough for the typical IMPC number of peaks (≈3s for 1500 samples and 16 peaks under k = 1 and l = 30); and (ii) our resampling scenario indiciated that our soft windowing approach is sensitive to the data that have a high level of outliers or random deviation from the mean. This may result from a bias in the design of the resampling but may also indicate that using all available controls may be appropriate for the cases with extreme variability.

Our soft windowing approach addresses the scaling issues associated with analysing an ever-increasing set of control data in long-term projects by eliminating controls with weights sufficiently close to zero from future analysis. In the case of the IMPC, once a window of control data is determined for a dataset, there would be no further requirement to re-analyse the dataset with each subsequent data release. This will reduce the computational resources needed with the resulting gene-phenotype associations remaining stable, greatly facilitating data exchange with research groups trying to functionally validate genes and their disease variants. Our findings also have important implications for such efforts as the UK BioBank and the All of Us initiatives where large cohort sizes coupled with mobile medical sensors are generating phenotype data at an unprecedented rate (Sankar and Parker, 2017; Sudlow et al., 2015). Researchers performing restrospective analysis to analyse exposures for a defined outcome group (e.g. metabolic disease) are challenged by the variability and longitudinal characteristics associated with these datasets. The methods described here can be used with these human health resources to maximize analytical power and help researchers find the genetic and environmental contributers to human diseases.

Supplementary Material

Supplementary information: Supplementary data are available at Bioinformatics online.

Supplementary Figures
Supplemental File 1

Footnotes

Conflict of Interest: none declared.

Funding

This work was supported by [H.H., J.C.M., V.M.-F., F.L.-G., K.B., R.K., E.S., S.D.M.B., D.S., P.F., A.M.M., H.P., T.F.M.—NIH: UM1 HG006370], [E.F.A., A.M.F., A.B., C.M.—NIH: UM1 OD023221; Genome Canada and Ontario Genomics (OGI-051 & 137)], [V.K., J.W.—NIH: UM1OD023222], [D.C., K.C.K.L.—NIH: UM1 OD023221], J.K.S., A.Gas., A.Gar., A.E.C., C.-W.H., C.L.R., D.G.L., I.L., J.R.G., J.J.G., R.B., R.C.S., S.V., J.D.H., M.E.D.—NIH: UM1 HG006348; U42 OD011174; U54 HG005348], [M.T., N.T., M.H., O.Y.—Management Expenses Grant for RIKEN BioResource Research Center, MEXT], [J.K.K., S.Y.C., Y.K.K., J.K.S.—Korea Mouse Phenotyping Project (2017M3A9D5A01052447) of the Ministry of Science, ICT and Future Planning through the National Research Foundation], [G.B.A., M.-F.C., L.V., S.L., H.M., M.S., P.T.R., T.S., H.Y.—We are grateful to members of the Mouse Clinical institute (MCI-ICS) for their help and helpful discussion during the project. The project was supported by the French National Centre for Scientific Research (CNRS), the French National Institute of Health and Medical Research (INSERM), the University of Strasbourg and the 'Centre Europeen de Recherche en Biomedecine', and the French state funds through the 'Agence Nationale de la Recherche' under the frame programme Investissements d'Avenir labelled (ANR-10-IDEX-0002-02, ANR-10-LABX-0030-INRT, ANR-10-INBS-07 PHENOMIN)], [G.M., H.F., L.G., L.B., N.S., H.M., V.G.-D.—German Federal Ministry of Education and Research: Infrafrontier [no. 01KX1012] (M.HdA.), the German Center for Diabetes Research (DZD), EU Horizon2020: IPAD-MD [no. 653961] (M.HdA.)], [WW EUCOMM: Tools for Functional Annotation of the Mouse Genome' (EUCOMMTOOLS) project - grant agreement no. [FP7-HEALTH-F4-2010-261492]].

References

  1. Akawi N, et al. Discovery of four recessive developmental disorders using probabilistic genotype and phenotype matching among 4,125 families. Nat Genet. 2015;47:1363–1369. doi: 10.1038/ng.3410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Al-Tamimi N, et al. Salinity tolerance loci revealed in rice using high-throughput non-invasive phenotyping. Nat Commun. 2016;7:13342. doi: 10.1038/ncomms13342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Begley CG, Ellis LM. Drug development: raise standards for preclinical cancer research. Nature. 2012;483:531–533. doi: 10.1038/483531a. [DOI] [PubMed] [Google Scholar]
  4. Blake JA, et al. Mouse genome database (MGD)-2017: community knowledge resource for the laboratory mouse. Nucleic Acids Res. 2017;45:D723–D729. doi: 10.1093/nar/gkw1040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bradley A, et al. The mammalian gene function resource: the International Knockout Mouse Consortium. Mamm Genome. 2012;23:580–586. doi: 10.1007/s00335-012-9422-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Brown RL, et al. Techniques for testing the constancy of regression relationships over time. J R Stat Soc Ser B. 2018;37:149–163. [Google Scholar]
  7. Brown SDM, Moore MW. The International Mouse Phenotyping Consortium: past and future perspectives on mouse phenotyping. Mamm Genome. 2012;23:632–640. doi: 10.1007/s00335-012-9427-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Charan J, Kantharia N. How to calculate sample size in animal studies? J Pharmacol Pharmacother. 2013;4:303. doi: 10.4103/0976-500X.119726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dickinson ME, et al. High-throughput discovery of novel developmental phenotypes. Nature. 2016;537:508–514. doi: 10.1038/nature19356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Edwards AM, et al. Too many roads not taken. Nature. 2011;470:163–165. doi: 10.1038/470163a. [DOI] [PubMed] [Google Scholar]
  11. Firth HV, et al. DECIPHER: database of chromosomal imbalance and phenotype in humans using Ensembl resources. Am J Hum Genet. 2009;84:524–533. doi: 10.1016/j.ajhg.2009.03.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Flood PJ, et al. Phenomics for photosynthesis, growth and reflectance in Arabidopsis thaliana reveals circadian and long-term fluctuations in heritability. Plant Methods. 2016;12:14. doi: 10.1186/s13007-016-0113-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Ford MS. The illustrated wavelet transform handbook: introductory theory and applications in science. Health Physics. 2003;84:667–668. [Google Scholar]
  14. Freedman LP, et al. The economics of reproducibility in preclinical research. PLoS Biol. 2015;13:e1002165–e1002169. doi: 10.1371/journal.pbio.1002165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Friggens NC, et al. Extracting biologically meaningful features from time-series measurements of individual animals: towards quantitative description of animal status. In: Sauvant D, editor. Modelling Nutrient Digestion and Utilisation in Farm Animals. Wageningen Academic Publishers; Wageningen: 2011. pp. 40–48. [Google Scholar]
  16. Harel A, et al. Modeling web usability diagnostics on the basis of usage statistics. In: Jank W, Shmueli G, editors. Statistical Methods in e-Commerce Research. John Wiley&Sons, Inc; Hoboken, NJ: 2008. pp. 131–172. [Google Scholar]
  17. de Angelis M Hrabê, et al. Analysis of mammalian gene function through broad-based phenotypic screens across a consortium of mouse clinics. Nat Genet. 2015;47:969–978. doi: 10.1038/ng.3360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Huang BE, et al. Detecting haplotype effects in genomewide association studies. Genet Epidemiol. 2007;31:803–812. doi: 10.1002/gepi.20242. [DOI] [PubMed] [Google Scholar]
  19. Karp NA, et al. Impact of temporal variation on design and analysis of mouse knockout phenotyping studies. PLoS One. 2014;9:e111239. doi: 10.1371/journal.pone.0111239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kervrann C. An Adaptive Window Approach for Image Smoothing and Structures Preserving. Springer; Berlin, Heidelberg: 2011. pp. 132–144. [Google Scholar]
  21. Kurbatova N, et al. PhenStat: statistical analysis of phenotypic data. [9 April 2019, date last accessed];2016 :1–9. bioc.ism.ac.jp. 2016 http://bioc.ism.ac.jp/packages/devel/bioc/vignettes/PhenStat/inst/doc/PhenStatUsersGuide.pdf.
  22. Kurbatova N, et al. PhenStat a tool kit for standardized analysis of high throughput phenotypic data. PLoS One. 2015;10:e0131274. doi: 10.1371/journal.pone.0131274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Li Y, et al. Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows. Am J Hum Genet. 2007;80:705–715. doi: 10.1086/513205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Lima MFM, et al. Robotic manipulators with vibrations: short time Fourier transform of fractional spectra. In: Machado JAT, editor. Intelligent Engineering Systems and Computational Cybernetics. Springer; 2009. pp. 49–60. [Google Scholar]
  25. Malinowska M, et al. Phenomics analysis of drought responses in Miscanthus collected from different geographical locations. GCB Bioenergy. 2017;9:78–91. [Google Scholar]
  26. Meehan TF, et al. Disease model discovery from 3, 328 gene knockouts by the International Mouse Phenotyping Consortium. Nat. Nat Genet. 2017;49:1231–1238. doi: 10.1038/ng.3901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Meyers RM, et al. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat Genet. 2017;49:1779–1784. doi: 10.1038/ng.3984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Mungall CJ, et al. Use of model organism and disease databases to support matchmaking for human disease gene discovery. Hum. Mutat. 2015;36:979–984. doi: 10.1002/humu.22857. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. OMIM Browser. Online Mendelian Inheritance in Man - An Online Catalog of Human Genes and Genetic Disorders. 2017 academic.oup.com.
  30. Poularikas AD. Discrete time and discrete Fourier transforms. In: Poularikas AD, editor. The Transforms and Applications Handbook. CRC Press LLC; Boca Raton: 2018. [Google Scholar]
  31. Prinz F, et al. Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discov. 2011;10:712–712. doi: 10.1038/nrd3439-c1. [DOI] [PubMed] [Google Scholar]
  32. Rath A, et al. Representation of rare diseases in health information systems: the orphanet approach to serve a wide range of end users. Hum Mutat. 2012;33:803–808. doi: 10.1002/humu.22078. [DOI] [PubMed] [Google Scholar]
  33. Sankar PL, Parker LS. The precision medicine initiative's All of Us research program: an agenda for research on its ethical, legal, and social issues. Genet Med. 2017;19:743–750. doi: 10.1038/gim.2016.183. [DOI] [PubMed] [Google Scholar]
  34. St Laurent RT. Reviewed work: understanding regression assumptions by William D. Berry. Technometrics. 1994;36:321. [Google Scholar]
  35. Stoeger T, et al. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 2018;16:e2006643. doi: 10.1371/journal.pbio.2006643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Sudlow C, et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Sun J, et al. Multitrait, random regression, or simple repeatability model in high-throughput phenotyping data improve genomic prediction for wheat grain yield. Plant Genome. 2017;10 doi: 10.3835/plantgenome2016.11.0111. [DOI] [PubMed] [Google Scholar]
  38. Tang R, et al. A variable-sized sliding-window approach for genetic association studies via principal component analysis. Ann Hum Genet. 2009;73:631–637. doi: 10.1111/j.1469-1809.2009.00543.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Vaas LAI, et al. Opm: an R package for analysing OmniLog® phenotype microarray data. Bioinformatics. 2013;29:1823–1824. doi: 10.1093/bioinformatics/btt291. [DOI] [PubMed] [Google Scholar]
  40. Vaas LAI, et al. Visualization and curve-parameter estimation strategies for efficient exploration of phenotype microarray kinetics. PLoS One. 2012;7:e34846. doi: 10.1371/journal.pone.0034846. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Vitak SA, et al. Sequencing thousands of single-cell genomes with combinatorial indexing. Nat Methods. 2017;14:302–308. doi: 10.1038/nmeth.4154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Viti C. High-throughput phenomics. In: Mengoni A, et al., editors. Bacterial Pangenomics: Methods and Protocols. Springer; New York, New York, NY: 2015. pp. 99–123. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Figures
Supplemental File 1

RESOURCES