Image-based cell profiling enhancement via data cleaning methods

Arghavan Rezvani; Mahtab Bigverdi; Mohammad Hossein Rohban

doi:10.1371/journal.pone.0267280

. 2022 May 4;17(5):e0267280. doi: 10.1371/journal.pone.0267280

Image-based cell profiling enhancement via data cleaning methods

Arghavan Rezvani ^1,^#, Mahtab Bigverdi ^1,^#, Mohammad Hossein Rohban ^1,^*

Editor: Thippa Reddy Gadekallu²

PMCID: PMC9067647 PMID: 35507559

Abstract

With the advent of high-throughput assays, a large number of biological experiments can be carried out. Image-based assays are among the most accessible and inexpensive technologies for this purpose. Indeed, these assays have proved to be effective in characterizing unknown functions of genes and small molecules. Image analysis pipelines have a pivotal role in translating raw images that are captured in such assays into useful and compact representation, also known as measurements. CellProfiler is a popular and commonly used tool for this purpose through providing readily available modules for the cell/nuclei segmentation, and making various measurements, or features, for each cell/nuclei. Single cell features are then aggregated for each treatment replica to form treatment “profiles”. However, there may be several sources of error in the CellProfiler quantification pipeline that affects the downstream analysis that is performed on the profiles. In this work, we examined various preprocessing approaches to improve the profiles. We consider the identification of drug mechanisms of action as the downstream task to evaluate such preprocessing approaches. Our enhancement steps mainly consist of data cleaning, cell level outlier detection, toxic drug detection, and regressing out the cell area from all other features, as many of them are widely affected by the cell area. Our experiments indicate that by performing these time-efficient preprocessing steps, image-based profiles can preserve more meaningful information compared to raw profiles. In the end, we also suggest possible avenues for future research.

Introduction

High-throughput image-based assays have proved to be an effective predictive tool in the early stages of drug discovery through automated microscopy and image analysis, which make quantification of cellular morphological responses possible at a large scale [1]. These experiments often involve growing cells in multi-well plates and then treating cells in each well with a small molecule, or genetic perturbation. Image-based profiling has diverse and powerful applications, including identification of gene and allele functions and targets, and mechanisms of action (MoA) of drugs [2]. Prediction of drug MoAs through such assays potentially saves drug discovery process costs when applied early on. According to previous studies, the MoA of unknown compounds can be predicted by grouping each unknown compound with already-annotated compounds based on the similarity of their morphological profiles [3–5]. Therefore, high throughput assays can be helpful in this process.

The typical workflow in the analysis of images that are produced by high-throughput assays includes illumination correction, nuclei/cell segmentation, quality control, morphological feature measurement, batch effect removal, data normalization, feature selection/dimensionality reduction, and finally, aggregation of single cell measurements into image-based profiles per well [2]. The initial steps such as the illumination correction, segmentation, and feature extraction are not investigated in this work. We instead mainly focus on approaches that preprocess the features extracted by CellProfiler, open-source software that aims to automate most of these steps. We conduct a comprehensive study on how such techniques could potentially improve the profile’s quality. Data cleaning is a key step for enhancing image-based profiling as there may be different artifacts in the staining and imaging process and can affect the next steps. There are some quality control methods applied to raw images; for instance, in [6], a cell-level quality control approach has been proposed based on the high throughput images. However, examining raw images can be time-consuming and not applicable to previously prepared datasets. Hence, in this work, we focus on extracted image-based profiles as the input to the downstream analysis. A significant step in image-based profiling and data cleaning is cell-level quality control. Outlier cells, which do not show any valid biological effect, may result from errors in different parts of the pipeline. For instance, an error in the segmentation step may result in overly small or large cells and bias the profiles heavily as a result. Detecting and removing outlier cells can highly improve the profile quality.

We can categorize strategies for detecting outlier cells into two categories [2]: model-free and model-based. In model-free strategies, statistical analysis is used to detect outlier cells; for instance, calculating estimators such as median and median absolute deviation for multivariate situations is useful [7]. The principal component analysis is another model-free strategy for outlier detection [8]. In model-based strategies, a model is trained based on normal samples. It can be a linear regression model, in which outliers can be detected as data points with a large residual [9]. Alternatively, by providing samples of outliers, supervised machine learning classifiers can be beneficial [10].

We propose to use a unsupervised outlier detection method in this work and evaluate its effectiveness in the context of MoA prediction. More specifically, histogram-based outlier detection is applied on each plate individually, and by removing the single-cells detected as outliers by this method, final results are improved.

In addition, we found that the domain-specific feature preprocessing, as a data cleaning step, could enhance the relevant downstream task. In summary, we found that the cell area is a pivotal contributor to many of the features, and hence the similarity metric is highly influenced by it. Therefore, we propose to neutralize the effect of cell area on the other features to be able to capture more meaningful information in the profiles. The major steps of this work are described in Fig 1. Finally, we would investigate various general purpose methods for dimensionality reduction and representation learning from the classical features that are extracted by CellProfiler, and found that, as opposed to the domain specific preprocessing step, they are not effective in improving the prediction ability of profiles.

Fig 1 — Six main steps transform cell measurements of each plate to compound-level profiles. Specifically, in step 3, we replace each feature by part of those features that are accounted for by the cell area feature. This is achieved by regressing each feature against the cell area, and replacing each feature by the residue of the corresponding regression. Furthermore, cells that appear as outliers are identified and removed through a histogram-based approach, which is called HBOS, to make sure that the well-level aggregated profiles would not be affected by potential artifacts.

According to previous studies with the same metric as the proposed method of this work, increasing the enrichment of profiles is not an easy task in the context of high-throughput microscopy [11]. However, we suggest some efficient preprocessing methods that can increase the metric significantly, which is our main contribution.

Materials and methods

Dataset

We used a dataset of images and morphological profiles of small-molecule treatments using the Cell Painting assay [12]. About 50 plates, which correspond to the bioactive compounds with known mechanisms of action, are chosen. Considering redundant features extracted by CellProfiler, a specific list of features is selected for this experiment. This feature and plate selection is based on the method proposed by Rohban et al. [11]. Each plate consists of wells that consist of cells, and each well is treated by a particular compound. There are wells known as negative controls (DMSO), which are treated only with the compound solvent. The profile of a well is obtained by taking the median of single-cell measurements in that well. To mitigate the batch effect, profiles in each plate were normalized separately using the DMSO wells of that plate. For normalizing cells of a plate, single cells were subtracted by the median of DMSO well profiles and divided by the median absolute deviation (MAD) of DMSO well profile [4]. There are several strategies for creating aggregated profiles. In this work, similar to earlier experiments [4, 11], the median and profiles are used for well-level and profile-level aggregations.

Methodology

Preprocessing

There are several sources of error while extracting raw image-based single-cell profiles that may dramatically influence further steps in the pipeline. Consequently, preprocessing strategies are necessary to get more meaningful and stable results. For instance, some of the wells that contained DMSO had some intensity issues. This may mislead CellProfiler algorithms and cause them to extract noisy information. Noisy DMSO can affect all the information of a plate, as it is used for normalization. In Figs 2 and 3, images of some misleading DMSO wells can be seen.

Some wells had an extraordinary high cell area. By investigating their raw images, it was clear that it was CellProfiler’s mistake. Those images mostly were toxic wells containing a very small number of cells. For example, plate 24293, well E19, has a cell area of nearly 3000 after normalization (Fig 4).

The noise in the images, especially in DMSO wells, could be a source of error in the CellProfiler pipeline, and this indicated the importance of data cleaning in image-based profiling enhancement. Preprocessing steps remove drugs with unrecognizable effects on the features (outlier drugs) as well as the toxic drugs.

Removing outlier drugs

Not all compounds produce a meaningful change in the features extracted by CellProfiler. Therefore, a drug selection method must be used to only keep those drugs with discernible effects on the features [13]. First, a median profile for each well is calculated. Then, the wells are grouped based on their “Metadata_broad_sample” column, so each group contains profiles of a specific compound. The similarity measure is chosen to be Pearson’s correlation, which reflects the linear relationship between two variables. The more similar the profiles are between two wells, the higher the correlation coefficient will be. To select drugs with meaningful effect on the features, a null distribution is required. The null distribution is defined based on the correlations between non-replicates. That is, the median correlation between groups of k different compounds constitute samples of the null distribution, where k is the number of technical replicates in the experiment (k = 3 in this work). Compounds whose median replicate correlations are greater than the 95th percentile of the null distribution are selected for further steps [13, 14], as they have a meaningful effect on the cells. We will refer to the drugs removed in this section as the “outlier drugs” in the following sections. In this experiment, there were a total of 1551 different compounds, and based on the explained method, only 455 of them were kept.

Removing toxic drugs

While investigating the features of each well, we faced some wells with an extraordinarily low number of cells, even less than 10 cells in a whole well, as shown in Fig 5. This issue could have several sources. It could be the result of a bad vital situation such as lack of nutrients in a well, which caused the cell death. Or, it could be the result of the drug itself. Some drugs have a toxic nature and cause cell death.

If the drug is toxic, that may cause a false result in the final odds ratio plot. To find toxic drugs, we checked the median of cell count of different wells facing the same drug, and we removed 0.05 of the drugs with the lowest cell count median. Cell count histogram after removing toxic drugs can be found in Fig 6.

After doing so, the odds ratio was decreased. This indicates that the toxic drugs have a correlation in their features that by removing them from the experiment, odds ratio decreases, and toxic drug removal makes the final result more real. Fig 7 shows the correlation between cells affected by toxic drugs, and Fig 8 corresponds to the correlation between toxic and non-toxic drugs. The comparison between these figures clarifies that toxic drugs are highly correlated.

Fig 8 — Toxic drugs are in rows and non-toxic ones are in columns.

Evaluation metric

Since similarity metrics reveal connections among image-based profiles, choosing a suitable one can enhance the analysis. Like similar tasks, Pearson’s correlation coefficient was used in this work.

Consider two wells, each treated by a compound, which has some special mechanism of action, or “MoA” for short. In this experiment, the final purpose is to find whether there is any association between high values of correlations of the profiles of two wells and similarity of the MoA of compounds that are used to treat the cells in those wells. In other words, it is hypothesized that if profiles of two compounds have a high correlation, they probably have the same MoA. To test this hypothesis, one-sided Fisher’s exact test is used.

Suppose Fisher’s test admits the hypothesis. In that case, this process can be used for drug discovery, as it is possible to do high throughput experiments and use the correlation of profiles to guess the MoA of unknown compounds. In this experiment, odds ratio is used to check how likely it is that the hypothesis is true. The Fisher’s test contingency table used in our experiment can be found in Fig 9.

We change the value of “k” in Fig 9 to draw a plot, where its y axis corresponds to the odds ratio value, and its x axis corresponds to value “k”, or the percentage of pairs with the highest correlations.

If the hypothesis is true, we expect to have a plot with a negative slope. That means that by tightening the condition of correlation, there must be more similarities in MoAs.

Histogram-based outlier score

By investigating images of some wells and the histogram of multiple features, it was crystal clear that there are some outlier profiles. The cause of this could be the issues in the pipeline or the imaging process. There are several ways to detect outliers in a dataset. Three different setups for outlier detection are supervised, semi-supervised, and unsupervised methods. For the first two setups, labeled data is needed [15]. In this dataset, outlier cells are not labeled, therefore unsupervised methods have to be adopted. Outputs of outlier detection methods can be scores or labels. Labels (normal or outlier) are assigned to each instance individually.

There are various approaches for unsupervised outlier/anomaly detection. Well-known methods are Local Outlier Factor (LOF) [16], nearest-neighbor-based algorithm [17], Clustering Based Local Outlier Factor (CBLOF) [18], and LDCOF [19]. These are computationally expensive. A fast and efficient unsupervised method for outlier detection is the Histogram-Based approach [20]. This approach assumes independence of features, so outlier detection is most appropriately applied after regressing out, where the features become more independent. Histogram-Based Outlier Score for an instance is calculated by building a histogram for each feature and aggregating them. The proportion of outliers or the amount of contamination of the dataset is a hyperparameter. In this task, for each plate containing wells, histogram-based outlier detection is applied separately with a 0.01 amount of contamination, and the outlier cells are removed. A useful library in Python for this section is PyOD [21]. A sensitivity analysis on the threshold of contamination is performed to assess stability of the results. As shown in Fig 10, it seems that our proposed method is robust to the slight changes of this hyperparameter.

Fig 10 — The proportion of outliers in each plate is set to different amounts of 0.009, 0.01, and 0.011 after applying preprocessing and before regressing out.

Regressing-out

The majority of features are affected by a feature named Cells_AreaShape_Area. This feature expresses the space that is occupied by each cell. We used the regress-out technique to reduce the impact of this feature on the rest in order to strengthen the meaning of each feature individually. For this purpose, the relation between each feature and Cells_AreaShape_Area of all cells in one plate was modeled by fitting a linear equation without intercept. Considering x as Cells_AreaShape_Area feature of all single-cells in a plate and y as another feature from the feature set of all cells. In other words, x_i represents the area and y_i another feature of the i^th cell in that plate. A linear regression line without intercept attempts to fit a line with an equation of the form y = ax. After fitting the model for each feature and cell area, that feature column was replaced by their residuals from the regression prediction. Residual in regression means the difference between any data point and the regression line:

\begin{matrix} R e s i d u a l = r e a l v a l u e - p r e d i c t e d v a l u e \end{matrix}

(1)

For each feature except cell area, and for each plate, this technique and replacement have to be done separately. By using this technique, significant progress was observed in the final results and odds ratio in different first k percentages.

Validation methods

One important step in validating the results of the proposed method thoroughly is to design some experiments in order to assess the significance of the observed enhancements. We designed three experiments to this end.

P-value assessment. As explained in methodology section, the Fisher’s exact test is used to test if highly correlated compound profiles have the same MoA. Considering the P-value of the Fisher’s test can be used to determine how significant the results are.

HBOS significance assessment. If HBOS is a significant step in the proposed method, changing its hyperparameter should not change the overall trend of the odds ratio plot, and it should lie upon the base method.

Per-drug validation. In a drug discovery pipeline, there is an alternative strategy to get candidate compound matches. In this strategy, a compound with a known primary MoA is used as a query to find compounds that match its profile, and hopefully its primary MoA. Note that this setup is different from the earlier one in which only top correlated compound pairs in the entire experiment are considered as hits and are followed up. The difference stems from the fact that in the latter strategy of following up only strongest matches, the list of hits may involve only few compounds, while in the alternative strategy all compounds are involved in the test. Therefore, the alternative strategy is a more difficult test to pass. For the alternative strategy, Fisher’s test is applied per drug; In other words, the contingency table is calculated for each drug to check if, for a specific compound, top k compounds with the highest correlation have the same MoA. Then, P-values of the Fisher’s tests for all compounds are adjusted using the Bonferroni correction method, to account for doing multiple tests. The number of drugs with P-values smaller than various thresholds is counted for both base profiles after preprocessing and our proposed method for profile enhancement. Comparing these values can lead to an understanding of how significant the proposed method is.

Deep learning

Recently, deep neural networks have appeared very promising in representation learning on various problems. Several methods were studied for the purpose of learning more meaningful representations to increase the similarity between drug embeddings with the same mechanism of action. Both supervised and unsupervised approaches are tested, and a brief explanation for each of them is discussed next.

Classification with HSIC loss. A simple model is designed for this task. The input of the network is single-cell features after feature selection and normalization, which was previously explained, and the output that the model has to predict is the drug by which that cell in the input is treated with. We expect the network to learn a useful representation of cells through this auxiliary task of predicting drug identity from the cellular measurements. For preparing the training data, five mechanisms of action are chosen and about 30 drugs with these MoAs are selected randomly. All cells in various pairs of plates and wells that are treated by these 30 compounds are collected to form the training data. An important point to be careful in the train-test split is not to have cells from the same well/drug in both train and test/validation data.

The purpose of this random selection of compounds and not to use all of them in training and designing a network is to assess reasonable generalization on unseen data, especially new compounds whose mechanisms and similarities with others are still unknown. Let (X, y) be a data point such that X ∈ R⁴⁷⁰ and y ∈ R³⁰. As discussed earlier, each single-cell profile has about 470 features (less than 470), which are extracted by CellProfiler, and the output is probabilities for each of 30 compounds.

The loss function that is used for this task is Hilbert-Schmidt Independence Criterion (HSIC) loss [22] with the aim of learning on the training data with a limited number of MoAs and compounds but generalize and perform well on profiles that have not been seen by the model. This loss function has recently been proposed to facilitate out-of-distribution generalization. Few changes in the implementation of HSIC loss by authors are needed to use it for the multi-class classification.

The model is constructed by three dense layers. The activation function that is used for all layers is ReLU and after the last layer softmax was applied to the outputs. After training and convergence of loss on the train data, the trained model can be used for extracting new representations for every single cell. The second last layer or the hidden layer before the class probabilities (the red layer shown in Fig 11) is our new representation for the input single-cell profile.

Fig 11 — New representation from the classification task to feed into the pipeline is created by passing through the trained network and picking the red layer.

After finding the representation of every single cell by passing them through the trained model, all previous steps such as aggregation, calculating the correlation between the compounds’ profile, and plotting are repeated. The results from the new representations were meaningless, and were not promising compared to raw profiles, and did not improve the performance.

Autoencoders. Recently, Autoencoder [23] has achieved remarkable success as a feature extraction method. A basic autoencoder has two parts, an encoder, and a decoder. The output of the encoder represents the reduced representation of input data (code) and the decoder reconstructs the original input from the latent code that is generated by the encoder. Later, other variations of autoencoders were proposed that each tried to solve an issue such as variational autoencoder (VAE) [24], denoising autoencoder (DAE) [25], sparse autoencoder (SAE) [26], relational autoencoder [27], and etc. In this section, preparing the training data is quite similar to the classification task. Since this method is unsupervised, labels (compounds’ names) are not needed. The decoder of the model should reconstruct the input of the encoder, which is a single-cell profile. After training and convergence of the mean squared loss on the training data, the trained model can be used for extracting new representations for every single cell. The code layer or the output of the encoder (the red layer shown in Fig 12) is the new representation for the input single-cell profile from the autoencoder.

Fig 12 — New representation from the classification task to feed into the pipeline is created by passing through the trained network and picking the red layer.

After finding the representation of every single cell by passing them through the trained model, all previous steps for calculating the odds ratio are repeated. Basic AE, VAE, SAE, RAE, and DAE with different depths and code sizes were tested, but no significant improvement was observed in the final results. The convergence plot for SAE is included in S3 Fig.

Mixup. According to [28], one way to improve model generalization is to construct some virtual training data:

\begin{matrix} \hat{x} = λ x_{i} + (1 - λ) x_{j}, w h e r e x_{i}, x_{j} a r e r a w i n p u t v e c t o r s . \end{matrix}

(2)

\begin{matrix} \hat{y} = λ y_{i} + (1 - λ) y_{j}, w h e r e y_{i}, y_{j} a r e o n e h o t l a b e l e n c o d i n g s . \end{matrix}

(3)

As proposed in the paper, mix-up is based on the prior knowledge that linear interpolations of feature vectors should lead to linear interpolations of the associated targets. This technique is used in the classification steps that were covered earlier in the HSIC section. It led to better accuracy on the test set of the auxiliary task, but no significant improvement in the odds ratio was observed.

Experiments and results

Data cleaning

The enrichment of k percentages of compound pairs with the highest profile correlation in having the same MoA is used for the evaluation. To measure the enrichment, the ratio of pairs with the same MoA to different MoAs is calculated, and compared to the same ratio for weakly correlated pairs. This comparison could be summarized into a single number known as the odds ratio, which is described in the Material and Methods. This basically shows how many times we expect to find biologically related compound pairs in the top k percent connections compared to the other connections.

We first investigated a certain feature transformation step that aims to remove the cell area information that is implicitly contained in other features. This is particularly useful given that many morphological features, such as the perimeter, or texture, are affected by the cell area. We simply form a linear regression to predict any given feature from the cell area at the single-cell level. The residual, or error, of such a regression represents the information that is uncorrelated to the cell area. This process, which we call “regressing out,” helps balance the effect of cell area in the profiles. Fig 13 left, shows that the odds ratio has improved slightly as a result of this step. We make comparisons in two cases. In the first case, all compounds in the dataset are included, while in the second case, only the compounds with a high replicate correlation are retained. This filtering has previously been discussed to be a key step in removing treatments that do not show any strong biological effect from the lens of the image-based assay (4). We call such compounds “outlier” throughout the paper. It is notable that the odds ratio jumps to a much higher level, from 9 to higher than 40, when such compounds are removed. In this case, regressing cell area out results in a more consistent improvement in the odds ratio across various values of k, which is illustrated in Fig 13 right. We next investigated the effect of removing overly toxic compounds, which highly affect the cell viability. We would expect a much smaller cell count in the wells that are treated by such compounds. The profile might be of lower reliability as a result of the small cell count. In addition, the profiles of toxic compounds are so unique that artificially inflate the odds ratio. The increase in the odds ratio as a result of toxicity is not valuable from a practical perspective, due to limited clinical use of overly toxic compounds, and in drug discovery as a result. Therefore, we opted to remove such compounds before making further evaluations. Fig 14 left, shows that regress-out gives a more pronounced improvement over the odds ratio when both outlier and toxic compounds are removed. This could be partly due to regress-out unreliability under small sample size, which is the cell count in this case.

Fig 13 — The left plot shows the increase in the odds ratio by regressing out the cell area from all other features in different k percentages when no other data cleaning method is applied. The right plot shows that regressing out improves the result when just outlier drugs, which exhibit weak phenotypes, were removed.

Fig 14 — The left plot shows the increase in the odds ratio by regressing out the cell area from all other features in different k percentages when outlier and toxic drugs were removed. Improvement in the odds ratio becomes more significant when the toxic drugs are excluded from the analysis. The right plot shows that using histogram-based outlier detection improves the result in the same baseline case, where the outlier and toxic drugs were removed.

Finally, we applied outlier detection at the level of single-cells through a histogram-based technique, which is called “HBOS.” Building upon our previous improvements, we further removed outlier cells prior to making the profiles. The odds ratio keeps enhancing in all the thresholds, as shown in Fig 14 right. It turns out that the cell outlier detection is a more effective method in improving the overall profile quality, while regress-out mainly improves the very top connections.

To assess the combined effect of regress-out and cell outlier removal, we take as baseline the non-toxic and inlier drugs in addition to one of these methods, and apply the other method next. Regress-out seems to provide an additional marginal improvement over cell outlier removal, Fig 15 left. In contrast, cell outlier detection gives more extra enhancement over regress-out, as shown in Fig 15 right. We conclude that both methods are valuable in cleaning the profile, and providing independent improvements, but the cell outlier removal is a bit more effective. All techniques have been described in more detail in the Material and Methods section.

Fig 15 — The left plot shows the effect of regressing out when all other preprocessing steps (outlier and toxic drug removal, and cell outlier removal) are applied. The right plot shows the effect of cell outlier removal when all other steps, including outlier and toxic drug removal and regress-out, have been applied. Cell outlier removal seems to provide a more consistent improvement across values of k compared to the regress-out technique.

Validation results

We investigated three different experiments to validate the significance of odds ratio results.

Fig 16 shows the amount of log(P-value) of Fisher’s tests, which their odds ratio can be found in the result section. We illustrated two situations: first, base profiles (only outlier and toxic drugs removed) and secondly, enhanced profiles (cell area regressed out from other features, and HBOS is applied). According to Fig 16, by tightening the condition, which means decreasing k, we have achieved a lower P-value, which indicates more confidence in both situations. The P-values of the proposed method are less than the base method, and it confirms the significance of our proposed method compared to the base method.

In another experiment, we tested the effectiveness of the HBOS outlier detection method with different hyperparameters. In all of the curves in Fig 17, all preprocessing steps are applied to the profiles, but the HBOS contamination parameter is different among different curves. The green curve is related to profiles with no HBOS outlier detection. It is clear that HBOS with different hyperparameters has improved the profiles compared to not applying it.

Fig 17 — HBOS with different proportions of outlier removal has outperformed the case in which no outliers were removed.

In the last experiment, the effectiveness of the proposed method for profile enhancement is tested for each drug through the alternative strategy that was discussed in validation methods section. Here, we plot the number of queried compounds that their 10 top correlated compound matches mostly share the same MoA of the queried compound. In Fig 18 the plot is made against various thresholds of the P-value that is used to decide the significance. As can be seen, the proposed method has improved the baseline significantly.

Fig 18 — Drug counts for each P-value threshold are compared between base profiles and our proposed method (Thresholds in the plot are divided by the total number of drugs for using Bonferroni correction method in actual experiments).

Analysis of deep learning methods

Unfortunately, none of the deep representation learning methods, which are described in Material and Methods, could lead to a better result in the odds ratio (S1 and S2 Figs). There are some explanations for these observations. The most important factor that prevents representation learning, could be information loss, which is a result of using CellProfiler output. In fact, there are some important features that are captured using CellProfiler, but they are mostly the features that are introduced as important features by biologists, and there could be some hidden patterns in the raw images that a deep neural network is capable of learning, which are not captured by CellProfiler. For example, the texture information in the CellProfiler features are limited to certain Haralick texture descriptors, while a deep convolutional model could learn richer texture patterns. One other source of information loss in this work, is that the raw images contain 16 bits of information for each pixel. This means that they can be modeled more accurately when fed originally to a deep model, compared to the CellProfiler features, which are often of lower precision.

Another reason that the network does not improve the results is that there are some problems with outputs of CellProfiler for some images with low quality or with a small number of cells in them. This problem was explained in the previous sections. All these observations and hypotheses lead to one conclusion, that it is challenging for the neural networks to improve the CellProfiler output, and they should be trained on the raw images instead to be able to capture the high-quality information.

Conclusion

High throughput assays play a vital role in carrying out biomedical experiments. CellProfiler is a common tool used to quantify the data produced by these assays, but many sources of error might affect the data quality in this pipeline. In this work, we mainly focused on data cleaning steps that could enhance image-based profiles extracted from CellProfiler.

Our Experiments indicate that data cleaning methods highly impact the quality of extracted features to identify mechanisms of action of different drugs, which results in a higher amount of the odds ratio. Removing drugs with intra-correlation less than inter-correlation improves the odds ratio. It is the result of keeping those drugs with more meaningful features. Removing toxic drugs—drugs that cause cell death—decreases the odds ratio, but makes the result more real and meaningful.

Due to the different error sources in the imaging and feature extracting pipeline, cell level outlier detection is an important step to enhance the profile of each specific compound. HBOS outlier detection was used in this work. Regressing out the cell area from all other features is another helpful step in data cleaning, since the cell area widely affects all other features, and by regressing it out, other useful information can be presented better in the resulting profile. Both techniques proved to be effectively improving the odds ratio.

Deep representation learning methods that are applied on top of CellProfiler features could not achieve a result better than the raw features, mainly because of information loss in the CellProfiler measurement step, but they have perfect ability in capturing hidden patterns, and they may be useful if applied on the raw images instead.

The proposed methods of this work are simply applicable to CellProfiler extracted features to improve the quality of image-based profiles, and can enhance the odds ratio significantly. Therefore, they can be considered as a pivotal step in the profile pre-processing step of typical analysis workflows.

Supporting information

S1 Fig. Odds ratio for profiles that are obtained through representation learning using a denoising autoencoder.

A denoising autoencoder (DAE) with one hidden layer of encoder and decoder, and code size of 200 is trained. Compound-level profiles that are used for the last step come from the aggregation of DAE representation of cell measurements. Odds ratios in different percentages are too low and full of fluctuations compared to the cases that were investigated in the results section.

(TIF)

Click here for additional data file.^{(116.9KB, tif)}

S2 Fig. Odds ratio for profiles that are obtained through a supervised representation learning using the mixup technique.

A simple fully connected network that is regularized through applying the mixup technique is trained. In the first step, instead of raw cell-level profiles, representations that are extracted from the network are used. Considering this plot, the mixup technique does not effectively improve upon the baseline that was discussed earlier in the results section.

(TIF)

Click here for additional data file.^{(119.5KB, tif)}

S3 Fig. Convergence plot of loss in the train and validation sets during training sparse autoencoder to capture the representation of raw profiles.

This figure is a sample of convergence plots of deep methods in this work. Deep networks are tried for representation learning but they lead to no improvement in the odds ratio. It is clear that the loss has been decreased during the training process and no overfitting can be detected in the plots. This plot specifically represents sparse autoencoder training and validation losses.

(TIF)

Click here for additional data file.^{(166.9KB, tif)}

Data Availability

URL: http://gigadb.org/dataset/view/id/100351 DOI:10.5524/100351.

Funding Statement

The author(s) received no specific funding for this work.

References

1. Pegoraro G, Misteli T. High-Throughput Imaging for the Discovery of Cellular Mechanisms of Disease. Trends in genetics: TIG. 2017;33(9):604–615. doi: 10.1016/j.tig.2017.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Caicedo JC, Cooper S, Heigwer F, Warchal S, Qiu P, Molnar C, et al. Data-analysis strategies for image-based cell profiling. Nature Methods. 2017;14(9):849–863. doi: 10.1038/nmeth.4397 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Perlman ZE, Slack MD, Feng Y, Mitchison TJ, Wu LF, Altschuler SJ. Multidimensional Drug Profiling By Automated Microscopy. Science. 2004;306(5699):1194–1198. doi: 10.1126/science.1100709 [DOI] [PubMed] [Google Scholar]
4. Ljosa V, Caie PD, ter Horst R, Sokolnicki KL, Jenkins EL, Daya S, et al. Comparison of Methods for Image-Based Profiling of Cellular Morphological Responses to Small-Molecule Treatment. Journal of Biomolecular Screening. 2013;18(10):1321–1329. doi: 10.1177/1087057113503553 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Loo LH, Lin HJ, Steininger RJ, Wang Y, Wu LF, Altschuler SJ. An approach for extensibly profiling the molecular states of cellular subpopulations. Nature Methods. 2009;6(10):759–765. doi: 10.1038/nmeth.1375 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Qiu M, Zhou B, Lo F, Cook S, Chyba J, Quackenbush D, et al. A cell-level quality control workflow for high-throughput image analysis. BMC Bioinformatics. 2020;21(1):280. doi: 10.1186/s12859-020-03603-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R. Statistical practice in high-throughput screening data analysis. Nature Biotechnology. 2006;24(2):167–175. doi: 10.1038/nbt1186 [DOI] [PubMed] [Google Scholar]
8. Prastawa M. A brain tumor segmentation framework based on outlier detection*1. Medical Image Analysis. 2004;8(3):275–283. doi: 10.1016/j.media.2004.06.007 [DOI] [PubMed] [Google Scholar]
9. Rousseeuw PJ, Leroy AM. Robust Regression and Outlier Detection: Rousseeuw/Robust Regression & Outlier Detection. Wiley Series in Probability and Statistics. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 1987. Available from: http://doi.wiley.com/10.1002/0471725382. [Google Scholar]
10. Rämö P, Sacher R, Snijder B, Begemann B, Pelkmans L. CellClassifier: supervised learning of cellular phenotypes. Bioinformatics. 2009;25(22):3028–3030. doi: 10.1093/bioinformatics/btp524 [DOI] [PubMed] [Google Scholar]
11. Rohban MH, Abbasi HS, Singh S, Carpenter AE. Capturing single-cell heterogeneity via data fusion improves image-based profiling. Nature Communications. 2019;10(1):2082. doi: 10.1038/s41467-019-10154-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Bray MA, Gustafsdottir SM, Rohban MH, Singh S, Ljosa V, Sokolnicki KL, et al. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay. GigaScience. 2017;6(12). doi: 10.1093/gigascience/giw014 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Rohban MH, Singh S, Wu X, Berthet JB, Bray MA, Shrestha Y, et al. Systematic morphological profiling of human gene and allele function via Cell Painting. eLife. 2017;6:e24060. doi: 10.7554/eLife.24060 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Koch CM, Chiu SF, Akbarpour M, Bharat A, Ridge KM, Bartom ET, et al. A Beginner’s Guide to Analysis of RNA Sequencing Data. American Journal of Respiratory Cell and Molecular Biology. 2018;59(2):145–157. doi: 10.1165/rcmb.2017-0430TR [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Chandola V. Anomaly Detection: A Survey. ACM Computing Surveys. 2009;41:74. doi: 10.1145/1541880.1541882 [DOI] [Google Scholar]
16. Breunig MM, Kriegel HP, Ng RT, Sander J. LOF: Identifying Density-Based Local Outliers. ACM SIGMOD Record. 2000;29:93–104. doi: 10.1145/335191.335388 [DOI] [Google Scholar]
17. Ramaswamy S, Rastogi R, Shim K. Efficient Algorithms for Mining Outliers from Large Data Sets. ACM SIGMOD international conference on Management of data. 2000; p. 12. [Google Scholar]
18. He Z, Xu X, Deng S. Discovering cluster-based local outliers. Pattern Recognition Letters. 2003;24(9):1641–1650. doi: 10.1016/S0167-8655(03)00003-5 [DOI] [Google Scholar]
19. Amer M. Comparison of Unsupervised Anomaly Detection Techniques. PLoS One. 2016; p. 44. [Google Scholar]
20. Goldstein M, Dengel A. Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm. KI-2012: Poster and Demo Track; p. 1. [Google Scholar]
21. Zhao Y, Nasrullah Z, Li Z. PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of Machine Learning Research. 2019;20(96):1–7. [Google Scholar]
22.Greenfeld D, Shalit U. Robust Learning with the Hilbert-Schmidt Independence Criterion. arXiv:191000270 [cs, stat]. 2020;.
23. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533–536. doi: 10.1038/323533a0 [DOI] [Google Scholar]
24. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative Adversarial Nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, editors. Advances in Neural Information Processing Systems. vol. 27. Curran Associates, Inc.; 2014. Available from: https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf. [Google Scholar]
25.Vincent P, Larochelle H, Bengio Y, Manzagol PA. Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on Machine learning—ICML’08. Helsinki, Finland: ACM Press; 2008. p. 1096–1103. Available from: http://portal.acm.org/citation.cfm?doid=1390156.1390294.
26.Deng J, Zhang Z, Marchi E, Schuller B. Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction; 2013. p. 511–516.
27.Meng Q, Catchpoole D, Skillicom D, Kennedy PJ. Relational autoencoder for feature extraction. In: 2017 International Joint Conference on Neural Networks (IJCNN); 2017. p. 364–371.
28.Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. mixup: Beyond Empirical Risk Minimization; 2018. Available from: https://openreview.net/forum?id=r1Ddp1-Rb.

PLoS One. doi: 10.1371/journal.pone.0267280.r001

Decision Letter 0

Thippa Reddy Gadekallu

10 Dec 2021

PONE-D-21-28958Data cleaning for image-based profiling enhancementPLOS ONE

Dear Dr. Rohban,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR:Based on the comments received from the reviewers and my own observation, I recommend major revisions for the article. Authors need not consider the references suggested by the reviewers if they are irrelevant or not required to strengthen the references.==============================

Please submit your revised manuscript by Jan 24 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Thippa Reddy Gadekallu

Academic Editor

PLOS ONE

Journal requirements:

1. When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

3. Thank you for stating the following in your Competing Interests section:

“NO authors have competing interests”

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state ""The authors have declared that no competing interests exist."", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

This information should be included in your cover letter; we will change the online submission form on your behalf.

4. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

5. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: A Data cleaning based on deep learning approach for image-based profiling enhancement is presented here. Topic is interesting. The manuscript is well-written. I would like to mention some comments to improve the quality of the manuscript.

-Abstract: please provide a conclusion sentence in its last sentence. The following sentence is not proper.\\

We also examined unsupervised and weakly-supervised deep learning based

methods to reduce the feature dimensionality, and finally suggest possible avenues for future research

-Introduction: I would like to see a comprehensive literature review about papers related to the topic.

-Section 2: It is recommended to add a methodology section

----Data used

----Methodology

------deep learning approach

------pre-processing

------validation

------ so on

-Section 3: please insert concepts in methodology section. Here you must mention results of the method.

-If the applied method has unknown parameters, I would like to see a sensitivity analysis. Please see the following paper.

A fuzzy-ga based decision making system for detecting damaged buildings from high-spatial resolution optical images

- Please insert convergence diagram of deep learning approach in the text.\\

-Validation regarding outputs should be done.

-The number of references is not adequate for a journal paper. Please mention at least 30 papers in the section.

Reviewer #2: The proposed research discusses a data cleaning approach for image-based profiling enhancement. It is a novel research area. However, the following concerns should be addressed.

• The title needs revision, and accordingly abstract and conclusion sections should also be updated.

• The results and discussions about how the proposed approach enhances the state of the art is missing. I recommend authors highlight the contribution of the proposed work separately, along with the limitations of the system.

• A detailed, layered design describing the proposed approach should be included for a better understanding of readers.

• The architecture looks very abstract and misses very important details. I recommend authors elaborate design and experimental setup of the proposed approach. The authors have described the materials and methods section, but I recommend including a detailed experimental setup for a better understanding and interpretation of the proposed work.

• The authors have included unnecessary references which can be removed and essential references such as, https://www.mdpi.com/1424-8220/20/24/7299”, “https://www.mdpi.com/1424-8220/20/20/5780” can be referred.

• I recommend authors to add a separate section to highlight novelties of the proposed.

Some more changes are needed:

4. All tables should be symmetrical and should follow a similar formatting style.

5. All the equations should be written using a professional equation editor and should use a similar formatting style and numbering.

6. Check the entire manuscript for grammatical and typo errors.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Sharnil Pandya

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 May 4;17(5):e0267280. doi: 10.1371/journal.pone.0267280.r002

Author response to Decision Letter 0

1 Mar 2022

We thank the reviewers for their critical assessment of our work. In the following we address their concerns point by point.

Reviewer 1

Reviewer Point P 1.1 — Abstract: please provide a conclusion sentence in its last sentence. The following sentence is not proper: We also examined unsupervised and weakly-supervised deep learning based methods to reduce the feature dimensionality, and finally suggest possible avenues for future research

Reply: We removed the mentioned sentence of the abstract and inserted the following sentence as a conclusion sentence:

Our experiments indicate by performing these time-efficient preprocessing steps, image-based profiles can preserve more meaningful information compared to raw profiles. In the end, we also suggest possible avenues for future research.

Reviewer Point P 1.2 — Introduction: I would like to see a comprehensive literature review about papers related to the topic

Reply: We investigated several papers related to the topic, specifically those related to cell quality control and outlier detection. A summary of these papers is inserted into the introduction section.

Reviewer Point P 1.3 — Section 2: It is recommended to add a methodology section • Data used

• Methodology

– deep learning approach – pre-processing

– validation

– so on

Section 3: please insert concepts in methodology section. Here you must mention results of the method.

Reply: We agree with the reviewer on this important point. The ”Materials and methods” section is updated as follows:

• Dataset

• Methodology

– Preprocessing

– Removing outlier drugs

– Removing Toxic drugs

– Evaluation metric

– Histogram-based outlier score

– Regressing-out

– Validation methods

∗ P-value assessment

∗ HBOS significance assessment ∗ Per-drug validation

– Deep learning

∗ Classification with HSIC loss

∗ Autoencoders

∗ Mixup

We also changed the order of sections ”Materials and methods” and ”Experiments and results”. Now, ”Materials and methods” come prior to ”Experiments and results”.

Reviewer Point P 1.4 — If the applied method has unknown parameters, I would like to see a sensitivity analysis.

Reply: The only unknown hyperparameter in the applied method is HBOS contamination paramater. The sensitivity analysis of this parameter is added to ”Histogram-based outlier score ” subsection in methodology section.

Reviewer Point P 1.5 — Please insert convergence diagram of deep learning approach in the text

Reply: Convergence diagram of sparse autoencoder is added to the supplementary materials. Reviewer Point P 1.6 — Validation regarding outputs should be done.

Reply: For validation, we designed three experiments. The description of experiments is under ”val- idation methods” section in materials and methods, and the result of validation experiments is under ”validation results” in experiments and results section. In one of the experiments we investigated the p-value in the fisher test for different k values. During this investigation, we found out that the p-value of k=0.1 is not as significant as other k values. Therefore, we updated all of the figures and removed k=0.1 from all comparisons.

Reviewer Point P 1.7 — The number of references is not adequate for a journal paper. Please mention at least 30 papers in the section

Reply: This specific area is not studied in many papers, therefore the number of references was under 30 at first. But during updating the literature review section, we increased the number of references.

Reviewer 2

Reviewer Point P 2.1 — The title needs revision, and accordingly abstract and conclusion sections should also be updated.

Reply: We changed the title to ”Image-based cell profiling enhancement via data cleaning methods”. We also updated the abstract and substituted the last sentence with another sentence that indicates our contribution. We highlighted our novelty in both introduction and conclusion sections, too.

Reviewer Point P 2.2 — The results and discussions about how the proposed approach enhances the state of the art is missing. I recommend authors highlight the contribution of the proposed work separately, along with the limitations of the system.

Reply: We changed the first figure of the paper, and we think it highlights the contribution more effectively. Furthermore, we changed several sentences in the introduction and conclusion sections so that the enhancement of our method would be more evident.

Reviewer Point P 2.3 — A detailed, layered design describing the proposed approach should be included for a better understanding of readers

Reply: We change the first figure of the paper in the introduction section to make our approach more understandable and clear to the reader.

Reviewer Point P 2.4 — The architecture looks very abstract and misses very important details. I recommend authors elaborate design and experimental setup of the proposed approach. The authors have described the materials and methods section, but I recommend including a detailed experimental setup for a better understanding and interpretation of the proposed work.

Reply: We added the methodology section in materials and methods to describe the steps more accurately. In each step, hyperparameters and even libraries are included.

Reviewer Point P 2.5 — I recommend authors to add a separate section to highlight novelties of the proposed

Reply: According to previous concerns, we highlighted our novelty in different parts: abstract, intro- duction, and conclusion. Furthermore, the new figure 1 shows the novelty more clear and understandable to readers.

Reviewer Point P 2.6 — 1- All tables should be symmetrical and should follow a similar format- ting style. 2- All the equations should be written using a professional equation editor and should use a similar formatting style and numbering. 3- Check the entire manuscript for grammatical and typo errors.

Reply: All done.

Attachment

Submitted filename: rebuttal (2).pdf

Click here for additional data file.^{(82KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0267280.r003

Decision Letter 1

Thippa Reddy Gadekallu

6 Apr 2022

Image-based cell profiling enhancement via data cleaning methods

PONE-D-21-28958R1

Dear Dr. Rohban,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Thippa Reddy Gadekallu

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors answered the questions carefully. The manuscript is mature for the publication. Please consider it for publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

PLoS One. doi: 10.1371/journal.pone.0267280.r004

Acceptance letter

Thippa Reddy Gadekallu

25 Apr 2022

PONE-D-21-28958R1

Image-based cell profiling enhancement via data cleaning methods

Dear Dr. Rohban:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Thippa Reddy Gadekallu

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Odds ratio for profiles that are obtained through representation learning using a denoising autoencoder.

(TIF)

Click here for additional data file.^{(116.9KB, tif)}

S2 Fig. Odds ratio for profiles that are obtained through a supervised representation learning using the mixup technique.

(TIF)

Click here for additional data file.^{(119.5KB, tif)}

S3 Fig. Convergence plot of loss in the train and validation sets during training sparse autoencoder to capture the representation of raw profiles.

(TIF)

Click here for additional data file.^{(166.9KB, tif)}

Attachment

Submitted filename: rebuttal (2).pdf

Click here for additional data file.^{(82KB, pdf)}

Data Availability Statement

URL: http://gigadb.org/dataset/view/id/100351 DOI:10.5524/100351.

[pone.0267280.ref001] 1. Pegoraro G, Misteli T. High-Throughput Imaging for the Discovery of Cellular Mechanisms of Disease. Trends in genetics: TIG. 2017;33(9):604–615. doi: 10.1016/j.tig.2017.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0267280.ref002] 2. Caicedo JC, Cooper S, Heigwer F, Warchal S, Qiu P, Molnar C, et al. Data-analysis strategies for image-based cell profiling. Nature Methods. 2017;14(9):849–863. doi: 10.1038/nmeth.4397 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0267280.ref003] 3. Perlman ZE, Slack MD, Feng Y, Mitchison TJ, Wu LF, Altschuler SJ. Multidimensional Drug Profiling By Automated Microscopy. Science. 2004;306(5699):1194–1198. doi: 10.1126/science.1100709 [DOI] [PubMed] [Google Scholar]

[pone.0267280.ref004] 4. Ljosa V, Caie PD, ter Horst R, Sokolnicki KL, Jenkins EL, Daya S, et al. Comparison of Methods for Image-Based Profiling of Cellular Morphological Responses to Small-Molecule Treatment. Journal of Biomolecular Screening. 2013;18(10):1321–1329. doi: 10.1177/1087057113503553 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0267280.ref005] 5. Loo LH, Lin HJ, Steininger RJ, Wang Y, Wu LF, Altschuler SJ. An approach for extensibly profiling the molecular states of cellular subpopulations. Nature Methods. 2009;6(10):759–765. doi: 10.1038/nmeth.1375 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0267280.ref006] 6. Qiu M, Zhou B, Lo F, Cook S, Chyba J, Quackenbush D, et al. A cell-level quality control workflow for high-throughput image analysis. BMC Bioinformatics. 2020;21(1):280. doi: 10.1186/s12859-020-03603-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0267280.ref007] 7. Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R. Statistical practice in high-throughput screening data analysis. Nature Biotechnology. 2006;24(2):167–175. doi: 10.1038/nbt1186 [DOI] [PubMed] [Google Scholar]

[pone.0267280.ref008] 8. Prastawa M. A brain tumor segmentation framework based on outlier detection*1. Medical Image Analysis. 2004;8(3):275–283. doi: 10.1016/j.media.2004.06.007 [DOI] [PubMed] [Google Scholar]

[pone.0267280.ref009] 9. Rousseeuw PJ, Leroy AM. Robust Regression and Outlier Detection: Rousseeuw/Robust Regression & Outlier Detection. Wiley Series in Probability and Statistics. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 1987. Available from: http://doi.wiley.com/10.1002/0471725382. [Google Scholar]

[pone.0267280.ref010] 10. Rämö P, Sacher R, Snijder B, Begemann B, Pelkmans L. CellClassifier: supervised learning of cellular phenotypes. Bioinformatics. 2009;25(22):3028–3030. doi: 10.1093/bioinformatics/btp524 [DOI] [PubMed] [Google Scholar]

[pone.0267280.ref011] 11. Rohban MH, Abbasi HS, Singh S, Carpenter AE. Capturing single-cell heterogeneity via data fusion improves image-based profiling. Nature Communications. 2019;10(1):2082. doi: 10.1038/s41467-019-10154-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0267280.ref012] 12. Bray MA, Gustafsdottir SM, Rohban MH, Singh S, Ljosa V, Sokolnicki KL, et al. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay. GigaScience. 2017;6(12). doi: 10.1093/gigascience/giw014 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0267280.ref013] 13. Rohban MH, Singh S, Wu X, Berthet JB, Bray MA, Shrestha Y, et al. Systematic morphological profiling of human gene and allele function via Cell Painting. eLife. 2017;6:e24060. doi: 10.7554/eLife.24060 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0267280.ref014] 14. Koch CM, Chiu SF, Akbarpour M, Bharat A, Ridge KM, Bartom ET, et al. A Beginner’s Guide to Analysis of RNA Sequencing Data. American Journal of Respiratory Cell and Molecular Biology. 2018;59(2):145–157. doi: 10.1165/rcmb.2017-0430TR [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0267280.ref015] 15. Chandola V. Anomaly Detection: A Survey. ACM Computing Surveys. 2009;41:74. doi: 10.1145/1541880.1541882 [DOI] [Google Scholar]

[pone.0267280.ref016] 16. Breunig MM, Kriegel HP, Ng RT, Sander J. LOF: Identifying Density-Based Local Outliers. ACM SIGMOD Record. 2000;29:93–104. doi: 10.1145/335191.335388 [DOI] [Google Scholar]

[pone.0267280.ref017] 17. Ramaswamy S, Rastogi R, Shim K. Efficient Algorithms for Mining Outliers from Large Data Sets. ACM SIGMOD international conference on Management of data. 2000; p. 12. [Google Scholar]

[pone.0267280.ref018] 18. He Z, Xu X, Deng S. Discovering cluster-based local outliers. Pattern Recognition Letters. 2003;24(9):1641–1650. doi: 10.1016/S0167-8655(03)00003-5 [DOI] [Google Scholar]

[pone.0267280.ref019] 19. Amer M. Comparison of Unsupervised Anomaly Detection Techniques. PLoS One. 2016; p. 44. [Google Scholar]

[pone.0267280.ref020] 20. Goldstein M, Dengel A. Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm. KI-2012: Poster and Demo Track; p. 1. [Google Scholar]

[pone.0267280.ref021] 21. Zhao Y, Nasrullah Z, Li Z. PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of Machine Learning Research. 2019;20(96):1–7. [Google Scholar]

[pone.0267280.ref022] 22.Greenfeld D, Shalit U. Robust Learning with the Hilbert-Schmidt Independence Criterion. arXiv:191000270 [cs, stat]. 2020;.

[pone.0267280.ref023] 23. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533–536. doi: 10.1038/323533a0 [DOI] [Google Scholar]

[pone.0267280.ref024] 24. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative Adversarial Nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, editors. Advances in Neural Information Processing Systems. vol. 27. Curran Associates, Inc.; 2014. Available from: https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf. [Google Scholar]

[pone.0267280.ref025] 25.Vincent P, Larochelle H, Bengio Y, Manzagol PA. Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on Machine learning—ICML’08. Helsinki, Finland: ACM Press; 2008. p. 1096–1103. Available from: http://portal.acm.org/citation.cfm?doid=1390156.1390294.

[pone.0267280.ref026] 26.Deng J, Zhang Z, Marchi E, Schuller B. Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction; 2013. p. 511–516.

[pone.0267280.ref027] 27.Meng Q, Catchpoole D, Skillicom D, Kennedy PJ. Relational autoencoder for feature extraction. In: 2017 International Joint Conference on Neural Networks (IJCNN); 2017. p. 364–371.

[pone.0267280.ref028] 28.Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. mixup: Beyond Empirical Risk Minimization; 2018. Available from: https://openreview.net/forum?id=r1Ddp1-Rb.

PERMALINK

Image-based cell profiling enhancement via data cleaning methods

Arghavan Rezvani

Mahtab Bigverdi

Mohammad Hossein Rohban

Roles

Abstract

Introduction

Fig 1. Representative workflow for the enhanced image-based cell profiling via data cleaning.

Materials and methods

Dataset

Methodology

Preprocessing

Fig 2. Plate 24277, well A13, containing DMSO, is blurry.

Fig 3. Another outlier DMSO well in plate 24293, which shows a huge clump on the upper right corner of the left image.

Fig 4. Plate 24293, well E19, with a cell area of nearly 3000.

Removing outlier drugs

Removing toxic drugs

Fig 5. Histogram of the cell count in each well.

Fig 6. Histogram of the cell count after removing toxic drugs.

Fig 7. Correlation between toxic drugs.

Fig 8. Correlation between toxic drugs and non-toxic drugs.

Evaluation metric

Fig 9. Odds ratio.

Histogram-based outlier score

Fig 10. Sensitivity analysis.

Regressing-out

Validation methods

Deep learning

Fig 11. Simple classification network architecture.

Fig 12. Simple classification network architecture.

Experiments and results

Data cleaning

Fig 13. Effect of regressing out the cell area.

Fig 14. Effect of regressing-out and cell outlier detection.

Fig 15. The combined effect of outlier detection and regressing-out.

Validation results

Fig 16. P-value comparison between base profiles and our proposed enhanced profiles in different k percentages.

Fig 17. HBOS significance analysis.

Fig 18. Comparison of drug counts for each P-value threshold.

Analysis of deep learning methods

Conclusion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Thippa Reddy Gadekallu

Roles

Author response to Decision Letter 0

Decision Letter 1

Thippa Reddy Gadekallu

Roles

Acceptance letter

Thippa Reddy Gadekallu

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases