Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Feb 1.
Published in final edited form as: J Chemom. 2017 Oct 13;32(2):e2961. doi: 10.1002/cem.2961

Comparative Chemometric Analysis for Classification of Acids and Bases via a Colorimetric Sensor Array

Michael J Kangas 1, Raychelle M Burks 2, Jordyn Atwater 1, Rachel M Lukowicz 1, Billy Garver 3, Andrea E Holmes 1
PMCID: PMC5962272  NIHMSID: NIHMS909419  PMID: 29795964

Summary

With the increasing availability of digital imaging devices, colorimetric sensor arrays are rapidly becoming a simple, yet effective tool for the identification and quantification of various analytes. Colorimetric arrays utilize colorimetric data from many colorimetric sensors, with the multidimensional nature of the resulting data necessitating the use of chemometric analysis. Herein, an 8 sensor colorimetric array was used to analyze select acid and basic samples (0.5 – 10 M) to determine which chemometric methods are best suited for classification quantification of analytes within clusters.

PCA, HCA, and LDA were used to visualize the data set. All three methods showed well-separated clusters for each of the acid or base analytes and moderate separation between analyte concentrations, indicating that the sensor array can be used to identify and quantify samples. Furthermore, PCA could be used to determine which sensors showed the most effective analyte identification. LDA, KNN, and HQI were used for identification of analyte and concentration. HQI and KNN could be used to correctly identify the analytes in all cases, while LDA correctly identified 95 of 96 analytes correctly. Additional studies demonstrated that controlling for solvent and image effects was unnecessary for all chemometric methods utilized in this study.

Keywords: Colorimetric sensor array, image analysis, k nearest neighbor analysis (KNN), hit quality index (HQI), linear discriminant analysis (LDA)

Introduction

The examination of digital images in analytical chemistry has increased by more than 87% from 2005 to 2015, because of the increased availability of imaging devices for the detection of analytes.1 In particular, the detection of analytes using their RGB (red, green, blue) color values has led to state of the art of colorimetric and fluorometric sensor arrays often referred to as opto-electronic noses, which discriminate among analytes through an array of dyes that change colors upon interaction with the tested substance, similar to the sense of smell.2

Colorimetric arrays are typically composed of 3–40 sensors that can interact with analytes and change color upon molecular interactions.36 The patterns of color changes in colorimetric arrays, when analyzed with chemometric methods including Euclidean distance, binary codes, principal component analysis (PCA), hierarchical cluster analysis (HCA), linear discriminant analysis (LDA), and matrix discriminant analysis, can be used for the identification and quantification of different compounds.13,7 Although these methods are well described in the literature,2,8 it is not completely understood which methods are best in regards to data visualization, identification of analytes, and determining analyte concentrations, specifically with respect to colorimetric sensor arrays.

A wide array of color changing sensors have been utilized in arrays including pH indicators, metalloporphyrins, solvatochromic dyes, redox indicators, metal salts, and nanoparticles.2 Some color changes are due to molecular interactions including hydrogen transfer reactions and π-π interactions.9 Other noted interactions between sensors and analytes include Lewis acid/base interactions, hydrogen bonding, and dipole-dipole interactions.10 In designing arrays, sensor selection depends on the specific application and the analytes targeted for detection. For example, if the analytes are known to have acid-base properties, pH indicators are the preferred sensors. If metal ions must be detected, such as mercury, then complexometric sensors are favorable.11 Overall, colorimetric arrays are often customized to include a selection of sensors that provide a multitude of different interaction types to improve sensor array versatility.10 An array is considered to be effective if the following criteria apply: a. high selectivity; b. high sensitivity; c. colorimetric data can be analyzed with a statistical analysis method for analyte identification.2 Colorimetric arrays should also have the ability to detect multiple analytes with the fewest numbers of sensors. Sensor selection can also entail criteria such as solubility, stability, cost, toxicity, and magnitude of the color change.3

Colorimetric sensor arrays are effective detection tools for a diverse range of analytes including ions, various small organic compounds, complex mixtures, metal nanoparticles, and even biological molecules.1,2,12 Acids and bases have also served as analytes for colorimetric sensor arrays, due to their ubiquitous nature and their well-known colorimetric reactions with pH indicators. For these reasons, select acids and bases were selected for this exploratory study. Gas-phase analytes including HCl, HF, HNO3, propionic acid, and various aliphatic amines have been identified and quantified with colorimetric sensor arrays.10,1316 In addition to gas-phase analytes, there have been a few reports of colorimetric sensor arrays for organic acidic and basic analytes in solution. Zhang and Suslick used a 36 sensor printed array to identify aliphatic amines, aromatic amines, and carboxylic acids at millimolar (mM) concentrations in water, with cluster analysis showing that the two classes of amines formed distinct, well separated, clustered groups of the analytes in an HCA dendrogram, thus allowing for the identification of the analytes.17 Kitamura, Shabbir, and Anslyn used an indicator displacement array composed of 23 unique combinations of receptors and indicators to identify solutions of seven carboxylic acids.18 Three classes of acids (phenolic acids, α-hydroxy carboxylic acids, and amino acids) showed distinct clustering behaviors when analyzed with PCA.18 In addition to the identification of acidic and basic analytes, sensor arrays have also been applied for pH determination.1921

While there is plethora of literature detailing the use of colorimetric arrays for the analysis of acids and bases, most articles focus on one or very few chemometric analysis methods. The work described herein, however, considers 5 different statistical analyses methods and compares these methods to each other using one dataset. Most articles cite only a combination of PCA and LDA or PCA and HCA, but often do not include other methods that are relatively simple to use within R (or other statistical programs) that could improve or leverage the results. This dataset is provided as supplemental information to facilitate further chemometric studies that can be performed by other experts in the field and may improve the reported results.

A Specific Colorimetric Array

Our research group has studied extensively a colorimetric array that contains eight sensors, many of which are acid-base indicators. The array can used to analyze many different substances, including narcotics, cutting agents, pesticides, steroids, and explosives using eight sensors.3,5,22,23

The first generation of the array involved simple visual analysis to determine whether a color change occurred for each sensor in the presence of an analyte.3 An analyte-specific binary code was generated based on whether or not a color change occurred. The second generation used an image analysis ImageJ-based method to automate the generation of unique binary codes based on statistically significant RGB changes.24 Although this method has been shown to be effective for identifying over 100 target analytes, all of the information contained in the colorimetric data set was not utilized to its full potential. For example, all statistically significant changes were given the same weight regardless of the magnitude of the color change quantified by RGB values. The research objective of this work is to determine which chemometric methods are most effective for exploiting the full potential of the RGB data. Therefore, the colorimetric data was examined using five multivariate techniques, namely PCA, LDA, HCA, KNN (k nearest neighbors), and HQI (hit quality index). These statistical methods were also evaluated to determine if they could detect the identity and concentration of the analytes concurrently, which is a new level of analysis for the colorimetric array. While this study only focuses on an eight sensor colorimetric array, our methodology could be adapted for other sensor array and analyte set applications.

Materials and Methods

All reagents were purchased at technical grade or better and were used as received without further purification. A universal pH indicator was prepared using methyl red, methyl orange, phenolphthalein, and bromothymol blue powders with a weight ratio of approximately 1:3:7:8.25 Saturated solutions of Congo red, erythrosin B, alizarin yellow R, crystal violet, eriochrome black T, phenolphthalein, universal indicator, and bromophenol blue were prepared by adding 1.0 g of the indicator to a formulation composed of 70.0 g acetate buffer (0.1 mM, pH 5), 8.0 g ethylene glycol, 5.0 g triethylene glycol monobutyl ether, and 16.0 g glycerol. The sensor solutions were heated (~50°), sonicated for 5 h, and then vacuum filtered twice using Whatman #1 filter paper. NaOH and HCl solutions (0.5, 1, 2, 3, 4, 5, and 10 M) were prepared by diluting concentrated solutions with milli-Q water (18 MΩ-cm).

The sensor array was formed in a 96-well plate by dispensing 100 μL of each sensor in the 8 columns of the well plate using an auto-pipet. To the 12 rows of the well plate, 100 μL of the analytes or controls were added, as shown in Figure 1. For reproducibility, water and analytes were tested 4 times each on a well plate. Two plates of the 1 M NaOH and HCl were evaluated (8 total samples) over time to assure that the data is reproducible. Color images (24-bit, 400 dpi) of the well plate were collected with an Epson Perfection V700 desktop scanner in transparency mode. To minimize interference from background light, the scanner was draped in black cloth. Mean values of the red, green, and blue channels for each well were extracted using an ImageJ macro.24,26

Figure 1.

Figure 1

All well plates included water control samples, and analyte-sensor RGB data was analyzed with and without subtraction of the control sample. As discussed in the results section below, control subtraction did not improve the data analysis and the data reported below does not utilize control subtraction, with the exception of the section that discusses the control subtraction.

All statistical analysis was carried out with the statistical programming language R.27 PCA was performed using the function “prcomp”. The data was mean-centered, but was not scaled to unit variance because all of the data was on a consistent scale of 0 to 256 RGB units. Score plots of the resulting data were constructed using the R library pca3d.28 Hierarchical clustering was conducted in agglomerative mode using Ward’s method based on Euclidean distances, and the package dendextend was used to help visualize and interpret the clustering results.29 LDA was performed using the MASS library,30 and the classification ability of the LDA model was tested by using all but one cross validation. For LDA and the other supervised methods, the concentrations of each analyte were treated as separate classes. KNN based on Euclidean distances was performed for k=1 and k=3 using the class package.30 HQI values were calculated using a custom R script. Classification was performed using all but one cross validation, and classifications were assigned based on the library sample with the highest HQI value. To facilitate further chemometric studies, the complete data set, of 96 samples, is provided in comma delimited format in the supplemental information Table S1.

Results and Discussion

The colorimetric sensor array used in the present study was composed of Congo red, erythrosine B, alizarin yellow R, crystal violet, eriochrome black T, phenolphthalein, universal indicator, and bromophenol blue. These sensors were selected because the sensors are soluble in water, stable under atmospheric conditions, inexpensive, relatively nontoxic, and the magnitude of the color change in the presence of various analytes.3 Notably, with the exception of eriochrome black T, all of the sensors are pH indicators, and can also be used for other applications, such as biological stains, cyanide sensors, and metal ion indicators.3134

When compared to the control, the addition of 100 μL 0.5 M HCl and NaOH resulted in color changes for some sensors (Figure 1). Some sensors did not exhibit a color change, or the color changes were too subtle to be easily detected by the naked eye. Thus, RGB analysis was used to distinguish color changes, providing a large dataset with 24 variables. Since the color of each sensor is defined by red, green, and blue channels, 24 variables (8 sensors * 3 colors = 24 variables) were used for the identification and quantification of HCl and NaOH.

It is the pattern of color changes, or the lack of changes, that result in a unique colorimetric identity map that facilitates chemometric analysis. For example, as shown in the red box in Figure 1, the addition of 0.5 M HCl resulted in a color change in Congo red from the red RGB value of (88, 10, 9) to (12, 12, 9), which approaches the RGB color value of black (0,0,0).

Analysis of the control samples to observe the variability of the method shows standard deviations (SD) of various/color channels from 0.17 to 8.7 RGB units. However, the average control SD is only 1.64 RGB units and is similar to the average analyte SD RGB units, and this variability did not skew the classification results. Variations of RGB data can be attributed to inherent scanning inconsistencies that have typical SD of 1 to 2 RGB units.

Principal Component Analysis

The dataset was first analyzed with PCA because it is one of the more commonly used methods to analyze large data sets, including data from colorimetric and sensor arrays.1,6,13,35 PCA is a statistical algorithm that uses an orthogonal transformation to change a set of observable and possibly related variables into a set of linearly uncorrelated variables, which are called principal components.36 In the case of colorimetric sensor arrays, the principal components are weighted combinations of all color channels from the sensors. However, fewer components are needed to describe the data set, allowing patterns and trends in the data to be easily visualized and analyzed. Although there were 24 variables in the original data set, only four principal components were required to capture 95% of the variance in the data set (SI, Figure S1). The number of components needed to describe the variance in the data set was consistent with observations from other colorimetric sensor array studies.37,38

In PCA, variables that are strongly correlated in the original data set remain closely related in the new components, and data points that are similar in the original data set also remain clustered in the principal component space.36 This can be observed in Figure 2, where water, HCl, and NaOH each form distinct clusters. The distinct groupings indicate that the analytes can easily be identified. In addition, the various concentrations of sodium hydroxide form distinct clusters indicating that the sensor array can be used to determine the concentration of NaOH. For HCl, the low concentrations (<2 M) form well separated clusters, while the higher concentrations overlap. Because the first component is dominated by alizarin yellow (discussed below), the overlap can be ascribed to the observation that at low pH values (2–10 M HCl), alizarin yellow precipitates (Figure 1, green box), which leads to large variations in the measured color values and overlapping results.

Figure 2.

Figure 2

In addition to visualizing the data with the scores plot, PCA can also be used to determine which sensors are responsible for the analyte discrimination by analyzing the loading plots in Figure 3. This information could be used for sensor selection and array optimization for future arrays because the loading plots clearly show which sensors have the biggest contributions for analyte detection as evidenced by the highest loadings on the y-axis. For example, from the analysis of the loading plots (Figure 3), principal component 1 is dominated by alizarin yellow and the universal indicator. Principal component 2 is dominated by the red channels of erythrosine B, Congo red, and alizarin yellow. It is not surprising that the alizarin yellow sensor is one of the sensors that dominates the loading plots. As seen in Figure 1, the alizarin yellow not only shows strong visible color changes but also a precipitation occurs in the presence of 0.5M HCl. Thus, not only can the loading plots be used to determine which sensors have the greatest color variances but also solubility factors can also be detected, and this information could actually help eliminate ineffective sensors. Moreover, sensors such as eriochrome black T and phenolphthalein do not have significant loadings on any of the first four components (principal components 3 and 4 not shown), indicating that they are minimally interacting with the analytes. This is consistent with the known behavior of eriochrome black T, as this sensor is commonly used to indicate the presence of divalent metal ions rather than pH values. Somewhat surprisingly, phenolphthalein, the common indicator used for titrations with bases, did not have a large contribution in any of the first four components. This is likely because phenolphthalein was colorless except in the presence of 0.5 M NaOH, which is consistent with previous reports.39

Figure 3.

Figure 3

Overall, PCA provides a convenient way to visualize the multivariate data set. The data shows that the analytes are clearly separated from each other. Furthermore, within each analyte cluster, the data shows a linear separation of low to high concentrations. This indicates that the sensors can be used for analyte identification and quantification. In addition, PCA provides a path to improving the performance of the sensor array by indicating which sensors are useful and which ones could be replaced. For future generations of the colorimetric array, the loading plots could provide a method for informed mass screening of sensors for inclusion in a particular array. This method may also be applied for other colorimetric arrays described in the literature.

Hierarchical Cluster Analysis

Like PCA, HCA is an unsupervised multivariate analysis technique that produces clusters of data and is commonly used to analyze colorimetric arrays.1,6,4042 Hierarchical clustering forms increasingly larger clusters by iteratively adding dissimilar clusters, based on distances (often Euclidean), to existing clusters. Compared to other clustering algorithms, two appealing features of HCA are that it gives a quantitative metric for the (dis)similarity of groups and that it defines clusters in all size scales.36 Figure 4 shows the clustering results from the colorimetric data. Water, NaOH, and HCl each form larger clusters, with the acid samples being more similar to water than the basic samples. The various concentrations of NaOH form distinct clusters. However, among the HCl samples, the 0.5 and 1 M acid samples show distinct clusters, while the higher concentrations (>2M) are less consistent. Although the algorithms and mathematics involved in PCA and HCA are very different, the HCA results are consistent with those from PCA, where the larger concentrations (>2M) of HCl overlapped.

Figure 4.

Figure 4

Overall, HCA was useful for illustrating which analytes are the most similar or most different. For example, all the acids closely are clustered together (red tones) and the bases are clustered (blue tones). Also, both analytes are on opposite sides of the dendogram and clearly not overlapping with water, thus illustrating that these analytes are not similar to each other or water (Figure 4). However, the branches of the dendrogram can be freely rotated, and comparing members of one cluster to members of another cluster can be difficult. That is the reason why the dendogram can lead to some ambiguous interpretations. In Figure 4, for example, one could rotate the acid cluster, and then then it is difficult to determine if the acid cluster is more similar to the base cluster or the water cluster. Moreover, HCA does not provide insight into why the classifications occur or how to improve them. As we have illustrated above, in Figure 2, PCA does not suffer of these weaknesses, and the comparisons can readily be carried out with a PCA scores plot.

While HCA has been used and validated abundantly to analyze large colorimetric datasets, this study demonstrates that PCA is advantageous over HCA for the colorimetric array RGB data to identify and quantify both NaOH and HCl. This indicates that other researchers who use predominantly HCA may benefit from testing PCA to determine if their results can be improved.

Linear Discriminant Analysis

LDA has also been applied in the analysis of colorimetric sensor arrays, although not as frequently as PCA or HCA.7,43 Analogous to PCA, LDA generates new variables, called discriminants, that are linear combinations of the original variables. However, LDA maximizes the differences in the group means, and therefore, can often perform better than PCA in the separation of groups.2 A drawback of LDA is the need for more samples than variables and the accuracy of the method potentially suffers when there are a large number of classes.44,45 In addition, LDA requires the data to be grouped, and the results can be affected by how the groups were constructed. Figure 5 shows a plot of the scores from the first and second discriminants from LDA analysis. In the plot, water (grey), NaOH (blue), and HCl (red) samples are distinct, indicating that LDA is effective for the identification of these analytes. For NaOH and HCl, the various analyte concentrations form small groups, which roughly correspond linearly with concentration. This supports the possibility of determining concentration using LDA. In comparison to the PCA score plot where the individual concentrations of bases were very well resolved and the concentrated acids overlapped, the clusters of various concentrations of acids and bases are spread out approximately equally in LDA. This means that for the determination of concentration, LDA is superior to PCA.

Figure 5.

Figure 5

In addition to dimension reduction, LDA can also be used as a quantitative means of classifying unknown samples. Table I shows the classification results from LDA using all but one cross validation. LDA correctly classified 93 of the 96 (97%) samples based on analyte and concentration. Two of the three misclassified samples were among the concentrated HCl samples, which was consistent with the results from PCA and HCA, where the acidic samples resulted in more overlap. The remaining misclassified sample (10 M NaOH) was initially classified as the wrong analyte (1 M HCl). This outlier prompted a reanalysis of the data, and the results demonstrate that the outlying data point cannot be reliably modeled with LDA. Upon further investigation into the RGB data, it was found that the universal indicator and bromophenol blue led to RGB intensities of one trial were much higher than the other three trials which could be due to wet chemistry experimental error, scanner inconsistencies, or an image processing error. LDA seems to be the only classification methods where the outlier was not classified as the correct analyte. PCA, HCA, HQI, and KNN were unaffected by the outlier.

Table I.

Summary of LDA sample classification.

Analyte Total Correct Misclassified
0.5 M HCl 4 4 -
1 M HCl 8 8 -
2 M HCl 4 4 -
3 M HCl 4 3 4 M HCl
4 M HCl 4 3 3 M HCl
5 M HCl 4 4 -
10 M HCl 4 4 -
0.5 M NaOH 4 4 -
1 M NaOH 8 7 2 M NaOH
2 M NaOH 4 4 -
3 M NaOH 4 4 -
4 M NaOH 4 4 -
5 M NaOH 4 4 -
10 M NaOH 4 3 1 sample could not be modeled
Water 32 32 -

Besides the aforementioned misclassification, results indicate that LDA was a very effective method for identifying analytes. These results are especially impressive considering the challenging data set, which includes sensors that show minimal differences between analytes (phenolphthalein and eriochrome black T) and sensors that have large variation and overlap within groups (alizarin yellow).

K Nearest Neighbor

KNN is another chemometric method used to classify unknown samples by comparing them to a library of known samples. The resulting classification is determined by the classes of most similar samples, which is based on distances between the unknown and known samples.46 Because the classification is based on treating the data sets as points in 24 dimensional space and calculating the distance between them, the implementation of KNN is straight forward.47 Other advantages of KNN include the ability to function in low samples/variable data sets and low impacts of outliers.48 In addition, KNN can often perform as well as or better than more sophisticated classifiers.46 KNN has previously been used in many classification problems including pH determination using a sensor array21 and melting point estimation of organic compounds.49 KNN results for k = 1 (number of nearest neighbors used for classification) using all but one cross validation are presented in Table II. Overall, KNN could be used to correctly identify 86 of the 96 samples (90%); all of the mislabeled samples involved the correct analyte, but the wrong concentration. Most misclassifications occurred among concentrated HCl samples (3–10 M), in accord with the PCA and HCA results. The number of nearest neighbors (k) has been shown to influence the performance of the method, but the relationship between k and performance can vary.46 Subsequent tests for k=3 resulted in lower classification accuracy, indicating that k=1 is the optimal parameter for the present study (data not shown).

Table II.

Summary of k nearest neighbor (KNN) sample classifications.

Analyte Total Correct Misclassified
0.5 M HCl 4 4 -
1 M HCl 8 8 -
2 M HCl 4 4 -
3 M HCl 4 2 2 M HCl, 4 M HCl
4 M HCl 4 2 2*3 M HCl
5 M HCl 4 0 3*3 M HCl, 4 M HCl
10 M HCl 4 2 2*5 M HCl
0.5 M NaOH 4 4 -
1 M NaOH 8 8 -
2 M NaOH 4 4 -
3 M NaOH 4 4 -
4 M NaOH 4 4 -
5 M NaOH 4 4 -
10 M NaOH 4 4 -
Water 32 32 -

Comparing the two classification methods, a few more misclassifications occurred with KNN than with LDA. This could potentially be explained by the fact that LDA utilizes input variables with optimal weights to separate the group means, while KNN gives equal significance to all of the variables. Alternative KNN algorithms apply various transformations to the dataset to optimize the accuracy of KNN, however, such methods were out of the scope of the present study.49

Hit Quality Index

HQI is a common method used to compare an unknown spectrum to a large database (samples and variables) of known spectra, particularly for FTIR and Raman.50,51 HQI treats the spectra as vectors, and the similarity between two spectra is calculated with dot products according to the equation, HQI=(x·y)(x·y)(x·x)(y·y), where x and y are the unknown and known spectra, respectively. Classification can then be assigned similarly to KNN, and comparative studies involving HQI have shown similar accuracies to other chemometric methods.50,51

In the present study, the classifications were assigned using all but one cross validation and the best HQI value (k = 1); the classification results are listed in Table III. Overall, the correct classification (analyte and concentration) was observed for 90 of 96 samples (94%), and the six misclassifications involved the correct analyte, but the wrong concentration. This means that even though the concentrations were classified incorrectly, the identity of the substance could be still determined. All of the misclassifications occurred in more concentrated (>3M) HCl samples, which is consistent with the qualitative results from HCA and PCA. This means that the acid sensors could have reached their capacity to discriminate highly acidic samples. HQI involves straight forward mathematics and can easily be implemented giving similar qualitative results as other methods. Thus, HQI could be used as an alternative or additional method for data analysis of colorimetric arrays.

Table III.

Summary of hit quality index (HQI) sample classifications.

Analyte Total Correct Misclassified
0.5 M HCl 4 4 -
1 M HCl 8 8 -
2 M HCl 4 4 -
3 M HCl 4 2 2 M HCl, 4 M HCl
4 M HCl 4 2 2*3 M HCl
5 M HCl 4 3 10 M HCl
10 M HCl 4 3 5 M HCl
0.5 M NaOH 4 4 -
1 M NaOH 8 8 -
2 M NaOH 4 4 -
3 M NaOH 4 4 -
4 M NaOH 4 4 -
5 M NaOH 4 4 -
10 M NaOH 4 4 -
Water 32 32 -

Control Subtraction

Subtracting a control from an analyte sample is a common method to isolate the impact of the analyte and improve the visualization of color changes in colorimetric sensor arrays.10 In addition, the normalization of colorimetric data against a control minimizes internal variations of color differences in image capture, such as variations in desktop scanners or smart devices. In order to determine if control subtraction would improve the chemometric analysis of the data set, the results of the uncorrected data were compared to the data from which the control was subtracted. PCA, LDA, and KNN showed minimal differences with or without the control subtraction. HCA of the uncorrected data (Figure 4) led to clustering of all 1 M NaOH samples. However, when the control sample was subtracted, half of the 1 M NaOH samples were grouped with the 0.5 M NaOH samples, and the other half were grouped with the 2 M NaOH samples. HQI appeared to be the most affected by the control subtraction. Although the classification of analytes was similar, with 9 misclassifications for unsubtracted data and 8 misclassifications for subtracted data, the HQI (measure of data similarity) was much lower in the subtracted data. For the uncorrected data in HQI, all water samples were correctly classified as water with a similarity > 0.999. However, in the subtracted data, water samples were misclassified and similarities ranged from 0.83 – 0.21. Based on these results, the control subtraction was found to be unnecessary for all chemometric methods, and was not used in the final data analysis for this array.

Conclusions and Future Outlook

Colorimetric sensor arrays are becoming a common tool for the identification and quantification of various analytes. The multidimensional nature of the resulting colorimetric data demands the use of chemometric analysis. There are many chemometric algorithms that could be applied to colorimetric data; however, analysis has mostly been limited to HCA and PCA. Herein, an 8 sensor colorimetric array was used to compare the performance of PCA, HCA, KNN, LDA, and HQI in the identification and quantification of acidic and basic samples.

While the current study demonstrates the effectiveness of the colorimetric sensor array for the identification and quantification of acids and bases, the performance of the array could likely be improved with sensor optimization. Avoiding pH sensors that precipitate would likely decrease the variance within groups and could lead to superior analyte quantification. In addition, selecting pH sensors that span a larger pH range, especially low pH values, could potentially be useful for separating analytes. Incorporating additional sensors to the array, could also lead to improved performance by supplying the chemometric methods with additional data. These criteria can be applied to other analytes and applications.

Our colorimetric array data analysis results show that PCA, HCA, and LDA can be used to qualitatively visualize the data and relationships between the analytes. With all three methods, a clear separation between water and analyte samples was observed. In addition, modest discrimination between the concentrations of the analytes was observed. Our results show that KNN, LDA, and HQI can used to quantitatively classify the samples in this array. KNN and HQI could be used to correctly identify the analyte in all cases, but LDA had one sample that could not be classified. For identification of analyte concentration, LDA slightly outperformed HQI and KNN, with 96%, 94%, and 90% accuracies, respectively. For all methods, most of the misclassifications occurred at higher concentrations (>2M) of HCl, which was likely due to large variation for some indicators due to precipitation. The slightly improved performance of LDA over KNN and HQI was not unexpected because LDA applies different weightings to optimally separate the groups, however, this trend is not universal in comparative studies of classification accuracy.46,47 Although the classification accuracy of KNN and HQI was a lower, these methods seemed more robust to outliers. For the present data set and presented methods, the effectiveness in regards to visualization and classification can be arranged as LDA > PCA > HCA for visualization and LDA > HQI > KNN > HCA for classification. Additional studies are underway to determine if these trends are consistent with larger and more complex datasets. However, our results regarding chemometric effectiveness in analyte classification will allow for improved sensor array design and analyte analysis.

Additional criteria for comparing and selecting chemometric methods will have to be considered, such as, the memory and processing power of the mobile device. In mobile device applications, a compromise between the accuracy of the chosen chemometric method and computing power will have to be considered.

Supplementary Material

Supp FigS1

Acknowledgments

This publication was made possible by

  1. US Army W911SR-15-C-0027 SBIR Phase I - Chemical Biological Radiological Nuclear and Explosives (CBRNE) Reconnaissance Sampling Kit (A15-048).

  2. National Institute for General Medical Science (NIGMS) (5P20GM103427), a component of the National Institutes of Health (NIH). (RL INBRE Scholar)

  3. Camille and Henry Dreyfus Foundation (AH-2015 Dreyfus Teacher Scholar Award).

  4. SBIR Phase II - Chemical Biological Radiological Nuclear and Explosives (CBRNE) Reconnaissance Sampling Kit (W911SR-16-C-0051)

References

  • 1.Capitán-Vallvey LF, López-Ruiz N, Martínez-Olmos A, Erena MM, Palma AJ. Recent developments in computer vision-based analytical chemistry: A tutorial review. Anal Chim Acta. 2015;899:23–56. doi: 10.1016/j.aca.2015.10.009. [DOI] [PubMed] [Google Scholar]
  • 2.Askim JR, Mahmoudi M, Suslick KS. Optical sensor arrays for chemical sensing: the optoelectronic nose. Chem Soc Rev. 2013;42(22):8649–8682. doi: 10.1039/C3CS60179J. [DOI] [PubMed] [Google Scholar]
  • 3.Burks RM, Pacquette SE, Guericke MA, et al. DETECHIP: A Sensor for Drugs of Abuse. J Forensic Sci. 2010;55(3):723–727. doi: 10.1111/j.1556-4029.2010.01323.x. [DOI] [PubMed] [Google Scholar]
  • 4.Li Z, Jang M, Askim JR, Suslick KS. Identification of accelerants, fuels and post-combustion residues using a colorimetric sensor array. Analyst. 2015;140(17):5929–5935. doi: 10.1039/C5AN00806A. [DOI] [PubMed] [Google Scholar]
  • 5.Lyon M, Groathouse J, Beaber J, et al. DETECHIP®: An Improved Molecular Sensing Array. J Forensic Res. 2011;2(4):1000126. doi: 10.4172/2157-7145.1000126. [DOI] [Google Scholar]
  • 6.Salles MO, Meloni GN, de Aaujo WR, Paixão TRLC. Explosive colorimetric discrimination using a smartphone, paper device and chemometrical approach. Anal Methods. 2014;6(7):2047–2052. doi: 10.1039/c3ay41727a. [DOI] [Google Scholar]
  • 7.Zhang Y, Askim JR, Zhong W, Orlean P, Suslick KS. Identification of pathogenic fungi with an optoelectronic nose. Analyst. 2014;139(8):1922–1928. doi: 10.1039/C3AN02112B. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Haswell SJ. Practical Guide to Chemometrics. New York: Marcel Dekker, Inc; 1992. [Google Scholar]
  • 9.Okuom MO, Holmes AE. Developing a Color-Based Molecular Sensing Device: DETECHIP®. Sens Transducers. 2014;183(12):30–33. [Google Scholar]
  • 10.Suslick KS. An Optoelectronic Nose:“Seeing” Smells by Means of Colorimetric Sensor Arrays. MRS Bull. 2004;29(10):720–725. doi: 10.1557/mrs2004.209. [DOI] [PubMed] [Google Scholar]
  • 11.Ariza-Avidad M, Salinas-Castillo A, Cuéllar MP, Agudo-Acemel M, Pegalajar MC, Capitán-Vallvey LF. Printed Disposable Colorimetric Array for Metal Ion Discrimination. Anal Chem. 2014;86(17):8634–8641. doi: 10.1021/ac501670f. [DOI] [PubMed] [Google Scholar]
  • 12.Mahmoudi M, Lohse SE, Murphy CJ, Suslick KS. Identification of Nanoparticles with a Colorimetric Sensor Array. ACS Sens. 2016;1(1):17–21. doi: 10.1021/acssensors.5b00014. [DOI] [Google Scholar]
  • 13.Chulvi K, Gaviña P, Costero AM, et al. Discrimination of nerve gases mimics and other organophosphorous derivatives in gas phase using a colorimetric probe array. Chem Commun. 2012;48(81):10105–10107. doi: 10.1039/C2CC34662A. [DOI] [PubMed] [Google Scholar]
  • 14.Lim SH, Feng L, Kemling JW, Musto CJ, Suslick KS. An optoelectronic nose for the detection of toxic gases. Nat Chem. 2009;1(7):562–567. doi: 10.1038/nchem.360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bang JH, Lim SH, Park E, Suslick KS. Chemically Responsive Nanoporous Pigments: Colorimetric Sensor Arrays and the Identification of Aliphatic Amines. Langmuir. 2008;24(22):13168–13172. doi: 10.1021/la802029m. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Soga T, Jimbo Y, Suzuki K, Citterio D. Inkjet-Printed Paper-Based Colorimetric Sensor Array for the Discrimination of Volatile Primary Amines. Anal Chem. 2013;85(19):8973–8978. doi: 10.1021/ac402070z. [DOI] [PubMed] [Google Scholar]
  • 17.Zhang C, Suslick KS. A Colorimetric Sensor Array for Organics in Water. J Am Chem Soc. 2005;127(33):11548–11549. doi: 10.1021/ja052606z. [DOI] [PubMed] [Google Scholar]
  • 18.Kitamura M, Shabbir SH, Anslyn EV. Guidelines for Pattern Recognition Using Differential Receptors and Indicator Displacement Assays. J Org Chem. 2009;74(12):4479–4489. doi: 10.1021/jo900433j. [DOI] [PubMed] [Google Scholar]
  • 19.Safavi A, Maleki N, Rostamzadeh A, Maesum S. CCD camera full range pH sensor array. Talanta. 2007;71(1):498–501. doi: 10.1016/j.talanta.2006.04.030. doi: http://dx.doi.org/10.1016/j.talanta.2006.04.030. [DOI] [PubMed] [Google Scholar]
  • 20.Curto VF, Fay C, Coyle S, et al. Real-time sweat pH monitoring based on a wearable chemical barcode micro-fluidic platform incorporating ionic liquids. Sens Actuators B Chem. 2012;171–172:1327–1334. doi: 10.1016/j.snb.2012.06.048. [DOI] [Google Scholar]
  • 21.Capel-Cuevas S, Cuéllar MP, de Orbe-Payá I, Pegalajar MC, Capitán-Vallvey LF. Full-range optical pH sensor based on imaging techniques. Anal Chim Acta. 2010;681(1–2):71–81. doi: 10.1016/j.aca.2010.09.033. [DOI] [PubMed] [Google Scholar]
  • 22.Batres G, Jones T, Johnke H, Wilson M, Holmes AE, Sikich S. Reactive Arrays of Colorimetric Sensors for Metabolite and Steroid Identification. J Sens Technol. 2014;4(1):1–6. doi: 10.4236/jst.2014.41001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Johnke H, Batres G, Wilson M, Holmes AE, Sikich S. Detecting Concentration of Analytes with DETECHIP: A Molecular Sensing Array. J Sens Technol. 2013;3(3):94–99. doi: 10.4236/jst.2013.33015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Lyon M, Wilson MV, Rouhier KA, et al. Digital Image Analysis for DETECHIP® Code Determination. Signal Image Process Int J SIPIJ. 2012;3(4):51–63. doi: 10.5121/sipij.2012.3405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Foster LS, Gruntfest IJ. Demonstration experiments using universal indicators. J Chem Educ. 1937;14(6):274–276. doi: 10.1021/ed014p274. [DOI] [Google Scholar]
  • 26.Soldat DJ, Barak P, Lepore BJ. Microscale Colorimetric Analysis Using a Desktop Scanner and Automated Digital Image Analysis Douglas. J Chem Educ. 2009;86(5):617–620. doi: 10.1021/ed086p617. [DOI] [Google Scholar]
  • 27.R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2017. http://www.R-project.org/ [Google Scholar]
  • 28.Weiner J. pca3d: Three Dimensional PCA Plots. R Package Version 0.10. 2017 http://CRAN.R-project.org/package=pca3d.
  • 29.Galili T. dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. 2015;31(22):3718–3720. doi: 10.1093/bioinformatics/btv428. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Venables WN, Ripley BD. Modern Applied Statistics with S. 4. New York: Springer; 2002. http://www.stats.ox.ac.uk/pub/MASS4. [Google Scholar]
  • 31.Lopez-Ruiz N, Curto VF, Erenas MM, et al. Smartphone-Based Simultaneous pH and Nitrite Colorimetric Determination for Paper Microfluidic Devices. Anal Chem. 2014;86(19):9554–9562. doi: 10.1021/ac5019205. [DOI] [PubMed] [Google Scholar]
  • 32.Sabnis RW. Handbook of Acid-Base Indicators. Boca Raton, Fl: CRC Press; 2008. [Google Scholar]
  • 33.Afkhami A, Sarlak N. A novel cyanide sensing phase based on immobilization of methyl violet on a triacetylcellulose membrane. Sens Actuators B Chem. 2007;122(2):437–441. doi: 10.1016/j.snb.2006.06.012. [DOI] [Google Scholar]
  • 34.Tandon KN. Complexometric determination of mercury(II) using Congo Red as indicator. Talanta. 1966;13(1):161–163. doi: 10.1016/0039-9140(66)80144-3. [DOI] [PubMed] [Google Scholar]
  • 35.Feng L, Musto CJ, Suslick KS. A Simple and Highly Sensitive Colorimetric Detection Method for Gaseous Formaldehyde. J Am Chem Soc. 2010;132(12):4046–4047. doi: 10.1021/ja910366p. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Graham RC. Data Analysis for the Chemical Sciences: A Guide to Statistical Techniques. New York: VCH Publishers, Inc; 1993. [Google Scholar]
  • 37.Salinas Y, Ros-Lis JV, Vivancos J-L, et al. A novel colorimetric sensor array for monitoring fresh pork sausages spoilage. Food Control. 2014;35(1):166–176. doi: 10.1016/j.foodcont.2013.06.043. [DOI] [Google Scholar]
  • 38.Li Z, Bassett WP, Askim JR, Suslick KS. Differentiation among peroxide explosives with an optoelectronic nose. Chem Commun. 2015;51(83):15312–15315. doi: 10.1039/C5CC06221G. [DOI] [PubMed] [Google Scholar]
  • 39.Wittke G. Reactions of phenolphthalein at various pH values. J Chem Educ. 1983;60(3):239–240. doi: 10.1021/ed060p239. [DOI] [Google Scholar]
  • 40.Bueno L, Meloni GN, Reddy S, Paixão TRLC. Use of plastic-based analytical device, smartphone and chemometric tools to discriminate amines. RSC Adv. 2015;5(26):20148–20154. doi: 10.1039/c5ra01822f. [DOI] [Google Scholar]
  • 41.Feng L, Musto CJ, Kemling JW, Lim SH, Zhong W, Suslick KS. Colorimetric Sensor Array for Determination and Identification of Toxic Industrial Chemicals. Anal Chem. 2010;82(22):9433–9440. doi: 10.1021/ac1020886. [DOI] [PubMed] [Google Scholar]
  • 42.Feng L, Musto CJ, Kemling JW, Lim SH, Suslick KS. A colorimetric sensor array for identification of toxic gases below permissible exposure limits. Chem Commun. 2010;46(12):2037–2039. doi: 10.1039/B926848K. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Minami T, Esipenko NA, Akdeniz A, Zhang B, Isaacs L, Anzenbacher P., Jr Multianalyte Sensing of Addictive Over-the-Counter (OTC) Drugs. J Am Chem Soc. 2013;135(40):15238–15243. doi: 10.1021/ja407722a. [DOI] [PubMed] [Google Scholar]
  • 44.Wold S, Johansson E, Jellum E, Bjørnson I, Nesbakken R. Application of simca multivariate data analysis to the classification of gas chromatographic profiles of human brain tissues. Anal Chim Acta. 1981;133(3):251–259. doi: 10.1016/S0003-2670(01)83199-8. [DOI] [Google Scholar]
  • 45.Aeberhard S, Coomans D, de Vel O. The Performance Of Statistical Pattern Recognition Methods In High Dimensional Settings. IEEE Signal Processing Workshop on Higher Order Statistics; Ceasarea: John Wiley; 1994. pp. 14–16. [Google Scholar]
  • 46.Ma C-M, Yang W-S, Cheng B-W. How the Parameters of K-nearest Neighbor Algorithm Impact on the Best Classification Accuracy: In Case of Parkinson Dataset. J Appl Sci. 2014;14(2):171–176. doi: 10.3923/jas.2014.171.176. [DOI] [Google Scholar]
  • 47.Balabin RM, Safieva RZ, Lomakina EI. Gasoline classification using near infrared (NIR) spectroscopy data: Comparison of multivariate techniques. Anal Chim Acta. 2010;671(1–2):27–35. doi: 10.1016/j.aca.2010.05.013. [DOI] [PubMed] [Google Scholar]
  • 48.Bhaskar H, Hoyle DC, Singh S. Machine learning in bioinformatics: A brief survey and recommendations for practitioners. Comput Biol Med. 36(10):1104–1125. doi: 10.1016/j.compbiomed.2005.09.002. [DOI] [PubMed] [Google Scholar]
  • 49.Nigsch F, Bender A, van Buuren B, Tissen J, Nigsch E, Mitchell JBO. Melting Point Prediction Employing k-Nearest Neighbor Algorithms and Genetic Parameter Optimization. J Chem Inf Model. 2006;46(6):2412–2422. doi: 10.1021/ci060149f. [DOI] [PubMed] [Google Scholar]
  • 50.Lee S, Lee H, Chung H. New discrimination method combining hit quality index based spectral matching and voting. Anal Chim Acta. 2013;758:58–65. doi: 10.1016/j.aca.2012.10.058. [DOI] [PubMed] [Google Scholar]
  • 51.Gryniewicz-Ruzicka CM, Rodriguez JD, Arzhantsev S, Buhse LF, Kauffman JF. Libraries, classifiers, and quantifiers: A comparison of chemometric methods for the analysis of Raman spectra of contaminated pharmaceutical materials. J Pharm Biomed Anal. 2012;61:191–198. doi: 10.1016/j.jpba.2011.12.002. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp FigS1

RESOURCES