In 2011, Houde, Berkowitz, and Engen published an important foundational work on the methodology of HX-MS for differential measurements in comparability contexts. Essentially, this work was perhaps the first to rigorously address the critical question ‘what is the smallest difference in an HX-MS measurement that we should classify as significant?’ At the end of the work, Houde et al. conclude that if the difference is greater than 0.5 Da in triplicate measurements, it should be considered significant.
In the course of a critical revisiting of this paper in preparation for a forthcoming manuscript, I have discovered that, while the overall premise of this work is sound, there are some errors in the statistical implementation. Overall, it appears that these errors will lead to a substantial overestimate in the magnitude of the significance limits when the methods described in the paper are followed. This Commentary is intended to provide guidance on how to avoid these errors.
It appears, based on my reading of the text, that three different statistical errors were introduced. Two of these errors will lead to overestimating the significance limit and one leads to an underestimate. Overall, it appears that the effect is that the threshold for statistically significant differences is much more stringent than was intended with the use of a purportedly 98% confidence interval. Overestimating the limit for significance has the apparent benefit of decreasing false positives (i.e., type I errors), but the undesirable side-effect of a loss of power due to an increase in the number of false negatives (i.e., type II errors). Thus, there is an increased risk of missing significant differences that appear to be ‘too small’ based on an overestimated significance limit.
Rather than just proclaiming where I think the errors are, I have gone into some detail to explain these statistical issues at a level that should be approachable by anyone who is familiar with mean, standard deviation, and Student’s t-test. Recommendations are provided at the end.
Notation
Since the issues addressed in this commentary hinge on technical matters around the statistical treatment of data, I have recast the equations in notation that is commonly employed in the field of statistics. Houde et al. define the HX difference between the reference (subscript “ref”) and experiment (subscript “exp”) as an “array of differences”, D
(1) |
where Sn (Mi,t) is an array of mass increases for peptides indexed by their location in the amino acid sequence, i, at HX labeling time t, for data set n where n is the label for the reference or the experiment. Since the issues raised here do not concern the identity of the peptide or the HX labeling time I have dispensed with labels and recast the equation as
(2) |
where X represents the HX data from reference sample and Y represents the HX data from the experiment sample.
Determination of the standard deviation of replicate differential HX measurements
The first matter that arises is how the mean difference was determined, and on this point, it seems that the paper is unclear. Technical replicates were obtained so that we can write X = (X1,X2,…,Xn) and Y = (Y1,Y2,…,Yn) where each subscripted symbol represents a single experimental determination of the amount of deuteration in a particular peptide at a specific HX labeling time. In the work of Houde et al., n = 3. According to the Supporting Information (pp. 4-5), F was “obtained from the average of three separate H/DX-MS measurements conducted on the same sample.” Hence we have
(3) |
where the overbar represents the arithmetic mean. It is trivial to show that :
(4) |
Plainly speaking, the differences of the means of X and Y is equal to the mean of the differences. The first problem that arises in this work, however, comes when one begins to reckon with the variance, or equivalently, the standard deviation. The exact definition of the standard deviation was not stated in the paper, but most experimental scientists would expect the sample standard deviation with the Bessel correction1
(5) |
Since the square root notation is cumbersome, in many contexts it is preferable to refer to the sample variance, s2. It is also worth noting here that the sample standard variance is what is known as a point estimator of the true variance, :1,2
(6) |
that is based on a limited number of replicated measurements. Before we form the standard deviation expression for F, it is useful to consider the error propagation in F :3
(7) |
which in the present case becomes
(8) |
where the function cov is the covariance of X and Y :
(9) |
expressed here with the n−1 Bessel correction. In the case of random, uncorrelated errors, which it is reasonable to expect in HX-MS measurements, the covariance would be zero, and replacing the variances with their point estimators gives the “common sense” expression for the propagated error in F :
(10) |
Importantly, equation (10) is valid if and only if X and Y are uncorrelated variables. In the Supporting Information, Houde et al. stated that they “determined SD values for all calculated mean D(ΔMi,t) data points” (p. 4) which I interpret as ‘the sample standard deviation was determined from the HX differences.’ This is a point where there is a lack of clarity in the paper, there are two ways to go about this, and looking initially at equation (4), it would appear that the difference is unimportant. Based on my reading of the text, it seems that the Houde et al., estimated their variance, denoted here as , using the mean of paired differences:
(11) |
This approach represents computing the differences from three paired measurements, i.e., Fi = Xi − Yi, taking their mean value (), and computing a sample standard deviation. However, based on (10) the propagated error would be expected to be
(12) |
It is clear by inspection that equations (11) and (12) are not equal, i.e., . Thus , computed as suggested by Houde et al., is a biased estimator1,2 of the variance in the HX difference and should not be used. The problem arises from the artificial correlation introduced when Xi and Yi are paired to calculate the mean difference, , rather than the difference of the means, . Again, the HX difference, , is not affected, as shown by equation (4), but the error estimate based on equation (11) is biased by covariance.
A simple numerical example can also serve to illustrate this issue. Consider X = (1,2,3), Y = (1,2,3), Z = (3,1,2) where the only difference between Y and Z is that the order of the results has changed. It can be shown that and thus that . Also, sX = sY = sZ = 1. The propagated error, based on equation (10), is . However, in the case of , computing the propagated error by (11), in other words, the mean of the differences, gives
While for , the result is
Thus we can see that the error propagated by equation (11) produces inconsistent results that depend on the arbitrary ordering of independent observations. Inclusion of the sample covariance, cov(X,Y) = 1 and , (i.e., equation 13) corrects for the artificial correlation, resulting in the “common sense” error of under both scenarios and . Thus the important imperative here is that for differential HX-MS measurements, the two conditions must be treated as independent, uncorrelated statistical entities or the error analysis must include the covariance. In other words, propagate error based on the difference between the means rather than the mean difference. In some cases, neglecting covariance in paired data will lead to an over-estimate of and in other cases it will be an under-estimate, depending on whether the pairing results in a positive or negative covariance. In the limit of very large numbers of technical replicate measurements of uncorrelated variables X and Y, the covariance will approach zero, but in small samples, such as triplicate measurements, there will usually be some amount of covariance. By appropriately averaging a large collection of independent determinations, the averaging would cancel out the covariance since the covariance is equally likely to be positive or negative. However, as described in the next section, the method used to determine the average standard deviation must be chosen appropriately.
Estimating the standard deviation from a population of measurements
Houde et al. proposed using the entire collection of sample standard deviations obtained from all of the replicate differential measurements to improve the reliability of the estimate. This is highly commendable because estimates of error based on triplicate measurements are notoriously unreliable. Setting aside the bias introduced by omission of covariance as discussed in the preceding section, there is an additional problem here, however. The collective standard deviation obtained was the “simple average of all these individual experimentally determined SD values” (SI, p. 4). Here, I assume that by “simple average” the authors are using the arithmetic mean. Taking the arithmetic mean of standard deviations is not a statistically accepted method to estimate the population standard deviation, even if the experimental error were estimated in replicates using either equation (5) or equation (10). Instead, a pooled estimate of the standard deviation, sp, should be used:2
(13) |
where si is a standard deviation obtained from replicate measurements, ni is the number of observations, three in in the work of Houde et al., and N = 670, based on 67 peptides observed at five HX labeling times in two conditions as suggested by equation (4) of the paper. When all ni are equal, equation (13) reduces to the root-mean-square (RMS) of the standard deviations:
(14) |
The well-known arithmetic mean RMS inequality states that:
(15) |
for Xi > 0. This inequality indicates that using the arithmetic mean will lead to an underestimate of the experimental error. Thus, the mean value of 0.14 Da reported by Houde et al. underestimates , the error of the differential measurements. The pooled estimate could be based on X and Y separately or the differences of the means, , however, as suggested in the following section, working with differences leads to confusion in the application of Student’s t-test that Houde et al. used to establish a 98% confidence interval.
Setting a threshold for a statistically significant difference using Student’s t-test
Following the averaging of standard deviations, Houde et al. propose a form of the Student’s t-test to set a threshold for a statistically significant HX difference. In other words, to determine if is large enough to be statistically significant. Although the approach was not described in these terms, Houde et al. used a two sample t-test for comparison of means assuming equal variance. The null hypothesis in this test is that the two samples, X and Y, are drawn from the same population. Loosely speaking, the null hypothesis is rejected when exceeds a critical value, the conclusion being that the difference between X and Y is statistically significant. To perform a t-test, one chooses a desired (1−α)×100% confidence level. The critical threshold for significance is then defined as
(16) |
where is the Student’s t value based on α/2 and the degrees of freedom, df. Houde et al. selected α = 0.02 to obtain a 98% confidence level. In the implementation by Houde et al., the following equation was formed
(17) |
since nX = nY = 3 and the error estimate was based on the difference. The preceding two sections have highlighted difficulties associated with using the arithmetic mean experimental error . Yet, even putting aside the problems associated with the error estimate, , there is an additional problem with the threshold in equation (17): the degrees of freedom is incorrect. In a two-sample t-test with equal variance
(18) |
and thus df = 4 for the difference between means from triplicate measurements. The difference here is substantial: t0.01:2 = 6.965 while t0.01:4 = 3.747 leading to an almost two-fold overestimate of the magnitude of the significance limit.
Recommendations
The overall premise of the work by Houde et al., is sound: use a large population of replicate measurements to establish a significance threshold for differential HX measurements. However three errors in the implementation will lead to an erroneous overestimate of that significance limit. To avoid these errors, three simple corrections are needed:
Compute the individual standard deviations based on the sample means rather than the paired differences.
Use the pooled standard deviation rather than mean standard deviation for a global estimate of the experimental error.
Determine the number of degrees of freedom based on the total number of measurements in the data, i.e., equation (18).
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.Barlow RJ. 1989. Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences. 1st ed., Chicester: John Wiley and Sons. [Google Scholar]
- 2.Freund RJ, Wilson WJ. 2003. Statistical Methods. 2nd ed., Amsterdam: Academic Press. [Google Scholar]
- 3.Taylor JR. 1997. An introduction to error analysis : the study of uncertainties in physical measurements. 2nd ed., Sausalito: University Science Books. [Google Scholar]