Abstract
Estimating statistical significance of the difference between two spectra or series is a fundamental statistical problem. Multivariate significance tests exist but the limitations preclude their use in many common cases; e.g., one-sided testing, unequal variance and when few repetitions are acquired all of which are required in magnetic spectroscopy of nanoparticle Brownian motion (MSB). We introduce a test, termed the T-S test, that is powerful and exact (exact type I error). It is flexible enough to be one- or two-sided and the one-sided version can specify arbitrary regions where each spectrum should be larger. The T-S test takes the-one or two-sided p-value at each frequency and combines them using Stouffer’s method. We evaluated it using simulated spectra and measured MSB spectra. For the single-sided version, mean of the spectrum, A-T, was used as a reference; the T-S test is as powerful when the variance at each frequency is uniform and outperforms when the noise power is not uniform. For the two-sided version, the Hotelling T2 two-sided multivariate test was used as a reference; the two-sided T-S test is only slightly less powerful for large numbers of repetitions and outperforms rather dramatically for small numbers of repetitions. The T-S test was used to estimate the sensitivity of our current MSB spectrometer showing 1 nanogram sensitivity. Using eight repetitions the T-S test allowed 15 pM concentrations of mouse IL-6 to be identified while the mean of the spectra only identified 76 pM.
Index Terms —: Multivariate statistical significance testing, One-sided statistical significance testing
Introduction:
Our ability to estimate the sensitivity of spectrometers used for magnetic spectroscopy of Brownian motion (MSB) was limited by the lack of one-sided multivariate statistical significance testing. Further, most MSB experiments require the same one-sided, multivariate statistical significance testing. We need to know when two spectra are significantly different. Adequate methods did not exist so we developed several and are reporting on comparisons showing their strengths and weaknesses as well as reporting on the sensitivity we are able to achieve with our current MSB spectrometer.
Magnetic particle spectroscopy, MPS, is useful in characterizing the microenvironment when the nanoparticles, NPs, are large enough that the dominant relaxation mechanism is Brownian rather than Neel. When Brownian relaxation is dominant, we call it magnetic spectroscopy of NP Brownian motion, MSB. NP spectra have been used to identify changes in the: temperature(Weaver et al., 2009; Perreard et al., 2014; Draack et al., 2017), viscosity (Rauwerdink and Weaver, 2010b), chemical binding (Rauwerdink and Weaver, 2010a), the pH (Gordon-Wylie et al., 2020)and stiffness of the matrix to which the NPs are bound (Weaver et al., 2014). The ability to identify binding has been exploited to measure the quantitative concentration of molecular biomarkers (Zhang et al., 2013; Weaver et al., 2017).
Most of these applications compare a spectra to a reference spectra and one wishes to know if the difference in spectra is significant. Experiments using individual frequencies must control all the factors that influence the spectra to isolate the desired effect from the many that might be changing. For example, spectra allow changes in viscosity or binding to be found even when the temperature is changing (Shi et al., 2020). The use of a spectrum requires a multivariate statistical test.
The spectra produced by magnetic NPs are quite simple in some ways. They are smooth: the measured signal can be represented by Langevin functions over much of the commonly used domain ( Reeves et al., 2016). The spectra generally change monotonically with frequency and, more importantly, they do not cross so the spectrum is either larger or smaller than the reference at all of the frequencies measured. The latter property means that one-sided significance testing is appropriate and will improve the sensitivity.
The significance of the difference between spectra requires single-sided, multivariate significance testing. Common significance tests exist for single variable significance testing for either single-sided or two-sided tests, z-test and Student’s t-test, and for multivariate, two-sided significance testing, Hotelling T2 test (Hotelling, 1992). But there are no commonly used methods for one-sided, multivariable significance testing, although research is continuing (Perlman and Wu, 2006).
Methods:
Proposed One-sided Multivariate Significance Testing Methods
We obtained composite p-values by combining one-sided t-tests at each frequency. We used two methods of combining p-values generally used for meta-analyses (Bailey and Gribskov, 1998; Whitlock, 2005): Fisher’s method (Fisher, 1925) and Stouffer’s method (Stouffer et al., 1949). One-sided p-values at each frequency were obtained using the t-test and the z-test. T-test derived p-values combined with Stouffer’s and Fisher’s methods were termed the T-S and T-F tests respectively. Similarly, z-test derived p-values combined using Stouffer’s and Fisher’s methods were termed the Z-S and Z-F tests respectively. Np is the number of frequencies in the spectra and Nr is the number of repeated measurements.
Because Stouffer’s method transforms the p-values to corresponding z-values for combination, there is a relatively simple closed form solution for the Z-S test. The existence of a closed form solution motivated our inclusion of z-test despite the well-known errors at low values of Nr.
The measured spectra are taken in a state which is compared to the reference state to estimate the significance of the difference between the spectra.
Fisher’s Method:
The p-value from the t-tests were found from the contrast to noise at each frequency. Fisher’s method generates a composite p-value:
| (1) |
where pi is the p-value for the ith point and follows a distribution so the composite p-value is:
| (2) |
where Γ is the gamma function.
Stouffer’s Method:
Stouffer’s method of combining p-values maps the p-values onto the normal distribution to average and form a composite p-value. Stouffer’s method uses the corresponding z-value of each p-value to calculate a combined z-value, which is then translated to a composite p-value:
| (3) |
For comparison, we used the single-variate t-test on the sum of all measured spectra, termed the A-T test.
Each of these five one-sided, multivariate significance tests have two-sided versions obtained by taking the two-sided t-tests at each frequency and calculating a composite p-value using Stouffer’s or Fisher’s method. We compared the power of the two-sided T-S test with the power of the Hotelling T2 test over a range of Nr to evaluate its usefulness in the two-sided venue as well.
Evaluating the Significance Tests:
The tests were evaluated using both simulations and measured MSB spectra.
Monte Carlo Simulations:
To measure the power and accuracy of the above one-sided tests, we first tested them with simulations. We used Matlab (Mathworks, Natick, MA) to generate normally distributed random spectra with means that differed by a specified amount. The spectra were generated using Matlab’s “randn” function, with Np = 8 and Nr = 4. We compared two groups of Nr random spectra, one representing the reference spectra. We generated many group pairs for each difference between the means so that we could measure what proportion were significant for a given difference for each test which is termed the power of the test. The statistical power of the test is the percentage of p-values that reach significance for each difference in means between the spectra and the reference spectra. A certain percentage of significant results will be obtained when there is no difference from the reference spectra. For an “exact test” the percentage will be the selected significance level; i.e., the type I error is the selected significance threshold when the null hypothesis is true. A powerful test should change from the selected significance level to one for as low a difference as possible.
We also compared test results for simulations with equal variance at each frequency with simulations where the variance was different at different frequencies, to see which tests would be the most resilient to uneven variance across frequency.
Matlab library routines were used for: z-test, t-test, Hotelling T2 test, error function and inverse error function. The vectorized code was reasonably fast: for example, the code generating Figure 1 ran in less than 2.5 minutes on a 2015 MacBook Pro. The 218 (half million) spectra achieved an average significant proportion with standard deviations less than 10–3; i.e., less than 5% of the mean difference between successive values in the power curves. Fewer spectra, 215, were used for the Hotelling T2 simulations because that code could not be vectorized so the run times became excessively long: it ran more than 80 hours of wall time on 16 processors on Dartmouth’s Discovery cluster.
Fig. 1:

Power curves for all five one-sided, multivariate statistical tests for Np = 8 and Nr = 4. The difference between the means is in units of the standard deviation of the noise. The power increases as the difference of the means increases. Both tests based on the z-test, are not exact providing too many significant results when no difference exists. Of the three exact tests, the T-S test is more powerful than the A-T test followed by the T-F tests. It is worth noting that Z-F and Z-S are not necessarily more powerful than T-S because the power curves for Z-F and Z-S are above that for T-S above the 5% significance level because they are not exact (Demidenko, 2019).
Experiments Evaluating the Sensitivity to Nanoparticle Mass and Biomarker Binding:
We compared the A-T and T-S tests using experimental MSB spectra from different amounts of NPs, to compare the power of the two tests. Precision MRX iron-oxide NPs (Imagion BioSystems, Inc. Albuquerque, NM) were used to test subsets of the data to explore the effects of uneven variance across the frequenices measured. The 43 nm hydrodynamic diameter (Malvern Zetasizer) was composed of a 25 nm diameter magnetite core and a monolayer of amphiphilic polymer (polyethylene glycol, PEG) with carboxylic acid (COOH) functional groups on the surface. The first 100 μL sample contained 250 ng (250×10−9 grams) of NPs in deionized water. The reference sample contained 249 ng of NP sin 100 μL prepared identically. The most powerful test requires the fewest repetitions to demonstrate a significant difference between the two samples.
Thirty MSB spectra were measured for each sample using a 5 mT alternating magnetic field at five frequencies, Np = 5: 334.2, 816.8, 1592.9, 2342, 3369.1 Hz. The static perpendicular magnetic field was 1 mT (Reeves and Weaver, 2014; Shi et al., 2017). Each frequency was averaged over five seconds and the spectra was repeated thirty times so the entire measurement time was 45 minutes per sample.
We adjusted the range of standard deviations at each frequency by: 1) removing the lowest frequency from each data set so the range of standard deviations was minimized and 2) removing the middle frequency so the range of standard deviations was maximized. Both data sets had Np = 4. The largest standard deviation in the first data set was 47% larger than the mean standard deviation, while the largest standard deviation of the second data set was 133% larger than the mean.
We calculated the number of repetitions needed to reach significance between the two samples for the best significance tests found from the simulations: T-S and A-T. We measured thirty spectra for both the 250 ng and the 249 ng samples. For each value of Nr, we averaged the p-values for 216 random combinations of spectra to span the range of measured noise. We calculated the T-S and A-T significance tests, using repetitions ranging from two to thirty.
We also evaluated the sensitivity of biomarker molecule concentration measurements (Zhang et al., 2013) using the same spectrometer. Samples of 25 micrograms of magnetite NPs (BNF) from Micromod Partikel technologie GmbH (Rostock, Germany) derivatized with streptavidin were conjugated with biotinylated antibodies for mouse IL-6 (R&D Biotechne). Mouse IL-6 was added to achieve 7 pM, 15 pM and 76 pM concentrations with constant 100 microliter sample volumes. Spectra were taken at 10 mT AC field, 1 mT DC field at 547, 817, 1218, 1594, and 2343 Hz. Two samples were prepared for each concentration and seven repeated spectra were measured for each sample. An additional seven spectra were measured on the second 7 pM sample. We calculated the average p-value for each number of repetitions using the T-S and A-T methods. For both methods, the significance for the two samples was combined using Stouffer’s method. The significance for one thousand random combinations of the repetitions were averaged. The same number of reference spectra were used for each concentration.
Results:
The simulations showed the one-sided T-S test was superior to the A-T test in most cases. The two-sided T-S test was superior to the Hotelling T2 test for small number so repetitions.
Simulations:
Fig. 1 compares the simulation results for the five one-sided significance. The two using the z-test, perhaps predictably, were not exactat low contrast; the significance was artificially high for low contrast. When there is no difference between the spectra, the significant proportion must be the significance selected, in our case 0.05, and it was much higher for both tests based on the z-test. It persists for far larger Nr than we expected; it was 230% of the 0.05 significance for Nr = 4 in Fig. 1 but remained at 150% for Nr = 128. It makes the z-test based tests of no value because any significance found is likely a type I error. Fig. 1 shows that the T-S test has superior power than the other two exact tests.
Comparing the one-sided to the two-sided tests shows that the a priori information about which spectra should be larger increases the power for all of the tests. The T-S test power increases from 0.18 for the two-sided to 0.60 for the one-sided when the contrast was a half standard deviation, Np = 8 and Nr = 4.
The power for any of the one-sided tests increases as either Np or Nr increases. The power increases with the Np × Nr product. There is a slight improvement in power if Nr is increased rather than Np. The allocation of measurement time should be made optimize the power. For example, differentiating small changes in MSB spectrais best done by repeating a measurement at one frequency rather than measuring another frequency. However, when measuring the temperature or viscosity of the microenvironment of the NPs, scaling two spectra is required (Weaver et al., 2009; Weaver. and Kuehlert, 2012) so more frequencies are required despite the slight loss in the ability to differentiate the spectra.
The effect of different noise at each frequency on the power of the one-sided tests is shown in Fig. 2 where the testing of spectra with equal variance across frequencies is compared to the testing of spectra with the noise at half the frequencies was double that of the equal variance case and the other half of the frequencies was half that of the equal variance case. The T-S and T-F tests exploit the low noise frequencies to improve the power over the equal variance case. However, the A-T test, which is otherwise almost as powerful as the T-S test, suffers with much lower power for unequal variances.
Fig. 2:

When the noise power is different for each frequency, the one-sided T-S and T-F tests are able to use the data at frequencies with lower noise more effectively than the one-sided A-T test which suffers a large loss of power. The three exact tests (A-T, T-F and T-S) are compared for Np = 8. When the variance is identical for all frequencies in the spectra, the results are the same as in Fig. 1, with T-S being the most powerful.
We also found the two-sided versions of the T-S test (conducted with two-sided t-tests) to be a useful alternative to the traditionally used Hotelling T2 test, particularly in scenarios without enough repetitions to accurately calculate the covariance matrix for the Hotelling T2 test. Results of this comparison are shown in Fig. 3. The comparison was made at two values of Nr: 8 and 1024. The Hotelling T2 test cannot be used directly if Nr< Np because Nr is too low to estimate the covariance matrix without a priori information. The T-S test functions when Nr< Np so it is clearly useful in that regime. However, even for Nr> Np, the power of the Hotelling T2 test continues to suffer from small Nr far more than the T-S test does. Even at large values of Nr, when the covariance matrix can be estimated accurately, the Hotelling T2 test is only superior to the T-S test by a small amount.
Fig. 3:

Simulated power comparing the two-sided T-S test and the Hotelling T2 two-sided multivariable significance test for Np = 4 and two values of Nr : 8 and 1024. Ns = 215 For Nr is only slightly larger than Np, e.g., Np = 4 and Nr = 8, the T-S has much higher power. The Hotelling T2 test is only slightly superior for large Nr; e.g., Nr = 1024 as shown.
Experimental Magnetic Nanoparticle Spectra:
The Nr required to achieve significance between MSB spectra from samples with a 1 nanogram (10−9 gram) difference in the number of NPs was compared for the A-T and T-S tests. Five frequencies were acquired but only four were used in each comparison. For one set of tests, the lowest frequency was removed to achieve the most uniform noise variance between the frequencies. For the other set of tests, the middle frequency was removed to achieve the most unequal noise variance among the frequencies. When the noise was almost equal, the A-T and T-S tests achieved essentially the same power; 7 repetitions were required for both to achieve significance on average. When the noise was unequal, the T-S test performed much better than the A-T test, a result that is congruent with the simulations. The T-S test required 12 repetitions, while the A-T test required 22 repetitions to achieve significance.
In all cases, the A-T test which averages the signal over frequency performed much more poorly than the T-S method. Using eight repetitions, the A-T method was able to detect 76 pM while the T-S method was able to detect 15 pM. The number of repetitions required to achieve significance also favors the T-S method. For 76 pM, significance was achieved for two repetitions using the T-S method while seven repetitions were required for the A-T method. For 15 pM, significance was achieved for eight repetitions using the T-S method while the A-T method did not achieve significance for all 14 repetitions. Using 10 repetitions, the 76 pM concentration was successfully identified using the A-T method.
Conclusions:
Five single-sided, multivariate significance tests were evaluated using simulations and experimental data. The most straightforward method, the A-T test, is to perform a single variate, single-side dt-test on the sum of the signal over all frequencies. This simple method performs adequately when the noise variance is the same for each frequency but the power is markedly reduced when the variances are unequal. To handle unequal variances, the p-value at each frequency was calculated separately and combined using methods developed for meta-analyses. The best test, the T-S test is calculated by combining the p-values obtained from a t-test at each frequency using Stouffer’s method. It performed well with equal or unequal variances and was exact at low contrast. All methods using the z-test were inexact demonstrating artificially high significance at low contrast. The two-sided version of the T-S test performed almost as well as the Hotelling T2 test for large number of repetitions and out-performed the Hotelling T2 for a small number of repetitions. The same trends were demonstrated for the experimental comparisons. When the variance was similar, samples that differed by 1 nanogram of NPs were significantly different when 7 repetitions were combined using both the A-T and T-S tests; however, when the variance was unequal and larger only 12 repetitions were required by the T-S test and 22 were required for the single-variate p-value of the integrated spectrum. Using eight repetitions the T-S test allowed 15 pM concentrations of mouse IL-6 to be identified while the mean of the spectra only identified 76 pM. The preferred method is also useful in the two-sided, multivariate case where the Hotelling T2 test provides an extension of the Student’s t-test that does not exist for the one-sided multivariate case which we are dealing with in this effort. The T-S method performed essentially identically with the Hotelling T2 when the number of repetitions was large. However, the T-S method has far better power when the number of repetitions was small.
Fig. 4:

Average MSB spectra with standard deviations at each frequency. Spectra were acquired from 250 ng and from 249 ng of magnetic NPs. Note the variance at low frequency is much higher than at high frequency.
Fig. 5:

Significance of the one-sided A-T test suffers more than that of the one-sided T-S test with non-uniform variance at each frequency in the MSB spectra. Spectra acquired from 250 ng were compared with spectra from 249 ng of magnetic NPs. The average p-value from 216 (64k) random combinations of each number of repetitions is plotted vs the number of repetitions. Four frequency spectra with uniform variance were compared to four frequency spectra with non-uniform variance.
Fig. 6:

The T-S test allowed lower concentrations of biomarker proteins to be isolated than the A-T test. NPs coated with antibodies for the IL-6 protein were mixed with three concentrations of that protein. The IL-6 protein links the NPs together increasing their relaxation time and reducing their MSB signal. The average significance improved with increasing number of repetitions used.
Acknowledgments:
NIH/NIBIB1R21EB021456, NIH/NCI R01 CA262147
This work was supported in part by the NIH under grants 1R21EB021456 and R01 CA262147.
References:
- Bailey TL and Gribskov M 1998. Combining evidence using p-values: application to sequence homology searches Bioinformatics 14 48–54 [DOI] [PubMed] [Google Scholar]
- Demidenko E 2019. Advanced statistics with applications in R: John Wiley & Sons; Hoboken, NJ: ) [Google Scholar]
- Draack S, Viereck T, Kuhlmann C, Schilling M and Ludwig F 2017. Temperature-dependent MPS measurements 2017 3 [Google Scholar]
- Fisher RA 1925. Statistical Methods for Research Workers (Edinburgh, UK: Oliver and Boyd; ) [Google Scholar]
- Gordon-Wylie SW, Grüttner C, Teller H and Weaver JB 2020. Using Magnetic Nanoparticles and Protein–Protein Interactions to Measure pH at the Nanoscale IEEE Sensors Letters 4 1–335582432 [Google Scholar]
- Hotelling H 1992. Breakthroughs in statistics: Springer; ) pp 54–65 [Google Scholar]
- Perlman MD and Wu L 2006. Some improved tests for multivariate one-sided hypotheses Metrika 64 23–39 [Google Scholar]
- Perreard I, Reeves D, Zhang X, Kuehlert E, Forauer E and Weaver J 2014. Temperature of the magnetic nanoparticle microenvironment: estimation from relaxation times Physics in medicine and biology 59 1109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rauwerdink AM and Weaver JB 2010a. Measurement of molecular binding using the Brownian motion of magnetic nanoparticle probes Applied Physics Letters 96 033702. Also appears in February 1, 2010 issue of Virtual Journal of Biological Physics Research [Google Scholar]
- Rauwerdink AM and Weaver JB 2010b. Viscous effects on nanoparticle magnetization harmonics Journal of Magnetism and Magnetic Materials 322 609–13 [Google Scholar]
- Reeves DB, Shi Y and Weaver JB 2016. Generalized Scaling and the Master Variable for Brownian Magnetic Nanoparticle Dynamics PloS one 11 e0150856. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reeves DB and Weaver JB 2014. Magnetic nanoparticle sensing: decoupling the magnetization from the excitation field J. Phys. D: Appl. Phys 47 045002 (8pp) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi Y, Jyoti D, Gordon-Wylie SW and Weaver JB 2020. Quantification of magnetic nanoparticles by compensating for multiple environment changes simultaneously Nanoscale 12 195–200 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi Y, Khurshid H, Ness DB and Weaver JB 2017. Harmonic phase angles used for nanoparticle sensing Physics in Medicine & Biology 62 8102. [DOI] [PubMed] [Google Scholar]
- Stouffer SA, Suchman EA, DeVinney LC, Star SA and Williams RMJ 1949. The American Soldier: Adjustment during Army Life. vol 1 (Princeton: Princeton University Press; ) [Google Scholar]
- Weaver JB, Rauwerdink AM and Hansen EW 2009. Magnetic nanoparticle temperature estimation Medical Physics 36 1822–9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weaver JB, Rauwerdink KM, Rauwerdink AM and Perreard IM 2014. Magnetic spectroscopy of nanoparticle Brownian motion measurement of microenvironment matrix rigidity Biomedizinische Technik/Biomedical Engineering 58 547–50 [DOI] [PubMed] [Google Scholar]
- Weaver JB, Shi Y, Ness DB, Khurshid H and Samia ACS 2017. Sensitivity Limits for in vivo ELISA Measurements of Molecular Biomarker Concentrations Journal of Magnetic Particle Imaging 3 706003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weaver JB and Kuehlert E 2012. Measurements of Magnetic Nanoparticle Relaxation Times Medical Physics 39 2765–70. Also published in the May 1, 012 issue of Virtual Journal of Biological Physics Research [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whitlock M 2005. Combining probability from independent tests: the weighted Z‐method is superior to Fisher’s approach Journal of evolutionary biology 18 1368–73 [DOI] [PubMed] [Google Scholar]
- Zhang X, Reeves DB, Perreard IM, Kett WC, Griswold KE, Gimi B and Weaver JB 2013. Molecular sensing with magnetic nanoparticles using magnetic spectroscopy of nanoparticle Brownian motion Biosensors and Bioelectronics 50 441–6 [DOI] [PMC free article] [PubMed] [Google Scholar]
