Abstract
Reference ranges are invaluable in laboratory medicine, as these are indispensable tools for the interpretation of laboratory test results. When assessing measurements on a single analyte, univariate reference intervals are required. In many cases, however, measurements on several analytes are needed by medical practitioners to diagnose more complicated conditions such as kidney function or liver function. For such cases, it is recommended to use multivariate reference regions, which account for the cross-correlations among the analytes. Traditionally, multivariate reference regions (MRRs) have been constructed as ellipsoidal regions. The disadvantage of such regions is that they are unable to detect component-wise outlying measurements. Because of this, rectangular reference regions have recently been put forward in the literature. In this study, we develop methodologies to compute rectangular MRRs that incorporate covariate information, which are often necessary in evaluating laboratory test results. We construct the reference region using tolerance-based criteria so that the resulting region possesses the multiple use property. Results show that the proposed regions yield coverage probabilities that are accurate and are robust to the sample size. Finally, we apply the proposed procedures to a real-life example on the computation of an MRR for three components of the insulin-like growth factor system.
KEYWORDS: Reference intervals, tolerance region, multivariate reference region, parametric bootstrap, laboratory medicine
1. Introduction
1.1. Background of the study
Biochemical and physiological test results of patients, without the necessary information for their interpretation, are limited on their own. This necessitates the use of reference intervals, which act as tools to interpret numerical laboratory results to provide substantial information to medical practitioners. These intervals are constructed based on the test results of a reference population, or a healthy population. The importance of reference intervals has been stressed repeatedly in the laboratory medicine literature, as numerous medical decisions are made based on laboratory test results and their interpretations. Reference intervals can be two-sided (i.e. consisting of both an upper and lower reference limit), or just one-sided (with a given upper or lower reference limit, but not both). Moreover, these reference limits can also be dependent on covariates, such as age, sex, and body mass index (BMI).
Population-based reference intervals are established so that they capture the measurements between the 2.5th and the 97.5th percentiles of the reference population. That is, the reference interval is supposed to contain the central 95% of the measurements of the reference population. In practice, reference ranges are calculated based on a random sample from a given population since the percentiles of reference populations are usually unknown. One may compute the reference range by using sample percentiles, but this is a naïve approach since the percentage of the population covered by the resulting interval will be less than 95%. Thus, instead of the use of sample percentiles, two alternative approaches to determine the reference interval are through prediction intervals and tolerance intervals.
In more complicated diagnoses, such as that of hepatotoxicity, or liver damage caused by the use of drugs, multiple analytes need to be assessed. Hyman Zimmermann introduces Hy’s law as a rule of thumb for hepatotoxicity detection. Hy’s law states that hepatotoxicity is present if the alanine transaminase (ALT) level is three times greater than the reference value, and total bilirubin is two times greater than the reference value. This illustrates the use of multiple analyte measurements to diagnose a condition. When there are multiple analytes needed in diagnosis, an inaccurate approach is to use separate univariate reference intervals. As explained in [2], such an approach leads to numerous false-positive results because the cross-correlations among analytes are disregarded. Thus, [2] proposes the use of a multivariate reference region (MRR) for situations requiring interpretation or diagnosis using multiple laboratory test results. The MRR approach accounts for the correlation structure among the analytes. Moreover, it reduces the number of false positive diagnoses since the region gives exact, rather than liberal, coverage.
Traditionally, MRRs have been constructed as ellipsoidal reference regions. However, reference regions of such shape cannot detect possible outlying univariate measurements. In other words, when a patient’s multivariate result does not fall within the ‘normal’ range of the reference region, it will not be possible to single out the specific analyte that has caused this result. Moreover, reference regions of ellipsoidal shape lack interpretability, possibly leading to test results that are difficult to explain to patients. [20] observes that MRRs have only played ‘a marginal role in the practice of clinical chemistry and laboratory medicine.’ This is largely due to the difficulties associated with ellipsoidal MRRs. Consequently, instead of ellipsoidal MRRs, [20] proposes the construction of rectangular MRRs, which lead to interpretable reference ranges and enable component-wise outlier detection. Following the work of [20], other studies have also proposed various approaches to compute rectangular MRRs. These include the recent works of [14,23].
An important question to ask is the following: Should we use prediction regions or tolerance regions as the criterion to construct MRRs? While the criterion for prediction intervals and regions is very common and has been recommended by [5] and the National Committee for Clinical Laboratory Standards [17], the resulting reference region is meant only for single use, not multiple use (or repeated use). In actual practice, however, reference intervals and regions are often used repeatedly. That is, the same reference interval or region is used to interpret the laboratory test results of multiple patients. Given that the multiple-use aspect of reference regions is not captured by prediction regions, [3] and [12] argue that the criterion for tolerance intervals and regions is the appropriate criterion to compute reference intervals and regions. As we shall see, tolerance regions are amenable for multiple use since such intervals capture a specified proportion of the population while quantifying the uncertainty in the estimated region.
This study is also largely concerned with the effect of covariate information in the computation of reference regions. Several examples, such as that of [16], illustrate that reference regions may depend on covariates. Tolerance regions that account for covariate information have been available in the literature for a long time. For example, the studies of [5], [6], and [7] have developed procedures to compute what are called β-expectation tolerance regions (which are different from the tolerance regions discussed in this paper) with various assumptions on the error term. It is known, however, that the criterion for the β-expectation tolerance region is equivalent to that of the prediction region. Thus, such regions may be regarded as simply prediction regions, and not the tolerance regions that are intended for multiple use. For the rest of this paper, whenever ‘tolerance region’ is mentioned, we refer to tolerance regions which are intended for multiple use. In Section 2, such regions will be defined as having a content level and a confidence level.
The studies of [9] and [11] address the problem of developing regression-based (i.e. covariate-dependent) tolerance regions under multivariate normality. Moreover, [3] tackles the same problem and adds a central condition to the population contained by the tolerance region. Finally, [10] develops regression-based multivariate tolerance regions based on multivariate conditional transformation models, with no parametric restriction on the response vectors. The main limitation of these regression-based tolerance regions is that they are of ellipsoidal shape, which, as already mentioned, suffer from a lack of interpretability. Thus, this study brings innovation by presenting a solution to compute rectangular regression-based tolerance regions which can be used as MRRs in a multivariate normal setting. The study’s methodology makes use of some of the ideas in [13], which develops rectangular MRRs through the tolerance region criterion, but fails to consider the situation where covariates may be needed. For a detailed overview on tolerance intervals and regions, see [8].
2. Methodology
We now discuss the construction of regression-based simultaneous tolerance intervals and regression-based multivariate tolerance regions in developing rectangular MRRs, under the assumption of multivariate normality. The various criteria to be used are discussed in detail in Sections 2.1, 2.2, 2.3, and 2.4. To start, we first define the data as follows.
Let be the n response vectors, each of which is of dimension . Let the corresponding fixed covariate vectors be where each is of dimension . Our data set consists of the n ordered pairs . Let be a matrix, be a constant matrix of rank , and let be a matrix of error terms, where , , are independent and identically distributed as , and is an unknown positive definite covariance matrix. The multivariate regression model is given by:
| (1) |
where is a matrix of coefficients that are unknown. We assume that . Given that and are both unknown, we estimate these quantities using their least squares estimators as follows:
| (2) |
| (3) |
where is the identity matrix. The distributions of the quantities above can be described through the following distributional results which hold in a multivariate regression setting with normally distributed error terms:
| (4) |
| (5) |
where denotes the distribution of a Wishart random matrix with degrees of freedom and covariance matrix . These distributional results will be useful in the development of the methodologies in this study.
Let be a response vector representing a future observation independent of , and let be its corresponding covariate vector. We assume . Our goal is to construct an appropriate regression-based rectangular MRR for . Note that we can describe the relationship between and as:
| (6) |
where is a vector of error terms which is independent of . We now introduce some notation. For any vector , let represent its th component, while for any square matrix , let represent its th diagonal element.
2.1. Simultaneous tolerance intervals
In this section, we discuss the proposed solution to compute regression-based simultaneous tolerance intervals. We want to find a reference region for the response vector corresponding to the covariate vector , of the form
| (7) |
such that the criterion for simultaneous tolerance intervals having content and confidence level is satisfied. This criterion is expressed in (8). We want to find the tolerance factor that satisfies the following probability condition:
| (8) |
which is equivalent to
| (9) |
Setting aside the inner probability from equation (9), and noting that follows a distribution for , we proceed as follows:
where denotes the cumulative distribution function of the standard normal distribution.
Thus, we want to find a tolerance factor κ that satisfies the following constraint:
| (10) |
To estimate the unknown value of satisfying (10), a parametric bootstrap procedure is employed. Algorithm 1 shows the proposed procedure to estimate , and thus obtain the covariate-dependent simultaneous tolerance intervals. Equation (10) gives us a basis for the parametric bootstrap.
| Algorithm 1: Parametric bootstrap procedure to estimate the factor required to obtain the simultaneous tolerance intervals in a regression setting |
| 1. From the data, compute and using (2) and (3), respectively. |
| 2. For , where is some large number, |
| i Generate and from the plug-in distributions |
| ∼ , , and |
| , respectively. |
| ii Through a root-finding process, solve for that satisfies |
| where and denote the th diagonal element of and , respectively; and and denote the th component of and , respectively. |
| 3. Take the -quantile of and denote this by . The -content, simultaneous tolerance intervals are given by |
2.2. Simultaneous central tolerance intervals
Now, suppose we want to find regression-based simultaneous central tolerance intervals for the response vector corresponding to a covariate vector , such that . In other words, we derive simultaneous tolerance intervals such that each component tolerance interval is a central tolerance interval (i.e. it contains a specified proportion of the central part of the marginal distribution). Our goal now is to find a region of the form of (7) such that the tolerance factor is computed subject to the condition in (11) for simultaneous central tolerance intervals. In (11) we have used the fact that each .
| (11) |
where denotes the 100 th percentile of the standard normal distribution, and is the unknown simultaneous central tolerance factor. Note that (11) ensures that, with probability , the central 100 % of each marginal distribution is contained by the region of the form of (7).
Observe that equation (11) is equivalent to
| (12) |
which expresses as a quantile and gives us a basis for a parametric bootstrap procedure to estimate . The procedure is described in Algorithm 2.
| Algorithm 2: Parametric bootstrap procedure for estimating the factor required to obtain the simultaneous central tolerance intervals in a regression setting |
| 1. From the data, compute and using (2) and (3), respectively. |
| 2. For , where is some large number, |
| i. Generate and from: |
| ∼ , , and |
| , respectively. |
| ii. Compute the following: |
| 3. Take the -quantile of We will denote this by . The -content, simultaneous central tolerance intervals are given by |
2.3. Rectangular multivariate tolerance regions
In this section, we describe the solution to compute a rectangular multivariate tolerance region for corresponding to the known covariate vector . The region is still of the form in (7), but this time should meet the condition for a multivariate tolerance region. That is, with probability , the region should contain a proportion of at least of the joint distribution of . This condition is described in (13):
| (13) |
which is equivalent to
| (14) |
The condition in (13) is analogous to (8), but we have now adapted it to the distribution of . To estimate the true value of , a parametric bootstrap procedure is employed, with equation (14) as our basis. This procedure is given in Algorithm 3.
| Algorithm 3: Parametric bootstrap procedure for estimating the factor required to obtain a multivariate tolerance region in a regression setting |
| 1. From the data, compute and using (2) and (3), respectively. |
| 2. For , where is some large number, |
| i. Generate and from: |
| ∼ , , and |
| , respectively. |
| ii. Through a root-finding process, solve for that satisfies |
| where . Thus, the probability on the left-hand side is taken with respect to the distribution, which is the plug-in distribution. |
| 3. Take the -quantile of We will denote this by . The -content, multivariate tolerance region is given by |
2.4. Rectangular multivariate central tolerance regions
In Section 2.3, we described how to obtain a rectangular multivariate tolerance region for . We now turn to the computation of a rectangular multivariate central tolerance region for the future observation corresponding to a specific covariate value , where . As before, let the data consist of the observations . We first define what is meant by the rectangular central part of the distribution of . The rectangular multivariate central part of the distribution of is given by
where satisfies
| (15) |
Note that we have represented the region as to reflect that fact that it depends on the known covariate vector . The quantity is unknown and is a function of the cross-correlations of the components of . The rectangular multivariate central tolerance region in the regression setting is of the form given by below. We write the tolerance region as because it also depends on the known covariate vector .
The tolerance factor is obtained such that the condition in (16) is satisfied:
| (16) |
To find , we estimate the unknown parameter through a parametric bootstrap. Note that (16) can be written as
| (17) |
which is equivalent to
| (18) |
Equation (18), which expresses the multivariate central tolerance factor as a quantile, gives us a basis for the parametric bootstrap procedure to estimate . The procedure is shown in Algorithm 4. Since is also an unknown quantity, it is estimated in step 2 of Algorithm 4 from the plug-in distribution of .
| Algorithm 4: Parametric bootstrap procedure for estimating the factor required to obtain the multivariate central tolerance region in a regression setting |
| 1. From the data, compute and using (2) and (3), respectively. |
| 2. Through a root-finding process, solve for that satisfies |
| where . Thus, the probability on the left-hand side is taken with respect to the distribution, which is the plug-in distribution. |
| 3. For , where is some large number, |
| i. Generate and from: |
| ∼ , , and |
| , respectively. |
| ii. Compute the following: |
| where is the solution obtained in Step 2. |
| 4. Take the -quantile of We will denote this by . The -content, multivariate central tolerance region is given by |
2.5. Performance evaluation
To evaluate the performance and accuracy of the proposed algorithms in Sections 2.1, 2.2, 2.3, and 2.4, we shall estimate the associated coverage probabilities, expected tolerance factors, and expected volumes. For a rectangular region of the form in (7), the volume is given by:
| (19) |
where is the estimated tolerance factor. The expected volume is estimated by taking the mean of the computed volumes of the simulated samples. We point out that the expected volume of a region is the multivariate counterpart of the expected length of an interval. Algorithms 1a, 2a, 3a, and 4a describe the process to estimate the associated coverage probabilities for the proposed procedures. Specifically, Algorithms 1a, 2a, 3a, and 4a provide Monte Carlo estimates of the probabilities in Equations (10), (12), (14), and (18), respectively, whenever the tolerance factor is estimated through the corresponding proposed procedure.
| Algorithm 1a: Computing the estimated coverage probability associated with the proposed procedure to compute simultaneous tolerance intervals through Algorithm 1 |
| For given values of , , , , and in the simulation settings, |
| 1. Generate = as , where s are independently drawn. |
| 2. Use the sample generated from step 1 to compute , , and the factor via Algorithm 1. |
| 3. Compute the following: |
| 4. Repeat steps 1 to 3 for times. Our estimate for the coverage probability is the proportion of times for which . We shall use . Also compute the average of and the average of the volumes across the simulated samples. |
| Algorithm 2a: Computing the estimated coverage probability associated with the proposed procedure to compute simultaneous central tolerance intervals through Algorithm 2 |
| For given values of , , , , and in the simulation settings, |
| 1. Generate = as , where s are independently drawn. |
| 2. Use the sample generated from step 1 to compute , , and the factor via Algorithm 2. |
| 3. Using , compute the following: |
| 4. Repeat steps 1 to 3 for times. Our estimate for the coverage probability is the proportion of times for which . Also compute the average of and the average of the volumes across the simulated samples. |
| Algorithm 3a: Computing the estimated coverage probability associated with the proposed procedure to compute a rectangular multivariate tolerance region through Algorithm 3 |
| For given values of , , , , and in the simulation settings, |
| 1. Generate = as , where s are independently drawn. |
| 2. Use the sample generated from step 1 to compute , , and the factor via Algorithm 3. |
| 3. Compute the following: |
| where the probability is taken with respect to the distribution. |
| 4. Repeat steps 1 to 3 for times. Our estimate for the coverage probability is the proportion of times for which . We shall use . Also compute the average of and the average of the volumes across the simulated samples. |
| Algorithm 4a: Computing the estimated coverage probability associated with the proposed procedure to compute a rectangular multivariate central tolerance region through Algorithm 4 |
| For given values of , , , , and in the simulation settings, |
| 1. Generate = as , where s are independently drawn. |
| 2. Use the sample generated from step 1 to compute , , and the factor via Algorithm 4. |
| 3. Compute the following: |
| where is obtained by finding the root of the equation: |
| The probability on the left side is taken with respect to the distribution. We shall use . |
| 4. Repeat steps 1 to 3 for times. Our estimate for the coverage probability is the proportion of times for which . Also compute the average of and the average of the volumes across the simulated samples. |
2.6. Comparisons with other methodologies
We shall be comparing some of the proposed methodologies to compute multivariate tolerance regions with corresponding Bonferroni approaches. The comparisons will be based on coverage probabilities, expected tolerance factors, and the expected volumes obtained from both the proposed and Bonferroni approaches. Subsections 2.6.1 and 2.6.2 describe how to apply the Bonferroni correction for the multivariate tolerance region and the multivariate central tolerance region, respectively.
We shall also illustrate what happens to the coverage probability if the rectangular prediction region approach is used as the reference region (i.e. when it is used repeatedly). Thus, we shall perform numerical simulations to evaluate the performance of a prediction region from the perspective of multivariate tolerance regions. The motivation behind such a comparison is that the prediction region fails to account for the repeated use aspect. The procedure to be used in computing the prediction region is that of [14]. The comparisons with other methodologies will only be done for the multivariate tolerance region and the multivariate central tolerance region (i.e. the methodologies described in Sections 2.3 and 2.4), because these appear to be the standard tolerance-based criteria used in computing MRRs (see for instance [2] and [3]).
2.6.1. Bonferroni-type multivariate tolerance region
To compute the Bonferroni-type multivariate tolerance region, we first introduce the Wallis’ Approximation. As discussed in [8], Wallis’ Approximation for the tolerance factor needed to obtain a -content, -confidence two-sided tolerance interval under a univariate linear regression model is given by
| (20) |
where denotes the -quantile of a non-central chi-square distribution with degree of freedom, and non-centrality parameter ; and denotes the -quantile of a central chi-square distribution with degrees of freedom, where is the sample size and , as before, is the dimension of the covariate vectors. This approximation is used when the error terms are assumed to follow a univariate normal distribution [8].
We now discuss the Bonferroni approach to compute the multivariate tolerance region of content and confidence level . As explained in [23], the Bonferroni approach when there are analytes may be implemented by taking the univariate two-sided tolerance factor corresponding to a content of and confidence level . Thus, we can obtain the tolerance factor to compute the Bonferroni multivariate tolerance region using the following quantity:
| (21) |
The Bonferroni multivariate tolerance region is now given by:
where denotes the quantity given in (21). As with the previous approaches, we will compare the magnitude of the expected tolerance factor of our parametric bootstrap approach for a multivariate tolerance region with that of in (21); the expected volumes and coverage probabilities obtained using the parametric bootstrap approach will also be compared with the Bonferroni approach. To compute the coverage probabilities and expected volumes under the Bonferroni approach, we shall use Algorithm 3a, taking note that will be used in place of .
2.6.2. Bonferroni-type multivariate central tolerance region
To compute Bonferroni-type multivariate central tolerance region, we first revisit the following quantity from (12), which was obtained through a central condition:
When , is now a scalar and the factor simplifies to:
| (22) |
where . Note that we have written in (22) in plain (not bold) text to reflect the fact that it is a scalar. Thus,
| (23) |
From (23), we can characterize the distribution of by multiplying the quantity with a standard normal random variable. That is,
where and the notation means equality in distribution. Moreover, it can be shown that the distribution of the square-root of the diagonal components of can be characterized as:
where denotes a chi-square random variable with degrees of freedom. Thus, taking note of the form of , and using the fact that the Bonferroni approach corresponds to using the univariate two-sided tolerance factor corresponding to a content of and confidence level , we specify the Bonferroni factor to be based on the -quantile of the following quantity:
| (24) |
where and are independent of each other since is independent of . To obtain the factor, we generate 10,000 variates from the and distributions, and then plug these into (24). The required Bonferroni factor is the -quantile of the resulting quantities. We shall call this factor as . The Bonferroni-corrected multivariate central tolerance region is now given by:
Once again, we will be comparing the magnitude of the expected tolerance factor of our parametric bootstrap approach with that of . Moreover, the expected volumes and coverage probabilities obtained using the parametric bootstrap approach will be compared with that of the Bonferroni approach. To compute the expected volumes and coverage probabilities under the Bonferroni approach, we shall use Algorithm 4a, taking note that will be used instead of .
3. Results
3.1. Numerical results
In this section, we test the performance of the proposed methodologies. The covariate values to be used are from the study of [16]. The goal of their study is to compute MRRs for serum concentrations of insulin-like growth factor I (IGF-I), insulin-like growth factor-binding protein 2 (IGFBP-2), and insulin-like growth factor-binding 3 (IGFBP-3), which are three components of insulin-like growth factor (IGF) system of adults. An MRR is constructed to take into consideration the possible relationships between these components [16]. Serum concentrations of the components of the IGF system can be affected by covariates such as age, gender, and body mass index (BMI). Consequently, [16] compute reference regions from a multivariate regression setting with these covariates. A total of 427 non-fasting German blood donors are the subjects, of whom 117 are women and 310 are men.
The covariate vector is given in (25),
| (25) |
where values for age and BMI are centered, and the variable sex is coded as 1 for females and 0 for males. The response vector containing serum measurements is given by:
| (26) |
where we see that the actual serum measurements are transformed. Based on the multivariate regression model in (1), below are the least squares estimates of and :
| (27) |
| (28) |
To evaluate the performance of the proposed procedures, we shall assume that the data will be coming from a multivariate normal distribution, setting (27) as the true value for B, and (28) as the true value for . Also, let X be the covariate matrix whose columns are vectors of the form (25) with values taken from the study of [16]. These values will be used to execute Algorithms 1a, 2a, 3a, and 4a, wherein the covariate will be based on three settings: (1) mean age and mean BMI, (2) minimum age and minimum BMI, and (3) maximum age and maximum BMI. The setting for the covariate sex will be constant at 1 (female) for cases involving mean age and mean BMI, and maximum age and maximum BMI. For cases involving minimum age and minimum BMI, the covariate sex will be set to 0 (male). Three sample sizes are considered for this performance evaluation, namely n = 50, 100, 427, where the first n observations are used in the simulations (for instance, for n = 50, we use the first 50 observations). A summary of the covariate values is shown in Table 1. In the numerical simulations, Algorithms 1a and 2a are implemented with , while Algorithms 3a and 4a are implemented with . The R package MASS by [19] is used in generating random multivariate normal variates, the package MCMCpack by [15] is used to generate random Wishart matrices, the package matrixcalc by [18] is used in implementing the ‘vec’ operator, and the package mvtnorm by [4] is used in computing multivariate normal probabilities.
Table 1.
Covariate values for used to estimate the coverage probability.
| Mean | Minimum | Maximum | ||
|---|---|---|---|---|
| −0.0773 | −27 | 34 | ||
| 0.0060 | 729 | 1156 | ||
| 1 | 0 | 1 | ||
| 0.2804 | −9.2209 | 17.1875 | ||
| 2.43 | −25 | 22 | ||
| 5.9049 | 625 | 484 | ||
| 1 | 0 | 1 | ||
| 0.7309 | −5.2945 | 12.1332 | ||
| 4.40 | −22 | 22 | ||
| 19.36 | 484 | 484 | ||
| 1 | 0 | 1 | ||
| 1.1212 | −4.6875 | 12.1332 |
Tables 2–5 show the numerical results associated with Algorithms 1a, 2a, 3a, and 4a, respectively. Recall that these algorithms evaluate the performance of the proposed procedures to compute the simultaneous tolerance intervals (Algorithm 1), simultaneous central tolerance intervals (Algorithm 2), multivariate tolerance region (Algorithm 3), and multivariate central tolerance region (Algorithm 4) in a regression setting. The results show that the estimates of the coverage probabilities for all four proposed procedures are near the 95% nominal coverage, regardless of which setting is used for the covariates, and even for a sample size of 50. The average of values and average volume, which are estimates for the expected tolerance factor and the expected volume, are also shown in the tables.
Table 2.
Estimated coverage probabilities (CP), expected tolerance factors (Expected ), and expected volumes (EV) of the proposed methodology to compute regression-based simultaneous tolerance intervals for a normal distribution.
| Covariate values evaluated at the | ||||
|---|---|---|---|---|
| Mean | Minimum | Maximum | ||
| CP | 0.9472 | 0.9464 | 0.9474 | |
| Expected | 1.787044 | 1.839362 | 2.035352 | |
| EV | 1.109979 | 1.210383 | 1.640283 | |
| CP | 0.9524 | 0.9514 | 0.9444 | |
| Expected | 2.025536 | 2.474884 | 2.610527 | |
| EV | 1.609161 | 2.943248 | 3.438820 | |
| CP | 0.9514 | 0.9476 | 0.9470 | |
| Expected | 2.425989 | 2.886339 | 3.091317 | |
| EV | 2.762976 | 4.636742 | 5.693718 | |
Table 3.
Estimated coverage probabilities (CP), expected tolerance factors (Expected ), and expected volumes (EV) of the proposed methodology to compute regression-based simultaneous central tolerance intervals for a normal distribution.
| Covariate values evaluated at the | ||||
|---|---|---|---|---|
| Mean | Minimum | Maximum | ||
| CP | 0.9488 | 0.9510 | 0.9464 | |
| Expected | 1.941142 | 2.109102 | 2.382152 | |
| EV | 1.423915 | 1.827380 | 2.633583 | |
| CP | 0.9534 | 0.9462 | 0.9462 | |
| Expected | 2.312953 | 2.860168 | 3.002978 | |
| EV | 2.389920 | 4.552876 | 5.270391 | |
| CP | 0.9468 | 0.9544 | 0.9480 | |
| Expected | 2.816874 | 3.296608 | 3.499398 | |
| EV | 4.283523 | 6.875820 | 8.237914 | |
Table 4.
Estimated coverage probabilities (CP), expected tolerance factors (Expected ), and expected volumes (EV) of the approaches to compute a regression-based rectangular multivariate tolerance region for a normal distribution using (1) the proposed methodology; (2) Bonferroni approach; and (3) prediction region approach of [14].
| Covariates evaluated at the | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | Minimum | Maximum | ||||||||
| (1) | (2) | (3) | (1) | (2) | (3) | (1) | (2) | (3) | ||
| CP | 0.9472 | 1.0000 | 0.4104 | 0.9500 | 0.9982 | 0.3244 | 0.9500 | 0.9738 | 0.1856 | |
| Expected | 2.1825 | 2.3086 | 2.3668 | 2.2201 | 2.3333 | 2.3689 | 2.3457 | 2.3950 | 2.3684 | |
| EV | 2.0204 | 2.3965 | 2.5862 | 2.1254 | 2.4697 | 2.5927 | 2.5103 | 2.6732 | 2.5885 | |
| CP | 0.9422 | 0.9988 | 0.3538 | 0.9524 | 0.9700 | 0.1576 | 0.9422 | 0.9532 | 0.1126 | |
| Expected | 2.3568 | 2.5766 | 2.3993 | 2.6855 | 2.7607 | 2.4010 | 2.7995 | 2.8206 | 2.3995 | |
| EV | 2.5315 | 3.3252 | 2.6848 | 3.7474 | 4.0815 | 2.6902 | 4.2489 | 4.3442 | 2.6816 | |
| CP | 0.9518 | 0.9916 | 0.2682 | 0.9472 | 0.9654 | 0.1414 | 0.9550 | 0.9510 | 0.1044 | |
| Expected | 2.6643 | 2.9322 | 2.4448 | 3.0379 | 3.1509 | 2.4466 | 3.2158 | 3.2529 | 2.4470 | |
| EV | 3.6467 | 4.8548 | 2.8108 | 5.4067 | 6.0083 | 2.8127 | 6.3913 | 6.6079 | 2.8100 | |
Table 5.
Estimated coverage probabilities (CP), expected tolerance factors (Expected ), and expected volumes (EV) of the approaches to compute a regression-based rectangular multivariate central tolerance region for a normal distribution using (1) the proposed methodology; (2) Bonferroni approach; and (3) prediction region approach of [14].
| Covariates evaluated at the | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | Minimum | Maximum | ||||||||
| (1) | (2) | (3) | (1) | (2) | (3) | (1) | (2) | (3) | ||
| CP | 0.9536 | 0.9744 | 0.0632 | 0.9476 | 0.9758 | 0.0234 | 0.9504 | 0.9666 | 0.0068 | |
| Expected | 2.4032 | 2.4481 | 2.3669 | 2.5640 | 2.6153 | 2.3649 | 2.8326 | 2.8849 | 2.3672 | |
| EV | 2.7033 | 2.8488 | 2.5910 | 3.2769 | 3.4806 | 2.5845 | 4.4149 | 4.6670 | 2.5910 | |
| CP | 0.9486 | 0.9684 | 0.0372 | 0.9478 | 0.9626 | 0.0084 | 0.9472 | 0.9612 | 0.0082 | |
| Expected | 2.8046 | 2.8643 | 2.4017 | 3.3329 | 3.4141 | 2.4000 | 3.4730 | 3.5466 | 2.3986 | |
| EV | 4.2758 | 4.5573 | 2.6893 | 7.1670 | 7.6968 | 2.6855 | 8.0914 | 8.6033 | 2.6887 | |
| CP | 0.9478 | 0.9624 | 0.0320 | 0.9478 | 0.9562 | 0.0114 | 0.9472 | 0.9614 | 0.0092 | |
| Expected | 3.3333 | 3.4262 | 2.4469 | 3.7968 | 3.8574 | 2.4467 | 3.9968 | 4.0739 | 2.4450 | |
| EV | 7.1119 | 7.7006 | 2.8160 | 10.5881 | 11.0373 | 2.8279 | 12.2454 | 13.0181 | 2.8223 | |
On the average, as n increases, the estimate of the tolerance factor decreases, because of the increase in precision of the estimated interval associated with larger samples. It follows that the expected volume of the regions also decreases as the sample size increases. This result is consistent across the four procedures. Comparing Tables 2 and 3, we can also see that the expected tolerance factors and the expected volumes computed under the settings specified for the simultaneous central tolerance intervals are larger than their counterparts for simultaneous tolerance intervals without a central condition. This result is as expected, since the condition to contain the central γ-proportion (and not just any γ-proportion) of the marginal distributions of each of the components of requires wider intervals to ensure that the condition is met. This result also appears when we compare the columns labeled (1) in Tables 4 and 5, noting that Table 5 shows the results for multivariate tolerance regions with a central condition. Now, if we compare the results in columns labeled (1) in Table 4 to the results shown in Table 2, the expected tolerance factors and the expected volumes computed under the settings specified for the multivariate tolerance regions are larger compared to those computed using simultaneous tolerance intervals. Similarly, when we compare the results in columns labeled (1) in Table 5 to that of Table 3, the expected tolerance factors and the expected volumes computed under the settings specified for the multivariate central tolerance regions are larger as compared to those computed using simultaneous central tolerance intervals. Thus, we can see that regardless of having a central condition or not, multivariate tolerance regions are generally larger than simultaneous tolerance intervals. This is to be expected because of the stricter condition of multivariate tolerance regions.
Table 4 also compares the proposed multivariate tolerance region with the Bonferroni approach (labeled column (2)) and the prediction region (column (3)) approach of [14]. We see that the proposed procedure outperforms both the other two methodologies, which are either conservative (the Bonferroni approach) or too liberal (the prediction region approach). Similar results are seen in Table 5 for the proposed multivariate central tolerance region. Tables 4 and 5 indicate that the ability of the proposed methodologies to attain the nominal coverage level in a way that other methodologies cannot.
3.2. Application: insulin-like growth factors
In this subsection, we will apply the proposed methodologies to the data from the study of [10] which aims to compute multivariate reference regions for serum concentrations for three components of the insulin-like growth factor (IGF) system in adults. The covariate vector will be of the form in (25), while the response vector containing serum measurements will be of the form in (26). Based on the multivariate regression model in (3.1), the least squares estimates of and are given by (27) and (28), respectively. We will be using these values to compute multivariate reference regions for serum concentrations of IGF-I, IGFBP-2, and IGFBP-3, which are all measured in µg/L. The regions will be computed with the application of Algorithms 1, 2, 3, and 4, based on varying values of age, namely age = 20, 30, … , 70, and fixing the sex and BMI covariates to 0 (males) and 25 kg/m2, respectively. The R package mixtools by [1] is used in generating the pairwise plots associated with each result.
Figures 1 and 2 show the computed rectangular reference regions when we apply Algorithm 1 (simultaneous tolerance intervals) and Algorithm 2 (simultaneous central tolerance intervals). Based on the figures, the reference regions change as a function of the variable age. The first plots in Figures 1 and 2 show the pairwise plots of reference regions for the first two components, S-IGF-I and S-IGFBP-2. Both plots show similar patterns for the change in shape and size of the regions as we increase the covariate age. To be more specific, the reference range for S-IGF-I decreases as the age increases, while the reference range for S-IGFBP-2 increases with age. On the other hand, the second plots of said figures show the pairwise plots of reference regions for the first component and the third component, S-IGF-I and S-IGFBP-3. Both plots show that reference ranges for both analytes decrease with age. The difference between the results when using Algorithm 1 compared to Algorithm 2 is that the regions computed using the latter are larger than the regions computed using the former due to the central condition imposed.
Figure 1.
Pairwise plots of rectangular reference regions using simultaneous tolerance intervals for the three components of the insulin-like growth factor system in adults. The reference regions are for healthy males by age, with BMI fixed at 25 kg/m2.
Figure 2.
Pairwise plots of rectangular reference regions using simultaneous central tolerance intervals for the three components of the insulin-like growth factor system in adults. The reference regions are for healthy males by age, with BMI fixed at 25 kg/m2.
On the other hand, Figures 3 and 4 show the computed rectangular reference regions when we apply Algorithm 3 (multivariate tolerance region) and Algorithm 4 (multivariate central tolerance region), respectively. As with the previous figures, different values of age result in different reference regions. The first plots of both figures show the pairwise plots of reference regions for the first two components, while second plots refer to the pairwise plots of reference regions for the first component and the third component. Note that Figures 3 and 4 share similar patterns with Figures 1 and 2. We also mention that, due to the central condition imposed, the reference regions computed using Algorithm 4 are larger than when Algorithm 3 is used.
Figure 3.
Pairwise plots of rectangular reference regions using multivariate tolerance region for the three components of the insulin-like growth factor system in adults. The reference regions are for healthy males by age, with BMI fixed at 25 kg/m2.
Figure 4.
Pairwise plots of rectangular reference regions using multivariate central tolerance region for the three components of the insulin-like growth factor system in adults. The reference regions are for healthy males by age, with BMI fixed at 25 kg/m2.
3.3. Investigating the robustness of the proposed methodologies
We shall also examine the performance of the proposed methodologies when the underlying distribution of the response vector is skewed. For each , we generate from the multivariate lognormal distribution with logarithmic-scale mean vector and covariance matrix and , respectively. The multivariate lognormal distribution is a skewed distribution, with a long upper tail. We then compute the associated coverage probabilities of the resulting reference regions. In the simulations, we only include the criteria for multivariate tolerance and multivariate central tolerance regions because, as previously mentioned, these are the standard tolerance-based criteria. The results are shown in Table 6. We exclude the expected volume because of the difference in scale. The results show that for large sample sizes, the multivariate tolerance region can produce robust coverage even in the presence of skewness. However, when the sample size is small, the coverage becomes liberal. On the other hand, for the multivariate central tolerance region, the proposed methodology clearly fails when the distribution is skewed.
Table 6.
Estimated coverage probabilities (CP) and expected tolerance factors (Expected ) of the proposed methodology to compute a regression-based rectangular tolerance region and rectangular central tolerance region for a normal distribution when the true underlying distribution of the analytes is skewed.
| Multivariate tolerance | Multivariate central tolerance | ||
|---|---|---|---|
| CP | 0.9176 | 0.0260 | |
| Expected | 2.1843 | 2.4052 | |
| CP | 0.8774 | 0.4560 | |
| Expected | 2.3580 | 2.8077 | |
| CP | 0.8898 | 0.7076 | |
| Expected | 2.6648 | 3.3362 |
4. Discussion
Reference ranges are essential in laboratory medicine for the interpretation of tests and patient care. Though these ranges are often used as bases for conclusions and interpretations of a single test result, there are cases when diagnoses are based on multiple measurements. One example is the diagnosis of hepatotoxicity, which according to Hy’s law is dependent on the results of both ALT and bilirubin levels. Using separate univariate reference ranges for these types of situations poses problems since possible cross-correlations between analytes are not accounted for, and it is possible for the rate of false-positive results to increase. This issue is addressed by multivariate reference regions (MRRs) which factor in the cross-correlations among the analytes.
Customarily, MRRs are constructed as ellipsoidal regions because exact solutions for such regions are available in the multivariate normal setting. However, such regions fail to detect possible outlying univariate measurements. Because of this deficiency of ellipsoidal MRRs, rectangular MRRs have been proposed. The first successful attempt to construct rectangular MRRs is that of [20], which proposes solutions in both the multivariate normal and nonparametric settings. A limitation of the work of [20], as pointed out by [22] is that the solution of [20] only provides a point estimate of the rectangular reference region, obtained by substituting the unknown parameters with their estimators. [13], [14], and [22] also develop methodologies to compute rectangular MRRs under various settings (whether multivariate normal or nonparametric) and using different criteria. [13] proposes to use the criteria for tolerance intervals and tolerance regions to construct the MRRs. These criteria allow for repeated use of the computed interval or region by multiple subjects.
Despite the accuracy of the proposed regions in [13], it does not consider the case where the reference regions could depend on covariates. To address the need for covariate-dependent reference regions that are amenable for multiple use (or repeated use), this study seeks to propose various approaches to construct rectangular MRRs using the criteria for tolerance intervals and regions in a multivariate normal regression scenario. Because of the assumption of multivariate normality of the observations, it seems natural to posit the rectangular reference region be of the form for . However, estimating the value of such that the correlations among the analytes are properly accounted for is a highly nontrivial problem. This study succeeds in coming up with a suitable parametric bootstrap solution to estimate under four different tolerance-based criteria.
In a recent article, [21] argues against the use of tolerance intervals in computing reference intervals, maintaining that reference regions ought to be computed from empirical data by estimating the population quantiles as precisely as possible. While this is certainly a logical approach, given that the population reference interval is enclosed by the 2.5th and 97.5th population percentiles, a drawback of such an approach (which the authors of the commentary article recognize) is that the precision of the estimates is not quantified. We remain of the opinion that the quantification of the uncertainty due to sampling variability in estimating the central part of the population is a critical component of reference interval determination. Nevertheless, the debate on whether reference intervals should be approached by using point estimates or by allowing the uncertainty to be quantified (which is what characterizes the tolerance interval approach) appears to be a philosophical discussion, and we shall abstain from further comment at this point. The approach that balances relevance with practicality in computing reference ranges requires input from laboratory medicine experts.
The simulation results in Section 3.1 show that the proposed methodologies yield accurate results, with coverage probabilities close to the nominal level. The accurate performance of the proposed methodologies remains robust to the size of the sample, as well as the covariate values for which the reference region is computed. In addition, the simulations comparing the proposed methodologies with benchmark procedures indicate the conclusively superior performance of the former. We believe that this study is the first attempt to compute regression-based rectangular tolerance regions with exact coverage, to be used as reference regions in a multivariate normal setting.
This study has presented four approaches to compute the multivariate reference region. Practitioners may opt to compute simultaneous tolerance intervals or a rectangular tolerance region. Moreover, for each of these options, a choice can be made as to whether the central condition should be imposed or not. At this point, we do not take a stand on which of these four approaches is most appropriate in actual practice, as this question also requires the opinion of laboratory practitioners. Nonetheless, it seems that the rectangular central tolerance region is the most cogent choice, as it uses the joint distribution of the response vector in accounting for the correlations, and encloses the central 95% of the distribution at a given confidence level, thereby being consistent with the notion of a reference region as that which contains the central 95% part of the multivariate population.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Benaglia T., Chauveau D., Hunter D.R., and Young D., mixtools: An R package for analyzing finite mixture models. J. Stat. Softw. 32 (2009), pp. 1–29. [Google Scholar]
- 2.Boyd J.C. and Lacher D.A., The multivariate reference range: An alternative interpretation of multi-test profiles. Clin. Chem. 28 (1982), pp. 259–265. [PubMed] [Google Scholar]
- 3.Dong X. and Mathew T., Central tolerance regions and reference regions for multivariate normal populations. J. Multivar. Anal. 134 (2015), pp. 50–60. [Google Scholar]
- 4.Genz A. and Bretz F., Computation of Multivariate Normal and t Probabilities, Springer-Verlag, Heidelberg, 2009. [Google Scholar]
- 5.Haq M.S. and Rinco S., Expectation tolerance regions for a generalized multivariate model with normal error variables. J. Multivar. Anal. 6 (1976), pp. 414–421. [Google Scholar]
- 6.Khan S. and Haq M.S., Expectation tolerance region for the multilinear model with matrix-t error distribution. Commun. Stat. Theory Methods 23 (1994), pp. 1935–1951. [Google Scholar]
- 7.Kibria B.M.G. and Haq M.S., Predictive inference for the elliptical linear model. J. Multivar. Anal. 68 (1999), pp. 235–249. [Google Scholar]
- 8.Krishnamoorthy K. and Mathew T., Statistical Tolerance Regions: Theory, Applications, and Computation, Wiley, Hoboken, NJ, 2009. [Google Scholar]
- 9.Krishnamoorthy K. and Mondal S., Tolerance factors in multiple and multivariate linear regressions. Commun. Stat. Simul. Comput. 37 (2008), pp. 546–559. [Google Scholar]
- 10.Lado-Baleato Ó., Cadarso-Suárez C., Kneib T., and Gude F., Multivariate reference and tolerance regions based on conditional transformation models: Application to glycemic markers. Biom. J. 65 (2023), pp. 2200229. [DOI] [PubMed] [Google Scholar]
- 11.Lee Y.T. and Mathew T., Tolerance regions in multivariate linear regression. J. Stat. Plan. Inference. 126 (2004), pp. 253–271. [Google Scholar]
- 12.Liu W., Bretz F., and Cortina-Borja M., Reference range: Which statistical intervals to use? Stat. Methods Med. Res. 30 (2021), pp. 523–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lucagbo M.D. and Mathew T., Rectangular tolerance regions and multivariate normal reference regions in laboratory medicine. Biom. J. 65 (2023), pp. 2100180. [DOI] [PubMed] [Google Scholar]
- 14.Lucagbo M.D., Mathew T., and Young D.S., Rectangular multivariate normal prediction regions for setting reference regions in laboratory medicine. J. Biopharm. Stat. 33 (2022), pp. 1–19. [DOI] [PubMed] [Google Scholar]
- 15.Martin A.D., Quinn K.M., and Park J.H., MCMCpack: Markov chain Monte Carlo in R. J. Stat. Softw. 42 (2011), pp. 22. [Google Scholar]
- 16.Mattsson A., Svensson D., Schuett B., Osterziel K.J., and Ranke M.B., Multidimensional reference regions for IGF-I, IGFBP-2 and IGFBP-3 concentrations in serum of healthy adults. Growth Horm. IGF Res. 18 (2008), pp. 506–516. [DOI] [PubMed] [Google Scholar]
- 17.National Committee for Clinical Laboratory Standards , EP28-A3C: Defining, Establishing, and Verifying Reference Intervals in the Clinical Laboratory; Approved Guideline, 3rd ed., Clinical and Laboratory Standards Institute, Wayne, Pennsylvania, 2010. [Google Scholar]
- 18.Novomestky F., matrixcalc: Collection of Functions for Matrix Calculations, 2022.
- 19.Venables W.N. and Ripley B.D., Modern Applied Statistics with S, Springer, New York, 2002. [Google Scholar]
- 20.Wellek S., On easily interpretable multivariate reference regions of rectangular shape. Biom. J. 53 (2011), pp. 491–511. [DOI] [PubMed] [Google Scholar]
- 21.Wellek S. and Jennen Steinmetz C., Why tolerance intervals should not be used. Stat. Methods Med. Res. 30 (2021), pp. 523–534. [DOI] [PubMed] [Google Scholar]
- 22.Young D.S. and Mathew T., Nonparametric hyperrectangular tolerance and prediction regions for setting multivariate reference regions in laboratory medicine. Stat. Methods Med. Res. 29 (2020), pp. 3569–3585. [DOI] [PubMed] [Google Scholar]
- 23.Young D.S. and Mathew T., Supplementary material for nonparametric hyperrectangular tolerance and prediction regions for setting multivariate reference regions in laboratory medicine. SAGE J. Journal contribution. 29 (2020), pp. 3569–3585. doi: 10.25384/SAGE.12588420.v1. [DOI] [PubMed] [Google Scholar]




