Abstract
Psychometric growth curve modeling techniques are used to describe a person’s latent ability and how that ability changes over time based on a specific measurement instrument. However, the same instrument cannot always be used over a period of time to measure that latent ability. This is often the case when measuring traits longitudinally in children. Reasons may be that over time some measurement tools that were difficult for young children become too easy as they age resulting in floor effects or ceiling effects or both. We propose a Bayesian hierarchical model for such a scenario. Within the Bayesian model we combine information from multiple instruments used at different age ranges and having different scoring schemes to examine growth in latent ability over time. The model includes between-subject variance and within-subject variance and does not require linking item specific difficulty between the measurement tools. The model’s utility is demonstrated on a study of language ability in children from ages one to ten who are hard of hearing where measurement tool specific growth and subject-specific growth are shown in addition to a group level latent growth curve comparing the hard of hearing children to children with normal hearing.
KEYWORDS: Bayesian hierarchical models, psychometric modeling, language ability, growth curve modeling, longitudinal analysis
1. Introduction
Many challenges arise when collecting data on test performance over time in the interest of modeling growth of some latent trait or ability. First, study drop-out and missing measurements are common concerns in longitudinal studies [23]. Second, the measurement tool may need to change over time either due to improved scientific knowledge or due to the nature of the latent ability being measured in the population being investigated. For example, when it is of interest to measure a psychological ability in children, the measurement tool may need to become more challenging as the children age, even though the same latent construct, such as intelligence, is being measured. This paper presents a novel Bayesian hierarchical model to estimate growth over time of a latent construct that is estimated by multiple measurements tools.
There is currently a vast literature on methods for modeling growth trajectories using longitudinal data analysis techniques. These methods can be classified into two primary approaches, multilevel or hierarchical modeling [2,16] and structural equation modeling (SEM) [11]. Both approaches analyze repeated measures of a dependent variable in the interest of comparing change over time between subjects or groups of subjects. However, these well-established techniques can only be applied to data with a single outcome measurement given to the subjects at every time point. Unfortunately, in some longitudinal studies, the measurement instruments must change over time in order to model growth of their psychological or multi-faceted latent abilities or attitudes, which is a common aim in the field of psychometrics.
A primary psychometric modeling technique designed to measure a latent ability or attitude is item response theory (IRT) modeling [9,11]. In IRT modeling, statistical techniques are used to relate an individual’s performance on a test item with their overall ability that the item was designed to measure. IRT models are commonly used to develop scales, because the resulting model produces a new measure of the latent ability or attitude. Hoffman et al. [8] propose an IRT model to create and compare measures of vocabulary ability by linking common items across multiple test forms. However, they do not use their results to estimate growth curves, and the assumption of common items across forms may not be feasible when different forms are designed to measure different aspects of an ability. Our goal is to measure this latent ability without the need to link specific items from the tests.
There have been a few methods proposed to measure growth curves using multiple measurement scales. An early solution was presented by Bayley [1], where multiple measures on different scales all seeking to capture intelligence from ages 6 to 26 were standardized using z-scores prior to the analysis. This method, while intuitive, has been criticized as a naïve attempt to pool data from conceptually diverse scores. Additionally, standardizing at each age is not ideal when the sample is changing over time, since the interpretation of the standardized score would also change as the sample changes. The approach proposed here is designed for accelerated longitudinal designs in which subjects enroll at various ages and are only tested on measurements applicable for them at those ages. To allow for changing samples over time and measure growth, but still account for the differences in means and variances between measurement tools, the proposed model allows for test-specific means and variances over time using a hierarchical linear model approach, which accounts for the sample changing over time.
McArdle et al. [12] describe several approaches that researchers have taken to analyze data on changing measurements of the same construct and propose a longitudinal growth model based on a longitudinal invariant Rasch test (LIRT) to bring different measures of vocabulary and memory to a common scale. As in Hoffman et al. [8], this approach uses item level data for items on eight different instruments, due to low counts of overlap between the instruments at any one occasion. The study design motivating our model seeks to increase the number of instruments given at any occasion, making it a more straightforward and flexible way to perform a growth analysis for a latent trait by using the overall raw scores as opposed to the individual items.
The model presented in this paper blends ideas from growth modeling and IRT modeling by using a Bayesian hierarchical model. The modeling framework is similar to the development in Oleson, Cavanaugh [14], who use subject-specific effects to borrow strength across two measurements and impute data on each score over time. However, our current study requires development of a model that can combine information from multiple instruments that are all based on one overall measure of some latent ability. Our approach combines scores on multiple continuous measurement instruments that do not have any item overlap, are not scored on the same scale, and are not necessarily measured across the entire time-span of interest. The commonality between the measurement items is that they are all measuring the same latent construct. The modeling framework is designed to have the ability to make population level, or marginal, inference in addition to individual-level inference. Similar to an IRT model, this framework yields a new measurement scale of the latent trait.
We construct our model in the Bayesian paradigm. The Bayesian hierarchical framework is an intuitive approach to implementing complicated modeling, by specifying complicated multidimensional relationships as a series of conditional statements. These conditional statements, laid out in Section 3, allow for the borrowing of information from multiple instruments to estimate a subject-specific latent curve. While the general model could be implemented in the Frequentist paradigm, the ability to leverage information in the Bayesian framework is necessary in situations where not all subjects have measurements on each instrument, as is the case in the application presented here. By leveraging information, Bayesian model estimates are subject to ‘shrinkage,’ making comparisons more conservative and eliminated the need for adjustments for multiple comparisons. As stated in Gelman, Hill [5], ‘rather than correcting for a perceived problem, we just build the multiplicity into the model from the start.’ Additionally, using the posterior distribution results in estimates of the model parameters conditioned on the observed data, which has a natural interpretation in this context.
2. Motivating study
2.1. OCHL and OSACHH studies
The Outcomes of Children with Hearing Loss (OCHL) study [13] and its follow-up study, Outcomes of School Age Children who are Hard of Hearing (OSACHH) study [15,21,22] investigated the speech, listening, and language development of children from preschool age through elementary school age (roughly ages 2–10). The OCHL study used an accelerated longitudinal design, in which children enrolled in the research between 6 months and 7 years of the study and followed over the length of the grant cycle (maximum of 4 years). Starting in 2013, children who were followed in the OCHL study were then enrolled in the second phase, OSACHH, in which children were tested at second or fourth grade. Children with hearing loss had impairment in the mild to severe range in their better ear and no additional disabilities. An age-matched group of hearing peers was included as a comparison group. Although participants were followed on almost a yearly basis, the test instruments measuring their spoken language ability changed in order to be developmentally appropriate. Tomblin et al. [20] provide demographic data and detailed information about the test instruments.
One of the objectives of these studies was to examine the spoken language abilities of the children who are hard of hearing throughout childhood. Prior to OCHL and OSACHH there had been few studies concerned with the communication outcomes of hard of hearing (HH) children during their preschool years. Furthermore, the small number of studies of the language abilities of these children during the school years were mixed as to whether they differed from normal hearing (NH) children. Recently, the results of the OCHL and OSACHH studies with regards to the preschool years was reported [19], however language ability was represented in terms of the average norm referenced standard score of the children at each age for analytical purposes. Since the test instruments utilized differed across ages, and the sample used to norm each test was not constant, there is additional variation across the ages that was not reflective of differences in language ability. In spite of these limitations, the findings indicated that the standard scores increased with age and the degree of hearing ability among these children was associated with their language ability across the preschool years. Even the children with mild hearing loss had poorer language ability than the normal hearing children.
More recently, Tomblin, Oleson [20], examined the language ability of these children at ages eight and ten and found that only the more severe children with hearing loss continued to be poorer in language ability than the normal hearing group, suggesting that the children with mild and moderate loss caught up with their normal hearing peers. In order to fully address this hypothesis, we seek to model the language growth of HH and NH children from their early preschool years through the age of 10. The analytical challenge is that the nature of language ability requires it to be measured with different instruments as children age. In particular, modeling language growth for this longer time-span must combine information from multiple response variables.
2.2. Measurement instruments of spoken language ability
To analyze the language abilities of the children, eight different instruments were used as the children aged. The Vineland Adaptive Behavior Scales-II [18] is a parent-report questionnaire. It examines adaptive behavior, including receptive and expressive language, writing, interpersonal and coping skills, and fine and gross motor skills. For the current analysis, we used the expressive and receptive subtests, which measure language production and comprehension, respectively. The Peabody Picture Vocabulary Test-4 [4] is a standardized measure of receptive vocabulary, in which the examiner says a word that describes one of the pictures on a page, and the participant identifies the correct picture.
The Comprehensive Assessment of Spoken Language [3] is a standardized measure of global language development. Subtests are designed to evaluate receptive and expressive language in areas of lexical/semantic, syntactic, supralinguistic, and pragmatic development. For the purposes of the current analysis, we included scores on the Basic Concepts, Pragmatic Judgment, and Syntax Constructions subtests to measure vocabulary, pragmatics, and expressive morphosyntax scores, respectively. The Vocabulary subtest of the Wechsler Abbreviated Scale of Intelligence [24] was used to measure expressive vocabulary skills. The WASI Vocabulary subtest requires the child to provide an oral definition for a list of target words.
We use the raw scores for each of these instruments rather than their standardized scores. Note that each instrument would be standardized to independent populations so it is not clear how within-subject correlation is impacted through standardized scores. Our goal will be to estimate an underlying latent effect that demonstrates the growth in language ability as a child ages. This latent growth measure must be able to use the information from all eight tests across various ages, where the tests are on different scales, and the tests do not necessarily overlap.
2.3. Study population
The data included 453 subjects, 134 with normal hearing and 319 of which were hard of hearing. The number of measurements that subjects had on any of the eight instruments of interest ranged from one to 17, with 50% of the subjects having between 6 and 13 measurements. 23.6% of children had measurements on all eight instruments. Table 1 summarizes the total number of measurements on each test, the number of measurements per subject on each test, and the range and average age of the children taking that test. The maximum number of measurements per subject on any one instrument was four. Figure 1 shows the mean score on each measurement instrument with error bars within one standard deviation for both NH and HH children. The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy restrictions.
Table 1. Data summaries for each instrument used in the study.
| Measurements per subject | Age tests were taken | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Total number of measurements | 0 | 1 | 2 | 3 | 4 | Min | Mean | Max | |
| CASL basic | 398 | 186 | 136 | 131 | 0 | 0 | 3 | 3.54 | 4 |
| CASL pragmatic | 623 | 100 | 153 | 130 | 70 | 0 | 3 | 4.51 | 8 |
| CASL syntax | 756 | 66 | 122 | 162 | 102 | 1 | 3 | 4.77 | 8 |
| CELF WS | 307 | 222 | 155 | 76 | 0 | 0 | 5 | 5.86 | 7 |
| PPVT | 365 | 197 | 149 | 105 | 2 | 0 | 5 | 6.07 | 9 |
| Vineland expressive | 582 | 134 | 136 | 107 | 72 | 4 | 1 | 2.93 | 4 |
| Vineland receptive | 582 | 134 | 136 | 107 | 72 | 4 | 1 | 2.93 | 4 |
| WASI vocab | 492 | 176 | 117 | 15 | 35 | 10 | 6 | 7.08 | 10 |
Figure 1.
Mean score by age on each measurement instrument with standard deviation error bars for NH and HH children.
3. Methods
3.1. Data and process models
Let denote the score of individual , on test , at time . Each individual may have scores on multiple tests at each time point. Assume a normal distribution as the data model for each outcome variable such that
and that the are conditionally independent given and . Therefore, given a subject’s test-specific curve, their deviations from that curve are independent over time. The normal distribution is commonly used in analyses of these measurement instruments, e.g. [19,21]. Other continuous distributions could be used if the observed distributions of measurements were found to be skewed or otherwise non-normal. Distributional assumptions should be assessed using a goodness-of-fit measure, and we describe one such measure Section 3.2. The model allows the variance to differ by measurement instrument, since the instrument scores can be on different scales. Since the variances are allowed to differ by measurement instrument, we assume that the variance within instrument does not change over time which is a reasonable assumption based on the motivating data. Although it would be possible to incorporate different variances within each instrument over time, this would add significant model complexity. The model could also be extended to have different variances by an individual characteristic such as exposure status, which may be pertinent practically.
For the process model, we use a linear function which reasonably fits the trajectories observed in Figure 1; however, it would be possible to fit any parametric curve in this framework.
The variable denotes the individual intercept for person on test and represents the individual slope for person on test . This indicates that growth is linear for each measurement, but each measurement is allowed to have a unique intercept and slope. These terms are characterized by
with the constraints and for identifiability.
The latent language growth is denoted by and which represent the overall population intercept and slope, respectively. The and terms combine information across test measures, therefore quantifying the difference between the intercept and slope for measurement from the population intercept and slope, respectively. The model is also able to include important covariate information that impact both the intercept and slope terms. and denote the and design matrices, respectively, representing and covariates, where is the row vector containing the covariate values for the ith subject and and are the and dimensional vectors, respectively, containing the corresponding coefficients. Covariates can be included in either the intercept term, the slope term, or both, allowing the model to include variables that will test specific hypotheses of clinical relevance. The random subject effects and allow for subject-specific intercepts and slopes. We assume the same random subject effects for each measurement to borrow strength and give predictions for missing scores. The subject-specific effects are analogous to random effects, while the , , and terms are analogous to fixed effects in frequentist linear mixed-effects model, however in the Bayesian framework all parameters are considered to be random and are given a probability distribution.
We define an overall ‘W-score’ for a group with some common characteristic defined in and as
| (1) |
The W-scores at each time therefore correspond to the latent trait being measured at that time, after averaging over the variation from the different measurement tools. If the individual ability is of interest, we can use the subject effects and the individual covariates to define a ‘U-score’ for each individual, measuring the subject-specific latent ability. For individual i, the U-score is defined as
| (2) |
U-scores may be used to determine if individuals are above or below average in the latent ability. Since it is not necessary that each measurement is on the same scale, the W-scores and U-scores do not inherently have any direct interpretation. However, the differences in W-scores or between the W-score and U-score are still relevant.
For example, it may be of interest to compare the latent growth ability of two different groups, such as a control group and a treatment group, while controlling for subjects’ sex. It may be reasonable to assume that the control group and treatment group start with the same latent ability, but have different rates of growth after the treatment begins (i.e. the same intercept, but different slopes). We may also assume that the subject’s sex shifts the curve up or down, but does not affect the growth rate. In this case would indicate the subject’s sex (without loss of generality, we will assume this is an indicator for females), and would have an indicator for belonging to the treatment group. Then would represent the deviation from the male intercept for all female subjects and would represent the deviation from the control group slope for subjects in the treatment group. The W-score curve for a male in the control group would be written as and the W-score curve for a male in the treatment group would be calculated as . Similarly, the W-score curve for a female in the control group would be written as and the W-score curve for a female in the treatment group would be calculated as .
3.2. Bayesian model specification
In the hierarchical Bayesian framework, priors must be specified for all model parameters and hyperparameters. The , , , and parameters are given vague normal priors. The subject-specific effects and are given a multivariate normal prior with variances and and covariances . The unstructured variance-covariance matrix is assigned a vague inverse-Wishart distribution. All priors are vague but proper reflecting the lack of prior knowledge, due to the construction of a new latent score through this modeling process. Prior distributions were chosen to be normal and inverse-Wishart to ensure the model is fully conjugate.
The hierarchical model allows us to specify a complex model in terms of simple conditional relationships, but the model complexity renders numerical computation infeasible. Instead, Markov chain Monte Carlo (MCMC) methods were used to estimate the posterior distribution of the model parameters using a Gibbs sampler implemented in R 3.5.2. Convergence was assessed by examining trace plots and using the Gelman-Rubin statistic [7]. A Gelman-Rubin statistic less than 1.1 for any parameter is determined to have sufficiently reached convergence.
For both Bayesian and Frequentist modeling approaches, it is important to assess the plausibility of the model and the modeling assumptions. One approach to evaluate model fit in the hierarchical Bayesian framework is the posterior predictive p-value proposed by Gelman et al. [6] using the goodness-of-fit discrepancy measure
where represents the model parameters. This measure of model fit compares observed scores, , to the posterior predictive scores from replicated draws from the posterior predictive distribution, . The posterior predictive p-value is defined as , which signifies posterior predictive p-values that deviate from 0.5 indicate lack of fit.
3.3. Sequential BEST procedure
It is often of interest to test for differences in growth between characteristics of individuals, or groups over time. In the Bayesian hierarchical framework, inference on between group differences can be made using the posterior distribution of the differences between groups, a direct result from the MCMC iterations. Implemented here is an extension of the Bayesian Estimation Supersedes the t Test (BEST) approach [10]. The BEST estimation procedure allows the researcher to define a ‘region of practical equivalence’ (ROPE) for the difference in means between the groups, which is then compared to the 95% Highest Posterior Density Interval (HDI). If the HDI and the ROPE do not overlap, we reject the null hypothesis and conclude that there is a clinically significant difference between the groups. If the HDI is completely contained within the ROPE we accept the null hypothesis and conclude that the groups are clinically equivalent, something that is not possible in the frequentist framework.
Since our data are longitudinal, we employ a variant of the BEST approach called the sequential BEST procedure [17]. The sequential BEST procedure uses the BEST estimation approach at each time point. At each time point of interest, we compute the 95% HDI for the difference in means and compare that to the ROPE. A significant difference at a given time point is recorded if the 95% HDI and the ROPE do not overlap at that time point. While the sequential BEST procedure involves 10 comparisons between the 95% HDI and the ROPE, no multiple comparisons adjustments are needed as the Bayesian hierarchical model shifts estimates and intervals in the construction of the posterior, as opposed to widening intervals post-hoc [5].
The BEST procedure offers an advantage over classical hypothesis testing, where the research question is framed in terms of rejecting a ‘null value’ or a difference of zero between groups. Research questions framed in the null hypothesis significance testing way have the potential to result in ‘statistically significant’ findings of a small difference that is not clinically relevant and thus practically is actually ‘insignificant’, especially for large sample sizes. The use of the ROPE in the BEST procedure allows the researcher to specify a difference, which is practically and clinically relevant, and explicitly tests for differences outside that region. As such, the choice of the ROPE should be determined based on clinical expertise and relevance in the field of interest, as opposed to the observed data.
Our modeling framework creates a new and latent measurement scale, so it is not obvious how to define the ROPE as there is no previous information on this scale. However, the model does give a distribution for the scale, which gives information about the ranges of scores. As the distribution of the scale is unimodal and symmetric, we propose using the standard deviation of the W-score, , at each time point to inform the ROPE. Whether the ROPE is ± 1 , or ± 2 should depend on what is standard in the field. For this data analysis, we use ± 1 to define the ROPE. The use of a one standard deviation difference in the field of speech and language is a well-known and commonly used method to determine clinically meaningful differences.
4. Data analysis
4.1. Model specification
The purpose of this analysis is to examine the differences in spoken language growth between normal hearing (NH) and hard of hearing (HH) children from preschool age through 10 years of age. Of specific interest in this application was whether there was a change in the trajectory of language growth once children started primary school. To address this research question, a knot was added to the process model for each child at age five. This allows each child to have one slope until age five, and then a different slope after age five, but ensures that the curve is continuous. Age five was chosen as it is the most common age when children start school. Hearing impairment was included in the model as a covariate to obtain population growth curves for NH and HH children. In hierarchical form the model can be written as
where the spline transformation is
Thus, the regression equation in each time interval is
where
under the constraints that and and with as an indicator for hearing impairment. Parameters and represent the deviations from the overall average slope and intercepts for each of the eight instruments described in Section 2. This model does not include any instrument-specific terms on the random effect for the spline coefficient, , because we do not expect the effect of any measure to change at age five, only the group and individual curves. Because the parameters do not vary by test, we assume that the test effects are completely removed by the terms and that the tests themselves do not score differently for NH and HH children, but the difference between groups is only a result of the latent language ability being modeled. An effect for HH and a subject-specific effect are included in the spline coefficient in order to analyze the differences in growth after age five for NH and HH children as well as for each individual separately.
Three chains were run for 50,000 iterations, after a burn-in of 50,000 iterations. All parameters were found to have a Gelman-Rubin statistic below the 1.1 threshold and the trace plots showed convergence for each chain. Trace plots are shown in the Supplemental Material. The posterior predictive p-value for the model was 0.494, indicating strong evidence that there was no lack of fit.
4.2. Population-level inference
First, we analyze the population level growth curves. From Equation (1), the W-score for NH children is
and the W-score for HH children is
Posterior means, standard deviations and 95% credible intervals for the W-score parameters are summarized in Table 2. The credible interval for does not include zero, indicating significant language growth over time for the NH children. Additionally, the estimate is negative and the credible interval does not include zero, indicating that HH children have a significantly lower intercept than the NH children. Both the and parameters are small and their credible intervals include zero, meaning that the W-scores have a very small change in slope after age five for both NH and HH children, although the estimated change in language growth for NH children is slightly negative, while the estimated change in language growth for HH children is slightly positive. Using the posterior distribution of the parameters, we can obtain the estimated W-scores for the overall latent vocabulary ability for both NH and HH children, and 95% credible intervals over time (Figure 2).
Table 2. Prior distributions and posterior parameter estimates.
| Parameter description | Parameter | Prior | Posterior mean (SD) | 95% Credible interval |
|---|---|---|---|---|
| NH | N(0, 1000) | 0.71 (1.17) | (−1.60, 3.02) | |
| N(0, 1000) | 8.40 (0.28) | (7.86, 8.95) | ||
| N(0, 1000) | −0.30 (0.43) | (−1.13, 0.53) | ||
| HH | N(0, 1000) | −3.83 (0.95) | (−5.69, −1.96) | |
| N(0, 1000) | −0.22 (0.26) | (−0.73, 0.29) | ||
| N(0, 1000) | 0.68 (0.44) | (−0.18, 1.54) | ||
| Test-specific Variances | IG(0.1, 0.1) | 21.70 (1.71) | (18.6, 25.27) | |
| IG(0.1, 0.1) | 15.81 (1.06) | (13.85, 18.01) | ||
| IG(0.1, 0.1) | 8.60 (0.60) | (7.47, 9.84) | ||
| IG(0.1, 0.1) | 10.80 (1.24) | (8.56, 13.40) | ||
| IG(0.1, 0.1) | 367.96 (27.63) | (317.85, 426.00) | ||
| IG(0.1, 0.1) | 161.42 (9.81) | (143.27, 181.75) | ||
| IG(0.1, 0.1) | 18.62 (1.28) | (16.26, 21.27) | ||
| IG(0.1, 0.1) | 17.08 (1.42) | (14.49, 20.04) | ||
| Variance-Covariance of Random Subject Effects | IWishart | 14.78 (5.35) | (5.61, 26.37) | |
| 1.72 (0.45) | (0.97, 2.70) | |||
| 3.51 (0.98) | (1.91, 5.70) | |||
| −2.17 (1.49) | (−5.43, 0.32) | |||
| 0.92 (2.12) | (−2.44, 5.66) | |||
| −1.66 (0.62) | (−3.03, −0.66) |
Figure 2.

Population level W-scores and 95% credible intervals of growth in language ability for NH and HH children.
The posterior intercept and slope estimates are informative of the growth trends, but to formally test if there is a clinically meaningful difference between the groups at each time point, we use the sequential BEST procedure described in Section 3.3 to directly compare the language abilities as measured by the W-score between NH and HH children at each age from one to ten. The results of the BEST procedure (Table 3) indicate that HH children have significantly lower language ability than NH children through age eight. However, at ages nine and ten, the ROPE and the 95% HDI begin to overlap. Since the 95% HDI is never completely contained in the ROPE, we do not accept the null hypothesis at any time, we can only conclude that at ages nine and ten there is not a clinically significant difference in the language abilities between the NH and HH children. These results are consistent with the previous findings in Tomblin et al. [19] and Tomblin et al. [20].
Table 3. Sequential BEST results comparing the population curves at each age.
| Age | ROPE | 95% HDI of |
|---|---|---|
| 1 | (−0.940, 0.940) | (2.584, 5.505) |
| 2 | (−0.735, 0.735) | (3.105, 5.401) |
| 3 | (−0.593, 0.593) | (3.478, 5.519) |
| 4 | (−0.561, 0.561) | (3.596, 5.870) |
| 5 | (−0.658, 0.658) | (3.452, 6.329) |
| 6 | (−0.684, 0.684) | (3.051, 5.840) |
| 7 | (−0.825, 0.825) | (2.430, 5.551) |
| 8 | (−1.036, 1.036) | (1.627, 5.414) |
| 9 | (−1.282, 1.282) | (0.761, 5.397) |
| 10 | (−1.546, 1.546) | (−0.138, 5.446) |
4.3. Subject-specific inference
The second goal of this analysis was to create subject-specific curves of spoken language ability in order to identify children with faster or slower growth than the population. This is achieved by estimating the subject-specific U-scores by setting the test-effects to zero as in Equation (2). Before presenting the results, we illustrate the effectiveness of the U-score at capturing trends in all instruments with scores present for a child.
Figure 3 depicts the individual scores on each test, the marginal test-specific curves estimated by the model, and the estimated U-curve for four children selected to illustrate various response patterns across the age range in this study. The subjects in the top row are both HH children, and the subjects in the bottom row are both NH children. Subject A has measurements on all eight instruments from ages 3 to 10. Subject B only has measurements after age five but has responses to five instruments. Subject C was measured only between ages five and seven on four different instruments. Subject D only has measurements on the Vineland Expressive and Receptive scales at age two. This figure illustrates how test-specific curves can be created, even when a subject does not have any measurements on an instrument, by using a weighted average of subject specific information and population level information. Note how the U-curve lies between all the test-specific curves demonstrating how the U-score combines information from all measurements present for a child.
Figure 3.
Subject-specific U-scores of language ability (dashed lines) and test-specific curves (solid lines) for four subjects.
To investigate how each subject compares to the population, we compare U-scores to W-scores for each child’s group (NH or HH). Figure 4 depicts the U-scores for the same four subjects as in Figure 3 and W-scores for both groups. Subject A is estimated to have higher language ability than the average HH child throughout the age range in this study, with faster growth in language ability than both the NH and HH average before age five and a decreased growth rate after age five. Subject B began with similar language ability to the average HH child, but has accelerated growth after age five resulting in higher language ability than both NH and HH children at age 10. Subject C has steady growth in language ability that is faster than average. Subject D only had measurements at one time point, so their estimate growth rate is virtually identical to the average NH child.
Figure 4.
Population W-scores (solid lines) and subject-specific U-scores (dashed lines) depicting language growth for four subjects compared to the group curves.
Since yields the subject-specific change in the slope after the knot at age five, we can use its posterior distribution for each subject to determine if the subject has significantly faster or slower growth in language ability than the average child in his or her hearing group (NH or HH). If the 95% credible interval for does not contain zero, that means the probability that the growth in language ability after age five for individual i is the same as the average language ability is less than 5%. In this analysis, 13 children were identified as having significantly different language growth after age five than the population average, with five of them having significantly slower language growth and eight with significantly faster language growth. Subject A from Figures 3 and 4 was one of the children identified to have significantly slower language growth after age five. The ability to identify these children can help researchers understand which children benefit most from starting school and which children to target with interventions.
5. Discussion
Changing observational measurement tools in a longitudinal study may be necessary when the underlying construct of interest is being measured in children who may outgrow certain tools or in an older population with rapidly declining cognitive ability. Additionally, there may be more than one instrument used to measure various aspects of a complicated latent trait, and the ability to combine information across these instruments is helpful in providing a complete analysis of the trait. The model developed here tackles both problems in a hierarchical Bayesian model framework that borrows ideas from more classical IRT modeling.
This novel modeling framework allows researchers to use more flexible study designs, such as the accelerated longitudinal design used in the motivating study. It also enhances the ability for researchers to do growth curve modeling over longer time-spans, particularly in children that may outgrow instruments of latent knowledge or abilities very quickly. The model includes between-subject variance and within-subject variance to account for correlated measurements over time. Our method utilizes the strength of Bayesian hierarchical models to leverage information across subjects and across several measurement instruments. In addition, the modeling formulation we have developed is able to combine measurement instruments while keeping their raw scores on the original scale, even if the instruments are on very different scales.
In the application, our findings did not indicate a significant change in language ability after age five for either group of children as was originally hypothesized. This may indicate that starting school by itself does not impact childrens’ language growth trajectory, or it may be due to the fact that not all children start school at the same age. It would be possible for each child to have a subject-specific spline term corresponding to when they started school, but that data was not available for this analysis. Another possible approach would be to estimate an inflection point for each child, but since that was not the main research question in this work a common spline was more appropriate.
The data analysis presented here illustrates the effectiveness and usefulness of this model in identifying the growth curves of language ability for children over a nine-year span. We created a novel W-score representing the latent ability, and although this W-score is on a new scale, it is still useful for answering specific research questions as illustrated with the data analysis. We can use the sequential BEST procedure on the W-scores to test for clinically relevant differences between exposure groups. Since the model borrows information from each measure, it could also be used to impute scores for all subjects on every measurement instrument, allowing us to make predictions about how a child might score on a measurement tool even though they were not actually measured.
Future work entails using this modeling framework to not just compare NH and HH children’s language ability over time, but to compare language growth of HH children by the age at which they received an intervention for their hearing impairment. It has been hypothesized that children receiving interventions earlier are more likely to have faster language growth than children that received an intervention later in their childhood. This modeling framework can be easily extended to answer this research question by defining groups based on early or late interventions and comparing the W-scores between these two groups as well as continuing to compare the intervention groups to the NH control group.
Incorporating information from multiple instruments gives an informative understanding of a latent trait and is especially useful when estimating growth in children that may outgrow tests over time. The model can be used to impute scores on instruments that were not taken by a child, or to impute scores for ages that the instrument was not used. The U-scores provided by this model can be used for further research into what causes certain children to have lower or higher scores than the population average. Modeling subject-specific growth over time can be informative on which children to target with interventions and when those interventions could have the most impact.
It is important to note that care must be used in identifying which instruments are used in this model framework. This model is essentially averaging over all instruments, so there is an implicit assumption that the instruments are measuring the same latent ability. In our analysis we use several instruments of language ability, each of which quantifies slightly different aspects of spoken language ability. By combining all eight of these instruments, we are creating a very wholistic measure of language ability. If the study’s interest was in a more specific aspect of spoken language, a subset of these instruments could be used. For example, if the interest was comparing the vocabulary ability of NH and HH children, we could use only the CASL Basic Concepts subtest, the WASI Vocabulary subtest, and the PPVT-4.
We believe that the Bayesian hierarchical modeling approach presented here allows for a more flexible approach to latent growth modeling. This flexibility does come with some assumptions. First, it must be assumed that the instruments measure the same latent trait. Second, we assume that subject-specific effects do not vary by instrument, i.e. subjects who score high on one instrument will also score high on the other instruments. This is a common assumption in both Bayesian and frequentist hierarchical linear models, and instrument-specific intercepts, slopes, and variances term control for the variability in the scoring of each instrument. Finally, in our data analysis we assume measurement invariance across groups, or in other words that the instruments do not score differently for NH and HH children. This is a strong assumption which could be tested by instrument-group interactions, however due to model complexity, that would only be feasible in an analysis using fewer measurement instruments. Despite these assumptions, our real data analysis yielded conclusions that concurred with previous analyses of the same data and we believe that our model is useful for accelerated longitudinal study designs with changing measurement tools in which item linking is not feasible.
Supplementary Material
Funding Statement
This research was supported by NIH/National Institute on Deafness and Other Communication Disorders [grant numbers R01DC009560 and R01DC013591]. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Centers for Disease Control and Prevention or the Department of Health and Human Services.
Disclosure statement
No potential conflict of interest was reported by the authors.
References
- 1.Bayley N., Individual patterns of development, Child Dev. 27(1) (1956), pp. 45–74. [DOI] [PubMed] [Google Scholar]
- 2.Bryk A.S. and Raudenbush S.W., Application of hierarchical linear-models to assessing change, Psychol. Bull. 101(1) (1987), pp. 147–158. [Google Scholar]
- 3.Carrow-Woolfolk E., CASL: Comprehensive Assessment of Spoken Language, American Guidance Services, Circle Pines, MN, 1999. [Google Scholar]
- 4.Dunn D. and Dunn L., Peabody Picture Vocabulary Test 4, NCS Pearson Inc, Minneapolis, MN, 2007. [Google Scholar]
- 5.Gelman A., Hill J., and Yajima M., Why we (usually) don’t have to worry about multiple comparisons, J. Res. Educ. Eff. 5(2) (2012), pp. 189–211. [Google Scholar]
- 6.Gelman A., Meng X.L., and Stern H., Posterior predictive assessment of model fitness via realized discrepancies, Stat Sinica 6(4) (1996), pp. 733–760. [Google Scholar]
- 7.Gelman A. and Rubin D.B., Inference from iterative Simulation using multiple sequences, Statist. Sci. 7(4) (1992), pp. 457–472. [Google Scholar]
- 8.Hoffman L., Templin J., and Rice M.L., Linking outcomes from peabody picture vocabulary test forms using item response models, J. Speech Lang. Hear. R 55(3) (2012), pp. 754–763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kim S.H. and Cohen A.S., A comparison of linking and concurrent calibration under item response theory, Appl. Psych. Meas. 22(2) (1998), pp. 131–143. [Google Scholar]
- 10.Kruschke J.K., Bayesian estimation supersedes the t test, J. Exp. Psychol. Gen. 142(2) (2013), pp. 573–603. [DOI] [PubMed] [Google Scholar]
- 11.McArdle J.J. and Epstein D., Latent growth curves within developmental structural equation models, Child Dev. 58(1) (1987), pp. 110–133. [PubMed] [Google Scholar]
- 12.McArdle J.J., Grimm K.J., Hamagami F., Bowles R.P., and Meredith W., Modeling life-span growth curves of cognition using longitudinal data with multiple samples and changing scales of measurement, Psychol Methods 14(2) (2009), pp. 126–149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Moeller M.P. and Tomblin J.B., An introduction to the outcomes of children with hearing loss study, Ear Hear. 36(Suppl 1) (2015), pp. 4S–13S. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Oleson J.J., Cavanaugh J.E., Tomblin J.B., Walker E., and Dunn C., Combining growth curves when a longitudinal study switches measurement tools, Stat. Methods Med. Res. 25(6) (2016), pp. 2925–2938. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Page T.A., Harrison M., Moeller M.P., Oleson J., Arenas R.M., and Spratford M., Service provision for children who are hard of hearing at preschool and elementary school ages, Lang. Speech Hear Serv. Sch. 49(4) (2018), pp. 965–981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Potthoff R.F. and Roy S.N., Generalized multivariate analysis of variance model useful expecially for growth curve problems, Biometrika 51(3-4) (1964), pp. 313–326. [Google Scholar]
- 17.Pugh M.A.M. and Oleson J.J., A Bayesian approach to detect time-specific group differences between nonlinear temporal curves, University of Iowa, 2016. [Google Scholar]
- 18.Sparrow S., Cicchetti D., and Balla D., Vineland Adaptive Behavior Scales-II, Pearson, San Antonio, TX, 2005. [Google Scholar]
- 19.Tomblin J.B., Harrison M., Ambrose S.E., Walker E.A., Oleson J.J., and Moeller M.P., Language outcomes in young children with mild to severe hearing loss, Ear Hear. 41(4) (2020), pp. 775–789. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Tomblin J.B., Oleson J.J., Ambrose S.E., Walker E.A., McCreery R.W., and Moeller M.P., Aided hearing moderates the academic outcomes of children with mild to severe hearing loss (2019). Manuscript submitted for publication. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Tomblin J.B., Oleson J., Ambrose S.E., Walker E.A., and Moeller M.P., Early literacy predictors and second-grade outcomes in children who are hard of hearing, Child Dev. 91(1) (2020), pp. e179–e197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Walker E.A., Ambrose S.E., Oleson J., Moeller M.P., False belief development in children who are hard of hearing compared with peers with normal hearing, J. Speech Lang. Hear. Res. 60(12) (2017), pp. 3487–3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Walker E.A., Redfern A., and Oleson J.J., Linear mixed-model analysis to examine longitudinal trajectories in vocabulary depth and breadth in children who are hard of hearing, J. Speech Lang. Hear. Res. 62(3) (2019), pp. 525–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wechsler D. and Hsiao-pin C., Wechsler Abbreviated Scale of Intelligence, Pearson, San Antonio, TX, 2011. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



