Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Sep 8.
Published in final edited form as: Mol Genet Metab. 2022 Sep 8;137(1-2):201–209. doi: 10.1016/j.ymgme.2022.09.001

Increasing precision in the measurement of change in pediatric neurodegenerative disease

JB Eisengart a,*, MH Daniel b, HR Adams c, P Williams d, B Kuca e, E Shapiro a,f
PMCID: PMC9879307  NIHMSID: NIHMS1862972  PMID: 36115283

Abstract

Due to the surge in new brain-directed treatments, metrics to detect the alteration in developmental trajectories in cognition and adaptive behavior have become increasingly important. We propose Growth Scale Values (GSVs) as a solution to monitoring children with severe neurologic/neurodegenerative conditions. This report stems from a panel of experts presenting at the Gorlin symposium (WORLD Symposium) and a subsequent open Webinar sponsored by the National MPS Society.

Because norm-referenced scores (Standard Scores or Intelligence Quotient, i.e., IQ) do not yield information about gain, stability, or loss of skills, they are not suitable for natural history studies or clinical trials. Age-equivalent (AE) scores have been the standard metric used in natural history studies. While AEs are familiar and interpretable to clinicians and parents, they are imprecise due to lack of standard deviations, standard errors of measurement, and equal intervals between scores. Raw scores also have unequal intervals and are not comparable between ages or ability levels.

The GSV, a nonlinear transformation of raw scores using item calibration to make an interval scale score, can be used for accurate measures of within-person change. GSVs have been identified as a useful metric for longitudinal measurement of other conditions involving neurodiversity. These growth scores circumvent inaccurate AEs in infants, are not limited by age and can be used for impaired patients who are chronologically above the normative age range. GSVs have interval properties (a given difference between GSV values represents the same difference in ability at all score levels) and each GSV value has a known standard error of measurement (SEM). GSVs are recommended to measure change in cognitive and adaptive behavior in natural history studies and in clinical trials for children with neurologic disease.

Keywords: Measurement, neurocognitive decline; Clinical trials; Growth scale value; Ability score; Clinical outcomes

1. Introduction

Rare pediatric neurodegenerative diseases such as the mucopolysaccharidosis (MPS) disorders, neuronal ceroid lipofuscinoses, and other conditions begin to take effect early in life but detecting these effects on brain development is difficult. Early signs can be obscured by the forces of development, appearing as mild developmental delays, a slower than expected rate of learning, or seeming developmental stagnation. In many of these diseases, the neurodegeneration eventually reaches a stage of rapid loss of abilities, leading to profound impairments in cognition and adaptive function, with tragically shortened lifespan [1]. Change at later stage disease, when functioning is in a very low range, is also difficult to detect [2,3].

There have been efforts surging in the last decade to develop disease-modifying therapies that target the brain and slow or halt these devastating functional declines [4-6]. Blood tests, brain imaging, and sampling of the cerebrospinal fluid cannot currently reveal functional benefit of brain treatments. It is well recognized that the path toward therapy development requires meaningful clinical outcome assessments that can accurately characterize 1) disease natural history, 2) the safety of a therapy, and 3) whether treatment makes a difference in how patients feel and function (cognitive and adaptive skills) [7-9]. Cognitive and adaptive outcome assessments have therefore been used as inclusion/exclusion criteria for enrollment, and as primary and secondary outcome measures in clinical trials (see clinicaltrials.gov).

An important challenge is that many of the existing measures used to assess cognition and adaptive function in typically developing children, and/or children with non-degenerative conditions, are not equipped to measure meaningful change (up or down) in children with neurodegenerative disease. Among a number of problematic factors are a mismatch between the functional ability of the child and the test difficulty level (e.g. the child is functioning lower than the age range of the test, test content is inappropriate for abilities), and inappropriate metrics to measure change. Highly relevant to this latter point, Farmer and colleagues [10] described the inadequacy of norm-referenced scores for quantifying developmental progress in non-degenerative neurodevelopmental disorders, as these scores can decrease over time if ability increases at a slow rate. The investigators further noted that “the reliability of norm-referenced scores is lower at the tails of the distribution, resulting in floor effects and increased measurement error for people with neurodevelopmental disorders” (p. 475), and they proposed Person Ability Scores, or Growth Scale Values (GSV) as an alternative metric to reflect change in individual ability rather than change in distance from a neurotypical developmental trajectory.

Revealing change in individual ability is the most fundamental goal in analyzing functional outcomes for rare pediatric neurodegenerative disease. The possibility of using Growth Score Values, another name for Growth Scale Value and abbreviated the same way, i.e., GSV, was discussed during two expert panels,1 the first at the Robert J. Gorlin Symposium at the WORLD meeting in February 2022 [11] and the second at a repeat webinar discussion with the same panelists in March 2022. The use of GSVs was felt to have promise for increasing precision and analytic strength in measuring cognitive and adaptive change. This paper was developed from presentations and discussions at those two meetings; it outlines the state of methodologies that have been used to measure change in pediatric neurodegenerative disease, including standard scores, age equivalents (and their derivative, the Developmental Quotient), and raw scores, and it proposes adding GSVs to the analytic approach.

2. A brief history of measuring change in pediatric neurodegenerative disease

2.1. Early studies of functional change

Early trials of bone marrow transplant in the severe form of MPS I, Hurler syndrome, were among the initial investigations of treatment for pediatric neurodegenerative diseases that included cognitive and adaptive function in evaluating therapeutic response. In these studies, developmental measures included the first and second editions of the Bayley Scales of Infant Development (BSID, BSID-II) [12,13], the Mullen Scales of Early Learning (MSEL) [14], and the first and second editions of the Vineland Adaptive Behavior Scales [15,16]. These tests generally used “Standard Scores,” which are norm-referenced scores benchmarked against typically developing children of the same age. These scores were unsatisfactory because of the limited applicability to children with significant disease-related cognitive impairment. First, the Standard Score range goes down only to a value of approximately 40 or 50 (or similar, depending on the test), which has been criticized in rare disease for truncating the representation of children who were functioning below this floor [17,18]. Second, these scores could not even be derived for children who were chronologically older than the test's designed age range but needed to take the test because they were functionally similar to the test's age range, known as “out-of-level testing,” such as a 6-year-old who was functioning at a 2-year old level. In such a case, there were no norm-referenced Standard Scores for the child. Third, Standard Scores have been noted to lack sensitivity in determining magnitude and direction of cognitive or adaptive change over time [7,9,10,18,19], because they are yoked to the expected rate of change in typically-developing, same-age peers. That is, Standard Scores decline over time in all of the following scenarios: 1) developing forward at a slower rate than expected, 2) stagnating, or 3) regressing.

2.2. Alternative measures of functional change

2.2.1. Age equivalent scores

In consideration of the limitations associated with Standard Scores, the available alternative was Age Equivalent scores (AEs) which involved determining the age at which a typically developing child would demonstrate the same skills. So, returning to the above example of out-of-level testing, the 6-year-old who is functioning at a 2-year-old level could have this function measured using AEs on the BSID-II or –III, such that points earned on the test would be converted to a developmental level (i.e., age equivalent) of 24 months. This metric has several advantages. First, AEs allow many more children with a diversify of skill levels to be represented in clinical, research, therapeutic and educational settings. Second, AEs are universally interpretable and understandable to families, educators, clinicians, and researchers alike, and create meaning with respect to the child's functional level. Third, these values reveal whether change involves slow gains forward, plateauing, or regression by nature of the increasing, leveling, or decreasing AE numbers, respectively [18]. Charted over successive evaluations, the outcome could be the slope of this developmental trajectory of AEs [20,21].

Plotting AEs against chronological age became a method for identifying and depicting inflection points in the cognitive course of disease, specifying timing of developmental arrest and regression. Clarifying these periods of disease-related change continues to be critical to optimizing the timing of therapeutic intervention. One of the earliest plots of this kind was published in graphs by Peters and colleagues in 1998 (Fig. 1) [20]. Bone marrow transplant had been found to halt the cognitive decline in Hurler syndrome, and by comparing subsets of their sample transplanted prior to 18 months versus beyond 24 months, they showed that treatment after 24 months of age resulted in stagnation of development. This determination resulted in recommendations for early treatment. Over the years, many studies have used AEs to measure change in pediatric neurodegenerative natural history studies, intervention studies, and clinical trials; this became the standard approach to examine cognitive and adaptive function in MPS and other disorders [18-31].

Fig. 1. Early use of Age Equivalent scores in measuring outcomes.

Fig. 1.

Using AEs from the BSID and the Mullen, differences in outcomes were found between early and late transplanted children with Hurler syndrome. In this adapted figure from Peters et al. 1998 [20], a few more longitudinal data points were added to those of the original published paper because the patients continued to be monitored for the original investigative team to feel convinced that the treatment had a sustained impact [E.S., personal communication]. Median slopes have also been added. In analyses of subsets of the original dataset, patients transplanted before 18 months of age showed developmental trajectories in closer proximity to the range of typical growth (A), whereas patients transplanted after 24 months of age showed trajectories much farther from the range of typical growth, with multiple trajectories involving minimal gains (B).

2.2.2. Developmental Quotient

AEs have also been used plentifully to derive a Developmental Quotient (DQ) [18], which approximates a child's standing relative to age expectations via a calculation of the ratio of AE to chronological age, multiplied by 100. Before the development of Standard Scores, IQs on the original Stanford Binet intelligence test [32] were measured in the same way as DQs. As Standard Scores also reflect a comparison of a child to age norms, the contemporary use of the DQ has been conceptualized as a solution to reflect a child's distance from the neurotypical developmental trajectory when out-of-level testing has been necessary. This method has been used to increase understanding of rapid progression of MPS Type IIIA (the most common subtype of MPS III) [1,3,30,33,34], with some studies showing a relationship of DQ to brain atrophy [3], or examining for possible differences between treatment and control groups [33].

The DQ has also been used as a solution when Standard Scores are available but performance is lower than the norm-referenced range, thus obscured by the “floor effect,” in which the lowest Standard Score is given to all values equal to or less than a cutoff score. The DQ was applied to historical BSID-I and -II data in a natural history study of Hurler syndrome, allowing for examination of the spread of scores beneath the floor (Fig. 2) [17]. With the floor effect circumvented, the slope of Hurler-related cognitive decline was calculated to be steeper than previously quantified with the Standard Scores.

Fig. 2. Floor effect: Standard Score vs. Developmental Quotient.

Fig. 2.

These plots both show cognitive outcomes on the same sample of children with untreated Hurler syndrome. The one on the left demonstrates the “floor effect”: because the lowest possible score on the test is a 50, all performances at or below 50 received a value of 50, creating the cluster of data points along the Y-value of 50. The full range of lower function is not represented. Shapiro, Whitley and Eisengart recalculated the Standard Score to be a Developmental Quotient on the right [17]. The DQ scores enabled representation of data points below 50, which contributed to calculation of a steeper slope of decline, whereas the floor effect from the Standard Scores created an overrepresentation of 50s in the calculations. Figure reproduced under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

2.2.3. Raw score

An alternative to AEs and DQs is raw scores, or the number of item score points a child earns on a test. Because raw scores are the basis of the other score types, they are by nature as sensitive, or more sensitive, to change as all other scores. Raw scores can be obtained regardless of the child's chronological age, and the only floor effect is whether the child can perform the easiest items. Raw scores have the advantage over Standard Scores and DQs in that a change in number (or lack thereof) concretely demonstrates whether performance has improved, declined, or plateaued. For example, when a Standard Score declines, the clinician or researcher often reviews the raw scores over time to clarify whether performance has slowly increased, stayed the same, or decreased.

Raw scores might also offer insights into whether there are disease-related functional “ceilings” of abilities. For example, a recent natural history study of MPS IIIA included a detailed analysis of raw scores from the second edition of the Vineland, which showed, e.g., that most patients never achieved a total of >100 raw score points on Expressive Language [30]. The authors meticulously examined the skills that corresponded to various raw score points and determined that most patients never developed speaking vocabularies beyond 50 words. Therefore, raw scores can offer more detail about direction of change as well as skill thresholds for a neurodegenerative disease. In such a circumstance, a trial outcome in which vocabulary is routinely larger and raw score points surpass 100 might be evidence of benefit on that endpoint.

2.2.4. Comparison of AE, DQ, and raw score

We begin with an evaluation of each score type relative to various desirable features of scores used to measure change in clinical trials, and then cite some empirical findings about the relative effectiveness of the score types in research.

2.2.4.1. Desirable feature 1: Equal-interval scales

Equal-interval scales are valuable when comparing or averaging change across children of different ability levels. On an equal-interval scale, a given size score difference reflects the same amount of difference in underlying ability at all score levels. When this is not the case, a given change in score (up or down) can have different meaning about disease state, disease severity, or therapeutic response, depending on the age or ability level of the person. Therefore, in placebo- or standard-of-care controlled trials, the treatment groups would have to be both ability-matched and in the same stage of disease to draw confident conclusions about differences, which is an unattainable goal for recruitment and enrollment in rare disease trials.

The three score types under discussion are not equal-interval, for different reasons. For raw scores, it is because tests virtually never have an even distribution of item difficulties across their full range. For AEs, the unit of measurement is the typical child's monthly or yearly growth in ability, which varies greatly across ages [9]. This feature of AE also affects DQ, which is based on AE.

Raw scores can approach equal-interval scaling if the test developer selects a set of items that are relatively evenly distributed in difficulty. For AE and DQ, however, test design cannot offset the effect of the diminishing rate of typical growth across the childhood years that is observed for most abilities.

Standard scores, by contrast, can claim an equal-interval property to the degree that ability is normally distributed in the population, because standard scores are usually normalized. This property is useful for assessing the degree to which a child differs from the normal population at a single point in time for age. Therefore, it is a desirable method to establish a disability for diagnostic and service eligibility purposes. However, Standard Scores are not desirable for longitudinal assessment of neurodevelopmentally diverse individuals because they cannot determine whether there has been a change in the skill level of the child over time [10].

2.2.4.2. Desirable feature 2: Indication of direction of change.

Another desirable feature of scores used to measure change is that they indicate whether actual performance has increased, decreased, or stayed the same. As noted above, a Standard Score can drop even when raw performance has improved, albeit slowly. DQs have the same characteristic because, like Standard Scores, they describe the relationship between actual and age-typical performance. By contrast, raw scores and AEs do indicate the direction of change in actual performance.

2.2.4.3. Desirable feature 3: Sensitivity to change.

Sensitivity to change in a person's scores is also important in clinical trials. Because raw scores are the basis of AEs, they are at least as sensitive. In fact, it sometimes happens that several raw score points correspond to a single AE, which can create a stagnant AE even when raw score points have been accrued. This has been noted as a shortcoming [9]. DQs are more likely to fluctuate because they are a ratio of two variables, but the meaning of those changes may not be clear, as noted above.

A statistically significant change is one that is greater than what might plausibly occur by chance as a result of measurement error. Statistical significance is determined using the standard error of measurement (SEM) which is, roughly, the average difference between a person's obtained score and their true score. Statistically significant change can be determined for raw scores, because they have SEMs, although test manuals often do not report them. Because of the inherent non-interval nature of AEs, they do not have SEMs. DQs lack them for the same reason and also because ratios introduce mathematical irregularities [35]. The implications of not having SEMs can be great: It took years to be able to show that earlier bone marrow transplantation for Hurler syndrome had significantly better outcomes, because there was no method to establish what was a statistically significant change in an age-equivalent score. Delays in confirming critical clinical guidance can be life-altering in progressive, irreversible diseases.

2.2.4.4. Desirable feature 4: Capacity to measure in out-of-level testing.

Regarding out-of-level testing of severely low-functioning children, all three score types are better than Standard Scores. AE and DQ can be used at any chronological age as long as the child’s developmental age is within the age range of the test's norms, and raw scores do not have even this limitation.

2.2.4.5. Desirable feature 5: Meaningfulness and comparability across tests.

The three score types vary in terms of meaningfulness and comparability across tests. AE is the most easily understood and has the same meaning on different tests. At the other end of the spectrum, raw scores are test-specific and have no inherent meaning. DQs fall between these extremes: they give a sense of how high or low the child's performance is relative to same-age peers, but unlike Standard Scores they do not correspond to a location in a population distribution of scores. Thus, a 12-month-old performing like a typical 6-month-old, an 8-year-old like a typical 4-year-old, and a 24-year-old like a typical 12-year-old all have a DQ of 50, but they may not be equally extreme relative to their age peers.

2.2.5. Empirical findings about the relative effectiveness of the score types

Research studies have reported relative strengths and weakness of the different score types. For example, AEs have been asserted to provide more detailed information than DQs about changes in development due to disease or therapy [18]. A recent study of disease progression in neuronopathic Hunter syndrome noted that AEs were able to show information about developmental gains prior to decline, whereas DQs showed only decline [36].

Two Consensus Conferences were held to discuss best practices in measuring cognitive outcomes for clinical trials in the mucopolysaccharidosis disorders. Among the 12 recommendations from the first meeting in December 2016 was the guidance to use age equivalent scores [37]. Recognizing the limitations of AEs, the second consensus meeting recommended use of multiple endpoints and multiple metrics, specifying that for developmental tests, both AEs and raw scores were appropriate [7].

Nevertheless, raw scores, like AEs, face the consequence of non-interval scales, as illustrated by the following examples applied to a 34-month-old taking the BSID-III Cognitive subtest:

  • If the raw score change is 5 points:
    • A raw score change from 40 to 45 corresponds to an AE increase of 2 months and a Standard Score increase of 5 points.
    • A raw score change from 65 to 70 corresponds to an AE increase of 4 months and a Standard Score increase of 10 points.
  • If the AE change is 3 months:
    • An AE change from 22 to 25 months corresponds to a raw-score increase of 5 points and a Standard Score increase of 10 points
    • An AE change from 30 to 33 months corresponds to a 3-point raw score increase and a 5-point Standard Score increase.

We can conclude that AEs and raw scores have the same problem of unequal intervals at different score levels. However, they both change in the same direction on an ordinal scale.

3. A way forward with Growth Scale Value: Applying an existing metric to pediatric neurodegenerative disease

The GSV is a metric that allows a quantitative approach to developmental change. The GSV is found in test-manual lookup tables, using raw scores already gathered for several dozen tests including some editions of the BSID and the Vineland, which allows examination of already collected data as well as prospective trials. A Growth Scale Value Technical Report contains a table listing tests with GSV-like scales [38].

3.1. The history of GSVs

GSV-type scales have been used to score and interpret clinical tests for almost fifty years. Richard Woodcock pioneered their use under the name W-scale, starting with the Woodcock-Johnson Psycho-Educational Battery [39]. At about the same time, the British Ability Scales [40] applied the label “ability score.” More recently, numerous tests including the Stanford-Binet 5 [41] have adopted the name Change-Sensitive Score, and many others have used the name Growth Scale Value. All of these scales are fundamentally the same, so the generic name “GSV” is used here.

3.2. What are GSVs?

Like raw scores and AEs, GSVs reflect the child's “raw” performance, that is, how well the child performed on the test items. The essential difference between raw scores and GSVs is that a raw score is a sum of scores on a particular set of test items, whereas the GSV is an estimate of the level of underlying ability that is most likely to have produced that raw score. Thus, there is a GSV for each raw score, and they are monotonically related—as one goes up, so does the other.

Raw scores reflect the idiosyncrasies of a particular set of items, most critically their distribution of item difficulties. Fig. 3 illustrates this for two tests, where items are represented by circles on the scale of person ability and item difficulty. The test on the left has an even distribution of item difficulties from easy to hard, and raw score increases linearly as ability increases (i.e., raw score is an interval scale). Few published tests have such an even difficulty distribution. On the other test, with an uneven distribution of item difficulties, raw score increases rapidly or slowly with ability depending on how many items are at that difficulty level. On that test, a given change in raw score designates different amounts of ability change at different score levels. On both tests, the GSV increases smoothly with ability because GSV is an estimate of the underlying ability and is, therefore, an equal-interval scale.

Fig. 3. Raw score and GSV as a function of distribution of item difficulties.

Fig. 3.

These graphs represent two hypothetical tests, one with an even distribution of item difficulties from easy to hard (A), and the other with an uneven distribution of item difficulties (B). On each test, the items are represented by circles on the X-axis scale of person ability and item difficulty, and the raw score increases rapidly or slowly with ability depending on how many items are at that difficulty level. On each test, the raw score and GSV are compared as functions of the distribution of item difficulties. On both tests, the GSV increases smoothly with ability because the GSV is an estimate of the underlying ability and is, therefore, an equal-interval scale.

The scale of ability/difficulty referred to above is generated when a test is calibrated using the Rasch model of item response theory (IRT) [42]. In this model, each person has an ability score, each test item has a difficulty value, and abilities and difficulties are on the same numerical scale. The difference between ability and difficulty determines the probability that the person will pass the item.

The special strength of GSVs is for comparing scores on the same test, such as a child's scores at different points in time. One reason for this strength is that equal-interval scaling means that change scores can be interpreted in the same way for children at different ability levels and can safely be combined for purposes of group analysis. A second reason is that each GSV score has a score-specific (conditional) SEM so that confidence intervals and statistically significant differences can be calculated more precisely. Third, GSV scores and changes in GSV scores can be interpreted in terms of the child's probability of success on individual test items, which gives concrete meaning to change scores.

3.3. GSV in comparison to other score types

The GSV is distinguished from raw scores, AEs and DQs by its desirable equal-interval scaling. Like the raw score, GSV reliably indicates the direction of the change in a child's performance, i.e., improvement, plateau, or decline. Because GSVs generally have a one-to-one relationship to raw scores, they usually have the same degree of sensitivity to change as raw scores.

GSVs and raw scores have SEMs, unlike AEs and DQs. The SEMs for GSVs are more informative because they are specific to the individual score. Typically, GSVs in the broad middle of the score range are more precise (have smaller SEMs) than those near the upper or lower limits, and these differences in precision are communicated to the user through the conditional SEMs. Raw score and Standard Score SEMs are not differentiated by score, but instead are averages across the score range [43].

AEs, DQs, raw scores, and GSVs all support out-of-level testing. Like raw scores, but unlike AEs and DQs, GSVs can be used even when the child's developmental age is above the age range of the test norms. Why does this matter if the test was selected as appropriate for the child's level of ability? It is important to make room for change if, during the course of a clinical trial, the child changes enough that the developmental age moves above or below the normative age range. GSVs and raw scores may continue to be used for tracking change until they approach the maximum or minimum raw score, where they become inaccurate (as indicated by an increase in the GSV's conditional SEM). This extension of range may be quite large; for example, on the BSID-III Cognitive subtest, GSVs and raw scores are accurate up to a developmental age of about 5½ years, two years above the top of the normative age range (i.e., 42 months) and AE maximum. This can be very important in clinical trials because it may avoid or delay the need to transition to a higher-level test. Transitioning does interrupt continuity of measurement and, if done too soon, may present the child with inappropriately difficult test items.

A disadvantage of GSVs that is shared with raw scores and AEs is the lack of values for composite scores, such as Language and Motor composites on the BSID-III, and the domain and composite scores on the Vineland. Constructing GSVs is less straightforward for composites than for subtests, and various methods are under active consideration at the publisher for the BSID and Vineland (Pearson).

GSVs and raw scores do not have inherent meaning, in contrast to the rather straightforward interpretability of AEs and DQs. GSVs and raw scores cannot be compared between tests, and they do not provide normative information. GSVs have a unique interpretive advantage, however, in the context of measuring change. A change in a GSV score can be described in terms of the change in the probability that the child will succeed on a particular task (such as a test item). For example, on many GSV-like scales, a 10-point increase in the GSV score means that the child's success probability has increased from 0.50 to 0.75.

3.4. Examples of GSV application in rare disease

3.4.1. Infant testing

Due to newborn screening (NBS) in the United States, MPS I [44], and now MPS II [45], are being identified in the beginning of life, and with advancing diagnostic techniques as well as improving therapies, there is the hope that even more disorders will be added to newborn screening panels. For early diagnosed infants, age equivalent scores are problematic in the first months of life because they are instantly restricted by the natural age “floor” of birth. Further, the correspondence of raw score to AE is rather variable, which can create exaggerated deviations from chronological age of the infant. This variability in very young infants further highlights the broader issue with the unequal interval scaling of raw scores. Specifically, a raw score change in a very young infant may have rather different meaning in an older infant even if it is precisely the same magnitude of raw score gain. This is because it is unlikely that test items can be increased in difficulty in linear fashion so that raw score gains always signify the same increase in performance. GSV scores, with equal-interval scaling, increase the confidence that the calculated difference signifies the same change in ability regardless of the age of the child or the segment of the test.

Table 1 illustrates this problem for two different hypothetical children: one who represents a child identified by NBS and who could therefore first be tested as a newborn (age 21 days, Child 1) and one who represents a child identified by clinical signs and who would therefore first be tested at 10 months 21 days (Child 2). After 5.5 months, anchored to their initial score, both children had gains of 16 raw score points. Child 1 had a drop in Scaled Score, reflecting a widening gap from the neurotypical trajectory. On the other hand. Child 2 had a gain in Scaled Score, which might be interpreted as accelerating development aligned with the neurotypical trajectory. However, examining GSV change over time revealed a gain of 100 points for Child 1 and only 54 points for Child 2. Thus, the increase of 16 raw score points for the younger infant represented nearly double the ability gain as those 16 points represent for the older infant.

Table 1.

Differences in ability associated with the same raw score point change.

Age Raw Score Scaled Score AE GSV
Child 1
 Time 1 0:21 3 7 <16 days 303
 Time 2 5 mos 7 days 19 5 4 mos 10 days 403
   Difference +16 −2 +100
Child 2
 Time 1 10 mos 21 days 34 7 9 mos 486
 Time 2 15 mos 7 days 50 12 17 mos 540
   Difference +16 +5 +54

3.4.2. Older children

An example of natural history data in MPS IIIB was provided by Allievex for a group of 22 untreated children between 1 and 10 years of age. One of the inclusion criteria was a Vineland “Adaptive Behavior Composite” total Standard Score over 50, as well as an overall AE ≥12 months (averaging AEs across all except Motor AEs). In this group, the BSID-III was administered every 12 weeks. Plotting change in ability anchored to first GSV score, there is a clear illustration of rapid acquisition of abilities (i.e., increased GSVs) within younger children, contrasting slower gains among older children, and losses among the oldest, seen by ever decreasing angles of trajectory (Fig. 4A). When the slopes are plotted, the decreasing acquisition velocity with age is evident, ending in a slope of loss in older ages (Fig. 4B).

Fig. 4. GSV scores in a natural history study of patients with MPS IIIB.

Fig. 4.

Individual BSID-III Cognitive GSV change scores beginning with baseline to last visit (A) depicts rapid acquisition of abilities (i.e., increased GSVs) within younger children, which contrasts with slower gains among older children, and losses among the oldest, seen by ever decreasing angles of trajectory. The same patients' GSV slope versus baseline age depicts the decreasing acquisition velocity, ending in a slope of loss in older ages (B).

3.4.3. Adaptive junction

Adaptive function can be defined as “The performance of daily activities required for personal and social sufficiency” [46, p. 10]. Measurement of adaptive function is commonly included as a secondary outcome measure for disease-modifying therapies for neurodegenerative disease: over 20 currently active studies listed at clinicaltrials.gov include adaptive behavior as secondary outcome measure.

GSVs have utility in assessing change in adaptive function over time, where the metric is available. To illustrate we present data from a prospective, longitudinal study of CLN3 Batten disease, an autosomal-recessively inherited, childhood-onset neurodegenerative lysosomal storage disorder. CLN3 Batten disease is slowly progressing, with an approximate 15–20-year disease course from symptom onset to end of life, and symptoms include vision loss, seizures, motor and speech impairment and dementia with both cognitive regression and behavioral and psychiatric symptoms [47-50]. Collectively these clinical features progressively impair adaptive function.

The Vineland can produce subdomain GSVs for individual domains of function. The advantages of GSVs over norm-referenced scores (v-scale scores) for measurement of adaptive function are apparent in Fig. 5 below, which presents the v-scale scores (left axis) and GSV scores (right axis) for the Vineland-3 Receptive Communication subdomain, respectively, for n = 9 children with CLN3 disease. Some of these subjects completed multiple assessments which permitted measurement of change over time in vSS and GSV scores. For these participants, floor effects do not occur with GSVs, yet begin to emerge with v-scale scores at an early stage of disease progression (i.e., at a young chronological age), even when children still have reasonably intact physical and cognitive abilities to perform activities of daily living and despite having adapted to their vision loss.

Fig. 5. Change in Vineland-3 Receptive Language in children with CLN3 Batten disease:

Fig. 5.

Data for a subset of N = 9 children from a larger study of CLN3 Batten disease are presented as both v-Scale Score (vSS) versus Growth Scale Value (GSV) on the same graph. The vSS range (1–19) is shown on left y-axis and the GSV range (10–147) is shown on the right y-axis. Longitudinal data for several affected individuals with multiple assessments are shown with closed markers + solid lines (vSS), and open markers + dashed lines (GSVs), respectively. Individuals who completed Vineland-3 assessments at a single time point are shown with closed (vSS) and open (GSV) markers, respectively. Comparison of vSS versus GSV scores for reveals a significant proportion of closed markers and/or solid lines gathered at the “floor” for vSS values, but much more movement/difference among the corresponding GSV values.

4. Discussion and future directions

Growth Score Values appear to be a superior method of measuring change over time, for both natural history and for clinical trials, in children with cognitive and adaptive impairment due to neurodegenerative disease or who have conditions resulting in stagnant or neurodiverse development. Table 2 summarizes the characteristics of GSVs in comparison to the other metrics that have been used to measure change.

Table 2.

Comparison of the capabilities of various metrics.

Metric Capability Standard
Scores
Raw
Scores
Age Equivalent
Scores
GSVs
Out-of-Level testing (chron. age above the age ceiling of the test) No Yes Yes For AEs ≤ age ceiling Yes
Meaning to clinicians, families, researchers Yes No Yes No
Comparison across tests Yes No Yes No
Measure developmental change qualitatively No Yes Yes Yes
Measure developmental change quantitatively (Equal intervals between measurements) No No No Yes
Standard error of measurement available Yes Yes No Yes

From a psychometric standpoint, GSVs have advantages over all the other metrics for precisely measuring change. Although, like raw scores, GSVs lack meaning for parents and clinicians, their use in clinical trials is more precise than any of the alternatives. Their equal intervals allow a quantification of change that is similar across developmental and chronological ages. Even for those diseases with a stagnant course, GSVs can track improvements due to treatment for such children. In neurodegenerative disease, response to treatment can be tracked and compared to the natural history of the disease. The statistical significance of observed changes can be evaluated by using known standard errors of measurement to distinguish between measurement error and treatment effect [51,52].

Because GSV scores are relatively new to rare disease investigators conducting clinical trials, few studies have been done to demonstrate their potential and advantages. As can be seen in Table 1, comparing GSVs with raw scores in newborns and older children demonstrates that, unlike raw scores, GSV scores increase the assurance that differences accurately reflect ability change across the age spectrum. Fig. 5 demonstrates the advantage of GSVs over standard scores in showing change. The previously discussed lack of meaning of GSVs to parents and clinicians may also have implications in the clinical trial context, as these metrics could be less meaningful to sponsors, funders and regulatory authorities. Further, the lack of published studies using GSVs for rare disease may increase reluctance to accept them for trials.

There are opportunities to address these problems. First, because GSVs and AEs are both based on raw scores, they can both be planned for natural history studies and clinical trials to ensure inclusion of accurate metrics of change as well as some more comfortable reflections of functional status. Second, it is critical that the rare disease scientific and clinical communities unite to expand the published studies of functional change with GSVs. A comparison of GSVs with AEs can easily be done retrospectively for data already published, provided there are raw scores available. We strongly encourage investigators to act on this challenge. The acceptability of the GSV metric in clinical trials depends on further work to demonstrate their superiority over raw and age equivalent scores. Further development of GSVs for tests like the Wechsler Preschool and Primary Scale of Intelligence [53] and Kaufman Assessment Battery for Children [54] are needed, either for individual subtests or composite scores. The need to develop a metric for change in composite scores is important, given that larger and broader functional domains are of focus when assessing efficacy of brain-directed therapies. Simply put, treatment indications for narrow domains of function are less desirable for devastating neurodegenerative diseases that affect the whole brain and thus broad functional outcomes. The lack of GSVs for composite scores is a limitation that should be tackled through a collaboration between test developers and psychometric experts to evolve a method for their development.

In conclusion, we believe GSVs hold promise to provide a precise method of quantifying change in natural history studies and clinical trials, and it is critical that this method be tested scientifically by the rare disease community. We therefore encourage investigators to include GSV metrics as part of all rare disease outcome studies to build a body of knowledge regarding this new method to evaluate change over time.

Acknowledgements

We thank Adam Scheller and Steve Maricich for their initial support of this idea. We thank Louis-Charles Vannier for his consultation on GSVs. We thank the National MPS Society, Terri Klein, and Jennifer Greenberg for their support of the webinar and this paper.

Funding sources

The Webinar launched from the Robert J. Gorlin symposium, and Open Access fees, were supported by the National MPS Society.

Work on CLN3 outcomes is supported by NIH/NINDS: 5U01NS101946-03

Abbreviations:

AE

Age Equivalent score

BSID

Bayley Scales of Infant and Toddler Development

CLN3

neuronal ceroid lipofuscinosis, type 3 (CLN3 Batten Disease)

DQ

Developmental Quotient

GSV

Growth Scale Value

IQ

Intelligence Quotient

IRT

Item Response Theory

MPS

Mucopolysaccharidosis

MSEL

Mullen Scales of Early Learning

SEM

Standard Error of Measurement

vSS

v-Scale score

Footnotes

1

Panelists included all authors as well as Patroula Smpokou, MD (FDA). Louis-Charles Vannier, MSc (Pearson), participated in the second panel discussion via Webinar.

References

  • [1].Shapiro E, Eisengart J, The Natural History of Neurocognition in MPS Disorders: A Review Molecular Genetics and Metabolism, Molecular Genetics and Metabolism 133 (2021) 8–34, 10.1016/j.ymgme.2021.03.002 Elsevier. [DOI] [PubMed] [Google Scholar]
  • [2].Eisengart JB, Esler AN, Ellinwood NM, Hudock RL, King KE, Klein TL, Lee C, Morton J, Stephens K, Ziegler R, O’Neill Cara, Issues of COVID-19-related distance learning for children with neuronopathic mucopolysaccharidoses, Mol. Genet. Metab 134 (2021) 68–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Shapiro EG, Nestrasil I, Delaney KA, Rudser K, Kovac V, Nair N, Richard CW, Haslett P, Whitley CB, A prospective natural history study of mucopolysaccharidosis type IIIA J. Pediatr 170 (2016) 278–287. e274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Leal AF, Espejo-Mojica AJ, Sánchez OF, Ramírez CM, Reyes LH, Cruz JC, Alméciga-Díaz CJ, Lysosomal storage diseases: current therapies and future alternatives, J. Mol. Med 98 (2020) 931–946. [DOI] [PubMed] [Google Scholar]
  • [5].Berry SA, Coughlin CR, McCandless S, McCarter R, Seminara J, Yudkoff M, LeMons C, Developing interactions with industry in rare diseases: lessons learned and continuing challenges, Genet. Med, 22 (2020) 219–226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Rajan DS, Escolar ML, Evolving therapies in neuronopathic LSDs: opportunities and challenges, Metab. Brain Dis (2022) 1–12. [DOI] [PubMed] [Google Scholar]
  • [7].van der Lee JH, Morton J, Adams HR, Clarke L, Eisengart JB, Escolar ML, Giugliani R, Harmatz P, Hogan M, Kearney S, Therapy development for the mucopolysaccharidoses: Updated consensus recommendations for neuropsychological endpoints, Mol. Genet Metab, 131 (2020) 181–196. [DOI] [PubMed] [Google Scholar]
  • [8].Shapiro E, Bernstein J, Adams HR, Barbier AJ, Buracchio T, Como P, Delaney KA, Eichler F, Goldsmith JC, Hogan M, Neurocognitive clinical outcome assessments for inborn errors of metabolism and other rare conditions, Mol. Genet Metab 118 (2016) 65–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Ghosh A, Shapiro E, Rust S, Delaney K, Parker S, Shaywitz AJ, Morte A, Bubb G, Cleary M, Bo T, Recommendations on clinical trial design for treatment of Mucopolysaccharidosis type III, Orphan J. Rare Dis 12 (2017) 117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Farmer CA, Kaat AJ, Thurm A, Anselm I, Akshoomoff N, Bennett A, Berry L, Bruchey A, Barshop BA, Berry-Kravis E, Person ability scores as an alternative to norm-referenced scores as outcome measures in studies of neurodevelopmental disorders, Am. J. Intellect. Develop. Disab 125 (2020) 475–480. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Shapiro EG, Williams P, Daniel MH, Kuca B, Eisengart JB, Adams H, Precision Metrics for Cognition, Robert J. Gorlin Symposium, WORLD Symposium, San Diego, CA, 2022. [Google Scholar]
  • [12].Bayley N, Bayley Scales of Infant Development: Manual, Psychological Corporation, 1993. [Google Scholar]
  • [13].Bayley N, Manual for the Bayley Scales of Infant Development Psychological Corporation, 1969. [Google Scholar]
  • [14].Mullen E, Mullen Scales of Early Learning Circle Pines, American Guidance Service, MN, 1995. [Google Scholar]
  • [15].Sparrow SS, Balla DA, Cicchetti DV, Harrison PL, Doll EA, Vineland Adaptive Behavior Scales, 1984. [Google Scholar]
  • [16].Sparrow SS, Cicchetti DV, Balla DA, Vineland Adaptive Behavior Scales: Second Edition (Vineland II), Survey Interview Form/Caregiver Rating Form, Pearson Assessments, Livonia, MN, 2005. [Google Scholar]
  • [17].Shapiro EG, Whitley CB, Eisengart JB Beneath the floor: re-analysis of neurodevelopmental outcomes in untreated Hurler syndrome, Orphaa J. Rare Dis 13 (2018) 76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Delaney KA, Rudser KR, Yund BD, Whitley CB, Haslett PA, Shapiro EG, Methods of neurodevelopmental assessment in children with neurodegenerative disease: Sanfilippo syndrome, JIMD Reports-Case and Research Reports, Vol. 13, Springer, 2013. 129–137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Martin HR, Poe MD, Reinhartsen D, Pretzel RE, Roush J, Rosenberg A, Dusing SC, Escolar ML, Methods for assessing neurodevelopment in lysosomal storage diseases and related disorders: a multidisciplinary perspective, Acta Paediatr., 97 (2008) 69–75. [DOI] [PubMed] [Google Scholar]
  • [20].Peters C, Shapiro E, Anderson J, Henslee-Downey P, Klemperer M, Cowan M, Saunders E, deAlarcon P, Twist C, Nachman J, Hurler syndrome: II. Outcome of HLA-genotypically identical sibling and HLA-haploidentical related donor bone marrow transplantation in fifty-four children, Blood 91 (1998) 2601. [PubMed] [Google Scholar]
  • [21].Bjoraker KJ, Delaney K, Peters C, Krivit W, Shapiro EG, Long-term outcomes of adaptive functions for children with M… : Journal of Developmental & Behavioral Pediatrics, J. Dev. Behav. Pediatr 27 (2006) 290–296. [DOI] [PubMed] [Google Scholar]
  • [22].Peters C, Balthazor M, Shapiro E, King R, Kollman C, Hegland J, H.-.Downey J, Trigg M, Cowan M, Sanders J, Bunin N, Weinstein H, Lenarsky C, Falk P, Harris R, Bowen T, Williams T, Grayson G, Warkentin P, Sender L, Cool V, Crittenden M, Packman S, Kaplan P, Lockman L, Outcome of unrelated donor bone marrow transplantation in 40 children with hurler syndrome, Blood 87 (1996) 4894–4902. [PubMed] [Google Scholar]
  • [23].Wraith J, Beck M, Lane R, van der Ploeg A, Shapiro E, Xue Y, Kakkis E, Guffon N, Enzyme replacement therapy in patients who have mucopolysaccharidosis I and are younger than 5 years: results of a multinational study of recombinant human {alpha}-L-iduronidase (laronidase), Pediatrics 120 (2007), e37. [DOI] [PubMed] [Google Scholar]
  • [24].Holt JB, Poe MD, Escolar ML, Natural progression of neurological disease in mucopolysaccharidosis type II, Pediatrics 127 (2011) e1258–e1265. [DOI] [PubMed] [Google Scholar]
  • [25].Kiely BT, Kohler JL, Coletti HY, Poe MD, Escolar ML, Early disease progression of Hurler syndrome, Orphan. J. Rare Dis 12 (2017) 32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Poe MD, Chagnon SL, Escolar ML, Early treatment is associated with improved cognition in Hurler syndrome, Ann. Neurol 76 (2014) 747–753. [DOI] [PubMed] [Google Scholar]
  • [27].Cross EM, Grant S, Jones S, Bigger BW, Wraith JE, Mahon LV, Lomax M, Hare DJ, An investigation of the middle and late behavioural phenotypes of Mucopolysaccharidosis type-III, J. Neurodev. Disord, 6 (2014) 46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Shapiro E, Ahmed A, Whitley C, Delaney K, Observing the advanced disease course in mucopolysaccharidosis, type IIIA, A case series Molecular genetics and metabolism 123 (2018) 123–126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Wright MD, Poe MD, DeRenzo A, Haldal S, Escolar ML, developmental outcomes of cord blood transplantation for Krabbe disease: a 15-year study, Neurology 89 (2017) 1365–1372. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Wijburg FA, Aiach K, Chakrapani A, Eisengart JB, Giugliani R, Héron B, Muschol N, O'Neill C, Olivier S, Parker S, An observational, prospective, multicenter, natural history study of patients with mucopolysaccharidosis type IIIA Mol. Genet Metab 135 (2022) 133–142. [DOI] [PubMed] [Google Scholar]
  • [31].Martin HR, Poe MD, Provenzale JM, Kurtzberg J, Mendizabal A, Escolar ML, Neurodevelopmental outcomes of umbilical cord blood transplantation in metachromatic leukodystrophy, Biol. Blood Marrow Transplant 19 (2013) 616–624. [DOI] [PubMed] [Google Scholar]
  • [32].Terman LM, Merrill M.a., Measuring intelligence: A guide to the administration of the new revised Stanford-Binet tests of intelligence 1937. [Google Scholar]
  • [33].Ghosh A, Rust S, Langford-Smith K, Weisberg D, Canal M, Breen C, Hepburn M, Tylee K, Vaz FM, Vail A, High dose genistein in Sanfilippo syndrome: a randomised controlled trial, J. Inherit Metab Dis 44 (2021) 1248–1262. [DOI] [PubMed] [Google Scholar]
  • [34].Klein K, Krivit W, Whitley CB, Peters C, Cool V, Fuhrman M, De Alarcon P, Klemperer M, Miller L, Nelson R, Poor cognitive outcome of eleven children with Sanfilippo syndrome after bone marrow transplantation and successful engraftment, Bone Marrow Transplant. 15 (1995) S176–S181. [Google Scholar]
  • [35].Gardner MK, Clark E, The psychometric perspective on intellectual development in childhood and adolescence Intellectual development, 1992. 16–43. [Google Scholar]
  • [36].Seo J-H, Okuyama T, Shapiro E, Fukuhara Y, Kosuga M, Natural history of cognitive development in neuronopathic mucopolysaccharidosis type II (Hunter syndrome): Contribution of genotype to cognitive developmental course, Mol. Genet. Metabol. Rep 24 (2020) 100630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].van der Lee JH, Morton J, Adams HR, Clarke L, Ebbink BJ, Escolar ML, Giugliani R, Harmatz P, Hogan M, Jones S, Cognitive endpoints for therapy development for neuronopathic mucopolysaccharidoses: Results of a consensus procedure, Mol. Genet. Metab, 121 (2017) 70–79. [DOI] [PubMed] [Google Scholar]
  • [38].Daniel M, Vannier L-C, Growth Scale Value: Theory, Development and Applications., Growth Scale Value Technical Report No. 1, Pearson, 2022. (In Press). [Google Scholar]
  • [39].Woodcock RW, Woodcock-Johnson Psycho-Educational Battery, Technical Report 1977. [Google Scholar]
  • [40].Elliott C, Murray D, Pearson L, British Ability Scales, National Foundation for Educational Research, Windsor, UK, 1978. [Google Scholar]
  • [41].Roid GH, Barram RA, Essentials of Stanford-Binet Intelligence Scales (SB5) Assessment, John Wiley & Sons, 2004. [Google Scholar]
  • [42].Bond TG, Yan Z, Heene M, Applying the Rasch Model: Fundamental Measurement in the Human Sciences, Routledge, 2021. [Google Scholar]
  • [43].Raju NS, Price LR, Oshima T, Nering ML, Standardized conditional SEM: a case for conditional reliability, Appl. Psychol. Meas, 31 (2007) 169–180. [Google Scholar]
  • [44].Grosse SD, Lam WK, Wiggins LD, Kemper AR, Cognitive outcomes and age of detection of severe mucopolysaccharidosis type 1, Genet. Med, 19 (2017) 975–982. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].Kemper AR, Lam K, Letostak T, Grosse SD, Ojodu J, Prosser LA, Ream M, Bocchini JA Jr., Botkin JR, Comeau AM, Evidence-Based Review of Newborn Screening for Mucopolysaccharidosis Type II: Final Report 2022. [Google Scholar]
  • [46].Sparrow S, Cicchetti DV, Saulnier CA, Vineland Adaptive Behavior Scale, Psychological Corporation, (Vineland-3) Antonio, 2016. [Google Scholar]
  • [47].Masten MC, Williams JD, Vermilion J, Adams HR, Vierhile A, Collins A, Marshall FJ, Augustine EF, Mink JW, The CLN3 disease staging system: a new tool for clinical research in batten disease, Neurology 94 (2020) e2436–e2440. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Adams HR, Mink JW, U.o.R.B.C.S. Group, Neurobehavioral features and natural history of juvenile neuronal ceroid lipofuscinosis (Batten disease), J. Child Neurol 28 (2013)1128–1136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Mole S, Williams R, Goebel H, The Neuronal Ceroid Lipofuscinoses (Batten Disease), Oxford University Press, 2011. [Google Scholar]
  • [50].Dang Do AN, Thurm AE, Farmer CA, Soldatos AG, Chlebowski CE, O'Reilly JK, Porter FD, Use of the Vineland-3, a measure of adaptive functioning, CLN3 Am. J. Med. Genet. A 188 (2022) 1056–1064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Norman GR, Sloan JA, Wyrwich KW, Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation, Med. Care (2003) 582–592. [DOI] [PubMed] [Google Scholar]
  • [52].Phillips R, Qi G, Collinson SL, Ling A, Feng L, Cheung YB, Ng T-P, The minimum clinically important difference in the repeatable battery for the assessment of neuropsychological status, Clin. Neuropsychol 29 (2015) 905–923. [DOI] [PubMed] [Google Scholar]
  • [53].Wechsler D, Wechsler Preschool and Primary Scale of Intelligence—Fourth Edition, Psychological Corporation, San Antonio, TX, 2012. [Google Scholar]
  • [54].Kaufman AS, KABC-II: Kaufman Assessment Battery for Children, AGS Pub, 2004. [Google Scholar]

RESOURCES