Abstract
Objective:
We propose several different patient-reported outcomes (PROs) from momentary, real-time collection of symptom data. In addition to the mean of momentary reports of symptoms, other types of summaries can reflect different aspects of the symptom experience.
Methods:
With secondary analyses of two studies of patients with chronic pain assessed with real-time methods, we demonstrate principles for developing outcomes that summarize symptom experience during a 1-week period. These studies focused on pain intensity, which is used to demonstrate methods for creating summary momentary measures.
Results:
Analyses from the first study (Pain 2008;139:146–57) yielded outcome measures based on the mean, median, 90th percentile, maximum, standard deviation, proportion of reports with no pain, proportion of reports with pain more than 50 (on a 0- to 100-point scale), and time-contingent measures. The second study examined the performance of these measures (and the mean) in a longitudinal study, in which some patients changed treatment (n = 78), making pain reduction likely, whereas others had no treatment change (n = 27). The measure that best discriminated the groups was the proportion of momentary reports without pain (effect size = 0.50), closely followed by the mean of all reports (effect size = 0.45). Most measures also correlated with patients’ global impression of their change (between 0.39 and 0.55, except for standard deviation [0.13]).
Conclusions:
These analyses suggest that momentary symptom data can be useful for developing new PROs that reflect symptom experience other than the mean. They highlight knowledge gleaned from real-time studies, which deepens our understanding of symptoms by demonstrating which changes in symptoms are associated with overall perceived change.
Keywords: PROs, ecological momentary assessment, outcome measures
INTRODUCTION
For decades, medical researchers have been assessing patients’ self-reports of their symptoms, behaviors, and well-being. When they serve as outcome variables in clinical trials, these reports are referred to as patient-reported outcomes (PROs) (1). Patient-reported outcomes are essential for evaluating new treatments and for tracking the course of disease over time, in part, because the only practical way to assess some types of symptoms and states that cannot be measured objectively is through patients’ own reports. Although objective, physiological-based measures such as blood pressure, glucose level, sediment rates, and white blood cell counts provide information about the biologic state of disease, they do not capture patients’ experiences of disease—what symptoms are felt and how symptoms affect daily functioning.
In this article, we consider several different possibilities for describing the intensity or severity of symptoms experienced by patients over a given reporting period using data collected with momentary assessments of PROs. Our goal was to synthesize observations from recent studies that used a fine-grained approach to symptom measurement with momentary reports of symptoms, behaviors, and well-being (2-4). Such approaches have become more common in recent years, as investigations have shed light on memory error and distortion that may accompany recall of symptoms and patients’ summaries of symptoms for relatively longer periods (e.g., see the US Food and Drug Administration’s recent guidance on recommendations for PRO measurement (5)). Relative to other approaches to collecting data, such as recall of symptoms for longer periods, momentary data provide a much higher resolution, revealing variability and patterns of symptoms within a reporting period.
The approaches described here come from our experience working with momentary and daily symptom, behavior, and affect reports during the last 3 decades and from the discoveries of others in the momentary data collection field (e.g., see references (3,4,6-10)). We have found that the perspective of patient experience afforded by momentary methods suggests many ways to characterize that experience and to create new measures—most of which are rarely used in clinical trials. Yet, we also have come to a more general realization about outcome assessment: it is not always obvious what we actually want to or need to measure in our PROs (11). That is, the aspects of symptom experience that are most relevant to the purpose of a particular study are not always measured, and there can be important nuances in how outcomes are constructed.
The starting point for the discussion is the observation that symptoms often possess considerable moment-to-moment and day-to-day variability (12), probably more than most scientists and clinicians recognize (for ease of presentation, we use the word symptoms to refer to symptoms, behaviors, and well-being constructs). If there were no variability in symptoms over a reporting period, then there would be no issue about how to create a summary measure: a single symptom report would work well and would yield exactly the same information as taking the mean over many observations. However, in the face of symptom variability, a single point is not likely to be a good indicator of symptom level (13), and many different ways of creating summary measures of symptom experience for a period are available. Using the mean of all moments assessed in a reporting period is the most obvious strategy and, indeed, is the most common approach for summarizing momentary data. However, other measure creation strategies, which are discussed at length below, also come to mind. For example, a measure of symptom fluctuation could be of interest (e.g., the standard deviation [SD] of the momentary reports) or perhaps the proportion of assessments that exceed some prespecified threshold of symptom intensity (e.g., values >50 on a 0- to 100-point scale). Or perhaps a useful summary measure would be the proportion of assessments at which the patient indicated not experiencing the symptom at all.
The reader may ask, “Why are summary measures other than the average necessary or desirable?” Currently, the argument for alternative summary measures is not persuasive from an empirical perspective because alternative measures have rarely been investigated. From a conceptual point of view, however, there is reason to think that the symptom experience may not be fully captured by the mean. Moreover, we have little information about what patients consider the most important features of their symptoms or what features they would most like to change. Perhaps, symptom intensities over a level of 50 (on a 100-point scale) have the greatest impact on individuals’ ability to perform their activities of daily living. Or perhaps, what patients most desire from treatment is to increase the amount of time they have no or very few symptoms. Reducing the roller coaster of symptom level variability could be another important goal for patients. The mean, as a summary measure, is not likely to be an effective method for assessing these experiences.
In this article, we propose several PRO summary measures created from momentary reports of pain reported during a week. We focus on pain as an example of a common symptom, but the approach could be applied to other symptoms. We are aware that pain intensity in patients with chronic pain is likely to have several important qualitative differences compared with other symptoms in patients with other diseases (e.g., its duration, intensity, location); thus, the particular results shown here are only illustrative. We chose 1 week for the reporting period because it is commonly used in retrospective assessments of PROs, but the methods described here are applicable to shorter or longer reporting periods.
Our analyses use data from two studies of patients with rheumatic disease, which collected momentary pain intensity data using an electronic diary protocol. Using data from Study 1, we developed alternative PRO measures. This study had patients report their pain several times a day for 28 consecutive days. Using data from Study 2, we compare the proposed measures to determine whether they differ in their sensitivity to change and to patients’ impression of their change. This second study also used an electronic diary procedure, but its design was 7 days of momentary data collection followed 3 months later by another 7 days of data collection. In most patients in Study 2, a change in their treatment took place after the first week of assessment, and we expected that pain levels would show a decline at the second assessment 3 months later. A third of the patients did not change their treatment, and we consider them a quasi “control” group for the study.
Our goal was to show the possibilities for obtaining a richer understanding of symptoms by expanding the types of measures used. We believe that some of the measures will be of interest to health researchers and clinicians and that their usage could contribute to a more complete understanding of treat - ment effects.
STUDY 1
As stated earlier, without considerable fluctuation in momentary reports within patients over time, there would be little point to developing novel measures. Thus, we first examine the extent of symptom fluctuations in the patients from Study 1. We focus on an interval of 1 week and examine the variability of momentary assessments during this interval.
Participants and Procedures
Study 1 monitored the pain and fatigue of 106 individuals with rheumatic disease (rheumatoid arthritis, osteoarthritis, lupus, or fibromyalgia) during a 28-day period. Each day, participants were signaled at six to eight randomly selected times and completed assessments of their immediate symptoms on an electronic diary. Pain intensity was assessed with two questions: “Are you in pain right now?” If yes, “How much pain are you in right now?” which was rated on a −0 to 100-point visual analog scale. A score of 0 was assigned to the pain variable if the first question was answered “no.” Details of the methodology are described in a previous article (14). Space restrictions do not permit a full description of the study design or results. Compliance with the protocol, an important factor for interpreting the results presented here, were excellent: for Study 1, compliance was more than 94%.
Results
Variability in random reports of pain during a week
Without within-subject variability in momentary reports, the indices described in this article would be uninteresting; for example, the mean would equal the median, and the SD would be 0 for all participants. For this reason, it is helpful to review visual representations of variability over time. Although statistics characterizing variability within persons and across persons are reported later, in Figure 1, we also present histograms of momentary pain reports for a random subset of nine participants to explicitly show the variability for some participants. The striking point about these histograms is that, although pain reports seem to be quite variable within most participants, the patients show distinctively different patterns in their distribution of pain intensity scores. Participant 103, for instance, displayed a fairly normal distribution of ratings, with an M (SD) of 51 (20.2), which means that ratings were spread over almost the entire range of the pain scale. Participants 132 and 178 had the lowest SD of these nine people (both <10), and the histogram confirms relatively little variation in their scores. Across all participants in the study, the range in SD was from 3.2 to 34.1 with a mean of 15.3. The distributions of responses also differ in shape between respondents. For some participants (e.g., 133, 162), the distribution of responses showed a notable positive skew with many ratings at the lower end of the scale, whereas for others (e.g., 187), we find a positive skew with more ratings at the higher end of the scale. These individual differences in within-subject variability are the basis for creating the measures described in the next section; for example, within-subject variability may not influence the mean very much and, on the other hand, is directly measured by the SD.
Figure 1.

Histograms of nine randomly chosen participants showing the distribution of their random assessments of pain intensity during the first week of participation, Study 1.
Defining summary measures
Mean. The mean for the week is a common method for assessing pain for a 7-day period. When a patient reports a “moderate” amount of pain last week, we generally interpret that as meaning the mean of all of their pain experiences for the week (although this point is rarely explicitly discussed in the literature). Thus, momentary methods usually average the ratings across the entire week. Although the mean is easily understood, there are questions about the way we may think about the mean. Do we imagine a mean based only on times when pain is present? Or do we imagine a mean taken over all moments, including both times with no pain and times with pain? These could yield very different results (Table 1).
Median. This nonparametric measure of central tendency is less sensitive to outlying scores than the mean; as such, it may be reasonable for characterizing “typical” pain levels, especially if a researcher is concerned that the mean is too influenced by extreme scores (e.g., pain flairs). When the distribution of pain is approximately normal or the distribution is symmetric, then the mean and median will yield very similar scores, as observed in Study 1 (Table 1).
90th Percentile. This measure provides a good indication of the upper levels of pain experienced, indicating the point where 10% of pain scores were above the statistic. In a clinical trial, investigators might be most interested in monitoring the reduction in this statistic because this is highly sensitive to the patient’s worst levels/episodes of pain, which may, in turn, be most closely associated with pain-related suffering and decrements in functioning.
Maximum. This is an easily interpreted measure, but it can have significant drawbacks. It is defined by the single highest rating, also known as “worst,” so it is not likely to be very reliable (unlike the mean, for example, which is based on many ratings and reduces the impact of random error). Ceiling effects are also likely with this measure, and that may reduce its sensitivity to change. Furthermore, the maximum is likely to be influenced by the number of random momentary assessments taken per day: the more assessments, the more likely it is to capture extreme scores. So, using the maximum could introduce error into trials that use different sampling frequencies for different assessments. Nevertheless, the maximum may be very clinically meaningful. Take Participant 162 whose mean pain was 26 and median was 11, yet this patient experienced some episodes of extreme pain (maximum = 96). From an intervention perspective, reducing episodes of extreme pain may be more important than reducing the usual levels of pain.
Standard deviation. This is a measure of how much the symptoms fluctuate over time within a person; it has intuitive appeal and a straightforward interpretation in the context of normally distributed pain scores. However, our data suggest that the distributions of these data are not normal. Alternative indices of fluctuation are suggested in the literature, including the root mean square of successive differences and the probability of acute change. Whereas the SD measures the overall dispersion of scores around a person’s mean score, regardless of the temporal pattern of scores, the root mean square of successive differences and probability of acute change take the temporal pattern into account more specifically in that they express how much a person’s ratings change (i.e., fluctuate) from any given time point to the next. These measures may be preferable if the research question focuses specifically on how symptoms exacerbate or abate over time. It is plausible that treatments could affect the variability of pain, and that could occur either with or without a change in central tendency (mean/median). The amount of symptom variability in individuals has been found to be of diagnostic value. For example, fibromyalgia patients have been found to experience greater variability in daily fatigue than patients with other rheumatic diseases (15), and patients with borderline personality disorder and suicide ideation have been shown to experience greater variability in daily mood (16,17). There may also be important individual patient differences in symptom variability (18).
Proportion of ratings of 0 (or below some minimal threshold). This measure is defined as the proportion of ratings where no pain was reported. It may be a compelling measure for clinical research, although it may also have limitations. A significant increase in the measure may be a very meaningful outcome from the patient’s perspective. We note that using a criterion of zero pain is arbitrary, although it has obvious intuitive value, and variants of this measure could be used based on other thresholds; for example, the proportion of scores lower than 10 or lower than 20 on a 100-point symptom scale.
Proportion of scores 50 or higher. This is the proportion of ratings when a participant’s symptom score reached or exceeded a threshold value of 50. It may be useful if the goal is to examine how often participants had symptoms at a level judged as “moderate-severe” by the researcher or clinician. The definition of “moderate-severe” symptom levels may vary according to the sample studied and the goals of a trial, such that values other than 50 could be chosen as appropriate.
Activity or time contingent. This family of measures takes advantage of other data collected during a momentary assessment that may be used to classify pain throughout the day. Activity-contingent measures are defined by computing the mean level of pain when a predefined condition is either present or absent. For example, one could examine pain in the presence of other people, when participants are alone, when participants were physically active, or when they were sedentary. One drawback of the activity measure is that it can only be computed for those patients who are engaging in a sufficient amount of the particular activity (e.g., physically active or sedentary). In many circumstances, an investigator might want to require that the mean be based on a minimum number of observations in that state/activity. A second type of measure is time contingent, which is defined by a period or time of day (e.g., morning, afternoon, or evening). The simple version of these measures is based on moments meeting a single criterion (e.g., happening with a certain activity). However, for both activity- and time-contingent measures, it is also possible to compute a more complex measure that is based on the mean pain for contrasting states (e.g., morning versus evening) and to subtract the second from the first yielding a single difference score. These measures may be particularly useful when an investigator hypothesizes a moderating effect of an independent variable such as treatment group on activity- or time-linked events. An example is presented in Study 2, which examines pain during the morning hours versus the evening hours.
TABLE 1.
Description of Summary Measures
| Name | Description | Strengths and Weaknesses |
|---|---|---|
| Location-based level | ||
| Mean | The mean of all momentary assessments | Easily understood Does not reflect the variability of outcomes Susceptible to influence by extreme scores |
| Median | The median of all momentary assessments | A nonparametric measure of central tendency Less susceptible to extremes |
| 90th percentile | Threshold in distribution above which 10% of the individual’s scores occur | A nonparametric measure of high levels of the outcome Less susceptible to extreme outliers |
| Maximum | Highest rating of momentary assessments | A measure of highest level of outcome May have problem with ceiling effect |
| Distribution-based variability | ||
| Standard deviation | Standard deviation of all momentary assessments | An easily understood measure of variability Is relatively sensitive to outliers |
| Proportion-based | ||
| Proportion rated 0 | Proportion of momentary assessments with a score of 0 | Easily understood Not influenced by outliers |
| Proportion >50 | Proportion of scores above a selected threshold (here 50 on a 101-point scale) | Useful for understanding how many of the assessments are at or above a specified level Not influenced by outliers |
| Contingent-based level | ||
| Morning versus evening | Mean of outcome during preselected intervals during the day or contingent on status on another variable Difference scores contrasting two contingent means can also be computed |
Allows testing of specific hypotheses concerning time-of-day or levels of outcome contingent on the presence of other events or conditions |
Table 2 shows statistics for the summary measures described above for each of the nine randomly selected participants and for all 106 participants. Histograms of the distributions of these various summary scores for the 106 participants are shown in Figure 2. Mean, median, 90th percentile, SD, and morning-versus-evening pain (the difference between the mean in the morning and the mean in the evening) have relatively normal distributions. As expected, maximum pain is negatively skewed, whereas the proportion of pain at 0 has most participants at 0 with a positive skew. Proportion of pain at 50 or higher has yet another distribution pattern: apart from many participants at 0, the remainder of the distribution is fairly flat. Correlation coefficients among the summary measures are presented in Table 3. Some of the summary indices were quite highly intercorrelated in this sample, whereas others showed little or no association with each other.
TABLE 2.
Momentary Measures for Selected Respondents and for the Full Sample: Pain for Week 1 in Study 1
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|
| Participant No. | Mean | Median | 90th Percentile |
Maximum | SD | Proportion Rated 0 |
Proportion ≥50 |
Morning Versus Evening (Difference Score) |
| 103 | 51.2 | 54 | 76 | 100 | 20.2 | 0 | 53 | 5.3 |
| 116 | 37.4 | 34.5 | 54 | 67 | 10.8 | 0 | 13 | −2.3 |
| 132 | 13.5 | 13 | 25 | 41 | 9.2 | 7 | 0 | −6.6 |
| 133 | 12.8 | 8.5 | 35 | 66 | 17.0 | 15 | 8 | 4.2 |
| 135 | 29.5 | 28 | 48 | 82 | 15.3 | 0 | 7 | −6.8 |
| 162 | 26.1 | 11 | 70 | 96 | 28.9 | 23 | 28 | −5.3 |
| 178 | 43.5 | 45 | 54 | 64 | 9.9 | 0 | 24 | −10.2 |
| 187 | 52.4 | 54 | 71 | 76 | 15.1 | 0 | 53 | −8.7 |
| 191 | 35.1 | 32 | 53 | 93 | 19.4 | 0 | 14 | 35.6 |
| Full sample (n = 106), M (SD) | 43.8 (18.2) | 43.0 (20.1) | 64.0 (19.5) | 77.0 (17.5) | 15.3 (5.5) | 2(7) | 43 (32) | −4.5 (12.0) |
SD = standard deviation; M = mean.
Figure 2.

Histograms for n = 106 patients in Study 1: summary measures of pain.
TABLE 3.
Week 1 Correlations Among Summary Measures in Study 1
| Mean | Median | 90th Percentile | Maximum | SD | Proportion at 0 | Proportion ≥ 50 | |
|---|---|---|---|---|---|---|---|
| Median | 0.99 | ||||||
| 90th percentile | 0.88 | 0.83 | |||||
| Maximum | 0.70 | 0.63 | 0.86 | ||||
| SD | −0.05 | −0.10 | 0.40 | 0.51 | |||
| Proportion rated 0 | −0.42 | −0.46 | −0.46 | −0.08 | 0.14 | ||
| Proportion ≥50 | 0.95 | 0.93 | 0.82 | 0.67 | −0.06 | −0.29 | |
| Morning versus evening | −0.06 | −0.07 | −0.09 | −0.04 | −0.02 | 0.11 | −0.06 |
SD = standard deviation.
n = 106.
Discussion
The objective of the data analysis for Study 1 was to provide preliminary statistics for several measures characterizing pain assessed over the course of a week. We discussed how the measures could provide novel information about the pain experience and suggested how each of these measures might be relevant for understanding the efficacy of treatments. Whereas in clinical trials, using momentary data, the mean level of pain for a week is usually the PRO; researchers and clinicians could be interested in other facets of pain. To date, these facets have rarely been examined, so little evidence of their usefulness or empirical data on their psychometric properties is available.
In Study 1, momentary pain reports by patients with a chronic pain disorder were highly variable over the course of a week, and there were individual differences in variability. It is notable that even the most consistent (low variability) individuals had variability in their momentary reports. We recognize the obvious limitation to this particular empirical presentation; other pain disorders are likely to yield different statistical patterns. For instance, conditions associated with episodic pain will certainly yield different pain levels and variability during a 1-week period. Yet, it seems unlikely that symptoms in patients with other conditions are much less variable than those in people with rheumatic disorders, although there is not much research to support this contention, and hopefully future momentary investigations will address this assumption.
The considerable variability in momentary pain reports opened the possibility of creating a number of measures that characterize several aspects of the week’s experience with pain. Distributions of these additional measures highlighted potential limitations to their usage; nevertheless, some of the measures demonstrated moderately normal distributions. Some of these measures were highly correlated with one another (the mean, median, 90th percentile, and proportion ≥50), yet others were only modestly correlated (such as the SD with most other indices). These associations show that the measures are not all measuring the same construct and could potentially respond differently to treatments in a clinical trial.
What is most important is the idea that at least some of the measures contain an alternative and interesting perspective on the experience of pain and that there may be utility in using one or more of these alternative indices to more fully capture the characteristics and pattern of symptom levels in observational and treatment studies.
STUDY 2
In Study 2, we examine these summary measures in a secondary analysis of a study that was designed to emulate a (nonrandomized) clinical trial. The design was based on two naturally occurring groups of patients: one with expected change in symptoms over time and one without expected change in symptoms. These groups were identified while recruiting patients from a large, community rheumatology practice, which had some patients who were prescribed a change in treatment to reduce their symptoms. Other patients maintained their current treatment regimen. We call the first group “Tx Change” and the second group “Comparison.” All of the individuals in the Tx Change group agreed to record their pain for a 1-week period before changing their treatment, which allowed us to have a baseline level of pain; both groups began recording baseline data within a few days after their physician visit. About 3 months later, all patients recorded their pain for a second week. We therefore had both baseline and follow-up assessments and could examine 3-month changes in pain. Unlike most clinical trials, we note that patients were neither blind to their condition nor were they randomized into conditions. Not surprisingly, it is also the case that the Tx Change group had higher baseline levels of pain (as measured by momentary assessments) than the Comparison group, which would not be the case in a clinical trial. A description of this study is presented in (19). Compliance was greater than 87% (mean during both weeks) for the analysis sample.
Participants
Recruitment for this study was conducted at a large community rheumatology practice. Two types of patients were recruited by the study rheumatologist: those who were about to start a change in their treatment to improve control of their pain and those not needing a change. The reason for identifying patients about to change their treatment was to ensure sufficient variability in change in pain to conduct the longitudinal analyses. If we had exclusively selected patients who were not modifying their treatments, then the only meaningful change in pain during the 3-month study period that we would expect to observe would be due to the natural course of the illness, and much of the observed variability in pain might simply reflect random fluctuations.
Treatment regimen change was defined as starting a new treatment, adding a new treatment to the current regimen, switching to a different treatment, or increasing the current treatment dose. Those patients changing treatment had to be willing to postpone the new treatment for at least a week to collect momentary and recall data before the change. Patients were informed by their rheumatologist that their decision to participate in the study would have no effect on the treatment they received, and patients completed informed consent with the research staff. The protocol was reviewed and approved by the Stony Brook University Institutional Review Board.
Patients were approached about participation by their rheumatologist; those who were interested in learning more about the study met with a research staff member, who briefly described the study and answered questions. Three hundred thirty-nine patients agreed to participate in a telephone screen, 294 were successfully contacted and completed the interview, and 220 were eligible. Inclusion criteria included having a chronic pain disorder diagnosis (fibromyalgia, rheumatoid arthritis, and/or osteoarthritis), feeling pain for more than 6 months, pain for 3 days or longer per week, pain 3 hours or more per day, and mean level of pain greater than 3 (on a 0- to 10-point rating scale, with 0 = no pain and 10 = excruciating pain). Other eligibility criteria were being between the ages of 18 and 80 years, having no sight or hearing problems, being fluent in English, having no difficulty holding a pen or writing, waking up by 10:00 AM and going to bed no earlier than 7:00 PM, having no serious psychiatric impairment, no alcohol and/or drug problem, not planning any major surgery while participating in the study, and not having participated in an electronic diary study within the last 5 years. Of the 220 candidates who were eligible, 173 (79%) agreed to participate and 129 (59%) came to the research office to begin the protocol. Ultimately, 105 patients provided data at both weeks, and they comprise the sample for both the cross-sectional and the longitudinal analyses. Of these, 78 patients entered the study planning to change their treatment at the end of the first week of data collection, and 27 patients entered the study planning no change in their treatment.
Materials and Procedures
We examined how the momentary summary measures, described above in Study 1, performed in this longitudinal study Of particular interest is the ability to examine the properties of change scores based on the summary measures because change is central in the evaluation of interventions. We present the estimated effect size for between group differences in change for each of the summary measures. These were considered exploratory analyses because we did not have a basis to articulate hypotheses about the performance of the summary measures. For example, we did not speculate that one of the measures would yield a greater treatment effect size than the others. In an actual treatment trial, such hypotheses would be informed by expectations about the nature of the intervention studied, that is, what sort of change in pain was expected from treatment.
Another opportunity to gain information about each the summary measure’s performance presented itself in the analyses of a Global Impression of Change (GIC) question that was also included at the 3-month assessment. The GIC measures are intended to provide patients’ perspectives about the direction (improvement, no change, or worsening) and magnitude of change in symptoms from baseline to follow-up. Although there has been considerable criticism of this measure that focuses on the ability of patients to accurately recall the baseline symptoms (20,21), the measure, nevertheless, is an integral part of the anchor-based approach to determining minimally important differences (21). For the purpose of this article, we correlated summary measure change scores with a GIC measure for change in pain obtained at the 3-month assessment. The wording of the GIC question was “About 3 months ago, when you first participated in our study (at that time your physician prescribed a change in your treatment), you rated your pain for a week. Compared to then, how is your pain now?” with response options very much worse, much worse, minimally worse, unchanged, minimally better, much better, and very much better.
Results
Means of the summary measures at baseline and at follow-up are shown for the two groups (Table 4). We estimated the effect sizes (d) for the group difference in mean change from baseline to follow-up (computed by dividing the difference in the group change scores by the SD of the overall change score); these are presented in Table 4, and histograms of the change scores are shown in Figure 3. All of the measures, with the exception of the time-contingent measure (it was not clear how “improvement” would relate to the difference between pain levels in the morning versus the evening; but we included it simply for the sake of providing a complete picture of how the summary measures performed in this study), showed change over time consistent with the interpretation of greater improvement in pain for the Tx Change group relative to the Comparison group (Table 4).
TABLE 4.
Baseline and Follow-up Summary Measures of Momentary Pain in Study 2
| Comparison Group | Tx Change Group | ||||
|---|---|---|---|---|---|
| Baseline | FU | Baseline | FU | Effect Size (d) for Baseline – FU Change Score Between Groups |
|
| Mean | 33.1 | 29.7 | 40.1 | 29.9 | 0.45* |
| Median | 28.3 | 25.6 | 37.5 | 26.3 | 0.39† |
| 90th percentile | 60.5 | 52.9 | 64.4 | 50.2 | 0.30 |
| Maximum | 73.3 | 71.6 | 77.5 | 67.2 | 0.40† |
| SD | 19.8 | 18.1 | 20.2 | 16.4 | 0.25 |
| Proportion rated 0 | 0.26 | 0.30 | 0.22 | 0.36 | 0.50* |
| Proportion ≥50 | 0.34 | 0.26 | 0.45 | 0.30 | 0.37 |
| Morning versus evening | 35.1 | 30.3 | 43.4 | 30.0 | 0.30† |
Comparison Group = group without expected change; Tx Change Group = group with expected change; FU = follow-up; SD = standard deviation.
n = 105.
p < .05
p < .10.
Figure 3.

Histograms of change scores based on summary measures in Study 2. Positive values indicated reductions in pain.
Not all of these measures yielded statistically significant between-group differences. However, we are less concerned about significance levels considering that the statistical power of this study to detect group differences in change for these measures was not considered a priori. What is important for the purpose of evaluating the utility of the measures is the magnitude of the effect sizes. The largest effect size was for percentage of zero ratings of pain. There was an 8% reduction in this measure for the Comparison group versus a 14% reduction for the Tx Change group, yielding an effect size of d = 0.50, which is often characterized as “moderate.” A slightly lower effect size (d = 0.45) was found for change in mean pain intensity, reflecting a 3.4-point reduction for the Comparison group and a 10.2-point reduction for the Tx Change group. The maximum measure yielded an effect size of d = 0.40 based with a change of only 1.7 points in the Comparison group and a 10.3-point reduction in the Tx Change group. Small effects were observed for the time-contingent measure (morning versus evening), for the SD, and for the 90th percentile measure.
A second set of analyses examined associations between change scores on the summary measures and all patients’ GIC ratings; these correlations are shown in Table 5. The strongest associations with GIC were observed in the expected directions for changes in mean (r = 0.55), percentage of ratings with no pain (r = −0.50), and the 90th percentile (r = 0.50). Change in the SD was not associated with GIC. The means of the 3-month change in the mean measure decreased monotonically with each category of the GIC levels. With minor exceptions (ratings of “minimally worse” and “unchanged”), the change in the percentage of ratings with no pain and 90th percentile increased monotonically with each increasing level of GIC.
TABLE 5.
Correlations of Change in Momentary Summary Measures With Global Impression of Change in Study 2
| Measure | Change in Measure With Global Impression of Change |
|---|---|
| Mean | 0.55**** |
| Median | 0.39**** |
| 90th percentile | 0.50**** |
| Maximum | 0.40**** |
| SD | 0.13 |
| Proportion rated 0 | −0.50**** |
| Proportion ≥50 | 0.40**** |
| Morning versus evening | 0.41**** |
SD = standard deviation.
A positive correlation means that a reduction in the measure from Week 1 to Week 2 is association with more improvement in perceived change.
n = 105.
p < .0001.
Discussion
The goal of Study 2 was to use the summary measures in a longitudinal study of pain where one group was expected to have declines in pain over time due to revised treatment, whereas the other group was expected to be fairly stable over time. There are similarities between the first and the second studies: they had the same momentary measurement strategy and similar populations of arthritis patients. Distributions of the summary measures in both studies were visually compared and were considered to be quite comparable (data not shown here).
An important observation was that change in summary measures was approximately normally distributed, even for those measures with severely skewed cross-sectional distributions in the first study. Difference scores are known to generate normal distributions even from highly nonnormal component scores (22), and this makes them good candidates for typical parametric statistical procedures.
Substantively, it was notable that moderate effects were found for two of the eight summary measures: the mean of all moments and the percentage of ratings with zero pain. Change scores based on these measures yielded two of the three largest standardized differences between groups and the strongest associations with GIC. The 90th percentile measure also performed well in this regard. We do not interpret these results as meaning that these measures are the “best” of the group, given that the primary goal was to demonstrate the feasibility of the new summary measures. Nevertheless, we consider this a compelling story because the percentage of ratings with zero pain is a measure that may be clinically appealing. We can imagine the results of a treatment for pain being discussed as reducing the percentage of ratings (or the amount of time, to generalize from a random momentary assessment protocol) that the treatment group had any pain from 50% to 20%, whereas the reduction for a control group was only from 50% to 40%. It was also impressive that patients’ GIC was strongly related to the change in the percentage of ratings of no pain. Although not as strong a measure relative to the others, the change in maximum pain yielded an effect size of 0.40 and was significantly correlated (.40) with GIC. This measure could complement the clinical profile of patient change in a trial.
Conversely, the median seemed to confer no advantage over the mean: it had a smaller effect size for change and a smaller association with GIC than did the mean summary measure. The 90th percentile, however, had very similar associations when compared with the mean. The summary measure with the “worst” performance was the SD. Although reducing pain fluctuations could be conceptualized as a positive outcome of treatment, and it was observed in the Tx Change group, it was not associated with patients’ GIC.
In practice, multiple outcomes based on momentary data could be used to characterize symptoms. For example, a study team could be interested in both the reduction of mean symptom levels and in the percentage of time patients were symptom free. In Study 1, 46% of the variance was shared by these two measures (mean and percentage of ratings of zero pain), so it seems quite reasonable to take a multivariate approach when justified. However, for summary measures that were highly correlated in a particular sample, little will be gained by the inclusion of more than a single summary measure.
GENERAL DISCUSSION
The summary measures presented in this article have brought to light several of the many alternative ways in which PROs can be characterized from momentary assessment protocols. The traditional standard of using the mean over the momentary reporting period has been reinforced by our data. We have also shown, in a preliminary manner, that it may be reasonable to create measures other than the mean PRO experience. Of course, the decision to create new measures will depend on the conceptual questions that are posed at the outset of a study. For some or even many hypotheses, using the mean of all momentary assessments may be sufficient for answering the question. However, other hypotheses may suggest or be enhanced by considering some of the other indices. For example, it is perfectly reasonable to document the effectiveness of an intervention by reporting the change in mean symptom levels for the treatment and comparison groups. Yet, it may also be informative to know that the treatment increased the percentage of symptom-free time compared to the comparison group, a difference that will not be evident by examining the mean differences. And, it may be interesting to know that the variability of symptoms in the treatment group was much lower after treatment (relative to the comparison group). These facts have the potential to provide a deeper understanding of the treatment that goes beyond what is revealed from the means.
The two data sets used for this article have demonstrated that there can be considerable complexity in the momentary experience of PROs as shown with this example of pain intensity. Evaluation of clinical outcomes could be potentially more informative and refined by adopting more than one way of characterizing change. Although these various measures have obvious face validity, they will need to be evaluated in clinical trials. It was of considerable interest that most of the summary measures were associated with patients’ perceptions of global change in Study 2, adding to the face validity of the summary measures.
This article has focused exclusively on pain intensity as an outcome, but we believe that the measurement conceptualizations presented here are likely to generalize to many types of PROs that have a quantitative dimension such as intensity or severity. Variables such as affective states, fatigue, other medical symptoms, well-being, and appraisals of circumstances could all use the analytic methods described here for pain.
We emphasize that all of the associations reported here are limited by the content tested (pain), the population studied (patients with rheumatic disease) and by the “treatment” analog design used. Thus, we are not advocating, endorsing, or eliminating any of the summary measures based on the observed associations. The choice of measures for a trial should be guided—to the extent possible—by the conceptual goals of the study; that is, by the theorized changes in the outcome that would allow one to conclude that an intervention fulfilled the aspirations of its developers. For some trials, this may be best represented by a change in the highest levels of pain (reducing the peaks) and in others by increasing the amount of pain-free times. And yet in other trials, the overall mean of daily pain may be the best outcome. Comparative effectiveness trials may especially benefit from including multiple types of outcome as different aspects of the symptom experience may be differentially affected by different treatments. When the goal is to evaluate treatment efficacy, we encourage researchers and clinicians to make decisions a priori about which summary measures are most important conceptually, and we certainly discourage “fishing expeditions” in which multiple measures are created a posteriori to explore which yields the greatest treatment effect.
Acknowledgments
This research was supported by grants from the National Institute of Arthritis and Musculoskeletal Diseases (U01-AR052170 to Dr. Stone (principal investigator) and U01-AR057948 to Drs. Broderick and Stone (principal investigators)).
Glossary
- PROs
patient-reported outcomes
- GIC
Global Impression of change
Footnotes
Drs. Stone and Broderick have financial interests in Invivodata, Inc, and Dr. Stone is a senior scientist for the Gallup Organization and a consultant for Wellness & Prevention, Inc.
REFERENCES
- 1.Rothman M, Burke L, Erickson P, Leidy NK, Patrick DL, Petrie CD.Use of existing patient-reported outcome (PRO) instruments and their modification: the ISPOR good research practices for evaluating and documenting content validity for the use of existing instruments and their modification PRO Task Force Report. Value Health 2009;12:1075–83. [DOI] [PubMed] [Google Scholar]
- 2.Shiffman S, Stone AA, Hufford MR. Ecological momentary assessment. Annu Rev Clin Psychol 2008;4:1–32. [DOI] [PubMed] [Google Scholar]
- 3.Stone AA, Broderick JE. Real-time data collection for pain: appraisal and current status. Pain Med 2007;8(Suppl 3):S85–93. [DOI] [PubMed] [Google Scholar]
- 4.Stone AA, Shiffman S, Atienza A, Nebling L, editors. The Science of Real-time Data Capture: Self-reports in Health Research. New York, NY: Oxford Univeristy Press; 2007. [Google Scholar]
- 5.Guidance for Industry. Patient-reported outcome measures: Use in medical product development to support labeling claims. Silver Sprineddg, MD: US Department of Health and Human Services; 2009; Available at: http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM193282.pdf. Accessed April 21, 2012. [Google Scholar]
- 6.Curran SL, Beacham AO, Andrykowski MA. Ecological momentary assessment of fatigue following breast cancer treatment. J Behav Med 2004;27:425–44. [DOI] [PubMed] [Google Scholar]
- 7.Myin-Germeys I, Oorschot M, Collip D, Lataster J, Delespaul P, van Os J. Experience sampling research in psychopathology: opening the black box of daily life [review]. Psychol Med 2009;39:1533–47. [DOI] [PubMed] [Google Scholar]
- 8.Shiffman S, Gwaltney CJ, Balabanis M, Liu KS, Paty JA, Kassel JD, Hickcox M, Gnys M. Immediate antecedents of cigarette smoking: an analysis from ecological momentary assessment. J Abnorm Psychol 2002;111:531–45. [DOI] [PubMed] [Google Scholar]
- 9.Steptoe A, Gibson EL, Hamer M, Wardle J. Neuroendocrine and cardiovascular correlates of positive affect measured by ecological momentary assessment and by questionnaire. Psychoneuroendocrinology 2007;32:56–64. [DOI] [PubMed] [Google Scholar]
- 10.Gendreau M, Hufford MR, Stone AA. Measuring clinical pain in chronic widespread pain: selected methodological issues. Best Pract Res Clin Rheumatol 2003;17:575–92. [DOI] [PubMed] [Google Scholar]
- 11.Broderick JE, Stone AA, Calvanese P, Schwartz JE, Turk DC. Recalled pain ratings: a complex and poorly defined task. J Pain 2006;7:142–9. [DOI] [PubMed] [Google Scholar]
- 12.Stone AA, Broderick JB, Porter L, Kaell AT. The experience of rheumatoid arthritis pain and fatigue: examining momentary reports and correlates over one week. Arthritis Care Res 1997;10:185–93. [DOI] [PubMed] [Google Scholar]
- 13.Stone AA, Broderick JE, Kaell AT. Single momentary assessments are not reliable outcomes for clinical trials. Contemp Clin Trials 2010;31:466–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Broderick JE, Schwartz JE, Vikingstad G, Pribbernow M, Grossman S, Stone AA. The accuracy of pain and fatigue items across different reporting periods. Pain 2008;139:146–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zautra A, Fasman R, Parish BP, Davis MC. Daily fatigue in women with osteoarthritis, rheumatoid arthritis, and fibromyalgia. Pain 2007;128:128–35. [DOI] [PubMed] [Google Scholar]
- 16.Nisenbaum R, Links PS, Eynan R, Heisel MJ. Variability and predictors of negative mood intensity in patients with borderline personality disorder and recurrent suicidal behavior: multilevel analyses applied to experience sampling methodology. J Abnorm Psychol 2010;119:433–9. [DOI] [PubMed] [Google Scholar]
- 17.Russel JJ, Moskowitz DS, Zuroff DC, Sookman D, Paris J. Stability and variability of affective experience and interpersonal behavior in borderline personality disorder. J Abnorm Psychol 2007;116:578–88. [DOI] [PubMed] [Google Scholar]
- 18.Schneider S, Junghaenel DU, Schwartz JE, Stone AA, Keefe FJ, Broderick JE. Individual differences in the day-to-day variability of pain, fatigue, and well-being in patients with rheumatic disease: associations with psychological variables. Pain 2012;153:813–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Stone AA, Broderick JE, Schwartz JE. Validity of average, minimum, and maximum end-of-day recall assessments of pain and fatigue. Contemp Clin Trials 2010;31:483–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Farrar JT, Young JT, LaMoreaux L, Werth JL, Poole RM. Clinical importance of changes in chronic pain intensity measured on an 11-point numerical pain scale. Pain 2001;94:149–58. [DOI] [PubMed] [Google Scholar]
- 21.Yost KJ, Eton DT. Combining distribution- and anchor-based approaches to determine minimally important differences: the FACIT experience. Eval Health Prof 2005;28:172–91. [DOI] [PubMed] [Google Scholar]
- 22.Vickers AJ. Parametric versus non-parametric statistics in the analysis of randomized trials with non-normally distributed data. BMJ Med Res Methodol 2005;5:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
