Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Dec 1.
Published in final edited form as: Neurogastroenterol Motil. 2014 Dec;26(12):1802–1811. doi: 10.1111/nmo.12466

The accuracy of patient reported measures for GI symptoms: A comparison of real time and retrospective reports

Jeffrey M Lackner 1, James Jaccard 2, Laurie Keefer 3, Rebecca Firth 1, Ann Marie Carosella 1, Michael Sitrin 1, Darren Brenner 3; Representing the IBSOS Research Group
PMCID: PMC4247164  NIHMSID: NIHMS633096  PMID: 25424582

Abstract

Background

Obtaining accurate information about GI symptoms is critical to achieving the goals of clinical research and practice. The accuracy of patient data is especially important for functional GI disorders (e.g., IBS) whose symptoms lack a biomarker and index illness severity and treatment response. Retrospective patient reported data are vulnerable to forgetting and various cognitive biases whose impact has not been systematically studied in patients with GI disorders.

Aim

To document the accuracy of patient reported GI symptoms over a reporting period (1 week) most representative of the time frame used in research and clinical care (1).

Methods

Subjects were 273 Rome III-diagnosed IBS patients (M age = 39 yrs., 89% F) who completed end of day GI symptom ratings for 7 days using an electronic diary. On Day 8, Ss recalled the frequency and/or intensity of IBS symptoms over the past 7 days. Reports were then compared against a validation criterion based on aggregated end of day ratings.

Key Results

At the group level, subjects recalled most accurately abdominal pain and urgency intensity at their worst, urgency days, and stool frequency. When data were analyzed at the individual level, a subgroup of subjects had difficulty recalling accurately symptoms that showed convergence between recall and real time reports at the group level.

Conclusions and Inferences

Although many patients’ recollection for specific GI symptoms (e.g., worst pain, stool frequency) is reasonably accurate, a non-trivial number of other symptoms (e.g. typical pain) are vulnerable to distortion from recall biases that can reduce sensitivity of detecting treatment effects in clinical and research settings.

Keywords: daily diary, pain, recall, symptom severity, randomized trial, symptom severity, pain measurement, irritable bowel syndrome


Self-report is one of the most common, inexpensive, and efficient methods for collecting patient information in both clinical and research settings. Clinically, patients’ reports of their symptoms inform clinical decision-making particularly for functional gastrointestinal (GI) disorders (e.g., IBS) whose symptoms lack a reliable biomarker that gauges illness severity. The day to day burden of many common GI disorders is oftentimes best explained by subjective variables (e.g., pain, fatigue, quality of life) that can only be assessed through self-report or concepts that are not readily observed (e.g., stool consistency or frequency). In research settings, patient reported outcomes are increasingly acknowledged as the primary source of data for documenting therapeutic benefit (improvement in symptoms or functional capacity) that are detectable by the patient.

To this end, the FDA (2) issued its Guidance on Patient Reported Outcomes or PROs in 2009. The Guidance provides a road map for developing and testing endpoints that focus on concepts that are most important and meaningful to patients. While the FDAs Guidance underscores the importance of subjective data, it acknowledges problems inherent to self-report. When patients are asked to recall their symptoms over a discrete time period, such as the last week, their responses are subject to memory failure and biases that arise from mental shortcuts (heuristics) used to make judgments quickly and efficiently about a stream of symptom experiences on varying dimensions (e.g., duration, intensity, severity, etc). For example, rather than mentally keeping an exact running total of their experience, people may rely on how they felt most recently and at their most extreme to summarize symptom levels (35). Recall heuristics streamline information processing of complex mental tasks but come at the cost of potential reduced accuracy such that what symptoms patients experience in real time can differ from their memory of the symptom experience they reconstruct and convey to a physician or researcher.

In spite of the potential problems inherent in a source of data that are critical to clinical and research activities, efforts to document the accuracy of retrospective recall of GI symptoms are limited (6, 7). No known study has validated recall of GI symptoms that have occurred over the past week, which is the time frame most commonly used in research and clinical care (1, 8, 9). In the present study, we estimated the accuracy of recalled GI symptom data over 7 days relative to data collected in real time using electronic diaries during each of those 7 days. The daily diary data capture the symptom experience close to the time of experience, thereby limiting distortion from forgetting and recall bias, and yielding reasonably accurate data. We used electronic diaries because of known data quality and compliance difficulties with paper diary methods. Studies have shown, for example, that patients do not reliably complete paper diaries in real time but instead frequently rely on “back filling” and even “forward filling” of the diary. In one study, researchers documented that respondents’ compliance with paper diaries was as little as 11%. -- in other words, 89% of paper diaries are not completed as prescribed (10). The electronic diary methods used in the present study addressed such problems by time stamping when responses were entered and preventing entry of responses after a specified time period.

The electronic diary data were aggregated across seven days and served as a gold standard against which to compare end-of-period self-reported recall of symptoms over the seven day time frame. We validated the accuracy of GI symptom recall at both the (aggregate) group level as well as the individual level. Group level accuracy refers to whether a group mean level of symptom severity based on recall accurately reflects the true mean level of symptom severity for that group of individuals as reflected by a “gold standard.” It is of interest, for example, when evaluating the overall effects of a treatment on a population of individuals enrolled in a clinical trial. Individual level accuracy refers to whether the level of symptom severity recalled by an individual accurately reflects the true level of symptom severity experienced by that individual. It is of interest in clinical settings when a patient reports symptoms to a physician and in research that seeks to identify correlates of symptom severity. Although not commonly recognized, recall can be accurate at the group level but not at the individual level. For example, if some individuals overestimate symptom severity relative to their true symptoms and other individuals underestimate it, when aggregated, the overestimations cancel the underestimations and the group level (mean) recall will show good agreement despite (potentially substantial) error at the individual level. The unique perspectives that this information reveals about how persons -- both as members of a comparison group of a clinical trial and as individual patients treated by a gastroenterologist in an office setting -- remember GI symptoms has important implications for clinicians and researchers in carrying out their respective tasks.

Materials and Methods

Participants

Participants included 273 individuals between the ages of 18 and 70 (inclusive) years who were recruited to an NIH funded IBS trial through a variety of sources including medical professionals (e.g., gastroenterology, obstetricians, gynecologists, primary care physicians), media coverage, word of mouth, and advertisements. Individuals who passed a brief telephone screening were scheduled for formal medical and psychological evaluations to determine their eligibility. Inclusion criteria included Rome III IBS diagnosis (11) confirmed during a medical examination by a board certified gastroenterologist; IBS symptoms (i.e., pain and defecatory symptoms) of at least moderate severity (occur an average of two or more days per week with life interference, e.g., 12, 13); ability to provide written consent; and a minimum 6th grade reading level. Exclusion criteria were: presence of a comorbid organic GI disease (e.g., IBD) that would adequately explain GI symptoms; mental retardation; current or past diagnosis of schizophrenia or other psychotic disorders; current diagnosis of unipolar depression with suicidal ideation; current diagnosis of psychoactive substance abuse. Patients’ predominant bowel habit was determined after medical examination using Rome III guidelines (11). Subject breakdown and other demographic data are presented in Table 1.

Table 1.

Descriptive Statistics for Total Sample

Variable Percent/Median
Percent white 89.7
Percent female 78.7
Median age (in years) 39.0
Percent college degree 31.2
Median family income (US dollars) 62,325
Percent type IBS-C 31.8
Percent type IBS-D 41.4
Percent type IBS-M 22.6
Percent type IBS-U 4.2
Median IBS duration (in years) 12.0

Note: Sample size = 273.

Procedure

This study was carried out as part of the Irritable Bowel Syndrome Outcome Study (IBSOS), a multisite clinical trial the details of which can be found elsewhere (14). The assessment battery and the experimental procedures of the study were approved by the Health Sciences IRBs of both clinical sites of the IBSOS (University at Buffalo, Northwestern University). Informed consent was obtained from all subjects before participation. Subjects were compensated $50 for completing baseline assessment battery from which data for the current study were used. After meeting eligibility criteria, subjects were trained how to enter symptom ratings over 28 days using either an Asus Memo Tablet (Model #K0WME172V) or Dell Mini Notebook (Model # 1012P04T) between 8 pm to 12 am at the end of each day. End of day ratings took approximately 3 minutes to complete and were time-stamped to verify when they were entered. The last seven days of the 28 day period comprised the monitoring phase for the present study. On the 8th day, the subjects returned to the clinic where they completed a 7-day recall measure. A 7 day recall period was used because it typifies the reporting period used in clinical practice and research settings (8).

End of Day Measures

Patients completed the following measures at the end of the day during the seven day monitoring phase.

Abdominal Pain

Abdominal pain intensity was measured with an 11-point pain intensity numerical rating scale, where 0=no pain and 10=worst possible pain (15). Patients chose a number from 0–10 that best described their abdominal pain intensity during the day. Patients rate the average and worst abdominal pain they experienced in the last 24 hours.

Stool Consistency

Stool Consistency was measured using the Bristol Stool Form Scale (16). The BSFS consists of an ordered seven point scale that describes variation in the consistency or form of stool ranging from “Type 1: Separate hard lumps like nuts (difficult to pass)” to “Type 7: “Water, no solid pieces (entirely liquid)””. Types 1 and 2 are representative of constipation, while Types 6 and 7 are representative of diarrhea. Because it was impractical and burdensome to assess the consistency of every stool a patient passed, patients were asked to describe the consistency of their typical stool for the day. We arrived at this decision after pilot testing this approach and finding that patients were able to identify a single stool type that characterized typical bowel consistency for a given day,

Stool Frequency

Patients were asked how many bowel movements they had passed in the last 24 hours.

Urgency

We adopted the practice (17) of using an 11 point numerical rating scale (0 = none, 10 worst) to measure average and worst urgency intensity. Urgency was described as “any sudden urge to rush to the toilet in order to move your bowels”).

End of Week Recall Measures

At the completion of the 7 day monitoring period, patients returned to their respective clinic (day 8) where they completed a 7 day recall questionnaire assessing GI symptoms. To minimize paperwork burden and the number of analyses, we limited assessment of stool consistency to the four most extreme (disordered) stool types (types 1–2, 6–7) of the Bristol Stool Form Scale. Stool frequency was measured by asking patients to report how many bowel movements they had, in total, during the past 7 days. Using the 11 point numerical rating scale described above (e.g., 0 = none, 10 = worst possible), patients characterized the average and worst intensity of abdominal pain and urgency over the past 7 days. Because IBS patients frequently report “good” and bad” days, two additional items asked patients to characterize the number of days they experienced pain and urgency, respectively. In total, the recall questionnaire was comprised of 11 items assessing 4 discrete symptoms (abdominal pain, urgency, stool frequency, stool consistency).

Analytic Strategy

For each individual, a diary-based index of symptoms was derived for each symptom separately by aggregating entries across the past seven days for that symptom (treatment of missing data is described in the Results section). For example, abdominal pain was rated daily on an 11 point scale, where 0 is labeled “none” and 10 is “worst possible.” The number of days on which a >2 rating across the seven days was calculated for that person. We used a rating of >2 as a criterion for daily pain because it was judged that values less than that are of questionable clinical significance and unlikely to be recalled. Our rationale is consistent with FDA interim endpoints for IBS, which regard pain intensity ratings below 3 on an 11 point scale as not clinically meaningful (18). As another example, a diary-based symptom index of the worst abdominal pain experienced in the past week was defined as the highest abdominal pain severity rating across the seven days for the individual. We refer to the end-of-day diary indices as “gold standards,” a practice consistent with symptom recall research more broadly which indicates patients can accurately recall symptoms such as pain they experience during the same day if they are motivated to report such symptoms accurately when queried about them (1924)

Group level analyses compared the group mean for the 7 day recall measure with the group mean for the gold standard, testing for mean differences using standard single degree of freedom contrast strategies for correlated-groups (25). Individual level accuracy analyses regressed the measured 7 day recall for each individual (using traditional ordinary least squares regression) onto the gold standard index for that individual under the hypotheses that (a) the intercept should be zero and (b) the slope should be 1.0. These hypotheses reflect presumed perfect correspondence between the gold standard and the recall score such that (a) when the gold standard equals zero, the recall measure also should equal zero (the intercept), and (b) for every one unit that the gold standard increases, the recall measure, on average, should also increase by one unit (the slope). Also of interest is the squared correlation in the analysis as it reflects the proportion of variability in the recall measure that is accounted for by true symptom variability. When data for a given variable showed absolute skewness or kurtosis values greater than 2.0, standard errors and significance tests were based on bootstrapping to accommodate non-normality (26). Routine analyses for outliers revealed none.

Results

Preliminary Analyses

Missing Data

Missing data for a given variable on a given day for the diary data was typically about 25% of the cases, which is comparable to similar electronic diary studies (6). The analytic strategy requires an aggregated index across seven days for each individual. We retained for analysis only individuals who reported data for a minimum of five of the seven days, which represented about 65% of the total sample (N = 177). For these individuals, missing data on a given day were imputed using the individual’s rating of the day before and the day after the missing rating (averaging the two ratings). If only one of the two surrounding days had data, the value for the one day was imputed. The average multiple correlation between a rating on a given day as predicted from the day before and the day after it was 0.80, suggesting that the imputation strategy is reasonable. We performed sensitivity analyses to determine if conclusions changed as a function of using alternative methods for treating missing data (e.g., Markov chain Monte Carlo multiple imputation) and did not find this to be the case.

We tested if the presence of missingness for more than two days of the target week was associated with individual difference variables, including demographic variables (gender, age, ethnicity, education, income) assessed at baseline. Only one variable, ethnicity (scored as 1 = White 0 = non-White) yielded a consistent statistically significant association (average r = −0.13, p < 0.05). The association was weak and readily attributed to chance given the number of correlations examined. There does not appear to be significant bias in missing data, but care must still be taken in generalizing results.

Monitoring Bias

The act of monitoring symptoms on a daily basis through diaries can create bias. One form of bias is that it sensitizes individuals to symptoms and thereby alters how they experience those symptoms. We tested for this form of bias by comparing mean reported symptom levels as recalled for the past seven days for study participants who had completed the diaries with a group of IBS patients who had not kept diaries (N=75). On the 11 indices of recalled 7 day symptoms referenced above, none of the means were statistically significantly different for the two groups.

Analyses of Recall Accuracy at the Group Level

Table 2 presents the estimated means for each symptom averaging across individuals (hereafter referred to as “goup level means”) for the one-week gold standard and for its corresponding one-week recall measure. For example, the average worst abdominal pain that respondents experienced across the 7 days using the gold standard was 6.69. By comparison, the average recalled worst abdominal pain was 6.66, a difference that was not statistically significant (p > 0.05). Generally speaking, there was some tendency towards over reporting symptom levels when examining group level means, with statistically significant mean disparities being observed for 7 of the 11 GI symptoms: intensities of pain and urgency on average, the four disordered stool types (Types 1–2, 6–7), and days with abdominal pain.

Table 2.

Gold Standard and Recall Means for Different Symptoms

Symptom Gold Standard Recall Difference Estimate
Days with abdominal pain 4.35 (2.6) 5.35 (1.8) 1.00* Over
Days with urgency 3.44 (2.5) 3.43 (2.3) 0.01 ns
Number of stools 15.06 (9.8) 15.14 (10.4) −0.07 ns
Days with hard, lumpy stools 0.86 (1.7) 1.53 (1.9) −0.67* Over
Days with sausage-shaped stools 0.43 (0.9) 1.91 (2.0) −1.48* Over
Days with fluffy stools 1.67 (2.0) 3.17 (2.4) −1.49* Over
Days with watery stools 0.34 (0.8) 1.00 (1.5) −0.66* Over
Worst abdominal pain 6.69 (2.2) 6.66 (2.3) 0.03 ns
Average abdominal pain 3.50 (1.8) 4.61 (2.0) −1.11* Over
Worst urgency 5.66 (3.0) 5.58 (3.3) 0.08 ns
Average urgency 2.99 (2.2) 3.67 (2.7) −0.67* Over

Notes: N ≈ 175, except for those involving stool type, where N ≈ 135; Values in parentheses are standard deviations;

*

p < 0.05; ns = nonsignificant difference between recall and gold standard

Analyses of Recall Accuracy at the Individual Level

While group level analyses provide important information, they are not sensitive to lack of concordance at the individual level. When averaging across subjects, individuals who overestimate their GI symptoms will offset those who underestimate their symptoms, with the result being an accurate aggregate level index. For this reason, we performed analyses that permit exploration of symptom level variation that is not apparent through group level analyses but rather respect individual variation in over- and under-estimation. Table 3 presents the results of several analyses. The first analysis presents the estimated intercept and slope when the recall score is regressed onto the gold standard. If there is perfect correspondence between the gold standard and the recall score, the slope should be 1.0 and the intercept should be zero. The slope indicates for every one unit that the gold standard increases, how much the recall measure, on average, increases. The intercept indicates that when the gold standard is zero, what the recall measure is, on average. For example, consider the symptom worst abdominal pain experienced in the past week. The intercept (1.78) was statistically different from 0 and the slope (0.45) was statistically significantly different from 1.00 (p < 0.05). These values suggest a lack of correspondence between the recall measure and the gold standard. When the gold standard was 0, the typical (i.e., average) worst abdominal pain reported was 1.78. Also, for every one unit that the gold standard increased, the recalled worst abdominal pain increased by only 0.45 units, on average.

Table 3.

Individual Level Accuracy Analyses for Different GI Symptoms

Symptom a B r2 Mean Disparity Disparity ≥ 2 Disparity ≤2
Days abdominal pain 0.57 0.71* 0.25* 1.74* 34% 10%
Days urgency 0.96* 0.72* 0.44* 1.31* 17% 19%
Number of stools 0.86 0.95 0.79* 3.39* 30% 31%
Days with hard, lumpy stools 0.80* 0.85* 0.53* 0.85* 21% 1%
Days with sausage stools 1.87* 0.09* 0.00 1.77* 39% 4%
Days with fluffy stools 2.00* 0.70* 0.31* 1.89* 49% 5%
Days with watery stools 0.69* 0.90 0.22* 0.77* 17% 1%
Worst abdominal pain 1.28* 0.80* 0.61* 1.15* 25% 2%
Average abdominal pain 1.75* 0.82* 0.56* 1.20* 22% 0%
Worst urgency 0.46 0.90 0.72* 1.29* 16% 19%
Average urgency 0.99* 0.90 0.53* 1.50* 18% 2%

Notes: N ≈ 176; a and b are the intercept and slope for the regression of the recall score onto the gold standard; r2 is the squared r between gold standard and recall measures; Mean Disparity is the average absolute disparity between the gold standard and the recall score; Disparity ≥ 2 is the percentage of people who reported a disparity at least 2 units higher from the gold standard on the metric in question (e.g., at least 2 days more for the day metric; at least 2 rating points more on the rating scale);

*

p < 0.05 for the test using the null hypothesis that ρ2 = 0, or for the null hypothesis that the intercept = 0, or for the null hypothesis that β1 = 1.00, or for the null hypothesis that the mean average disparity = 0.

Table 3 also presents the squared correlation between ratings based on the gold standard and recall measures. This represents the proportion of variation in the recall measure that is accounted for by gold standard. The proportion of explained (accounted for) variance is an index of how strong the association is between the gold standard and the recall score. The index ranges from zero (no predictive value) to one (perfect correspondence, except for constant error that is reflected in the intercept). In general, the gold standard accounts for about 50% of the variance the recall measure (and vice versa), with a few notable exceptions. For example, for the recall of the number of days on which pain was experienced, 26% of the variation in it could be explained by the respective gold standard, with 74% of the variation in the recall measure reflecting factors other than the true symptom state.

Finally, Table 3 presents the average absolute disparity between ratings based on the absolute difference score between the gold standard and the recall score as calculated for each individual. To illustrate, recalled reports of the worst abdominal pain experienced in the past week were typically (on average) “off” from the true worst pain in the past week by about 1.15 rating scale units (see Table 3). The final columns of Table 3 report the percentage of individuals whose recall was at least 2 units higher or lower, respectively, from the gold standard (2 days for the days metric and 2 rating scale points for the rating scale). The number of stools recalled operates on a very different metric than intensity-based rating scales so for this outcome, we also calculated the percent of individuals who reported overestimates of 2 or more, 4 or more, and 5 or more stools per week. These were 30%, 18%, and 14%, respectively. The percent of individuals who reported underestimates of 2 or more, 4 or more, or 5 or more stools per week was 31%, 16%, and 10%, respectively. These data are diagramtically presented in Figure 1

Figure 1.

Figure 1

Discussion

This study sought to assess the accuracy of retrospective recall of GI symptoms in patients with more severe IBS. At the group level, we found that patients tended to recall accurately certain (but not all) GI symptoms when retrospective recall of symptoms at the week’s end is compared to multiple end of day ratings (our gold standard) captured for 1 week. We found relatively good agreement between multiple end of day and end of week reports for stool frequency, worst intensity of pain and urgency, and urgency days. However, an overestimation bias for group level recall was observed for average intensity of pain and urgency as well as frequency of disordered stool form (Types 1–2, 6–7) and clinical pain (>2 on 11 point pain scale) days. A possible explanation for this overestimation bias is that patients attend to the more extreme episodes and disregard low-intensity, symptom-free episodes, or in the case of stool consistency, normal bowel events when they are asked to summarize somatic experiences. In other words, more extreme symptom experiences are easier to remember and recall over 1 week than symptom free or low intensity events.

Although the relative group-level accuracy results of worst pain intensity lend support for its use as an endpoint for clinical trials for IBS, we are reluctant to recommend its use because of its selective focus on extreme symptom experiences and neglect of those at lower or non-existent intensities means maximum ratings do not fully capture a symptom experience that is representative of how patients feel. In this respect, it misses the mark of an optimally valid and meaningful PRO (2). If ratings of symptoms at their worst ignore episodes at lower magnitudes, then their sensitivity to detect genuine treatment changes may be blunted. It runs the risk of handicapping the efficacy profile of treatments that effectively reduce but do not eliminate symptoms (an accepted goal for therapies of chronic diseases for which there are no cures) because more extreme residual symptoms retain greater salience over those that subside in intensity or resolve completely and fade from memory when patients recall or summarize symptom information. We believe it may be wiser to aggregate multiple end of day symptom ratings as collected through electronic diaries (i.e., our gold standard), as long as the advantages of such assessments (efficiency, affordability, negligible distortion from forgetting and recall biases) are retained and the limitations that come with diary data collection methods (e.g., compliance) are minimized. A clinical advantage of focusing on multiple ratings of average intensity across 7 days is that it corresponds with the treatment goals of physicians, which is controlling symptom intensity on average (20).

Although we observed a group level tendency to overestimate average pain with recall measures (see Table 2), such bias is not necessarily problematic, depending on one’s purposes. For example, in an RCT, if the group level tendency to overestimate typical pain operates to the same extent in both the treatment and control groups, then the true mean difference between the groups on typical pain (which indexes treatment efficacy) will be preserved with recall measures because the error represents constant error that is differenced out when group differences are calculated. Consider, for example, clinical researchers interested in the analgesic properties of X and Y treatments by comparing a shift in the mean typical pain ratings in patients as a function of treatment X versus Y. If patients overestimate the typical pain intensity by 30%, this bias will be operative for both groups and should not affect the estimate of the mean group difference comparing X and Y (21, 28). The operation of recall bias does not necessarily invalidate conclusions regarding the relative efficacy profile of two treatments. On the other hand, if researchers or clinicians seek to make absolute rather than relative statements about aspects of GI symptoms (e.g., the actual number of stools passed per week or the actual number of pain days that individuals in a given treatment group achieve), then overestimation bias is more problematic and can lead to erroneous conclusions.

As noted, it is possible that good group level agreement between the gold standard and recall conceals recall bias at the individual level, as overestimators cancel out underestimators. This phenomenon was observed for all 4 symptoms (urgency days, stool frequency, worst pain, worst urgency) that showed relatively good group level agreement between gold the standard and retrospective symptom reports in Table 2. For example, group level data showed a non-significant difference between means for the recall and the gold standard for urgency days (3.43 vs 3.44 days). Analysis of individual-level data, however, suggests that recall bias was operative in this outcome for segments of patients. Specifically, thirty five percent of patients reported a disparity of +/− 2 days or more when they recalled the number of urgency days. Similar patterns of data characterized recall of worst abdominal pain ratings; while aggregate data suggest good agreement between gold standard and recalled data (6.69 vs 6.66), individual level data indicate that 27% of patients recalled intensity of worst pain that deviated at least 2 units from end of day ratings.

Like group level analyses, the clinical implications of such results depend on whether a relative or absolute clinical judgment is being made by the clinician based on patient recall. If a clinician (or researcher) relies on symptom reduction as a way to understand improvement and decision making, then recall bias that is operating at one point in time (e.g., prior to treatment) likely will operate at another point in time (after treatment) and the true improvement will be unaffected by bias because it is differenced out when comparisons are made. By contrast, if clinical decisions are based on absolute levels of a symptom (rather than improvement per se), then patient level recall bias is consequential. Individual level bias would make it difficult, for example, for clinicians to determine the absolute number of “bad” vs “good” pain days a patient has experienced because, in the case of overestimation, the bias distorts reports upwards (i.e. there is an over-reporting of the number of bad days). As seen in Table 3, for a sizable proportion of patients, retrospective evaluation yields overestimation of pain levels that equals or exceeds what is considered a clinically significant effect of an analgesic agent (two points or a 30% reduction on an 11 point pain scale, 29). This may be particularly problematic at the low end of intensity scales where an overestimation bias can elevate objectively mild symptoms levels to within clinical range (e.g., 2/10 > 4/10 NRS)

Individual level analyses indicated that an equal proportion (30%) of patients under- or overestimated the number of bowels they passed over a week (+/− 2 per week). Together, a majority – 60% - of IBS subjects are unable to recall accurately the number of stools they pass per week. While a 2 unit deviation per week in stool frequency may seem trivial, an increase of one complete spontaneous (a stool not induced by rescue medication and passed with a sensation of complete evacuation) bowel movement per week from baseline registers as a clinically significant change for constipation predominant IBS patients (18).

Our data raise some concerns about relying on 1 week retrospective recall of average abdominal pain intensity to gauge usual pain experience in clinical or research settings. This is at odds with findings of other researchers who have found relative convergence between end-of-day and end of week typical pain ratings (19, 20) in patients with other pain disorders (e.g., low back pain). It is unclear whether the apparent agreement between recall and real time symptom level data in prior research is disorder-specific (e.g., back pain patients are more likely to recall accurately usual intensity of back pain than IBS patients are of abdominal pain). It is also possible that an unexplored recall bias due to some patients overestimating and others underestimating typical pain intensity underlies the agreement at the group level in prior research.

One area for further investigation is to determine if a recall period other than 7 days is optimal. One might expect shorter reporting periods to yield less recall bias, but shorter periods also are more sensitive to the influence of random events/noise (e.g., hormonal fluctuations, occasional lapses of dietary restraint around food triggers) that would be averaged out across longer time periods. The challenge is to choose a reporting period whose length balances the need to capture stable and meaningful tendencies in symptoms while at the same time maximizing recall accuracy. We selected a 7 day period because it has been identified as an optimal reporting period for IBS symptoms (30) and one that typifies the frame of reference used in clinical practice and many clinical trials (8, 9). Research is needed to explore other possibilities in light of our results. A reporting period of several days could, if validated against random noise influences, be useful in clinical settings because one might be able to circumvent recall bias when longer time periods are used. One could envision patients using a smartphone application to track GI symptoms over several days before an office appointment with the goal of sharing these data with their gastroenterologist. However, practical constraints with this approach would also need to be addressed out (e.g., cost, accommodating patients without smartphones, privacy, security).

The percent of variance that actual symptoms accounted for in recall measures tended to be, in general, somewhat low (see Table 3). Real time data accounted for as little as 22% of the variance in some recalled symptoms, particularly abdominal pain, urgency and disordered stool consistency. Additional research is needed to understand processes other than true symptom variability that causes variability in retrospective symptom reports. Identifying the factors that more fully explain the judgments patients make about their symptoms (e.g., motivational states, personality traits, pain, or mood at the time of diary completion) can help make more informed judgments about treatment outcomes. In clinical trials, measuring these processes and covarying them out during statistical modeling will increase the statistical power associated with the analysis.

There are several limitations to this study which warrant discussion. First, this is a select, small scale study (see Table 1), so the results may not generalize to other populations and settings. Second, there was missing data that had to be dealt with by making assumptions about underlying missing data mechanisms, which like all research with missing data, is of concern. We, however, found minimal bias in terms of the type of people who completed at least five days of data versus those who did not. Third, we relied on end-of-day symptom ratings as our gold standard. It could be argued that there is some error in end-of-day measures as well. However, the broader literature on recall (2124, 31) suggests that end-of-day recall is not susceptible to major recall biases that afflict longer reporting periods. It remains to be seen whether intensive monitoring of GI symptoms multiple times a day (6) confers any incremental validity over aggregated end-of-day ratings without imposing undue participant burden and cost. Intensive monitoring may be better suited for symptoms that fluctuate continuously (e.g., chronic low back pain, fibromyalgia, arthritis, chronic fatigue) rather than episodic ones like IBS. Fourth, we focused on a select number of GI symptoms that have been proposed as endpoints (pain, urgency, stool frequency, stool consistency) in IBS trials (18, 32) to the exclusion of others (e.g., bloating, incomplete evacuation, straining) that are important aspects of the IBS experience. Despite these limitations, we believe the results of the present study are intriguing, that they have both research and clinical implications, and that they set the stage for what we think is much needed future research on PROs for IBS and other GI disorders whose management depends on patient self-report.

Key Message.

While clinicians and researchers rely on retrospective recall of GI symptoms, its accuracy has not been validated. It was found that as a group, patients tend to recall accurately certain (but not all) GI symptoms when retrospective recall of symptoms at the week’s end is compared to multiple end of day ratings captured electronically for 1 week. We found good agreement between multiple end of day and end of week reports for stool frequency, worst intensity of pain and urgency, and urgency days. When data were analyzed at an individual level representative of clinical practice, subgroups of subjects had difficulty recalling symptoms accurately even though the group level data suggested convergence between recall and real time reports.

Acknowledgments

This study was funded by NIH Grant DK77738 and ARRA Supplement DK53317. We are grateful to Mark Schneggenburger and Dr Raymond Dannenhoffer of the UB Office of Medical Computing for their assistance and technical support developing symptom tracking for electronic diaries. We would like to thank Dr Charles Baum for his constructive comments on a previous draft of this manuscript as well as members of the IBSOS Research Group (Jason Bratton, Gregory Gudleski, Leonard Katz, Susan Krasner, Chang-Xing Ma, Sarah Quinton, Christopher Radziwon) for support of various aspects of the research reported in this manuscript. We are indebted to Dr. Frank Hamilton, NIDDK Project Scientist for the IBSOS, for his encouragement and support of this study.

Abbreviations

GI

gastrointestinal

BSFS

Bristol Stool Form Scale

RCT

randomized clinical trial

FDA

Food and Drug Administration

IBSOS

Irritable Bowel Syndrome Outcome Study

PRO

patient reported outcome

IBD

inflammatory bowel disease

NRS

numerical rating scale

Footnotes

Specific author contributions: Jeffrey Lackner participated in study design, data collection, data analysis, and manuscript preparation; Jim Jaccard, study design, data analysis and manuscript preparation; Rebecca Firth, study design, data collection, and manuscript preparation; Laurie Keefer, manuscript preparation, data collection; Darren Brenner, manuscript preparation, data collection, Ann Marie Carosella, data analysis; Michael Sitrin, manuscript preparation. Jeffrey Lackner supervised the study

Guarantor of the article: Jeffrey Lackner

Potential competing interests: None

References

  • 1.Stone AA, Shiffman S. Capturing momentary, self-report data: a proposal for reporting guidelines. Ann Behav Med. 2002;24:236–43. doi: 10.1207/S15324796ABM2403_09. [DOI] [PubMed] [Google Scholar]
  • 2.U.S Department of Health and Human Services Food and Drug Administration. Guidance for Industry: Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims. 2009 [cited; Available from: http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM193282.pdf.
  • 3.Redelmeier DA, Katz J, Kahneman D. Memories of colonoscopy: a randomized trial. Pain. 2003;104:187–94. doi: 10.1016/s0304-3959(03)00003-4. [DOI] [PubMed] [Google Scholar]
  • 4.Redelmeier DA, Kahneman D. Patients’ memories of painful medical treatments: real-time and retrospective evaluations of two minimally invasive procedures. Pain. 1996;66:3–8. doi: 10.1016/0304-3959(96)02994-6. [DOI] [PubMed] [Google Scholar]
  • 5.Bradburn NM, Rips LJ, Shevell SK. Answering autobiographical questions: the impact of memory and inference on surveys. Science. 1987;236:157–61. doi: 10.1126/science.3563494. [DOI] [PubMed] [Google Scholar]
  • 6.Weinland SR, Morris CB, Hu Y, et al. Characterization of episodes of irritable bowel syndrome using ecological momentary assessment. Am J Gastroenterol. 2011;106:1813–20. doi: 10.1038/ajg.2011.170. [DOI] [PubMed] [Google Scholar]
  • 7.Rey E, Locke GR, 3rd, Jung HK, et al. Measurement of abdominal symptoms by validated questionnaire: a 3-month recall timeframe as recommended by Rome III is not superior to a 1-year recall timeframe. Aliment Pharmacol Ther. 2010;31:1237–47. doi: 10.1111/j.1365-2036.2010.04288.x. [DOI] [PubMed] [Google Scholar]
  • 8.Stone AA, Broderick JE, Kaell AT, et al. Does the peak-end phenomenon observed in laboratory pain studies apply to real-world pain in rheumatoid arthritics? The journal of pain: official journal of the American Pain Society. 2000;1:212–217. doi: 10.1054/jpai.2000.7568. [DOI] [PubMed] [Google Scholar]
  • 9.Salovey P, Sieber W, Jobe J, et al. The recall of physical pain. In: Schwartz N, Sudman S, editors. Autobiographical Memory and the Validity of Retrospective Reports. New York, NY: Springer-Verlag; 1993. pp. 89–106. [Google Scholar]
  • 10.Stone AA, Shiffman S, Schwartz JE, et al. Patient non-compliance with paper diaries. BMJ. 2002;324:1193–4. doi: 10.1136/bmj.324.7347.1193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Drossman DA, Corazziari E, Talley NJ, et al. The functional gastrointestinal disorders: Diagnosis, pathophysiology and treatment: A multinational consensus. 2. McLean, VA: Degnon Associates; 2006. Rome III. [Google Scholar]
  • 12.Drossman DA, Toner BB, Whitehead WE, et al. Cognitive-behavioral therapy versus education and desipramine versus placebo for moderate to severe functional bowel disorders. Gastroenterology. 2003;125:19–31. doi: 10.1016/s0016-5085(03)00669-3. [DOI] [PubMed] [Google Scholar]
  • 13.Lackner JM, Jaccard J, Krasner SS, et al. Self-administered cognitive behavior therapy for moderate to severe irritable bowel syndrome: clinical efficacy, tolerability, feasibility. Clin Gastroenterol Hepatol. 2008;6:899–906. doi: 10.1016/j.cgh.2008.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lackner JM, Keefer L, Jaccard J, et al. The Irritable Bowel Syndrome Outcome Study (IBSOS): rationale and design of a randomized, placebo-controlled trial with 12 month follow up of self- versus clinician-administered CBT for moderate to severe irritable bowel syndrome. Contemp Clin Trials. 2012;33:1293–310. doi: 10.1016/j.cct.2012.07.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Turk DC, Dworkin RH, Burke LB, et al. Developing patient-reported outcome measures for pain clinical trials: IMMPACT recommendations. Pain. 2006;125:208–15. doi: 10.1016/j.pain.2006.09.028. [DOI] [PubMed] [Google Scholar]
  • 16.Lewis SJ, Heaton KW. Stool form scale as a useful guide to intestinal transit time. Scand J Gastroenterol. 1997;32:920–4. doi: 10.3109/00365529709011203. [DOI] [PubMed] [Google Scholar]
  • 17.Diggs C, Meyer WA, Langenberg P, et al. Assessing Urgency in Interstitial Cystitis/Painful Bladder Syndrome. Urology. 2007;69:210–214. doi: 10.1016/j.urology.2006.09.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Food and Drug Administration; U.S. Department of Health and Human Services, editor. Guidance for Industry on Irritable Bowel Syndrome — Clinical Evaluation of Drugs for Treatment. May, 2012. [Google Scholar]
  • 19.Bolton JE. Accuracy of recall of usual pain intensity in back pain patients. Pain. 1999;83:533–9. doi: 10.1016/S0304-3959(99)00161-X. [DOI] [PubMed] [Google Scholar]
  • 20.Jensen MP, McFarland CA. Increasing the reliability and validity of pain intensity measurement in chronic pain patients. Pain. 1993;55:195–203. doi: 10.1016/0304-3959(93)90148-I. [DOI] [PubMed] [Google Scholar]
  • 21.Jaccard J, McDonald R, Wan CK, et al. Recalling sexual partners: the accuracy of self-reports. J Health Psychol. 2004;9:699–712. doi: 10.1177/1359105304045354. [DOI] [PubMed] [Google Scholar]
  • 22.Jaccard J, Helbig DW, Wan CK, et al. The prediction of accurate contraceptive use from attitudes and knowledge. Health Educ Q. 1996;23:17–33. doi: 10.1177/109019819602300102. [DOI] [PubMed] [Google Scholar]
  • 23.Jaccard J, McDonald R, Wan CK, et al. The Accuracy of Self-Reports of Condom Use and Sexual Behavior. Journal of Applied Social Psychology. 2002;32:1863–1905. [Google Scholar]
  • 24.Stone AA, Broderick JE, Schwartz JE. Validity of average, minimum, and maximum end-of-day recall assessments of pain and fatigue. Contemp Clin Trials. 2010;31:483–90. doi: 10.1016/j.cct.2010.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Maxwell SED, HD . Designing experiments and analyzing data: A model comparison perspective. 2. Mahwah, N.J: Lawrence Erlbaum Associates; 2004. [Google Scholar]
  • 26.Wilcox R. Introduction to robust estimation and hypothesis testing. 3. San Diego Academic Press; 2012. [Google Scholar]
  • 27.Irvine EJ, Whitehead WE, Chey WD, et al. Design of Treatment Trials for Functional Gastrointestinal Disorders. Gastroenterology. 2006;130:1538–1551. doi: 10.1053/j.gastro.2005.11.058. [DOI] [PubMed] [Google Scholar]
  • 28.Stull DE, Leidy NK, Parasuraman B, et al. Optimal recall periods for patient-reported outcomes: challenges and potential solutions. Curr Med Res Opin. 2009;25:929–42. doi: 10.1185/03007990902774765. [DOI] [PubMed] [Google Scholar]
  • 29.Farrar JT, Portenoy RK, Berlin JA, et al. Defining the clinically important difference in pain outcome measures. Pain. 2000;88:287–294. doi: 10.1016/S0304-3959(00)00339-0. [DOI] [PubMed] [Google Scholar]
  • 30.Norquist JM, Girman C, Fehnel S, et al. Choice of recall period for patient-reported outcome (PRO) measures: criteria for consideration. Qual Life Res. 2012;21:1013–20. doi: 10.1007/s11136-011-0003-8. [DOI] [PubMed] [Google Scholar]
  • 31.Broderick JE, Schwartz JE, Vikingstad G, et al. The accuracy of pain and fatigue items across different reporting periods. Pain. 2008;139:146–57. doi: 10.1016/j.pain.2008.03.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Mangel A, Wang J, Sherrill B, et al. Urgency as an Endpoint in Irritable Bowel Syndrome. Gastroenterology Research. 2011;4:9–12. doi: 10.4021/gr283e. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES