Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Aug 24.
Published in final edited form as: Gastrointest Endosc. 2016 Jan 6;84(1):115–125.e4. doi: 10.1016/j.gie.2015.12.029

A proposed staging system and stage-specific interventions for familial adenomatous polyposis

Patrick M Lynch 1, Jeffrey S Morris 2, Sijin Wen 3, Shailesh M Advani 1, William Ross 1, George J Chang 4, Miguel Rodriguez-Bigas 4, Gottumukkala S Raju 1, Luigi Ricciardiello 5, Takeo Iwama 6, Benedito M Rossi 7, Maria Pellise 8, Elena Stoffel 9, Paul E Wise 10, Lucio Bertario 11, Brian Saunders 12, Randall Burt 13, Andrea Belluzzi 14, Dennis Ahnen 15, Nagahide Matsubara 16, Steffen Bülow 17, Niels Jespersen 17, Susan K Clark 18, Steven Erdman 19, Arnold J Markowitz 20, Inge Bernstein 21, Niels De Haas 21, Sapna Syngal 22, Gabriela Moeslein 23
PMCID: PMC5570515  NIHMSID: NIHMS878056  PMID: 26769407

Abstract

Background

It is not possible to accurately count adenomas in many patients with familial adenomatous polyposis (FAP). Nevertheless, polyp counts are critical in evaluating each patient’s response to interventions. However, the U.S. Food and Drug Administration no longer recognizes the decrease in polyp burden as a sufficient chemoprevention trial treatment endpoint requiring a measure of “clinical-benefit.” To develop endpoints for future industry-sponsored chemopreventive trials, the International Society for Gastrointestinal Hereditary Tumors (InSIGHT) developed an FAP staging and intervention classification scheme for lower GI tract polyposis.

Methods

Twenty-four colonoscopy or sigmoidoscopy videos were reviewed by 26 clinicians familiar with diagnosis and treatment of FAP. The reviewers independently assigned a stage to a case using the proposed system and chose a stage-specific intervention for each case. Our endpoint was degree of concordance among reviewers staging and intervention assessments.

Results

The stage and intervention ratings of the 26 reviewers were highly concordant (ρ= 0.710; 95% credible interval, 0.651–0.759). Sixty-two percent of reviewers agreed on FAP stage, and 90% of scores were within ±1 stage of the mode. Sixty percent agreed on the intervention, and 86% chose an intervention within ±1 level of the mode.

Conclusions

The proposed FAP colon polyposis staging system and stage-specific intervention is based on a high degree of agreement on the part of experts in the review of individual cases of polyposis. Therefore, reliable and clinically relevant means for measuring trial outcomes can be developed. Outlier cases showing wide scatter in stage assignment call for individualized attention and may be inappropriate for enrollment in clinical trials for this reason.

Keywords: Neoplasm Staging, familial adenomatous polyposis, chemoprevention, classification

Introduction

It is virtually impossible to accurately count adenomas during endoscopy in many patients with familial adenomatous polyposis (FAP). Nevertheless, polyp counts are critical in evaluating patient’s response to chemopreventive agents. However, there has been virtually no guidance for endoscopists and surgeons in determining when surgery should be performed. More pointedly, the FDA determination that approval of new chemopreventive agents must meet a higher standard of clinical benefit, has left the FAP community speculating as to what such a standard really calls for. Members of the International Society for Gastrointestinal Hereditary Tumors (InSiGHT) undertook the described study in order to develop a staging and staged intervention system that would provide an acceptable measure of clinical benefit in future industry-sponsored chemoprevention trials and other interventions in FAP

In 1989, Spigelman et al proposed a staging system for duodenal adenomas in patients with familial adenomatous polyposis (FAP).1 This system has enabled clinicians to monitor patients more effectively and has guided clinical interventions. Unfortunately, no corresponding staging system exists for adenomas in the colon and rectum in either the pre- or postoperative setting, perhaps because some perform colectomy or proctocolectomy soon after diagnosis of colorectal adenomas, regardless of severity. But many clinicians use extent of “polyp burden” and clinical judgment to determine the timing of colectomy, both of which are subjective and individual based, thus indicating a need for standardization.

A diagnosis of FAP is typically established on the basis of adenomatous polyposis coli (APC) gene testing, and adenomas can be found in patients as young as age 10 or 12.2, 3 Although it is a normal practice to operate at an early point in the evolution of FAP, there has been a tendency to defer surgery in these young patients. Improvements in endoscopes and better, safer anesthesia for pediatric use have made full colonoscopy a very acceptable procedure in children. There is also value in waiting for the rectum to “declare itself” insofar as the development of adenoma burden is concerned, so that surgeons can better select the appropriate operation: colectomy or proctocolectomy.4 Conversely, much older patients with attenuated FAP and mutY homolog (MUTY)-associated polyposis may initially be diagnosed with a very mild adenoma burden at age 50 or later.5, 6 An unknown but small fraction of such patients can be managed conservatively, with periodic multiple polypectomies without surgery.

This emerging diversity in FAP presentation, diagnosis, and treatment has not, of itself, been enough to stimulate the development of a colorectal polyposis staging system. However, in 2011, the U.S. Food and Drug Administration (FDA) stated that it would no longer approve, much less accelerate approval of, chemopreventive agents for the treatment of premalignant conditions such as FAP on the basis of reduction in polyp number and size alone; a clearer demonstration of “clinical benefit” would be required.7, 8 At the 2011 meeting of the International Society for Gastrointestinal Hereditary Tumors (InSiGHT), a group of FAP experts met with pharmaceutical leaders interested in responding to the FDA’s “clinical benefit” challenge. The experts agreed that demonstrating clinical benefit would require development of clinically relevant signposts of FAP progression that would also serve as primary endpoints for clinical trials of chemopreventive therapies. Also, treatment response or progression would have to be couched in oncological meaningful terms, despite the fact that FAP-related mortality are uncommon in patients with FAP because of current intensive endoscopic surveillance and surgical prophylaxis. To be clinically meaningful, progressive disease stage would need to be linked to progressively more aggressive interventions. A staging system for colorectal polyposis akin to the Spigelman et al staging system for duodenal polyposis might, thus, provide objective and clinically relevant measures of time to disease progression as well as disease regression. As a subgroup of the FAP experts who met in 2011, we undertook the development and testing of such a staging system.

As detailed below, we created a scale that divides colorectal polyposis into 5 progressive stages based on adenoma number and size. Degree of dysplasia, age, and desmoid disease were not considered in developing the IPSS. We then created a corresponding scale specifying the endoscopic, surgical, and/or chemopreventive interventions considered appropriate to the adenoma burden. Recognizing that clinical staging and interventions are based on expert opinion, we convened a panel of experts—endoscopists and surgeons—to review videos of edited colonoscopies or sigmoidoscopies (in cases of prior colectomy or proctocolectomy). Our endpoint was to discern the degree of agreement among the experts in assigning a given video to one of the 5 predefined InSiGHT polyposis staging system (IPSS) stages and, further, in proposing appropriate interventions for the stages they assigned.

Methods

Development of IPSS Staging System: At the 2011 annual InSiGHT meeting, the need for a staging system for colorectal polyposis was recognized, in response to the FDA position requiring a measure of “clinical benefit” for new drug approval. Therefore, we developed an arbitrary classification system for progressive categories of colorectal polyposis severity and a means for validating that classification. The categories were developed by the primary author (P.M.L.) with the expectation that a given range of severity should lend itself to interventions appropriate to that degree of severity. Delay in progression from one stage to a higher stage, or regression to a lower stage should translate into change in necessary intervention and thus constitute a worthwhile measure of clinical benefit. The proposed classification is seen in Figures 1 and 2. The initial test of suitability of this staging system was to be based on review of a large number of videos of FAP cases by a large panel of clinicians, most of whom are recognized experts in FAP management. The system was developed to represent the broad grouping of polyp burden (both in number and size) in such a way that one could assign a given case to a broad category with a reasonable confidence but without the need to undertake an attempt at accurately counting the polyps. A 5-point numbered scale (0–4) was used for staging system, and a 5 point letter scale (A–E) was used for classifying stage specific interventions. The system was developed in a way that intervention system (A–E) would correspond to the stage identified. However, the reviewers were not notified about this classification to prevent any biases or direct them to a specific intervention.

Fig. 1.

Fig. 1

Proposed InSiGHT staging system classification and clinical interventions for colonic polyposis

Fig. 2.

Fig. 2

Proposed InSiGHT staging system classification and clinical interventions for post-colectomy cases with ileorectal anastomosis.

Data Collection

We contacted IPSS members who are experts in the field of FAP. These members were emailed with a detailed description of the study and were requested to respond back via email with their interest in participation in the study. Participants who agreed to participate formed our list of reviewers. Participation in the study implied informed consent. Because this study did not pose any harm and/or risk to the reviewers, no signed informed consent was needed. A total of 29 experts agreed to participate in the review of 24 videos; 26 (90%) completed the study. The study was approved by the Institutional Review Board at The University of Texas MD Anderson Cancer Center with a waiver of consent for use of the previously obtained and de-identified videos that comprised the study material.

We collected archived and de-identified videos of colonoscopies and sigmoidoscopies performed during earlier multicenter chemoprevention trials conducted at MD Anderson Cancer Center, the Cleveland Clinic, and St. Mark’s Hospital.9 One of the authors (PML) selected 24 videos from the archive to represent a range of FAP severity. These videos represent the typical distribution of FAP cases that we experience in clinical settings. They were taken from cases that met the criteria for FAP and participated in previous chemoprevention trials.12 The videos were loaded into a video editing program (Corel Video Studio ProX7) and edited to capture total adenoma burden while preventing reviewer fatigue by eliminating extraneous footage (all videos ran <2 minutes). Consequently, deidentified and sequentially numbered videos were transferred to USB thumb drives and mailed to reviewers. The USB drives also included an instruction page with a link to the data-recording site in Survey Monkey. Raters were provided with tables displaying the proposed IPSS guidelines for staging FAP of the colon or FAP of the rectum only (for post-colectomy cases) and the proposed stage-specific interventions, which were arbitrarily chosen for the purposes of this study (Figures 1 and 2).

The reviewers were InSiGHT members known to be experienced with FAP and other institutional colleagues recommended by these members. Demographic characteristics of the reviewers are summarized in Table 1. Reviewers received nominal reimbursement for their participation in video review and scoring process. (Authors did not report any COIs related to this reimbursement. Additional disclosures can be found in the section of disclosures.) In addition to assigning a stage for and choosing a recommended level of intervention for each FAP case depicted in the videos, reviewers were asked to provide comments after scoring each video. Reviewers were also required to self-designated themselves as either surgeons or endoscopists and record their annual FAP patient volume. Having scored the videos and assigned a recommended intervention, the reviewers were then asked to rate the utility of the IPSS and of the stage-specific interventions using a 5-point visual analogue scale ranging from “strongly agree” to “strongly disagree.”

Table 1.

Summary Characteristics of 26 Reviewers for the IPSS Staging System (N=26)

Charecterstic N(%)
Gender
 Male 20 (77)
 Female 6 (23)

Specialty
 Colorectal surgery 13 (50)
 Gastroenterology 13 (50)

Clinical category
 Endoscopist 14 (54)
 Surgeon 12 (46)

Workplace setting
 Clinical 5 (19)
 Academic 21 (81)

No of FAP patients seen every year
 0–5 3 (11)
 6–10 4 (15)
 11–20 6 (24)
 >=21 13 (50)

Statistical Design

For each video, we provided reviewers with electronic scoring sheets consisting of 2 ordinal 5-point scales as discussed above, and our goal was to assess multi-rater concordance based on these ordinal scales. Because there are not any standard measures of concordance that apply to our setting with an ordinal (5-point) scale and multiple raters, we used a Bayesian multiple-rater model to assess the concordance of the ordinal data across raters.10, 11 This method allowed us to estimate the rater variation and overall variation and, therefore, to obtain a model-based intraclass correlation coefficient (ICC), ρ, as our measure of concordance. The details of this measure and its calculation are provided in the supplemental materials. Briefly, we assumed that the variability across ratings had two components: a rater-to-rater variability and a video-to-video variability. The measure ρ indicates the proportion of total variability attributed to the video-to-video component and is constrained to be between 0 and 1. Thus, higher ρ indicates greater concordance, with ρ=1 indicating that all raters gave the same rating to all videos, and ρ=0.5 indicating that the variability across raters was equal in magnitude to the variability across videos. We plotted the data in heat maps, graphical representations of tables using colors to represent numbers. In the heat maps (Figures 35), the darker the boxes, the higher the proportion of reviewers in agreement.

Fig. 3.

Fig. 3

Fig. 3

Fig. 3

Fig. 3a. Heat map displaying proportion of IPSS scores by video, with videos ordered from lowest average stage (video 17) to highest average stage (video 10). IPSS: InSiGHT polyposis staging system

Fig. 3b. Heat map displaying InSiGHT polyposis staging system scores from 14 endoscopists by video

Fig. 3c. Heat map displaying InSiGHT polyposis staging system scores from 12 surgeons by video

Fig. 5.

Fig. 5

Fig. 5

Fig. 5

Fig. 5

Heat maps displaying differences between recommended interventions and IPSS scores (intervention minus IPSS score) for each reviewer. Positive values indicate the reviewer recommended a higher intervention level than that corresponding to the assigned stage; negative values indicate the reviewer recommended a lower intervention level. IPSS: InSiGHT polyposis staging

Fig. 5a. Heat map displaying proportion of videos with each difference value by rater

Fig. 5b. Heat map displaying proportion of raters with each difference value by video

Fig. 5c. Heat map displaying proportion of endoscopists with each difference value by video

Fig. 5d. Heat map displaying proportion of surgeons with each difference value by video

The multiple-rater model was fit using a Markov chain Monte Carlo algorithm with 10,000 samples after a burn-in of 5,000 used for inference. From these samples, we computed the posterior mean, standard error, and 95% credible interval for ρ and tested the null hypothesis that ρ≤0.5 by computing p=Prob(ρ≤0.5|data), rejecting the null hypothesis if p<0.05.

Because there is no standard sample size software for this multiple rater ordinal measure of concordance, we performed simulations to determine a sample size that would provide sufficient power to detect a strong concordance. We simulated 100 trials with 24 raters and 24 videos with 5 different scenarios: (1) the rater variation is the same as the video variation (ρ= 0.5); (2) the rater variation is 1/2 of the video variation (ρ= 0.67); (3) the rater variation is 3/7 of the video variation (ρ= 0.70); (4) the rater variation is 1/3 of the video variation (ρ= 0.75); and (5) the rater variation is 1/4 of the video variation (ρ= 0.80). We concluded that there was significant agreement between raters if the chance of ρ being < 0.5 was very small (<0.05).

Our simulation showed that a sample size of 24 raters and 24 videos would have at least 83% power to show a concordance of ρ=0.70 (ie, the rater variation is 3/7 of the video variation). More results from the simulation are shown in Supplemental Tables 1 and 2. A weighted Cohen’s kappa was used to assess concordance of the assigned stages and interventions for each rater.12 R (Version 2.15.2) was used to conduct statistical analysis.

Results

Our study’s key finding was the demonstration of strong agreement in scoring across the 24 videos by the 26 reviewers. There was high concordance across raters, with a mean ρ of 0.710 (standard error 0.027; 95% credible interval, 0.651–0.759). From this, we rejected the null hypothesis that ρ≤0.5 (p<0.0001). A histogram of posterior distribution of ρ is given in Supplemental Figure 1. We observed a statistically significant degree of agreement in both the staging of polyp burden and the selected interventions for a given stage. This agreement is important because clinical decision-making in the abstract is often quite different from that applied in real cases.

Heat maps of reviewer staging by video shows that those reviewers reached a high degree of agreement at the extremes of IPSS stage. Figure 3A shows, for each video, the proportion of raters who assigned each stage. The data reveal that at the highest and lowest levels (stage 4 and stage 0) of adenoma burden, there was near-perfect concordance between observers in assigning a given video to a stage. Not surprisingly, perhaps, in the approximate midrange of adenoma burden, there was greater scatter, though with overall a high level of concordance. Nonetheless, we discerned wide agreement on most videos, with most scores ranging within one stage “worse” or “better” than the modal value. In only 3 of the 24 videos scores varied by more than one stage above or below the mode. Figures 3B and 3C present the raw ratings for each video for endoscopists and surgeons, respectively. At the extremes of severity, we found greater agreement between the surgeons and endoscopists; the level of concordance was similar between the two groups.

Greater scatter was seen with respect to interventions (Figure 4). In general, however, reviewers either agreed with the proposed intervention for the stage to which they had assigned a given video or recommended an intervention within one incremental level of the proposed one. Heat maps (Figures 5A–D) were also constructed to demonstrate individual reviewer’s tendencies to recommend a more aggressive versus less aggressive intervention relative to a particular polyp burden. We derived a score by subtracting the numeric value of each assigned stage (0–4) from the numeric value of the stage corresponding to the assigned intervention. (0–4, corresponding to A–E). A difference of zero indicated agreement with the stage-specific intervention proposed in Figures 1 or 2. A positive score indicated that the reviewer preferred a more aggressive intervention than that proposed by the researchers. Conversely, a negative score indicated that the reviewer preferred a less aggressive intervention for the polyposis burden shown. In most cases (20 of 26), the reviewers agreed with the stage-specific interventions provided by the researchers, and in 92% of cases reviewers chose an intervention within one level more or less aggressive than that provided in our proposed system. We also observed a modest difference by specialty with respect to the aggressiveness of the assigned intervention. Endoscopists were slightly more likely to have a positive treatment aggressiveness score (22.1% of cases) than were surgeons (15.7% of cases), indicating endoscopists were more likely to recommend more aggressive treatment. To assess concordance of the IPSS scores and intervention, Cohen kappa coefficients were computed. The kappa with square weights was calculated between interventions and IPSS scores for each rater. The mean Cohen kappa from 26 raters was 0.793, with a standard deviation of 0.188 demonstrating that raters predominantly tended to propose interventions coinciding with their IPSS staging for that patient. In addition we ran similar analysis to compare scoring based on reviewers gender and patient load of FAP seen every year (>=11 FAP patients per year vs <=10 patients per year). There were no significant differences seen between these groups (See Supplemental Table 3).

Fig. 4.

Fig. 4

Heat map displaying the proportion of raters (N=26) who assigned each intervention to each video, with rows representing reviewer-selected intervention scores ranging from A to E, and videos ordered from lowest to highest average scores

Of the 26 reviewers, 25 strongly agreed (17) or agreed (8) that “the development of a staging system for colorectal polyposis will be helpful in communicating with colleagues regarding patient status.” When we use the same scale, 22 of 26 strongly agreed (18) or agreed (4) that “the development of a staging system for colorectal polyposis will be helpful in evaluating endpoints in clinical chemoprevention trials.” There was also considerable support for the proposed staging system. Of 25 responses, 23 indicated that the reviewer agreed (21) or strongly agreed (2) with the proposition, “Subject to my specific comments in the scoring sheet above, I am in general agreement with the present proposed IPSS.” When asked to rate their general agreement with the proposed stage-specific intervention scale (subject, as above, to comments offered in the scoring sheet), reviewers expressed generally supportive, although more qualified, responses. Sixteen of the 26 reviewers agreed with the proposed interventions, whereas 8 were neutral and 2 disagreed with them. (See Supplemental Table 4 for these ratings)

Discussion

Our study’s key finding was the demonstration of strong agreement in scoring across the 24 videos by the 26 reviewers.

We have developed a staging system for severity of colorectal polyposis FAP for future industry-sponsored clinical trials. We sought to determine whether experts could reach consensus as to the appropriate stage assignment and intervention for a given case. However, surgeons and endoscopists who deal with FAP may assign the same video images to different stages. We had reason to anticipate some such variability, based on our quantitative video based study of video-based adenoma burden, polyp number and diameter.13 These experts scored more than 20 colonoscopy videos using an “electronic abacus” placing polyps into “bins” corresponding to three diametrical ranges. The measure of scatter led us to conclude that colorectal FAP could and should be classified into broad categories if a high degree of concordance needs to be achieved. We wanted to include enough reviewers to provide more statistical clarity than was possible in our earlier study. Polyposis staging was considered the straightforward part of this exercise. We expected that observers reviewing same endoscopic video clip would agree within the broad ranges we provided. More challenging was the proposition that an arbitrary set of interventions would be agreed upon as well. Arguably, a reviewer could modify staging for a video to bring it into line with a desired intervention. In other words, reviewers might consciously or unconsciously “upstage” or “downstage” to arrive at a preferred intervention. Conversely, if all of a reviewer’s recommended interventions were the same (eg, “needs surgery”) regardless of stage, then our attention to staging would prove irrelevant. Reviewers’ comments did not show a substantial amount of such cognitive dissonance or compensatory reasoning

Our study has limitations. For one, the proposed classifications by polyp count and diameter were based on expert opinion without previous validation. True validation awaits longer-term outcome measures, including need for surgical intervention or development of cancer. Staging categories also did not account for key factors that could affect interventions and their timing; for example, neither age nor risk of desmoid disease was factored into the IPSS. Young patients with modest polyp burdens might be managed expectantly to allow for completion of their pubescent growth spurts before colectomy, whereas older patients with identical polyp burdens might be recommended for surgery straightaway. A patient with known desmoid disease might, appropriately, choose to wait as long as possible for surgery. No feature of a colonoscopy video can properly inform clinicians on such issues as age or desmoid status. Neither does the IPSS account for degree of dysplasia in adenomas, though any adenoma with high-grade dysplasia should probably be placed in stage IV. Attempting to stratify according to such criteria would have, we think, needlessly complicated the staging system.

Because FAP is rare, its care is commonly given over to experts who have extensive experience with hereditary colorectal cancer syndromes. Determining whether a broad consensus could be reached was viewed as the first step toward establishing a staging system for colorectal polyposis. Hence for our study, we recruited FAP reviewers who are experts in the field. Although they may share certain biases, at least such biases reflect expertise in assessing severity of actual FAP cases. More important than the use of experts as such, was the fact of general agreement in stage and intervention assignment across the panel of reviewers. An appropriate next undertaking, but one beyond the immediate scope of this study, would be to engage trainees and other non-experts to determine whether they could intuitively or with minimal training come to use the IPSS in a fashion similar to our experts.

We further examined results for the videos showing the greatest scatter in stage assignment. In video 13, for instance, there was one very large confluent adenoma in the right colon, but very few other adenomas, all limited to the right colon. Some scorers dismissed the confluent adenoma and assigned a low stage due to the low overall polyp count, whereas others were very influenced by the one large confluent adenoma and assigned a high stage. Such discordant findings were not really anticipated in the IPSS as it was initially conceived. These kinds of difficult, outlying cases defy ready classification and require individual attention. It can be argued, at least, that these cases will be appropriately “flagged” by demonstration of discordant staging upon review by multiple experts. Such cases might well be excluded from clinical chemoprevention trials for this reason. (See Supplemental Table 5 for some of reviewers’ comments on 3 videos with outlier cases). After reviewing our results, we contacted our reviewers to assess whether high-grade dysplasia would automatically classify a patient to Stage IV. Fifteen out of 26 respondents replied and agreed to the statement and also stated that these atypical cases require discussion and consensus in multispecialty conference settings.

Technical considerations in endoscopy can affect the performance of the IPSS or any other staging system, including colon preparation and withdrawal. There are additional measures that could reduce disagreement in staging. Cases of greatest scatter could be reviewed jointly to determine whether outliers are due to errors in the identification and classification of polyps (lymphoid polyps can be a problem), polyp counts, or polyp size. Joint review of difficult cases might also lead to modification of the staging designation when disagreement exists as to the “bins” to which such cases are assigned. Different count or dimension thresholds might lead to different criteria for staging any given polyp burden. To address these issues, tutorials similar to those used by pathologists to standardize interpretation of pathologic specimens could be designed.

Because our reviewers were FAP experts, we did not attempt to control for the amount of time spent by each reviewers to score a particular video. One might argue that less-experienced participants would tend to spend more time on difficult cases than more experienced ones, and perhaps reach a better score/rating than if less time were spent reviewing a case. Because we did not monitor the number of times a video was reviewed or the time spent in doing so, we cannot say what the effect of this might have been on inter-rater concordance. This will be addressed in future studies.

In this study we used deidentified videos that had been used in previous chemoprevention trials. Although these represent a range of FAP cases that normally concur in clinical settings, they did, by definition, have to fall within a range of severity that would be amenable to clinical trial inclusion. Thus, cases with obviously invasive cancer on the one hand, and totally normal colon on the other, tended to not be included. Consequently, it does not appear that there was any relevant selection bias. Also, these cases did not have a predetermined stage, as assigning a stage was a part of the exercise. As seen in heatmaps, majority of reviewers assigned a particular stage or were within ±1 stage. We calculated interobserver variation and found it to be low. However, we did not assess intra-observer variation. One would expect the intra-observer variability to be smaller in magnitude than the inter-reviewer variability because our current study shows the inter-reviewer variability is small relative to the inter-video variability suggesting that even if the intra-reviewer variability were measured and assessed, our conclusions of the reliability of the proposed scoring system would not change. However, his would have been an interesting exercise, but submission of replicate videos would have substantially increased the number of videos that would have to be scored, while still providing for the range of cases that were included. The statistical modeling that was performed provided a “best fit” for the approximate number of reviewers and videos that were used and recommended 24 videos and reviewers as the best sample size. Hence, addition of replicate videos would mean reducing the number of independent videos. In addition, re-review of only “difficult cases” is not advisable because cases that had higher interreviewer variability would also be likely to have greater intra-reviewer variability, and thus this would give a skewed assessment of the level of intra-reviewer variability. Hence, future studies will address this as part of future validation steps. Our study shows that rater-to-rater variability is low, the key initial validation of IPSS. This system can now be implemented to assign polyposis staging in clinical and clinical trial settings. Indeed, the prospective clinical trial setting will be the ideal circumstance in which to more appropriately validate IPSS, also enabling trialists to address some of the unresolved issues here, including potential importance of (1) intra-observer variation; (2) amount of time spent reviewing videos; and (3) reconciliation processes for cases in which disagreement exists. Finally, given some of the scatter observed in the more challenging cases, one can only assume that a given reviewer would, on a blinded re-review, up- or down-stage a given case in a fashion similar to that seen in the pool of independent reviewers. This would be an interesting phenomenon to assess and in fact will be incorporated into the design of upcoming clinical trials in which IPSS is used.

Depending upon how such issues are resolved, it may be feasible to delegate scoring to non-expert reviewers, after a training period, using the values found here as a benchmark and framework for analysis. Now that we have a framework for analysis, we can also readily see that use of IPSS enables the detection of “outlier” cases in which disagreement in stage assignment exists. These lend themselves well to the development of reconciliation or adjudication process. Such a process must recognize the absence of perfect fidelity in severity scoring, and must provide measures, however arbitrary, for their resolution. In doing so, opportunities to further refine IPSS should emerge.

We did not find differences in scoring by demographic characteristics of the scorers, such as FAP patient load per year. We could have collected additional information such as total years in FAP-focused practice, perhaps a better reflection of cumulative FAP experience. However, because most reviewers have been InSiGHT participants for many years, this was not likely an important factor. In addition, there were no real patterns of scoring to indicate any particular reviewer consistently over or under staged, relative to the average values provided.

How might the prospective IPSS be applied clinically or in clinical trials, and how would we evaluate its effectiveness? Our initial goal was to develop a system that would establish clinical trial endpoints that closely correspond to the FDA’s requirement of “clinical benefit.” If surgical resection can be delayed because of chemoprevention, with or without polypectomy, then a clinical benefit will have been rendered. Such trials are in preparation, with the IPSS incorporated as a primary endpoint. These trials may also lead to modifications of the IPSS, depending on the results of validation studies using hard clinical outcomes—surgery, cancer, death.

Another consequence of the use of the IPSS may be to modify the prevailing standard of care with respect to polyposis intervention; that is, a new staging system may prompt changes in the recommended surveillance intervals or criteria for surgical resection. If one considers the interventions recommended by our panel of experts in the context of actual cases, there is a comfort level with nonsurgical intervention at the earlier stages of adenoma involvement and differing surgical interventions for later stages.

The survey data showed general satisfaction with the IPSS. It suggests that clinicians will develop confidence in the system and that it is easy to use. In addition, the proposed IPSS could affect the criteria for inclusion in trials of chemoprevention in FAP patients. It is likely that future clinical trial attention will be devoted to patients with intact colons. Although historically, we have seen that most chemoprevention trials have limited enrollment to patients who have already undergone colectomy and who have recurrent rectal adenomas.

This is the first step toward developing a staging system for colorectal polyposis. No system is perfect, and only improves with time. We believe this proposal and testing of a colorectal polyposis staging system with stage-specific interventions will enable more reliable measures of patients’ response to nonsurgical treatments. In addition, these measures should satisfy the need to determine treatment endpoints that meet the FDA’s new requirement of “clinical benefit.” Further validation of this scoring system can be expected in the course of prospective clinical chemoprevention trials currently in development.

Supplementary Material

Acknowledgments

Grant Support: Support for this study was provided by SLA Pharmaceutical, UK. The protocol number for this study at MD Anderson Cancer Center is Protocol #PA11-0926.

We would like to acknowledge all our reviewers for their time and contribution toward this study, SLA Pharmaceuticals for their funding support and the entire team who made this study possible.

Abbreviations

FAP

familial adenomatous polyposis

FDA

Food and Drug Administration

InSiGHT

International Society for Gastrointestinal Hereditary Tumors

IPSS

InSiGHT polyposis staging system

ICC

intraclass correlation coefficient

APC

Adenomatous Polyposis Coli

Footnotes

Disclosures: Each video reviewer received compensation for the time devoted to review of endoscopic videos and for completing the scoring forms. Dr. Burt is a consultant for Myriad Genetics. All authors do not report any additional COI.

Transcript Profiling: Not applicable

Writing Assistance: No outside writing assistance was provided for this manuscript.

Author Contributions:
  1. Dr. Patrick Lynch: Study concept and design; funding; manuscript writing; study supervision; drafting of the article; critical revision of the article for important intellectual content; final approval of the article.
  2. Dr. Jeffrey Morris: Statistical design, analysis, interpretation; manuscript writing; critical revision of the article for important intellectual content.
  3. Dr. Sijin Wen: Statistical design, analysis, interpretation; manuscript writing; critical revision of the article for important intellectual content.
  4. Dr. Shailesh Advani: Administrative; study supervision; data handling; database management; manuscript writing; acquisition of data; critical revision of the article for important intellectual content.
  5. IPSS Group (all other authors): Review and scoring of videos; Data acquisition and analysis, Drafting and critical revision of manuscript, final approval of published version, accountability for the content of work.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Spigelman AD, Williams CB, Talbot IC, Domizio P, Phillips RK. Upper gastrointestinal cancer in patients with familial adenomatous polyposis. Lancet. 1989;2(8666):783–5. doi: 10.1016/s0140-6736(89)90840-4. [DOI] [PubMed] [Google Scholar]
  • 2.Kennedy RD, Potter DD, Moir CR, El-Youssef M. The natural history of familial adenomatous polyposis syndrome: a 24 year review of a single center experience in screening, diagnosis, and outcomes. J Pediatr Surg. 2014;49(1):82–6. doi: 10.1016/j.jpedsurg.2013.09.033. [DOI] [PubMed] [Google Scholar]
  • 3.Levine FR, Coxworth JE, Stevenson DA, Tuohy T, Burt RW, Kinney AY. Parental attitudes, beliefs, and perceptions about genetic testing for FAP and colorectal cancer surveillance in minors. J Genet Couns. 2010;19(3):269–79. doi: 10.1007/s10897-010-9285-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.da Luz Moreira A, Church JM, Burke CA. The evolution of prophylactic colorectal surgery for familial adenomatous polyposis. Dis Colon Rectum. 2009;52(8):1481–6. doi: 10.1007/DCR.0b013e3181ab58fb. [DOI] [PubMed] [Google Scholar]
  • 5.Burt RW, Cannon JA, David DS, et al. Colorectal cancer screening. J Natl Compr Canc Netw. 2013;11(12):1538–75. doi: 10.6004/jnccn.2013.0180. [DOI] [PubMed] [Google Scholar]
  • 6.Grover S, Kastrinos F, Steyerberg EW, et al. Prevalence and phenotypes of APC and MUTYH mutations in patients with multiple colorectal adenomas. JAMA. 2012;308(5):485–92. doi: 10.1001/jama.2012.8780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Steinbach G, Lynch PM, Phillips RK, et al. The effect of celecoxib, a cyclooxygenase-2 inhibitor, in familial adenomatous polyposis. N Engl J Med. 2000;342(26):1946–52. doi: 10.1056/NEJM200006293422603. [DOI] [PubMed] [Google Scholar]
  • 8.EL Services DoHH, editor. Memorandum of Meeting Minutes Pre-IND/Pre-NDA for eicosapentaenoic acid (free fatty acid) [EPA-FFA] 2011. pp. 1–20. [Google Scholar]
  • 9.Lynch PM, Burke CA, Phillips R, et al. An international randomised trial of celecoxib versus celecoxib plus difluoromethylornithine in patients with familial adenomatous polyposis. Gut. 2015 doi: 10.1136/gutjnl-2014-307235. [DOI] [PubMed] [Google Scholar]
  • 10.Johnson VE, Albert J. Ordinal Data Modeling {Statistics for Social Science and Public Policy} Springer-Verlag New York Incorporated; 1999. [Google Scholar]
  • 11.Johnson VE. On Bayesian analysis of multirater ordinal data: An application to automated essay grading. Journal of the American Statistical Association. 1996;91(433):42–51. [Google Scholar]
  • 12.Cohen J. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin. 1968;70(4):213. doi: 10.1037/h0026256. [DOI] [PubMed] [Google Scholar]
  • 13.Lynch PM, Morris JS, Ross WA, et al. Global quantitative assessment of the colorectal polyp burden in familial adenomatous polyposis by using a web-based tool. Gastrointest Endosc. 2013;77(3):455–63. doi: 10.1016/j.gie.2012.11.038. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

RESOURCES