Skip to main content
Journal of Registry Management logoLink to Journal of Registry Management
. 2021 Dec 1;48(4):161–167.

Melanoma Tumor Depth Quality Audit: A Nonmatch Analysis

Pamela Sanchez a,, Margaret (Peggy) Adamo a, Clara JK Lam a, Jennifer Steven b, Ariel Brest b, Serban Negoita a
PMCID: PMC10198396  PMID: 37260866

Abstract

Background:

The National Cancer Institute's Surveillance Research Program (SRP) received reports from cancer registries in the Surveillance, Epidemiology, and End Results (SEER) Program concerning the coding of melanoma tumor depth. To address these concerns, SRP developed an algorithm to identify melanoma depth measurement values and conducted a nonmatch analysis.

Methods:

A nonmatch analysis was conducted on 1,117 cases diagnosed between 2010 and 2017. With the help of Information Management Services, a natural language processing algorithm was developed to identify melanoma tumor depth values along with a gold standard for comparison. A randomly sampled data set was created to compare the algorithm-generated and gold standard values to the originally reported values; these were analyzed using SAS software version 9.4. Analyses were conducted to determine the distribution of nonmatches by demographics and estimate the distribution of nonmatches by the derived T variable according to the 7th edition of the American Joint Committee on Cancer (AJCC)'s AJCC Cancer Staging Manual.

Results:

Of the 1,117 cases, 849 cases (76%) were a match between the originally reported values and the gold standard. The majority of cases were found to be in male patients (60%) and non-Hispanic White patients (93%). When comparing derived AJCC-7 T based on the originally reported value to the gold standard, 16% of the original derived AJCC-7 T values were incorrect, with most of the nonmatches resulting in incorrectly coding a case as TX instead of T1.

Conclusion:

In total, 24% of cases were found to have a discrepancy in the originally recorded values. Decimal errors made up 3% of all cases in this nonmatch analysis. This algorithm may prove to be an essential tool in optimizing registry resources by flagging inconsistencies via automated text review to be adjudicated by registrars, improving their quality of data as needed.

Keywords: algorithm, Breslow, data quality, melanoma tumor depth, Surveillance, Epidemiology, End Results (SEER) Program

Introduction

Cutaneous melanoma is the fifth-most-common cancer in the United States, and incidence has been steadily rising an average of 1.5% each year.13 The Breslow depth of invasion is the preferred classification for melanoma invasion due to its standardized method to measure melanoma depth.4 It is a measure of the depth of the lesion vertically in millimeters from the top granular layer of the epidermis to the deepest point of tumor involvement. Tumor thickness defined by the Breslow depth of invasion is one of the most important determinants of prognosis for melanomas.57 The tumor, node, and metastasis (TNM) staging system—a comprehensive classification system—uses tumor thickness as one component to predict overall prognosis. An increase in tumor thickness is associated with metastasis and poorer prognosis.4 Thus, the accuracy of recorded tumor thickness has important implications for staging and cancer treatment.

Collaborative Stage Site-Specific Factor (SSF) 1 melanoma tumor depth is reported using the Breslow depth of invasion.8 The National Cancer Institute (NCI)'s Surveillance Research Program (SRP) received reports from cancer registries in the Surveillance, Epidemiology, and End Results (SEER) Program concerning the coding of melanoma tumor depth. Some potential problems that were identified included errors in converting centimeters to millimeters (the “implied” decimal point errors), transcription errors, possible miscoding of tumor size for tumor depth, and incomplete information. To understand the potential impact of these problems and to ensure the highest quality of data in the SEER database, SRP conducted an audit of melanoma tumor depth values. The 4 main goals of this audit were as follows:

  1. To develop and test a natural language processing algorithm to accurately identify melanoma depth measurement values

  2. To assess nonmatch distribution and its effect on staging

  3. To provide registries with a set of flagged cases with a high probability of inaccurate depth measurement values for review

  4. To provide SEER central registries with a method for automatic correction of these depth measurement values

Methods

Three SEER central registries participated in this retrospective audit. The analysis was conducted on 157 cases from registry A diagnosed between January 1, 2010, and December 31, 2014, and 480 cases from registries B and C diagnosed between January 1, 2010, and December 31, 2017, with a total sample size of 1,117 cases. All cases were analyzed using the same algorithm techniques to make the results comparable. Participating SEER central registries gave approval for the investigators to access their data through the SEER Data Management System (SEER*DMS) to select cases, abstracts, and electronic pathology (e-path) reports. SEER*DMS is a centralized system that provides support for all essential central cancer registry functions.9 Data use agreements were signed and confirmed with each individual who would be accessing sensitive information. Separate data use agreements were signed with each participating registry. Selection criteria included:

  1. Year of diagnosis between 2010–2014 or 2010–2017

  2. Behavior code = 3 (invasive cancers)

  3. Primary site = C44.x

  4. Histology codes = 8720–8790

  5. Cases reportable to SEER only

Death certificates only cases were excluded (reporting source =7).

This audit included all North American Association of Central Cancer Registries (NAACCR) abstracts and e-path reports linked to consolidated tumor cases defined in the selection criteria. Pathology reports scanned in PDF format were excluded, as well as reports dated before diagnosis or ≥120 days after diagnosis. Consolidated tumor cases that did not have at least an abstract or an e-path report were also excluded.

Data Extraction and Algorithm Development

Information Management Services (IMS) and NCI-SRP collaborated on the data extraction and development of the query algorithm. Subject matter experts from NCI-SRP assisted IMS with clinical, pathology, and registry standards expertise. Using SAS software, IMS employed an ensemble of text-mining methods to identify melanoma tumor depth measurement values by detecting numeric values corresponding to related terminology. The natural language processing algorithm was tested on over 22,000 cases: 4,802 from registry A, 5,724 cases from registry B, and 12,056 cases from registry C. The outline below was used to develop, test, and refine the final algorithm used to capture melanoma tumor depth measurements.

The following steps outline the ordered logic that the query followed to extract relevant melanoma tumor depth values in the raw text (ie, text documentation) for comparison with abstracted values:

  1. Query for text relevancy to melanoma tumor depth
    • Step 1: Identify anchor terms (Breslow, depth, thickness, invade, ulcer)
    • Step 2: Identify unit(s) of measurement (millimeters, centimeters, mm, cm; ignoring mm2, cm2)
    • Step 3: Identify number(s) directly adjacent to a unit of measurement
    • The algorithm excluded range, dimensional measurements (measurements with an x in between), and measurements near the term excise
  2. Select measurement
    • Preference is given to measurements in the pathology report over the abstract
    • Select the largest measurement in millimeters
    • Select the SSF1 value coded if no measurement is available on the pathology report or text of the abstract
    • Note 1: If more than 1 pathology report or abstract were available for a case, the algorithm would select the largest measurement.
    • Note 2: A measurement in the pathology report described in millimeters took precedence over a measurement in the abstract described in centimeters.

The algorithm was created to select the best source document and select the largest measurement from such. The best source document was defined as the pathology report; if no pathology report was available, the algorithm would select the largest measurement in the text of the NAACCR abstract. If there were multiple source documents, the algorithm would select the largest measurement to be consistent with Collaborative Stage instructions.10 If no measurement was found in the text of the abstract, then the known (000–980) value coded on the abstract for SSF1 was selected. Thus, the algorithm-generated value is the value derived by the algorithm from the best source document available. The algorithm was also instructed to prioritize a measurement in the pathology report described in millimeters over a measurement in the abstract in centimeters. This came recommended after a discussion with experts in the field. At the time the data reviewed in this audit were collected, the tumor depth value was a 3-digit field with an implied decimal point between the first and second digits (X.XX). For coding purposes, the actual measurement was recorded in the hundredths of millimeters from the pathology report. For example, a tumor depth measurement of 2.0 mm would have been coded as 200. If no value was recorded in SSF1, then the algorithm would code 999 for unknown. Table 1 provides a further description of how melanoma tumor depth was coded. With the algorithm performing at a satisfactory level, the query results were combined with registry data to form a consolidated data set. Satisfactory level was defined as an F1 score above 0.8. F1 score is commonly used in machine learning when testing a model's accuracy,11 with 1.0 being the highest value possible for precision and recall and 0 being the lowest value possible. From this, a randomly sampled stratified data set was created and delivered to each registry in a form view to be used to build a gold standard at each participating registry.

Table 1.

Collaborative Stage Site-Specific Factor 1 Measured Thickness (Depth) Coding Description

Code Description
0 No mass/tumor found
001–979 0.01–9.79 mm (code exact measurement in hundredths of a millimeter)
Examples:
001 0.01 mm
002 0.02 mm
010 0.1 mm
074 0.74 mm
100 1 mm
105 1.05 mm
979 9.79 mm
980 9.80 mm or larger (includes cases converted from codes 981–989 during conversion in V0200)
999 Microinvasion; microscopic focus or foci only and no depth given
Not documented in patient record
Unknown; depth not stated

Creating the Gold Standard

A randomly sampled stratified data set was created and sent back to the participating registries to conduct a consolidated tumor case–level registry evaluation of the measurement values. NCI-SRP assisted in defining and setting up the criteria for the gold standard measurement value. Two certified tumor registrars (CTRs) from each participating registry used the available data sources to determine a measurement value for each consolidated tumor case in the review. If the 2 CTRs did not agree on a case, NCI-SRP provided CTR guidance and adjudication. This adjudicated CTR-reported value was the gold standard value.

Data Analysis

Once the CTR review of the randomly sampled data set was completed, the algorithm-generated and gold standard values were compared to the originally reported values and analyzed using SAS software version 9.4. Analyses were conducted to determine the distribution of nonmatches by demographics and estimate the distribution of nonmatches by the derived T variable according to the 7th edition of the American Joint Committee on Cancer (AJCC)'s AJCC Cancer Staging Manual. Derived AJCC-T is a numeric representation for the AJCC 7th edition T descriptor (from TNM staging), and is derived using the Collaborative Stage algorithm from Collaborative Stage coded fields.12 Analyses were also conducted to compare derived AJCC-7 T and AJCC-7 Stage group based on the originally reported values to the gold standard on a subset of cases. This analysis could not be performed on all cases due to some participating registries not having collected the necessary data elements to perform the TNM rederivation on all diagnosis years included in this audit. A few cases were also removed when comparing derived AJCC-7 T to the originally reported values due to error in rederivation.

Classification of Nonmatches

For this analysis, a match was defined as having an agreement between the algorithm-generated value or originally reported value and the gold standard. When there was a disagreement between the algorithm-generated value or originally reported value and the gold standard, this was considered a nonmatch. Thus, when the algorithm selected the correct melanoma tumor depth value according to the gold standard, this was a match. Nonmatch types were classified in the following 6 categories for determining nonmatch distribution:

  1. Decimal error

  2. Both the gold standard and originally reported value or algorithm had depth values ≤ 9.8 mm that do not match

  3. The gold standard had a known value between 1 and 980 and the algorithm or originally reported value is coded unknown

  4. The gold standard was coded as unknown and the algorithm or originally reported value had a known value between 1 and 980

  5. The gold standard had a value < 980 and the algorithm or originally reported value had a value >980

  6. Other nonmatches

Categories 5 and 6 had 0 cases throughout our analysis and are therefore not presented in the tables.

Results

Of the 1,117 cases, 849 (76%) were a match between the originally recorded value and the gold standard (Table 2). Most nonmatched cases were in category 2 (depth values that do not match), with 115 cases (10%), followed by 51 (5%) of cases that were in category 3 (gold standard had a known value, originally reported value was coded unknown). There were 68 cases (6%) in category 4 (gold standard was coded unknown, originally reported value had a known value) and 34 cases (3%) in category 1 (decimal error) (Table 2). These findings indicate that implied decimal errors represented a very small percentage of total nonmatches (3%) in the originally recorded values (Table 2). The majority of cases were found to be in male patients (60%) and non-Hispanic White patients (93%). About half were in patients diagnosed with melanoma at the age of 65 years or older (50%). The nonmatch distribution was similar across age and sex, with match percentages between 74% and 78%. Registry B had the highest match percentage with 83%, followed by registry A with 74% and registry C with 70%. When comparing the distribution of nonmatches from the algorithm to the originally reported values, the largest difference was found in category 2 (nonmatches, depth values that do not match; 15% vs 10%), followed by category 4 (nonmatches, gold standard was coded unknown, originally reported value or algorithm had a known value; 3% vs 6%) and category 1 nonmatches (decimal error; 1% vs 3%).

Table 2.

Match vs Nonmatch by Demographics (Originally Reported vs Gold Standard)

Match Nonmatch*
Total A 1 2 3 4
N Match Decimal error Depth values that do not match Gold standard value known, originaly reported value unknown Gold standard value unknown, originally reported value known
N (%) N (%) N (%) N (%) N (%)
Total consolodated tumor cases 1117 849 (76) 34 (3) 115 (10) 51 (5) 68 (6)
Age at diagnosis (y)
>65 549 427 (78) 19 (3) 52 (9) 25 (5) 26 (5)
65≤ 568 422 (74) 15 (3) 63 (11) 26 (5) 42 (7)
Sex
Male 675 511 (76) 23 (3) 75 (11) 25 (4) 41 (6)
Female 442 338 (77) 11 (2) 40 (9) 26 (6) 27 (6)
Registry
A 157 116 (74) 9 (6) 12 (7) 5 (3) 15 (10)
B 480 398 (83) 4 (1) 44 (9) 17 (3.5) 17 (3.5)
C 480 335 (70) 21 (4) 59 (12) 29 (6) 36 (8)
Race/ethnicity
Non-Hispanic White 1042 783 (75) 31 (3) 113 (11) 48 (5) 67 (6)
Non-Hispanic Black 13 10 (77) 0 1 (7.5) 1 (7.5) 1 (7.5)
Non-Hispanic American Indian 2 2 (100) 0 0 0 0
Asian/Pacific Islander 3 2 (67) 1 (33) 0 0 0
Hispanic 15 13 (87) 1 (6.5) 1 (6.5) 0 0
Other/unknown 42 39 (93) 1 (2) 0 2 (5) 0
*

Category 5: gold standard < 980, originally reported > 980; Category 6 (other nonmatches) had 0 cases.

Table 3 shows the distribution of nonmatches from the originally reported values by derived AJCC-7 T. Out of 858 cases, most were in the following 3 T categories: T2a with 172 cases (20%), TX with 153 cases (18%), and T1a with 141 cases (16.3%). Category T4b had the highest match percentage with 91%, while T0 and T2 (not otherwise specified; NOS) had the lowest match percentage with 33%. Table 4 shows the impact on derived AJCC-7 T for all nonmatches combined from the originally reported values. When comparing derived AJCC-7 T based on the originally reported value to the gold standard, 16% of the original derived AJCC-7 T values were incorrect. As a result, most of these cases were coded as TX instead of T1, causing 11% to be staged incorrectly. When examining the derived AJCC-7 T value based on the algorithm's selection, 13% of values were incorrect according to the gold standard. This resulted in a smaller impact on stage, with 9% of cases that would have been staged incorrectly.

Table 3.

Derived AJCC-7 T vs OCTC and GS

Match Nonmatch
Derived_AJCC_7_T Thickness Ulceration Total, N A 1 2 3 4
Match, n (%) Decimal Error, n (%) Depth values that do not match, n (%) Gold standard value known, originally reported value unknown; n (%) Gold standard value unknown, originally reported value known; n (%)
Total 858* 644 (75) 25 (3) 92 (11) 46 (5) 51 (6)
T0 NA NA 6 2 (33) 0 0 0 4 (67)
T1a Without ulceration and mitosis <1/mm2 141 116 (82) 8 (6) 9 (6) 1 (1) 7 (5)
T1b ≤1.0mm With ulceration or mitosis ≥1/mm2 68 44 (65) 10 (14.5) 10 (14.5) 0 4 (6)
T1 NOS 65 42 (65) 5 (8) 6 (9) 0 12 (18)
T2a Without ulceration 172 133 (77) 0 30 (18) 0 9 (5)
T2b 1.01–2.0mm With ulceration 43 32 (74) 0 9 (21) 0 2 (5)
T2 NOS 3 1 (33) 0 0 0 2 (67)
T3a Without ulceration 73 53 (73) 1 (1) 16 (22) 0 3 (4)
T3b 2.01–4.0mm With ulceration 68 58 (85) 0 8 (12) 0 2 (3)
T3 NOS 2 0 0 0 0 2 (100)
T4a Without ulceration 28 23 (82) 1 (4) 2 (7) 0 2 (7)
T4b >4.0mm With ulceration 34 31 (91) 0 1 (3) 0 2 (6)
T4 NOS 2 1 (50) 0 1 (50) 0 0
TX NA NA 153 108 (71) 0 0 45 (29) 0

NA, not applicable; NOS, not otherwise specified.

*

Missing 259 values (derived AJCC_7_T= blank).

Table 4.

Originally Derived AJCC-7 T vs Gold Standard*

AJCC-7 T based on gold standard
Original T T0 T1 T2 T3 T4 TX Total
T0 6 0 0 0 0 0 6
T1 0 232 11 3 5 23 274
T2 0 5 194 2 7 10 218
T3 0 1 4 127 3 7 142
T4 0 1 0 0 60 3 64
TX 0 26 6 6 5 108 151
Total 6 265 215 138 80 151 855*
*

Missing 259 values (derived AJCC_7_T= blank); 3 cases removed due to error in rederivation.

When comparing the agreement between the algorithm-generated values, originally reported values, and the gold standard, of the total 1,117 cases, 659 were in agreement with each other (Table 5). In other words, the value the algorithm selected as well as originally reported was the same value in the gold standard. There were 192 cases in which the gold standard and the algorithm-generated value matched but did not agree with the originally reported value. Conversely, there were 190 cases in which the gold standard matched the originally reported value but disagreed with the algorithm-generated value. We also observed 76 cases in which the algorithm-generated and the originally reported value disagreed with the gold standard (Table 5).

Table 5.

Agreement vs Disagreement

A B Total (%)
Agreement between originally derived and gold standard values Disagreement between originally derived and and gold standard values
Agreement between algorithm-generated and gold standard values 659 192 851 (76.2%)
Disagreement between algorithm-generated and gold standard values 190 76 266 (23.8%)
Total 849 (76.1%) 268 (23.9%) 1,117

Discussion

After testing the algorithm in 3 SEER central registries, the algorithm accurately selected the best melanoma tumor depth value for about 76% of cases, which was the same level of accuracy found in the originally reported values. Looking at the distribution of nonmatched cases from the originally reported values, the highest number of nonmatches were found in category 2, where depth values did not match with 115 cases (Table 2). This accounted for 10% of the total cases included in this audit. Category 3 demonstrates cases for which the originally reported value was coded unknown and the gold standard had a value between 1 and 980, making up 5% of cases. Category 4 includes cases where the gold standard value was coded unknown and the originally reported value was between 1 and 980, making up 6% of cases. Category 1, decimal errors, made up the smallest percentage of nonmatches (3%). This is likely due to the ongoing and rigorous quality control and visual editing processes in place at central registries. When looking at the distribution of nonmatches based on the algorithm's selection, it was similar to the distribution of nonmatches observed in the originally reported values. The highest number of nonmatches was found in category 2, where depth values did not match with 170 cases (15%). Upon additional review, it was learned that there were several cases in which the algorithm made the correct selection; however, certain registry policies required the gold standard value come from the abstract rather than the pathology report. This conflicted with how the algorithm was designed to select the best melanoma tumor depth value, resulting in a decreased match percentage. Five percent of cases fell under category 3 nonmatches (gold standard had a known value, the algorithm coded unknown) based on the algorithm's selection, while 3% fell under category 4 nonmatches (the algorithm had a known value, gold standard coded unknown). These nonmatches were likely due to the algorithm having difficulty distinguishing the size of melanoma tumor depth from tumor size, specimen size, or size of ulceration. The algorithm was not able to select the best value when more than 1 melanoma was reported on the same pathology report.

An objective of this audit was to provide registries with a method for automatic error correction using the algorithm. This required evaluating the algorithm-generated values to the gold standard and the originally reported values. With an accuracy level of about 76%, it was determined the algorithm was better suited as a tool for identifying discrepancies. The results in Table 5 support the algorithm's ability to identify cases with a high probability of a nonmatch or discrepancy. By having the algorithm narrow down cases that are likely to be a nonmatch, it can be used as a tool and allow registrars to focus their quality improvement efforts where needed. For instance, it is very likely that the 76 cases (7%) where both the originally reported value and algorithm-generated value disagree with the gold standard have discrepancies. Additionally, the 192 cases where the originally reported value disagrees with the algorithm-generated value and gold standard could benefit from additional review. Through additional review, registries can make the necessary corrections and improve their data quality as needed. If incorporated into registry operations, this tool can provide registrars with the opportunity to focus their time and effort on reviewing cases with a high likelihood of being a nonmatch rather than reviewing all melanoma cases.

This study is among the first to incorporate a type of artificial intelligence technology called text mining that uses natural language processing for quality control activities to support the cancer registry community. Strengths of this study include having experienced quality control registrars manually review the random sample of cases to develop the gold standard while being blinded to the original recorded values to reduce the risk of bias. The best value is consistently found among the values identified by the algorithm, but the ultimate selection of the 1 value matching the gold standard is limited due to multiple reasons mentioned previously.

SRP is currently conducting a validation study in 3 SEER central registries to evaluate how effective this tool would be in registry operations. The registries are reviewing cases for which the algorithm disagrees with the originally reported value (considered a nonmatch) to confirm that these cases were coded incorrectly. Additionally, registries are reviewing cases for which the algorithm agrees with the originally reported value (considered a match) to confirm that these cases were coded correctly. Once this validation study is complete and modifications to the algorithm are made if needed, SRP plans to make the algorithm available to all SEER central registries as a quality improvement tool to ensure that the SEER database continues to offer a high-quality data set for researchers.

Conclusion

We sought to estimate the degree of nonmatches that existed in the SEER database, understand the impact of these nonmatches, and address them through the development of an algorithm that can accurately identify melanoma depth measurement values. Decimal errors made up 3% of all cases in this nonmatch analysis. The majority of nonmatches were in category 2, where depth values did not match with 10%. In total, 24% of cases were found to have a discrepancy in the originally recorded values. While the algorithm is limited in finding the best melanoma tumor depth value, this does not affect its ability to identify incorrect values. This algorithm may prove to be an essential tool in optimizing registry resources by flagging inconsistencies via automated text review. Inconsistences will be adjudicated by registrars, thus improving the quality of their data as needed. SRP is currently conducting a validation study to evaluate the extent to which the algorithm can accurately identify discrepant cases. This is a quality initiative to ensure that the SEER database will continue to offer the largest, most accurate data set of population-based melanoma tumor depth values for research.

Acknowledgements

The authors gratefully acknowledge the contributions of the staff from the Louisiana, Detroit, and New Jersey cancer registries for not only their work in collecting the data used in this audit but for the role they played in developing the gold standard. We would also like to thank Glenn Abastillas and Rebecca Ehrenkranz for their contributions during the early development of this audit.

Footnotes

Supported by Surveillance Research Program of the Division of Cancer Control and Population Sciences at the National Cancer Institute/National Institutes of Health contracts with the Surveillance, Epidemiology, and End Results registries.

References


Articles from Journal of Registry Management are provided here courtesy of National Cancer Registrars Association

RESOURCES