Skip to main content
EPA Author Manuscripts logoLink to EPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jun 1.
Published in final edited form as: Landsc Ecol. 2020 Jun 1;35:1263–1267. doi: 10.1007/s10980-020-01029-1

A guide for evaluating and reporting map data quality: Affirming Shao et al. “Overselling overall map accuracy misinforms about research reliability”

SV Stehman 1, J Wickham 2,*
PMCID: PMC7970508  NIHMSID: NIHMS1668055  PMID: 33746360

Abstract

Context

Landscape ecologists often use thematic map data in their research. Greater familiarity with thematic map accuracy assessment protocols will enhance appropriate use and interpretation of map quality data.

Objectives

Provide an overview of thematic map accuracy assessment protocols and simple, non-quantitative guidelines to assess the quality of the thematic map data that landscape ecologists use in their research.

Methods

Synthesis and interpretation of salient literature on map accuracy assessment.

Conclusions

Landscape ecologists can adopt three simple rules to improve their use and interpretation of map data: 1) use the map quality data only if the accuracy assessment protocols adhere to rigorous, well-established standards for the sampling design, response design, and analysis; 2) focus on class-specific accuracy via user’s and producer’s accuracies (or the complementary measures commission and omission error rates); and 3) use the criterion that an accuracy assessment that reports class-specific accuracies accompanied by standard errors is a strong indicator of a rigorous assessment.

Use and Interpretation of Map Accuracy Information

We appreciate and thank Shao et al. (2019) for their thought-provoking editorial reminding the readers of Landscape Ecology to consider the quality of the remotely sensed land cover and other thematic data they use in their work. Shao et al.’s (2019) caution against overselling overall map accuracy provides an opportune setting to revisit other important issues of reporting and interpreting of map accuracy results as these results pertain to use of spatial products in landscape ecology.

Protocols for rigorous accuracy assessment of remotely sensed land cover data were established more than 20 years ago (Stehman and Czaplewski 1998), and they have been further developed and articulated in more recent publications (Olofsson et al. 2014; Stehman and Foody 2019). The three main elements of a rigorous accuracy assessment are:

  1. sample design – the rules to select sample units for which reference map labels will be acquired (e.g., simple random, stratified random, cluster, systematic);

  2. response design – the protocols used to collect the reference data, and;

  3. analysis – the methods used to quantify agreement between map and reference data.

There are important details for each of the three accuracy assessment elements (Table 1), and it would be worthwhile for landscape ecologists to consider these details as part of an overall assessment of map data quality. A simple yet effective means for incorporating map data quality into the work of landscape ecologists is to report whether or not the producers of the data provided accuracy information that was based on established protocols.

Table 1 –

Details of a rigorous thematic accuracy assessment

Objectives
All accuracy assessments should start with a clear set of objectives. The desired objectives inform the decisions made regarding aspects of the three main elements of an accuracy assessment. Common objectives are to estimate overall accuracy, per-class accuracy, and precision (i.e., standard error). Cost is often a constraint on objectives. Stehman et al. (2008) discuss how accuracy assessment objectives can become complex and how that complexity informs and affects accuracy assessment planning.
Sampling design
Protocols used to define the population (the frame) and select the sample. Probability-based options for sample selection are recommended (Stehman and Czaplewski 1998). Stratified random sampling is often an efficient (i.e., cost-effective) probability-based option for thematic map accuracy assessments because some classes are likely to be common while others are likely to be rare. Stratified sampling is particularly motivated to enhance precision of the estimates of user’s accuracy of rare classes. Each sample unit can belong to only one stratum when stratified sampling is used.
Response design
Response design refers to the protocols used to determine the reference classification. There are many details to be considered that affect the quality of the reference classification and hence map-reference agreement (Olofsson et al. 2014; Stehman and Czaplewski 1998; Stehman and Foody 2019). We identify four elements of the response design protocol that we consider essential. (1) The reference medium should be of higher quality than the medium used to produce the map (Khorram et al. 1999; Olofsson et al. 2014). Practically, higher quality is most often realized by use of a reference medium with higher spatial resolution. (2) The spatial unit on which the map-reference comparison is based is another essential element. A pixel, at the native resolution of the raster map source, is perhaps the most common unit, but there are other options (e.g., polygons). Stehman and Wickham (2011) discuss how use of a spatial unit other than a pixel affects each of the three main elements of an accuracy assessment. (3) Reference label assignment should be blind to the map classification. (4) Reference label assignment should be consistent. Protocols should be established to ensure that each person assigns the same reference label to the same sample unit when teams are used to collect reference data. Consistency can be promoted through pilot (i.e., training) efforts, assignment of common points to multiple interpreters, and periodic meetings among those collecting reference data (e.g., Wickham et al. 2017).
Analysis
The measures use to quantify agreement between map and reference data. The formulas used to quantify agreement should be based on the sampling design implemented and account for inclusion probabilities. Cell entries in the error matrix should represent area, not frequency counts. Documentation of the analysis component should include equations for variance estimation and the accuracy estimates derived from the error matrix should be accompanied by standard errors.

The well-known cross-tabulation or K-x-K error matrix (K = number of classes) (Story and Congalton 1986) of map-reference agreement and the accuracy measures derived from an error matrix (e.g., overall accuracy [OA], producer’s accuracy [PA] and user’s accuracy [UA]) were familiar to most landscape ecologists at the initial stages of the discipline’s development (Hess and Bay 1997; Shao and Wu 2008; Smith et al. 2002, 2003; Wickham et al. 1997). Many of these studies evaluated the sensitivity of landscape metrics to class-specific error (e.g., UA, PA), not just OA. Visual inspection of the K-x-K cross-tabulation matrix in fact provides useful information on the quality of a map accuracy assessment.

To improve both reporting and use of map accuracy information, we also recommend that the error matrix include: 1) UA and PA, 2) row and column marginal totals, and, 3) the standard errors for the UA and PA estimates. In the usual presentation of an error matrix, the rows represent the map classification and the columns the reference classification, so UA and PA are associated with the rows and columns, respectively, in the error matrix. The row and column totals provide the area or proportion of area of each class determined from the map (rows) or the reference classification (columns). For each thematic class, the difference between the row total and column total represents non-site-specific accuracy, also called quantity disagreement (Pontius and Millones 2011). Olofsson et al. (2014) point out that this difference also characterizes the bias associated with using pixel counting to estimate area from a map, where pixel counting is defined as summing all pixels of a class in the map and multiplying by the pixel’s area.

The standard errors of the accuracy estimates are the most overlooked of these three critical components. Reporting of standard errors communicates that: 1) at least two of the three main elements (sampling design and analysis) of a well-done accuracy assessment were incorporated and connected at the effort’s planning stage (Stehman 2001); 2) the accuracy assessment was likely initiated with a clear articulation of objectives; 3) producers of the map recognized the importance of statistical inference to rigorous documentation of data quality, and; 4) users of the data can more assuredly put some stock in the assumption that a well thought out response design was also incorporated into the effort. Accuracy assessments that do not report class-specific standard errors are often also missing fundamental information that may indicate that other basic elements of accuracy assessment were not rigorously implemented. One common example that we have found is assessments that implement a stratified random sampling design and report frequency counts in the error matrix. Such reporting ignores the importance of incorporating inclusion probabilities to produce unbiased estimators of class-specific accuracies, as demonstrated by Stehman and Foody (2019, Sec. 4.4).

The four critiques of OA offered by Shao et al. (2019) provide insights that merit expanded discussion to address further details of interpretation of map accuracy results. Shao et al.’s (2019) first critique is that OA does not supply uncertainty for map derivatives as illustrated in their example in which OA is 90% and the map derivative is area of forest. However, other features of an accuracy assessment are informative for quantifying uncertainty of forest area. For example, as noted previously, the bias in the proportion of area of a class derived from pixel counting is the difference between the map total and the row total of that category in the error matrix. Instead of counting pixels to estimate area, Olofsson et al. (2014) recommend estimating area using the reference classification of an accuracy assessment sample data. Based on this recommendation, the estimated proportion of area of each class is provided by column total of the error matrix when the cells of the error matrix are presented in terms of percent or proportion of area (Stehman and Foody 2019, Tables 3 and 5). These sample-based area estimates should be accompanied by their standard errors (or confidence intervals) to capture the uncertainty attributable to sampling variation.

Shao et al.’s (2019) second and third critiques of OA highlight the need to exercise caution when making comparisons of maps with different numbers of classes or maps with the same classes but differing class proportions. A blanket recommendation that maps with different numbers of classes (second critique) are not comparable overlooks an important exception. Map classifications, especially for land cover, are often hierarchical, perhaps because of the influence of the publication by Anderson et al. (1976). Because of the inherent nesting of classes in hierarchical classifications, it is common to compare OA for different levels of class nesting. Comparisons of OA for disaggregated and aggregated levels of the classification hierarchy yields important information on how error is distributed among the classes. A large increase in OA arising from class aggregation indicates a substantial portion of map-reference disagreement occurs between detailed classes that comprise a more generalized class.

In cases where the class areal proportions differ between two maps being compared (third critique), the issues become more involved and depend on the objectives of the comparison. For example, if the objective is to monitor improvement in accuracy of map products over time in an ongoing program such at the U.S. National Land Cover Database (NLCD) (www.mrlc.gov), accuracy assessments to compare changes in OA, UA, and PA across NLCD eras (e.g., Wickham et al. 2017) are justifiable even though class proportions are changing over time (e.g., urbanization). Comparisons of changes in OA, UA, and PA between NLCD 2006 and NLCD 2011, for example, are statistically valid and provide a useful, quantitative indicator of whether the ever-evolving NLCD mapping methods are leading to improved product quality. In the case of NLCD, land-cover proportions are not changing substantially over time, so confounding between improved accuracy from adjustments to classification methods with changes in land cover is not a strong concern. In other situations, for example when comparing classification algorithms, confounding of land cover areal distributions (e.g., by not using a common set of test sites) and different classification algorithms may critically impair the ability to generalize from the results. We agree with the recommendation by Shao et al. (2019) that important details need to be considered when comparing OA between two different maps, and this concern also applies to UA and PA.

Shao et al.’s (2019) fourth critique is that a higher OA is not necessarily an indicator of a more realistic map than a different map of the same area with a lower OA. They demonstrate this critique by showing that OA can be increased by increasing the mapped area of the majority category (within a fixed-area region of interest) or by extending the mapped area. Although it is true that increasing map area of the majority class will likely increase OA, an important characteristic of the error matrix is the interdependence among the classes. Increasing the mapped area of the dominant class will surely increase the omission errors of the rare classes. The recommended practice of reporting class-specific accuracies would detect the changes in accuracy resulting from attempts to tailor the map to maximize OA. Expanding the extent of the mapping area is tantamount to changing the population (Stehman 2001). An accuracy assessment conducted prior to map expansion does not include in its population the pixels in the expanded area and therefore provides no information on the accuracy for that area. Expanding the extent of the mapping area and thereby changing the population represented by the accuracy estimates would impact all accuracy metrics, not just OA. Typically, defining the relevant study area is a critical first step in an accuracy assessment, so expanding the study area should always raise concerns and map producers should be required to provide a convincing justification for such and expansion of the region of interest.

OA is a legitimate, albeit coarse summary measure of map quality. The overselling of OA can be attributed to efforts to use overall accuracy for purposes it was never intended to serve. OA represents the proportion of area that is correctly classified. For that intended purpose, OA is properly influenced by the distribution of land cover in the study area because classes that are most common in the region of interest should more strongly influence OA. To do otherwise would violate the “map relevant” criterion for accuracy assessment (Stehman and Foody 2019), which states that the accuracy assessment must produce results consistent with the land cover areal distribution in the study area. As Shao et al. (2019) note, OA cannot represent the accuracy of every category. OA is not intended for that purpose. OA is a useful measure that provides some but not all of the information needed to support informed use of the data. Given the intended purpose of OA, the class-specific UA and PA estimates and their associated standard errors are absolutely essential results to report from an accuracy assessment.

We appreciate the important and timely reminder offered by Shao et al. (2019) that landscape ecologists should take stock of map quality and that assessment of map quality should extend well beyond OA. Of course, the initial check of map quality is whether the map producers provided accuracy information at all. If accuracy information is missing, users of the map need to report that fact in their discussion of how map quality impacts the results of their application of the map. Assessment of map quality extends well beyond OA. In addition, we offer two easy-to-implement, non-quantitative questions that landscape ecologists can use to evaluate the veracity of the accuracy assessment protocols and the resulting map data quality: 1) Do the producers of the mapped data follow the established rigorous protocols (sampling design, response design, and analysis) for the collection and analysis of map accuracy data? and; 2) do the map producers report class-specific UA and PA estimates with accompanying standard errors? To supplement Shao et al.’s (2019) advice, a worthwhile standard of practice (sensu Olofsson et al. 2014; Stehman and Foody 2019) for landscape ecologists to adopt would be to require affirmative responses to both questions before proceeding to incorporate the map accuracy data into their analyses. The ability to provide this affirmation in publications and reports is the joint responsibility of map users and producers through proper documentation of thematic accuracy methods and results.

Acknowledgements

The commentary described in this paper has been funded by the U.S. Environmental Protection Agency. We thank the anonymous reviewers and Maliha Nash (US EPA) for their valuable comments on earlier versions of the paper. The paper has been subjected to Agency review and has been approved for publication. The views expressed in this journal article are those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency. Mention of trade names or commercial products does not constitute endorsement or recommendation for use. S. Stehman’s participation was underwritten by contract G12AC20221 between SUNY-ESF and the U.S. Dept. of Interior, Geological Survey.

References

  1. Anderson JR, Hardy EE, Roach JT, Witmer RE (1976) A Land Use and Land Cover System for Use with Remote Sensor Data. U.S. Geological Survey, Professional Paper 964. 10.3133/pp964; https://pubs.er.usgs.gov/publication/pp964. [DOI] [Google Scholar]
  2. Hess GR, Bay JM (1997) Generating confidence intervals for composition-based landscape indexes. Landsc Ecol 12:309–320. [Google Scholar]
  3. Khorram S, Biging GS, Chrisman NR, Colby DR, Congalton RG, Dobson JF, Ferguson RL, Goodchild MF, Jensen Jr, Mace TH (1999) Accuracy Assessment of Remote Sensing-Derived Change Detection. ASPRS Monograph Series, American Society of Photogrammetry and Remote Sensing (ASPRS), Bethesda, Maryland, USA. [Google Scholar]
  4. Olofsson P, Foody GM, Herold M, Stehman SV, Woodcock CE, Wulder MA (2014) Good practices for estimating area and assessing accuracy of land change Remote Sens Environ 148:42–57. [Google Scholar]
  5. Pontius RG Jr, Millones M (2011) Death to Kappa: birth of quantity disagreement and allocation disagreement for accuracy assessment. Int J Remote Sens 32:4407–4429. [Google Scholar]
  6. Shao G, Tang L, Liao J (2019) Overselling overall map accuracy misinforms about research reliability. Landsc Ecol 34:2487–2492 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Shao G, Wu J (2008) On the accuracy of landscape pattern analysis using remote sensing data. Landsc Ecol 23:505–511. [Google Scholar]
  8. Smith JH, Stehman SV, Wickham JD, Yang L (2003) Effects of landscape characteristics on land-cover class accuracy. Remote Sens Environ 84:342–349. [Google Scholar]
  9. Smith JH, Wickham JD, Stehman SV, Yang L (2002) Impacts of patch size and land cover heterogeneity on thematic image classification accuracy. Photogram Eng and Remote Sens 68:65–70. [Google Scholar]
  10. Stehman SV (2001) Statistical rigor and practical utility in thematic map assessment. Photogramm Eng Remote Sens 67:727–734. [Google Scholar]
  11. Stehman SV, Czaplewski RL (1998) Design and analysis of thematic map accuracy assessment: fundamental principles. Remote Sens Environ 64:331–344. [Google Scholar]
  12. Stehman SV, Foody GM (2019) Key issues in rigorous assessment of land cover products. Remote Sens Environ 231:111199. [Google Scholar]
  13. Stehman SV, Wickham J (2011) Pixels, blocks of pixels, and polygons: choosing a spatial unit for thematic accuracy assessment. Remote Sens Environ 115:3044–3055. [Google Scholar]
  14. Stehman SV, Wickham J, Wade TG, Smith JH (2008) Designing a multi-objective, multi-support accuracy assessment of the 2001 National Land Cover Data (NLCD 2001) of the conterminous United States. Photogramm Eng Remote Sens 74:1561–1571. [Google Scholar]
  15. Story M, Congalton RG (1986) Accuracy assessment: A user’s perspective. Photogramm Eng Remote Sens 57:397–399. [Google Scholar]
  16. Wickham JD, O’Neill RV, Riitters KH, Wade TG, Jones KB (1997) Sensitivity of selected landscape pattern metrics to land-cover misclassification and differences in land cover composition. Photogramm Eng Remote Sens 63:397–402. [Google Scholar]
  17. Wickham J, Stehman SV, Gass L, Dewitz JA, Sorenson DG, Granneman BJ, Poss RV, Baer LA (2017) Thematic accuracy assessment of the 2011 National Land Cover Database (NLCD). Remote Sens Environ 191:328–341. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES