Abstract
The Clock Drawing Test is a cognitive screening tool gaining popularity in the perioperative setting. We compared three common scoring systems:1) the Montreal Cognitive Assessment, 2) the Mini-Cog,and 3) Libon et al. (1996).Three novice raters acquired inter- and intra-rater reliability for each scoring system and then scored 738 preoperative clock drawings with each scoring system. Final scores correlated with each other but with notable discrepancies indicating the need to attend to inter and intra-rater reliability when implementing any scoring approach in a clinical setting.
INTRODUCTION
Preoperative cognitive impairment in older adults is a known risk factor for post-operative delirium and worse outcomes after surgical procedures with general anesthesia.1 As a result, hospitals are looking for ways to incorporate preoperative cognitive screening into their protocols.
The Clock Drawing Test (CDT) is one cognitive screening tool for fast-paced perioperative clinical settings. The CDT is particularly appealing to clinicians because it is easy to administer and provides a wealth of behavioral information.2 Yet, because of the CDTs extensive history within the medical literature and popularity, there are many clock drawing scoring systems. The traditional neuropsychological approach involves examining clock drawings for errors and error type (e.g., executive, motor, and mental planning deficits) 2–4. Other popular approaches such as the Montreal Cognitive Assessment (MoCA) 5 and the Mini-Cog, 6 incorporate clock drawing as a subcomponent of a larger test and score the clock quantitatively for overall accuracy.
CDT scoring technique may pose reliability challenges and particularly for novice raters such as medical students or nursing staff. The Libon3 scale’s attention to errors may improve requires attention to errors which may prove useful for inter- and intra-rater-reliability but have drawbacks due to time constraints. The MoCA system is known to require training to achieve high rater reliability.4 The Mini-Cog has established reliability for the test as a whole, but the isolated clock drawing system has not been examined separately for rater reliability.6
We examined these three CDT scoring systems for inter- and intra-rater reliability among three novice medical student raters. Score discrepancy was then examined for the three CDT scoring systems through review of a large set of clock drawings from patients attending a preoperative anesthesia clinic.
METHODS
The University of Florida’s Institutional Review Board/Research Ethics Committee approved this investigation with a waiver of consent. This article adheres to the applicable GRAAS Equator guidelines.
Preoperative Anesthesia Clinic Clock Drawings
Data were retrospectively acquired from two sources: 1) a preoperative hospital-wide screening investigation from May 2nd to July 1st where patients aged 65 or older attending UF Health’s preoperative anesthesia clinic completed the CDT (n = 671), and 2) a separate federally funded prospective study examining cognition in older adults prior to orthopedic surgery (n = 67; final total sample=738 participants). Patients did not complete a clock drawing if they had a history of a major learning disability, an inability to effectively hold a pen, significant loss of hearing, or having a first language other than English. CDTs were administered by a trained administrator in a private and quiet room. Two CDT conditions were administered. First, individuals completed a command CDT condition where they were told to “Draw the face of a clock, put in all the numbers, and set the hands to ten after eleven.” This command condition was then followed by a copy condition where the patients copied a clock drawing. For the purposes of the current investigation, we examined rater scores for clocks only acquired from the command condition.
CDT Scoring Systems
MoCA5:
Within this test, clock drawing accounts for 3 of the 30 total possible points. The clock is scored for the contour of the clock face, the placement and length of the hands, and the placement of the numbers, with one point given for each correct component (full details at www.mocatest.org). For the CDT component, individuals with dementia score on average 1.64 of 3 points.4
Mini-Cog6:
In the Mini-Cog assessment, clock drawing is combined with a three-word recall (worth up to 3 points). The circle of the clock face is provided to the participant as part of the test with the final clock drawing scored as normal (2 points) or abnormal (0 points) based on hand and number placement (full details at mini-cog.com). Mini-Cog scores below a 4 of 5 are indicative of cognitive impairment and/or the need for further dementia screening.7 Raters disregarded clock face contour for the final score, as our CDT administration required patients to draw the face of a clock as well as the numbers and hands.
Libon3:
This scoring approach is based on traditional analysis of process and errors. Errors are summated based on three qualitative categories: graphomotor/clock face errors, errors in hand/number placement, and executive control errors. The total number of errors possible is 10. On the command condition of clock drawing, individuals with Alzheimer’s disease produce an average of 2.1 ± 1.2 errors, and individuals with small vessel vascular disease produce 2.9 ± 1.3 errors.
Measuring Intra- and Inter-rater Reliability
Three novice medical students (authors BF, KW, and MZ; referred to as raters B,K,M), independently studied each of the CDT scoring systems described below. To establish intra- rater reliability, 40 de-identified clock drawings from external federally funded IRB approved investigations were selected for scoring and randomized into different sets (Set A, set B, etc). These clocks represented a range of cognitive impairment. Raters independently scored clock drawings with each CDT system (i.e., MoCA, Mini-Cog, and Libon). Raters first used Libon Scale 1 (1996) for the first 20 clock drawings but after phone consultation Dr. David Libon, it was recommended the raters use Libon Scale 2 (1996). For this reason, intra-rater reliability analyses are based on 40 clock drawings for the MoCA and Mini-Cog system, and 20 clock drawing for the Libon system.
Determination of Clock Score by System
Estimates of intra- and inter- rater reliability were deemed adequate as described by Landis and Koch8, After the intra- and inter- rater reliability phase of the project was completed, each rater independently scored 738 clock drawings with each of the three scoring systems (for a total of 2,214 clocks each) and compared scoring when discrepancies occurred. For the MoCA and Mini-Cog system, discrepancies were discussed; if any person disagreed on a clock’s score by one point or more, then all raters would systematically re-review the rating criteria together come to a consensus. If agreement could not be reached for any reason, a resident expert (CP) was consulted. A majority rules approach9 was then used for the final MoCA5 and Mini-Cog6 scores. Since the range of scores was much greater for Libon (i.e., scores could range from 0 to 10), the mean Libon score of the three raters was calculated to determine a final score.
Statistical Analysis
SAS version 9.4 (Cary, N.C.) was used for all analyses. Kappa statistics, chance corrected indices which measure agreement, and 95% confidence intervals (CI) were computed to assess intra- and inter-rater reliabilities for the MiniCog. SAS default weighted kappa statistics and 95% CIs were computed to assess reliabilities for the MoCA and Libon scoring methods. Inter-rater reliabilities were then computed for each pair of raters and averaged. To assess variability in scoring over the entire sample (n = 738), the absolute value of difference scores between each rater was computed for each rater pair and for each scoring system. Spearman correlations examined associations between the final scores for: 1) Libon and MoCA, 2) Libon and Mini-Cog, and 3) MoCA and Mini-Cog. Summary statistics (e.g., means, frequencies) were calculated to describe the study sample. The level of significance was set at .05 for correlational testing.
RESULTS
Establishing Rater Reliabilities on Training Sets
Intra-rater reliabilities:
For the MoCA, the estimated intra-rater reliabilities were .89 (95% CI = [.76, 1.00]), .89 (95% CI= [.78, .99]), and .94 (95% CI = [.86, 1.00]). Each rater had perfect estimated intra-rate agreement for the Mini-Cog. For the Libon, the estimated intra-rater reliabilities were .93 (95% CI = [.83, 1.00]), .95 (95% CI [.85 1.00]), and .90 (95% CI = [.80, 1.00]).
Inter-rater reliabilities:
For MoCA, estimated inter-rater reliabilities (95% CI) between raters B and K, raters B and M, raters K and M were .74 (95% CI = [.56, .93]), .48 (95% CI = [.28, .69]), and .59 (95% CI = [.42, .76]), respectively. The average of the estimated inter-rater reliability for the MoCA was .60. For MiniCog, estimated inter-rater reliabilities (95% CI) between raters B and K, raters B and M, raters K and M were .78 (95% CI = [.58, .98]), .60 (95% CI = [.34, .86]), and .82 (95% CI = [.62, 1.00]), respectively. The average of the estimated inter-rater reliability for the MiniCog was .73. For Libon, estimated inter-rater reliabilities (95% CI) between raters B and K, raters B and M, raters K and M were .55 (95% CI = [.34, .75]), .50 (95% CI = [.29, .70]), and .54 (95% CI = [.32, .77]), respectively. The average of the estimated inter-rater reliability for the Libon was .53.
Preoperative CDT Dataset Participant Description:
The sample’s mean age was 73.3 years (SD = 6.1), 47% were female, 89% were white, 7% were black, and 4% were of other race/ethnicity. The mean (SD) number of years of education was 14.2 (2.9). Despite the initial fair inter-rater reliabilities on the training sets of clock drawings, on scoring the entire sample (n = 738), the discrepancies of raters’ scores were minimal (see Table 1). For the MoCA and Libon scoring systems, rater pairs differed by no more than one point about 99% and 98% of the time, respectively.
Table 1:
Percentages of Scoring Discrepancies by CDT Rating System and Rater Pairs for 738 Clock Drawings
| Rating System and Rater Pair | Absolute Value of Point Discrepancy | |||||
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | |
| MoCA, B and K | 69.1% | 30.2% | .7% | 0% | – | – |
| MoCA, B and M | 63.7% | 35.9% | .4% | 0% | – | – |
| MoCA, K and M | 74.3% | 24.4% | 1.4% | 0% | – | – |
| MiniCog, B and K | 96.1% | 3.9% | – | – | – | – |
| MiniCog, B and M | 98.5% | 1.5% | – | – | – | – |
| MiniCog, K and M | 96.5% | 3.5% | – | – | – | – |
| Libon, B and K | 57.7% | 39.7% | 2.2% | 0.3% | 0.1% | 0% |
| Libon, B and M | 61.9% | 36.6% | 1.2% | 0% | 0.1% | 0.1% |
| Libon, K and M | 55.6% | 43.0% | 1.4% | 0% | 0% | 0.1% |
Final CDT scores from the three scoring systems significantly correlated with each other: 1) MoCA and MiniCog rs = .60 (95% CI = [.55, .65]), p < .0001; 2) MoCA and Libon rs = −.76 (95% CI = [−.79, −.73]), p < .0001; and 3) Libon and MiniCog rs = −0.60 (95% CI = [−.64, −.54]), p < .0001. The cohort level of impairment was also comparable by scoring system. MiniCog: were scored zero (0) for 42.1%; MoCA: 34.6% were scored 1 or below; Libon: 62.3% of the sample had two or more errors, and 30.4% had three or more errors.
Although the final CDT system scores correlated highly with each other, there were notable divergent clocks within each comparison group (please see Table 2 and Figure 1) such that discrepancies were identified for one-fifth of the sample.
Table 2:
Percentages Showing Degree of Correspondence by CDT Rating System(1)
| Libon Mean Score | MoCA Majority Rules Score | MiniCog Majority Rules Score | ||||
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 0 | 2 | |
| 0 | 0% | 0% | 0% | 100% | 0% | 100% |
| 1 | 0% | 2.0% | 61.1% | 37.0% | 13.3% | 86.7% |
| 2 | .4% | 33.9% | 57.6% | 8.1% | 46.2% | 53.8% |
| 3 | 18.4% | 51.7% | 27.9% | 2.0% | 68.0% | 32.0% |
| 4 | 42.1% | 42.1% | 15.8% | 0% | 96.5% | 3.5% |
| 5 | 43.8% | 50.0% | 6.3% | 0% | 93.8% | 6.3% |
| 6 | 75.0% | 25.0% | 0% | 0% | 100% | 0% |
| MiniCog Majority Rules Score | ||||||
| 0 | 19.7% | 45.8% | 31.9% | 2.6% | ||
| 2 | .2% | 11.92% | 49.5% | 38.3% | ||
Rows may not add to 100% due to rounding.
Figure 1.
(A) Top, scatter plot of the relationship between Mini-Cog and MoCA scores; bottom left, clock with a failed Mini-Cog score and a MoCA score of 2; bottom right, clock with a passing score on Mini-Cog and a MoCA score of 1. (B) Top, scatter plot of the relationship between Mini-Cog and Libon scores; bottom left, clock with a failing Mini-Cog score and a Libon score of 1; bottom right, clock with a passing Mini-Cog score and a Libon score of 4. (C) Top, scatter plot of the relationship between MoCA and Libon scores; bottom, clocks both with a MoCA score of 1, but the left has a Libon score of 2 while the right has a Libon score of 4.
DISCUSSION
This study highlights the necessity for healthcare systems to consider intra and inter-rater reliability training for CDT scoring regardless of system approach. We found novice raters could achieve high intra-rater and moderate inter-rater reliability during the clock training sets. Despite this established skillset, the three raters did produce score discrepancies for at least 20% of the full preoperative sample. For example, Figure 1 shows how a difference of 1 point on either the MoCA or Mini-Cog alters the label “intact” to “impaired” or vice versa. This finding has clinically relevance; patients may show one point score discrepancies pre to post-operatively and this difference could simply reflect rater bias and not true change. This is a topic that needs consideration and additional research. Based on these collective findings we conclude that while any one of the reviewed CDT scoring approaches appear appropriate for implementation, the administrative system needs to remain vigilant to adequate training and rater reliability assessment.
An additional consideration is that the choice of one CDT scoring approach does not exclude future application of another scoring approach for clinical care. A distinct advantage of the CDT is that many different scoring systems can be implemented as long as the visual figure of the clock is maintained. The CDT can be administered and scored in the perioperative setting e.g.,10 with the visual image saved to medical files and provided to neurocognitive specialists for clinical re-review with more comprehensive scoring protocols. Since the CDT is used clinically throughout neuropsychological clinics across the country a preoperative clock drawing image provides a gateway for cognitive monitoring over time2.
We recognize study limitations. Due to constraints of the preoperative setting and reliability on clocks available for review, we lacked a sample size justification to achieve an a priori precision of reliability measures. Additionally, we chose to report averages for estimates of multi-rater reliability, but note that other multi-rater estimates are also available. Future work needs to examine rater reliability across a wider range of neurocognitive pathologies within the perioperative setting. Perioperative scoring approaches should also be compared to more modern CDT approaches such as the digital clock drawing task which collects well over 10k variables per patient11.
Conclusion
When selecting a CDT scoring system to evaluate patients, practitioners are encouraged to establish a plan for acquiring and maintaining rater reliability, and to discuss the pros and cons of each clock scoring approach before final administration in their clinical setting.
Acknowledgments
Funding: This work was supported by The Lawrence M. Goodman Research Award (MSARF; to BF,KW,MZ), National Institutes of Health (grant nos. R01 NR014181 to CP, IIS-1404494 to CP, R01 AG055337 to CP and PT, P50AG047266), and by the I. Heermann Anesthesia Foundation (to CG and CP).
Footnotes
Conflicts of Interest: NONE
REFERENCES
[REF] Everitt, B. S. (1995). The Cambridge dictionary of statistics in the medical sciences (pp. 60–61). Cambridge: Cambridge University Press.
[REF 2] Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5), 378.
- 1.Oresanya LB, Lyons WL, Finlayson E. Preoperative assessment of the older patient: narrative review. JAMA 2014;311:2110–2120. [DOI] [PubMed] [Google Scholar]
- 2.Kaplan E The process approach to neuropsychological assessment of psychiatric patients. J Neuropsychiatry Clin Neurosci 1990;2:72–87. [DOI] [PubMed] [Google Scholar]
- 3.Libon DJ, Malamut BL, Swenson R, Sands LP, Cloud BS. Further analyses of clock drawings among demented and nondemented older subjects. Arch Clin Neuropsychol 1996;11:193–205. [PubMed] [Google Scholar]
- 4.Price CC, Cunningham H, Coronado N, et al. Clock drawing in the Montreal Cognitive Assessment: recommendations for dementia assessment. Dementia Geriatr Cogn Disord 2011;31:179–187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nasreddine ZS, Phillips NA, Bédirian V, et al. The Montreal Cognitive Assessment, MoCA: a brief screening tool for mild cognitive impairment. J Am Geriatr Soc 2005;53:695–699. [DOI] [PubMed] [Google Scholar]
- 6.Borson S, Scanlan JM, Chen P, Ganguli MJJotAGS. The Mini‐Cog as a screen for dementia: validation in a population‐based sample 2003;51(10):1451–1454. [DOI] [PubMed] [Google Scholar]
- 7.McCarten JR, Anderson P, Kuskowski MA, McPherson SE, Borson S. Screening for cognitive impairment in an elderly veteran population: acceptability and results using different versions of the Mini‐Cog. J Am Geriatr Soc 2011;59:309–313. [DOI] [PubMed] [Google Scholar]
- 8.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–174. [PubMed] [Google Scholar]
- 9.Colon-Perez LM, Triplett W, Bohsali A, et al. A majority rule approach for region-of-interest-guided streamline fiber tractography. Brain Imaging Behav 2016;10:1137–1147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Culley DJ, Flaherty D, Reddy S, et al. Preoperative cognitive stratification of older elective surgical patients: a cross-sectional study. Anesth Analg 2016;123:186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hizel LP, Warner ED, Wiggins ME, et al. Clock Drawing Performance Slows for Older Adults After Total Knee Replacement Surgery. Anesth Analg 2018. [DOI] [PMC free article] [PubMed]

