Abstract
This paper reports on a study that aimed to assess the inter-rater agreement of observable neurological signs in the upper and lower limbs (eg inspection, gait, cerebellar tests and coordination) and elicitable signs (eg tone, strength, reflexes and sensation). Thirty patients were examined by two neurology doctors, at least one of whom was a consultant. The doctors’ findings were recorded on a standardised pro forma. Inter-rater agreement was assessed using the kappa (κ) statistic, which is chance corrected. There was significantly better agreement between the two doctors for observable than for elicitable signs (mean ± standard deviation [SD] κ, 0.70 ± 0.17 vs 0.41 ± 0.22, p = 0.002). Almost perfect agreement was seen for cerebellar signs and inspection (a combination of speed of movement, muscle bulk, wasting and tremor); substantial agreement for strength, gait and coordination; moderate agreement for tone and reflexes; and only fair agreement for sensation. The inter-rater agreement is therefore better for observable neurological signs than for elicitable signs, which may be explained by the additional skill and cooperation required to elicit rather than just observe clinical signs. These findings have implications for clinical practice, particularly in telemedicine, and highlight the need for standardisation of the neurological examination.
KEYWORDS : Elicitable signs, inter-rater reliability, neurological examination, observable signs, telemedicine
Introduction
A neurological consultation comprises a verifiable history, a reliable examination, appropriate investigations and their subsequent interpretation. When a specialist opinion is sought using telemedicine, the remote clinician relies on another doctor's neurological examination. Some neurological signs have to be elicited by the examining physician, eg tone, strength and sensory deficits, but some valuable signs can be seen and heard by the remote and examining physician, eg walking, speed of finger movements and maintaining the outstretched arm in a particular posture.
The inter-rater reliability of the National Institutes of Health Stroke Scale (NIHSS, Table 1), which splits motor aspects into five groups – no drift (0), drift before 10 seconds (1), falls before 10 seconds (2), no effort against gravity (3) and no movement (4) – and the traditional neurological examination has been investigated before (Table 2), but these investigations did not include a comparison of signs categorised as elicitable or observable. These studies have not analysed their data according to whether the clinical signs could be observed from the end of the bed.
Table 1.
Table 2.
Telemedicine has been used to provide an out-of-hours stroke thrombolysis service to hospitals in south-east Wales since April 2012. We therefore investigated the inter-rater agreement of some elicitable and observable neurological signs in the upper and lower limbs to inform an assessment of their utility in the clinical examination performed using telemedicine.
Methods
Thirty patients (mean ± standard deviation [SD] age 55 ± 15 years) recruited over a 4-week period in a routine neurology outpatient clinic gave written consent to be examined by a consultant and, in the same clinic session, one other neurology doctor (foundation year 2, core medical trainee, specialty registrar or consultant). The second examiner, blinded to the findings of the first, repeated the examination of the upper and lower limbs. Examiners were asked to record their findings immediately on a standardised pro forma (Table 3) by selecting from binary options (eg present/absent for clonus) and categorical options (eg absent, depressed, normal or brisk for reflexes and Medical Research Council grades 0–5 for strength). Clinicians did not undertake any special training or instruction in clinical examination as part of this study and were asked to examine patients in accordance with their usual clinical practice, with appropriate equipment provided.
Table 3.
Inter-rater agreement was assessed using the κ statistic, which makes no assumptions about which doctor is correct – only whether they agree. The κ benchmarks used in this paper were that of Landis and Koch: <0 represents poor agreement, 0–0.20 slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement and 0.81–1.00 almost perfect agreement.15 A significant difference in agreement is present if there is no overlap in the 95% confidence intervals for the κ value. The mean κ and t-test results were used to assess the significance of the difference between grouped data. The analysis was performed using Microsoft Excel 2007 spreadsheet software.
The study was part of a medical student placement and was approved by the North Wales Research Ethics (Central & East) Proportionate Review Sub-Committee (11-WA-0311) and the Cardiff and Vale Research and Development Department (11/CMC/5212).
Results
The results are summarised in Fig 1 and Table 4. The inter-rater reliability for observable signs was better than for elicitable signs (mean ± SD κ value 0.70 ± 0.17 vs 0.41 ± 0.22, p = 0.002). We considered whether the difference between observable and elicitable signs was a consequence of the variable number of available options – for example, reflexes could be normal, brisk, reduced or absent but speed of movement could only be normal or slow. We therefore recalculated the inter-rater agreement for all data using a binary grouping – for example, reflexes could be abnormal (brisk, reduced or absent) or normal and strength could be abnormal (any grading ≤4) or normal (grade 5). The difference in the inter-rater agreement between observable and elicitable signs was still significant (mean ± SD κ value 0.76 ± 0.09 vs 0.46 ± 0.21, p = 0.014).
Table 4.
Discussion
Signs that have to be elicited involve skill on the part of the examiner, the cooperation of the patient and then interpretation – for example, to test tone, the patient must be relaxed and comfortable and the examining doctor must have an understanding of the actions required to elicit the clinical features of spasticity and rigidity. Informal observation of the techniques used by different doctors in this study suggested marked variations in technique and interpretation, which may explain the poor inter-rater agreement. By comparison, it is more straightforward to observe patients at rest or when performing actions such as tapping movements of the finger and thumb to assess speed of movement or walking, which may explain the better agreement seen for these observable signs. Miller and Johnston16 found foot tapping (κ = 0.73) to be more reliable (sensitivity 86% and specificity 84%) for upper motor neurone weakness than Babinski testing (plantar reflex) (κ = 0.30) (sensitivity 35% and specificity 77%).
The previous literature (see Tables 1 and 2) shows a wide variation in the elicitable signs, with the κ statistic value ranging from 0.29 to 1.00 (mean 0.65) for strength and from 0.15 to 1.00 (0.46) for sensation. The variation in reliability of the peripheral neurological examination in the literature, as well as with the results of this study, highlights that relying on another doctor's assessment may affect diagnosis and management.
One of the concerns of clinicians providing opinions about patients they are not able to examine in person is that their clinical examination is impoverished by the lack of direct patient contact. However, this study suggests that those signs that require elicitation have poorer inter-rater reliability than ‘end-of-the-bed’ signs, which can be observed by both the attending physician and the remote physician using telemedicine equipment. The importance of being a good noticer17 is as relevant today as it ever was, and rather than compromising clinical skills, the technology of telemedicine may demand of clinicians a review of the parts of the clinical examination that are most reliable.
Conclusion
Observable neurological signs have significantly better inter-rater agreement than elicitable signs. These findings have implications for clinical practice, including telemedicine.
Acknowledgements
We would like to thank clinical colleagues in the department of neurology for their help and support. This work was first presented at the video conference All Wales Stroke Meeting (AWSM).
References
- 1.Anderson ER, Smith B, Ido M, Frankel M. Remote assessment of stroke using iPhone 4. J Stroke Cerebrovascular Dis 2013;22:340–4. 10.1016/j.jstrokecerebrovasdis.2011.09.013 [DOI] [PubMed] [Google Scholar]
- 2.Demaerschalk BM, Vegunta S, Vargas BB, et al. Reliability of real-time video smartphone for assessing National Institutes of Health Stroke Scale scores in acute stroke patients. Stroke 2012;43:3271–7. 10.1161/STROKEAHA.112.669150 [DOI] [PubMed] [Google Scholar]
- 3.Gonzalez MA, Hanna N, Rodrigo ME, et al. Reliability of pre–hospital real-time cellular video phone in assessing the simplified National Institutes of Health Stroke Scale in patients with acute stroke: a novel telemedicine technology. Stroke 2011;42:1522–7. 10.1161/STROKEAHA.110.600296 [DOI] [PubMed] [Google Scholar]
- 4.Meyer BC, Lyden PD, Al-Khoury L, et al. Prospective reliability of the STRokE DOC wireless/site independent telemedicine system. Neurology 2005;64:1058–60. 10.1212/01.WNL.0000154601.26653.E7 [DOI] [PubMed] [Google Scholar]
- 5.Handschu R, Littmann R, Reulbach U, et al. Telemedicine in -emergency evaluation of acute stroke: interrater agreement in remote video examination with a novel multimedia system. Stroke 2003;34:2842–6. 10.1161/01.STR.0000102043.70312.E9 [DOI] [PubMed] [Google Scholar]
- 6.Meyer BC, Hemmen TM, Jackson CM, Lyden PD. Modified National Institutes of Health Stroke Scale for use in stroke clinical trials: prospective reliability and validity. Stroke 2002;33:1261–6. 10.1161/01.STR.0000015625.87603.A7 [DOI] [PubMed] [Google Scholar]
- 7.Shafqat S, Kvedar JC, Guanci MM, et al. Role for telemedicine in acute stroke: feasibility and reliability of remote administration of the NIH stroke scale. Stroke 1999;30:2141–5. 10.1161/01.STR.30.10.2141 [DOI] [PubMed] [Google Scholar]
- 8.Brott T, Adams HP, Jr, Olinger CP, et al. Measurements of acute cerebral infarction: a clinical examination scale. Stroke 1989;20:864–70. 10.1161/01.STR.20.7.864 [DOI] [PubMed] [Google Scholar]
- 9.Goldstein L, Bertels C, Davis J. Interrater reliability of the NIH stroke scale. Arch Neurol 1989;46:660–2. 10.1001/archneur.1989.00520420080026 [DOI] [PubMed] [Google Scholar]
- 10.Carswell C, Rañopa M, Pal S, et al. Video rating in neurodegenerative disease clinical trials: the experience of PRION-1. Dement Geriatr Cogn Dis Extra 2012;2:286–97. 10.1159/000339730 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hand P, Haisma JA, Kwan J, et al. Interobserver agreement for the bedside clinical assessment of suspected stroke. Stroke 2006;37:776–80. 10.1161/01.STR.0000204042.41695.a1 [DOI] [PubMed] [Google Scholar]
- 12.Jepsen J, Laursen LH, Hagert CG, et al. Diagnostic accuracy of the neurological upper limb examination I: Inter-rater reproducibility of selected findings and patterns. BMC Neurology 2006;6:8. 10.1186/1471-2377-6-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lindley RI, Warlow CP, Wardlaw JM, et al. Interobserver reliability of a clinical classification of acute cerebral infarction. Stroke 1993;24:1801–4. 10.1161/01.STR.24.12.1801 [DOI] [PubMed] [Google Scholar]
- 14.Shinar D, Gross CR, Mohr JP, et al. Interobserver variability in the assessment of neurologic history and examination in the Stroke Data Bank. Arch Neurol 1985;42:557–65. 10.1001/archneur.1985.04060060059010 [DOI] [PubMed] [Google Scholar]
- 15.Landis J, Koch G. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–74. 10.2307/2529310 [DOI] [PubMed] [Google Scholar]
- 16.Miller T, Johnston SC. Should the Babinski sign be part of the routine neurologic examination? Neurology 2005;65:1165–8. 10.1212/01.wnl.0000180608.76190.10 [DOI] [PubMed] [Google Scholar]
- 17.Asher R. Clinical sense. BMJ 1960;1:985–93. 10.1136/bmj.1.5178.985 [DOI] [PMC free article] [PubMed] [Google Scholar]