Abstract
Excessive paperwork is a considerable issue that leads to additional burdens for health-care professionals. In Thai health-care systems, physicians manually review medical records to select an appropriate principle diagnosis and other co-morbidities and convert them into ICD-10s to claim financial support from the government. Accordingly, 160,000 ICD-10 codes and 46,000 in-patient discharge summaries are documented by physicians at Maharaj Nakorn Chiang Mai hospital each year. As a result, to decrease physicians' burden of manual paper-work, we created a new approach to automatically analyse discharge summary notes and map the diagnoses to ICD-10s. We combined SNOMED-CT and natural language processing techniques within the approach through 3 steps: cleaning data; extracting keywords from discharge summary notes; and matching keywords to ICD-10. In this paper, we present that mapping clinical documents by using approximate matching and SNOMED-CT shows potential to be used for automating the ICD-10 mapping process.
Introduction
ICD-10 stands for the 10th revision of the international classification of diseases and related health and medical problems documented by the World Health Organisation (WHO)1. The ICD-10 contains international identification numbers for terms such as signs and symptoms, disease names, procedures and abnormal findings, which are used for global health information standard classifications for mortality, morbidity statistics, clinical care and research, to analyse diseases and study disease patterns, as well as to manage health-care, monitor outcomes, and allocate resources2-4. In Thai health-care systems, physicians review medical records to select an appropriate principle diagnosis and other co-morbidities, and convert them into ICD-10s to claim financial support from the government. Errors occurring in the process, for instance, missing ICD-10 inputs or incorrect interpretation affect the correctness and completeness of the records, resulting in challenges such as: intensive labour necessary for rechecking the ICD-10 inputs; incorrect health epidemiological reports; and insufficient reimbursement for managing a hospital2,5.
Paper-work overload is a major problem and leads to a heavy burden for health-care professionals6. On average, each year there are 160,000 ICD-10 codes documented by physicians and 46,000 discharge summaries generated at inpatient departments at the Maharaj Nakhon Chiang Mai hospital. Due to the massive workload, there is an increased risk of human error during the processing of ICD-10 documentation.
To address the challenges involved in human ICD-10 processing, computer technologies could help to assist or automate the ICD-10 mapping process. In 2008, a study was conducted into reverse mapping of ICD-10-CA (Canadian version of the 10th revision of the International Statistical Classification of Diseases) to SNOMED-CT by applying an exact and partial mapping algorithm7. However, this method had limitations for mapping abstract terms. Because of the limitations of this study, the algorithms were only recommended to semi-automate the ICD-10 mapping process and support humans to complete the documentation. Consequently, in 2009, natural language processing techniques were applied to increase the efficacy of the ICD-10 mapping process, producing more precise results using acronym processing and noun phrase extraction8.
To address the ICD-10 mapping problem, we developed a new approach by combining SNOMED-CT and natural language processing techniques to automatically extract electronic discharge summary notes recorded by physicians in Maharaj Nakorn Chiang Mai hospital into ICD-10. Our goal is that this approach could help to decrease the burden of manual ICD-10 mapping processes for healthcare professionals, provide more complete documents with reduced errors, and further help the hospital administrators to fully and accurately claim financial support from the government.
Materials and Methods Samples
We randomly collected 560 discharge summaries at Maharaj Nakhon Chiang Mai hospital from 2006 to 2016 to test our approach. The data were retrospectively retrieved under an ethical approval by the Research Ethics Committee of Faculty of Medicine, Chiang Mai University (Study code: PHY-2562-06152). All methods were carried out with exemption criteria (waiver of informed consent) in accordance with the Faculty of Medicine, Chiang Mai University regulations.
SNOMED-CT ICD-10 dictionary
We used the SNOMED-CT dictionary (version 45_THIS_SNOMED_CT_US_201903_Full Excel9) provided by the Thai Health Information Standards Development Centre (THIS)10. The dictionary included the SNOMED-CT terms of abnormalities, disorders, findings, and procedures that map to ICD-10. The dictionary provides a complete map (representing relationship and hierarchy by concept id) between medical terms and ICD-10. SNOMED-CT or Systemic Nomenclature of Medicine Clinical Terms is an organised collection of clinical terms arranged in a comprehensive manner, created by the International Health Terminology Standards Development Organisation (IHTSDO) in 200711. The library consists of core general terminology for electronic hospital documentation. It covers a wide variety of medical categories such as clinical signs and symptoms, findings, diagnoses, organisms, procedures, pharmaceuticals, medical devices, and so forth. A collaboration between the IHTSDO and the WHO in 2012 linked SNOMED-CT nomenclature to ICD-10 terminology using concept ids12. Based on this relationship, we applied these concepts to link medical terms in the discharge summary notes to ICD-10.
Map discharge summaries to ICD-10
The discharge summaries were processsed to map to ICD-10 with the following 3 steps:
Cleaning data
Extracting keywords from discharge summary notes
Matching keywords to ICD-10
Because the discharge summary notes and ICD-10 description terms of SNOMED-CT were in a free-form text, we were not able to directly map the contents to ICD-10. The contents were required to be tokenised and pre-processed to remove irrelevant words and characters. Regular expression was used to convert text to lowercase and remove non- English alphabet items, non-digits, and stop words. The remaining contents were treated as keywords. The keywords contained patient information, findings, abnormalities and some keywords or phrases related or relevant to ICD-10s. Mapping keywords from the notes directly to ICD-10 is less likely to achieve matches because ICD-10 terms were not the sole terms recorded in the notes. Therefore, we need to find a bridge between keywords and ICD-10. The SNOMED-CT dictionary links medical terms and ICD-10 by concept id which allowed us to match the keywords in the notes to concept id and then link the id to related ICD-10s. Because of the variation of the keywords in the notes, Levenshtein Distance was utilised to map terms between the keywords and SNOMED-CT terms. Levenshtein Distance theory was introduced by Vladimir Levenshtein in 196513. The theory is used to measure the distance between two words (text similarity) by finding the minimum number of operations needed to change a word sequence into another word using insertions, deletions, or substitutions. For example, the Levenshtein distance between "Oedema" and "Edema" is 1 (1 deletion of "O") or the Levenshtein distance between "Mycoplasma infection" and "Mycoplasma pneumonia" is 7 (3 substitutions of "i", "t", "i", 2 insertions of "f" and "c", and 2 deletions of "i" and "a"). The greater the distance, the more difference between two words.
We applied the Levenshtein Distance for measuring text similarity using a Fuzzy wuzzy package14 to compute the standard Levenshtein distance and calculate similarity ratios (presented as percentages) between a collection of words of two sentences. We used the token set ratio function that performed a set operation to pairwise intersect words and then paired these with other non-intersected words to calculate the ratio, then selected the highest ratio. The set operation allows flexibility of comparing unsorted words and sentences with different sizes. The higher the similarity ratio, the more similarity between two sequences. The performance of token set ratio function was compared with the other two baseline algorithms: 1) character set ratio and 2) Jaccard similarity. Character set ratio is calculated by the number of characters that intersects between two strings divided by the size of the union of all unique characters. Jaccard similarity15 is calculated similar to the character set ratio but it is applied on the token level. All tokens of the discharge summary notes were compared with SNOMED-CT tokens. In case of one ICD-10 was linked to multiple SNOMED-CT terms, the term with the highest similarity was selected. Finally, we ranked ICD-10s by the similarity values and reset index to a percentage scale, ranked from 100 to 0.
We evaluated the algorithm performance by calculating the minimum, maximum, median, average and standard deviation of the ranks and compared between the three algorithms. The higher number of the rank of actual ICD-10s represented the higher quality of the ICD-10 mapping. We are interested to see how the actual IC0-10s were correctly detected (rated in a high rank) and the actual ICD-10s that were missed by the algorithms (rated in a low rank). In addition, we explored false predicted ICD-10s which were ranked higher than the actual ICD-10s. True and false positive were defined by the following criteria. True positive was the actual ICD-10s which were ranked higher than 50%. False positive was the false predicted ICD-10s which were ranked higher than actual ICD-10s. False negative was the actual ICD-10s which were ranked lower than 50%. True negative was not evaluated in this study. Finally, we manually evaluated and discussed the pattern of true positive, false positive, and false negative.
Results
We demonstrated how the raw discharge summary note was processed until mapping relevant ICD-10s was achieved. First, the raw notes were cleaned by using regular expressions to remove non-English alphabet items, non-digits, and stop words. Figure 1 shows text before and after the discharge summary was cleaned. SNOMED-CT terms were also cleaned by the same process (data not shown).
Figure 1:
Text comparison between pre (raw text on the left hand side) and post (clean text on the right hand side) data pre-processing process
We applied Levenstein Distance using fuzzy wuzzy package to compare keywords (from the clean discharge summary notes as presented in Figure 1) and SNOMED-CT terms in the SNOMED-CT dictionary to calculate pair-wise similarity ratios. We compared all keywords, ranked the similarity ratios, and linked them to ICD-10s using concept id. Table 1 shows the list of the top thirteen predicted ICD-10 mapping results. This case was diagnosed with pneumonia due to Mycoplasma pneumoniae (J157) and Spastic tetraplegia (G824) which were both matched at the fourth, seventh and thirteenth orders, respectively.
Table 1: SNOMED-CT terms mapped to ICD-10.
Concept id | SNOMED-CT terms | ICD-10s | ICD-10 terms |
25797006 | blood aspiration | P242 | Neonatal aspiration of blood |
56018004 | wheezing | R062 | Wheezing |
233604007 | pneumonia | J189 | Pneumonia, unspecified |
186464008 | mycoplasma infection | A493 | Mycoplasma infection, unspecified |
125591009 | injury pharynx | S198 | Other specified injuries of neck |
238159008 | desaturation blood | P84 | Other problems with newborn |
46970008 | mycoplasma pneumonia | J157 | Pneumonia due to Mycoplasma pneumoniae |
65520001 | ph | E748 | Other specified disorders of carbo-hydrate metabolism |
299989006 | infection toe | L089 | Local infection of the skin and sub-cutaneous tissue, unspecified |
722435003 | dystonia 16 | G248 | Other dystonia |
128601007 | lung infection | J189 | Pneumonia, unspecified |
282776008 | injury toe | S999 | Unspecified injury of ankle and foot |
192965001 | spastic tetraplegia | G824 | Spastic tetraplegia |
Overall, there were 498 actual unique ICD-10s recorded by humans in the 560 discharge summaries. Figure 2 shows the frequency of ICD-10s grouped by ICD-10 categories, sorted from A to Z. The three most common ICD-10 categories are: Diseases of the circulatory system (I), Neoplasms (C), and Endocrine, nutritional and metabolic diseases (E).
Figure 2:
Frequency of ICD-10 terms in each category
Table 2 shows the distribution of the rank of the actual ICD-10s predicted by character set ratio, Jaccard similarity, and set token ratio. The performance of set token ratio is the best compared with the other algorithms. Fifty percent of the correct predicted ICD-10s were ranked within the top two percent of the distribution.
Table 2: Statistical analysis of the prediction of actual ICD-10s.
Algorithms | Median | Average | Max | Min | Std |
Character set ratio | 93.28 | 87.57 | 99.99 | 15.97 | 14.08 |
Jaccard similarity | 97.05 | 86.41 | 100.00 | 1.12 | 23.57 |
Set token ratio | 98.72 | 92.31 | 100.00 | 9.33 | 13.92 |
Figure 3 shows the distribution of predicted ICD-10 rank grouped by ICD-10 categories. The distribution of prediction performance of set token ratio shows the best performance across all categories compared with the other algorithms. The performance of Jaccard similarity generally has better performance than character set ratio except in Z category. The poor performance of category Z in Jaccard similarity causes low general performance (as presented in the average in Table 2).
Figure 3:
Distribution of predicted ICD-10 rank grouped by ICD-10 categories
Figure 4 shows the rank distribution of actual ICD-10s in each category predicted by set token ratio (x-axis). 485 ICD-10s (97.39%) with the average rank above 50% represent true positive and 13 ICD-10s (2.61%) with the average rank below 50% represent false negative (annotated with ICD-10s in the figure). The false negative instances are the actual ICD-10s that the algorithm did not recognise ICD-10 related terms in the discharge summary notes.
Figure 4:
Rank distribution of actual ICD-10s in each category predicted by the set token ratio
Discussion
Mapping clinical documents to SNOMED-CT or ICD-10 is a challenging task and we have explored how to improve the process by utilising computerised techniques in order to reduce human work load. Our study showed that combining the fuzzy wuzzy approximate matching algorithm and using SNOMED-CT allowed flexibility to map clinical documents to ICD-10 which helped to discover more missing relevant ICD-10s that were not detected by humans in the document. SNOMED-CT terms cover major medical terms that commonly appear in clinical notes and the dictionary helped us to link the terms to related ICD-10s. By applying the approximate matching technique, the target words or phrases did not need to be in the perfect correct order with the same length or spelling in order to be matched. For example, Mycoplasma pneumonia was detected although this was not an exact phrase written in the notes. The flexibility of the automated system has a main advantage in that it has a higher sensitivity than a physician to match related ICD-10 from the discharge summary notes. Typically, a physician only selects a few important ICD-10s because of time limitations while our process is able to fully explore all relevant ICD-10s. This is very helpful for finance reimbursement because it could produce a more complete ICD-10 coding to report to the government, especially, for example, in a case presenting with many complications such as ectopic pregnancy.
In the example shown in Figure 5, a physician using manual coding determined one actual ICD-10 "Other ectopic pregnancy" for this case, while the algorithm detected the correct ICD-10 of "tubal pregnancy" and additionally identified other missing relevant ICD-10s from the discharge summary including pallor, abnormal bleeding, present pain, ovary tender, and history of doxycycline allergy. However, the advantage of flexibility also creates a major disadvantage of recruiting non-related ICD-10s, thus an optimal system needs to be able to distinguish sensible and irrelevant responses. In this case, irrelevant or incorrect ICD-10s such as murmur, single pregnancy, edema, small uterus, normal pregnancy, heart irregular, small ovary, conjunctiva closed, and lung cyst were also matched. We manually analysed the mapping results from all notes and especially 13 false positives. We found that the mapping process worked well when there were SNOMED-CT terms present (but not necessarily in the same order or with an exact match) in the discharge summary. However, there are some major multiple flaws that could be improved in future studies. First, the algorithms were unable to comprehend abbreviations that were not internationally used and not pre-documented in the mapped SNOMED-CT. For example, referring to "ICD-10: O820, Delivery by elective caesarean section", our results show that the algorithm incorrectly matched this discharge summary to an ICD-10 code when the physician noted C/S in the discharge summary instead of spelling out "caesarean section". Second, the algorithm did not weigh the importance of term sequences. If a term that was placed in the beginning of a paragraph could meaningfully match with terms from later in the document, the algorithm combined those terms to form ICD-10 match. For example, Figure 6 shows a document with an incorrect ICD-10 mapping of IUD contraception instead of malignant neoplasm of the endocervix.
Figure 5:
Ectopic pregnancy case
Figure 6:
Malignant neoplasm of endocervix case
The algorithm combined "Contraception" with "iub" and then converted "b" in "iub" to "D" according to Levenstein Distance and considered that these two words were similar. The correct ICD-10, malignant neoplasm of endocervix (C530), was not detected in a top ranking position in this case because of the abbreviation problem. HSIL stands for High grade squamous intraepithelial lesion (R87), a type of malignant neoplasm of the endocervix which was not fully named in the document. Regarding the abbreviation problem, we did try to apply Allie: a database of abbreviation16 to convert abbreviations in the discharge summary notes to full text. We found that by applying the database to directly map abbreviations to full text created a lot of incorrect mappings. For example, "PE" was commonly noted in the document as "Physical Examination" but "PE" also mean "Pulmonary Embolism". Physical examination has less meaning to ICD-10 mapping in opposite to pulmonary embolism. Therefore, the automatic direct conversion trended to convert PE to pulmonary embolism which was mostly incorrect. Human doctor is able to distinguish between those two terms by considering surrounding context. Therefore, we might need to add this factor to achieve better conversion performance in the future work.
A third challenge was that negative findings are commonly reported in a physician's discharge summary notes. Typical negative findings such as no palpable mass, no pale conjunctiva, no jaundice, no hepatosplenomagaly were generally found in the notes; the algorithms were not sophisticated enough to comprehend the negated phrases. Negative findings of relating to murmur, single pregnancy, edema, small uterus, normal pregnancy, heart irregular, small ovary, conjunctiva closed, and lung cyst in the ectopic pregnancy case were all detected as false positives. The same pattern of false positive was also found in the malignant neoplasm of endocervix case with "Bleeding" identified while the document clearly noted "inactive bleeding".
Lastly, the algorithms were not able to distinguish past, present and future when noted in the discharge summary. For example, a patient presented at a hospital with dyspnea on exertion and had a history note of acute myocardial infarction and ruptured appendicitis 10 years ago. Although the events of appendicitis and myocardial infarction were already completed in the past, the algorithm still comprehended both diseases as present problems which will lead to ICD-10 mismatching.
In conclusion, the results of this study show that mapping clinical documents by using approximate matching and SNOMED-CT has the potential to be used to screen keywords to map relevant ICD-10s, although there are some clear challenges still to be addressed. Prior research showed that using advanced machine learning methods on mapping ICD-10s from clinical text have shown the significant improvement17. In the future work, we aim to apply the machine learning methods and compare the results with the method in this paper.
Conclusion
Using approximate matching and the SNOMED-CT dictionary to map ICD-10 from discharge summary notes shows potential for use case for improving the intensive documentation workload in a health-care system. Although there are still some limitations of this approach, we may consider continuing doing further research to improve this approach, including pre-processing common abbreviation terms, utilising term frequency-inverse document frequency or TF- IDF to prioritise the importance of keywords, weighing the importance of word sequencing to minimise the effects of negative findings, and applying advanced machine learning methods.
Acknowledgement
This work was supported by the Student Affair, and IT Department, Faculty of Medicine, Chang Mai University. We would like to thank the Tilley family for assistance in editing and reviewing the manuscript.
Figures & Table
References
- [1].WHO — International Classification of Diseases (ICD) Information Sheet
- [2].Horsky Jan, Drucker Elizabeth A., Ramelson Harley Z. Accuracy and Completeness of Clinical Coding Using ICD-10 for Ambulatory Visits. AMIA … Annual Symposium proceedings. AMIA Symposium. 2017;2017:912–920. [PMC free article] [PubMed] [Google Scholar]
- [3].Liang Fu Wen, Wang Liang Yi, Liu Lin Yi, Li Chung Yi, Lu Tsung Hsueh. Physician code creep after the initiation of outpatient volume control program and implications for appropriate ICD-10-CM coding. BMC Health Services Research. 2020;20(1):1–7. doi: 10.1186/s12913-020-5001-5. feb. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Tsopra Rosy, Peckham Daniel, Beirne Paul, Rodger Kirsty, Callister Matthew, White Helen, Jais Jean Philippe, Ghosh Dipansu, Whitaker Paul, Clifton Ian J., Wyatt Jeremy C. The impact of three discharge coding methods on the accuracy of diagnostic coding and hospital reimbursement for inpatient medical care. International Journal of Medical Informatics. 2018;115:35–42. doi: 10.1016/j.ijmedinf.2018.03.015. jul. [DOI] [PubMed] [Google Scholar]
- [5].Patel Rikinkumar S., Bachu Ramya, Adikey Archana, Malik Meryem, Shah Mansi. Factors related to physician burnout and its consequences: A review. oct 2018. [DOI] [PMC free article] [PubMed]
- [6].McRae Shelagh, Hamilton Robert. The burden of paperwork. Canadian family physician Medecin de famille canadien. may 2006;52(5):586–588. [PMC free article] [PubMed] [Google Scholar]
- [7].Lee Dennis, Lau Francis. Exploratory reverse mapping of icd-10-ca to snomed ct. 2008;volume 410:01. [Google Scholar]
- [8].Stenzhorn Holger, Pacheco Edson Jose´, Nohama Percy, Schulz Stefan. Studies in Health Technology and Informatics. volume 150. IOS Press; 2009. Automatic mapping of clinical documentation to SNOMED CT; pp. 228–232. [PubMed] [Google Scholar]
- [9].Thai health information standards development center (this) snomed ct download
- [10].Thai health information standards development center (this)
- [11].SNOMED - Home — SNOMED International
- [12].Giannangelo Kathy, Millar Jane. Studies in Health Technology and Informatics. volume 180. IOS Press; 2012. Mapping SNOMED CT to ICD-10; pp. 83–87. [PubMed] [Google Scholar]
- [13].Levenshtein V. I., Binary V. I. Codes Capable of Correcting Deletions, Insertions and Reversals. SPhD. 1966;10:707. [Google Scholar]
- [14].GitHub - seatgeek/fuzzywuzzy: Fuzzy String Matching in Python
- [15].Jaccard Paul. The distribution of the flora in the alpine zone.1. New Phytologist. 1912;11(2):37–50. [Google Scholar]
- [16].Yamamoto Y., Yamaguchi A., Bono H., Takagi T. Allie: a database and a search service of abbreviations and long forms. Database: The Journal of Biological Databases and Curation. 2011;2011 doi: 10.1093/database/bar013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Mullenbach James, Wiegreffe Sarah, Duke Jon, Sun Jimeng, Eisenstein Jacob. Explainable prediction of medical codes from clinical text; Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); New Orleans, Louisiana. Association for Computational Linguistics; June 2018. pp. 1101–1111. [Google Scholar]