Skip to main content
Karger Author's Choice logoLink to Karger Author's Choice
. 2023 May 11;66(1):928–939. doi: 10.1159/000530954

Ophthalmology Operation Note Encoding with Open-Source Machine Learning and Natural Language Processing

Yong Min Lee a,b,, Stephen Bacchi a,b, Carmelo Macri a,b, Yiran Tan a,b, Robert J Casson a,b, Weng Onn Chan a,b
PMCID: PMC10308528  PMID: 37231984

Abstract

Introduction

Accurate assignment of procedural codes has important medico-legal, academic, and economic purposes for healthcare providers. Procedural coding requires accurate documentation and exhaustive manual labour to interpret complex operation notes. Ophthalmology operation notes are highly specialised making the process time-consuming and challenging to implement. This study aimed to develop natural language processing (NLP) models trained by medical professionals to assign procedural codes based on the surgical report. The automation and accuracy of these models can reduce burden on healthcare providers and generate reimbursements that reflect the operation performed.

Methods

A retrospective analysis of ophthalmological operation notes from two metropolitan hospitals over a 12-month period was conducted. Procedural codes according to the Medicare Benefits Schedule (MBS) were applied. XGBoost, decision tree, Bidirectional Encoder Representations from Transformers (BERT) and logistic regression models were developed for classification experiments. Experiments involved both multi-label and binary classification, and the best performing model was used on the holdout test dataset.

Results

There were 1,000 operation notes included in the study. Following manual review, the five most common procedures were cataract surgery (374 cases), vitrectomy (298 cases), laser therapy (149 cases), trabeculectomy (56 cases), and intravitreal injections (49 cases). Across the entire dataset, current coding was correct in 53.9% of cases. The BERT model had the highest classification accuracy (88.0%) in the multi-label classification on these five procedures. The total reimbursement achieved by the machine learning algorithm was $184,689.45 ($923.45 per case) compared with the gold standard of $214,527.50 ($1,072.64 per case).

Conclusion

Our study demonstrates accurate classification of ophthalmic operation notes into MBS coding categories with NLP technology. Combining human and machine-led approaches involves using NLP to screen operation notes to code procedures, with human review for further scrutiny. This technology can allow the assignment of correct MBS codes with greater accuracy. Further research and application in this area can facilitate accurate logging of unit activity, leading to reimbursements for healthcare providers. Increased accuracy of procedural coding can play an important role in training and education, study of disease epidemiology and improve research ways to optimise patient outcomes.

Keywords: Machine learning, Natural language processing, Operation note, Procedural coding, Electronic medical records

Introduction

Operation notes are medico-legal documents that serve as permanent clinical and administrative records. In Australia, each procedure can be encoded using the MBS item number, which guides reimbursement to the relevant department [1]. In the USA, Current Procedural Terminology (CPT®) codes are commonly used as procedural codes. Procedural codes are often manually entered by medical officers or administrators when booking patients for surgery. Current coding methods are susceptible to error which can result in a discrepancy between the preoperative codes and the surgical activity that was performed. This is particularly evident in complex cases, often involving multiple item numbers. Uncoded activity is a major concern in the public health sector as it inaccurately portrays departmental activity, leading to under or overestimation of financial funding and resource distribution for the clinical unit.

Postoperatively, medical officers can accurately identify procedural codes by conducting a retrospective review of the operation notes. However, this process can be time-consuming and resource-intensive. We hypothesized that artificial intelligence could serve as a tool to detect mismatches between the preoperative code and completed operation notes.

Introduction of a variety of language datasets and the development of deep learning techniques have led to significant progress within the field of natural language processing [2]. Word representation methods have played a key role in this progress, enabling larger scale and more accurate analysis of human language. Early approaches, such as one-hot encoding to represent words from dictionaries, had major limitations in establishing relationships or similarities between vectors representing similar words. Vectors were unnecessarily large and high-dimensional, making it difficult for language datasets to identify words with similar semantic phrases [2]. Mikolov et al. [3] developed word embedding techniques that allowed relating phrases and words with similar meanings through low-dimensional word representation. These methods were still challenged by larger scale text representation, leading to the development of the Bag-of-Words (BOW) method, which aimed to represent documents and group of words as a feature vector with large dimensions. Although useful in their application in filtering spam and document classification, the BOW methods did not take into consideration the meaning of the group of texts as effectively and led to inaccuracies when analysing the text [4].

Deep learning is an important branch of artificial intelligence that can process large amounts of data and has diverse applications. Integration of deep learning in computer vision technology has enabled effective prediction of missing regions in incomplete images [5] and accurate image reconstruction of low-resolution images [6]. However, in NLP, its progress has been marked by processing sentence-level text representation. Unsupervised approaches involve training artificial neuron networks (ANN) to identify probability distribution of words over a large, unlabelled database, allowing for the statistical estimation of corresponding outputs. Supervised approaches were mostly reliant on recurrent neuron networks (RNNs) or convolutional neuron networks (CNNs) to process labelled word vector inputs through pre-trained multilayered neuron networks [7]. The application of this technology in medicine demonstrated effective automated classification of medical data but still has limitations in the magnitude of processable texts [2].

Pre-trained transformer models represent the latest natural language processing with the capability of processing large amount of text. These models are able to detect relationships between input and output variables, providing greater flexibility and efficiency compared to neural network models [8]. Bidirectional Encoder Representations from Transformers (BERT) is one form of a pre-trained unsupervised transformer model that incorporates the functionality of Masked Language Model (MLM), allowing it to predict missing words in a sentence, and Next Sentence Prediction (NSP), which assesses the sequential relationship between sentences [9]. Despite its ability to be trained significantly faster than RNNs or CNNs, it still has limitations when processing large text sequences. Other state-of-the-art models for discriminative NLP tasks include logistic regression and support vector machine (SVM), which label the output by assessing the conditional probability distribution of its input training data. These methods contrast to generative models, such as the Naïve Bayes classifiers or hidden Markov models (HMM) that aim to learn how the data were labelled and generated before producing an output label. The performance of these models varies depending on the task at hand. For example, Dhola and Saradva et al. [10] achieved the best results in sentiment analysis with 85.4% accuracy using the BERT model, compared to 76.9% with the Naïve Bayes classifier. Hybrid models including RoBERTa (an altered form of BERT) and recurrent neural networks demonstrated outstanding results in sentiment analysis of IMDb dataset, Twitter US Airline Sentiment dataset, and Sentiment 140 dataset with F1 scores up to 93% [11].

While NLP technology has been used to process electronic medical records (EMR), it has yet to be explored for analysing operation notes using an open-source approach. A machine-led approach allows algorithms to label surgeries with procedural codes which can be reviewed by the surgeon. Other potential ways to facilitate a combined machine learning and human approach to coding procedures may include manual coding during admission and subsequently utilising machine learning to highlight potential errors, further increasing the accuracy of coding. The aims of this study were to develop and evaluate machine learning NLP models to aid with the coding of procedures based upon ophthalmology operation notes.

Materials and Methods

Participant Recruitment

Individuals included in this study were patients who underwent ophthalmic surgery in an operating theatre within the Central Adelaide Local Health Network, comprising the Royal Adelaide Hospital and The Queen Elizabeth Hospital, from March 1, 2020, until March 1, 2021. Completed operation notes written by the surgeons present during the procedure were identified from existing departmental registries and extracted. Operation notes that were incorrectly classified under ophthalmology were not included. After the identification of ophthalmology cases with MBS coding available, incomplete operation notes were excluded, and the first 1,000 cases were selected for analysis (see Fig. 1).

Fig. 1.

Fig. 1.

Flowchart describing case selection.

Procedure Encoding and Reimbursement Allocation

Following case identification, procedures were encoded (see Fig. 2). All procedures had previously been encoded as per standard hospital procedures (generally performed by the booking medical officer or administrative staff). Manual review of these labels was undertaken by multiple investigators (Y.M.L., C.M.) employing the criteria in the MBS [1]. Arbitrary codes were reviewed by WOC and resolved, serving as the ground-truth for classification experiments. The allocated reimbursement for each of the procedures was also determined through review of the MBS.

Fig. 2.

Fig. 2.

Flowchart describing analytical approach.

Machine Learning Analysis

Cases with missing or blank operation notes were excluded from the analysis during the case identification stage. Subsequently, pre-processing was undertaken with the BERT library, prior to analysis with BERT. Otherwise, pre-processing involved negation detection, stopword removal, word stemming, and punctuation removal. Negation detection was performed using the Natural Language Toolkit (NLTK) library [12]. In particular, the negation detection utility, from the sentiment analysis utilities, was used for negation detection. This tool was applied so that a negating string was added as a suffix to any negated terms (“_NEG”). These terms were then included in subsequent analyses, as for non-negated terms. The text then underwent count vectorisation (including n-grams 1–3 stems in length) and was then transformed into a term frequency inverse document frequency (TF-IDF) array. Data were then randomly split into a training dataset (80%) and test dataset (20%).

Models were developed on the training dataset using 5-fold cross-validation for the classification of the five most coded procedures. Models that were developed included logistic regression, random forest, XGBoost, and BERT algorithms. Hyperparameters were tuned on the training dataset. These models were then tested on the holdout test dataset (primary outcome). The best performing model from the first task was the XGBoost model, which had a structure comprising 200 estimators, a maximum tree depth of 6, a minimum child weight of 1, a uniform sampling method, and a learning rate of 0.3. This best performing model from the first task was then applied to the classification task of classifying all procedures that had five or more cases (secondary outcome).

The primary outcome was the classification accuracy in the categorisation of the five most commonly coded procedures in the holdout test dataset (as a multi-label classification task). As a secondary outcome, the best performing model for this task was then applied to the classification of all procedures for which there were five or more cases (multi-label classification task).

Exploratory Financial Analysis

In the test dataset, the total reimbursement was calculated for the manually labelled codes, the coding that was claimed (as determined from ORMIS), and the correctly labelled machine learning classifications. These values enabled the calculation of differences in reimbursement between the three approaches.

Results

Patient Characteristics

The mean age of the cohort was 65.5 years (SD 16.0) and 482 were female (48.2%). The five most common procedures were “lens extraction and insertion of intraocular lens” (code 42702) (374 cases), “vitrectomy via pars plana sclerotomy” (code 42702) (298 cases), “retina, photocoagulation of” (code 42809) (149 cases), “glaucoma, filtering operation for” (code 42746) (56 cases), and “intravitreal injection” (code 42740) (49 cases). The list of all procedures that occurred in five or more instances is detailed in Table 1.

Table 1.

Procedures included in dataset, which occurred in 5 or more cases

Procedure code Procedure description Total number of occurrences in dataset
42702 Cataract surgery 374
42725 Vitrectomy 298
42809 Laser photocoagulation 149
42746 Trabeculectomy 56
42740 Intravitreal injection 49
45623 Ptosis repair 29
42653 Corneal transplantation 25
42776 Scleral buckling 24
31356 Malignant skin excision 22
42815 Silicone oil or liquid removal 21
42641 Auto-conjunctival transplant 17
42698 Lens extraction 16
42686 Removal of pterygium 15
45617 Blepharoplasty 14
72855 Frozen section examination (1 section) 13
42801 Radioactive plaque insertion 13
42503 Ophthalmological examination under anaesthesia 12
42623 Dacryocystorhinostomy 12
42738 Paracentesis of anterior chamber or vitreous cavity 11
42719 Removal of vitreous or capsular or lens material 10
42802 Radioactive plaque removal 10
42833 Squint operation 9
42551 Repair of penetrating wound or rupture of eye 8
42731 Lensectomy combined with vitrectomy 8
42701 Intraocular lens insertion 7
42704 Intraocular lens removal or repositioning 7
45451 Full thickness skin grafts 7
45626 Correction of ectropion or entropion 7
42509 Enucleation of eye 7
42710 Removal of intraocular lens and replacement with posterior chamber lens or scleral fixation 6
42533 Exploration of orbit with drainage or biopsy 6
42818 Cryotherapy 6
72856 Frozen section examination (2–4 sections) 5
42584 Repair of rupture extraocular muscle or medial palpebral ligament 5

Machine Learning Performance for Operation Note Coding

The XGBoost model achieved the highest classification accuracy (88.0%) in the multi-label classification of the five most common procedures (primary outcome) in the test dataset (of 200 individuals). Random Forest and Logistic Regression and returned accuracies of 85.5% and 77.0%, respectively. With respect to the secondary outcome, when the XGBoost was applied as a multi-label classification task across all procedure types with greater than or equal to 5 cases, it gave an accuracy of 75.5%. Examples of misclassifications are outlined in Table 2, while procedure-by-procedure classification performance is detailed in Table 3.

Table 2.

Examples of misclassifications

Procedure to be classified as present or absent Type of misclassification Operation note Gold-standard coding
42702 False positive Right IOL rotation to 1 degree
Prep drape
Mendez ring used to assess the current IOL axis position
Main incision from previous surgery at 35 degrees
Main incision reopened using keratome
Viscoelastic used to open the capsular bag, no issues
IOL rotated to a new position at 1 degree
IOL axis confirmed with Mendez ring
Aspiration irrigation of viscoelastic
Incision hydrated
Cefazolin 1 mg/0.1 mL into the AC.
Incision stable
Chlorsig and maxidex to the right eye
Pad and shield
42704
42725 False positive LE Vitritis? PCNSL – LE Phaco/IOL/25GPPV/Vitreous Biopsy
LE Routine Phaco + IOL (small pupil)
3 × 25 G ports in vitreous biopsy
PVD present, checked with triamcinolone
Int search, ports out SC Cef and Dex
Pad
42702
42738
42740
42809 False positive Right Cataract and Diabetic VH-RE Phaco/IOL/PPV/Laser/Avastin
Eye cleaned and draped
Routine phaco + IOL
3 × 25 G ports in
PPV. pvd present
Internal search
Fill in PRP
Partial fax
IVI Avastin
Ports out. SC Cef and Sex
Pad
Intraop findings: VH, no areas of traction or obvious NVD/NVE
42702
42725
42740
42809
42740 False positive R phaco + IOL with Iris hooks without complications
Cefazolin IC
0.05 kenacort intravitreal
42702
42740
42746 False positive Nil N/A
42702 False negative LE cataract + SiO filled eye post RD + CMO – LEPhaco/IOL/Removal of SiO/IVTA
Eye prepped with Betadine and draped
2 corneal incisions made, synechiolysis with cannula and viscoelastic
Anterior capsule stained with brilliant blue, 5x iris hooks
Capsulorrhexis, hydrodissection
Phacoemulsification of nucleus, removal of cortex with IA probe
IOL into bag
3 × 25g ports in, removal of SiO, multiple FAX, IVTA 0.1 mL
Removal of iris hooks, removal of Viscoelastic, sclerotomy closed with Vicyl 7/0, IC Cef
Hydration of self-sealing Corneal wounds, SC Cef and Dex
Eye cleaned, pad and shield
42702
42725
42725 False negative Eye prep with Betadine and draped
360 conjunctival peritomy, recti muscles isolated and slinged
Scleral buckle applied and sutured to sclera with Nylon 5/0
Corneal sections, 4 x iris hooks, phacoemulsification, no lens injected
3 × 25 G 6 mm ports inserted
Funnel retinal detachment, with giant retinal tear temporally
Vitrectomy with base dissection with the assistance of triamcinolone stain, membranes stained with membrane blue and peeled
360 retinectomy done close to the arcade vessels after 360 endodiathermy
Dislocation of macula, tried repositioning with tano brush, PFCL to flatten the retina, FAX, Endolaser 360, Reformation of AC with BSS, iris hooks removed
Silicone oil 5500 cst injected, sclerotomy closed with Vicryl 7/0
Scleral buckle removed
Conj closed with Vicryl 7/0
Subtenon ropivacaine 0.5% 5 mLs
RE examined with BIO with indentation, 2 areas of retinal tear, lasered from before. no new breaks
42776
42698
42725
42740
42809
42809 False negative Left PDR with VH and TRD-LE PPV/Laser/Segmentation/Avastin/SiO 1,300 cst
Eye prep with Betadine and draped
3 × 27 G ports inserted, PVD not present
Vitrectomy, segmentation of tractional membranes
Intraop findings: solitary tractional membrane nasal to disc, broad tractional membrane superiorly with massive exudations subretinally extending to 1/2 disc diameter from fovea. No breaks
PRP laser
Fluid-air exchange, intravitreal avastin, Silicone Oil 1300 cst injected
Ports out, no leak from sclerotomy
Subconj Cef and Dex
Eye cleaned, pad and shield applied
42725
42809
42740 False negative R eye 25g PPV + laser +air for VH secondary to PDR
3 ports in PPV + PVD checked with triamcinolon
Tag on superior arcade and nasally to a bifrovascular old pannus not bleeding
360 search laser top up inferiorly and temp
FAX
AVASTIN ic
3 ports out/sealed
Cef and Dex subconj
42725
42740
42746 False negative RIGHT trabeculectomy with mitomycin C 0.02%
Betadine prep, drape, and speculum (Ong)
Superior peritomy, conjunctiva and tenonâ€s capsule undermined
MMC soaked sponge applied for 2 min
Saline irrigation to area
Cautery to scleral vessels
4 × 3 mm half thickness square scleral flap
Paracentesis
Trabeculectomy with Kelly punch.
Peripheral iridectomy with de Wecker scissors
10-0 nylon to flap
BSS to reform AC – flap filtering.
10-0 nylon to close conjunctiva – wound secure
Subconj dexamethasone
Atropine
Pad and shield
42746

Table 3.

Procedure-wise classification performance, when XGBoost model applied to the classification of the test set for all procedures which occurred on five or more occasions

Procedure True negative False positive False negative True positive
42702 125 1 1 73
42725 141 6 2 51
42809 169 4 6 21
42746 183 0 3 14
42740 189 3 3 5
45623 196 0 1 3
42653 193 0 1 6
42776 196 0 1 3
31356 197 0 0 3
42815 199 0 0 1
42641 196 0 0 4
42698 195 0 4 1
42686 196 0 0 4
45617 198 0 2 0
72855 197 0 1 2
42801 196 0 2 2
42503 200 0 0 0
42623 194 0 1 5
42738 196 0 4 0
42719 200 0 0 0
42802 197 0 2 1
42833 197 0 1 2
42551 198 0 1 1
42731 199 0 1 0
42701 197 1 2 0
42704 199 0 1 0
45451 198 0 2 0
45626 199 0 0 1
42509 198 0 1 1
42710 198 0 1 1
42533 199 0 1 0
42818 199 0 1 0
72856 200 0 0 0
42584 198 1 0 1

The MBS coding that was entered into the hospital theatre system (ORMIS) and reflected the claims made by clinicians and administrative teams was examined for the entire dataset. Among 1,000 operation notes, only 539 (53.9%) cases were entered completely correct.

Financial Analysis of Machine Leaning Application to Operation Note Coding

In the test dataset (of 200 individuals), the total MBS reimbursement for the procedures, calculated using the manual labels, was $214,527.50 ($1,072.64 per case). In contrast, the MBS reimbursement based on the previously entered coding was $199,498.30 ($997.49 per case) which includes incorrect coding that could falsely increase the reimbursement. The total MBS reimbursement for procedures accurately labelled by machine learning was $184,689.45 ($923.45 per case).

Discussion

The documentation of operation notes as an electronic record opens new possibilities for research with the use of artificial intelligence. Application of NLP and ML can assist clinicians and coders to categorise operations into their dedicated coding category. The current coding system has inherent bias and error due to its reliance on human input and interpretation. XGBoost model was the most accurate model with 88.0% classification accuracy for the five most common ophthalmology procedures which were cataract surgery, vitrectomy, laser coagulation, trabeculectomy, and intravitreal injections. A cost analysis of the generated coding demonstrated that the algorithm was able to successfully generate $923.45 per case compared to $1,072.64 per case if the process was undertaken manually. Our findings suggest that incorporating NLP and machine learning technology into clinical coding has the potential to improve accuracy, save time, and generate profitable reimbursements, creating a high-quality database in healthcare.

NLP technology has been applied in diverse ways within the field of ophthalmology, including the extraction of microbial keratitis measurements [13], the identification of open globe injury [14], and early detection of multiple sclerosis [15]. It has also been used to develop predictive models of cataract surgery complications by associating risk factors that were described in the EMR [16], as well as to triage ophthalmology outpatient referrals [17]. NLP has also been applied to aspects of clinical coding [18, 19], although this has thus far been limited to proprietary software or for detection of pathology through diagnostic codes. There has been limited research regarding open-source mechanisms for performing this task.

Recent studies have demonstrated the utility of NLP in extracting specific information from the surgical notes. Wyles et al. [20] developed an algorithm that identifies three common elements in total hip arthroplasty: operative approach, fixation method, and bearing surface. A separate algorithm was devised with each variable. The training datasets comprised 250, 467, and 300 notes, while the test dataset included 250, 291, and 284 notes for the operative approach, fixation method, and bearing surface, respectively. The algorithms achieved an accuracy of 99.2%, 90.7%, and 95.8%, respectively. The system underwent external validation with 422 operative notes from other hospital systems. Of these, 242 operative notes were used for refinement of the existing algorithm. The final performance was measured on the remaining 180 data, achieving an accuracy of 94.4%, 95.6%, and 98% for identifying the operative approach, fixation method, and bearing surface. Liu et al. [21] used NLP to ascertain key variables such as intracameral antibiotic injections and posterior capsular rupture (PCR) in cataract surgeries. The NLP tool achieved positive and negative predictive values that exceeded 99% for operation notes involving intracameral antibiotic injections and greater than 94% for notes involving PCR, demonstrating the feasibility of NLP in detecting key features within operation notes.

Our study demonstrated that only 53.9% of MBS codes were entered correctly – most of which were uncomplicated single-coded procedures such as isolated cataract surgery or trabeculectomy. Coding of more complex procedures such as vitrectomy were challenging due to the difficulty of multi-labelling and interpreting complex ophthalmology operation notes. These factors contribute to the inaccuracies and misclassifications in coding, which can lead to decreased funding allocation and underestimation of unit activity. Our data revealed that the average reimbursements for operations was $997.49 per case, compared to the gold standard of $1,072.64 per case. This discrepancy is largely due to under-coded activity and takes into consideration incorrectly applied codes that result in inflated funding. Accurate unit activity is essential for government funding policies, distribution of workforce and funding, and understanding disease epidemiology, as well as for research and training purposes.

Systematic review by Nouraei et al. [22] demonstrated that 30,127 patients across multiple surgical disciplines required at least one change to the original coding post-audit review in 51% of patients. 12% of cases required a change to the procedures performed, and 17% of cases led to changes to their reimbursement category, highlighting a financial difference amounting to E 3,974,544. The highest proportion of changes in this study was seen in ophthalmology, which required an overall change in 22% of its pre-audit to post-audit coding. Clinical coding of oculoplastic procedures was shown to have an accuracy of 30.7% with errors being attributed to clinician factors including lack of awareness of coding issues, lack of training and minimal exposure to clinical coding, diagnostic uncertainty, and illegible handwriting. Coder factors that contribute to inaccurate coding include the dependence on accurate documentation and difficulty in interpreting individual abbreviations and speciality-specific terminology [23].

Despite more hospitals emphasising the importance of clinical coding and delivering educational sessions, clinical coding remains an issue in tertiary hospitals. Although there is evidence to suggest medical team input can generate more accurate clinical coding, the reality is that medical teams are overburdened with long wait times for patients [24], negatively impacting physician productivity. In addition, there are more incentives, both professionally and morally, to prioritise delivering patient care, and clinicians may be unwilling to assume data entry roles unless there are immediate returns. Individual factors such as limited technological skills and lack of interest in academia, or data quality are barriers for clinicians to produce accurate coding. Medical documentation plays a significant role in clinician-to-clinician handover of patients, leading to specialised notes and complex clinical descriptions making it difficult for coders to categorise.

Previous research investigating the use of machine learning in medicine has shown its diverse potential to automate and improve healthcare. Research by Mahendra et al. [25] investigated the use of random forests and neural network models to predict inpatient mortality in the intensive care unit (ICU) based on medical documentation. The approach of this research highlights the issue of creating machine learning algorithms with limited institutional data as it impacts the external generalisability of the model and its performance in analysing notes from other ICU departments [25]. After selection of training dataset, standardisation and labelling of data can be variable per individual and require specialised medical knowledge which can limit the development of accurate algorithms. Lu et al. [26] also highlighted the effectiveness of using language models to extract key information from medical documentation; however, criticised the inadequate semantic interpretation of the extracted data. Although its application can vary, the cross-sectional nature of the training data can limit its use in predicting the entire medical journey of each individual. Following development of algorithms, the cost of implementation and integration into hospital documentation system is a major barrier to utilising artificial intelligence in optimising healthcare. The universal application of successful algorithms is essential for positive outcomes in healthcare even if the experimental results are promising [27].

Integration of artificial intelligence to generate clinical coding is an area of interest among current researchers. Various studies have been conducted ranging from different architectures within RNNs, CNNs, machine learning on various datasets with the main objective of improving coding accuracy while reducing the required time. Zhou et al. [28] experimented Regexps techniques on discharge summaries to generate codes in accordance with the International Classification of Disease 10th revision (ICD-10). Although their recall rates were between 23.67∼27.90% and overall accuracy measuring approximately 41.19%, the time taken to complete its tasks compared to manual coding was 2.1∼2.4 s versus 213.3∼272.2 s for every ten discharge summaries. Other studies by Teng et al. [29] used CNN and autoencoder techniques to code as per the Swiss Operation Classification System (CHOP) system which is another complex coding system consisting of 18 categories and over 14,000 codes. CNN with embedding techniques recorded the highest F1 score of 60.86%. Deep learning models trained through publicly available datasets such as the MIMIC-III recorded variable levels of success in applying ICD-10 codes [30–33], although the results did not validate AI use for persistent performance. BERT models have been successful in predicting clinical codes with variation models trained with medical content (BlueBERT), achieving the greatest AUC with 89.4∼92.0 [34].

There are multiple components that complicate the automation of clinical coding that needs to be addressed for future successful research. The initial selection of data to train algorithms is often inefficient with limited availability of gold-standard coded hospital data [35]. Algorithms trained on publicly accessible datasets (such as MIMIC-III) are prone to error which limits the success of the trained algorithm. Furthermore, clinical documentation that is selected for training purposes varies in structure per clinician, can be incomplete, and includes personalised notations that create semantic ambiguities that challenge the architecture that is employed [28]. The application of complex and dynamic classification systems such as the ICD-10 often cause difficulties for algorithms, let alone even for professionally trained medical coders. Approximately 70,000 codes for ICD-10 and 1.6 million diagnostic codes for ICD-11 can be applied in multiple ways to describe a clinical situation [36]. There are limitations to the architecture model that is employed, where BERT can only process up to 512 tokens, despite medical discharge summaries having an average of 1,500 tokens, which limits its effectiveness in clinical coding [35]. Implementation to practice is another major limitation as it involves significant deployment costs, novel interaction between coders and AI-based systems, and the risk of negligence due to overreliance on computational coding, which may lead to errors and omissions.

Our research investigated the use of NLP and ML algorithms in the interpretation of operation notes to assign procedural MBS codes. Operation notes are often abbreviated and challenging to decipher. Clinicians can analyse these notes and, due to their brevity compared to larger discharge summaries, can be a suitable dataset for successful application of NLP and ML. The success of our NLP model training was directly correlated with the frequency of the operation notes. When the model was applied to all procedures that occurred more than five times, the accuracy decreased to 75.5%. The discrepancy could be accounted by the complexity of the NLP task, where it struggled to differentiate between less frequently occurring procedures.

Especially with many clinicians being unaware of the significance of uncoded unit activity, the reality is current clinical coding has major room for improvements. This is where we suggest the utility of machine learning and NLP to strengthen the accuracy of clinical coding. Although complete removal of error in clinical coding will be difficult due to diagnostic uncertainty and variation in clinical context interpretation, implementation of a combined human and machine learning approach could further improve the portrayal of unit activity. The reduction of under-coded activity allows for more accurate resource and finance distribution to the relevant department. This additional funding can allow departments to equip themselves with higher quality technology, staffing, and education to ultimately improve delivery of patient care.

Conclusion

NLP has been successful in classifying ophthalmic operation notes into MBS coding categories. The use of open-source mechanisms for performing this task is currently limited in research. Our XGBoost model demonstrated an accuracy of 88.0% for the top five procedures, including cataract surgery, vitrectomy, laser coagulation, trabeculectomy, and intravitreal injections. Our results demonstrate the potential for surgeons and coders to rely on NLP technology to assign the correct billing codes. We propose applying this algorithm to completed operation notes to prompt generation of MBS codes for surgeons and coding team to subsequently review. This algorithm could also be applied to operations processed by coding teams to highlight codes that potentially missed codes due to a lack of understanding of speciality-specific terminology and procedures. Combining human and machine-led approaches can expedite procedural coding with greater accuracy, leading to more accurate logging of unit activity for funding and research purposes. Surgeons will be able to prioritise patient care and invest time in more detailed operation notes, while further funding with more advanced technology and staffing will also optimise patient care.

Limitation of the study that impacts the accuracy of the algorithm is the inclusion of operation notes from two sites. Variation in individual surgeon’s operation notes, as well as institutional report structures, can affect the structure of the EMR documentation, potentially leading to discrepancies in the algorithm’s output. Use of variable abbreviations and misspellings may impact the algorithm’s accuracy as the success of NLP is heavily dependent on accurate operation notes. Surgeon preferences may also lead to discrepancy in coding as in cases where “limbal or pars plana lensectomy combined with vitrectomy” (code 42731) may be coded in two separate operations as “lens extraction and insertion of intraocular lens” (code 42702) and “vitrectomy via pars plana sclerotomy” (code 42725). Our study did not consider the specific surgeon who completed the operation note. There are limitations to the deployment of this algorithm into clinical practice such as costs and feasibility, such as the training of staff members as well as ensuring necessary infrastructure to support its use. The NLP model produced will be specific to the English language within the MBS coding system.

Further advancements of this technology would involve studies that focus on training and coding of rarer procedures and expansion of dataset to include multiple ophthalmologic departments across Australia. This would increase the generalisability of the model at the expense of decreased accuracy. Ongoing algorithm refinements and external validation will be essential to improve the current model and create an automated streamlined process. Future studies could also consider input data from individual surgeons to create a user-specific NLP algorithm that adapts to the language and reporting style of the surgeon. International institutions will require local expertise to implement NLP respective to the coding services provided in their preferred language. Expanding the application of this technology to other medical specialities could lead to a more automated and streamlined coding process for healthcare providers.

Statement of Ethics

The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee(s) and with the Helsinki Declaration (as revised in 2013). This study protocol was reviewed and approved by the committee of Central Adelaide Local Health Network (CALHN) Research Services and Royal Adelaide Hospital, approval number 14372. Consent is not required for this study in accordance with local or national guidelines.

Conflict of Interest Statement

The authors have no conflicts of interest to declare.

Funding Sources

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author Contributions

For this paper, the main contributions are as follows: (1) Yong Min Lee was involved in data collection, analysis, and manuscript production; (2) Stephen Bacchi was involved in data analysis and processing and manuscript production; (3) Carmelo Macri was involved in data collection and manuscript production; (4) Yiran Tan and Robert J. Casson were involved in project supervision and manuscript production; (5) Weng Onn Chan was involved in data collection, project supervision, and manuscript production.

Funding Statement

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Data Availability Statement

Data for this project are secured on a hospital network that requires authorised access to maintain confidentiality of all patients involved in the study. Data are not publicly available due to ethical reasons. Further enquiries can be directed to the corresponding author.

References

  • 1.MBS Online Medicare Benefits Schedule Australian Government Department of Health. Available from: http://www9.health.gov.au/mbs/search.cfm.
  • 2. Lauriola I, Lavelli A, Aiolli F. An introduction to deep learning in Natural Language processing: models, techniques, and tools. Neurocomputing. 2022;470:443–56. 10.1016/j.neucom.2021.05.103. [DOI] [Google Scholar]
  • 3. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their Compositionality. 2013 Oct 1. [arXiv:1310.4546 p.]. Available from: https://ui.adsabs.harvard.edu/abs/2013arXiv1310.4546M. [Google Scholar]
  • 4. Cormack G, Gomez Hidalgo J, Sanz E. Spam filtering for short messages2007. p. 313–20.
  • 5. Chen Y, Xia R, Zou K, Yang K. FFTI: image inpainting algorithm via features fusion and two-steps inpainting. J Vis Commun Image Representation. 2023;91:103776. 10.1016/j.jvcir.2023.103776. [DOI] [Google Scholar]
  • 6. Chen Y, Liu L, Phonevilay V, Gu K, Xia R, Xie J, et al. Image super-resolution reconstruction based on feature map attention mechanism. Appl Intell. 2021;51(7):4367–80. 10.1007/s10489-020-02116-1. [DOI] [Google Scholar]
  • 7. Banerjee I, Ling Y, Chen MC, Hasan SA, Langlotz CP, Moradzadeh N, et al. Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification. Artif Intell Med. 2019;97:79–88. 10.1016/j.artmed.2018.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Tanaka H, Shinnou H, Cao R, Bai J, Ma W. Document classification by word embeddings of BERT. Comput Lings. Singapore: Springer; 2020. [Google Scholar]
  • 9. Khurana D, Koli A, Khatter K, Singh S. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl. 2023;82(3):3713–44. 10.1007/s11042-022-13428-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Dhola K, Saradva M, editors. A comparative evaluation of traditional machine learning and deep learning classification techniques for sentiment analysis. 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence); 2021 Jan 28–29. [Google Scholar]
  • 11. Tan KL, Lee CP, Anbananthen KSM, Lim KM. RoBERTa-LSTM: a hybrid model for sentiment analysis with transformer and recurrent neural network. IEEE Access. 2022;10:21517–25. 10.1109/access.2022.3152828. [DOI] [Google Scholar]
  • 12. Bird S, Klein E, Loper E. Natural Language processing with Python O’Reilly Media Inc.; 2009. [Google Scholar]
  • 13. Maganti N, Tan H, Niziol LM, Amin S, Hou A, Singh K, et al. Natural Language processing to quantify microbial keratitis measurements. Ophthalmology. 2019;126(12):1722–4. 10.1016/j.ophtha.2019.06.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Apostolova E, White HA, Morris PA, Eliason DA, Velez T. Open globe injury patient identification in warfare clinical notes. AMIA Annu Symp Proc. 2017;2017:403–10. [PMC free article] [PubMed] [Google Scholar]
  • 15. Chase HS, Mitrani LR, Lu GG, Fulgieri DJ. Early recognition of multiple sclerosis using natural language processing of the electronic health record. BMC Med Inform Decis Mak. 2017;17(1):24. 10.1186/s12911-017-0418-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Gaskin GL, Pershing S, Cole TS, Shah NH. Predictive modeling of risk factors and complications of cataract surgery. Eur J Ophthalmol. 2016;26(4):328–37. 10.5301/ejo.5000706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Tan Y, Bacchi S, Casson RJ, Selva D, Chan W. Triaging ophthalmology outpatient referrals with machine learning: a pilot study. Clin Exp Ophthalmol. 2020;48(2):169–73. 10.1111/ceo.13666. [DOI] [PubMed] [Google Scholar]
  • 18. Wadia R, Akgun K, Brandt C, Fenton BT, Levin W, Marple AH, et al. Comparison of Natural Language processing and manual coding for the identification of cross-sectional imaging reports suspicious for lung cancer. JCO Clin Cancer Inform. 2018;2:1–7. 10.1200/CCI.17.00069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Banerji A, Lai KH, Li Y, Saff RR, Camargo CA Jr, Blumenthal KG, et al. Natural Language processing combined with ICD-9-CM codes as a novel method to study the epidemiology of allergic drug reactions. J Allergy Clin Immunol Pract. 2020;8(3):1032–8.e1. 10.1016/j.jaip.2019.12.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Wyles CC, Tibbo ME, Fu S, Wang Y, Sohn S, Kremers WK, et al. Use of Natural Language processing algorithms to identify common data elements in operative notes for total hip arthroplasty. J Bone Joint Surg Am. 2019;101(21):1931–8. 10.2106/JBJS.19.00071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Liu L, Shorstein NH, Amsden LB, Herrinton LJ. Natural language processing to ascertain two key variables from operative reports in ophthalmology. Pharmacoepidemiol Drug Saf. 2017;26(4):378–85. 10.1002/pds.4149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Nouraei SA, Hudovsky A, Frampton AE, Mufti U, White NB, Wathen CG, et al. A study of clinical coding accuracy in surgery: implications for the use of administrative big data for outcomes management. Ann Surg. 2015;261(6):1096–107. 10.1097/SLA.0000000000000851. [DOI] [PubMed] [Google Scholar]
  • 23. Juniat V, Athwal S, Khandwala M. Clinical coding and data quality in oculoplastic procedures. Eye. 2019;33(11):1733–40. 10.1038/s41433-019-0475-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Mahbubani K, Georgiades F, Goh EL, Chidambaram S, Sivakumaran P, Rawson T, et al. Clinician-directed improvement in the accuracy of hospital clinical coding. Future Healthc J. 2018;5(1):47–51. 10.7861/futurehosp.5-1-47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Mahendra M, Luo Y, Mills H, Schenk G, Butte AJ, Dudley RA. Impact of different approaches to preparing notes for analysis with Natural Language processing on the performance of prediction models in intensive care. Crit Care Explor. 2021;3(6):e0450. 10.1097/CCE.0000000000000450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Lu Z, Sim JA, Wang JX, Forrest CB, Krull KR, Srivastava D, et al. Natural Language processing and machine learning methods to characterize unstructured patient-reported outcomes: validation study. J Med Internet Res. 2021;23(11):e26777. 10.2196/26777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Zhang A, Xing L, Zou J, Wu JC. Shifting machine learning for healthcare from development to deployment and from models to data. Nat Biomed Eng. 2022;6(12):1330–45. 10.1038/s41551-022-00898-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Zhou L, Cheng C, Ou D, Huang H. Construction of a semi-automatic ICD-10 coding system. BMC Med Inform Decis Mak. 2020;20(1):67. 10.1186/s12911-020-1085-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Teng F, Liu Y, Li T, Zhang Y, Li S, Zhao Y. A review on deep neural networks for ICD coding. IEEE Trans Knowl Data Eng. 2022:1. 10.1109/tkde.2022.3148267. [DOI] [Google Scholar]
  • 30. Xie X, Xiong Y, Yu PS, Zhu Y. EHR coding with multi-scale feature attention and structured knowledge graph propagation. Proceedings of the 28th ACM International Conference on information and knowledge management. Beijing, China: Association for Computing Machinery; 2019. p. 649–58. [Google Scholar]
  • 31. Huang J, Osorio C, Wicent Sy L. An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical Notes. 2018 Feb 1. [arXiv:1802.02311 p.]. Available from: https://ui.adsabs.harvard.edu/abs/2018arXiv180202311H. [DOI] [PubMed] [Google Scholar]
  • 32. Li F, Yu H. ICD coding from clinical text using multi-filter residual convolutional neural Network. 2019 Nov 1. [arXiv:1912.00862 p.]. Available from: https://ui.adsabs.harvard.edu/abs/2019arXiv191200862L. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Yogarajan V, Montiel J, Smith T, Pfahringer B. Seeing the whole patient: using multi-label medical text classification techniques to enhance predictions of medical Codes. 2020 Mar 1. [arXiv:2004.00430 p.]. Available from: https://ui.adsabs.harvard.edu/abs/2020arXiv200400430Y. [Google Scholar]
  • 34. Ji S, Holtta M, Marttinen P. Does the magic of BERT apply to medical code assignment? A quantitative study. Comput Biol Med. 2021;139:104998. 10.1016/j.compbiomed.2021.104998. [DOI] [PubMed] [Google Scholar]
  • 35. Dong H, Falis M, Whiteley W, Alex B, Matterson J, Ji S, et al. Automated clinical coding: what, why, and where we are? NPJ Digit Med. 2022;5(1):159. 10.1038/s41746-022-00705-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. ICD-11: International classification of disease 11th revision. World Health Organisation; 2023. Available from: https://icd.who.int/en. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data for this project are secured on a hospital network that requires authorised access to maintain confidentiality of all patients involved in the study. Data are not publicly available due to ethical reasons. Further enquiries can be directed to the corresponding author.


Articles from Ophthalmic Research are provided here courtesy of Karger Publishers

RESOURCES