Abstract
The clinical competency of residents at teaching hospitals is always under scrutiny. Ideally, assessment should reflect competency on-the-job, under realistic circumstances, and include evaluating their medical reports. Currently, the assessment is done manually by the attending physicians, which adds to the cognitive load. In this study, we developed an automated system for assessing medical resident’s pathology reports. Our system used natural language processing (NLP) techniques to identify different lexical and semantic similarity scores at sentence level as well as chunk level. We then used supervised learning to classify the reports into three categories– No Change (NC), Minor Changes (MiC), and major changes (MaC), reflecting how much the attending physician’s report differs from that of the resident. Our system was able to classify the reports with an accuracy of 73.6%. Although moderately successful, our work shows the potential and future of automated assessment systems in the biomedical domain.
1. Introduction and Background
One of the primary goals of medical education is to train medical professionals who can provide better care for the patients. With the accelerating pace of scientific discoveries and advancements, new medical treatments seem to be emerging far more rapidly than in the previous decades. Such fast-paced progress often introduces complexities in clinical decision making pressing the need for continuous training and evaluation of residents to produce better prepared medical professionals.
Evaluation of medical residency program is competency-based and each medical resident is evaluated on six core competencies defined by the American Board of Medical Specialties (ABMS)a and the Accreditation Council on Graduate Medical Education (ACGME)b. Milestones determine the learning curve of the residents as well as significant developmental points1 and are reported at regular intervals. These milestones are based on on-the-job assessment as well as periodic medical examinations which assesses the resident’s knowledge and skills defined in the six areas.
One of the major on-the-job assessment tasks is the evaluation of patient’s reports documented by residents which can range from diagnostic reports to discharge summaries and operative reports. These reports have to be reviewed by the physicians manually for two reasons. First, it helps to assess the knowledge and competency of the residents. Second, and most important, they are part of the medical record of the patient which is used for billing, legal purposes, and effective patient care.
A large portion of the research is focused on either improving the evaluation process and metrics of medical education programs with quality improvement strategies7,8 or creating and improving reporting formats including effectiveness measurements2–6. Even though the shift is towards e-learning and electronic systems, the existence of automated assessment of the free-text reports in medicine is very minimal. Much of the current methods for evaluating these textual reports involve manual grading9–12. The manual assessment adds to the already existing cognitive load of the attending physician and reduces the quality of care. A better solution would be to automate this process of assessment.
The concept of automated assessment in general is not new, and has been used one way or another since 196613, with a plethora of automated essay grading systems existing today14–17. While the primary focus of these systems is content assessment, their application to medical education is scarce. Zhang et al.20 created a system for the automatic identification of sentiments, opinions, and topics from evaluation review comments about the medical trainees.
Chen et al.19 and Joshua et al.18 developed systems to automatically assess resident’s competency in the clinical domain by identifying mentions of UMLS Concept Unique Identifiers(CUIs) using natural language processing (NLP) techniques and compared the extracted codes to manually annotated gold standard clinical notes. Using these as features, they used supervised learning to grade the relevance scores of resident clinical notes. These systems have two major limitations. Firstly, they fail to consider the semantic similarities between concepts and notes. For example, consider “acute prostatitis” and “prostatic tissue with acute inflammation”. Even though they are similar semantically, these systems fail to consider them. Secondly, they look for an exact match of the concepts present in UMLS. Using an exact match for CUIs doesn’t work all the time as the way of writing differs from person to person. For example, if we were to look for the CUI condition “Left shoulder injury” and the report says “Injury in left shoulder”, there will not be a lexical match even though they represent the same concept.
The main goal of this paper was to perform a preliminary assessment on applications of NLP in automated evaluation of medical reports. Our major contributions in this paper are:
We proposed a novel system for automated assessment of resident reports by measuring and comparing lexical and semantic similarities with physician’s reports.
We proposed the use of chunks along with sentences for assessing the reports.
We developed a supervised classification system that classifies the difference between resident reports and physician reports into three classes: No Change (NC), Minor Change (MiC), and Major Change(MaC).
2. Methods and Materials
Figure 1 shows the architecture of our automated assessment system. There are three phases in the system: Preprocessing phase, feature extraction phase, and supervised learning phase. Each phase is explained in the rest of the section.
Figure 1:
Proposed system for automated report assessment.
2.1. Data Collection
We used diagnostic reports created by six medical residents along with corresponding attending physician’s reports from the pathology department at Hospital of the University of Pennsylvania for this study. Although there are only six medical residents, we collected multiple reports created by them.
Table 1 shows examples of reports. Each report has a medical test and the comment and analysis for that test. We noticed that the medical test was the same for both the reports and hence excluded them from our analysis.
Table 1:
Examples of sample reports
| No | Resident report | attending physician report |
|---|---|---|
| 1 | Right breast mass, ultrasound guided core biopsy: Adenosis tumor; see note. | Right breast mass, ultrasound guided core biopsy: Fibroadenoma. Pending reticulin |
| 2 | Left breast, core biopsy: Fibroadenoma; see note. Sclerosing adenosis and florid duct hyperplas ia, usual type. | Left breast, core biopsy: Fibroadenoma. |
2.2. Preprocessing Module
Our preprocessing module involves three steps: data cleaning, sentence segmentation, and chunking. Data cleaning: This step involves lower case conversion, tokenization, and removal of stop words.
Sentence segmentation: Once the report is cleaned and tokenized, the next step involves splitting the documents into sentences. Parts of Speech tagging (POS tagging) is then applied to each sentence using Stanford Parser28.
Chunking: Chunking involves extracting phrases from a given sentence. We developed a simple regular expression based chunking system that chunks the sentences based on two POS tags: Conjunctions and Prepositions.
2.3. Feature Extraction Module
We compute five different similarity scores on the sentences and chunks from the preprocessing module with three scores for lexical similarity and two for semantic similarity. For every report, we compute the similarity scores at the sentence level and the chunk level. We compute the chunk level scores for the following reasons. Firstly, the differences between the two texts can be easily explained as we will be able to pinpoint the locations of differences. Secondly, concept identification and Named Entity Recognition (NER) becomes easier if needed. Finally, semantic similarities work better at the phrase level compared to long sentences.
2.3.1. Lexical Features
We extract lexical level features from three similarity methods. These methods are hugely dependent on the words. We compute three scores for each method and use the maximum score for each method to find more accurate similarity scores. The final similarity score for each method is given by the equation 1. Let di and dj be two documents.
| (1) |
The three scores include similarity between raw text, lemmatized text, and stemmed text. We use three different similarity methods and use them as features on our model.
1. Jaccard Similarity: Jaccard similarity21 between two documents can be defined as the proportion of number of shared words divided by the total number of unique words in the two documents. Jaccard Similarity can be computed using equation 2.
| (2) |
Jaccard Similarity ranges between 0 and 1, where 1 denotes the two documents to be exactly the same and 0 denotes them being completely different. Jaccard similarity looks at individual words and does not consider the context and structure of the sentence.
2. Levenshtein Similarity: Levenshtein Distance24,25 is a distance measure between two strings which can be defined as the number of edit operations needed to transform one string to another. The edit operations are insertions, deletion, and substitution. This technique is mainly used in spell check and is done on character level. However, we use this technique to compute the distance between two sentences at the word level. We convert the distance score into a similarity score and is given by equation 3.
| (3) |
3. Cosine Similarity: Cosine similarity is used in the vector space models. Documents are represented as vectors of words and the similarity is measured by the cosine of the angle between these vectors as given by equation 4.
| (4) |
Cosine similarity is one of the most widely used measures in measuring the similarity between documents. Although it has yielded a huge success, it still has many limitations. A major limitation is that it does not consider the semantic relations between words.
2.3.2. Semantic Similarity
As the lexical similarity measures do not consider the semantic relationship between words and concepts, we compute two different semantic similarity scores.
1. Wordnet-based similarity: Li et. al22 proposed a method to measure the similarity between two short texts using a combination of semantic and syntactic similarity between the two text and is given by the equation 5.
| (5) |
where Ss denotes semantic similarity and Sr represents word order similarity.
The word order similarity which represents the syntactic similarity is based on the total number of words and the order in which the word pairs appear. The semantic similarity is computed using Wordnet23. Wordnet is a lexical semantic database which comprises of synsets. These synsets are nothing but synonym set in which all words in the set have a similar meaning. The relationship between synsets is defined in a hierarchical tree structure. Each sentence is converted into a semantic vector using synsets and cosine similarity is computed on the semantic vectors.
2. Word Mover Distance: Word embeddings are gaining popularity in biomedical NLP community as it has shown to provide better results. It finds its major applications in named entity recognition, document retrieval, text categorization, and semantic relationships between concepts. Word embeddings represent a given text into vector space where each word in the text becomes a point in the vector space. The idea behind the word embeddings is similar words are closer to each other in the vector space.
Word Mover’s Distance (WMD)26 is a distance measure which measures the cumulative cost of transporting words from one document to match words in another document. WMD defines the distance as the dissimilarity score and is calculated by minimizing the sum of cost between each pair of words. WMD depends heavily on the word embeddings. Since we are dealing with the biomedical domain, our system uses Wiki-PubMed-PMC pre-trained word embedding model created by Pyssalo et. al27 where we use 200D vectors.
| (6) |
WMD has a range of (0, ∞). We convert these distances as similarity scores using the formula 6
Algorithm 1: Sentence pair matching alogirthm
Data: report1, report2
Result: matched_pairs
sent_list_1 = split sentences(report1);
sent_list_2 = split sentences(report2);
while all 5 similarity scores are to be computed do
for i, sent-1 in sent_list-1 do
for j, sentJ2 in sent_list_2, do
sent_sim[i][j] = get_similarity(sent_1, sent_2)
end
end
sent_pair_indx = find index of max value for each row
end
for i in sent_list_l do
j = sent_pair_indx[i];
if i,j is same in 3 out of 5 scores then
matched_pairs.append(sent_list_1[i], sent_list_2[j]);
end
end
2.3.3. Pair Matching
As reports can contain multiple sentences and the order of sentences may vary between reports, they require matching. Algorithm 1 shows the process of matching sentence pairs. We first compute five different scores for each sentence pair between the two reports. We then match sentences between the reports if the two sentences have maximum value for three of the five computed scores.
For example, consider two reports:-
“Benign breast tissue, no tumor seen. Correlation with the specimen radiograph was performed. Infiltrating and in situ moderately differentiated mammary carcinoma with ductal and lobular features”
“Infiltrating moderately differentiated mammary carcinoma with ductal and lobular features. Benign breast tissue, no tumor seen.”
The order of sentences between the two reports differs and require matching. Table 2 shows five similarity scores (cosine, jaccard, levenshtein, wordnet, and WMD) between each sentence pair. We find the maximum score for each row for each of the five similarity methods. If the same pair has high scores in three of the five methods, we match these two sentences. Multiple sentences in one report might be matched with a single sentence from another report. In these cases, the sentence pair that has higher scores are matched.
Table 2:
Sentence pair matching example
| Infiltrating moderately differentiated mammary carcinoma with ductal and lobular features | Benign breast tissue, no tumor seen. | |
| Benign breast tissue, no tumor seen. | 0.71, 0.39, 0.39, 0.60, 0.55 | 1.0, 1.0, 1.0, 1.0, 1.0 |
| Correlation with the specimen radiograph was performed. | 0.0, 0.04, 0.05, 0.08, 0.45 | 0.0, 0.0, 0.0, 0.16, 0.38 |
| Infiltrating and in situ moderately differentiated mammary carcinoma with ductal and lobular features | 0.89, 0.88, 0.77, 0.89, 0.85 | 0.0, 0.0, 0.0, 0.0, 0.46 |
Matching is required for chunks as well and we use the same algorithm. Once the chunks are matched for each sentence pair, we compute a consolidated chunk score for each sentence by averaging each individual chunk pair score. Once we obtain the sentence pairs for each report, we compute a consolidated similarity score for each of the five methods by averaging the scores of each sentence pairs. Thus, for every report pair, we obtain ten similarity scores as features from this module, five for sentences and five for chunks.
2.4. Supervised Learning
We used supervised classification technique for predicting relevance scores. We use the ten similarity scores from the feature extraction step as the input for the classification. We applied three classifiers - logistic regression (LR), naive bayes (NB), random forest (RF), and support vector machines (SVMs). We also experimented with an ensemble method which classifies samples based on voting from different classifiers. We used all four classifiers for voting.
We used 80-20 split for training and testing set for the classification. We performed ten-fold cross-validation to pick the best model and used the held-out test set for prediction. We computed precision, recall, and f-scores. Since there was no baseline measure for our analysis, we used ZeroR classifierc as our baseline measure which predicts the majority class label without considering the features.
3. Results
3.1. Dataset and Annotation
We had a total of 265 reports for this study. Relevance scores for these reports were annotated by one of the physicians. Relevance score for each report pair fell into two classes (NC, C). NC meant the two reports were similar and no change is required. C meant changes to be made and it was further categorized into two classes (MiC, MaC) with MiC depicting minor changes and major changes for MaC. Out of the 265 reports, there were 171 NC reports (64.5%) and 94 C reports (35.5%). Within the 94 C reports, 40 reports (43.5%) belonged to MiC and 54 reports (57.5%) fell into MaC class.
3.2. Classification
Table 3 shows the ten-fold cross-validation of various classifiers with Table 3a presenting binary classification performance and Table 3b showing 3-way classification. Logistic Regression (both binary and multinomial) outperforms all other classifiers. We can see a significant improvement from the baseline method.
Table 3:
Ten-fold cross validation performances by different classifiers
| Classifier | NC (F) | C (F) | Accuracy (%) |
|---|---|---|---|
| ZeroR | 0.786 | 0.0 | 64.7 |
| LR | 0.779 | 0.479 | 70.0 |
| SVM | 0.775 | 0.426 | 68.5 |
| NB | 0.654 | 0.581 | 63.6 |
| RF | 0.774 | 0.445 | 68.1 |
| Ensemble | 0.752 | 0.463 | 68.5 |
| (a) Binary Classification | |||
| Classifier | NC (F) | MiC (F) | MaC (F) | Accuracy (%) |
|---|---|---|---|---|
| ZeroR | 0.786 | 0.0 | 0.0 | 64.7 |
| LR | 0.812 | 0.0 | 0.631 | 72.4 |
| RF | 0.767 | 0.017 | 0.394 | 67.0 |
| NB | 0.620 | 0.248 | 0.449 | 53.1 |
| SVM | 0.776 | 0.0 | 0.065 | 64.7 |
| Ensemble | 0.767 | 0.02 | 0.429 | 68.1 |
| (b) 3-way Classification (NC vs MiC vs Mac) | ||||
Although the NB classifier handled the class imbalance well, it’s performance did not match LR. The ensemble method (voting-based) doesn’t improve the performance. It is also evident that the 3-way classification performance is slightly better than binary classification.
As the addition of chunk level scores was novel to our system, we wanted to see if it had any effect on the performance. Table 4 shows precision, recall, and f-scores for each of the three classes (NC, MiC, MaC) on ten-fold cross-validation.
Table 4:
Ten-fold cross validation performance scores for different feature sets
| Features | NC | MiC | MaC | Accuracy | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | (%) | |
| Sent lexical | 0.671 | 1.000 | 0.799 | 0.000 | 0.000 | 0.000 | 0.600 | 0.207 | 0.299 | 0.682 |
| Sent semantic | 0.678 | 0.976 | 0.792 | 0.000 | 0.000 | 0.000 | 0.650 | 0.184 | 0.267 | 0.677 |
| Chunk lexical | 0.661 | 1.000 | 0.789 | 0.000 | 0.000 | 0.000 | 0.100 | 0.050 | 0.067 | 0.663 |
| Chunk semantic | 0.698 | 0.981 | 0.809 | 0.000 | 0.000 | 0.000 | 0.700 | 0.286 | 0.396 | 0.706 |
| Sent lexical + Chunk lexical | 0.684 | 0.992 | 0.807 | 0.000 | 0.000 | 0.000 | 0.767 | 0.280 | 0.393 | 0.701 |
| Sent semantic + Chunk semantic | 0.687 | 0.985 | 0.805 | 0.000 | 0.000 | 0.000 | 0.717 | 0.337 | 0.441 | 0.696 |
| Sent combined | 0.668 | 0.958 | 0.779 | 0.000 | 0.000 | 0.000 | 0.658 | 0.394 | 0.455 | 0.673 |
| Chunk combined | 0.684 | 0.957 | 0.792 | 0.000 | 0.000 | 0.000 | 0.742 | 0.469 | 0.555 | 0.696 |
| Sent combined + Chunk combined | 0.715 | 0.949 | 0.807 | 0.000 | 0.000 | 0.000 | 0.692 | 0.454 | 0.514 | 0.710 |
From the table, We can clearly see that by combining chunk_combined and sent_combined, there is an improvement in the performance and has the best cross-validation accuracy. Although performances of lexical features alone and semantic features alone also provide similar performances, we decided to use both lexical and semantic features because semantic measures sometimes find similarities between sentences that do not have any similarity. This can be clearly seen in Table 2. The two sentences “Correlation with the specimen radiograph was performed” and “benign breast tissue, no tumor seen.” do not share any similarity. However, WMD measure gives a score of 0.38. We expect the lexical similarities to help in such cases.
This model was then used on the held-out test set. It achieved an overall accuracy of 73.6%. Precision, recall, and f-scores for this held-out set can be seen in Table 5.
Table 5:
Performance scores on the held-out test set
| Class | Precision | Recall | F-measure |
|---|---|---|---|
| NC | 0.708 | 1.000 | 0.829 |
| MiC | 0.000 | 0.000 | 0.000 |
| MaC | 1.000 | 0.625 | 0.769 |
4. Discussion
Our major goal in this study was to build a baseline system for automated assessment and comparison of medical reports and help reduce the cognitive load of attending physicians. Our results show that the system is capable of assessing the differences between reports and grade them.
We analyzed the predictions to understand the reasons for the errors. Table 6 shows a few examples of the reports along with different similarity scores, chunk pairs, and the predictions. Example 1-2 shows the correct prediction. In example 1, the overall similarity scores are very high except for a few scores and hence it is correctly classified into the NC category. Similarly, similarity scores for consistently low for example 2 which our system correctly classifies as MaC category. When we look at example 3, the correct class being NC means the two reports are similar. When we look at the similarity scores, except the Chunk_Wmd and Sent_Wmd scores having moderately high scores, the scores for all the other similarities are very low. Only word mover distance measure which finds semantic similarity using word embedding finds the two reports being more similar. In example 4, we can see that sentence-level semantic scores are higher than chunk-level similarity scores and might be the cause for the misclassification.
Table 6:
Sample reports with similarity scores and predictions
| No | ResReport | PhyReport | Scores | Chunk pairs | Truth | Prediction |
|---|---|---|---|---|---|---|
| 1 | Invasive mucinous carcinoma (colloid carcinoma), moderately differentiated. | Invasive moderately differentiated mucinous carcinoma. Note- We will addend results of predictive marker stains (block 1A). | Sent Cos: 0.85 Sent Lev: 0.273 Sent Jac: 0.60 Sent Sem: 0.729 Sent Wmd: 0.666 Chunk Cos: 0.537 Chunk Lev: 0.5 Chunk Jac: 0.4 Chunk Sem: 0.64 Chunk Wmd: 0.519 |
Invasive mucinous carcinoma colloid carcinoma : Invasive moderately differentiated mucinous carcinoma = 0.64 | NC | NC |
| 2 | High grade prostatic intraepithelial neoplasia | Prostatic adenocarcinoma, Gleason score 3 + 4 = 7 (30% grade 4) (Grade Group2), involving 5 mm (40%) of one out of one core | Sent Cos: 0.223 Sent Lev: 0.029 Sent Jac: 0.074 Sent Sem: 0.109 Sent Wmd: 0.377 Chunk Cos: 0.078 Chunk Lev: 0.032 Chunk Jac: 0.015 Chunk Sem: 0.041 Chunk Wmd: 0.336 |
High grade prostatic intraepithelial neoplasia : Prostatic adenocarcinoma , Gleason score 3 + 4 = 7 30%grade 4 Grade Group2 , involving 5 mm 40 % one one core = 0.402 | MaC | MaC |
| 3 | Papilloma with surrounding sclerosing adenosis | Fibroadenoma containing sclerosing adenosis, duct hyperplasia, and cyst (complex fibroadenoma) | Sent Cos: 0.184 Sent Lev: 0.071 Sent Jac: 0.133 Sent Sem: 0.228 Sent Wmd: 0.486 Chunk Cos: 0.137 Chunk Lev: 0.133 Chunk Jac: 0.167 Chunk Sem: 0.189 Chunk Wmd: 0.432 |
surrounding sclerosing adenosis : Fibroadenoma containing sclerosing adenosis = 0.566 | NC | MaC |
| 4 | In situ lobular carcinoma, intermediate grade, invovling papilloma. | Intraductal carcinoma (micropapillary and solid type, EORTC intermediate grade) involving sclerosing duct papilloma. Carcinoma show extensive involvement of lobules. Correlation with the specimen radiograph was performed. | Sent Cos: 0.357 Sent Lev: 0.294 Sent Jac: 0.286 Sent Sem: 0.445 Sent Wmd: 0.642 Chunk Cos: 0.205 Chunk Lev: 0.143 Chunk Jac: 0.143 Chunk Sem: 0.258 Chunk Wmd: 0.359 |
In situ lobular carcinoma : invovling papilloma : Intraductal carcinoma micropapillary = 0.531 invovling papilloma : solid type = 0.348 intermediate grade : EORTC intermediate grade involving sclerosing duct papilloma = 0.516 |
MaC | NC |
| 5 | Benign fatty breast tissue with mild chronic inflammation. | Benign fatty breast tissue with fibrocystic changes and minimal chronic inflammatory infiltrates. No definitive lymph node architecture is identified. | Sent Cos: 0.477 Sent Lev: 0.538 Sent Jac: 0.467 Sent Sem: 0.627 Sent Wmd: 0.754 Chunk Cos: 0.39 Chunk Lev: 0.389 Chunk Jac: 0.417 Chunk Sem: 0.415 Chunk Wmd: 0.693 |
Benign fatty breast tissue : Benign fatty breast tissue = 1.0 mild chronic inflammation : fibrocystic changes minimal chronic inflammatory infiltrates = 0.58 |
MiC | NC |
One of the major contributions of this paper is chunk-level analysis and we have shown that chunks along with the sentences help in getting better predictions and understanding errors. Furthermore, the “chunk pairs” column breaks down the scores for each chunk pair identified. Missing chunks represent zero or very low similarity.
Through error analysis, we identified two major limitations. Firstly, We have low sample size with huge imbalance in the three classes. Secondly, lack of medical terminology and report based embeddings reduces the power of capturing the true semantic relationships between reports. Most of the misclassification error occurs in classifying Minor Changes (MiC) category. Our classifiers failed to correctly classify even a single MiC category reports as clearly seen in Table 4. Example 5 represents this misclassification. Although the similarity scores are moderately high but not very high, there were misclassified to NC category. Report pairs from NC category still have differences between them but the difference is minute. Our algorithm couldn’t capture the level of difference in NC category and from MiC category. This can be attributed to very low training samples from this category(only 15% of 265 reports).
Error analysis allows us to explore different ways to improve this preliminary work. First and foremost, we intend to have more annotated training data. We have seen that word mover distance measures which are trained on PubMed articles gives us better semantic scores and more accurate scores. Embeddings dedicated to a hospital setting based on huge volumes of reports would improve the performance. Mapping the identified chunks to a knowledge base like UMLS can improve the performance and we intend to add them as features to our system.
Although our system is focused on automated assessment of pathology reports, we find that such a system can be used in various different ways. One major area that we intend to extend is the automated summary of difference in patient reports between multiple hospital visits. We can also extend our system to other medical domains (operative reports, scan reports, or patient visit notes).
5. Conclusion
In this paper, we presented a new approach for automated assessment of resident’s clinical reports. We proposed a system that computes various lexical and semantic similarities at sentence level as well as much simpler phrase level using NLP techniques. We presented a baseline supervised machine learning system which uses the similarity scores as features to grade the reports. Our system was able to classify reports with an accuracy of 73.6% and we intend to improve the performance of our classifiers. In addition, we showed that adding phrase level information helps improve performance. We believe that our system is an important step toward automated assessment systems in the medical domain.
Footnotes
References
- 1.Holmboe ES, Edgar L, Hamstra S. The Milestones Guidebook. 2017. http://www.acgme.org/Portals/0/PDFs/Milestones/MilestonesGuidebookforResidentsFellows.pdf (Accessed July 25, 2018)
- 2.Hyde Glendon A, Michael D. Biderman, Eric C Nelson. “Resident Operative Reports before and after Structured Education.”. The American Surgeon. 2018;84(no. 6):987–990. [PubMed] [Google Scholar]
- 3.Tersteeg Janneke JC, Paul D. Gobardhan, Rogier MPH Crolla, Peter AM Kint, Ilse Niers-Stobbe, Leandra JM Boonmande Winter, Jennifer MJ Schreinemakers. American Journal of Roentgenology. 2018. “Improving the Quality of MRI Reports of Preoperative Patients With Rectal Cancer: Effect of National Guidelines and Structured Reporting.”; pp. 1240–1244. [DOI] [PubMed] [Google Scholar]
- 4.Stogryn S.E, Hardy K, Mullan M.J, Park J, Andrew C, Vergis A. Synoptic operative reporting: assessing the completeness, accuracy, reliability, and efficiency of synoptic reporting for Roux-en-Y gastric bypass. Surgical Endoscopy. 2018;32(4):1729–1739. doi: 10.1007/s00464-017-5855-8. [DOI] [PubMed] [Google Scholar]
- 5.Gur I, Gur D, Recabaren J.A. The computerized synoptic operative report: a novel tool in surgical residency education. Archives of Surgery. 2012;147(1):71–74. doi: 10.1001/archsurg.2011.228. [DOI] [PubMed] [Google Scholar]
- 6.Johnson Tucker F, Waleed Brinjikji, Derrick A. Doolittle, Alex A. Nagelschneider, Brian T. Welch, Amy L. Kotsenas. “Structured Head and Neck CT Angiography Reporting Reduces Resident Revision Rates.”. Current problems in diagnostic radiology. 2018. [DOI] [PubMed]
- 7.Singh G, Harvey R, Dyne A, Said A, Scott I. “Hospital discharge summary scorecard: a quality improvement tool used in a tertiary hospital general medicine service.”. Internal medicine journal. 2015;45(no. 12):1302–1305. doi: 10.1111/imj.12924. [DOI] [PubMed] [Google Scholar]
- 8.Triola Marc M, Richard E. Hawkins, Susan E. Skochelak. “The time is now: Using graduates practice data to drive medical education reform.”. Academic Medicine. 2018;93(no. 6):826–828. doi: 10.1097/ACM.0000000000002176. [DOI] [PubMed] [Google Scholar]
- 9.Williamson Kenneth B, Jennifer L. Steele, Richard B. Gunderman, Terrence D. Wilkin, Robert D. Tarver, Valerie P. Jackson, Donald L. Kreipke. “Assessing radiology resident reporting skills.”. Radiology. 2002;225(no. 3):719–722. doi: 10.1148/radiol.2253011335. [DOI] [PubMed] [Google Scholar]
- 10.Legault Kimberly, Jacqueline Ostro, Zahira Khalid, Parveen Wasi, John J. You. “Quality of discharge summaries prepared by first year internal medicine residents.”. BMC medical education. 2012;12(no. 1):77. doi: 10.1186/1472-6920-12-77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Novitsky Yuri W, Ronald F. Sing, Kent W. Kercher, Martha L. Griffo, Brent D. Matthews, Todd Heniford B. “Prospective, blinded evaluation of accuracy of operative reports dictated by surgical residents.”. The American surgeon. 2005;71(no. 8):627–632. [PubMed] [Google Scholar]
- 12.Wauben Linda SGL, Richard HM Goossens, Johan F. Lange. “Differences between attendings and residents operative notes for laparoscopic cholecystectomy.”. World journal of surgery. 2013;37(no. 8):1841–1850. doi: 10.1007/s00268-013-2050-5. [DOI] [PubMed] [Google Scholar]
- 13.Page Ellis B. “The imminence of... grading essays by computer.”. The Phi Delta Kappan. 1966;47(no. 5):238–-243. [Google Scholar]
- 14.Westera Wim, Mihai Dascalu, Hub Kurvers, Stefan Ruseti, Stefan Trausan-Matu. “Automated essay scoring in applied games: Reducing the teacher bandwidth problem in online training.”. Computers & Education. 2018;123:212–224. [Google Scholar]
- 15.Boulanger David, Vivekanandan Kumar. Cham: Springer; 2018. “Deep Learning in Automated Essay Scoring.” In International Conference on Intelligent Tutoring Systems; pp. 294–299. [Google Scholar]
- 16.Attali Yigal, Jill Burstein. “Automated essay scoring with e-rater V. 2.”. The Journal of Technology, Learning and Assessment. 2006;4(no. 3) [Google Scholar]
- 17.Jin Cancan, Ben He, Kai Hui, Le Sun. “TDNN: A Two-stage Deep Neural Network for Prompt-independent Automated Essay Scoring.”. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2018. pp. 1088–1097. [Google Scholar]
- 18.Burrows Steven, Iryna Gurevych, Benno Stein. “The eras and trends of automatic short answer grading.”. International Journal of Artificial Intelligence in Education. 2015;25(no. 1):60–117. [Google Scholar]
- 19.Chen Yukun, Jesse Wrenn, Hua Xu, Anderson Spickard III, Ralf Habermann, James Powers, Joshua C. Denny. “Automated Assessment of Medical Students Clinical Exposures according to AAMC Geriatric Competencies.”. In AMIA Annual Symposium Proceedings; American Medical Informatics Association; 2014. p. 375. [PMC free article] [PubMed] [Google Scholar]
- 20.Zhang Rui, Serguei Pakhomov, Sophia Gladding, Michael Aylward, Emily Borman-Shoap, Genevieve B. Melton. “Automated assessment of medical training evaluation text.”. In AMIA annual symposium proceedings; American Medical Informatics Association; 2012. p. 1459. [PMC free article] [PubMed] [Google Scholar]
- 21.Jaccard Paul. “Étude comparative de la distribution florale dans une portion des Alpes et des Jura.”. Bull Soc Vaudoise Sci Nat. 1901;37:547–579. [Google Scholar]
- 22.Li Yuhua, David McLean, Zuhair A. Bandar, Keeley Crockett. “Sentence similarity based on semantic nets and corpus statistics.”. IEEE Transactions on Knowledge & Data Engineering; 2006. pp. 1138–1150. [Google Scholar]
- 23.Miller George A. “WordNet: a lexical database for English.”. Communications of the ACM. 1995;38(no. 11):39–41. [Google Scholar]
- 24.Levenshtein Vladimir I. “Binary codes capable of correcting deletions, insertions, and reversals.”. In Soviet physics doklady. 1966;vol. 10(no. 8):707–710. [Google Scholar]
- 25.Damerau Fred J. “A technique for computer detection and correction of spelling errors.”. Communications of the ACM. 1964;7(no. 3):171–176. [Google Scholar]
- 26.Kusner Matt, Yu Sun, Nicholas Kolkin, Kilian Weinberger. “From word embeddings to document distances.”. In International Conference on Machine Learning; 2015. pp. 957–966. [Google Scholar]
- 27.Moen S. P. F. G. H, Tapio Salakoski2 Sophia Ananiadou. “Distributional semantics resources for biomedical text processing.”. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine; Tokyo, Japan: 2013. pp. 39–43. [Google Scholar]
- 28.Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003. 2003. pp. 252–259.

