. 2014 Oct 21;22(e1):e48–e66. doi: 10.1136/amiajnl-2014-002868

Table 5:

Experimented features and template

Compilation	Feature type	Software
The best system	1 Original word 2 Lemma (base form, eg, patient for patients or extract for extracted) 3 Named entity recognition (NER for person, location, organization, other proper name, date, time, money, and number., eg, number for 5) 4 Part of speech (POS, eg, noun, verb, adjective, etc.) 5 Parse tree, including the ancestors from the root to the node (eg, root—noun phrase—common noun for 5 in In bed 5, we have…) 6 Basic dependents (eg, 5 that refers to the bed ID for bed in In bed 5, we have…) 7 Basic governors (eg, preposition in and subject we for have in In bed 5, we have…) 8 Phrase context (eg, In bed 5 for bed in In bed 5) 9 Top 5 candidates (eg, BP may refer to Bachelor of Pharmacy, bedpan, before present, birthplace, or blood pressure, among others) 10 Top mapping (eg, pneumonia is a type of respiratory tract infection) 11 Location percentage (within the [0, 10%], (10%, 20%], (20%, 30%], …, (90%, 100%] of the report) on ten-point scale 12 Medication score (1 if the word belongs to words annotated as a medicine in an independent set of 100 synthetic handover reports after removing tailored generic stopwords [i.e., the following 14 words: 2, 4, a, all, and, as, chest, effect, for, from, i, in, no, of, on, or, other, pain, the, to, two, used, very, with], 0 otherwise).⁴⁶ 13 Systematized Nomenclature of Medicine—Clinical Terms—Australian Release identifiers (SNOMED-CT-AU IDs, 0 if none found): first, the aforementioned MetaMap was used to extract all Unified Medical Language System (UMLS) concept IDs that are present in the SNOMED-CT subset of UMLS and second, the UMLS concept IDs were transformed to the respective SNOMED-CT IDs through a lookup table derived from UMLS by using the UMLS Installation and Customization Program called MetamorphoSys	Stanford CoreNLP Stanford CoreNLP Stanford CoreNLP Stanford CoreNLP Stanford CoreNLP Stanford CoreNLP MetaMap 2011 MetaMap 2011 MetaMap 2011 NICTA NICTA SNOMED-CT-AU, MetamorphoSys, and NICTA
Other experimented features	14 Normalized term frequency (ie, number times a given term occurs in a handover report/maximum of this term frequency over all terms in this report) 15 SNOMED-CT-AU abstraction IDs (0 if none found): The SNOMED CT-AU concepts found in records were generalized by using, the SNOMED CT-AU reference sets, that is, for each concept, a check was performed to identify all the reference sets the concept was present. This allowed assigning a pseudo semantic class (or abstraction) for each concept. This pseudo semantic class, defined as the list of reference sets the concept was present, was then used as word-level feature (0 if word is not part of SNOMED CT-AU concept or is not present in any of the reference sets). 16 Australian Medicines Terminology (AMT) IDs (0 if none found): first, a dictionary-based matcher was used in order to identify medicines (both the trade names and their generic versions [eg, oxycodone for both Roxicodone and Oxycontin]) in the texts; second, the medicines were transformed to their generic versions by using the trade products and medicinal products columns in AMT; and third, in order to capture the more general semantic properties carried by the medicinal products, SNOMED CT-AU was used to abstract various medicines to their common parent concept (eg, opiate or CNS (central nervous system) drug for both oxycodone and codeine). AMT contained 5066 trade products and 1821 medicinal products. Since there was no existing mapping between medicinal products in AMT and their respective products in SNOMED-CT-AU, string matching was used to find the SNOMED-CT-AU IDs corresponding to the medicinal product IDs in AMT. Such approach allowed to map 842 (46.2%) of total 1821 medicinal products in AMT to their respective concepts in SNOMED CT-AU. Although a substance-level mapping exists between AMT and SNOMED-CT-AU, such approach was discarded because of multiple substances corresponding to a single medicinal product and further complications in transforming concepts to their common abstractions (the structure of SNOMED-CT allows multiple paths between concept nodes). 17 Corpus check: since some of the Australian trade products have homonymous names (eg, Pain, Ice, and Peg), a binary feature was introduced to determine if the medicine name is present in the fiction subset of the Brown University Standard Corpus of Present-Day American English. 18 Expression type, position, and temporality: our hypothesis was that temporal expressions are not distributed homogeneously in the report. For example, iCcco (clinical history/ presentation) should (in theory) contain temporal expressions referencing the past events, whereas iccCo (care plan) or icccO (outcomes of care and reminders) should rather contain expressions referencing future events. From the Med-TTK output, three separate word-level features were created: the first identified the type of the expression (if it's a date, duration, etc.; 0 for words not part of temporal expression), the second identified the relative position of the reference on the timeline (before, after, etc.; 0 for words not part of temporal expression), and the third was a binary feature determining if the word is part of a temporal expression (0 if not).	NICTA SNOMED-CT-AU and NICTA AMT Sep 2013, Ontoserver, and NICTA AMT Sep 2013, Brown Corpus, and NICTA Med-TKK,⁴⁵ medical modification of The Temporal Awareness and Reasoning Systems for Question Interpretation (TARSQI) Toolkit (TTK), and NICTA
Other experimented features	19 Sentiment carrying and polarity: Our hypothesis was that, similarly to temporal expressions, the sentiment carrying words are not distributed homogeneously in the reports. For example, Iccco (identification of the patient) should not contain such words at all, whereas icccO (outcomes of care and reminders) should contain a great deal of sentiment—patient getting better, condition improved. The approach used exact string matching to map the transcripts to the list of positive and negative opinion words or sentiment words. As this approach was not tailored for clinical language and word sense disambiguation was not used in this task, common ambiguous stopwords like patient and like were removed from the list based on our initial error analysis. Additionally, the reports contained words [unclear] and [inaudible] for identifying incomprehensible portions of text in transcription which we removed. The approach resulted in two separate word-level features: the first identified if the word is carrying any sentiment (1 for yes and 0 for no) and the second identified the sentiment polarity (positive, negative, or 0). 20 Rule-based post-processing containing seven rules for changing the label predicted by CRF++ based on analyzing the labels of previous and following lemmas and their POS tags. The rules were used to change the labels of interjections, modals, personal pronouns, adverbs, and existential theres to NA, because these tokens have no useful meaning in current context and thus should be discarded. Additionally, the labels of determiners were changed to NA, if the previous and next tokens were labeled as NA. Similarly, the labels of conjunctions were changed to match the label of the next token, if both the previous and next token were labeled the same.	Positive and negative opinion words or sentiment words for English (appr. 6800 words with their respective polarities) supplemented with discharge and discharged (both were marked with positive polarity) at NICTA NICTA

In the CRF++ template, which controls the usage of features in training and testing, we defined in the unigram part that we use all features of the current location alone, previous location alone, and next location alone; the pairwise correlations of the previous–current location over all features; the pairwise correlations of the current–next location over all features; and the correlation/combination of all features in the current location. In the binary part of the template, we set up the use of only one bigram template, which combines the predicted category for the previous location and the features of the current location as a feature. To download the cited software, please visit http://nlp.stanford.edu/software/corenlp.shtml for Stanford CoreNLP, http://metamap.nlm.nih.gov/ for MetaMap, http://www.nehta.gov.au/our-work/clinical-terminology/snomed-clinical-terms for SNOMED-CT-AU, http://www.ncbi.nlm.nih.gov/books/NBK9683/ for MetamorhoSys, http://www.nehta.gov.au/our-work/clinical-terminology/australian-medicines-terminology for AMT, http://ontoserver.csiro.au:8080/ for Ontoserver, http://khnt.hit.uib.no/icame/manuals/brown/index.htm for Brown Corpus, and http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar for Positive and negative opinion words or sentiment words for English. For NICTA software, please email hanna.suominen@nicta.com.au.