1. Filter unstructured notes by type |
Medical notes are often characterized by their type (e.g., Discharge Summary, Surgical History). In this case, certain note types were filtered out because the study team judged certain note types to have a reduced probability of containing free-text documentation of vaccinations based on manual review of a sample of our training set. Examples include notes populated with semi-structured interview questions like “Received Flu Vaccine: No” or patient education notes that discuss vaccinations in the hypothetical and might read “...after this procedure, do not receive a flu shot for at least a month...”. |
2. Tokenize Filtered Set of Notes |
In order to process the filtered set of notes, we used a simple tokenize algorithm (SpaCy)1, to segment the text from the filtered notes into single words. |
3. Create simple part of speech tagging to identify presence of vaccine administration |
Using the list of words produced by the tokenizer algorithm, we tagged verbs which indicated vaccine administration. The identification of a past tense verb (“got,” “received,” “given,” or “had)” assisted in identifying true instances of vaccination rather than vaccine education materials (Full list can be found in Supplementary Material) |
4. Using NLP rule-based matching to search for vaccine derivative in vicinity of verb |
If a desired verb was found, the algorithm searched for evidence of a vaccination (e.g., vaccine, shot, vaccination) within five tokens, where a token is a continuous string of characters between a space or punctuation marks. |
5. Using NLP rule-based matching to search for and identify vaccine type |
The algorithm used the preceding four tokens of the vaccination term to search for the vaccine type (e.g., influenza, flu, hep b, hepatitis b). It then looked for the mapped term that is the most complete match of the four preceding tokens (e.g., “pneumococcal 13” maps to “pneumococcal 13-valent” rather than more generic “pneumococcal)”. The table of the mappings to vaccine types was developed from an initial list from clinicians SMEs augmented by potential alternative names found in a manual review of a sample of training cases. The final table can be reviewed in Supplementary Material. If a name was not found, the vaccine was added as “vaccine” with no specified type (e.g., if note read “patient received vaccinations today)”. |
6. Find or derive date of vaccine administration |
The algorithm searched for an absolute date (ex. 1/12/19, 1/19) or relative date (yesterday, last week, today) within five tokens of the vaccination term. A table of the different date formats and relative date tokens used can be found in Supplementary Material. We built the list of date formats and selected five tokens as the window from the vaccination term based on a developer's manual review of a sample of cases from our training dataset. Relative dates were derived based on the date of the note entry. Vaccinations were only included when an associated absolute or relative date was found. Manual reviews demonstrated that, without absolute or relative dates, mentions of vaccinations were much less likely to represent actual vaccination events. As this was a POC algorithm, the algorithm is limited in the permutations of date formats it can identify and could be improved by the ability to recognize phrases like “3 days ago” or “3 weeks ago” among others. |