Skip to main content
Proceedings of the AMIA Annual Fall Symposium logoLink to Proceedings of the AMIA Annual Fall Symposium
. 1997:580–584.

Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis.

W R Hersh 1, E M Campbell 1, S E Malveau 1
PMCID: PMC2233467  PMID: 9357692

Abstract

OBJECTIVE: Identify the lexical content of a large corpus of ordinary medical records to assess the feasibility of large-scale natural language processing. METHODS: A corpus of 560 megabytes of medical record text from an academic medical center was broken into individual words and compared with the words in six medical vocabularies, a common word list, and a database of patient names. Unrecognized words were assessed for algorithmic and contextual approaches to identifying more words, while the remainder were analyzed for spelling correctness. RESULTS: About 60% of the words occurred in the medical vocabularies, common word list, or names database. Of the remainder, one-third were recognizable by other means. Of the remaining unrecognizable words, over three-fourths represented correctly spelled real words and the rest were misspellings. CONCLUSIONS: Large-scale generalized natural language processing methods for the medical record will require expansion of existing vocabularies, spelling error correction, and other algorithmic approaches to map words into those from clinical vocabularies.

Full text

PDF
580

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

  1. Cimino J. J., Clayton P. D., Hripcsak G., Johnson S. B. Knowledge-based approaches to the maintenance of a large controlled medical terminology. J Am Med Inform Assoc. 1994 Jan-Feb;1(1):35–50. doi: 10.1136/jamia.1994.95236135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Dolin R. H. Outcome analysis: considerations for an electronic health record. MD Comput. 1997 Jan-Feb;14(1):50–56. [PubMed] [Google Scholar]
  3. Friedman C., Alderson P. O., Austin J. H., Cimino J. J., Johnson S. B. A general natural-language text processor for clinical radiology. J Am Med Inform Assoc. 1994 Mar-Apr;1(2):161–174. doi: 10.1136/jamia.1994.95236146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Hripcsak G., Friedman C., Alderson P. O., DuMouchel W., Johnson S. B., Clayton P. D. Unlocking clinical data from narrative reports: a study of natural language processing. Ann Intern Med. 1995 May 1;122(9):681–688. doi: 10.7326/0003-4819-122-9-199505010-00007. [DOI] [PubMed] [Google Scholar]
  5. Humphreys B. L., Hole W. T., McCray A. T., Fitzmaurice J. M. Planned NLM/AHCPR large-scale vocabulary test: using UMLS technology to determine the extent to which controlled vocabularies cover terminology needed for health care and public health. J Am Med Inform Assoc. 1996 Jul-Aug;3(4):281–287. doi: 10.1136/jamia.1996.96413136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Lindberg D. A., Humphreys B. L., McCray A. T. The Unified Medical Language System. Methods Inf Med. 1993 Aug;32(4):281–291. doi: 10.1055/s-0038-1634945. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. McCray A. T., Aronson A. R., Browne A. C., Rindflesch T. C., Razi A., Srinivasan S. UMLS knowledge for biomedical language processing. Bull Med Libr Assoc. 1993 Apr;81(2):184–194. [PMC free article] [PubMed] [Google Scholar]
  8. Sager N., Lyman M., Bucknall C., Nhan N., Tick L. J. Natural language processing and the representation of clinical data. J Am Med Inform Assoc. 1994 Mar-Apr;1(2):142–160. doi: 10.1136/jamia.1994.95236145. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Proceedings of the AMIA Annual Fall Symposium are provided here courtesy of American Medical Informatics Association

RESOURCES