Abstract
De-identification of a patient's personal data from medical records is a protective legal requirement imposed before medical documents can be used for research purposes or transferred to other healthcare providers (e.g., teachers, students, tele-consultations). This de-identification process is tedious if performed manually, and is known to be quite faulty in direct search and replace strategies [9]. In this paper, we report on the identification step of this process. The proposed algorithm is based on estimating the fitness of candidate patient name references to a set of semantic selectional restrictions. The semantic restrictions place tight contextual requirements upon candidate words in the report text and are determined automatically from a manually tagged corpus of training reports. Maximum entropy classifiers are used to provide a probabilistic measure of the belief of a given candidate token to a given semantic restriction. We report on the design and preliminary evaluation of the system within the do-main of pediatric urology.
Full text
PDFSelected References
These references are in PubMed. This may not be the complete list of references from this article.
- Bui Alex A. T., Dionisio John David N., Morioka Craig A., Sinha Usha, Taira Ricky K., Kangarloo Hooshang. DataServer: an infrastructure to support evidence-based radiology. Acad Radiol. 2002 Jun;9(6):670–678. doi: 10.1016/s1076-6332(03)80312-4. [DOI] [PubMed] [Google Scholar]
- Metz C. E. Basic principles of ROC analysis. Semin Nucl Med. 1978 Oct;8(4):283–298. doi: 10.1016/s0001-2998(78)80014-2. [DOI] [PubMed] [Google Scholar]
- Quantin C., Bouzelat H., Allaert F. A., Benhamiche A. M., Faivre J., Dusserre L. Automatic record hash coding and linkage for epidemiological follow-up data confidentiality. Methods Inf Med. 1998 Sep;37(3):271–277. [PubMed] [Google Scholar]