Table 3.
Comparison of characteristics of EHR data versus Natural language data.
| Criteria | Natural language | EHR |
|---|---|---|
| Token granularity | The basic token is a word, which is a compressed semantic unit in language and can express some basic meaning. But in many cases, an integrated semantic unit (e.g., a named entity or a prepositional phrase) requires the combination of multiple tokens. | The basic token is a clinical code, which can represent an integrated semantic unit, e.g., a disease description, a drug, or a procedure. |
| Syntactic: Hierarchical structure | A paragraph (document) contains multiple sentences, and a sentence contains multiple words. | More complex, a patient’s information contains multiple visits, and a visit contains multiple codes of different categories. |
| Syntactic: Sequential order | Simple and clear. | The visits are sorted sequentially according to time but the codes within a visit may be unordered or with certain prioritized orders. |
| Semantic | Dependency relations among sentences (e.g., discourse relations) as well as words within each sentence (e.g., syntactic dependency, semantic roles) are clear. | Dependency relationships are not always clear, e.g., adjacent visits may be of little relevance owing to large time intervals. |
| Time interval | Regular, one between adjacent words. | Usually no explicit intervals between codes, and irregular intervals between adjacent visits. |
| Data completeness | Relatively complete for regular texts such as written language. | Usually incomplete and sometimes erroneous due to the nature of EHR. |
| Sequence length | Within a relatively narrow range: the maximum sequence length of words in a sentence rarely reaches a hundred. | More variable: a patient’s medical records can include anywhere from one to hundreds of visits. In a single visit, a patient can have hundreds of medical codes. |