. 2021 May 20;4:86. doi: 10.1038/s41746-021-00455-y

Table 3.

Comparison of characteristics of EHR data versus Natural language data.

Criteria	Natural language	EHR
Token granularity	The basic token is a word, which is a compressed semantic unit in language and can express some basic meaning. But in many cases, an integrated semantic unit (e.g., a named entity or a prepositional phrase) requires the combination of multiple tokens.	The basic token is a clinical code, which can represent an integrated semantic unit, e.g., a disease description, a drug, or a procedure.
Syntactic: Hierarchical structure	A paragraph (document) contains multiple sentences, and a sentence contains multiple words.	More complex, a patient’s information contains multiple visits, and a visit contains multiple codes of different categories.
Syntactic: Sequential order	Simple and clear.	The visits are sorted sequentially according to time but the codes within a visit may be unordered or with certain prioritized orders.
Semantic	Dependency relations among sentences (e.g., discourse relations) as well as words within each sentence (e.g., syntactic dependency, semantic roles) are clear.	Dependency relationships are not always clear, e.g., adjacent visits may be of little relevance owing to large time intervals.
Time interval	Regular, one between adjacent words.	Usually no explicit intervals between codes, and irregular intervals between adjacent visits.
Data completeness	Relatively complete for regular texts such as written language.	Usually incomplete and sometimes erroneous due to the nature of EHR.
Sequence length	Within a relatively narrow range: the maximum sequence length of words in a sentence rarely reaches a hundred.	More variable: a patient’s medical records can include anywhere from one to hundreds of visits. In a single visit, a patient can have hundreds of medical codes.