Table 4.
Domain | Challenges to Phenotyping acute patient cohorts | Description | Potential Solutions |
---|---|---|---|
Data | Data Availability (Completeness) | Longitudinal records may not be available for all patients. | Anticipate data quality issues with available data types including electronic health record (EHR) metadata. Consider sources of metadata indirectly related to EHR data types (e.g., geospatial and Census Block data or other “community vital signs”)[26]) to interpolate various systems processes not captured in the clinical record. |
Data Management (Timeliness) | Discordant temporality of data streams (e.g., from operational to structured data). | Evaluate time from event to data pull; create automated systems to accommodate differences. Exploit missingness to interpolate time between variables.[27] |
|
Data Validation (Correctness) | Patient histories may rely on data from limited visit(s) and visit types. | Evaluate ways data is gathered and recorded in your healthcare system.[25] Identify essential population and database characteristics,[28] including the degree to which a given variable tends to over or underestimate a feature or change over time. Target novel data sources and note types (e.g., clinical communications) to validate narrative or structured elements.[9], [29] |
|
Authoring | Defined Cohorts | No reliable billing code available to identify cohorts. | Validate local testing practices (i.e., presence of laboratory testing). Assign probability of known disease;[30] evaluate data driven selection of cases or controls such as a maximum likelihood approach.[31] |
Defined Logic | Data use requires knowledge of data cleaning processes. | Build a data dictionary documenting representation of data elements (e.g., Boolean, temporal) as well as cleaning methods. |
elements of the table were adapted with permission from Rasmussen, et al. 2019.[19]