Skip to main content
. 2022 Jul 11;13:883433. doi: 10.3389/fphar.2022.883433

TABLE 1.

Examples of human and programmatic evaluation of variable clusters. The table includes the relatedness category from manual review (Category), working definition of the category used by reviewers (Definition), examples of types of relationships in the category (General Examples), examples of ARIC variables that fit each relationship type (Study Variables), a cluster identifier (Cluster Number), and calculated relatedness score from the NLP analysis (Score). The scoring process is described in further detail in the methods section. Examples were selected to demonstrate different types of variable relationships that exist among ARIC variables and the associated relatedness category. See Supplemental Table S1 for all clusters.

Category Definition General examples Study variables Cluster Number Score
Unrelated Clusters where a human reviewer would not expect correlation between the variables in the cluster. Clusters related to a topic, such as MRI exclusion criteria, but are disparate and would not be expected to correlate “Do you have a cardiac pacemaker or a heart valve prosthesis?” and “Do you have metal fragments in your eyes, brain, or spinal cord?” 269 8.5
“Enter code and specify brand and form below” and “What kind of fat do you usually use for baking?” 213 7.9
Related Clusters where the variables would be expected to be correlated but not as highly would be “related”. Clusters where the variables all relate to the same broad topic, such as history of cardiovascular disease “Medications which secondarily affect cholesterol,” “Average mean arterial blood pressure,” and “Carotid Distensibility” 1 10.5
Clusters relating dietary intake of a nutrient and blood level of that nutrient “In the past year, how often on average did you consume... Dark meat fish, such as salmon, mackerel, swordfish, sardines, bluefish” and “Omega fatty acid W20:5 and W22:6 [g]” 383 11.6
Highly Related Clusters where a human reviewer would expect a high degree of correlation between the variables. Clusters where one variable depends on the other “Ever had emphysema” and “Age emphysema started” 16 35.1
Clusters where the variables all relate to the same narrow topic such as consumption of alcoholic beverages, or a history of wheezing “How many drinks of hard liquor do you usually have per week?,” “How many days in a week do you usually drink beer?” and “Alcohol intake [g] per day” 46 17.7
“[Wheezing]. Ever have to stop for breath when walking at our own pace on the level?” and “[Wheezing]. Ever stop for breath after walking about 100 yards (or after a few minutes) on the level?” 248 40.5
Exact Clusters where a human reviewer would expect almost complete correlation between the variables. Clusters with variables that are repeat measurements during the same exam First, second and third sitting blood pressure measurement at exam 2 58 44.2
Clusters with variables that ask the same question, potentially in different ways “I have a fiery temper,” “I am hotheaded,”, and “I am quick tempered” 86 32.2
“Have you ever been diagnosed by a doctor as having a polyp or noncancerous tumor of the colon or rectum?” and “Has a doctor ever told you that you had adenoma or polyp of the colon (large intestine)?” 175 32.2
Clusters with variables that are the same measurement at different time points White blood cell count at exams 3 and white blood cell count at exam 4 226 47.2