Abstract
The Pointwise Mutual Information statistic (PMI), which measures how often two words occur together in a document corpus, is a cornerstone of recently proposed popular natural language processing algorithms such as word2vec. PMI and word2vec reveal semantic relationships between words and can be helpful in a range of applications such as document indexing, topic analysis, or document categorization. We use probability theory to demonstrate the relationship between PMI and word2vec. We use the theoretical results to demonstrate how the PMI can be modeled and estimated in a simple and straight forward manner. We further describe how one can obtain standard error estimates that account for within-patient clustering that arises from patterns of repeated words within a patient’s health record due to a unique health history. We then demonstrate the usefulness of PMI on the problem of predictive identification of disease from free text notes of electronic health records. Specifically, we use our methods to distinguish those with and without type 2 diabetes mellitus in electronic health record free text data using over 400,000 clinical notes from an academic medical center.
Keywords: cluster-corrected standard errors, electronic health records, natural language processing, probability models, word2vec
1. Introduction
Natural Language Processing (NLP) has many potential uses in health research and clinical settings, particularly to process and analyze free text data in electronic health records. Natural language processing techniques have long been developed in the fields of biostatistics and biometrics (Williams, 1975, Yule, 1939), but during the last few decades most of the research activity in the field has been coming from computer science, linguistic, and medical informatics communities (for example, see Nadkarni et al., 2011). In recent years, much of the advance in NLP has been fueled by the emerging field of deep learning (LeCun et al., 2015). Among the NLP algorithms based on the deep learning ideas, probably the most influential is the word2vec algorithm (Mikolov et al., 2013) (the paper has been cited over 15,000 times as of May 2020 according to Google Scholar). As Mikolov et al. (2013) describe, word2vec can analyze text to reveal country-city relationships (e.g. “Paris” is in “France”, and “Tokyo” is in “Japan”), or to extract information about individuals (e.g. Steve “Ballmer” is affiliated with “Microsoft”). Heuristically, word2vec can be used to group similar words in a book or electronic health record together. For example, in a health record “diabetes” and “insulin” might be mentioned close to each other in a document, and hence might be considered related to each other. Conversely, the method might discriminate among other items that appear in dissimilar contexts from “diabetes” and “insulin” such as “cancer” or “prostatectomy”.
The Pointwise Mutual Information statistic (PMI) is at the heart of word2vec, and is used to represent closeness of words in documents, as will be discussed in the next Section. It has been demonstrated that the core algorithm in word2vec, called skip-gram with negative sampling, is directly related to the PMI (Levy and Goldberg, 2014a). The PMI and its relationship with word2vec can open some interesting research questions for the statistics community. However, the notation and the proofs in Levy and Goldberg (2014a) and popular unpublished tutorials such as Meyer (2016) might make the theory behind the PMI and word2vec difficult for statisticians to follow.
Here, we use probabilistic arguments to clarify the relationship between word2vec’s skip-gram with negative sampling algorithm and the PMI. After presenting the probability theory underlying the word2vec method, we discuss how one can obtain standard errors for the PMI. Finally, we demonstrate how the PMI can be used to distinguish patients with and without type 2 diabetes mellitus based on the free text notes from their electronic health records.
2. Pointwise Mutual Information
In the application paper, we are interested in seeing which words are related to each other in an electronic health record (EHR). Later in this paper we demonstrate how we can use word relationships to discriminate between patients with and without type 2 diabetes mellitus. To measure how well two words are related we will rely on calculating how often pairs of words occur in each other’s neighborhood, where neighborhood is defined as follows. Let the subscripts i and j separately index the location of all words in the EHR. Let Wi and be multinomial random variables representing the words observed at locations i ∈ {1…J} and j ∈ {1…J}, where J is the number of words in the EHR. We use the prime symbol (′) to distinguish between the center word (Wi) and its potential neighborhood words . Let Zij be a random variable that takes the value 1 if a pre-specified rule for determining whether Wi and are sufficiently close is met, 0 otherwise. A simple rule might be that Zij = 1 if two words are next to each other. As an example, consider the phrase “the patient has history of diabetes.” Let W2 map to the word “patient.” Here, maps to “has” and maps to “the”, thus satisfying conditions that Z23 = 1 and Z21 = 1. For the other pairings, Z24 = Z25 = Z26 = 0.
In the ideal world of unlimited computational power, one would extract all pairs of words, therefore detaching the word pairs from the order in a document. Doing so would require another variable to indicate whether the word pairings satisfied the rule for being close enough. Here, we map all pairwise i and j to the integers r ∈ {1…R}. In this case, R = (J − 1) * J. For each r, we let Cr = Wi and for the mapped pairings of i and j. Let c and c′ represent their observed values. We further let Dr = Zii, again after mapping i and j to r. We process the corpus of the EHR into a matrix for analysis with row r consisting of , as these variables are sufficient for identifying our statistics of interest. At this point, we can omit the r subscripts for ease of presentation.
In order to characterize the closeness of words, we use the PMI (Church and Hanks, 1990) statistic. This examines the log of the ratio of the joint probability, conditional on meeting the rule D = 1, that the center word C = c is observed with a neighborhood word C′ = c′, divided by the probability of observing the words under independence. Using conditional probability notation, the PMI is defined as:
| (1) |
| (2) |
| (3) |
In this work, we will focus on the formulation in equation (3). Note, if we take the square root of the denominator in equation (1), we have the log of the expected value of the cosine similarity statistic for random vectors of indicator (i.e. binary 0/1) variables (Huang, 2008). We prefer the conditional probability representation in this paper, as it provides insight into how statistical models may be used to estimate the PMI, as we detail in Section 4 below.
The PMI statistic can take any real number value with no bound. Under an independence assumption, P(C = c|C′ = c′, D = 1) would be equal to P(C = c|D = 1) and the PMI would be equal to log(1) = 0. If two words are commonly observed together, then P(C = c|C′ = c′, D = 1) > P(C = c|D = 1) and the ratio will be greater than one, so the PMI on the log scale would be greater than 0. Values of the PMI equal to or less than zero indicate that two words are rarely found close to each other. Heuristically, the scale is akin to a log relative risk, such that PMIs larger than log(2) might be considered significant, although this is not a hard rule.
In the above example, we used a simple rule that two words had to be next to each other for D = 1. However, more general rules could be developed to define “close,” such as two words being found in the same paragraph.
3. Relationship of Skip-Gram with Negative Sampling to the PMI.
The PMI is the target of estimation for many natural language processing algorithms. Here we describe how the PMI underlies the popular word2vec skip-gram with negative sampling algorithm. While there are many word2vec algorithms, the word2vec with negative sampling algorithm has been one of the most highly cited (e.g. Levy and Goldberg, 2014b). In practice, documents have a large number of words. In a book with 10,000 words, for example, there could be up to 49,995,000 (= 10,000 choose 2) pairings of words which would have to be evaluated. In such a context, extracting all possible pairs would be computationally taxing. To reduce the burden, a skip-gram with negative sampling technique has been developed in which all of the pairings where D = 1 are retained, and then the components C and C′ are resampled under independence assumptions to generate a pseudo-sample for D = 0 (Levy and Goldberg, 2014b, proposed 15 resamples).
In the next few paragraphs, we give a notational example of the negative sampling algorithm, and then an actual example. We feel it is important to give both the notational example and the word example as both formats are sometimes used in the computer science literature. We use the term “words of interest” to denote words that are sometimes referred to as “input” words in the computer science field (e.g. Meyer, 2016). We use the term “related words” to denote words found in a neighborhood of the input word of interest; these related words are sometimes referred to as “contexts” (e.g. Levy and Goldberg, 2014b).
In terms of notation, let upper case letters represent words as random variables (i.e. C and C′), and lower case letters represent their observed values. For ease of presenting the notation, assume that we are interested in evaluating a short clinical note that states “history of diabetes.” As input words, let c1 represent “history”, c2 represent “of”, and c3 represent “diabetes”. The set of possible context words includes the same words represented by , , and . Let the definition of a close neighborhood be two words appearing next to each other. The neighborhood pairs are presented in the matrix D1 in which D = 1 below. We then generate a negative sample to represent the null distribution. To do this for each observed word of interest (ci), we randomly sample K values of in the sample of neighborhood pairs. In the matrices below, we represent how this works for K = 2, which is generally a low value of K but sufficient for didactic purposes. We used the sample() macro in R to generate the negative samples.
We refer to the word pairs in D1 as matched pairs within neighborhoods, and see that c1 was in the neighborhood of . Further, c2 was in the neighborhood of and , while c3 was close to . In D0 in which D = 0, we used a discrete uniform probability distribution to obtain eight random samples of the four values in the D1 set. That is, for each ci, the probability of choosing the values for the random match was 1/4 for , 1/2 for , and 1/4 for . Different probability distributions beside the uniform distribution have been proposed for sampling (e.g. Mikolov et al., 2013).
Expanding our phrase to “history of diabetes with kidney disease,” we present an alternative version of the process using the actual words rather than the random variable notation. Words are more typically used in the computer science literature. With the definition of close being words next to each other, we might obtain the following observed (i.e. D1) and negative sampling (i.e. D0) results:
From this we see that the words at the beginning and end of a sentence may be undersampled in generating the pseudo-sample D0. However, as the number of words in a document increases, the bias becomes smaller.
We can represent the process described above using standard conditional probability notation. Here, we present nonparametric identification of the PMI using the skip-gram with negative sampling technique, as described above. This formalizes with conditional probability notation what others have described (e.g. Levy and Goldberg, 2014b).
Let C and C′ denote the multinomial random variables representing the word of interest (C) and related word (C′). Similar to above, c and c′ represent observed values of C and C′. Let h() represent a function of C and C′. It could be either parametric or nonparametric. We demonstrate here how h() is used by the word2vec method to model the PMI statistic. We do not dwell on the functional form of h() used by others, as we later use the following results to motivate a framework for obtaining inferences from the PMI statistic. Again, we note that,
Next, we reframe the proof of Levy and Goldberg (2014b) and clarify the assumptions used by skip-gram with negative sampling to identify the PMI. According to their proof, Word2vec uses the following functional form to express the conditional probability of D.
| (4) |
Equation (4) is the form of the logistic regression model that serves as the basis of this skip-gram with negative sampling technique. Solving for h(C = c, C′ = c′),
Using Bayes theorem,
Now, simply by construction of our negative sampling distribution (the distribution in which D = 0), we have that:
| (5) |
Equation (5) is a key assumption to identification of the PMI using the word2vec skip gram with negative sampling algorithm. Note that P(C = c, C′ = c′|D = 0) is equal to two probabilities conditional on D = 1. This suggests that negative sampling is not necessary since inferences rely only on the D = 1 group. Then:
By taking K samples for our negative sample group, . Thus,
PMI statistics can be placed in a symmetric matrix that characterizes the similarity of words with each other. It is possible to use Pearson’s correlation or the closely related cosine similarity statistic (Egghe and Leydesdorff, 2009) for any two rows to see how closely related two words (that index the rows) are to each other based on the PMI. The diagonal values in the matrix are informative as measures of how likely the same word is repeated twice under the same definition of closeness, D = 1. It is also possible to use linear combinations of model parameters of h() to estimate the cosine similarity between C and C′, partly justified by the relationship of PMI to cosine similarity demonstrated in Equation (1).
4. Estimation and Inference for the PMI Statistic
In this section we describe a novel method of estimating the PMI statistic and obtaining standard errors. The standard errors would allow for hypothesis testing; based on our literature search, there seems to be a paucity of papers describing estimation of standard errors.
Our examination of the probability theory underpinning the word2vec algorithm in the previous Section provided us with insight into how we could model the PMI statistic. We note that the numerator and denominator of the PMI are conditional on D = 1, so creation of the negative sample is not needed. We can simply use a cross tabulation of the D = 1 pairs that we find in a document to estimate P(C = c, C′ = c′|D = 1), P(C = c|D = 1), and P(C′ = c′|D = 1). For ease of notation, let “C ←-- word” represent the value that random variable C takes from the mapping of the “word” to the relevant integer values. For example, if we simply had two words in our electronic health record sample space, such as “diabetes” and “mellitus,” our cross tabulation for the set in D = 1 would look like this:
| C’ ←-- Diabetes | C’ ←-- Mellitus | |
|---|---|---|
| C ←-- Diabetes | n1 | n2 |
| C ←-- Mellitus | n1 | n2 |
The values n1 and n4 represent the number of times that diabetes is repeated twice or mellitus is repeated twice in a document. Let and indicate an estimated probability and estimated PMI, respectively. In this case,
We give an example of this 2x2 table approach for a short document in our Supporting Information and text-data example; we also demonstrate it using a larger contingency table.
For simple rules represented by D = 1, we have found that estimating PMI values using counts is faster than using a logistic regression, and can be used via manipulation of table creation commands in software such as R. However, the counts do not derive standard errors, which should account for correlation (i.e. clustering) of word patterns within individual patient electronic health records due to patients’ unique health histories. To account for the correlation, we can estimate the probabilities using logistic regressions with cluster-corrected standard errors. Logistic regression may also be unavoidable for more complex definitions of what defines closely neighboring words in a document, again represented by the value D takes.
Next, we describe estimation of P(C = c|C′ = c′, D = 1) (numerator of PMI) for a specific pair of words using logistic regression. Let I(·) be the indicator function that takes the value 1 when the expression within the parentheses is true, and 0 otherwise. Let X be the design matrix for a logistic model with an intercept and a single covariate (i.e. for row r of R matrix rows, with R specifying the number of matched word pairs in the D = 1 set. Let β = [β0 β1]⊺ parameterize the intercept and slope of the model.
For P(C = c|D = 1) (denominator of PMI), we specify an intercept only model parameterized by α. In this case, the design matrix is simply the intercept only model in which Xr = (1).
Let denote the estimates of Ψ = (β⊺ α)⊺. The probability form of the logit model, log likelihood, log L(·), and first and second derivatives for the logistic model, under an independence assumption, are:
| (6) |
| (7) |
| (8) |
| (9) |
| (10) |
| (11) |
| (12) |
| (13) |
Note that the above likelihoods assume that the probabilities are independent across the entire sample; one might reasonably assume that there is clustering of word pairs within patients due to unique health histories. Assume now that we have electronic health records from k = 1…K unique patients, each with r = 1…Rk word pairs, . Let E(·) represent the expectation with respect to the data. To account for correlated words data within an individual, we can use the robust standard errors described by Huber (1967) with cluster correction described by Williams (2000). In this case, we further subscript the log likelihood by individual and word pair such that,
Under the conditions of Huber (1967) and Williams (2000) our parameter estimates converge in distribution as follows.
where
Let denote the estimate of ΣΨ when it is evaluated at . One can estimate β and α of Ψ by using logistic regressions. Then and can be plugged into the estimating equations (7 – 9) and (11 – 13) and matrix multiplication can be used to solve for . We can use the estimates of and to similarly estimate the probabilities in equations (6) and (10).
By the multivariate delta-method the asymptotic variance for equation (6)’s , , can be estimated with linear combination vector L1 = (1 1 0)⊺ by:
Similarly, the asymptotic variance for equation (10)’s , which is an intercept only model, with linear combination vector L2 = (0 0 1)⊺, can be estimated by:
Again, using the delta method, the estimated PMI and its estimated variance becomes
where
Note that above we specified a single set of parameters Ψ for our specific choice of c and c′. To generalize to all pairs of c and c′, we would allow for separate Ψcc′ parameters for each pairwise comparison. If there are N words, then we would have comparisons, which include examination of repeated words. We provide a program that performs the calculations with a text file as Supporting Information.
5. Prediction of type 2 diabetes in electronic health records
We examined the ability of PMI statistics to identify type 2 diabetes in free text of electronic health records. To do this, we downloaded electronic inpatient progress notes and outpatient notes for 1,000 patients from an academic cancer center, and stored the records on a secure server. We chose these notes as they contained many free text unstructured fields. We identified 500 patients with and 500 without structured ICD-10 diagnosis codes (E11 sequence of codes) in the electronic health record indicating type 2 diabetes.
The patient sample came from those who had electronic health record ICD-10 billing codes generated from September, 2015, to September, 2017, although many had received recurring care for some time before then. To reduce Health Insurance Portability and Accountability Act (HIPAA) concerns about sharing electronic health record data, preference was given to releasing the data of patients who died. Hence, a random sample was taken of patients without diabetes who had died. Since not all of those with Type 2 diabetes had died, all 384 patients with Type 2 diabetes who had died were sampled, and 116 patients with Type 2 diabetes who had not died were randomly sampled. The sample was chosen by a database query (ST and ML). The work was approved under Fox Chase Cancer Center IRB Number 17-9027.
We preprocessed our data using the tm (Feinerer et al., 2008) and tokenizers libraries in R (R Foundation for Statistical Computing, Vienna, Austria) which converted letters to lower case, removed all numbers, removed punctuation, and stripped white space. We further removed generally non-informative common English words as defined by R’s stopwords macro (Benoit et al., 2017), with the exception of “no”, “not”, and “nor”, which could indicate informative negative histories. We stemmed the remaining words (i.e. truncated word to root form). In total, there were 61,489 unique stems in the clinical notes, of which 16,140 were used a single time. To reduce the dimensionality of the problem, we retained the 1,500 most common stems in the 1,000 patients’ electronic health records. These 1,500 stems had a usage frequency ranging from 3,100 to 627,270 in the notes.
Although the sample size of 1,000 might seem relatively modest, the retained number of notes was substantial, including 415,117 notes (including notes from visits prior to 2015), ranging in size from 0 to 1,381 word stems after preprocessing (mean=102 word stems, standard deviation=62, median=127). For ease of presentation, we imply word stems when we use the term word in the following description of our methods.
We then divided the sample into a training set of 600 patients, and a test (i.e. validation) set of 400 patients. We used a window of two words in the reduced notes before and two words after each word in the note to create matched pairs of words for our D = 1 sample (i.e. the closest four words to each given word). We chose a two word window on either side; this generated over 99 million matched pairs of word stems. Sensitivity analyses found that larger windows did not substantially improve predictive ability, and a window of size one gave similar predictive ability. Words at the beginning or end of the note had fewer than two word matches before or after, respectively. In order to better discriminate between those with and without type 2 diabetes (i.e. a supervised approach), we calculated the PMI statistics separately for those with and without type 2 diabetes indications in the training set. We then calculated four PMI index statistics to use for prediction of type 2 diabetes. The first two (i.e. estimated separately by type 2 diabetes indication status) used the within note average of PMI values of words describing joint associations with the stem “diabet” (for diabetes) with every word in a single note, and then averaged these within patients. The next two used the within note average PMI for all words (not just those co-occurring with the stem “diabet”) and then averaged within patients. We calculated the two indices estimated within the two cohorts for the whole sample. That is, we used the estimates from the group without type 2 diabetes to calculate a non-type 2 diabetes score among those with type 2 diabetes indications, and vice versa.
In estimation, we must index the words within a note, and the notes within an individual. Adding to the definitions of the subscripts from above, we again index person by k ∈ {1…K}, note within person by v ∈ {1…Vk}, word within note by w ∈ {1…Wvk}, and all words within our vocabulary by c and c′ ∈ {1…N}. In the current example, N = 1,500 by design. For PMI values, let represent the PMI for words indexed by c and c′ as estimated using the diabetes group (but calculated for the whole sample), and be the same for the non-diabetes group. Let and be the same, but only using the PMI values for the word combinations that include C when it references “diabet”. Finally, we construct measures based on the exponential form of the PMI (i.e. the ratios before taking the log) to give greater weight to ratios greater than one than to ratios less than one. Also, this allows us to retain ratios with zero numerators (ratio=0), which are not defined on the log scale. Others have similarly focused on the ratio rather than the log statistic (e.g. Budiu et al., 2007). Our four summary measures, denoted with the prefix mk (for mean of person k) are hence,
| (14) |
Once we had the PMI-derived score statistics, we used two separate logistic regressions of the diabetes status indicator with 1) and entered, and 2) with log(mkPMId) and log(mkPMIxd). We also compared the predictive ability of creating summary statistics based on the PMI without exponentiating {e.g. substituting for in Equation 14 above}.
We compared the results with patient demographics extracted from the notes including patients’ mean Body Mass Index (BMI) readings (mean of 28.3 of patient-level averages over several visits, standard deviation [SD]=6.4), mean systolic blood pressure readings (130.0, SD=12.1), mean diastolic readings (73.6, SD=7.1), birth year (mean 1945, SD=11.1), and sex (52% male, 48% female).
5.1. Prediction results
As we will elaborate on shortly, we found that had the best predictive value based on the area under the curve (AUC) (see Table 1). Figure 1 provides some explanation to this finding. In the figure, we present the distribution of in those with and without diabetes indications. Those without a type 2 diabetes indication tend to have words that are less likely to be found with the stem “diabet”. Conversely, those with a type 2 diabetes indication have notes that contain words that are more likely to be located close to the “diabet” stem. The shape of the figures might help explain why using PMIs estimated from the non-type 2 diabetes group are better at predicting who has type 2 diabetes. The log within-person average statistics seem to have a tighter distribution around negative values for those with no type 2 diabetes indication. This suggests that the PMIs estimated in the non-type 2 diabetes group might be better at predicting who does not have type 2 diabetes (i.e. specificity) than at predicting who does have type 2 diabetes (i.e. sensitivity).
Table 1:
Summary statistic results by diabetes indicator status. The AUC refers to the area under the curve from logistic models with the log summary statistic values added as covariates. SD = Standard Deviation
| No Type 2 Diabetes ICD-10 Code | Type 2 Diabetes ICD-10 Code | |||||
|---|---|---|---|---|---|---|
| Set | Statistic | Mean | SD | Mean | SD | AUC |
| Training | 0.72 | 0.13 | 0.88 | 0.21 | 0.78 | |
| Test | 0.72 | 0.16 | 0.86 | 0.20 | 0.73 | |
| Training | 0.92 | 0.49 | 1.87 | 0.84 | 0.89 | |
| Test | 0.88 | 0.40 | 1.80 | 0.83 | 0.88 | |
| Training | mkPMId | 1.07 | 0.03 | 1.06 | 0.03 | 0.54 |
| Test | mkPMId | 1.07 | 0.03 | 1.06 | 0.03 | 0.55 |
| Training | mkPMIxd | 1.06 | 0.03 | 1.07 | 0.03 | 0.59 |
| Test | mkPMIxd | 1.07 | 0.04 | 1.07 | 0.04 | 0.57 |
| Multivariable model results | ||||||
| Training | 0.91 | |||||
| Test | 0.92 | |||||
| Training | mkPMId + mkPMIxd model | 0.83 | ||||
| Test | mkPMId + mkPMIxd model | 0.81 | ||||
| Training | 0.92 | |||||
| Test | 0.93 | |||||
| Training | Demographic and Clinical Variables | 0.71 | ||||
| Test | Demographic and Clinical Variables | 0.70 | ||||
| Training | 0.92 | |||||
| Test | 0.93 | |||||
Note: Demographic and clinical variables include patient specific average BMI, systolic and diastolic blood pressures, sex (male versus female), and birth year.
Figure 1:

Histograms of . This statistic is the within person mean over notes of each word’s exponentiated PMI with “diabet” as estimated in the sample without an ICD-10 code indicating type 2 diabetes, but evaluated in the whole sample. The arrows point to the log values of the means from the training set in table 1.
In Table 1, we present the characteristics of our indices by type 2 diabetes indication status. As described previously, we estimated the PMI statistics separately in those with and without type 2 diabetes ICD-10 indications, but then estimated the values of the statistics in both samples using the relevant predicted PMIs. In our sample, many of those who do not have a type 2 diabetes indication noted still have the term diabetes in their electronic health record. Generally, this takes the form of phrases akin to “no evidence of diabetes” or more simply “no diabetes”, as well as phrases such as “history of diabetes”. Those with a history of diabetes sometimes might not have a type 2 diabetes mellitus ICD-10 diagnosis indicating as such. This could be for several reasons. One is that the history does not reflect current remission due to successful diet and lifestyle management. Another is that the patient had type 1 diabetes that was not considered type 2. Finally, the term could have simply been missed by medical coders.
Overall, the average PMI-related summary statistic describing the association of the “diabet” stem with other stems works well at distinguishing those with and without type 2 diabetes. Our test set AUCs are 0.73 for and 0.88 for . It is noteworthy that the average statistic is better at discriminating type 2 diabetes status when using the values estimated from those without a type 2 diabetes indication.
The averages of all of the pairwise exponentiated PMIs within the two indication groups were not as good at prediction. This suggests that, in general, the word neighborhood pairings between the two type 2 diabetes indication groups did not differ substantially, but the pairings with the stem “diabet” did. This demonstrates the utility of identifying salient words for prediction purposes. Many of the 1,500 stems used may not have strong associations with “diabet.” Hence, using all stems in prediction, rather than relevant stems, seems to add too much noise for the PMI statistics to be useful.
Demographic and clinical characteristics used alone in the model gave an AUC of 0.70. Demographic characteristics used with and gave an AUC of 0.93.
As a sensitivity analysis, we investigated whether using the predictive ability of the average of the PMI statistics themselves (i.e. the log ratio of conditional probabilities). We found this to be a substantially worse predictor than the average of the ratio of the probabilities (i.e. the exponentiated PMI statistics). Using the untransformed PMI statistics, the test set AUC was 0.54 for the recalculated and 0.63 for Further, we examined if the inclusion of some patients who had not yet died influenced our predictive ability. We found that restricting the sample only to those who had died did not substantially change the predictive ability of our algorithm (test set AUC of 0.87 for and 0.91 for and jointly).
6. Conditional Probabilities of Word Pairings and PMI statistics
In order to understand why the PMIs with the stem “diabet” were able to distinguish those with and without type 2 diabetes so well, we examined the characteristics of the top and bottom ten PMIs in the training sample. In Table 2, we see that there was some overlap in words commonly seen in neighborhoods of the “diabet” stem in both those with and without diabetes. “Mellitus” was the most commonly found stem in both groups, as was “depend”, and “dm” (an abbreviation for diabetes mellitus). However there were some key differences in the top 10 list that could be useful in explaining why different word patterns could distinguish those with and without type 2 diabetes indications. In those without type 2 diabetes, the stem “neg” was the eighth most common word to find in the vicinity of “diabet”. This suggests that negative terminology found close to diabetes is more likely to indicate that someone does not have diabetes. While the “neg” stem has a PMI value of 1.65 {se=0.50, exp(PMI)=5.20} in the type 2 diabetes group (and hence does not make the list in Table 2), this is much smaller than the PMI value of 4.10 {se=0.77, exp(PMI)=60.34} in the non-type 2 diabetes group. The abbreviation for history, “hx”, was one of the top words associated with “diabet” among those without Type 2 diabetes, but was not among the top words among those with type 2 diabetes. Like the “neg” stem, some patients had a history of diabetes noted, perhaps based on self report, but did not have an indication that they currently had diabetic symptoms. Hence, the history by itself did not warrant an ICD-10 notation in the health record.
Table 2:
Smallest and largest PMIs for the relationship of “diabet” with other stems. Here, M is the event that C indicates “diabet” and M′ is the event that C′ indicates the stem in the relevant columns of the table, conditional on D = 1. To protect confidentiality, we do not report very small percentages. se=standard error.
| No Type 2 Diabetes ICD-10 Code | Type 2 Diabetes ICD-10 Code | ||||
|---|---|---|---|---|---|
| Stem | Pxd(M|M′)(se) | Stem | Pd(M|M′)(se) | ||
| mg | < 0.01 | −3.95(0.77) | mouth | < 0.01 | −5.89(0.71) |
| bilater | < 0.01 | −3.45(1.00) | measur | < 0.01 | −5.14(0.99) |
| abdomen | < 0.01 | −3.45(1.02) | size | < 0.01 | −5.02(1.00) |
| system | < 0.01 | −3.31(1.01) | musculoskelet | < 0.01 | −4.66(1.00) |
| dose | < 0.01 | −3.27(1.01) | rash | < 0.01 | −4.61(1.00) |
| valu | < 0.01 | −3.26(1.01) | cycl | < 0.01 | −4.47(1.01) |
| bowel | < 0.01 | −3.22(1.00) | appear | < 0.01 | −4.41(0.70) |
| tablet | < 0.01 | −3.18(0.73) | topic | < 0.01 | −4.32(1.00) |
| complet | < 0.01 | −3.15(1.01) | interv | < 0.01 | −4.31(1.00) |
| lb | < 0.01 | −3.02(0.92) | pelvic | < 0.01 | −4.27(1.01) |
| sister | 0.01(0.01) | 3.88(0.62) | patern | 0.05(0.02) | 3.22(0.44) |
| hx | 0.01(0.01) | 3.88(0.63) | hyperlipidemia | 0.05(0.01) | 3.25(0.23) |
| neg | 0.02(0.01) | 4.10(0.77) | hypertens | 0.06(0.01) | 3.43(0.10) |
| mother | 0.02(0.01) | 4.37(0.40) | dm | 0.06(0.02) | 3.45(0.26) |
| diabet | 0.02(0.01) | 4.42(0.54) | father | 0.06(0.01) | 3.46(0.16) |
| attack | 0.03(0.02) | 4.64(0.55) | mother | 0.06(0.01) | 3.53(0.16) |
| insulin | 0.03(0.01) | 4.65(0.53) | asthma | 0.07(0.01) | 3.64(0.21) |
| depend | 0.04(0.01) | 4.93(0.30) | type | 0.07(0.01) | 3.66(0.07) |
| dm | 0.07(0.04) | 5.51(0.68) | depend | 0.11(0.02) | 4.03(0.16) |
| mellitus | 0.26(0.01) | 6.88(0.17) | mellitus | 0.26(0.002) | 4.90(0.05) |
Another key difference was that “insulin” had the fourth highest PMI among those with no type 2 diabetes indications. This might be due to documentation of insulin-dependent Type 1 diabetes that would not warrant a relevant type 2 code. Upon inspection of the note text, there was also evidence that there may have been general discussions about insulin dependent diabetes between clinicians and those with pancreatic cancer. These discussions may not have resulted in a type 2 diabetes diagnosis, but their occurrence may be due to patient or provider concerns about the relationship between pancreatic cancer and diabetes (Pannala et al., 2009).
“Hypertens” and “hyperlipidemia” stems were likely to be found in the neighborhood of “diabet” among those with type 2 diabetes indications, but not those with no indication. This is consistent with type 2 diabetes being related to metabolic syndrome of which hypertension and hyperlipidemia are two symptoms (Punthakee et al., 2018).
Also of important note are the standard errors in Table 2. The standard errors are generally much larger after accounting for within patient cluster-correlation of word pairs, as we did, than if one assumes independence. The standard error of the log PMI describing the relationship of “diabet” with “mellitus” in those without a diabetes indication is 0.02 if one does not account for clustering, but 0.17 if one does account for clustering. This is likely due to the fact that there were only 600 patients in the training sample, which is the effective number of independent samples. In contrast, the training set clinical notes were so detailed that there were over 99 million word pairs in neighborhoods of each other. Accounting for clustering via robust standard errors, as we did, more appropriately corrected for the sample size.
7. Discussion
We have described how the PMI statistic underlies one of the most commonly cited natural language processing algorithms. We also proposed an efficient way of estimating the PMI. On our personal computer, R can calculate all PMI statistics for the 1500 x 1500 symmetric PMI matrix in 0.36 seconds using tables, albeit without standard error calculations. This compares with 4.17 minutes to estimate a single PMI statistic using the two logistic regressions in Equations 6 and 10. One could use the cross-tabulation for exploratory purposes, and only estimate PMI standard errors for the comparisons of most interest.
Of note, we estimated the PMI without taking a fixed number of negative samples as some might imply as necessary by commonly used natural language processing algorithms discussed in Section 3. In the context of our electronic health record example, this reduced the dimensionality of our training set design matrix to just 99 million word pairs in neighborhoods of each other. Using 15 negative random samples from the matched pairs has been proposed (Levy and Goldberg, 2014b). In our context, this would have increased the number of design matrix rows by almost 1.5 billion.
Our approach is novel in that we have presented a method for obtaining standard errors for our estimators. A search suggests that many computer science papers detailing natural language processing algorithms do not propose methods for quantifying estimation uncertainty. We used robust cluster-corrected standard errors to account for within patient correlation of word patterns within notes. Adjusting for within patient clustering of word pairs can have substantial impact on standard errors, and make them many times larger. Larger standard errors are appropriate as there were only 1,000 patients in the sample. Failure to account for clustering of words within patient would make the sample size seem unrealistically large, and hence result in standard errors that are too precise. Future work can examine the use of other methods, such as regressions estimated by generalized estimating equations (GEE), to similarly account for correlation. One difficulty with GEE estimation in this context is that amount of data might strain available computational resources.
We were able to perform this work because we first gained understanding of the probability model that underlies the word2vec algorithm. Defining the role of statistics in natural language processing and data science has taken on increasing importance, as evidenced by the position statement of the American Statistical Association on “The Role of Statistics in Data Science” (American Statistical Association, 2015). Many of the natural language processing methods in the computer science fields start from an algorithmic approach, rather than from a probabilistic and statistical theory driven approach (Gacs et al., 2001). This paper provides a framework by which the statistical properties of different parameterizations for Equations (6) and (10) can also be considered (e.g. the embeddings parameterizations of Levy and Goldberg, 2014b). As demonstrated here, it is possible to show the probabilistic characteristics of many algorithms, and use these characteristics to make them more efficient and statistically rigorous.
Supplementary Material
Acknowledgements
The corresponding author is Brian L. Egleston. This was funded in part by NIH/NCI grants R21CA202130 (PIs Egleston/Vucetic), P30CA006927 (Fox Chase Cancer Center Support Grant). We thank Drs. Elizabeth A. Handorf and Samuel Litwin for their comments on a draft.
Footnotes
Data Availability Statement
Research data are not shared. Our data include the free text clinical notes of electronic health records. These notes contain patients’ protected health information, and this precludes direct sharing of the data.
R code that demonstrates the method using the text from a Latex draft document of this paper is available with this paper at the Biometrics website on Wiley Online Library.
Contributor Information
Brian L. Egleston, Biostatistics and Bioinformatics Facility, Fox Chase Cancer Center, Temple University Health System, Philadelphia, Pennsylvania, 19111, U.S.A.
Tian Bai, Department of Computer and Information Sciences, Temple University.
Richard J. Bleicher, Department of Surgical Oncology, Fox Chase Cancer Center
Stanford J. Taylor, Population Studies Facility, Fox Chase Cancer Center
Michael H. Lutz, Population Studies Facility, Fox Chase Cancer Center
Slobodan Vucetic, Department of Computer and Information Sciencesm, Temple University.
References
- American Statistical Association (2015). “ASA statement on the role of statistics in data science”. https://community.amstat.org/blogs/ronaldwasserstein/2015/10/01/the-role-of-statistics-in-datascience-an-asa-statement. Accessed June 23, 2020.
- Benoit K, Muhr D, and W. K (2017). Stopwords: one-stop shopping for stopwords in r. https://cran.r-project.org/web/packages/stopwords/stopwords.pdf. Accessed June 23, 2020.
- Budiu R, Royer C, and Pirolli P (2007). Modeling information scent: A comparison of lsa, pmi and glsa similarity measures on common tests and corpora. In Large Scale Semantic Access to Content (Text, Image, Video, and Sound), RIAO’07, page 314–332, Paris, FRA. Le Centre de Hautes Etudes Internationales D’Informatique Documentaire. [Google Scholar]
- Church K and Hanks P (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29. [Google Scholar]
- Egghe L and Leydesdorff L (2009). The relation between pearson’s correlation coefficient r and salton’s cosine measure. J Am Soc Inf Sci Technol, 60(5):1027–1036. [Google Scholar]
- Feinerer I, Hornik K, and M. D (2008). Text mining infrastructure in r. Journal of Statistical Software, 25(5):1–54. [Google Scholar]
- Gacs P, Tromp JT, and Vitanyi PMB (2001). Algorithmic statistics. IEEE Transactions on Information Theory, 47(6):2443–2463. [Google Scholar]
- Huang A (2008). Similarity measures for text document clustering. In Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, volume 4, pages 9–56. [Google Scholar]
- Huber P (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 221–233, Berkeley, CA. University of California Press. [Google Scholar]
- LeCun Y, Bengio Y, and Hinton G (2015). Deep learning. Nature, 521:436–444. [DOI] [PubMed] [Google Scholar]
- Levy O and Goldberg Y (2014a). Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 302–308. [Google Scholar]
- Levy O and Goldberg Y (2014b). Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pages 2177–2185. [Google Scholar]
- Meyer D (2016). How exactly does word2vec work? https://pdfs.semanticscholar.org/49ed/be35390224dc0c19aefe4eb28312e70b7e79.pdf. Accessed June 23, 2020.
- Mikolov T, Sutskever I, Chen K, Corrado G, and Dean J (2013). Distributed representations of words and phrases and their compositionality. In Burges CJC, Bottou L, Welling M, Ghahramani Z, and Weinberger KQ, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc. [Google Scholar]
- Nadkarni PM, Ohno-Machado L, and Chapman WW (2011). Natural language processing: An introduction. Journal of the American Medical Informatics Association, 18(5):544–551. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pannala R, Leirness JB, Bamlet WR, Basu A, Petersen GM, and Chari ST (2009). Prevalence and clinical profile of pancreatic cancer-associated diabetes mellitus. Gastroenterology, 134(4):981–987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Punthakee Z, Goldenberg R, and Katz P (2018). Definition, classification and diagnosis of diabetes, prediabetes and metabolic syndrome. Canadian Journal of Diabetes, 42:S10–S15. [DOI] [PubMed] [Google Scholar]
- Williams CB (1975). Mendenhall’s studies of word-length distribution in the works of Shakespeare and bacon. Biometrika, 62(1):207–212. [Google Scholar]
- Williams RL (2000). A note on robust variance estimation for cluster-correlated data. Biometrics, 56:645–646. [DOI] [PubMed] [Google Scholar]
- Yule GU (1939). On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika, 30(3/4):363–390. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
