Table 2.
Unsafe features.
Variable | Meaning |
---|---|
Total | Number of total PubMed ID candidates retrieved in ElasticSearch when querying for all gene synonyms for a given gene symbol |
Contribution | The percentage of PubMed IDs that a given gene synonym contributes to the total for a particular gene symbol |
Number of characters | The length of the gene synonym in characters |
Bits | The sum of the bits of information of every character in a gene synonym based on the frequencies of each character in PubMed’s corpus of titles and abstracts |
Number of nested | The number of other gene synonyms that contain the gene synonym. For example: “Insulin” is part of “Insulin Receptor” |
Prob. of the synonym given an alternative | The conditional probability of finding the gene synonym given that an alternative synonym for the same gene symbol also appears in the text |
Prob. of an alternative given the synonym | The conditional probability of finding alternative gene synonyms given that the synonym synonym appears in the text |
Is gene symbol | Whether the synonym is also an accepted gene symbol |
Engineered features to evaluate the probability of a given gene symbol of being ambiguous (unsafe).