Skip to main content
. 2021 Aug 3;11:15747. doi: 10.1038/s41598-021-94897-9

Table 2.

Unsafe features.

Variable Meaning
Total Number of total PubMed ID candidates retrieved in ElasticSearch when querying for all gene synonyms for a given gene symbol
Contribution The percentage of PubMed IDs that a given gene synonym contributes to the total for a particular gene symbol
Number of characters The length of the gene synonym in characters
Bits The sum of the bits of information of every character in a gene synonym based on the frequencies of each character in PubMed’s corpus of titles and abstracts
Number of nested The number of other gene synonyms that contain the gene synonym. For example: “Insulin” is part of “Insulin Receptor”
Prob. of the synonym given an alternative The conditional probability of finding the gene synonym given that an alternative synonym for the same gene symbol also appears in the text
Prob. of an alternative given the synonym The conditional probability of finding alternative gene synonyms given that the synonym synonym appears in the text
Is gene symbol Whether the synonym is also an accepted gene symbol

Engineered features to evaluate the probability of a given gene symbol of being ambiguous (unsafe).