Skip to main content
Journal of the American Medical Informatics Association : JAMIA logoLink to Journal of the American Medical Informatics Association : JAMIA
. 2020 Jun 17;27(7):1110–1115. doi: 10.1093/jamia/ocaa061

Development of the Gender, Sex, and Sexual Orientation ontology: Evaluation and workflow

Clair A Kronk o1,, Judith W Dexheimer o1,o2,o3
PMCID: PMC7647319  PMID: 32548638

Abstract

Objective

The study sought to create an integrated vocabulary system that addresses the lack of standardized health terminology in gender and sexual orientation.

Materials and Methods

We evaluated computational efficiency, coverage, query-based term tagging, randomly selected term tagging, and mappings to existing terminology systems (including ICD (International Classification of Diseases), DSM (Diagnostic and Statistical Manual of Mental Disorders ), SNOMED (Systematized Nomenclature of Medicine), MeSH (Medical Subject Headings), and National Cancer Institute Thesaurus).

Results

We published version 2 of the Gender, Sex, and Sexual Orientation (GSSO) ontology with over 10 000 entries with definitions, a readable hierarchy system, and over 14 000 database mappings. Over 70% of terms had no mapping in any other available ontology.

Discussion

We created the GSSO and made it publicly available on the National Center for Biomedical Ontology BioPortal and on GitHub. It includes clarifications on over 200 slang terms, 190 pronouns with linked example usages, and over 200 nonbinary and culturally specific gender identities.

Conclusions

Gender and sexual orientation continue to represent crucial areas of medical practice and research with evolving terminology. The GSSO helps address this gap by providing a centralized data resource.

Keywords: biological ontologies, gender and sexual minorities, sex, gender identity, sexual behavior

INTRODUCTION

Gender, sex, and sexual orientation have expanded significantly as areas of research in the last 2 decades.1–3 This literature features not only volume, but also heterogeneity in terminology.2,4 These issues make knowledge discovery and understanding difficult without external modeling.5–7 The Gender, Sex, and Sexual Orientation (GSSO) ontology was designed as a model to facilitate communication and assist in organizing this literature.8 The GSSO includes more than 1400 external biomedical ontology references, 1000 definitions, and 150 fully cited reference materials. Approximately one-third of its entries are novel, having no mapping to any of the over 700 ontologies indexed in the open biomedical ontology repository, the National Center for Biomedical Ontology (NCBO) BioPortal, in 2019.9,10

An ontology was chosen as the GSSO format to facilitate domain knowledge reuse, make analysis of that knowledge computationally efficient, and create explicitness within the domains of gender, sex, and sexual orientation.11

Since its inception, the GSSO has garnered significant interest, being in the top 5% of all ontologies visited on the NCBO BioPortal’s website as of March 2020. As its use cases expand, integration and interoperability have become essential to parse and tag incoming information. Despite its size, many entries were incomplete, lacking external linking, references, or definitions, and its original classification system was created in a more human- than machine-readable format.

To address the literature and traffic volumes, we sought to increase the ontology’s scope, referencing, internal and external linking, and verification methodologies. We focused on search capabilities given the diversity and difficult in terminology matching in the fields of gender, sex, and sexual orientation.1,12–14 Our goal was to improve the functionality and accessibility of the GSSO through scalable, iterative improvements.

MATERIALS AND METHODS

Ontology construction

The first version of the GSSO included 6250 terms covering diverse topics from abstinence to zygosity with synonyms, mappings to other ontologies, and select translations to additional non-English languages. It is available via GitHub (https://github.com/Superraptor/GSSO), and a demonstration website with simple search functionality was made available at our institution (https://homepages.uc.edu/∼kronkcj/gsso/). We constructed the ontology using the Protégé software.15 Despite its initial wide coverage, only 17% of initial terms included definitions and referencing, compared with about 11% of Medical Subject Headings (MeSH) terms or 77% of National Cancer Institute Thesaurus terms.10

We iteratively scoped all relevant literature for completeness.16 We identified 32 seed terms from the most recent Human Rights Campaign (HRC) glossary to search titles from the 2019 release of Medline. HRC is the largest LGBTQIA+ advocacy group in the United States,17 claiming more than 3 million members,18 about one-third of the estimated American LGBTQIA+ population.19 These search results were then used to identify the most common n-grams. The top 200 of 1-, 2-, 3-, 4-, and 5-grams were found and considered for addition to the GSSO. The 32 seed terms were ally, androgynous, asexual, biphobia, bisexual, cisgender, closeted, coming out, gay, gender dysphoria, gender-expansive, gender expression, gender-fluid, gender identity, gender nonconforming, genderqueer, gender transition, homophobia, intersex, lesbian, LGBTQ, living openly, nonbinary, outing, pansexual, queer, questioning, same-gender loving, sex assigned at birth, sexual orientation, transgender, and transphobia.

We searched “LGBTQ terminology” and “LGBTQ slang” on Google in August 2019, taking the first 2 pages of relevant results and adding all terms from online glossaries. “LGBTQ” was chosen as the queried initialism because of its preferred usage in current style guides.20–23 We evaluated print encyclopedias and vocabularies found via Google Books, Google Scholar, and the HathiTrust Digital Library from September 2019 to January 2020. We incorporated feedback on terms from the GLBT Museum and Archives listserv and the Trans PhD Network on Facebook for additional terms and usage notes.

To complete the ontology, we parsed each of the original terms and updated them in a piecewise manner, with reclassifications based on superclass or subclass and class or individual and class or individual for computational readability. For instance, journal articles were moved to instances of “scholarly article” rather than subclasses of it to facilitate more computationally efficient queries. We added definitions and references, either a piece of literature or an external database identifier, where missing.

A full mapping was made to the second version of the Homosaurus vocabulary24 and to the Library of Congress’ LGBTQIA+ related vocabulary.25

Preliminary ontology evaluation

We used Medline as our evaluation set to evaluate completeness of mappings. We analyzed recall and precision on chosen queries revolving around the HRC glossary terms. We tested usage with plaintext title matching, then added MeSH terms, and finally added GSSO terms.

We created 2 subgroupings of Medline for testing. The first we contained a random selection of 1 217 621 entries (of 29 138 438 [4.2%]). In the second, we preselected those entries with plaintext title/abstract matches to the HRC glossary terms.

GSSO term matching for titles and abstracts was done piecewise to decrease false positives.26,27 Pronouns, stop words, and other terms 3 characters or shorter were not matched. Matching was done in a 2-step process. In the first step, a mapping to a label, alternate name, synonym, exact synonym, broad synonym, narrow synonym, or demonym was considered a definite match and returned. In the second step, a mapping to a shortened name, related synonym, or obsoleted term was considered a probable match. Then, the descendants and instances of that term were searched. If there was a match with any of the descendants or instances, the probable match was deemed a match. Otherwise, it was deemed a nonmatch.

Recall, precision, and F-scores were tested in the random Medline subset and the queried Medline subset using plaintext title mapping, MeSH terms, and GSSO terms. Terms had to be present in all categories and have little to no ambiguity to be considered for matching. Lack of ambiguity was determined as the term having only 1 definition as of February 2020 on the online dictionary Wiktionary. Alternatively, if all definitions pertained to the same idea they were included. Terms fitting these criteria included lesbian, transgender, sexual orientation, gender identity, gender dysphoria, homophobia, LGBTQ, cisgender, gender expression, transphobia, gender nonconforming, genderqueer, biphobia, same-gender loving, gender-expansive, gender-fluid, and sex assigned at birth. MeSH terms were considered the gold-standard. Precision above 0.265, recall above 0.216, and F-score above 0.198 were considered satisfactory based on other work describing ontology-mapped query systems28; likewise, precision values above 0.56, recall values above 0.42, and F-scores above 0.48 were considered excellent.29

RESULTS

Version 2 of the GSSO features 10 060 entries expanded from 6250 and is available at the NCBO BioPortal (https://bioportal.bioontology.org/ontologies/GSSO) and at GitHub (https://github.com/Superraptor/GSSO). Each class in version 2 (7121 entries) contains a human-readable definition and is placed in a computer-readable hierarchy. The number of external database mappings has been increased from 1416 to 14 193 (Table 1). A total of 5273 (74.0%) classes had no mapping to any class in any of the over 800 ontologies at the NCBO BioPortal. The average number of annotations per class increased from 2.6 to 7.4. Sample entries for version 1 and 2 are shown in Figure 1. The number of sources increased from 264 to 823, including 325 scholarly articles and 91 books, including The Complete Dictionary of Sexology, The Wiley Blackwell Encyclopedia of Gender and Sexuality Studies, and LGBTQ America Today: An Encyclopedia, each being cited in 371, 159, and 121 entries.30–32 Online sources of note included the glbtq Encyclopedia Project and the Encyclopedia of Homosexuality, which provided 125 and 134 citations, respectively.33,34 A total of 310 usage notes were created.

Table 1.

External database mappings from version 1 to version 2

Source Mappings in version 1 (n=1416) Mappings in version 2 (n=14 193)
ATC NM 20a
BFO 19 0
ChEBI 4 213a
DO 62 193a
DSM NM 21a
EFO 35 0
FMA 43 327a
GO 32 152a
GOLD 13 0
Homosaurus NM 430a
HPO 30 137a
ICD-9-CM 30 101a
ICD-10-CM 28 261a
LCC NM 527a
LCSH NM 749a
MedDRA 129 595a
MeSH 261 904a
NCBI Taxon 11 87a
NCIT 261 1034a
SCTID 241 1084a
SIO 116 3
STY 16 1
TA 3 91a
TE 30 38
UBERON 22 169a
Wikidata NM 2270a
Wikipedia NM 4755a

Only databases with more than 10 mappings are shown. Mappings to Wikidata and Wikipedia were made in version 1 but were not separately tabulated (they brought the total number of mappings to approximately 3300, which is still significantly less than 14 193).

ATC: Anatomical Therapeutic Chemical Classification System; BFO: Basic Formal Ontology; ChEBI: Chemical Entities of Biological Interest; DO: Disease Ontology; DSM: Diagnostic and Statistical Manual of Mental Disorders; EFO: Experimental Factor Ontology; FMA: Foundational Model of Anatomy; GO: Gene Ontology; GOLD: General Ontology for Linguistic Description; HPO: Human Phenotype Ontology; ICD-9-CM: International Classification of Diseases–Ninth Revision–Clinical Modification; ICD-10-CM: International Classification of Diseases–Tenth Revision–Clinical Modification; LCC: Library of Congress Classification; LCSH: Library of Congress Subject Headings; MeSH: Medical Subject Headings; NCBI: National Center for Biotechnology Information; NCIT: National Cancer Institute Thesaurus; NM: no mapping; SIO: Semanticscience Integrated Ontology; STY: Semantic Types Ontology; TA: Terminologia Anatomica; TE: Terminologia Embryologica.

a

Significant increase.

Figure 1.

Figure 1.

Sample entries from version 2 (left) and version 1 (right)..

The GSSO also includes 204 slang terms with definitions and references, 190 pronouns with linked example usages, and 223 terms related to nonbinary and culturally specific gender identities ranging from hijra on the Indian subcontinent to ashtime in the Maale community of Ethiopia.

With plaintext title searching using HRC terms, we identified 14 019 Medline entries. Tagging a Medline entry with GSSO terms took 7.7 seconds on average (range 1.8-9.9 seconds) when run on a local machine. Tagging was performed for titles, abstracts, and journal titles for each entry.

We tagged 13 998 (99.85%) entries with GSSO terms, compared with the 82.62% covered with MeSH terms. Comparisons between MeSH headings, Medline keywords, and GSSO terms for the HRC terms are shown in Table 2. A total of 7022 MeSH terms (2.5% of its corpus) were present in our dataset vs 8833 keywords (no defined corpus) and 2911 GSSO terms (28.9% of its corpus).

Table 2.

Term frequencies for most common 32 terms across main terminologies for the HRC subset

Word HRC count MeSH count Keyword count GSSO count
ally 289 NM 2 304
androgynous 23 NM 2 39
asexual 1288 NM 5 1307
biphobia 3 NM 2 14
bisexual 2477 2021 183 3240
cisgender 39 NM 8 176
closeted 3 NM 1 24
coming out 192 NM 21 307
gay 4426 3257 208 2356
gender dysphoria 277 156 143 394
gender expression 24 NM 10 61
gender identity 725 1644 164 1078
gender nonconforming 11 NM 9 129
gender transition 10 NM 14 48
gender-expansive 0 NM 1 132
gender-fluid 0 NM 1 1
genderqueer 8 NM 8 19
homophobia 276 197 58 470
intersex 598 561 44 622
lesbian 2195 1891 246 3182
LGBTQ 196 NM 58 271
living openly 0 NM 0 0
nonbinary 31 NM 3 12
outing 33 NM 0 0
pansexual 3 NM 1 13
queer 369 NM 49 594
questioning 1219 NM 12 1301
same-gender loving 2 NM 0 2
sex assigned at birth 0 NM 0 21
sexual orientation 1393 NM 358 2044
transgender 2102 2078 529 2431
transphobia 17 NM 10 50

Note that because not all terms matched perfectly, the following mappings were made for the GSSO (androgynous → androgynous gender expression; gender nonconforming → gender nonconforming; gender-expansive → gender variance; sex assigned at birth → sex at birth; gay → gay man; nonbinary → gender nonbinary; gender-fluid → fluctuating gender identity; same-gender loving → same gender loving), for MeSH (bisexual → bisexuality; gay → homosexuality, male; intersex → disorders of sex development; lesbian → homosexuality, female; transgender → transgender persons + transsexualism), and for keywords (androgynous → androgyny). Mappings were made to prioritize MeSH-GSSO comparisons (eg, “gay” to “homosexuality, male” and “gay man”).

GSSO: Gender, Sex, and Sexual Orientation; HRC: Human Rights Campaign; MeSH: Medical Subject Headings; NM = no mapping.

Of the 17 terms considered nonambiguous, calculating precision, recall, and F-scores was only possible for 5 that were mappable to MeSH: lesbian (precision = 0.53, recall = 0.89, F-score = 0.664), transgender (precision = 0.58, recall = 0.76, F-score = 0.658), gender identity (precision  = 0.60, recall = 0.39, F-score = 0.473), gender dysphoria (precision  = 0.30, recall = 0.76, F-score = 0.430), and homophobia (precision  = 0.19, recall = 0.46, F-score = 0.269). Using our criteria, 1 of our results was considered excellent (“transgender”) and 3 were considered satisfactory (“lesbian,” “gender identity,” and “gender dysphoria”).

With the randomly selected subset, we uploaded 1 217 621 Medline entries, of which 628 979 (50.66%) were tagged with GSSO terms. A total of 1 058 065 (86.89%) had MeSH terms and 201 096 (16.52%) had keywords. A total of 27 813 MeSH terms (10.0% of its corpus) were present in our dataset vs 306 728 keywords and 2579 GSSO terms (25.6% of its corpus).

The total number of unique GSSO terms utilized in both subsets was 3733 (37.11%), with an intersection of 1757 terms between sets. In comparison, MeSH terms had an intersection of 7006 terms.

DISCUSSION

The GSSO clarifies thousands of terms related to gender, sex, and sexual orientation not present in any other biomedical ontology. These areas continue to be controversial areas of research,2 with a constantly evolving terminology. As writer Kyle Taylor Shaughnessy noted in The Remedy: Queer and Trans Voices on Health and Health Care: “While the constantly evolving language and concepts of gender and sexual identity… can be overwhelming at times, if we don’t keep up we lose the ability to connect and therefore to do effective work.”35 The inclusion of source materials with date information and usage notes helps provide context for terminology (including derogatory terms) and makes the GSSO easily updatable.

The lack of gender, sex, and sexual orientation terminology in MeSH is significant, with only 8 of 32 common terms being directly mappable to it. Additionally, there was no mapping for intersex, with only the derogatory term “Disorders of Sex Development” being available. Assessing the reliability of what was considered a true positive in our HRC query subset was challenging for this reason. As a next step, we plan to manually curate a dataset and calculate precision and recall values against that grouping as gold standard. Manual curation will help us better understand the mapping process.

The GSSO’s tagging system is fast, extensible, and reliable in situations in which terms are unambiguous. Despite its performance, some GSSO term matches were missed or imprecise. For example, gay was not matched at all, owing to it being only 3 characters in length; likewise, asexual matched most articles relating to asexual reproduction in nonhuman species, instead of asexuality as a sexual orientation. Association rule mining or contextual tagging could help minimize these problems moving forward.

In comparison with MeSH, the GSSO was more specific, matching more Medline entries in the HRC query subset with a larger number of terms. This specificity carried over to the randomly selected subset, in which it matched fewer abstracts but had a much higher overlap of matched terms with the HRC query subset.

Future directions include the release of the GSSO website currently in development (Figure 2) and evaluation and validation with additional datasets. The website will support GSSO tagging within text documents and will include a survey component to help track changes and shifts in terminology usage over time.

Figure 2.

Figure 2.

Screenshots from mock-ups for web applications showcasing the GSSO with version 1 (left) and version 2 (right).

The GSSO’s extensibility allows it to be used in a number of applications including health surveillance of LGBTQIA+ social media data, which is heavy in slang use; electronic health record integration for identification of LGBTQIA+ patient groups; usage in clinical training programs as a comprehensive resource for inclusiveness; and semiautomated LGBTQIA+ literature review.

AUTHOR CONTRIBUTIONS

CAK conceived the research idea, designed and completed the ontology, performed ontology analytics, and was the primary author of the manuscript. JWD oversaw the project, contributed several critical revisions, and provided feedback on analytical systems crucial to interpreting results. Both authors agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

ACKNOWLEDGMENTS

This document adheres to the principles outlined in the Declaration of Helsinki. Special thanks to the following individuals for their assistance in this work: Giao Q. Tran, Piper Ranallo, Charles Macquarie, Patti Brennan, K. J. Rawson, Isaac Fellman, Florence Paré, Amber Billey, Marika Cifor, Walter Walker, Jack van der Wel, and Brian M. Watson.

CONFLICT OF INTEREST STATEMENT

None declared.

REFERENCES


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES