Abstract
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a large and highly-integrated public chemical database resource at NIH. In the past two years, significant updates were made to PubChem. With additions from over 130 new sources, PubChem contains >1000 data sources, 119 million compounds, 322 million substances and 295 million bioactivities. New interfaces, such as the consolidated literature panel and the patent knowledge panel, were developed. The consolidated literature panel combines all references about a compound into a single list, allowing users to easily find, sort, and export all relevant articles for a chemical in one place. The patent knowledge panels for a given query chemical or gene display chemicals, genes, and diseases co-mentioned with the query in patent documents, helping users to explore relationships between co-occurring entities within patent documents. PubChemRDF was expanded to include the co-occurrence data underlying the literature knowledge panel, enabling users to exploit semantic web technologies to explore entity relationships based on the co-occurrences in the scientific literature. The usability and accessibility of information on chemicals with non-discrete structures (e.g. biologics, minerals, polymers, UVCBs and glycans) were greatly improved with dedicated web pages that provide a comprehensive view of all available information in PubChem for these chemicals.
Graphical Abstract
Graphical Abstract.
Introduction
PubChem (1,2) is a public database that provides comprehensive information on chemicals and their biological activities. Visited by millions of users every month (see Figure 1), it serves as a key information resource for the scientific community as well as the general public. PubChem data are collected from over a thousand data sources and organized into multiple data collections (3,4), including Substance, Compound, BioAssay, Protein, Gene, Pathway, Cell Line, Taxonomy and Patent. Substance archives chemical descriptions submitted by individual data depositors and Compound stores unique chemical structures extracted from Substance through chemical structure standardization (5). BioAssay holds the descriptions and test results of biological assay experiments. Protein, Gene, Pathway, Cell Line and Taxonomy provide a target-centric view of PubChem data, with the target being a protein, gene, pathway, cell line, and taxon (organism), respectively (4). Each record in the latter five collections contains information on the target and chemicals related to it, making it easier to analyze and interpret the biological activity of chemicals against the target. Patent contains information on patent documents related to chemicals. Currently (as of 12 September 2024), PubChem contains 322 million substances and 119 million compounds. It also has 295 million bioactivities from 1.67 million biological assay experiments. The record counts for PubChem data collections are summarized in Table 1. The up-to-date counts are available on the PubChem Statistics page (https://pubchem.ncbi.nlm.nih.gov/docs/statistics).
Figure 1.
Number of monthly unique users who visited PubChem. The statistics in this Figure include both interactive and programmatic users.
Table 1.
PubChem record counts (as of 12 September 2024). The up-to-date record counts are available at the PubChem Statistics page (https://pubchem.ncbi.nlm.nih.gov/docs/statistics)
| Type | Count | Description |
|---|---|---|
| Substances | 322 395 335 | Chemical descriptions provided by PubChem contributors. They can be non-discrete structures (e.g. polymers) or even structureless (e.g. natural product extracts or materials). They are often redundant |
| Compounds | 118 596 691 | Unique chemical structures extracted from contributed PubChem Substance records |
| BioAssays | 1 671 325 | Biological experiments provided by PubChem contributors |
| Bioactivities | 295 360 133 | Biological activity data points reported in PubChem BioAssays |
| Proteins | 248 298 | Protein targets tested in PubChem BioAssays and/or those involved in PubChem Pathways |
| Genes | 113 242 | Gene targets tested in PubChem BioAssays and/or those involved in PubChem Pathways |
| Pathways | 241 163 | Groups of chemicals, genes, and proteins that interact with each other to lead a certain product or a change in a cell |
| Cell Lines | 2007 | Cell lines tested in PubChem BioAssays |
| Taxonomy | 107 617 | Organisms of targets tested in PubChem BioAssays and those involved in PubChem Pathways |
| Literature | 41 558 769 | Scientific publications with links in PubChem |
| Patents | 50 836 952 | Patents with links in PubChem |
| Data Classifications | 73 | Classification trees that have information on the distribution of PubChem data among nodes in the hierarchy of interest |
| Data Sources | 1006 | Organizations contributing data to PubChem |
The development and evolution of PubChem data, tools, and services were previously documented in multiple papers published in the annual Database Issues (1,3,6,7) and Web Server Issues (8,9) of Nucleic Acids Research. The present paper describes major updates to PubChem for the past two years, including the addition of data content from 130+ new data sources, the development of the consolidated literature panel and the patent knowledge panel, the inclusion of literature co-occurrence data in PubChemRDF, and the introduction of summary pages dedicated to chemicals without discrete structures.
New data content
In the last two years, PubChem has significantly expanded its data coverage by incorporating information from over 130 new sources. This integration has enabled PubChem to provide information on 119 million compounds sourced from >1000 data sources. This section describes important data content added to PubChem.
Information for drug discovery
Information on approved drugs from Drugs@FDA and the Japan Pharmaceuticals and Medical Devices Agency (JPMDA) was integrated with PubChem, complementing the existing drug information from the U.S. Food and Drug Administration (FDA) color books (Orange Book for small molecule drugs, Purple Book for biologics, and Green Book for animal drugs) and the European Medicines Agency (EMA). In addition, information on the benefits and risks of exposure to medications and other chemicals during pregnancy or breastfeeding was integrated from the MotherToBaby Fact Sheets (https://mothertobaby.org/fact-sheets/).
Many approved drugs are known to be natural products or their derivatives and they continue to play a pivotal role in modern drug discovery. A recent addition to PubChem in this area is information on natural products found in various species, integrated from the Natural Product Activity and Species Source Database (NPASS) (https://bidd.group/NPASS/) (10).
PubChem also integrated data from various resources for drug discovery. Data from MarkerDB (https://markerdb.ca/) (11) was added to PubChem to provide information on chemicals, proteins, and genetic biomarkers, including the biomarker concentration in body fluids (e.g. blood, serum, urine) for normal and various disease conditions. The IUPAC digitized pKa data set (https://github.com/IUPAC/Dissociation-Constants), derived from three reference works (12–14), was integrated into PubChem. This data set contains ‘high-confidence’ pKa values for more than 11 thousand compounds. Other notable new data relevant to drug discovery include target development levels from Pharos (https://pharos.nih.gov/) (15), chemical-macromolecule interactions from the Toxin and Toxin Target Database (T3DB) (http://www.t3db.ca/) (16), and genomic information on drug sensitivity in cancer cells from the Genomics of Drug Sensitivity in Cancer (GDSC) project (https://www.cancerrxgene.org/) (17).
Information on health hazards of chemicals
PubChem now provides health hazard information for chemicals found in the environment, integrated from the U.S. Environmental Protection Agency (USEPA) Integrated Risk Information System (IRIS) (https://www.epa.gov/iris) and Provisional Peer-Reviewed Toxicity Values (PPRTV) program (https://www.epa.gov/pprtv). The data from IRIS includes the reference concentration (RfC) or reference dose (RfD), which are estimates of exposure to the human population that are not likely to pose an appreciable risk of deleterious effects during a lifetime. The PPRTV data includes provisional toxicity values (e.g. provisional RfC or RfD) for health effects of chronic or subchronic exposure to chemicals of concern to the USEPA’s Superfunds program. These values are derived from scientific literature and USEPA’s methodologies, practices, and guidance and peer-reviewed by EPA scientists and external scientific experts. In addition, carcinogenicity and reproductive toxicity information for the chemicals on the list maintained under the Safe Drinking Water and Toxic Enforcement Act of 1986 (also known as Proposition 65) was integrated from the California Office of Environmental Health Hazard Assessment (CA-OEHHA) (https://oehha.ca.gov/proposition-65).
Information for chemical health and safety was also integrated from other authoritative sources. For example, regulatory and safety information for chemicals was added to PubChem from the New Zealand Environmental Protection Authority (NZEPA) (https://www.epa.govt.nz/) and the Australian Industrial Chemicals Introduction Scheme (AICIS) (https://www.industrialchemicals.gov.au/). In addition, data relevant to chemical safety was also integrated from the U.S. Chemical Safety and Hazard Investigation Board (CSB) (https://www.csb.gov/). Moreover, the list of chemicals of interest from the Chemical Facility Anti-Terrorism Standards (CFATS) program of the U.S. Department of Homeland Security (DHS) (https://www.cisa.gov/resources-tools/resources/chemical-facility-anti-terrorism-standards-cfats-appendix-chemicals-interest-coi-list) was also added to PubChem.
Information on chemical exposure
Information on chemical exposure was also integrated into PubChem. An example is the data from the USEPA Chemical Data Reporting (CDR) database (https://www.epa.gov/chemical-data-reporting), which contains basic exposure-related information on certain chemicals in commerce that are listed on the Toxic Substances Control Act (TSCA) Chemical Substance Inventory. The CDR data includes physical description, general manufacturing, consumer and industrial uses, and U.S. production and is used by USEPA to screen and prioritize chemicals for identification of potential human health and environmental effects.
Data from the Global Substance Registration System (GSRS) of the U.S. Food and Drug Administration (FDA) was added to PubChem, making it easier to access information on chemical ingredients in FDA-regulated products (such as drugs, food, cosmetics and tobacco products). Moreover, information on chemical ingredients in cosmetics and fragrances can be found in PubChem, thanks to data integration with the Cosmetic Ingredient Review (CIR) (https://cir-safety.org/) and the International Fragrance Association (IFRA) (https://ifrafragrance.org/).
Information for metabolomics
Information on metabolites found in various organisms was integrated from the KNApSAcK Species-Metabolite Database (18) and the Yeast Metabolome Database (YMDB) (19). In addition, thanks to data contribution from the Baker lab at the University of North Carolina at Chapel Hill, among others, PubChem now provides collision cross-section (CCS) values for lipids (20) and per- and polyfluoroalkyl substances (PFAS) (21), experimentally determined through ion mobility spectrometry (IMS).
Information on biological targets of chemicals
As explained in the Introduction, PubChem has data collections that provide target-centric chemical information for a given protein, gene, pathway, cell line, and organism. To assist users in better understanding the information contained in these collections, annotations about biological targets of chemicals were collected from authoritative sources and integrated into PubChem. This includes protein domain information from InterPro (17) as well as protein and genetic interactions in multiple species from the BioGRID (Biological General Repository for Interaction Datasets) (BioGRID) (https://thebiogrid.org/) (22) and the Database of Interacting Proteins (DIP) (https://dip.doe-mbi.ucla.edu/) (23). Cell Lines were also annotated with data from Dependency Map (DepMap) (https://depmap.org/) (24) and the GDSC project (17).
Protein, gene, cell line and organism records in PubChem were mapped with corresponding records in popular information resources, including the Human Protein Atlas (https://www.proteinatlas.org/) (25), neXtProt (https://www.nextprot.org/) (26), IntAct Molecular Interaction Database (https://www.ebi.ac.uk/intact/) (27), Gene Curation Coalition Database (GenCC-DB) (https://thegencc.org/) (28), Bgee (https://www.bgee.org/) (29), Open Targets (https://www.opentargets.org/) (30), DepMap, GDSC and Encyclopedia of Life (EOL) (https://eol.org/) (31). This mapping allows users to quickly locate desired information on a given target from the external databases.
Data integration with resources for standard model organisms used in life and biomedical sciences was also an important focus area that we recently worked on. Gene records in PubChem were annotated with identifiers used in popular databases for model organisms, including Mouse Genome Informatics (MGI) (32), Rat Genome Database (RGD) (33), the Zebrafish Information Network (ZFIN) (34), FlyBase (for fruit fly) (35), Saccharomyces Genome Database (SGD) (for yeast) (36), PomBase (for fission yeast) (37), WormBase (for Caenorhabditis elegans and related nematodes) (38), and Xenbase (for Xenopus frogs) (39). In addition, the NCBI Pathogen Detection Reference Gene Catalog (https://www.ncbi.nlm.nih.gov/pathogens/refgene/) and the Eukaryotic Pathogen, Vector and Host Informatics Resource (VEuPathDB) (https://veupathdb.org/) were mapped to PubChem Protein and Gene records. This allows users to readily move from PubChem to external resources to find additional information on the organism-specific genes.
Non-small molecule information
While the majority of chemicals contained in PubChem are small molecules, it also has other types of chemicals, including miRNA, siRNA, peptides, lipids, and carbohydrates. For the past 2 years, PubChem made continuing efforts to expand the scope of its chemical coverage. An example is the recent addition of mineral information from the Athena Mineral Database (https://athena.unige.ch/athena/mineral/mineral.html), the RRUFF database (https://rruff.info/), and the Mineralogical Society of America Handbook of Mineralogy (https://handbookofmineralogy.org/). Another example is information on unknown or variable composition, complex reaction products and biological materials (UVCBs) from USEPA TSCA Chemical Substance Inventory, NORMAN Suspect List Exchange (SLE), and Health and Environmental Sciences Institute (HESI) (https://hesiglobal.org/uvcb/), among others. In addition, as explained in our recent paper (40), there have been ongoing efforts to make glycoscience data available in PubChem, exemplified by the addition of glycan-related information from the Symbol Notations for Glycans (SNFG) discussion group (https://www.ncbi.nlm.nih.gov/glycans/snfg.html) (41–43) and GlyConnet (https://glyconnect.expasy.org/) (44)
Consolidated literature panel
More than two million compounds in PubChem have information on scientific articles about them (45). This information is from various sources, including publishers, journals, PubChem substance and assay data depositors, curated data collections, and external databases (e.g. DrugBank (46), the Human Metabolome Database (HMDB) (47) and LactMed (https://www.ncbi.nlm.nih.gov/books/NBK501922/)) (see Table 2). In addition, PubChem computationally generates links between PubChem compound records and PubMed articles that share the same chemical substance Medical Subject Heading (MeSH) annotations. Furthermore, scientific articles (in the form of citations, digital object identifiers (DOIs), PubMed identifiers (PMIDs), and/or PubMed Central identifiers (PMC IDs)) are provided as evidence for the content on a given PubChem record page. The scientific articles about a compound are displayed in the Literature section of its compound summary page.
Table 2.
Literature data sources included in the consolidated literature panel
| Literature data source | Source type |
|---|---|
| Springer Nature | Publisher |
| Thieme Chemistry | Publisher |
| Wiley | Publisher |
| Nature Chemistry | Journal |
| Nature Chemical Biology | Journal |
| Nature Synthesis | Journal |
| Nature Portfolio Journals | Journal |
| Nature Catalyst | Journal |
| Nature Communications | Journal |
| NLM-curated PubMed citations (MeSH) | PubChem |
| Depositor-provided PubMed citations | PubChem |
| PubChem BioAssay | PubChem |
| PubChem Pathways | PubChem |
| HMDB | External database |
| DrugBank | External database |
| LactMed | External database |
It should be emphasized that PubChem does not select specific journals or publishers to include in the Literature section. The journals and publishers listed in Table 2 are those who voluntarily provided PubChem with information on chemicals appearing in their publications to improve the accessibility of scientific information through public databases. The contributed chemical-article association data is presented in the source-specific subsection of the Literature section. PubChem does not add or remove any papers to the source-provided list of articles shown in the source-specific subsections. It is also worth mentioning that the chemical-article association data from journals and publishers complement the data generated from PubMed records through the automated pipeline described in our previous paper (45). This pipeline does not cover chemicals appearing in full-text articles because PubMed records contain the titles and abstracts of articles, but not the full texts. In addition, many chemistry journals are not indexed in PubMed, which is focused on life and biomedical sciences. Therefore, the data integration from journals and publishers to PubChem greatly improves the scope and coverage of the literature data in PubChem, making chemical data more FAIR-compliant.
The articles about a given chemical are grouped by source and displayed in the source-specific subsections of the Literature section. In this approach, the citations for a given chemical are scattered across many tables, making it difficult to get the complete list of articles about the chemical. To address this issue, we introduced a consolidated literature panel, which contains literature data from all sources. This panel is displayed under the ‘Consolidated References’ subsection of the Literature section, as shown in this example (for dextromethorphan (CID 5360696)): https://pubchem.ncbi.nlm.nih.gov/compound/5360696#section=Consolidated-References
Figure 2 shows a partial screenshot of the consolidated literature panel for dextromethorphan. The article list shown in the panel can be searched using a keyword query or sorted by publication date, publication name, PMID, PMCID, or DOI. The panel also allows users to download the list of articles with their metadata in CSV, JSON or XML format. Each article in the list has links to the corresponding PubMed abstract and the article at the publisher's website. When available, a link to the free full-text article in PubMed Central is also presented. This consolidated literature panel allows users to search, sort, and download the data in a single location. The content in the consolidated literature panel is updated weekly.
Figure 2.
Consolidated References section of the Compound Summary page for dextromethorphan (CID 5360696), accessible at https://pubchem.ncbi.nlm.nih.gov/compound/5360696#section=Consolidated-References.
Patent knowledge panel
The literature knowledge panels in PubChem (48) help users to uncover and summarize important relationships between chemicals, genes, and diseases by analyzing co-occurrences of terms in PubMed records. The analysis and summarization techniques used in the literature knowledge panels were extended to patent documents to develop the patent knowledge panels, which show the entity relationships based on their co-occurrences in patents. The patent knowledge panels display a list of a few selected chemicals, genes, or diseases commonly co-mentioned with a given chemical or gene in granted patents. They can be found in the Patents section of the summary page of a compound or gene record. For example, the Compound summary page of rivaroxaban (CID 9875401) has three patent knowledge panels:
Chemical-chemical co-occurrences: https://pubchem.ncbi.nlm.nih.gov/compound/9875401#section=Chemical-Co-Occurrences-in-Patents
Chemical-gene co-occurrences: https://pubchem.ncbi.nlm.nih.gov/compound/9875401#section=Chemical-Gene-Co-Occurrences-in-Patents
Chemical-disease co-occurrences: https://pubchem.ncbi.nlm.nih.gov/compound/9875401#section=Chemical-Disease-Co-Occurrences-in-Patents
As another example, the following three knowledge panels are displayed on the Gene summary page of the tachykinin receptor 3 (TACR3) gene (NCBI Gene ID 6870):
Gene-chemical co-occurrences: https://pubchem.ncbi.nlm.nih.gov/gene/6870#section=Gene-Chemical-Co-Occurrences-in-Patents
Gene-gene co-occurrences: https://pubchem.ncbi.nlm.nih.gov/gene/6870#section=Gene-Gene-Co-Occurrences-in-Patents
Gene-disease co-occurrences: https://pubchem.ncbi.nlm.nih.gov/gene/6870#section=Gene-Disease-Co-Occurrences-in-Patents
The co-occurrence data underlying the patent knowledge panel can be downloaded interactively or accessed programmatically for further exploration and analysis.
It is noteworthy that, while the co-occurrence data for the literature knowledge panels is derived from the titles and abstracts of PubMed records, that for the patent knowledge panels is not only from the titles and abstracts of patents but also from other parts such as the claims and descriptions. In addition, whereas PubMed abstracts are typically concise and written for scientific clarity, patents typically prioritize commercial interests and deliberately obfuscate important information deep in the lengthy claims section and others to make it difficult for competitors to replicate the invention. With that said, the entities listed in the patent knowledge panels are different from those in the literature knowledge panels. As an example, Figure 3 compares the top-3 chemicals listed in the patent and literature knowledge panels for rivaroxaban (CID 9875401), which is a blood thinner used to treat and prevent deep vein thrombosis (DVT) and pulmonary embolism (PE). Note that only one compound (apixaban; CID 10182969) appears in both lists.
Figure 3.
Comparison of the top-3 chemicals listed in the patent and literature knowledge panels (https://pubchem.ncbi.nlm.nih.gov/compound/9875401#section=Chemical-Co-Occurrences-in-Patents) and literature knowledge panel (https://pubchem.ncbi.nlm.nih.gov/compound/9875401#section=Chemical-Co-Occurrences-in-Literature) for rivaroxaban (CID 9875401). The data were retrieved on 1 September 2024.
Updated Pubchemrdf
PubChem provides its data in a machine-readable format, using the resource description framework (RDF) (https://www.w3.org/RDF/). This machine-readable data, also known as PubChemRDF (49), can be downloaded from the PubChem FTP site and imported into triplestores (e.g. Apache Jena TDB or OpenLink virtuoso) or RDF-aware graph database (e.g. Neo4j) to enable a schema-less database access and query. PubChemRDF helps researchers work with PubChem data on local computing resources using semantic web technologies.
For the past 2 years, new subdomains (author, book, journal, grant, and organization) for the metadata of scientific articles were added. In addition, a data model was developed to express the co-occurrence relationships among chemicals, genes, and diseases in RDF. Along with the new subdomains for literature metadata, this data model was used to augment the existing PubChemRDF data with the co-occurrence data used in the literature knowledge panels. The semantic relationships between entities in the co-occurrence data model were defined using relevant terms from the PubChem-specific vocabulary as well as several external ontologies, including the Funding, Research Administration and Projects Ontology (FRAPO) (https://sparontologies.github.io/frapo/current/frapo.html), Dublin Core Metadata Initiative (DCMI) metadata terms (https://www.dublincore.org/specifications/dublin-core/dcmi-terms/), Semanticscience Integrated Ontology (SIO) (50), Publishing Requirements for Industry Standard Metadata (PRISM) vocabularies (https://www.w3.org/submissions/prism/) and EMBRACE Data And Methods (EDAM) ontology (51).
Thanks to the addition of the co-occurrence data to PubChemRDF, users now can access a large set of co-occurring entities (up to 1 000) for a given entity and automate this data retrieval using a computer program or script. This enhancement to PubChemRDF may facilitate the semantic exploration of biomedical knowledge and foster scientific discoveries. Importantly, it seamlessly connects with linked data resources from various scientific communities, enhancing biomedical data accessibility and usability.
Summary pages for records without discrete chemical structures
Each record in the PubChem Compound data collection has a Compound Summary page, which provides a comprehensive view of all information available in PubChem for a chemical. As mentioned in the Introduction, the Compound collection contains unique chemical structures extracted from the Substance collection, through chemical structure standardization (3,5). However, because some records in the Substance collection (e.g. biologics, minerals, UVCBs, polymers and glycans) do not have discrete chemical structures to standardize, they cannot be entered into the Compound collection. As a result, although those without discrete chemical structures often have valuable information within PubChem and other information resources, it was not possible in the past to deliver this information to PubChem users. To address this issue, an internal data collection containing chemicals without discrete structures was created and the information in this collection was fed to generate dedicated web pages for them, as in the following examples:
Biologics (e.g. basiliximab): https://pubchem.ncbi.nlm.nih.gov/compound/Basiliximab
Minerals (e.g. asbestos): https://pubchem.ncbi.nlm.nih.gov/compound/Asbestos
UVCBs (e.g. naphtha): https://pubchem.ncbi.nlm.nih.gov/compound/Naphtha
Polymers (e.g. polyethylene glycol (PEG)): https://pubchem.ncbi.nlm.nih.gov/compound/Polyethylene%20Glycol
Glycans (e.g. chitin): https://pubchem.ncbi.nlm.nih.gov/compound/Chitin
It is noteworthy that these pages contain a decent amount of information, collected from various authoritative sources. It is also worth mentioning that, as indicated at the top of these pages (see Figure 4), chemicals without discrete structures do not have Compound identifiers (CID) because they are not in the Compound collection.
Figure 4.
Partial screenshot of the summary page for naphtha, available at https://pubchem.ncbi.nlm.nih.gov/compound/Naphtha (accessed on 1 September 2024).
Summary
In the last 2 years, there have significant updates to PubChem. Notably, data integration with >130 new data sources has enabled PubChem to provide information on over 119 million compounds from >1000 data sources. Importantly, information useful for drug discovery has been added to PubChem from various sources. This includes information on approved drugs (from Drugs@FDA and JPMDA), natural products (from NPASS), biomarkers (from MarkerDB), and toxins (from T3DB). Also notable are annotations about effects of exposure to medications during pregnancy and breastfeeding (from MotherToBaby fact sheets), target development levels (from Pharos), genomic information on drug sensitivity in cancer cells (from the GDSC project), and pKa values for over 10 000 compounds (from IUPAC). In addition, PubChem has significantly expanded its data content to provide comprehensive information on safety, health hazards, and environmental effects of chemical exposure, by integrating data from authoritative sources like USEPA, NZEPA, FDA, CSB, CFATS, CA-OEHHA. This expansion covers a wide range of chemicals from those found in the environment to those used in consumer products.
Annotations about biological targets (e.g. protein and genetic interactions and protein domain) from multiple sources such as InterPro, BioGRID, DIP, DepMap and GDSC were added to help users better understand how chemicals interact with biological targets. Proteins, genes, cell lines, and organisms in PubChem were linked to relevant entries in popular external resources, including databases focused on popular model organisms (e.g. mice, rats, fruit flies, yeast, zebrafish, frogs and C. elegans). This mapping allows researchers to seamlessly switch between PubChem and these external resources to find more detailed information on the specific target they’re interested in. Additionally, PubChem was integrated with the NCBI Pathogen Detection Reference Gene Catalog and VEuPathDB, making it easier to explore information on proteins and genes relevant to infectious diseases.
PubChem has information on scientific articles about a given chemical. This information comes from various sources. In the past, this information was scattered across different parts of a compound summary page. To improve the usability of this data, PubChem has introduced a ‘consolidated literature panel’ that combines all references into one searchable and downloadable list. This allows researchers to easily find, sort, and export all relevant articles for a chemical in one place. The data underlying this panel is updated weekly.
The patent knowledge panels were developed using the analysis and summarization techniques used for the literature knowledge panels. The patent knowledge panels for a given chemical or gene display entities (i.e. chemicals, genes, diseases) that are co-mentioned with that chemical or gene in patent documents. These panels allow users to explore potential relationships between co-occurring entities in patents. The underlying co-occurrence data can be downloaded interactively or accessed programmatically for further exploration and analysis.
PubChemRDF has been expanded to include the co-occurrence data used in the literature knowledge panel. In conjunction with new subdomains added to encode metadata of scientific articles in RDF, the co-occurrence data model allows researchers to use semantic web technologies to explore the relationship between chemicals, genes, and diseases based on their co-occurrences in scientific literature, potentially accelerating scientific discoveries.
PubChem has expanded the scope of its data coverage to chemicals with non-discrete structures (e.g. biologics, minerals, polymers, UVCBs, and glycans), by providing a dedicated summary web page for each of them. This greatly improved the usability and accessibility of the data available for these chemical entities.
Acknowledgements
We greatly appreciate the hundreds of data contributors for making their data openly accessible within PubChem, the many extensive efforts of our scientific community collaborators, the steady stream of feedback from users, and the many key community efforts that make the open data science ecosystem possible. Special thanks go to the entire NCBI staff (especially to the help desk and systems support teams). This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health.
Contributor Information
Sunghwan Kim, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Jie Chen, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Tiejun Cheng, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Asta Gindulyte, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Jia He, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Siqian He, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Qingliang Li, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Benjamin A Shoemaker, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Paul A Thiessen, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Bo Yu, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Leonid Zaslavsky, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Jian Zhang, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Evan E Bolton, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Data availability
All PubChem data, tools, and services are provided to the public free of charge. They are accessible from the PubChem homepage (https://pubchem.ncbi.nlm.nih.gov).
Funding
National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health. Funding for open access charge: National Center for Biotechnology Information, National Library of Medicine.
Conflict of interest statement. None declared.
References
- 1. Kim S., Chen J., Cheng T., Gindulyte A., He J., He S., Li Q., Shoemaker B.A., Thiessen P.A., Yu B.et al.. PubChem 2023 update. Nucleic Acids Res. 2023; 51:D1373–D1380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Kim S. Exploring chemical information in PubChem. Current Protocols. 2021; 1:e217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Kim S., Thiessen P.A., Bolton E.E., Chen J., Fu G., Gindulyte A., Han L.Y., He J.E., He S.Q., Shoemaker B.A.et al.. PubChem substance and compound databases. Nucleic Acids Res. 2016; 44:D1202–D1213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Kim S., Cheng T.J., He S.Q., Thiessen P.A., Li Q.L., Gindulyte A., Bolton E.E.. PubChem protein, gene, pathway, and taxonomy data collections: bridging biology and chemistry through target-centric views of PubChem data. J. Mol. Biol. 2022; 434:167514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Hähnke V.D., Kim S., Bolton E.E.. PubChem chemical structure standardization. J. Cheminformatics. 2018; 10:36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Kim S., Chen J., Cheng T.J., Gindulyte A., He J., He S.Q., Li Q.L., Shoemaker B.A., Thiessen P.A., Yu B.et al.. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019; 47:D1102–D1109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Kim S., Chen J., Cheng T.J., Gindulyte A., He J., He S.Q., Li Q.L., Shoemaker B.A., Thiessen P.A., Yu B.et al.. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 2021; 49:D1388–D1395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Kim S., Thiessen P.A., Cheng T.J., Yu B., Bolton E.E.. An update on PUG-REST: RESTful interface for programmatic access to PubChem. Nucleic Acids Res. 2018; 46:W563–W570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Kim S., Thiessen P.A., Bolton E.E., Bryant S.H.. PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic Acids Res. 2015; 43:W605–W611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Zhao H., Yang Y., Wang S., Yang X., Zhou K., Xu C., Zhang X., Fan J., Hou D., Li X.et al.. NPASS database update 2023: quantitative natural product activity and species source database for biomedical research. Nucleic Acids Res. 2023; 51:D621–D628. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Wishart D.S., Bartok B., Oler E., Liang K.Y.H., Budinski Z., Berjanskii M., Guo A., Cao X., Wilson M.. MarkerDB: an online database of molecular biomarkers. Nucleic Acids Res. 2021; 49:D1259–D1267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. International Union of Pure and Applied Chemistry Serjeant E.P., Dempsey B.. Ionisation Constants of Organic Acids in Aqueous Solution. 1979; Oxford: Pergamon. [Google Scholar]
- 13. International Union of Pure and Applied Chemistry. Perrin D.D. Dissociation Constants of Organic Bases in Aqueous Solution. 1965; London. [Google Scholar]
- 14. International Union of Pure and Applied Chemistry. Perrin D.D. Dissociation Constants of Organic Bases in Aqueous Solution, Supplement. 1972; London: Butterworths. [Google Scholar]
- 15. Kelleher K.J., Sheils T.K., Mathias S.L., Yang J.J., Metzger V.T., Siramshetty V.B., Nguyen D.-T., Jensen L.J., Vidović D., Schürer S.Cet al.. Pharos 2023: an integrated resource for the understudied human proteome. Nucleic Acids Res. 2023; 51:D1405–D1416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wishart D., Arndt D., Pon A., Sajed T., Guo A.C., Djoumbou Y., Knox C., Wilson M., Liang Y., Grant J.et al.. T3DB: the toxic exposome database. Nucleic Acids Res. 2015; 43:D928–D934. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Yang W., Soares J., Greninger P., Edelman E.J., Lightfoot H., Forbes S., Bindal N., Beare D., Smith J.A., Thompson I.R.et al.. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res. 2013; 41:D955–D961. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Shinbo Y., Nakamura Y., Altaf-Ul-Amin M., Asahi H., Kurokawa K., Arita M., Saito K., Ohta D., Shibata D., Kanaya S.. Saito K., Dixon R.A., Willmitzer L.. KNApSAcK:AComprehensive Species-Metabolite Relationship Database. Plant Metabolomics. 2006; 57:Berlin: Springer-Verlag Berlin; 165–181. [Google Scholar]
- 19. Ramirez-Gaona M., Marcu A., Pon A., Guo A.C., Sajed T., Wishart N.A., Karu N., Djoumbou Feunang Y., Arndt D., Wishart D.S.. YMDB 2.0: A significantly expanded version of the yeast metabolome database. Nucleic Acids Res. 2017; 45:D440–D445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Kirkwood K.I., Christopher M.W., Burgess J.L., Littau S.R., Foster K., Richey K., Pratt B.S., Shulman N., Tamura K., MacCoss M.J.et al.. Development and application of multidimensional lipid libraries to investigate lipidomic dysregulation related to smoke inhalation injury severity. J. Proteome Res. 2022; 21:232–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Foster M., Rainey M., Watson C., Dodds J.N., Kirkwood K.I., Fernández F.M., Baker E.S.. Uncovering PFAS and other xenobiotics in the dark metabolome using ion mobility spectrometry, mass defect analysis, and machine learning. Environ. Sci. Technol. 2022; 56:9133–9143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Oughtred R., Rust J., Chang C., Breitkreutz B.J., Stark C., Willems A., Boucher L., Leung G., Kolas N., Zhang F.et al.. The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci. 2021; 30:187–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Salwinski L., Miller C.S., Smith A.J., Pettit F.K., Bowie J.U., Eisenberg D.. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004; 32:D449–D451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Tsherniak A., Vazquez F., Montgomery P.G., Weir B.A., Kryukov G., Cowley G.S., Gill S., Harrington W.F., Pantel S., Krill-Burger J.M.et al.. Defining a cancer dependency map. Cell. 2017; 170:564–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Pontén F., Jirström K., Uhlen M.. The Human Protein Atlas - a tool for pathology. J. Pathol. 2008; 216:387–393. [DOI] [PubMed] [Google Scholar]
- 26. Zahn-Zabal M., Michel P.-A., Gateau A., Nikitin F., Schaeffer M., Audot E., Gaudet P., Duek P.D., Teixeira D., Rech de Laval V.et al.. The neXtProt knowledgebase in 2020: data, tools and usability improvements. Nucleic Acids Res. 2020; 48:D328–D334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. del Toro N., Shrivastava A., Ragueneau E., Meldal B., Combe C., Barrera E., Perfetto L., How K., Ratan P., Shirodkar G.et al.. The IntAct database: efficient access to fine-grained molecular interaction data. Nucleic Acids Res. 2022; 50:D648–D653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. DiStefano M.T., Goehringer S., Babb L., Alkuraya F.S., Amberger J., Amin M., Austin-Tse C., Balzotti M., Berg J.S., Birney E.et al.. The gene curation coalition: a global effort to harmonize gene–disease evidence resources. Genet. Med. 2022; 24:1732–1742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Bastian F.B., Roux J., Niknejad A., Comte A., Fonseca Costa S.S., de Farias T.M., Moretti S., Parmentier G., de Laval V.R., Rosikiewicz M.et al.. The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals. Nucleic Acids Res. 2021; 49:D831–D847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Ochoa D., Hercules A., Carmona M., Suveges D., Baker J., Malangone C., Lopez I., Miranda A., Cruz-Castillo C., Fumis L.et al.. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Res. 2023; 51:D1353–D1359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Parr C.S., Wilson N., Leary P., Schulz K.S., Lans K., Walley L., Hammock J.A., Goddard A., Rice J., Studer M.et al.. The Encyclopedia of Life v2: providing global access to knowledge about life on Earth. Biodiversity Data Journal. 2014; 2:e1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Baldarelli R.M., Smith C.L., Ringwald M., Richardson J.E., Bult C.J., Group M.G.I.. Mouse Genome Informatics: an integrated knowledgebase system for the laboratory mouse. Genetics. 2024; 227:iyae031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Vedi M., Smith J.R., Thomas Hayman G., Tutaj M., Brodie K.C., De Pons J.L., Demos W.M., Gibson A.C., Kaldunski M.L., Lamers L.et al.. 2022 updates to the Rat Genome Database: A Findable, Accessible, Interoperable, and Reusable (FAIR) resource. Genetics. 2023; 224:iyad042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Bradford Y.M., Van Slyke C.E., Ruzicka L., Singer A., Eagle A., Fashena D., Howe D.G., Frazer K., Martin R., Paddock H.et al.. Zebrafish information network, the knowledgebase for Danio rerio research. Genetics. 2022; 220:iyac016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Öztürk-Çolak A., Marygold S.J., Antonazzo G., Attrill H., Goutte-Gattat D., Jenkins V.K., Matthews B.B., Millburn G., dos Santos G., Tabone C.J.et al.. FlyBase: updates to the Drosophila genes and genomes database. Genetics. 2024; 227:iyad211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Wong E.D., Miyasato S.R., Aleksander S., Karra K., Nash R.S., Skrzypek M.S., Weng S., Engel S.R., Cherry J.M.. Saccharomyces genome database update: server architecture, pan-genome nomenclature, and external resources. Genetics. 2023; 224:iyac191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Rutherford K.M., Lera-Ramírez M., Wood V.. PomBase: a global core biodata resource—growth, collaboration, and sustainability. Genetics. 2024; 227:iyae007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Sternberg P.W., Van Auken K., Wang Q., Wright A., Yook K., Zarowiecki M., Arnaboldi V., Becerra A., Brown S., Cain S.et al.. WormBase 2024: status and transitioning to alliance infrastructure. Genetics. 2024; 227:iyae050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Fisher M., James-Zorn C., Ponferrada V., Bell A.J., Sundararaj N., Segerdell E., Chaturvedi P., Bayyari N., Chu S., Pells T.et al.. Xenbase: key features and resources of the Xenopus model organism knowledgebase. Genetics. 2023; 224:iyad018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Kim S., Zhang J., Cheng T., Li Q., Bolton E.E.. Glycoscience data content in the NCBI Glycans and PubChem. Anal. Bioanal. Chem. 2024; 10.1007/s00216-024-05459-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Varki A., Cummings R.D., Aebi M., Packer N.H., Seeberger P.H., Esko J.D., Stanley P., Hart G., Darvill A., Kinoshita T.et al.. Symbol nomenclature for graphical representations of glycans. Glycobiology. 2015; 25:1323–1324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Neelamegham S., Aoki-Kinoshita K., Bolton E., Frank M., Lisacek F., Lütteke T., O’Boyle N., Packer N.H., Stanley P., Toukach P.et al.. Updates to the symbol nomenclature for glycans guidelines. Glycobiology. 2019; 29:620–624. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Lewis A.L., Toukach P., Bolton E., Chen X., Frank M., Lütteke T., Knirel Y., Schoenhofen I., Varki A., Vinogradov E.et al.. Cataloging natural sialic acids and other nonulosonic acids (NulOs), and their representation using the symbol nomenclature for glycans. Glycobiology. 2023; 33:99–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Alocci D., Mariethoz J., Gastaldello A., Gasteiger E., Karlsson N.G., Kolarich D., Packer N.H., Lisacek F.. GlyConnect: glycoproteomics goes visual, interactive, and analytical. J. Proteome Res. 2019; 18:664–677. [DOI] [PubMed] [Google Scholar]
- 45. Kim S., Thiessen P.A., Cheng T., Yu B., Shoemaker B.A., Wang J., Bolton E.E., Wang Y., Bryant S.H.. Literature information in PubChem: associations between PubChem records and scientific articles. J. Cheminformatics. 2016; 8:32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Knox C., Wilson M., Klinger C.M., Franklin M., Oler E., Wilson A., Pon A., Cox J., Chin N.E., Strawbridge S.A.et al.. DrugBank 6.0: the DrugBank Knowledgebase for 2024. Nucleic Acids Res. 2024; 52:D1265–D1275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Wishart D.S., Guo A., Oler E., Wang F., Anjum A., Peters H., Dizon R., Sayeeda Z., Tian S., Lee B.L.et al.. HMDB 5.0: the Human Metabolome Database for 2022. Nucleic Acids Res. 2022; 50:D622–D631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Zaslavsky L., Cheng T., Gindulyte A., He S., Kim S., Li Q., Thiessen P., Yu B., Bolton E.E.. Discovering and summarizing relationships between chemicals, genes, proteins, and diseases in PubChem. Front. Res. Metrics Analyt. 2021; 6:689059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Fu G., Batchelor C., Dumontier M., Hastings J., Willighagen E., Bolton E.. PubChemRDF: towards the semantic annotation of PubChem compound and substance databases. J. Cheminformatics. 2015; 7:34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Dumontier M., Baker C.J.O., Baran J., Callahan A., Chepelev L., Cruz-Toledo J., Del Rio N.R., Duck G., Furlong L.I., Keath N.et al.. The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery. J. Biomed. Semantics. 2014; 5:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Ison J., Kalaš M., Jonassen I., Bolser D., Uludag M., McWilliam H., Malone J., Lopez R., Pettifer S., Rice P.. EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics. 2013; 29:1325–1332. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All PubChem data, tools, and services are provided to the public free of charge. They are accessible from the PubChem homepage (https://pubchem.ncbi.nlm.nih.gov).





