PMC Article Datasets
Interested in automated retrieval of articles in machine-readable formats in PubMed Central (PMC)? PMC and the NCBI Bookshelf offer several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses).
NLM provides cloud service access to the PMC Open Access Subset and the PMC Author Manuscript Dataset for faster retrieval. As part of this service, content from these datasets is accessible to users on Amazon Web Services (AWS), without charge, through either an HTTPS or S3 URL, and without any log-in requirement for retrieval. Cloud Service documentation is available on the PMC Cloud Service and Accessing PMC Article Datasets Using AWS pages.
- Not all articles in PMC are available for text mining and other reuse.
- The PMC Cloud Service, PMC OAI-PMH Service, PMC FTP Service, E-Utilities and BioC API are the only services that may be used for automated retrieval of PMC content. Systematic retrieval (or bulk retrieval) of articles through any other automated process is prohibited.
- License terms vary. Please refer to the license statement in each article for specific terms of use.
- Users of this dataset are directly and solely responsible for compliance with copyright restrictions and are expected to adhere to the terms and conditions defined by the copyright holder (see the PMC Copyright Notice).
About the Datasets
Content | License Terms | How to Access | XML | TXT | ||
---|---|---|---|---|---|---|
PMC Open Access Subset | The PMC Open Access Subset (or PMC OA Subset) contains millions of full-text open access article files made available under a Creative Commons or similar license terms or with publisher permission. This dataset includes retractions, corrections, and expressions of concern*. Also included are select articles from the PMC COVID-19 Collection that continue to be made available under terms that allow for secondary analysis and reuse. | Broken down by license type:
Commercial use allowed: CC0, CC BY, CC BY-SA, CC BY-ND Non-commercial use only: CC BY-NC, CC BY-NC-SA, CC BY-NC-ND Other: no machine-readable Creative Commons license, no license tagged, or a custom license |
(FTP only) | |||
Author Manuscript Dataset | The Author Manuscript Dataset consists of full-text files of hundreds of thousands accepted author manuscripts (AAMs) that have been made available in PMC under a partner funder's policy. This dataset includes retractions, corrections, and expressions of concern*. |
Default license: "This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law."
AAMs that include a Creative Commons license are also available via the Open Access Subset. |
||||
Historical OCR Dataset | Full-text files of OCR'd text from articles published in the 18th, 19th, and 20th centuries added to PMC as part of an NLM Digitization Project. | Files are generally made available for text mining. Articles added more recently may also include a Creative Commons license and therefore will also be available via the Open Access Subset. |
||||
LitArch Open Access Subset | The LitArch Open Access Subset contains the full-text of thousands of the books and documents in the NLM Literature Archive. | Creative Commons or similar license |
* Retractions, corrections, and expressions of concern can be identified in the downloadable XML files by looking for the attribute article-type="retraction" or "correction" or "expression-of-concern" in the <article> element. In plain text files look for Retraction, Correction, or Expression of Concern in the Front section. Retractions, corrections, or expressions of concern can also be found using search filters with values of retraction[filter], correction[filter] or expression of concern[filter] respectively.