PMC Article Datasets
Interested in automated retrieval of articles in machine-readable formats in PubMed Central (PMC)? PMC and the NCBI Bookshelf offer several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses).
April 13, 2026: Update on PMC Article Dataset Distribution Changes
As announced on February 12, major changes to PMC's Article Dataset Distribution Services are underway.
On April 13, all legacy files for the PMC Article Datasets were moved to new temporary directories and prefixes on the PMC FTP and Cloud Services.
- FTP Service: all legacy files were moved to a new directory named "deprecated."
- Cloud Service: all legacy prefixes were updated to add "deprecated" to the prefix. Prefixes for legacy files now begin with //pmc-oa-opendata/deprecated/.
This intentional disruption alerts users to the upcoming changes to the PMC Cloud Service on AWS, while allowing for easy updates to keep existing automated workflows running. We encourage users of the legacy PMC FTP and PMC Cloud Services to begin working with the updated PMC Cloud Service structure and to adjust existing workflows.
All legacy files on the FTP and Cloud Services will be removed in August 2026.
For complete details about this transition, please see the NCBI Insights blog post and our documentation on Accessing PMC Article Datasets Using Amazon Web Services
- Not all articles in PMC are available for text mining and other reuse.
- The PMC Cloud Service, PMC OAI-PMH Service, PMC FTP Service, E-Utilities and BioC API are the only services that may be used for automated retrieval of PMC content. Systematic retrieval (or bulk retrieval) of articles through any other automated process is prohibited.
- License terms vary. Please refer to the license statement in each article for specific terms of use.
- Users of this dataset are directly and solely responsible for compliance with copyright restrictions and are expected to adhere to the terms and conditions defined by the copyright holder (see the PMC Copyright Notice).
About the Datasets
| Content | License Terms | How to Access | XML | TXT | Media & Supplemental Files | ||
|---|---|---|---|---|---|---|---|
| In August 2026, PMC Article Datasets will be removed from the current PMC FTP Service. Learn more. | |||||||
| PMC Open Access Subset | The PMC Open Access Subset (or PMC OA Subset) contains millions of full-text open access article files made available under a Creative Commons or similar license terms or with publisher permission. This dataset includes retractions, corrections, and expressions of concern*. Also included are select articles from the PMC COVID-19 Collection that continue to be made available under terms that allow for secondary analysis and reuse. | Broken down by license type:
Commercial use allowed: CC0, CC BY, CC BY-SA, CC BY-ND Non-commercial use only: CC BY-NC, CC BY-NC-SA, CC BY-NC-ND Other: no machine-readable Creative Commons license, no license tagged, or a custom license |
![]() |
![]() |
![]() |
![]() |
|
| Author Manuscript Dataset | The Author Manuscript Dataset consists of full-text files of hundreds of thousands of Author Accepted Manuscripts (AAMs) that have been made available in PMC under a partner funder's policy. This dataset includes retractions, corrections, and expressions of concern*. |
Default license: "This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law."
AAMs that include a Creative Commons license are also available via the PMC Open Access Subset. |
![]() |
![]() |
![]() for manuscripts with Creative Commons licenses |
![]() for manuscripts with Creative Commons licenses |
|
| Historical OCR Dataset | Full-text files of OCR'd text from articles with Creative Commons licenses published in the 18th, 19th, and 20th centuries added to PMC as part of an NLM Digitization Project. | Articles available for reuse all have Creative Commons licenses and are therefore, also part of the PMC Open Access Subset. | ![]() |
![]() |
![]() |
![]() |
|
| There are no changes being made to the LitArch Open Access Subset | |||||||
| LitArch Open Access Subset | The LitArch Open Access Subset contains the full-text of thousands of the books and documents in the NLM Literature Archive. | Creative Commons or similar license | ![]() |
||||
* Retractions, corrections, and expressions of concern can be identified in the downloadable XML files by looking for the attribute article-type="retraction" or "correction" or "expression-of-concern" in the <article> element. In plain text files look for Retraction, Correction, or Expression of Concern in the Front section. Retractions, corrections, or expressions of concern can also be found using a search query of "articletyperetraction", "articletypecorrection" or "articletypeexpressionofconcern" respectively.
