PMC Article Datasets
Interested in automated retrieval of articles in machine-readable formats in PubMed Central (PMC)? PMC and the NCBI Bookshelf offer several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses).
February 12, 2026: Changes to PMC Article Datasets Distribution Services Coming in 2026
PMC will make major changes to our Article Dataset Distribution Services in 2026. In August 2026, you will need to access full text article data files through the PMC Cloud Service instead of the PMC FTP Service. This change will provide you with more reliable performance, faster retrieval times, and greater flexibility in retrieving only the types and number of files you wish to work with.
Since this may impact operational workflows, we are providing a transition period from February to August. During this time, the FTP Service, OA Web Service API, and the current PMC Cloud Service files will remain available concurrently with the updated PMC Cloud Service on AWS.
For complete details about this transition, please see the NCBI Insights blog post and our documentation on Accessing PMC Article Datasets Using Amazon Web Services
- Not all articles in PMC are available for text mining and other reuse.
- The PMC Cloud Service, PMC OAI-PMH Service, PMC FTP Service, E-Utilities and BioC API are the only services that may be used for automated retrieval of PMC content. Systematic retrieval (or bulk retrieval) of articles through any other automated process is prohibited.
- License terms vary. Please refer to the license statement in each article for specific terms of use.
- Users of this dataset are directly and solely responsible for compliance with copyright restrictions and are expected to adhere to the terms and conditions defined by the copyright holder (see the PMC Copyright Notice).
About the Datasets
| Content | License Terms | How to Access | XML | TXT | Media & Supplemental Files | ||
|---|---|---|---|---|---|---|---|
| In August 2026, PMC Article Datasets will be removed from the current PMC FTP Service. Learn more. | |||||||
| PMC Open Access Subset | The PMC Open Access Subset (or PMC OA Subset) contains millions of full-text open access article files made available under a Creative Commons or similar license terms or with publisher permission. This dataset includes retractions, corrections, and expressions of concern*. Also included are select articles from the PMC COVID-19 Collection that continue to be made available under terms that allow for secondary analysis and reuse. | Broken down by license type:
Commercial use allowed: CC0, CC BY, CC BY-SA, CC BY-ND Non-commercial use only: CC BY-NC, CC BY-NC-SA, CC BY-NC-ND Other: no machine-readable Creative Commons license, no license tagged, or a custom license |
![]() |
![]() |
![]() |
![]() |
|
| Author Manuscript Dataset | The Author Manuscript Dataset consists of full-text files of hundreds of thousands of Author Accepted Manuscripts (AAMs) that have been made available in PMC under a partner funder's policy. This dataset includes retractions, corrections, and expressions of concern*. |
Default license: "This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law."
AAMs that include a Creative Commons license are also available via the PMC Open Access Subset. |
![]() |
![]() |
![]() for manuscripts with Creative Commons licenses |
![]() for manuscripts with Creative Commons licenses |
|
| Historical OCR Dataset | Full-text files of OCR'd text from articles with Creative Commons licenses published in the 18th, 19th, and 20th centuries added to PMC as part of an NLM Digitization Project. | Articles available for reuse all have Creative Commons licenses and are therefore, also part of the PMC Open Access Subset. | ![]() |
![]() |
![]() |
![]() |
|
| There are no changes being made to the LitArch Open Access Subset | |||||||
| LitArch Open Access Subset | The LitArch Open Access Subset contains the full-text of thousands of the books and documents in the NLM Literature Archive. | Creative Commons or similar license | ![]() |
||||
* Retractions, corrections, and expressions of concern can be identified in the downloadable XML files by looking for the attribute article-type="retraction" or "correction" or "expression-of-concern" in the <article> element. In plain text files look for Retraction, Correction, or Expression of Concern in the Front section. Retractions, corrections, or expressions of concern can also be found using a search query of "articletyperetraction", "articletypecorrection" or "articletypeexpressionofconcern" respectively.
