PMC Article Datasets

Interested in automated retrieval of articles in machine-readable formats in PubMed Central (PMC)? PMC and the NCBI Bookshelf offer several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses).

February 12, 2026: Changes to PMC Article Datasets Distribution Services Coming in 2026

PMC will make major changes to our Article Dataset Distribution Services in 2026. In August 2026, you will need to access full text article data files through the PMC Cloud Service instead of the PMC FTP Service. This change will provide you with more reliable performance, faster retrieval times, and greater flexibility in retrieving only the types and number of files you wish to work with.

Since this may impact operational workflows, we are providing a transition period from February to August. During this time, the FTP Service, OA Web Service API, and the current PMC Cloud Service files will remain available concurrently with the updated PMC Cloud Service on AWS.

For complete details about this transition, please see the NCBI Insights blog post and our documentation on Accessing PMC Article Datasets Using Amazon Web Services

Not all articles in PMC are available for text mining and other reuse.
The PMC Cloud Service, PMC OAI-PMH Service, PMC FTP Service, E-Utilities and BioC API are the only services that may be used for automated retrieval of PMC content. Systematic retrieval (or bulk retrieval) of articles through any other automated process is prohibited.
License terms vary. Please refer to the license statement in each article for specific terms of use.
Users of this dataset are directly and solely responsible for compliance with copyright restrictions and are expected to adhere to the terms and conditions defined by the copyright holder (see the PMC Copyright Notice).

About the Datasets

	Content	License Terms	How to Access	PDF	Media & Supplemental Files
In August 2026, PMC Article Datasets will be removed from the current PMC FTP Service. Learn more.
PMC Open Access Subset	The PMC Open Access Subset (or PMC OA Subset) contains millions of full-text open access article files made available under a Creative Commons or similar license terms or with publisher permission. This dataset includes retractions, corrections, and expressions of concern*. Also included are select articles from the PMC COVID-19 Collection that continue to be made available under terms that allow for secondary analysis and reuse.	Broken down by license type: Commercial use allowed: CC0, CC BY, CC BY-SA, CC BY-ND Non-commercial use only: CC BY-NC, CC BY-NC-SA, CC BY-NC-ND Other: no machine-readable Creative Commons license, no license tagged, or a custom license	Cloud Service FTP Service PMC OAI-PMH Service OA Web Service API E-Utilities BioC API
Author Manuscript Dataset	The Author Manuscript Dataset consists of full-text files of hundreds of thousands of Author Accepted Manuscripts (AAMs) that have been made available in PMC under a partner funder's policy. This dataset includes retractions, corrections, and expressions of concern*.	Default license: "This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law." AAMs that include a Creative Commons license are also available via the PMC Open Access Subset.	Cloud Service FTP Service PMC OAI-PMH Service BioC API	for manuscripts with Creative Commons licenses	for manuscripts with Creative Commons licenses
Historical OCR Dataset	Full-text files of OCR'd text from articles with Creative Commons licenses published in the 18th, 19th, and 20th centuries added to PMC as part of an NLM Digitization Project.	Articles available for reuse all have Creative Commons licenses and are therefore, also part of the PMC Open Access Subset.	Cloud Service FTP Service PMC OAI-PMH Service BioC API
There are no changes being made to the LitArch Open Access Subset
LitArch Open Access Subset	The LitArch Open Access Subset contains the full-text of thousands of the books and documents in the NLM Literature Archive.	Creative Commons or similar license	FTP Service Bookshelf OAI-PMH Service

* Retractions, corrections, and expressions of concern can be identified in the downloadable XML files by looking for the attribute article-type="retraction" or "correction" or "expression-of-concern" in the <article> element. In plain text files look for Retraction, Correction, or Expression of Concern in the Front section. Retractions, corrections, or expressions of concern can also be found using a search query of "articletyperetraction", "articletypecorrection" or "articletypeexpressionofconcern" respectively.

Last modified: Fri March 20 2026