Skip to main content

FTP Service

NLM provides cloud service access to the PMC Open Access Subset and the PMC Author Manuscript Dataset for faster retrieval. As part of this service, content from these datasets is accessible to users on Amazon Web Services (AWS), without charge, through either an HTTPS or S3 URL, and without any log-in requirement for retrieval. Cloud Service documentation is available on the PMC Cloud Service and Accessing PMC Article Datasets Using AWS pages.

The PMC File Transfer Protocol (FTP) Service supports usage of the PMC Article Datasets with the following services:

Bulk download

Individual article download

  • Available for: PMC Open Access Subset only
  • Packages include: XML, PDF (if present), media files, and supplementary materials for a single article

PDF download

  • Available for: PMC Open Access Subset only
  • Individual PDFs of articles: only available for non-commercial use licensed articles

PMC ID Cross-referencing

  • Cross reference any PMC article ID with identifiers such as PubMed IDs, DOIs, and Author Manuscript IDs
  • File: PMC-ids.csv.gz, a file in the top-level FTP directory

Base FTP URL: https://ftp.ncbi.nlm.nih.gov/pub/pmc

*Tip* If you are having difficulties with FTP, please consider trying the HTTPS protocol instead, e.g. [https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/oa_comm/xml/oa_comm_xml.incr.2021-09-17.filelist.csv. NCBI also supports secure FTP via SFTP.

If you have questions or comments about the PMC FTP Service, please write to the PMC help desk. Further information on retrieving full text and other common developer queries can be found on Developer Resources page.

Bulk Download

If you only are interested in the metadata and text of an article or author manuscript, then bulk download may be what you want to use. Bulk packages group together hundreds of thousands of articles in XML or plain text formats in compressed packages (Note: The Historical OCR Dataset is only available in plain text format). If you are also interested in media files, supplementary materials, or PDFs, please see the sections on Individual Article Download and PDF Download.

Baseline Packages Update Schedule

New baseline packages will be created at least two times per year. Previous baseline and incremental packages and the accompanying file lists will be deleted whenever a new baseline is created.

New baselines will be created:

  • mid-June
  • mid-December
  • as needed*

*PMC is sometimes required to suppress an article from public view for legal reasons if the case involves a legal injunction or a breach of patient privacy. In such cases, a new set of baseline packages will be created for the impacted dataset. This is not a frequent occurence.

Directories Organized by Dataset, License Terms, and File Content Type

Bulk downloads are available on the FTP Service by dataset:

PMC Open Access Subset - Bulk Author Manuscript Dataset - Bulk Historical OCR Dataset - Bulk

We have further divided the PMC Open Access Subset bulk packages into three groups based on available license terms:

  • Commercial Use Allowed - CC0, CC BY, CC BY-SA, and CC BY-ND licenses
  • Non-Commercial Use Only - CC BY-NC, CC BY-NC-SA, CC BY-NC-ND
  • Other - no machine-readable license, no license, or a custom license

PMC OA Subset - Commercial Use PMC OA Subset - Non-Commercial Use Only PMC OA Subset - Other

To access the complete PMC OA Subset you will need to retrieve ALL of the OA Subset packages. These groups are complementary rather than duplicative.

Each of these datasets or groupings is divided into separate directories by file content type: XML (\xml) and plain text (\txt). The baseline packages for each of these OA Subset groups and for the Author Manuscript Dataset are divided by PMCID range (e.g., PMC004XXXXXX) in order to keep package sizes reasonable.

The result is the following directory structure:

|_ manuscript/
|___ txt/
|___ xml/
|_ oa_bulk/
|___ oa_comm/
|_____ txt/
|_____ xml/
|___ oa_noncomm/
|_____ txt/
|_____ xml/
|___ oa_other/
|_____ txt/
|_____ xml/

File Lists

There are csv and txt formatted file lists available for each package.

Note: Author manuscripts have different metadata information available than PMC OA Subset articles, so do not assume the same structure for the file lists for these two different datasets.

Sample Bulk File Names

  • Baselist file list: oa_comm_xml.PMC003XXXXXX.baseline.2021-09-16.filelist.csv
  • Baseline: oa_comm_xml.PMC003XXXXXX.baseline.2021-09-16.tar.gz
  • Incremental file list: oa_comm_xml.incr.2021-09-17.filelist.csv
  • Incremental update: oa_comm_xml.incr.2021-09-17.tar.gz

In each of the sample file names above you can substitute various parts to get to the files you want, e.g.

  • Replace oa_comm with oa_noncomm to get PMC OA Subset non-commerical use articles or replace with oa_other to get PMC OA Subset articles without explicity tagged Creative Commons licenses. Replace it with author_manuscript to get author manuscripts.
  • Replace _xml with _txt to get plain text files vs. XML files
  • Replace baseline with incr to switch from a baseline file to one of the daily incremental files, be sure to update the date and remove the PMC00#XXXXXX from the file name
  • Replace PMC003XXXXXX with PMC008XXXXXX in baseline file names to get the articles in the specified grouping with PMCIDs in the range from PMC8000000 to PMC8999999; to get all articles you must retrieve all the PMCID ranges
  • Replace the date (e.g. 2021-09-16) with the new baseline date if the baseline has been updated since this documentation was written; replace the date for incremental files with the date you want to retrieve
  • Replace .csv with .txt as the file extension for the file list to get a tab separated plain text version of the file list

Individual Article Download (PMC Open Access Subset Only)

PMC Open Access Subset Individual Article Packages

If you only want to download some of the PMC OA Subset based on search criteria or if you want to download complete packages for articles that include XML, PDF, media, and supplementary materials, you will need to use the individual article download packages. To keep directories from getting too large, the packages have been randomly distributed into a two-level-deep directory structure. You can use the file lists in CSV or txt format to search for the location of specific files or you can use the OA Web Service API. The file lists and OA Web Service API also provide basic article metadata.

  • Filenames: PMCXXXXXXX.tar.gz where the X's represent a specific PMCID
  • File lists: oa_file_list.csv or oa_file_list.txt (Located up one level in the top level PMC FTP directory)

The first line of each file list is the timestamp the file was written. Subsequent rows contain metadata for each article.

Each row is divided into 6 metadata fields, delimited by comma or tab characters, for example:

oa_package/66/8b/PMC555938.tar.gz   BMC Bioinformatics. 2005 Mar 7; 6:44   PMC555938 2023-06-11 23:35:18 15748298 CC BY no

The fields in the files are:

  • The fully qualified name of the .tar.gz file for an article
  • The article citation, comprising the journal title abbreviation, publication date, volume, issue, and the page range or elocation ID
  • PMC accession number (PMCID)
  • Last updated timestamp
  • PubMed ID (PMID)
  • License type - The value for "license type" can be any of the standard Creative Commons license variants (e.g., "CC BY"; "CC BY-NC"; "CC BY-NC-ND") or "NO-CC CODE". "NO-CC CODE" appears when the license is missing, has custom terms (i.e., not a Creative Commons license), or is not machine decodable.
  • Retracted - The value for "Retracted" can be either "yes" or "no" to indicate whether this article is known by NLM to be retracted.

PDF Download (PMC Open Access Subset Only)

PMC Open Access Subset PDF Files

Individual article PDF downloads are only available for non-commercial use licensed articles. To keep directories from getting too large, the article PDFs have been randomly distributed into a two-level-deep directory structure. You can use the oa_non_comm_use_pdf file lists in CSV or txt format to search for the location of specific files, or you can use the OA Web Service API. The file lists and OA Web Service API also provide basic article citation and license information, as well as the date the article was last updated in PMC.

  • Filenames: filename.PMCXXXXXXX.pdf where filename is the original name of the source file and the X's represent a specific PMCID
  • File lists: oa_non_comm_use_pdf.csv or oa_non_comm_use_pdf.txt (Located in the top level PMC FTP directory)

License

Articles in these datasets are made available consistent with either the terms of applicable article-level license statements or the funder’s policy. See PMC Copyright for more information.

Contact

pubmedcentral@ncbi.nlm.nih.gov

How to Cite

See the individual dataset pages on how to cite the PMC Open Access Subset and PMC Author Manuscript Dataset.