Accessing PMC Article Datasets Using Amazon Web Services
As part of our Cloud Service, PMC makes the datasets described below freely accessible on Amazon Web Services (AWS), without charge, through either an HTTPS or S3 URL, and without any log-in requirement for retrieval. The National Library of Medicine works with the AWS Open Data Sponsorship Program to provide this access. Read on to learn why and how you may access these datasets from our AWS Cloud Service.
February 12, 2026: Changes to PMC Article Datasets Distribution Services Coming in 2026
PMC will make major changes to our Article Dataset Distribution Services in 2026. In August 2026, you will need to access full text article data files through the PMC Cloud Service instead of the PMC FTP Service. This change will provide you with more reliable performance, faster retrieval times, and greater flexibility in retrieving only the types and number of files you wish to work with.
Since this may impact operational workflows, we are providing a transition period from February to August. During this time, the FTP Service, OA Web Service API, and the current PMC Cloud Service files will remain available concurrently with the updated PMC Cloud Service on AWS.
For complete details about this transition, please see the NCBI Insights blog post and our documentation on Accessing PMC Article Datasets Using Amazon Web Services
Description and Location of PMC Article Datasets on AWS
Resource type: S3 Bucket, world-readable
Amazon Resource Name (ARN): arn:aws:s3:::pmc-oa-opendata
AWS Region: us-east-1
AWS CLI (Command Line Interface) Access (No AWS account required if you use --no-sign-request)
aws s3 ls s3://pmc-oa-opendata --no-sign-request
Updated Structure of the PMC Article Datasets on AWS
An article in PMC may have more than one version associated with the same Accession ID (PMCID), for example an author manuscript and a published version. The updated PMC Article Datasets on AWS are organized by article version. They include the following components:
Article Objects
Objects for each of the roughly 8 million PMC article versions are collected under a prefix named by the PMC Accession ID and the article version number.
Sample prefix: PMC12855588.1
The files located under this prefix include:
-
XML: an XML file for the article, encoded according to the latest version of ANSI/NISO Z39.96-2015 JATS, and named corresponding to the parent prefix.
Sample filename:PMC12855588.1.xml -
TXT: a plain text version of the article, extracted from the XML and also named by article version.
Sample filename:PMC12855588.1.txt -
PDF: a PDF file for the article, named corresponding to the parent prefix.
Sample filename:PMD12855588.1.pdf -
JSON metadata: a JSON object listing core metadata (see below), named by the parent prefix.
Sample filename:PMC12855588.1.json -
Media and Supplementary Files: additional media and supplementary files, when permissible by the publishers' licenses, namely:
- the PDF file of the article.
Sample filename:PMC12855588.1.pdf - media and supplementary data files.
Sample filenames:gr1.jpg,gr2.jpg,mmc1.pdf
- the PDF file of the article.
JSON Metadata Objects
The JSON metadata objects for each article version are additionally collected under a metadata prefix. The JSON includes the following properties:
- pmcid: the PubMed Central Accession ID
- version: the article version number
- pmid: the PubMed Central Accession ID
- doi: the Digital Object Identifier
- title: the article title,
- citation: the journal citation,
- is_pmc_openaccess: whether the article version is part of the PMC Open Access Subset
- is_manuscript: whether the article version is an author manuscript
- is_historical_ocr: whether the article version is part of Historical OCR Dataset
- is_retracted: whether the article version has been retracted
-
license_code: a code for the license. This value can be:
- a code corresponding to any of the Creative Commons licenses.
Samples:CC BY,CC BY-NC,CC BY-NC-ND - 'TDM', which stands for text and data mining, for author manuscripts without Creative Commons licenses where the full text is available for text mining, and where the full text may also be used consistent with the principles of fair use under the copyright law.
- null, when an article does not have a machine-readable Creative Commons license and isn't in the Author Manuscript Dataset. The article has been indicated to be open access, but there isn’t a machine readable Creative Commons license.
NOTE:
License terms vary. Please refer to the license statement in each article for specific terms of use.
Users of these datasets are directly and solely responsible for compliance with copyright restrictions and are expected to adhere to the terms and conditions defined by the copyright holder (see the PMC Copyright Notice).
- a code corresponding to any of the Creative Commons licenses.
-
xml_url: the S3 URL to the article XML
- pdf_url: the S3 URL to the article PDF, if available
- media_urls: a list of S3 URLs to images and supplementary data files, if available
- text_url: the S3 URL to the plain text file
NOTE: All S3 URLs in the JSON object include the MD5 digest of the object in the form of a URL parameter, md5.
Sample JSON Object
{"pmcid": "PMC12855588", "version": 1, "pmid": 41623473, "doi": "10.1016/j.isci.2025.114581", "mid": null,
"title": "Study on sodium ion supplementation performance of CNT-coated sodium oxalate in sodium ion batteries",
"citation": "iScience. 2025 Dec 31;29(2):114581. doi: 10.1016/j.isci.2025.114581", "is_pmc_openaccess": true,
"is_manuscript": false, "is_historical_ocr": false, "is_retracted": false, "license_code": "CC BY",
"pdf_url": "s3://pmc-oa-opendata/PMC12855588.1/PMC12855588.1.pdf?md5=8546223bd7ec0f01313ec7c4903fb9dc",
"xml_url": "s3://pmc-oa-opendata/PMC12855588.1/PMC12855588.1.xml?md5=99210445a56f3315af7e95797556bd6b",
"text_url": "s3://pmc-oa-opendata/PMC12855588.1/PMC12855588.1.txt?md5=95d78e2bc6318fb728abb7137ad0049e",
"media_urls": ["s3://pmc-oa-opendata/PMC12855588.1/fx1.jpg?md5=7e7e95a91dedf32760eda1032d632d24",
"s3://pmc-oa-opendata/PMC12855588.1/gr1.jpg?md5=39209b49c555fa939d5108a4b6c4fe92",
"s3://pmc-oa-opendata/PMC12855588.1/gr2.jpg?md5=2bd4128fe00c8d6c823e048c4b31948c",
…
"s3://pmc-oa-opendata/PMC12855588.1/mmc1.pdf?md5=e3b8ac93beae30fc40ae7e01421ab566"]}
Amazon S3 Inventory File
An Amazon S3 inventory in CSV format is located at s3://pmc-oa-opendata/inventory-reports/pmc-oa-opendata/ and is updated once a day.
Inventory CSV files contains the following fields:
- Bucket name - always
pmc-oa-opendata - JSON metadata object name – e.g.
metadata/PMC10009416.1.json -
Last modified date - the object creation date or the last modified date of the JSON metadata file, whichever is the latest.
Sample: "2026-01-21T19:03:55.000Z" -
ETag - the entity tag or checksum of the JSON metadata object. The JSON contains the MD5 checksum of each object belonging to the article version. Therefore, a change to any of the objects will reflect in a changed ETag.
Sample: "79d91888d335940aa62371f41b9fe2f7"
See the section below titled Accessing the Inventory for information on finding the latest version of the inventory files.
Versioning and Update Frequency
Article versions are updated continuously. Updates include:
- addition of new article versions
- the update of one or all objects belonging to an existing article version
- in rare cases, the removal of an article version
The Amazon inventory is updated on a daily basis. This means the inventory lags the bucket state and may not cover the most recent changes.
S3 Bucket Schematic Overview
Below is a schematic overview of the S3 Bucket with two article versions. The old prefixes and files used in our original cloud distribution will be removed in August 2026 as indicated in the schematic below.
s3://pmc-oa-opendata/
|-- PMC10009416.1
| |-- NPR2-43-85-g001.jpg
| |-- NPR2-43-85-s001.xlsx
| |-- PMC10009416.1.json
| |-- PMC10009416.1.pdf
| |-- PMC10009416.1.txt
| `-- PMC10009416.1.xml
|-- PMC12788873.1
| |-- PMC12788873.1.json
| |-- PMC12788873.1.txt
| `-- PMC12788873.1.xml
|-- author_manuscript # old organization, available until Aug 2026
|-- metadata
| |-- PMC10009416.1.json
| `-- PMC12788873.1.json
|-- oa_comm # old organization, available until Aug 2026
|-- oa_noncomm # old organization, available until Aug 2026
|-- phe_timebound # old organization, available until Aug 2026
|-- inventory-reports/
`-- README.txt
AWS Command Line Interface (CLI) Data Access
The following demonstrate how to access the updated PMC Article Datasets using the AWS command line interface (CLI).
For anonymous access, add --no-sign-request to all aws commands.
Accessing Article Data
List the contents of the bucket
Caution: This will print about 8 million prefixes, it is better to get the inventory CSV files (see Accessing the Inventory below).
$ aws s3 ls s3://pmc-oa-opendata/
List the objects belonging to a specific article version
$ aws s3 ls s3://pmc-oa-opendata/PMC10009402.1/
Download a specific object
$ aws s3 cp s3://pmc-oa-opendata/PMC10009402.1/PMC10009402.1.xml .
Create a new directory and download all objects belonging to a specific article version into it
$ mkdir PMC10009402.1/
$ aws s3 cp --recursive s3://pmc-oa-opendata/PMC10009402.1/ PMC10009402.1./
List all versions belonging to a PMCID
$ aws s3api list-objects-v2 --bucket pmc-oa-opendata --prefix "PMC11370360." --delimiter "/" --query "CommonPrefixes[].Prefix" --output "text"
outputs:
PMC11370360.1/
PMC11370360.2/
Accessing the Inventory
The Amazon S3 inventory, in CSV format, is regenerated once a day. Many versions of the inventory are available. Instructions below help you locate, retrieve and view the latest day's version of the inventory CSV files.
Find the latest version
$ aws s3 ls s3://pmc-oa-opendata/inventory-reports/pmc-oa-opendata/metadata/ | awk '{print $2}' | grep -v hive | grep -v data | sort | tail -1
Sample output:
2026-02-23T01-00Z/
Retrieve the CSV path for a specific inventory file by reading the manifest
The example below retrieves the CSV path for the version you choose (e.g. 2026-02-23T01-00Z/ as found from "Find the latest version" command above). You will need to replace the value with the value of the inventory version you wish to retrieve.
$ aws s3 cp s3://pmc-oa-opendata/inventory-reports/pmc-oa-opendata/metadata/2026-02-23T01-00Z/manifest.json - | jq '.files[].key'
outputs:
"inventory-reports/pmc-oa-opendata/metadata/data/6f7628e4-fdb9-4f91-98fc-269996ed5c79.csv.gz"
"inventory-reports/pmc-oa-opendata/metadata/data/d1e76b0a-3b0a-4c21-8136-2fe6d38e093a.csv.gz"
"inventory-reports/pmc-oa-opendata/metadata/data/6db257f7-f9f2-461f-bd4b-f624610dea73.csv.gz"
"inventory-reports/pmc-oa-opendata/metadata/data/231a00df-6145-4e68-9600-3518ed2dced9.csv.gz"
Note: The inventory will be split into multiple large CSV files that are each compressed with gzip.
Retrieve one of the gzipped inventory files to your local directory
$ aws s3 cp s3://pmc-oa-opendata/inventory-reports/pmc-oa-opendata/metadata/data/6f7628e4-fdb9-4f91-98fc-269996ed5c79.csv.gz .
Uncompress an inventory file
$ gunzip 6f7628e4-fdb9-4f91-98fc-269996ed5c79.csv.gz
outputs the uncompressed CSV file:
6f7628e4-fdb9-4f91-98fc-269996ed5c79.csv
Print the first 3 rows of an inventory file to the screen
$ head -3 6f7628e4-fdb9-4f91-98fc-269996ed5c79.csv
Sample output:
"pmc-oa-opendata","metadata/PMC8267076.1.json","2026-02-22T09:48:18.000Z","0323f9d4d6ce512e31078abcfd8b35dc"
"pmc-oa-opendata","metadata/PMC8267077.1.json","2026-02-20T04:16:10.000Z","c30b4496596d9c6c534300cfea76b2d2"
"pmc-oa-opendata","metadata/PMC8267079.1.json","2026-02-22T09:48:21.000Z","1e46cc6a4b6ef240f7a85f931796f485"
Using an eSearch-S3 Pipeline
The "ESearch" Entrez Programming Utility with db=pmc responds to a text search query with the list of matching PMC identifiers, in integer form. Refer to https://www.ncbi.nlm.nih.gov/books/NBK25497/ for API documentation.
Appending the filters "(open_access[filter] OR author_manuscript[filter])" to the query will limit results to articles available in these datasets.
The articles can then be mapped to S3 prefixes and retrieved.
Example: Retrieve all the files associated with articles in PMC that are found by searching for alzheimers that allow reuse.
- Run eSearch on the command line, for example for the term "alzheimers"
Note how the command below appends the filters "(open_access[filter] OR author_manuscript[filter])" which limits the results to articles available in these datasets. If you do not use that, your query may find PMC identifiers that are not available for retrieval.
$ curl -s "https://eutils.ncbi.nlm.nih.gov/eutils/esearch.fcgi?db=pmc&term=alzheimers+AND+(open_access[filter]+OR+author_manuscript[filter])&format=json" \
| jq '.esearchresult.idlist'
This returns an idlist:
[
"12810641",
"12810747",
...
]
- For each ID, find all articles version(s) associated with a PMCID
For each item in the idlist you retrieved in step 1, run thes3api list-object-v2command shown below (being sure to append the prefix 'PMC' before the ID value to create the PMCID and to append 'PMC' after the end of the id value).
$ aws s3api list-objects-v2 --bucket pmc-oa-opendata --prefix "PMC12810641." \
--delimiter "/" --query "CommonPrefixes[].Prefix" --output "text"
For this id, the command returns:
PMC12810641.1/
This result means that there is only one version of this article available for retrieval, and it is version 1. The majority of articles in PMC have a single version and it is version 1.
- For each article version, retrieve all the article files associated with that version
$ aws s3 cp --recursive s3://pmc-oa-opendata/PMC12810641.1/ PMC12810641.1/.
This command will create a directory named PMC12810641.1 and place a copy of all the available files for the article version PMC132810641.1 in that directory, e.g. XML, TXT, the JSON metadata object, article PDF, all image or media files and any supplemental files.
PMC Search Fields and Example Queries
For a complete list of PMC search fields, see https://pmc.ncbi.nlm.nih.gov/about/userguide/#usetags.
Example Search Queries
Find articles added on a specific day
2026/01/18[pmcrdat] AND (open_access[filter] OR author_manuscript[filter])
Find articles added during a specific date range
2026/1/14:2026/1/31[pmcrdat] AND (open_access[filter] OR author_manuscript[filter])
Find articles that permit commercial reuse
(cc0_license[filter] OR cc_by_license[filter] OR cc_by-sa_license[filter] OR
cc_by-nd_license[filter] ) AND (open_access[filter] OR author_manuscript[filter])
Find articles that only permit non-commercial reuse
(cc_by-nc_license[filter] OR cc_by-nc-nd_license[filter] OR cc_by-nc-sa_license[filter]) AND (open_access[filter] OR author_manuscript[filter])
Find articles that permit commercial and non-commercial reuse e.g. if you are using them for non-commercial purposes:
(open_access[filter] OR author_manuscript[filter])
See complete documentation of license filters in PMC.
FAQs for the Transition
Can I still get the files for all the articles that allow commercial re-use easily?
Yes, PMC Search has filters that allow you to easily find articles by Creative Commons license type and PMC searches can be conducted with the eUtilities ESearch API to create an eSearch-S3 pipeline to retrieve all files for the articles with commercial licenses.
Find all articles in PMC that permit commercial reuse
(cc0_license[filter] OR cc_by_license[filter] OR cc_by-sa_license[filter] OR
cc_by-nd_license[filter]) AND (open_access[filter] OR author_manuscript[filter])
See the PMC User Guide section on searching by license.
Will there be a way to get all the PDFs easily?
Yes, however, there will no longer be a tar.gz package that contains all the PDF files for articles licensed for non-commercial use.
If you want all PDFs for articles that allow non-commerical use, you can conduct a search for all articles with PDFs that are part of the PMC Open Access Subset, and build that search into a pipeline (such as the eUtilities eSearch-S3 pipeline documented above limited to PDF file retrieval) to retrieve all the PDFs.
Find all articles with article PDF files that can be used for non-commercial purposes
(open_access[filter] AND has_pdf[filter])
Where do I find the baseline and incremental files?
There will no longer be baseline bulk file packages that contain the XML or TXT for millions of articles.
As there are no baseline files, there are also no incremental files. Instead, each individual file will have a timestamp and ETag that will be updated if the file changes. The bucket inventory includes S3 modification dates as well as ETag data.
If you want to retrieve all the articles that have been added on a specific date or in a specific date range, you can use these queries with eSearch combined into an eSearch-S3 pipeline.
Find articles added on a specific day
2026/02/26[pmcrdat] AND (open_access[filter] OR author_manuscript[filter])
This search query will find all articles added on 2026/02/26 that are available for automated retrieval.
Find articles added during a specific date range
2026/02/26:2026/02/28[pmcrdat] AND (open_access[filter] OR author_manuscript[filter])
This search query will find all articles added from 2026/02/26 through 2026/02/28 that are available for automated retrieval.
Do I need an account to retrieve the files from the AWS Cloud?
No account is necessary to retrieve files from the AWS Cloud to your local storage. You will need to add –no-sign-request to all of your command line statements while using the AWS CLI in order to do this. Access is also available without login using a URL in a browser or in a Curl command. The URLs all start with https://pmc-oa-opendata.s3.amazonaws.com/.
Will retrieving files from the updated PMC Cloud Service cost me anything?
There is no charge for accessing the files for retrieval to your own Cloud storage or to download them to a local storage area. If you use AWS Cloud computing resources to work with the files in the Cloud you will have to pay for their computing services. If you copy the files to your own Cloud storage area, you will have to pay for storage.
PMC Open Access (OA) Subset on AWS
New structure required beginning in August 2026 - learn more
The PMC Open Access Subset in PMC's S3 bucket on AWS is divided into three top-level directories: oa_comm, oa_noncomm, and phe_timebound. For commercial usage, you are limited to the articles in the oa_comm directory which includes articles licensed under CC BY and CC0 licenses and to articles in the phe_timebound directory all of which have a PMC COVID-19 Collection timebound license statement. For non-commercial usage, you may access articles in the oa_noncomm (which contains articles licensed under all Creative Commons license types with the exclusion of CC BY and CC0), oa_comm, and phe_timebound directories.
Some articles in the PMC COVID-19 Collection were made available through the PMC Open Access Subset under license terms that expired at the end of the public health emergency declaration and are no longer available in the phe_timebound directory. To download a list of PMCIDs that are no longer available under license terms allowing for re-use, see the FAQ item “Where can I find a list of articles removed from PMC or the PMC Open Access subset at the end of the Public Health Emergency?".
The license terms on articles are not all identical. Please refer to the license statement in each article for specific terms of use. The oa_comm/, oa_noncomm/, and phe_timebound/ directories follow similar structures:
|_ txt/
|_ all/
individual plain text files for each article, named
PMC[accession_id].txt, e.g. PMC1043859.txt
|_ metadata
|_csv/
[oa_comm or oa_noncomm or phe_timebound].filelist.csv
|_txt/
[oa_comm or oa_noncomm or phe_timebound].filelist.txt
|_ xml/
|_ all/
individual XML files for each article, named
PMC[accession_id].xml, e.g. PMC1043859.xml
|_ metadata
|_csv/
[oa_comm or oa_noncomm or phe_timebound].filelist.csv
|_txt/
[oa_comm or oa_noncomm or phe_timebound].filelist.txt
Note that on AWS, we are limiting distribution of open access articles to those that have a machine-readable Creative Commons license. Those articles that have been identified by the publishers as open access, but that do not have machine-readable Creative Commons licenses tagged are available via the PMC FTP Service.
File lists are updated daily. Each contains a row per article with a number of metadata fields. Below is a sample header and sample row for the CSV formatted file list. The plain text file list uses tabs to separate the fields.
Key,ETag,Article Citation,AccessionID,Last Updated UTC (YYYY-MM-DD HH:MM:SS),PMID,License,Retracted
oa_comm/xml/all/PMC1043859.xml,801ba4a4c2d48ad98149e4e481a55b06,PLoS Biol. 2005 Apr 22; 3(4):e60,PMC1043859,2021-06-17 18:35:10,15736975,CC BY,no
Header Definitions
- Key - the object key (or key name) uniquely identifies the object in an Amazon S3 bucket
- ETag - this is the AWS entity tag and its value represents a specific version of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data.
- Article Citation - A brief format of basic article citation information
- AccessionID - the PMC accession ID in the PMC standard format of PMC#######, where the number of digits can vary
- Last Updated UTC (YYYY-MM-DD HH:MM:SS) - the date and time the file (or object) was last updated in the bucket
- PMID - the PubMed accession ID in the PubMed standard format of ######### if a PMID is associated with the article, where the number of digits can vary
- License - an indicator of the Creative Commons license type, e.g CC BY, CC BY-NC
- Retracted - a value of yes means that the article has been retracted, a value of no means that the article has not been retracted
PMC Author Manuscript Dataset on AWS
New structure required beginning in August 2026 - learn more
The PMC Author Manuscript Dataset in PMC's S3 bucket on AWS is found in the author_manuscript directory. Articles in this directory are accepted author manuscripts that have been collected under a funder policy in PMC. They are available in XML and plain text for text mining purposes.
The author_manuscript/ directory is organized as follows:
|_ txt/
|_ all/
individual plain text files for each author manuscript,
named PMC[accession_id].txt, e.g. PMC1249490.txt
|_ metadata/
|_csv/
author_manuscript.filelist.csv
|_txt/
author_manuscript.filelist.txt
|_ xml/
|_all/
individual XML files for each author manuscript,
named PMC[accession_id].xml, e.g. PMC1043859.xml
|_metadata/
|_ csv/
author_manuscript.filelist.csv
|_txt/
author_manuscript.filelist.txt
File lists are updated daily. Each contains a row per manuscript with a number of metadata fields. Below is a sample header and sample row for the CSV formatted file list. The plain text file list uses tabs to separate the fields.
Key,ETag,AccessionID,Last Updated UTC (YYYY-MM-DD HH:MM:SS),PMID,MID
author_manuscript/xml/all/PMC8218989.xml,c9090970ef2d0ab762ef473a18eac2ef,PMC8218989,2021-06-24 07:31:23,32914184,NIHMS1703867
Header Definitions
- Key - the object key (or key name) uniquely identifies the object in an Amazon S3 bucket
- ETag - this is the AWS entity tag and its value represents a specific version of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data.
- AccessionID - the PMC accession ID in the PMC standard format of PMC#######, where the number of digits can vary
- Last Updated UTC (YYYY-MM-DD HH:MM:SS) - the date and time the file (or object) was last updated in the bucket
- PMID - the PubMed accession ID in the PubMed standard format of #########, if a PMID is associated with the manuscript, where the number of digits can vary
Retrieval from AWS
Retrieving files from PMC's S3 bucket on AWS does not require an AWS account. In addition, there are no transfer fees to users for downloading or transferring files, because these costs are covered through PMC's participation in the AWS Open Data Sponsorship Program. There are several methods available to retrieve files as described in the Downloading an object documentation from AWS.
AWS Command Line Interface (CLI)
First, download the AWS Command Line Interface (CLI) following these instructions.
Because the PMC S3 bucket is world-readable, you do not need an AWS account ID to read or download these files; however, if you choose to access the data anonymously, you will need to include a --no-sign-request option on any of the below examples. If, however, you wish to copy these data into your own S3 bucket or use AWS services like AWS Elastic Compute Cloud or Amazon Athena on these data, you will need an AWS account and you will need to input your AWS credentials.
The following examples take advantage of the bucket-, prefix-, and object-level s3 commands. Read more about s3 commands.
Using AWS CLI to access and retrieve objects: Examples
There are several methods available to download files as described in the AWS Downloading an object documentation.
Download everything in a directory using sync
Let's say you want to download everything living under the prefix /oa_comm/xml/all. In this example, we've already generated a directory called pmc-test that we want all these objects to be copied into. aws sync syncs everything in a source bucket into your designated directory. Note that sync does not have a --prefix option as list-objects-v2 does. However, since the key to any object includes prefixes, you can use --include and --exclude filters to designate what prefixes you want to sync.
Note that filters can accommodate a number of patterns, and have a precedence hierarchy! Read more about include and exclude filters.
Example no longer relelvant beginning in August 2026 - learn more
aws s3 sync s3://pmc-oa-opendata ./pmc-test/ --exclude "*" --include "/oa_comm/xml/all/"
Download new or updated files in a directory using sync
A common use case is that you will want to only download new or updated data. Per sync documentation, "a s3 object will require downloading if the size of the s3 object differs from the size of the local file, the last modified time of the s3 object is newer than the last modified time of the local file, or the s3 object does not exist in the local directory". So after you've used sync once to get everything, you can continue to use it whenever you want to retrieve only new or updated files.
Read the official aws s3 sync documentation.
Download a subset using cp
If you only want a subset of data to work with and don't want to keep the entirety of a bucket in your own storage, you can also use aws cp. cp is a single-object command, so if you want cp to scan the entire bucket for anything added after a specific timestamp, you'll want to add the --recursive tag.
Copy all files
aws s3 cp s3://pmc-oa-opendata ./pmc-test/ --recursive
Copy files within a certain prefix
This example also defines that you want to download data, but it includes --exclude and --include prefixes to limit the cp to files under a certain prefix.
Example no longer relevant beginning in August 2026 - learn more
aws s3 cp s3://pmc-oa-opendata ./pmc-test/ --exclude "*" --include "oa_comm/xml/all/" --recursive
Explore the official aws s3 cp documentation.
Engage
NCBI wants your feedback on accessing PMC Article Datasets using AWS. Contact pubmedcentral@ncbi.nlm.nih.gov with feedback and questions.