Accessing PMC Article Datasets Using Amazon Web Services
As part of our Cloud Service, PMC makes the datasets described below freely accessible on Amazon Web Services (AWS), without charge, through either an HTTPS or S3 URL, and without any log-in requirement for retrieval (see Access Using the Command Line Interface). The National Library of Medicine works with the AWS Open Data Sponsorship Program to provide this access. Read on to learn why and how you may access these datasets from our AWS Cloud Service.
Note:
- The files that PMC distributes via our AWS cloud service include individual articles in NISO Z39.96-2015 JATS XML format as well as in plain text as extracted from the XML. Also included are file lists that include metadata for articles in each dataset.
- Media files and supplementary materials associated with Open Access Subset articles can be retrieved in individual article packages using the PMC FTP Service.
- Media files and supplementary materials for author manuscripts are not made available as part of the PMC Article Datasets.
Description and Location of PMC Article Datasets on AWS
Resource type: S3 Bucket, world-readable
Amazon Resource Name (ARN): arn:aws:s3:::pmc-oa-opendata
AWS Region: us-east-1
AWS CLI (Command Line Interface) Access (No AWS account required if you use --no-sign-request)
aws s3 ls s3://pmc-oa-opendata --no-sign-request
PMC Open Access (OA) Subset on AWS
The PMC Open Access Subset in PMC's S3 bucket on AWS is divided into three top-level directories: oa_comm, oa_noncomm, and phe_timebound. For commercial usage, you are limited to the articles in the oa_comm directory which includes articles licensed under CC BY and CC0 licenses and to articles in the phe_timebound directory all of which have a PMC COVID-19 Collection timebound license statement. For non-commercial usage, you may access articles in the oa_noncomm (which contains articles licensed under all Creative Commons license types with the exclusion of CC BY and CC0), oa_comm, and phe_timebound directories.
Some articles in the PMC COVID-19 Collection were made available through the PMC Open Access Subset under license terms that expired at the end of the public health emergency declaration and are no longer available in the phe_timebound directory. To download a list of PMCIDs that are no longer available under license terms allowing for re-use, see the FAQ item “Where can I find a list of articles removed from PMC or the PMC Open Access subset at the end of the Public Health Emergency?".
The license terms on articles are not all identical. Please refer to the license statement in each article for specific terms of use. The oa_comm/, oa_noncomm/, and phe_timebound/ directories follow similar structures:
|_ txt/
|_ all/
individual plain text files for each article, named
PMC[accession_id].txt, e.g. PMC1043859.txt
|_ metadata
|_csv/
[oa_comm or oa_noncomm or phe_timebound].filelist.csv
|_txt/
[oa_comm or oa_noncomm or phe_timebound].filelist.txt
|_ xml/
|_ all/
individual XML files for each article, named
PMC[accession_id].xml, e.g. PMC1043859.xml
|_ metadata
|_csv/
[oa_comm or oa_noncomm or phe_timebound].filelist.csv
|_txt/
[oa_comm or oa_noncomm or phe_timebound].filelist.txt
Note that on AWS, we are limiting distribution of open access articles to those that have a machine-readable Creative Commons license. Those articles that have been identified by the publishers as open access, but that do not have machine-readable Creative Commons licenses tagged are available via the PMC FTP Service.
File lists are updated daily. Each contains a row per article with a number of metadata fields. Below is a sample header and sample row for the CSV formatted file list. The plain text file list uses tabs to separate the fields.
Key,ETag,Article Citation,AccessionID,Last Updated UTC (YYYY-MM-DD HH:MM:SS),PMID,License,Retracted
oa_comm/xml/all/PMC1043859.xml,801ba4a4c2d48ad98149e4e481a55b06,PLoS Biol. 2005 Apr 22; 3(4):e60,PMC1043859,2021-06-17 18:35:10,15736975,CC BY,no
Header Definitions
- Key - the object key (or key name) uniquely identifies the object in an Amazon S3 bucket
- ETag - this is the AWS entity tag and its value represents a specific version of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data.
- Article Citation - A brief format of basic article citation information
- AccessionID - the PMC accession ID in the PMC standard format of PMC#######, where the number of digits can vary
- Last Updated UTC (YYYY-MM-DD HH:MM:SS) - the date and time the file (or object) was last updated in the bucket
- PMID - the PubMed accession ID in the PubMed standard format of ######### if a PMID is associated with the article, where the number of digits can vary
- License - an indicator of the Creative Commons license type, e.g CC BY, CC BY-NC
- Retracted - a value of yes means that the article has been retracted, a value of no means that the article has not been retracted
PMC Author Manuscript Dataset on AWS
The PMC Author Manuscript Dataset in PMC's S3 bucket on AWS is found in the author_manuscript directory. Articles in this directory are accepted author manuscripts that have been collected under a funder policy in PMC. They are available in XML and plain text for text mining purposes.
The author_manuscript/ directory is organized as follows:
|_ txt/
|_ all/
individual plain text files for each author manuscript,
named PMC[accession_id].txt, e.g. PMC1249490.txt
|_ metadata/
|_csv/
author_manuscript.filelist.csv
|_txt/
author_manuscript.filelist.txt
|_ xml/
|_all/
individual XML files for each author manuscript,
named PMC[accession_id].xml, e.g. PMC1043859.xml
|_metadata/
|_ csv/
author_manuscript.filelist.csv
|_txt/
author_manuscript.filelist.txt
File lists are updated daily. Each contains a row per manuscript with a number of metadata fields. Below is a sample header and sample row for the CSV formatted file list. The plain text file list uses tabs to separate the fields.
Key,ETag,AccessionID,Last Updated UTC (YYYY-MM-DD HH:MM:SS),PMID,MID
author_manuscript/xml/all/PMC8218989.xml,c9090970ef2d0ab762ef473a18eac2ef,PMC8218989,2021-06-24 07:31:23,32914184,NIHMS1703867
Header Definitions
- Key - the object key (or key name) uniquely identifies the object in an Amazon S3 bucket
- ETag - this is the AWS entity tag and its value represents a specific version of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data.
- AccessionID - the PMC accession ID in the PMC standard format of PMC#######, where the number of digits can vary
- Last Updated UTC (YYYY-MM-DD HH:MM:SS) - the date and time the file (or object) was last updated in the bucket
- PMID - the PubMed accession ID in the PubMed standard format of #########, if a PMID is associated with the manuscript, where the number of digits can vary
Retrieval from AWS
Retrieving files from PMC's S3 bucket on AWS does not require an AWS account. In addition, there are no transfer fees to users for downloading or transferring files, because these costs are covered through PMC's participation in the AWS Open Data Sponsorship Program. There are several methods available to retrieve files as described in the Downloading an object documentation from AWS.
AWS Command Line Interface (CLI)
First, download the AWS Command Line Interface (CLI) following these instructions.
Because the PMC S3 bucket is world-readable, you do not need an AWS account ID to read or download these files; however, if you choose to access the data anonymously, you will need to include a --no-sign-request
option on any of the below examples. If, however, you wish to copy these data into your own S3 bucket or use AWS services like AWS Elastic Compute Cloud or Amazon Athena on these data, you will need an AWS account and you will need to input your AWS credentials.
The following examples take advantage of the bucket-, prefix-, and object-level s3 commands. Read more about s3 commands.
Using AWS CLI to access and retrieve objects: Examples
There are several methods available to download files as described in the AWS Downloading an object documentation.
Download everything in a directory using sync
Let's say you want to download everything living under the prefix /oa_comm/xml/all. In this example, we've already generated a directory called pmc-test
that we want all these objects to be copied into. aws sync
syncs everything in a source bucket into your designated directory. Note that sync
does not have a --prefix
option as list-objects-v2
does. However, since the key to any object includes prefixes, you can use --include
and --exclude
filters to designate what prefixes you want to sync.
Note that filters can accommodate a number of patterns, and have a precedence hierarchy! Read more about include and exclude filters.
aws s3 sync s3://pmc-oa-opendata ./pmc-test/ --exclude "*" --include "/oa_comm/xml/all/"
Download new or updated files in a directory using sync
A common use case is that you will want to only download new or updated data. Per sync
documentation, "a s3 object will require downloading if the size of the s3 object differs from the size of the local file, the last modified time of the s3 object is newer than the last modified time of the local file, or the s3 object does not exist in the local directory". So after you've used sync
once to get everything, you can continue to use it whenever you want to retrieve only new or updated files.
Read the official aws s3 sync documentation.
Download a subset using cp
If you only want a subset of data to work with and don't want to keep the entirety of a bucket in your own storage, you can also use aws cp
. cp
is a single-object command, so if you want cp
to scan the entire bucket for anything added after a specific timestamp, you'll want to add the --recursive
tag.
Copy all files
aws s3 cp s3://pmc-oa-opendata ./pmc-test/ --recursive
Copy files within a certain prefix
This example also defines that you want to download data, but it includes --exclude
and --include
prefixes to limit the cp
to files under a certain prefix.
aws s3 cp s3://pmc-oa-opendata ./pmc-test/ --exclude "*" --include "oa_comm/xml/all/" --recursive
Explore the official aws s3 cp documentation.
Engage
NCBI wants your feedback on accessing PMC Article Datasets using AWS. Contact pubmedcentral@ncbi.nlm.nih.gov with feedback and questions.