Skip to main content

Digitized Journals in PMC

Image: Scanned full cover image from the Proceedings of the National Academy of Sciences. January 1915, Volume 1. Image shows stamp saying: Library Received March 25, 1915. US Department of Agriculture and lists the names of the members of the Editorial Board.

Since 2004, the National Library of Medicine (NLM) has worked to digitize journals that were previously available only in print. The work began with a collaboration among NLM, Wellcome Trust, and U. K. Joint Information Systems Committee (JISC). That collaboration, the Journal Backfiles Digitization Project, focused on digitizing the back content of journals that were participating in PMC and whose journal content was not already available in electronic format. The work for that project concluded in 2010.

In 2014, NLM once again partnered with the Wellcome Trust and signed a Memorandum of Understanding (MOU) to work together to make thousands of complete back issues of historically-significant biomedical journals freely available in PMC. The digitization of those titles was completed in 2019. Since then, NLM has continued the project to digitize titles from its collection, focusing on titles that are in the public domain.

[Image: Scanned full cover image from Proc Natl Acad Sci U S A. 1915 Jan; 1(1).]

The projects are described in more detail below. While the project details and specifications have evolved through the years, the output of each is largely the same:

  • High resolution page scans for all pages in a journal issue
  • Article-level metadata XML files
  • Article-level PDF files with embedded text generated from optical character recognition (OCR)
  • Advertising and administrative content PDF files
  • OCR text files for all articles and administrative content

Access to the Historical OCR Dataset Files

Full-text files of OCR text are available for bulk download from the PMC FTP Service. Most are available for text mining. Beginning in 2014, the digitized articles added to PMC include a Creative Commons license and therefore are also available via the PMC Open Access Subset. Learn more on the PMC Article Datasets page.

Tip icon

Tip: Find articles in PMC from digitized journals by using the search filter "is scanned"[filter].

Journal Backfiles Digitization (2004-2010)

A number of journals that joined PMC prior to 2008 participated in NLM's backfiles digitization project, offered to publishers whose archival content was not yet available in electronic form. By scanning back issues that were previously available only in print, NLM helped create a complete digital archive of these journals in PMC. The original project announcement provides information about the collaboration among NLM, Wellcome Trust, and U. K. Joint Information Systems Committee (JISC).

Participating journals granted NLM permanent rights to archive the scanned material and make it freely available to the public through PMC, subject to fair use provisions of copyright law. Copyright for the material scanned during this project remains with the publisher or the individual authors, as applicable.

Each issue of the journals was scanned cover to cover and produced:

  • Individual page images, ranging from 300dpi to 600dpi, depending on the source material;
  • Metadata XML for each article, cover, issue administrative material, and advertising in an issue;
  • Grayscale and color graphics of all figures in the articles;
  • Individual PDF files for each item with embedded figures and OCR text;
  • OCR text files for each article and all administrative material in an issue.

The XML files created for the project included abstracts as they appeared in the source material, and where the abstracts were not already part of the PubMed database, PMC added them. PubMed records were also created for the scanned articles that had not already been indexed in PubMed.

Biomedical Journal Digitization (2014- )

Image: Scanned portion of masthead from the Annals of Medicine for the year 1796. Includes the text, Exhibiting a concise view of the latest and most important discoveries in medicine and medical philosophy. Also has a handwritten note in pencil in the upper right corner that says 103676 Mar 77.
[Image: Scanned masthead image from Ann Med (Edinb) 1796; 1.]

Initial Phase (2014-2019)

The historical material covered by the Memorandum of Understanding signed in 2014 between NLM and Wellcome Trust includes journals that fall under one of three categories for clearances and permissions:

  1. Material currently under copyright for which the publisher has granted NLM permission to digitize and include in PMC. This material is made available with a Creative Commons license chosen by the publisher.
  2. Material that is in the public domain:
    • Titles published in their entirety in the United States prior to Jan 1, 1923.
    • Titles published in their entirety outside of the United States prior to Jan 1, 1877. This material falls under the Creative Commons Public Domain Mark and is free of known copyright restrictions.
  3. Material identified by the Wellcome Trust as an Orphan Work following a diligent search to ascertain the rights holder. This material is made available with a Creative Commons Attribution-NonCommercial 4.0 International License per the MOU.

Project Output

For each article, available issue cover, available table of contents, and the administrative material of an issue, the following output was produced:

  • NISO JATS XML metadata record 
  • 400-dpi 24-bit color LZW TIFF images of all pages 
  • PDF/A-2b of the article 
  • OCR text of the article. OCR text for some of these journals is available for download via the Historical OCR Collection described on the PMC Article Datasets page. 

Advertising in issues was also captured and preserved in PMC. The advertising page images have the same technical specifications listed above and are available as PDFs. OCR was not performed on the advertisement pages.

Citation records for articles were also created for PubMed.

Defining and Identifying Orphan Works

The Wellcome Trust chose to include certain journals that it has determined to be Orphan Works, as described below. Per the terms of the MOU, articles from the orphan works will appear in PMC under the Creative Commons Attribution-NonCommercial 4.0 International License.

Using the definition created by the European Commission, a work shall be considered an orphan work if:

  • none of the rights holders in that work is identified, or
  • none of the rights holders is located despite a diligent search for the rights holders having been carried out and recorded.

For this project, the Wellcome Library assumed that the publisher owns the rights to all the content they have published. No attempt was made to trace individual authors (of articles) or any third parties, which may have content embedded within the journal.

  1. Using the NLM Locator Plus service, staff at the Wellcome Library identified the full name of the journal and all of its variants, and recorded these details along with the name(s) of the publisher.
  2. Wellcome Library staff searched Ulrich’s Periodicals Directory to try to identify the contact details of the publisher. Where contact details were found, these were recorded. The search within Ulrich's began by using the journal title that was last used when it was actively published. Other “previously known” journal titles were searched if the initial attempt was unsuccessful.
  3. If the journal was not listed in Ulrich's, the search was expanded to a more general Internet search using Google. Where contact details were found, these were recorded.
  4. If a publisher address could not be found in Ulrich's or the Internet, the Wellcome Library contacted the Publishing Licensing Society (PLS) in the UK to see if they have any contact details for the publisher. Where contact details were found, these were recorded.
  5. In cases where a search of Ulrich's, the Internet, and the PLS failed to identify an address for the publisher, these works were considered orphans.
  6. In all cases where the addresses of the rights holder were found, the Wellcome Library contacted them to seek permission to digitize. In cases where a publisher did not then give permission, these works were then deemed out of scope.

Continuing Phases (2020- )

NLM continues to digitize historically-significant biomedical journals from its collection to include in PMC using the same technical specifications. Material selected for digitization falls under the Creative Commons Public Domain Mark and is free of known copyright restrictions.

Project Output

For each article, available issue cover, available table of contents, and the administrative material of an issue, the following output is produced: 

  • NISO JATS XML metadata record 
  • 400-dpi 24-bit color LZW TIFF images of all pages
  • PDF/A-2b of the article 
  • OCR text of the article. OCR text for these journals is available for download via the Historical OCR Collection described on the PMC Article Datasets page. 

Advertising in issues is also preserved in PMC. The advertising page images have the same technical specifications listed above and are available as PDFs. OCR is not performed on the advertisement pages.

Citation records for articles are created for PubMed.