MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format

Zeeshan Ahmed; Thomas Dandekar

doi:10.12688/f1000research.7329.2

. 2017 Apr 12;4:1453. Originally published 2015 Dec 16. [Version 2] doi: 10.12688/f1000research.7329.2

MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format

Zeeshan Ahmed ^1,^a, Thomas Dandekar ^2,^b

PMCID: PMC5897790 PMID: 29721305

Version Changes

Revised. Amendments from Version 1

We present here a revised manuscript striving for more clarity and better presentation of the results including language and style. The conclusion is revised with the inclusion of discussion of some of the available image based databases, which can directly profit from MSL by fast, automatic and rapid separation of text, and text describing the images.

Abstract

Published scientific literature contains millions of figures, including information about the results obtained from different scientific experiments e.g. PCR-ELISA data, microarray analysis, gel electrophoresis, mass spectrometry data, DNA/RNA sequencing, diagnostic imaging (CT/MRI and ultrasound scans), and medicinal imaging like electroencephalography (EEG), magnetoencephalography (MEG), echocardiography (ECG), positron-emission tomography (PET) images. The importance of biomedical figures has been widely recognized in scientific and medicine communities, as they play a vital role in providing major original data, experimental and computational results in concise form. One major challenge for implementing a system for scientific literature analysis is extracting and analyzing text and figures from published PDF files by physical and logical document analysis. Here we present a product line architecture based bioinformatics tool ‘Mining Scientific Literature (MSL)’, which supports the extraction of text and images by interpreting all kinds of published PDF files using advanced data mining and image processing techniques. It provides modules for the marginalization of extracted text based on different coordinates and keywords, visualization of extracted figures and extraction of embedded text from all kinds of biological and biomedical figures using applied Optimal Character Recognition (OCR). Moreover, for further analysis and usage, it generates the system’s output in different formats including text, PDF, XML and images files. Hence, MSL is an easy to install and use analysis tool to interpret published scientific literature in PDF format.

Keywords: Bioinformatics, Data mining, Images, Scientific literature, Text, OCR, PDF, Biomedical

Introduction

There has been an enormous increase in the amount of the scientific literature in the last decades ¹. The importance of information retrieval in the scientific community is well known; it plays a vital role in analyzing published data. Most published scientific literature is available in Portable Document Format (PDF), a very common way for exchanging printable documents. This makes it all-important to extract text and figures from the PDF files to implement an efficient Natural Language Processing (NLP) based search application. Unfortunately, PDF is only rich in displaying and printing but requires explicit efforts in the extraction of information, which significantly impacts the search and retrieval capabilities ². Due to this reason several document analysis based tools have been developed for physical and logical document structure analysis of this file type.

The recently, provided basic information retrieval (IR) system by PubMed is efficient in extracting literature based on published text (titles, authors, abstracts, introduction etc.), with the application of automatic term mapping and Boolean operators ³. The normal outcome of a successful NLP query brings a maximum of 20 relevant results per page; however, user can improve the search by customizing the query using the provided advanced options. So far, the current PubMed system, as well many other related orthodox NLP approaches are unable to completely implement an efficient information retrieval system, capable of extracting both text and figures from published PDF files. One of the major and technical challenges is the availability of structured text and figures. To our limited knowledge, there still is no single tool available which can efficiently perform both physical and logical structure analysis of all kinds of PDF files and can extract and classify all kinds of information (embedded text from all kinds of biological and scientific published figures). Different commercial and free downloadable software applications provide support in extracting the text and images from PDF files:

A-PDF ( http://www.a-pdf.com/image-extractor/),

PDF Merge Split Extract ( http://www.pdf-technologies.com/pdf-library-merge-split.aspx),

BePDF ( http://haikuarchives.github.io/BePDF/), KPDF ( https://kpdf.kde.org),

MuPDF ( http://mupdf.com), Xpdf tool ( http://www.foolabs.com/xpdf/),

Power PDF ( http://www.nuance.com/for-business/imaging-solutions/document-conversion/power-pdf-converter/index.htm)

However, these software applications do not provide text and images in a form where they could be considered for further logical analysis e.g. mining text in reading order from double or multiple columns documents (the text of first column followed by the text of second column, and so on), searching marginal text using key-words, removing irrelevant graphics (e.g. journal, publisher’s logos and header-footer images embedded inside document etc.) and extracting embedded text inside single and multi-panel complex biological images.

So far, the current PubMed system as well many other related orthodox NLP approaches e.g. 4– 13, are unable to completely implement an efficient information retrieval system, capable of extracting both text and figures from published PDF files.

To meet the technological objectives of this challenge, we took a step forward in the development of a new user friendly, modular and client based system MSL (software acronym denotes “Mining Scientific Literature”) for the extraction of full and marginal text from PDF files based on the keywords and coordinates ( Figure 1). It was build with a product line architecture. Since MSL provides a module for the extraction of figures from PDF files and applies Optical Character Recognizer (OCR) to extract text from all kinds of biomedical and biological Images. MSL comprises three modules working in product-line architecture: Text, Image and OCR ( Figure 2). Each module performs its task independently and its output is used as an input for the next module. When a PDF file is input to the MSL, first full and marginal text is automatically extracted, and then images are automatically extracted and placed in the same directory where PDF file is located. Later, user needs to select from all extracted and visible images in the image view, and apply OCR to particular image to extract text.

Figure 1. — There are three main components (left bottom square): Text, Image and OCR. A PDF document ³⁵ is input and processed by MSL. (left upper square): The text module provides extracted, searched and marginalized text in reading order, and file attributes. The image component provides the preview of extracted images from the document. OCR component provides extracted text from selected and processed image. The output is shown in upper right square and GUI and user options are indicated in the lower right square.

Figure 2. — There are three main components: Text, Image and OCR., and nine sub-components (rectangles): Text File, Image File, Visualize Image, PDF File, LEADTOOLS, XML File, iTEXTSharp, Bytescout, Spire. The text component applies iTEXTSharp, Bytescout, Spire to extract the text from PDF document and write output in XML file. The image component applies Spire to extract images from the PDF document and visualize that using Visualize Image. OCR component applied LEADTOOLS to extract text from images and export that to PDF format. Colored arrows denote the information processing flow. The hexagon at the top ….

Methods

MSL extracts text and figures from the published scientific literature and helps in analyzing embedded text inside figures. The overall methodological implementation and workflow of the MSL is divided into two processes: (I) Text mining and (II) Image analysis. MSL is a desktop application, designed and developed following the scientific software engineering principles of three-layered Butterfly ¹⁴ software development model.

Text mining

Physical and logical document analysis is one of the living challenges. To the best of the authors’ knowledge, there is no solution available which can perform efficient physical and logical structural analysis of PDF files, implement completely correct rendering order and classify text in all possible categories e.g. Tile, Abstract, Headings, Figure Captions, Table Captions, Equations, References, Headers, Footers etc.

However, there are some tools available which are helping in this regard e.g. PDF2HTML towards contextual modeling of logical labelling ¹⁵, PDF-Analyzer for object level document analysis ¹⁶, XED for hidden structure analysis ², Dolores for the logical structure analysis and recovery ¹⁷ automatic conversation from PDF to XML ¹⁸ and PDF to HTML ¹⁹ etc.

MSL has enhanced capabilities compared to these tools including Dolores (see comparison below). Thus we developed MSL’s Text module, which is capable of processing PDF files with single, double or multiple columns. It divides the system’s text based output in four sub-modules: full text, marginal text, keyword based extracted text and file attributes. Full text gives the complete text from PDF file, marginal allows user to give the coordinates (Lower Left X, Lower Left Y, Upper Right X and Upper Right Y) and extract the desired portion of the text from the PDF file. The keyword based text allows user to extract the information from PDF file based on keywords and respective coordinates (Left, Top, Width, Height) e.g. if a user is only interested in getting the figure caption or references, this kind of search will be helpful. The last sub module, File attributes gives the information about input file including title, author, creator, producer, subject, creation date, keywords, modified, number of pages and number of figures.

While implementing Text module, we researched and tried different available commercial and freely downloadable libraries with a focus on full text extraction, marginal text extraction, keyword based text extraction and text extraction from embedded images from PDF files. We tried different implemented systems and libraries ( Table 1) e.g. iTextSharp Bytescout, Spire PDF Sautinsoft PDF Focus Dynamic PDF, PDFBox, iText PDF, QPDF, PoDoFo, Haru PDF Library, JPedal, SVG Imprint, Glance PDF Tool Kit, BCL SharpPDF etc.

Table 1. Systems and Libraries tested for MSL ¹.

Library Name	Weblink
iTextSharp	( http://sourceforge.net/projects/itextsharp/),
Bytescout	( https://bytescout.com)
Spire PDF	( http://www.e-iceblue.com/Introduce/pdf-for-net-introduce.html)
Sautinsoft PDF Focus	( http://www.sautinsoft.com/products/pdf-focus/)
Dynamic PDF	( https://www.dynamicpdf.com)
PDFBox	( https://pdfbox.apache.org)
iText PDF	( http://itextpdf.com)
QPDF	( http://qpdf.sourceforge.net)
PoDoFo	( http://podofo.sourceforge.net)
Haru PDF Library	( http://libharu.sourceforge.net)
JPedal	( https://www.idrsolutions.com/jpedal/)
SVG Imprint	( http://svgimprint-windows.software.informer.com)
Glance PDF Tool Kit	( http://www.planetpdf.com/forumarchive/53545.asp)
BCL	( http://www.pdfonline.com/corporate/)
SharpPDF	( http://sharppdf.sourceforge.net)

Open in a new tab

¹This table gives the list of different systems and libraries, which all have been used for the extraction of text from PDF files.

One of the common problems in almost all libraries is merging and mixing of text, using double or multiple columns. Our developed system is the combination of different libraries, useful for different purposes. We have used Spire PDF to remove the Book-marks, iTextSharp for the extraction of full and marginal text, Bytescoute for the keyword based marginalized text search and producing output in the form of XML file ( Figure 2). The generated XML file contains structured (tagged) text along with the information about its coordinates (placement in the file), font (Bold, Italic etc.) and size, which can be used for mapping and pattern recognition tasks.

Image processing

Image-based analysis is a versatile and inherently multiplexed approach as it can quantitatively measure biological images to detect those features, which are not easily detectable by a human eye. Millions of figures have been published in scientific literature that includes information about results obtained from different biological and medicinal experiments. Several data and image mining solutions have been already implemented, published and are in use in the last 15 years ²⁰. Some of the mainstream approaches are towards the analysis of all kinds of images (flow charts, experimental images, models, geometrical shapes, graphs, images of thing or objects, mixed etc.). There are not many approaches proposed for specific kinds of image-analysis e.g. towards the identification and quantification of cell phenotypes ²¹, prediction of subcellular localization of proteins in various organism ²², analysis of gel diagrams ²³, mining and integration of pathway diagrams ²⁴.

While implementing a new data-mining tool, one of our goals was to extract images from published scientific literature and try to extract embedded text as well. We analyzed different freely available and commercial OCR systems and libraries including Aspose, PUMA, Microsoft OCR, Tesseract, LEADTOOLS, Nicomsoft OCR, MeOCR OCR, OmniPage, ABBYY, Bytescout claiming to be able to extract embedded text from figures. During our research we found LEADTOOLS ( Figure 2) as one of the best available solutions for this purpose. MSL is capable of automatically extracting images from the PDF files and allowing the user to apply OCR to any extracted image by clicking and enlarging it for a better view (using Windows default image viewer).

Results and Discussion

MSL mining performance tested on different literature sources

We tested MSL with similar parameters on randomly selected scientific manuscripts (ten PDF files) from different open access ( F1000Research, Frontiers, PLOS, Hindawi, PeerJ, BMC) and restricted access ( Oxford University Press, Springers, Emerald, Bentham Science, ACM) publishers, including some of the authors’ published papers, details are given in Table 2. While testing MSL on the selected manuscripts, we observed best overall performance for the manuscripts ^25,
26–
30, with satisfactory results from almost all publishers (including Oxford University Press, BMC, Frontiers, PeerJ, Bentham Science, ACM) in terms of both extracting text in reading order and extracting images. An observed poor performance involved manuscripts from PLOS ³¹ , Hindawi ³², F1000Research ³³ and IEEE ³⁴ publishers. Here, in the case of text extraction we observed that the text was in reading order when using manuscripts from F1000Research and IEEE but text was without spaces in the manuscript from PLOS and with additional lines and extra spaces in the manuscript from Hindawi. In the case of figure extraction we observed one common problem among the four manuscripts from these publishers; along with the manuscript images (Figures), embedded journal or publishers’ logos and images were also extracted. Additionally, while analyzing the manuscript from F1000Research, we observed that the images were broken into many pieces and it was not possible to find one single complete image. As we did not test all manuscripts from the mentioned publishers, we cannot claim that the results will be the same for all papers from a publisher, as the output may vary in different papers. Our observed results using MSL are given in attached supplementary material ( Supplementary Table S1 and Dataset 1).

Extracted images and text from papers tested using MSL

The raw dataset is attached to this manuscript, which categorically provides all images and text in XML format, extracted from manuscripts (from different publishers (included in file names)) using MSL ⁴⁴.

Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Click here for additional data file.^{(114.8MB, tgz)}

Table 2. Papers (PDF files) tested using MSL ¹.

Publishers	Manuscript
F1000-Research	Ant-App-DB: a smart solution for monitoring arthropods activities, experimental data management and solar calculations without GPS in behavioral field studies ³³.
PLOS	The Genomic Aftermath of Hybridization in the Opportunistic Pathogen Candida metapsilosis ³¹.
Hindawi	Mathematical Properties of the Hyperbolicity of Circulant Networks ³².
IEEE	Design implementation of I-SOAS IPM for advanced product data management ³⁴.
BMC	Software LS-MIDA for efficient mass isotopomer distribution analysis in metabolic modeling ²⁶.
PeerJ	Anvi’o: an advanced analysis and visualization platform for ‘omics data ²⁷.
Frontiers	Ontology-based approach for in vivo human connectomics: the medial Brodmann area 6 case study ²⁸.
ACM	Intelligent semantic oriented agent based search (I-SOAS) ²⁹.
Bentham Science	DroLIGHT-2: Real Time Embedded and Data Management System for Synchronizing Circadian Clock to the Light-Dark Cycles ³⁰.
Oxford University Press	Bioimaging-based detection of mislocalized proteins in human cancers by semi- supervised learning ²⁵.

Open in a new tab

¹We selected these ten manuscripts from different publishers for testing and validating the MSL application.

To apply MSL, published scientific literature has first to be downloaded in the form of a PDF file, from any published source. The validation process using MSL consists of three major steps: 1) Text mining, 2) Image extraction, and 3) Application of OCR to extract text from selected images as shown in Figure 1, following the implemented workflow as shown in Figure 2. Example results and graphics are shown in Figure 1, Figure 3 and Figure 4. Representation includes the extraction of text and images from one of the randomly selected papers ³⁵, and application of OCR to one of the extracted images from another randomly picked publication ²⁵.

Figure 3. — A figure (shown as three panels; including two charts, one image and a table) is analyzed (example from ref. 25). OCR (LEADTOOLS) is applied to extract and report the text from the figure in two ways (red stippled lines): simple text form (symbolized by …..section: Extracted Text from Figure) as well as in PDF file (rectangle) with similar margins to the original figure (section: Exported text in PDF format). Steps involved are document image analysis, text extraction and PDF conversion.

Figure 4. — The scanned image based page of one of the randomly selected papers ³⁵ is processed using OCR (LEADTOOLS; blue arrows): Text is extracted from the image and a new PDF is generated (rectangles), which is based on the text, placed with similar margins to the image file. Steps involved are again document image analysis, text extraction and PDF conversion.

Figure 1 shows that one randomly selected published article’s PDF file ³⁵ is inputted to the MSL’s text, the extracted text is divided into three categories (i) complete text in excellent rendering order (ii) marginalized text and (iii) keyword based searched text. Two figures ( Figure 1 and Figure 2) are extracted and displayed in the image section, and one of those is selected to apply OCR. The applied OCR extracts textual information, which is displayed in and can be exported in a PDF file.

MSL validation and comparison to other tools

To further validate the application of OCR and discuss different results, Figure 3 show another example of embedded text extraction from a complex figure ³⁶, which includes three panels of images (i) colorful pie and circle charts, (ii) biological images and (iii) tabular information. Similar to our prior application of OCR, results are displayed in textual form as well as generated PDF file of extracted text. A noticeable difference between both outputs is that the textual information is presented in line-by-line order whereas in the PDF file the information is displayed in margins with respect to the original image.

The last resultant example is based on the validation of MSL by extracting the textual information from image based PDF files. We produced an image form of one of the randomly selected article ²⁵ and then processed one of pages. As Figure 4 shows, the obtained results were comprehensive in both textual as well as the PDF form. This kind of textual extraction can be very helpful, especially when the literature is available in only images e.g. in the case of old published literature in print only format but electronically available in scanned form. MSL produces several files as system output in the parent folder of the files. These files are: XML files (which include structured or tagged information), an Images File (extracted from the PDF file) and PDF files for all analyzed images using OCR.

We mentioned earlier that we have tried and implemented different libraries for text and image extraction and analysis. The best text based outcome was observed using iTextSharp, better image extraction was observed using Spire and OCR from LEADTOOLS was the most promising. While validating the implemented solution, other than the expected results (text and images), we observed some limitations in the used libraries: Irrelevant images are also automatically extracted e.g. journal logos, publisher’s logos and header-footer images embedded inside document (e.g. images added by the publishers, to provide publishing details). However, these images are easily recognized by the user and can also be automatically removed if desired as e.g. always referring to the same logo.

Furthermore, text was not always in good rendering order, especially when there were text-based mathematical equations with super and subscripts; and in case of double or multicolumn PDF files, most of the libraries’ rendering order is not correct. During extracting text, we found that some important symbols were missed and spaces were generated for some paragraphs. We found that it was not possible to extract particular images that are created as a combination of different sub-images and text objects in the manuscript. In these cases, text is found in extracted text area and all extracted sub-images are image sections, with the possibility of missing some sub-images as well. Moreover, when we applied OCR to different images (extracted or loaded), we found that its performance does vary with respect to the complexity of inputted images. In case of special characters (e.g. Greek delta, alpha, beta etc.), it does not perform well unless these are hard wired in the software.

In comparison to earlier mentioned tools; MSL possess some advantages as well as limitations. For instance, Dolores helps the user in adding custom tags to the PDF document and create a semantic model associated to the processed class of documents, PDF2HTML implements conditional random fields (CRF) based model to learn semantics from processed PDF page’s content, PDF-Analyzer devised a model based on rectangular objects for the analysis on PDF documents, XED applies method to combine PDF symbol analysis with traditional document image processing technique. MSL does not apply any of these methods and support such features. However, beyond the capabilities of the above tools, MSL does support marginalization of text, provides text in correct reading order, enable users with keywords based search and provide extraction of embedded text from figures (using OCR), which none of these tools does. To enhance the functionality of the MSL program (e.g. our standard version available here for download), we give a table of the most often used special symbols in biomedical literature ( Table 3). Depending on your application in mind, you thus simply extend the MSL parser by considering also these special characters occurring often in your texts.

Table 3. Special symbols found in biomedical literature ¹.

Number	Special Symbols	Name
1	Δ	Delta
2	α	Alpha
3	β	Beta
4	ϕ	Phi

Open in a new tab

¹The table illustrates that special characters occurring most often in the texts of choice enhance further MSL capabilities if incorporated in addition in the parser. This is, however, a text-dependent additional modification of the MSL program.

MSL Implementation & operation

MSL architecture is based on the Product Line Architecture (PLA) and Multi-Document Interface (MDI) developmental principles, and it is designed and developed (using C-Sharp programming language, Microsoft Dot NET Framework) following the key principles of Butterfly paradigm ^14,
36. The work-flow of MSL is divided into two processes: (I) extraction and marginalization of text with respect to the division and placement of text in PDF file and keyword based search by using the iTextSharp, Bytescoute, Spire PDF libraries, and (II) extraction and analysis of figures by using the Spire PDF library and LEADTOOLS OCR.

It takes Portable Document Format (PDF) based literature files as input, performs partial physical structure analysis, and exports output in different formats e.g. text, images and XML files. It allows user to extract keywords and marginal (X and Y coordinates) information based text, have PDF file’s metadata information (title, author, creator, producer, subject, creation date, keywords, modified, number of pages and number of figures) and save extracted full and marginal text in text files.

Biomedical image extraction and analysis is one of the most complex tasks from the field of computer sciences and image analysis. Some of the mainstream approaches ^37–
42 have been proposed towards the analysis of all kinds of images (e.g. flow charts, experimental images, models, geometrical shapes, graphs, image-of-thing, mix etc.). MSL allows user to automatically extracting images from the PDF files, let any selected image viewed via Windows default image viewer and apply implemented OCR. Other than extract images from PDF file, MSL allow user to load any image, apply OCR and export output in readable PDF file.

MSL produces several out files in the parent folder including XML files (which include structured or tagged information), Images File (extracted from PDF file) and PDF files for all analyzed images using OCR ( Figure 5).

Figure 5. — This figure shows different files generated during analysis of PDF document. PDF file (top, left) is the actual document, XML file is the structured (tagged) form of extracted text (top, middle), a second PDF file (top, right) is the extracted text from image (see Figure 3) and all other files are extracted image from the original PDF document.

MSL application is very simple to install and use. It was tested and can be well configured on a Microsoft Windows platform (preferred OS version: 7). MSL follows a simple six steps installation process ( Figure 6). After installation, it can be run by either clicking on the installed application’s icon at the desktop or execute application following sequence of steps: Start → All Programs → MSL 1.0.0 → MSL.

Figure 6. — Squares indicate steps, the blue stippled line the process.

Regarding using the MSL application, one important point to remember is that it is based on different PDF text extraction, marginalization and figure extraction libraries, which are automatically configured during installation but used OCR by the LEADTOOLS is not a freely available library, which we have used upon academic research (free) license. The OCR library is also automatically configured during installation but its performance at different (non-licensed) machines is not confirmed. Moreover, the recommended display screen resolution size is 1680×1050 with landscape orientation.

Conclusions

The development of a virtual research environment to store and link molecular data, can be well achieved and established if first the mixture of text, protocols and omics data is properly separated from images, figures and figure legends – a task for which our tool can be well suited. There are a number of databases (e.g. Alzheimer’s Disease Neuroimaging Initiative (ADNI); Breast Cancer Digital Repository (BCDR); BiMed; Public Image Databases; Cancer Image Database (caIMAGE); COllaborative Informatics and Neuroimaging Suite (COINS); DrumPID; Digital Database for Screening Mammography (DDSM); Electron Microscopy Data Bank (EMDB); LONI image data archive; Mammography Image Databases (MID); New Database Provides Millions of Biomedical Images; Open Access Series of Imaging Studies (OASIS); Stanford Tissue Microarray Database (TMA); STRING; The Cancer Imaging Archive (TCIA); Whitney Imaging Center etc.) which can directly profit from MSL by fast, automatic and rapid separation of text and text description from images and figure legends describing the images is important for further improvement of the database and its content. One in-house example is the DrumPID database ⁴³, where different types of data and images are warehoused by us and an improved separation and retrieval of text versus figure legends, image descriptions etc. is highly useful and currently applied.

The latest available and easy to use version of MSL has been tested and validated in-house. The advancements in information retrieval techniques for text and figure analysis combined with this sophisticated computational tool can support various studies.

Data availability

Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication). http://creativecommons.org/publicdomain/zero/1.0/

F1000Research: Dataset 1. Extracted Images and Text from Papers tested using MSL, 10.5256/f1000research.7329.d108739 ⁴⁴

Software availability

Software access

The software executable is freely available at the following web link: https://zenodo.org/record/30941#.Vi0PtmC5LHM

The software download section provides one executable: MSL, setup to be installed on the Microsoft Windows platform.

MSL has been NOT been developed for any commercial purposes but as a non-commercial prototype application for academic research, analysis and development purposes.

Archived software files as at the time of publication

Mining Scientific Literature (MSL) Ver 1.0.0 (DOI: 10.5281/zenodo.30941).

License

All associated files are licensed under the Academic Free License 3.0 (AFL 3.0).

Acknowledgments

We thank the German Research Foundation (DFG-TR34/Z1) for support. We would like to thank Dr. Chunguang Liang (University of Wuerzburg, Germany) for his help in testing MSL and all interested colleagues for critical community input on the approach and anonymous reviewers for their helpful comments.

We would like to thank all the open source, licensed and commercial library providers, for their help in this non-commercial and academic research and software development.

Funding Statement

This work was supported by a German Research Foundation grant (DFG-TR34/Z1) to TD.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[version 2; referees: 4 approved with reservations]

Supplementary material

Supplementary Table S1. List of Papers (PDF files) tested using MSL.

Supplementary which gives the list of some of those manuscripts from different publishers ( F1000Research, PLOS, Hindawi, IEEE, BMC, PeerJ, Frontiers, ACM, Bentham Science and Oxford University Press), which have been used for testing and validating the MSL application. The attached table provides the information about some of the extracted images and observed full and marginal text.

Click here for additional data file.^{(1.5MB, tgz)}

References

1. Hunter L, Cohen KB: Biomedical language processing: what’s beyond PubMed? Mol Cell. 2006;21(5):589–594. 10.1016/j.molcel.2006.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Hadjar K, Rigamonti M, Lalanne D, et al. : Xed: A New Tool for Extracting Hidden Structures from Electronic Documents. In International Workshop on Document Image Analysis for Libraries. 2004;221–224. 10.1109/DIAL.2004.1263250 [DOI] [Google Scholar]
3. Sayers EW, Barrett T, Benson DA, et al. : Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010;38(Database issue):D5–16. 10.1093/nar/gkp967 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. States DJ, Ade AS, Wright ZC, et al. : MiSearch adaptive pubMed search tool. Bioinformatics. 2009;25(7):974–76. 10.1093/bioinformatics/btn033 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Poulter GL, Rubin DL, Altman RB, et al. : MScanner: a classifier for retrieving Medline citations. BMC Bioinformatics. 2008;9(1):108. 10.1186/1471-2105-9-108 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Plikus MV, Zhang Z, Chuong CM: PubFocus: semantic MEDLINE/PubMed citations analytics through integration of controlled biomedical dictionaries and ranking algorithm. BMC Bioinformatics. 2006;7:424. 10.1186/1471-2105-7-424 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Smalheiser NR, Zhou W, Torvik VI, et al. : Anne O’Tate: A tool to support user-driven summarization, drill-down and browsing of PubMed search results. J Biomed Discov Collab. 2008;3:2. 10.1186/1747-5333-3-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Doms A, Schroeder M: GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 2005;33(Web Server issue):W783–86. 10.1093/nar/gki470 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Kim JJ, Pezik P, Rebholz-Schuhmann D: MedEvi: retrieving textual evidence of relations between biomedical concepts from Medline. Bioinformatics. 2008;24(11):1410–12. 10.1093/bioinformatics/btn117 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Rebholz-Schuhmann D, Kirsch H, Arregui M, et al. : EBIMed--text crunching to gather facts for proteins from Medline. Bioinformatics. 2007;23(2):e237–44. 10.1093/bioinformatics/btl302 [DOI] [PubMed] [Google Scholar]
11. Douglas SM, Montelione GT, Gerstein M, et al. : PubNet: a flexible system for visualizing literature derived networks. Genome Biol. 2005;6(9):R80. 10.1186/gb-2005-6-9-r80 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Eaton AD: HubMed: a web-based biomedical literature search interface. Nucleic Acids Res. 2006;34(Web Server issue):W745–47. 10.1093/nar/gkl037 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Hearst MA, Divoli A, Guturu H, et al. : BioText Search Engine: beyond abstract search. Bioinformatics. 2007;23(16):2196–97. 10.1093/bioinformatics/btm301 [DOI] [PubMed] [Google Scholar]
14. Ahmed Z, Zeeshan S, Dandekar T: Developing sustainable software solutions for bioinformatics by the “Butterfly” paradigm [version 1; referees: 2 approved with reservations]. F1000Res. 2014;3:71. 10.12688/f1000research.3681.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Tao X, Tang Z, Xu C: Contextual Modeling for Logical Labeling of PDF Documents. Comput Electr Eng. 2014;40(4):1363–75. 10.1016/j.compeleceng.2014.01.005 [DOI] [Google Scholar]
16. Hassan T: Object-Level Document Analysis of PDF Files. In Proceedings of the 9th ACM symposium on Document engineering. 2009;47–55. 10.1145/1600193.1600206 [DOI] [Google Scholar]
17. Bloechle JL, Rigamonti M, Ingold R: OCD Dolores - Recovering Logical Structures for Dummies. In 10th IAPR International Workshop on Document Analysis Systems (DAS). 2012;245–249. 10.1109/DAS.2012.58 [DOI] [Google Scholar]
18. Déjean H, Meunier JL: A System for Converting PDF Documents into Structured XML Format. In Proceedings of the 7th international conference on Document Analysis Systems. 2006;129–140. 10.1007/11669487_12 [DOI] [Google Scholar]
19. Rahman F, Alam H: Conversion of PDF Documents into HTML: A Case Study of Document Image Analysis. In Proceedings of Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers. 2003;1:87–91. 10.1109/ACSSC.2003.1291873 [DOI] [Google Scholar]
20. Zweigenbaum P, Demner-Fushman D, Yu H, et al. : Frontiers of biomedical text mining: current progress. Brief Bioinform. 2007;8(5):358–375. 10.1093/bib/bbm045 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Carpenter AE, Jones TR, Lamprecht MR, et al. : CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biol. 2006;7(10):R100. 10.1186/gb-2006-7-10-r100 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Chou KC, Shen HB: Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat Protoc. 2008;3(2):153–162. 10.1038/nprot.2007.494 [DOI] [PubMed] [Google Scholar]
23. Kuhn T, Nagy ML, Luong T, et al. : Mining images in biomedical publications: Detection and analysis of gel diagrams. J Biomed Semantics. 2014;5(1):10. 10.1186/2041-1480-5-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Kozhenkov S, Baitaluk M: Mining and integration of pathway diagrams from imaging data. Bioinformatics. 2012;28(5):739–742. 10.1093/bioinformatics/bts018 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Xu YY, Yang F, Zhang Y, et al. : Bioimaging-based detection of mislocalized proteins in human cancers by semi-supervised learning. Bioinformatics. 2015;31(7):1111–9. 10.1093/bioinformatics/btu772 [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Ahmed Z, Zeeshan S, Huber C, et al. : Software LS-MIDA for efficient mass isotopomer distribution analysis in metabolic modelling. BMC Bioinformatics. 2013;14:218. 10.1186/1471-2105-14-218 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Eren AM, Esen ÖC, Quince C, et al. : Anvi'o: an advanced analysis and visualization platform for 'omics data. PeerJ. 2015;3:e1319. 10.7717/peerj.1319 [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Moreau T, Gibaud B: Ontology-based approach for in vivo human connectomics: the medial Brodmann area 6 case study. Front Neuroinform. 2015;9:9. 10.3389/fninf.2015.00009 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Ahmed Z: Intelligent semantic oriented agent based search (I-SOAS).In Proceedings of the 7th International Conference on Frontiers of Information Technology2009. 10.1145/1838002.1838065 [DOI] [Google Scholar]
30. Ahmed Z, Helfrich-Förster C: DroLIGHT-2: Real Time Embedded and Data Management System for Synchronizing Circadian Clock to the Light-Dark Cycles. Recent Patents on Computer Sci. 2013;6(3):191–205. 10.2174/2213275906666131108211241 [DOI] [Google Scholar]
31. Pryszcz LP, Németh T, Saus E, et al. : The Genomic Aftermath of Hybridization in the Opportunistic Pathogen Candida metapsilosis. PLoS Genet. 2015;11(10):e1005626. 10.1371/journal.pgen.1005626 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Hernández JC, Rodríguez JM, Sigarreta JM: Mathematical Properties of the Hyperbolicity of Circulant Networks. Adv Math Phys. 2015;2015: 723451. 10.1155/2015/723451 [DOI] [Google Scholar]
33. Ahmed Z, Zeeshan S, Fleischmann P, et al. : Ant-App-DB: a smart solution for monitoring arthropods activities, experimental data management and solar calculations without GPS in behavioral field studies [version 2; referees: 1 approved, 2 approved with reservations]. F1000Res. 2015;3:311. 10.12688/f1000research.5931.3 [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Zeeshan A, Detlef G: Design implementation of I-SOAS IPM for advanced product data management. In IEEE 2nd International Conference on Computer, Control and Communication2009;1–5. 10.1109/IC4.2009.4909215 [DOI] [Google Scholar]
35. Ahmed Z, Mayr M, Zeeshan S, et al. : Lipid-Pro: a computational lipid identification solution for untargeted lipidomics on data-independent acquisition tandem mass spectrometry platforms. Bioinformatics. 2015;31(7):1150–1153. 10.1093/bioinformatics/btu796 [DOI] [PubMed] [Google Scholar]
36. Ahmed Z, Zeeshan S: Cultivating Software Solutions Development in the Scientific Academia. Recent Patents on Computer Sci. 2014;7(1):54–66. 10.2174/2213275907666140612210552 [DOI] [Google Scholar]
37. Schindelin J, Arganda-Carreras I, Frise E, et al. : Fiji: an open-source platform for biological-image analysis. Nat Methods. 2012;9(7):676–82. 10.1038/nmeth.2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Schmid B, Schindelin J, Cardona A, et al. : A high-level 3D visualization API for Java and ImageJ. BMC Bioinformatics. 2010;11:274. 10.1186/1471-2105-11-274 [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Schneider CA, Rasband WS, Eliceiri KW: NIH Image to ImageJ: 25 years of image analysis. Nat Methods. 2012;9(7):671–75. 10.1038/nmeth.2089 [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Peng H, Ruan Z, Long F, et al. : V3D enables real-time 3D visualization and quantitative analysis of large-scale biological image data sets. Nat Biotechnol. 2010;28(4):348–53. 10.1038/nbt.1612 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Lopez LD, Yu J, Arighi C, et al. : A framework for biomedical figure segmentation towards image-based document retrieval. BMC Syst Biol. 2013;7(Suppl 4):S8. 10.1186/1752-0509-7-S4-S8 [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Sheng J, Xu S, Deng W, et al. : Novel Image Features for Categorizing Biomedical Images.In IEEE International Conference on Bioinformatics and Biomedicine (BIBM)2012. 10.1109/BIBM.2012.6392689 [DOI] [Google Scholar]
43. Kunz M, Liang C, Nilla S, et al. : The drug-minded protein interaction database (DrumPID) for efficient target analysis and drug development. Database (Oxford). 2016;2016: pii: baw041. 10.1093/database/baw041 [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Ahmed Z, Dandekar T: Dataset 1 in: MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format. F1000Research. 2015. Data Source [DOI] [PMC free article] [PubMed] [Google Scholar]

F1000Res. 2017 Sep 22. doi: 10.5256/f1000research.12318.r24964

Referee response for version 2

Karin Verspoor ^1,²

There is no novel methodological contribution in the tool itself that I could determine; the value is primarily in integrating the other tools into a user-facing tool. However, the value of a user-facing tool (as opposed to an automated tool that proceeds without human input) is not made clear or evaluated (are users happy to do the work expected of them?).

As a very high level point on motivation, many scientific articles -- particularly in the open access literature which form the majority of the studied papers -- are already available as raw XML and it does not make sense to me to try to parse the PDF to infer structure (in a way that first requires user interaction, and second may introduce errors) that can be read directly from the publisher-produced XML. The PubMed Central repository is a repository of full-text, available online as structured HTML, and the open access collection which is downloadable in its entirety uses XML ( https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/). So in my opinion several of the key objectives of this tool are not needed for a large and growing proportion of the scientific literature. The argument could be made that older articles and articles from certain publishers are not available as "raw" XML, but the author neither make this argument nor specifically test this question. In addition, this XML resource provides a fantastic opportunity to automatically test (some aspects of) the performance of the authors tool. This is not done. The authors should consider the contribution of their tool in the context of such resources. The Introduction focuses on PubMed, but PubMed does not claim to search over the full-text literature (it is limited to abstracts by design), and PubMed Central is not mentioned here.

There is one more tool that I am aware of -- and I suspect there are others -- that the authors do not mention; the Layout-Aware PDF tool ¹. In general, I'd be interested to understand more deeply how the authors' tool differs from the various tools they mention, and why they selected the tools they did for their final system.

The Results presented by the authors are not framed in any well-defined evaluation framework. While they mention "best overall performance" no indication of either specific quantitative or qualitative criteria used to assess the tool have been provided. What were the criteria that were used? Were there guidelines for how to "score" systems on various criteria? What is presented is largely ad hoc qualitative observations. On what basis was performance of the tool ranked?

Comments on presentation:

It is poor practice to cite papers that are purely studied as artifacts -- a citation implies that you are referencing scientific content which you are not. Please remove citations to the papers that were used to test the system; they should be listed (preferably with DOIs) in Table 2, with a paper ID, and the paper IDs should be referenced in the article.

There are a number of phrases in the manuscript that are awkward. "marginalization" does not mean "to find margins". Queries in PubMed are not "NLP quer[ies]" but rather user queries processed by an IR system (NB: arguably, IR systems don't even use NLP). The authors talk about "orthodox NLP approaches"; I have no idea what makes an NLP approach "orthodox". I suspect the authors mean "pipeline" rather than "product line". What is an "inherently multiplexed approach"?

The long list of databases in the Conclusions doesn't add much to the manuscript; perhaps the example of the DrumPID should be pulled into a discussion section, and elaborated to clearly demonstrate the practical application and value of the tool.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

References

1. Ramakrishnan C, Patnia A, Hovy E, Burns GA: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med.2012;7(1) : 10.1186/1751-0473-7-7 7 10.1186/1751-0473-7-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

F1000Res. 2018 Mar 29.

Zeeshan Ahmed ¹

Reply: Thank you so much for your recommendations.

This manuscript addresses the analysis of scientific articles published in PDF form, and introduces a software tool that essentially wraps a number of other tools to build a single tool that integrates a number of features. There is no novel methodological contribution in the tool itself that I could determine; the value is primarily in integrating the other tools into a user-facing tool. However, the value of a user-facing tool (as opposed to an automated tool that proceeds without human input) is not made clear or evaluated (are users happy to do the work expected of them?). As a very high level point on motivation, many scientific articles -- particularly in the open access literature which form the majority of the studied papers -- are already available as raw XML and it does not make sense to me to try to parse the PDF to infer structure (in a way that first requires user interaction, and second may introduce errors) that can be read directly from the publisher-produced XML. The PubMed Central repository is a repository of full-text, available online as structured HTML, and the open access collection which is downloadable in its entirety uses XML (https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/).

Reply: Thank you so much for your views.

So in my opinion several of the key objectives of this tool are not needed for a large and growing proportion of the scientific literature. The argument could be made that older articles and articles from certain publishers are not available as "raw" XML, but the author neither make this argument nor specifically test this question. In addition, this XML resource provides a fantastic opportunity to automatically test (some aspects of) the performance of the authors tool. This is not done. The authors should consider the contribution of their tool in the context of such resources. The Introduction focuses on PubMed, but PubMed does not claim to search over the full-text literature (it is limited to abstracts by design), and PubMed Central is not mentioned here.

Reply: We thank you so much for nice suggestion and have tried to revise manuscript accordingly.

There is one more tool that I am aware of -- and I suspect there are others -- that the authors do not mention; the Layout-Aware PDF tool 1. In general, I'd be interested to understand more deeply how the authors' tool differs from the various tools they mention, and why they selected the tools they did for their final system.

Reply: We thank you so much for nice suggestion and have tried to revise manuscript accordingly.

Reply: We thank you so much for nice suggestion and have tried to answer question in revision.

Reply: We thank you so much for nice suggestion and have added DOIs.

There are a number of phrases in the manuscript that are awkward. "marginalization" does not mean "to find margins". Queries in PubMed are not "NLP quer[ies]" but rather user queries processed by an IR system (NB: arguably, IR systems don't even use NLP). The authors talk about "orthodox NLP approaches"; I have no idea what makes an NLP approach "orthodox". I suspect the authors mean "pipeline" rather than "product line". What is an "inherently multiplexed approach"?

Reply: We thank you so much for nice suggestion and have tried to revise manuscript accordingly.

Reply: We thank you so much for nice suggestion and have added more details.

We thank you so much for your time and excellent suggestions, which have helped us in improving the manuscript.

With best wishes,

Authors

F1000Res. 2017 Sep 6. doi: 10.5256/f1000research.12318.r24952

Referee response for version 2

Florencio Pazos ¹

Extracting the large amounts of scientific and medical information stored in an unstructured way is an important challenge. Any effort in that direction, such as that presented in this work, is of potential interest.

My main concern with this work is that most of the features described are already available in existing software, even those described as “new”. For example, tools like “pdftotext” can parse multi-column PDFs, “pdfimages” extract the images within a PDF file, “OCRFeeder” detects the image elements and extract the texts, tables, etc, … On the contrary, some potentially newer features of MSL, such as recognizing the article parts (Abstract, References, … -page 3-) mentioned in Methods are not described later (Results). Problems of other tools mentioned in the Introduction, such as “removing irrelevant graphics” are not solved by MSL either.

I think the comparison with other tools should be presented in terms of what these fail to detect while MSL does not, and vice versa. E.g. For a particular article “pdftotext” was not able to recognize the columns while MSL does, etc.

In summary, better putting this system into the context of existing ones.

The system is implemented as an interactive tool intended for analyzing a single article at a time, even presenting some of the final results (e.g. text extracted from images) back as PDF. For a single article (or a small number) human inspection will be better than any automated system. I guess the real potential of this system is in the automated parsing of large collections of articles. In my opinion the authors should focus the manuscript more on that.

F1000Res. 2018 Mar 29.

Zeeshan Ahmed ¹

Reply: Thank you so much for your recommendations.

The authors present a system for extracting the main text, the images, and the text within those (labels, etc), from scientific papers in PDF format. The system is implemented in an interactive desktop application for Windows and tested in 10 papers. Extracting the large amounts of scientific and medical information stored in an unstructured way is an important challenge. Any effort in that direction, such as that presented in this work, is of potential interest.

Reply: Thank you so much for your views and we agree with you.

Reply: Thank you so much and we respect your points that many competing applications do exist but that still doesn't not negatively impact our contributions.

On the contrary, some potentially newer features of MSL, such as recognizing the article parts (Abstract, References, … -page 3-) mentioned in Methods are not described later (Results).

Reply: We thank for nice suggestion and have tried to revise manuscript accordingly.

Problems of other tools mentioned in the Introduction, such as “removing irrelevant graphics” are not solved by MSL either.

Reply: Thank you so much and we respect your point that it’s not a complete solution, as future research and development is also recommended but still helps at some good levels.

Reply: We thank for nice suggestion and have added comparison.

In summary, better putting this system into the context of existing ones.

Reply: We thank for nice suggestion and have tried to revise manuscript accordingly.

Reply: We agree with you and thank so much for nice suggestion. We mainly tried to focus on that part but as one of the examples to show the strength of system, we have included that.

We thank you so much for your time and excellent suggestions, which have helped us in improving the manuscript.

With best wishes,

Authors

F1000Res. 2016 Aug 9. doi: 10.5256/f1000research.7898.r14625

Referee response for version 1

M Julius Hossain ¹

In this manuscript authors presented a computational tool that extracts text and images from PDF files. In general the manuscript is interesting considering that it can analyze various types of PDF files from different scientific areas based on the keywords and coordinates. However, it lacks technical novelty over the published literatures and needs additional input on the image analysis section before indexing.

Extraction of texts and images from scientific publications has been presented in various domains: computer science ¹, biomedical ² ^- ⁴, chemistry ⁵, proteomics ⁶ and so on. The manuscript by Zeeshan Ahmed and Thomas Dandekar presents an incremental innovation without providing clear technological advancement in the field. The objective of performing both physical and logical structure analysis of all kinds of PDF files as mentioned in the manuscript has not been sufficiently supported by technological contribution described in Methods section.

The image processing section the manuscript has been very brief. It does not provide any advanced image analysis technique as mentioned in the abstract. Authors should mention how exactly segmentation of figures and labels are performed and how they are represented to make logical connection between different entities in order to perform further analysis and customized visualization.

The framework has been tested with a very small set of PDF files and no qualitative/quantitative result reporting the accuracy with respect to manually annotated files was presented. It would be good to increase the number test files and include the results of qualitative/quantitative analysis.

Some of the figures (Figures 1, 3, 4 and 6) in the manuscript are hard to see the details in both online and print format. These figures could be reformatted.

References

1. Clark C, Divvala S: Looking beyond text: Extracting figures, tables and captions from computer science papers. AAAI 2015 Workshop on Scholarly Big Data.2015; Reference source
2. Lopez LD, Yu J, Arighi CN, Huang H, Shatkay H, Wu C: An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents. Bioinformatics and Biomedicine (BIBM), 2011 IEEE International Conference on.2011; 10.1109/BIBM.2011.26 578-581 10.1109/BIBM.2011.26 Reference source [DOI]
3. Lopez LD, Yu J, Tudor CO, Arighi CN, Hongzhan H, Vijay-Shanker K, Wu K: Robust segmentation of biomedical figures for image-based document retrieval. Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on.2012; 10.1109/BIBM.2012.6392706 1-6 10.1109/BIBM.2012.6392706 Reference source [DOI]
4. Lopez LD, Yu J, Arighi C, Tudor CO, Torii M, Huang H, Vijay-Shanker K, Wu C: A framework for biomedical figure segmentation towards image-based document retrieval. BMC Syst Biol.2013;7 Suppl 4: 10.1186/1752-0509-7-S4-S8 S8 10.1186/1752-0509-7-S4-S8 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Choudhury SR, Mitra P, Kirk A, Szep S, Pellegrino D, Jones S, Giles CL: Figure Metadata Extraction from Digital Documents. 2013 12th International Conference on Document Analysis and Recognition.2013; 10.1109/ICDAR.2013.34 135-139 10.1109/ICDAR.2013.34 Reference source [DOI]
6. Kou Z, Cohen WW, Murphy RF: Extracting information from text and images for location proteomics. Proceedings of the 3nd ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD 2003).2003;

F1000Res. 2016 Jan 13. doi: 10.5256/f1000research.7898.r11637

Referee response for version 1

Juilee Thakar ¹

The manuscript titled “MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format” addresses an important issue of extracting information from published manuscripts. However, the following issues must be clarified before indexing.

In the text mining section authors say that there is no tool to perform physical and logical structural analysis of PDF files. However, in the next paragraph they describe “Dolores” for logical structure analysis. Authors should describe how their method is different than Dolores.

Legends of all the figures should be more descriptive so that figures are understandable on their own. Each component of the figure should be described in the legend.

The results section is missing. Is it integrated in the discussion section? It is unclear what exactly the results were.

The article will be much clear if all the libraries (described on page 4 second paragraph) are described in the form of a table.

Authors should include a clear metric to estimate performance of the algorithm. This can be achieved by comparison with existing tools or through comparative analysis. A clear example showing the information extracted from several PDF files to address a biologically relevant example will be useful.

It is not clear whether the text extracted from the PDF files is actually coming from figure legends or related to the main body of the manuscript. Also, how is this text organized?

The authors mention that unexpected and irrelevant images were extracted. It is not clear how authors address that. It is absolutely essential to address that.

Minor corrections:

Page 2 second column: The definition of MSL is not the same as described in the abstract

F1000Res. 2018 Apr 11.

Juilee Thakar ¹

Thanks for responding to my suggestions.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Extracted images and text from papers tested using MSL

Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Click here for additional data file.^{(114.8MB, tgz)}

Click here for additional data file.^{(1.5MB, tgz)}

Data Availability Statement

F1000Research: Dataset 1. Extracted Images and Text from Papers tested using MSL, 10.5256/f1000research.7329.d108739 ⁴⁴

[ref-1] 1. Hunter L, Cohen KB: Biomedical language processing: what’s beyond PubMed? Mol Cell. 2006;21(5):589–594. 10.1016/j.molcel.2006.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-2] 2. Hadjar K, Rigamonti M, Lalanne D, et al. : Xed: A New Tool for Extracting Hidden Structures from Electronic Documents. In International Workshop on Document Image Analysis for Libraries. 2004;221–224. 10.1109/DIAL.2004.1263250 [DOI] [Google Scholar]

[ref-3] 3. Sayers EW, Barrett T, Benson DA, et al. : Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010;38(Database issue):D5–16. 10.1093/nar/gkp967 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-4] 4. States DJ, Ade AS, Wright ZC, et al. : MiSearch adaptive pubMed search tool. Bioinformatics. 2009;25(7):974–76. 10.1093/bioinformatics/btn033 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-5] 5. Poulter GL, Rubin DL, Altman RB, et al. : MScanner: a classifier for retrieving Medline citations. BMC Bioinformatics. 2008;9(1):108. 10.1186/1471-2105-9-108 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-6] 6. Plikus MV, Zhang Z, Chuong CM: PubFocus: semantic MEDLINE/PubMed citations analytics through integration of controlled biomedical dictionaries and ranking algorithm. BMC Bioinformatics. 2006;7:424. 10.1186/1471-2105-7-424 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-7] 7. Smalheiser NR, Zhou W, Torvik VI, et al. : Anne O’Tate: A tool to support user-driven summarization, drill-down and browsing of PubMed search results. J Biomed Discov Collab. 2008;3:2. 10.1186/1747-5333-3-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-8] 8. Doms A, Schroeder M: GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 2005;33(Web Server issue):W783–86. 10.1093/nar/gki470 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-9] 9. Kim JJ, Pezik P, Rebholz-Schuhmann D: MedEvi: retrieving textual evidence of relations between biomedical concepts from Medline. Bioinformatics. 2008;24(11):1410–12. 10.1093/bioinformatics/btn117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-10] 10. Rebholz-Schuhmann D, Kirsch H, Arregui M, et al. : EBIMed--text crunching to gather facts for proteins from Medline. Bioinformatics. 2007;23(2):e237–44. 10.1093/bioinformatics/btl302 [DOI] [PubMed] [Google Scholar]

[ref-11] 11. Douglas SM, Montelione GT, Gerstein M, et al. : PubNet: a flexible system for visualizing literature derived networks. Genome Biol. 2005;6(9):R80. 10.1186/gb-2005-6-9-r80 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-12] 12. Eaton AD: HubMed: a web-based biomedical literature search interface. Nucleic Acids Res. 2006;34(Web Server issue):W745–47. 10.1093/nar/gkl037 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-13] 13. Hearst MA, Divoli A, Guturu H, et al. : BioText Search Engine: beyond abstract search. Bioinformatics. 2007;23(16):2196–97. 10.1093/bioinformatics/btm301 [DOI] [PubMed] [Google Scholar]

[ref-14] 14. Ahmed Z, Zeeshan S, Dandekar T: Developing sustainable software solutions for bioinformatics by the “Butterfly” paradigm [version 1; referees: 2 approved with reservations]. F1000Res. 2014;3:71. 10.12688/f1000research.3681.2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-15] 15. Tao X, Tang Z, Xu C: Contextual Modeling for Logical Labeling of PDF Documents. Comput Electr Eng. 2014;40(4):1363–75. 10.1016/j.compeleceng.2014.01.005 [DOI] [Google Scholar]

[ref-16] 16. Hassan T: Object-Level Document Analysis of PDF Files. In Proceedings of the 9th ACM symposium on Document engineering. 2009;47–55. 10.1145/1600193.1600206 [DOI] [Google Scholar]

[ref-17] 17. Bloechle JL, Rigamonti M, Ingold R: OCD Dolores - Recovering Logical Structures for Dummies. In 10th IAPR International Workshop on Document Analysis Systems (DAS). 2012;245–249. 10.1109/DAS.2012.58 [DOI] [Google Scholar]

[ref-18] 18. Déjean H, Meunier JL: A System for Converting PDF Documents into Structured XML Format. In Proceedings of the 7th international conference on Document Analysis Systems. 2006;129–140. 10.1007/11669487_12 [DOI] [Google Scholar]

[ref-19] 19. Rahman F, Alam H: Conversion of PDF Documents into HTML: A Case Study of Document Image Analysis. In Proceedings of Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers. 2003;1:87–91. 10.1109/ACSSC.2003.1291873 [DOI] [Google Scholar]

[ref-20] 20. Zweigenbaum P, Demner-Fushman D, Yu H, et al. : Frontiers of biomedical text mining: current progress. Brief Bioinform. 2007;8(5):358–375. 10.1093/bib/bbm045 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-21] 21. Carpenter AE, Jones TR, Lamprecht MR, et al. : CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biol. 2006;7(10):R100. 10.1186/gb-2006-7-10-r100 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-22] 22. Chou KC, Shen HB: Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat Protoc. 2008;3(2):153–162. 10.1038/nprot.2007.494 [DOI] [PubMed] [Google Scholar]

[ref-23] 23. Kuhn T, Nagy ML, Luong T, et al. : Mining images in biomedical publications: Detection and analysis of gel diagrams. J Biomed Semantics. 2014;5(1):10. 10.1186/2041-1480-5-10 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-24] 24. Kozhenkov S, Baitaluk M: Mining and integration of pathway diagrams from imaging data. Bioinformatics. 2012;28(5):739–742. 10.1093/bioinformatics/bts018 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-25] 25. Xu YY, Yang F, Zhang Y, et al. : Bioimaging-based detection of mislocalized proteins in human cancers by semi-supervised learning. Bioinformatics. 2015;31(7):1111–9. 10.1093/bioinformatics/btu772 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-26] 26. Ahmed Z, Zeeshan S, Huber C, et al. : Software LS-MIDA for efficient mass isotopomer distribution analysis in metabolic modelling. BMC Bioinformatics. 2013;14:218. 10.1186/1471-2105-14-218 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-27] 27. Eren AM, Esen ÖC, Quince C, et al. : Anvi'o: an advanced analysis and visualization platform for 'omics data. PeerJ. 2015;3:e1319. 10.7717/peerj.1319 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-28] 28. Moreau T, Gibaud B: Ontology-based approach for in vivo human connectomics: the medial Brodmann area 6 case study. Front Neuroinform. 2015;9:9. 10.3389/fninf.2015.00009 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-29] 29. Ahmed Z: Intelligent semantic oriented agent based search (I-SOAS).In Proceedings of the 7th International Conference on Frontiers of Information Technology2009. 10.1145/1838002.1838065 [DOI] [Google Scholar]

[ref-30] 30. Ahmed Z, Helfrich-Förster C: DroLIGHT-2: Real Time Embedded and Data Management System for Synchronizing Circadian Clock to the Light-Dark Cycles. Recent Patents on Computer Sci. 2013;6(3):191–205. 10.2174/2213275906666131108211241 [DOI] [Google Scholar]

[ref-31] 31. Pryszcz LP, Németh T, Saus E, et al. : The Genomic Aftermath of Hybridization in the Opportunistic Pathogen Candida metapsilosis. PLoS Genet. 2015;11(10):e1005626. 10.1371/journal.pgen.1005626 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-32] 32. Hernández JC, Rodríguez JM, Sigarreta JM: Mathematical Properties of the Hyperbolicity of Circulant Networks. Adv Math Phys. 2015;2015: 723451. 10.1155/2015/723451 [DOI] [Google Scholar]

[ref-33] 33. Ahmed Z, Zeeshan S, Fleischmann P, et al. : Ant-App-DB: a smart solution for monitoring arthropods activities, experimental data management and solar calculations without GPS in behavioral field studies [version 2; referees: 1 approved, 2 approved with reservations]. F1000Res. 2015;3:311. 10.12688/f1000research.5931.3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-34] 34. Zeeshan A, Detlef G: Design implementation of I-SOAS IPM for advanced product data management. In IEEE 2nd International Conference on Computer, Control and Communication2009;1–5. 10.1109/IC4.2009.4909215 [DOI] [Google Scholar]

[ref-35] 35. Ahmed Z, Mayr M, Zeeshan S, et al. : Lipid-Pro: a computational lipid identification solution for untargeted lipidomics on data-independent acquisition tandem mass spectrometry platforms. Bioinformatics. 2015;31(7):1150–1153. 10.1093/bioinformatics/btu796 [DOI] [PubMed] [Google Scholar]

[ref-36] 36. Ahmed Z, Zeeshan S: Cultivating Software Solutions Development in the Scientific Academia. Recent Patents on Computer Sci. 2014;7(1):54–66. 10.2174/2213275907666140612210552 [DOI] [Google Scholar]

[ref-37] 37. Schindelin J, Arganda-Carreras I, Frise E, et al. : Fiji: an open-source platform for biological-image analysis. Nat Methods. 2012;9(7):676–82. 10.1038/nmeth.2019 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-38] 38. Schmid B, Schindelin J, Cardona A, et al. : A high-level 3D visualization API for Java and ImageJ. BMC Bioinformatics. 2010;11:274. 10.1186/1471-2105-11-274 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-39] 39. Schneider CA, Rasband WS, Eliceiri KW: NIH Image to ImageJ: 25 years of image analysis. Nat Methods. 2012;9(7):671–75. 10.1038/nmeth.2089 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-40] 40. Peng H, Ruan Z, Long F, et al. : V3D enables real-time 3D visualization and quantitative analysis of large-scale biological image data sets. Nat Biotechnol. 2010;28(4):348–53. 10.1038/nbt.1612 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-41] 41. Lopez LD, Yu J, Arighi C, et al. : A framework for biomedical figure segmentation towards image-based document retrieval. BMC Syst Biol. 2013;7(Suppl 4):S8. 10.1186/1752-0509-7-S4-S8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-42] 42. Sheng J, Xu S, Deng W, et al. : Novel Image Features for Categorizing Biomedical Images.In IEEE International Conference on Bioinformatics and Biomedicine (BIBM)2012. 10.1109/BIBM.2012.6392689 [DOI] [Google Scholar]

[ref-43] 43. Kunz M, Liang C, Nilla S, et al. : The drug-minded protein interaction database (DrumPID) for efficient target analysis and drug development. Database (Oxford). 2016;2016: pii: baw041. 10.1093/database/baw041 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-44] 44. Ahmed Z, Dandekar T: Dataset 1 in: MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format. F1000Research. 2015. Data Source [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format

Zeeshan Ahmed

Thomas Dandekar

Version Changes

Revised. Amendments from Version 1

Abstract

Introduction

Figure 1. Graphical user interfaces of MSL and modular workflow.

Figure 2. Conceptual architecture of MSL and component’s workflow.

Methods

Text mining

Table 1. Systems and Libraries tested for MSL 1.

Image processing

Results and Discussion

MSL mining performance tested on different literature sources

Table 2. Papers (PDF files) tested using MSL 1.

Figure 3. Example: Image analysis of Figure 1 in (YY et al., 2015).

Figure 4. Example: Publication, Page 1 (Ahmed et al., 2015).

MSL validation and comparison to other tools

Table 3. Special symbols found in biomedical literature 1.

MSL Implementation & operation

Figure 5. Screenshot of all extracted images and generated files (XML and PDF).

Figure 6. Six steps installation process for MSL.

Conclusions

Data availability

Software availability

Software access

Archived software files as at the time of publication

License

Acknowledgments

Funding Statement

Supplementary material

References

Referee response for version 2

Karin Verspoor

Roles

References

Zeeshan Ahmed

Referee response for version 2

Florencio Pazos

Roles

Zeeshan Ahmed

Referee response for version 1

M Julius Hossain

Roles

References

Referee response for version 1

Juilee Thakar

Roles

Juilee Thakar

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 1. Systems and Libraries tested for MSL ¹.

Table 2. Papers (PDF files) tested using MSL ¹.

Table 3. Special symbols found in biomedical literature ¹.