MolMiner: You Only Look Once for Chemical Structure Recognition

Youjun Xu; Jinchuan Xiao; Chia-Han Chou; Jianhang Zhang; Jintao Zhu; Qiwan Hu; Hemin Li; Ningsheng Han; Bingyu Liu; Shuaipeng Zhang; Jinyu Han; Zhen Zhang; Shuhao Zhang; Weilin Zhang; Luhua Lai; Jianfeng Pei

doi:10.1021/acs.jcim.2c00733

. 2022 Sep 15;62(22):5321–5328. doi: 10.1021/acs.jcim.2c00733

MolMiner: You Only Look Once for Chemical Structure Recognition

Youjun Xu ^†,^*, Jinchuan Xiao ^†, Chia-Han Chou ^†, Jianhang Zhang ^†, Jintao Zhu ^‡, Qiwan Hu ^‡, Hemin Li ^†, Ningsheng Han ^†, Bingyu Liu ^†, Shuaipeng Zhang ^†, Jinyu Han ^†, Zhen Zhang ^†, Shuhao Zhang ^†, Weilin Zhang ^†, Luhua Lai ^‡,^§,^*, Jianfeng Pei ^‡,^*

PMCID: PMC9710516 PMID: 36108142

Abstract

graphic file with name ci2c00733_0005.jpg

Molecular structures are commonly depicted in 2D printed forms in scientific documents such as journal papers and patents. However, these 2D depictions are not machine readable. Due to a backlog of decades and an increasing amount of printed literatures, there is a high demand for translating printed depictions into machine-readable formats, which is known as Optical Chemical Structure Recognition (OCSR). Most OCSR systems developed over the last three decades use a rule-based approach, which vectorizes the depiction based on the interpretation of vectors and nodes as bonds and atoms. Here, we present a practical software called MolMiner, which is primarily built using deep neural networks originally developed for semantic segmentation and object detection to recognize atom and bond elements from documents. These recognized elements can be easily connected as a molecular graph with a distance-based construction algorithm. MolMiner gave state-of-the-art performance on four benchmark data sets and a self-collected external data set from scientific papers. As MolMiner performed similarly well in real-world OCSR tasks with a user-friendly interface, it is a useful and valuable tool for daily applications. The free download links of Mac and Windows versions are available at https://github.com/iipharma/pharmamind-molminer.

Introduction

Chemical products are a vast amount of priceless wealth and are making our lives and health better. Much efforts on chemical research and development have been made and published as primary scientific literature. During the past decade, researchers have developed various machine learning and deep learning models to solve a series of predictive and generative tasks in the fields of chemistry and biology.¹⁻³ It is obvious that well-performed computational models cannot be separated from data accumulation, especially experimental data, for example, chemical reaction data and biological active data.

Several well-known databases have played vital roles in scientific research. For example, the Protein Data Bank database is an important undertaking to make protein crystal structural data publicly available, which has greatly facilitated research efforts and knowledge developments on protein structure–function studies and structure predictions.⁴ Several large comprehensive biomedical databases like ChEMBL have been constructed and updated. These data sets offer necessary basis and opportunity to develop various practical advanced technologies.⁵ Recently, there has been a renewed interest in structurally collating experimental data sets and building their interrelationships and intrarelationships to enhance various downstream predictions and recommendations. The Open Reaction Database aims to collect and share chemical reaction data from journal articles, patents, and even electronic laboratory notebooks.⁶ Due to the rapidly increasing number of literature resources, it becomes both laborious and time consuming to integrate diverse kinds of experimental data into a comprehensive and professional knowledge database.

Automatic computational methods provide a potential option to handle various forms of valuable chemical and biological information. Named Entity Recognition tools have been applied to extract chemical textual information from literatures to create structured data.⁷ In addition to text-like objects, researchers have also developed Optical Chemical Structure Recognition (OCSR) tools with the intention of decoding a graphical chemical depiction into a machine-readable molecular format. However, it remains challenging to accurately recognize chemical structures from 2D printed images.

Since 1990, several commercial and open-source OCSR systems have been established based on similar rule-based implementations involving image vectorization, image thinning, line enhancement, text-based Optical Character Recognition (OCR), and graph reconstruction. A representative system, Chemical Literature Data Extraction (CLiDE), is a commercial OCSR toolkit developed by Keymodule,⁸ which has been integrated into the ChemAxon software.⁹ Generally, most of the commercial OCSR systems were unavailable to academic researchers. In 2009, Filippov and Nicklaus published the first open-source system called Optical Structure Recognition Application (OSRA),¹⁰ which is kept active and upgraded for improved recognition. Imago and MolVec have also been developed as open-source systems to offer researchers optional tools for molecular structure recognition.^11,12 A recent review summarized the currently available OCSR systems.¹³

Dramatic developments in deep learning (DL) frameworks and hardware have been achieved in image recognition technologies in recent years. Bristol-Myers Squibb held a competition for molecular translation in Kaggle.¹⁴ The architecture of the CNN-Transformer plays an essential role in translating chemical images to InChI strings.¹⁵ Based on this, DECIMER has been reported for translating various chemical images to SELFIES strings with acceptable precision.¹⁶ Similarly, Bayer researchers have developed another translation method called Img2Mol, exhibiting a potential ability of recognizing hand-drawn molecules.¹⁷ Inspired by image segmentation technologies, ChemGrapher uses atom-based, bond-based, and charge-based segmentation neural networks to predict the probabilities of each pixel for a chemical image and then constructs a chemical graph.¹⁸ Following this work, ABC-Net applied a divide-and-conquer segmentation strategy to significantly improve recognition performance.¹⁹ Although these DL-based works present a promising and potential application value, they are short of rigorous evaluations on benchmark data sets, especially on real-world data tasks.

Taking the substantial strengths of DL into consideration,²⁰ we developed a practical DL-based system, MolMiner, for real-world OCSR tasks. MolMiner can rapidly and accurately extract chemical images and recognize chemical structures from PDF-format documents. MolMiner performed better than the three existing open-source OCSR systems on the benchmark data sets. To demonstrate its practicability, we further collected 3040 images from scientific papers. MolMiner performed similarly well in this real-world OCSR task, while other methods did not work well. We also integrated functionalities like a real-time correction function and a screenshot function into MolMiner to provide a user-friendly interface. MolMiner is freely available with daily permission to all registered users at Windows link and Mac link, respectively.

MolMiner Recognition System

Module Implementation

MolMiner is a rule-free learning system. It aims to transform the vectorization problem into object detection tasks. That means it is able to extract chemical elements in an object detection manner by training well-labeled data sets with atom and bond annotations. It was implemented by five main modules as follows:

The data generation and annotation module aims to automatically generate various styles of well-annotated chemical images based on the RDKit v2021.09.1 toolkit.²¹ It supports several augmentation operations, such as rotation, thinning, thickness, noise, and supergroup. More augmentation operations (e.g., hand-drawn lines) will be added into this module in the future, which is well-suited for various real application scenarios.
The chemical image detection (MolMiner-ImgDet) module is a fully DL-driven image segmentation module and is implemented by lightweight model MobileNetV2.²² It is used to extract printed chemical representations from PDF-format documents. Labeled data are generated by the first module with several predefined templates such as journal style and patent style. The two categorical annotations are adopted to train this model with cross entropy loss of background class and compound class. The key performance of recall is 95.5%.
The chemical image recognition (MolMiner-ImgRec) module is the key module for rapid and accurate chemical structure recognition. It is implemented based on the popular one-stage YOLOv5 architecture.²³ After a series of evaluations on MaskRCNN,²⁴ FastRCNN,²⁵ and EfficientDet,²⁶ we empirically found the YOLOv5 model could significantly outperform other object detection architectures just for this recognition task. The atom labels include “Si”, “N”, “Br”, “S”, “I”, “Cl”, “H”, “P”, “O”, “C”, “B”, “F”, and “Text”. The bond labels include “Single”, “Double”, “Triple”, “Wedge”, “Dash”, and “Wavy”. The whole predicted mAP@.5 is 97.5%. Labeled data are automatically generated by the first module with several image-level augmentation operations.
The text-based OCR (optical character recognition) (MolMiner-TextOCR) module is used to recognize chemistry text images with atom characters and super groups. This OCR model is implemented by fine-tuning a pretrained EasyOCR model²⁷ with specifically cropped chemical texts from the first module. The accuracy performance is about 96.4%.
The chemical element construction and evaluation module contains a distance-based graph construction algorithm accompanied by a supergroup parser and an automatic evaluation module for performing fair comparisons on benchmark data sets. The supergroup dictionary is collected from the RDKit toolkit, ChemAxon, OSRA, and scientific journals. The evaluation results are discussed in the Benchmark Evaluation section.

Details of the five module implementations are described in the Supporting Information. In comparison to rule-based systems, MolMiner has several advantages: (i) Batch GPU-based inference can make the speed of element recognition faster than rule-based algorithms. (ii) It implicitly learns the rules from a large amount of automatically generated data sets or other manual annotated data sets, without designing explicit rules. This characteristic can be simply extended to new scenes. (iii) It supports rapid synchronous recognition of multiple large-sized images. (iv) It provides additional user-friendly interfaces for ease of use.

MolMiner User Interface

The user interface has integrated several frequently used functions, including screenshot recognition, batch PDF recognition, real-time molecular editing, molecule collection, and SDF/XLSX file download. Figure 1 presents an overview of the real-time molecular edit interface. Using this interface, users can check and correctly recognized molecules one-by-one to reach the perfect accuracy of 100%. After a simple test on 1000 images of non-Markush organic medicinal molecules, it takes about 5–10 s for one person with simple training to process one molecule with a series of operations (such as check, correct, and save). Other functional modules can be easily found in the MolMiner software. For details, please refer to the MolMiner User Manual (https://github.com/iipharma/pharmamind-molminer/tree/main/docs).

Editing interface with five rectangle parts (in blue): (A) User’s account information. (B) MolMiner module. (C) PDF viewer (supported by PDF.js v2.14²⁸). (D) cropped molecular image and (D1) three buttons. The “Add” is to add another cropped image of interest. The “Adjust Cropping” is to adjust the boundaries of the cropped images. The “Delete” is to remove the selected cropped image. (E) Real-time molecular edit (supported by Ketcher v2.4²⁹). E1 contains the four basic functions of Copy, Save, Favorites, and Next.

MolMiner Evaluation

Metrics

We used the accuracies of the IUPAC International Chemical Identifier (InChI) and maximum common substructure (MCS) to evaluate the recognition performance. An InChI string is one kind of unified string with multilayer representation. InChI-based accuracy is the metric to evaluate the identity of the two InChI strings retrieved from molecular structures. MCS-based accuracy is used to evaluate the identity of both atom level and bond level based on graph matching algorithms. Compared to the InChI-based accuracy, the MCS-based accuracy offers a point-to-point comparison of the recognized molecular structures and ground truth structures. The retrieved InChI string would often overunderlie chemical information like cis/trans/either double bonds and chirality checking, making the strings from the predicted molecular graphs slightly different from the target strings. We took two examples from the Supporting Information to explain these differences, shown in Figure S1 and Figure S2. In our opinion, it is enough for graph-level identity to evaluate the recognized accuracy of chemical image data.

Benchmark Evaluation

We tested four benchmark data sets (USPTO, UOB, CLEF2012, and JPO) referred to by Rajan et al.,¹³ shown in Table 1. The four benchmark data sets can be downloaded at https://github.com/Kohulan/OCSR_Review/tree/master/assets/images. These raw images were directly recognized by the MolMiner-ImgRec API service, in which we designed a general strategy to depict these images. For large images with max(width, height) > 2559 or small images with max(width, height) < 640, these images are resized to max(width, height) = 2560 or max(width, height) = 640, respectively. Those images with max(width, height) in the ranges of (640, 1280], (1280,1920], and (1920, 2560] are padded to the corresponding upper bounds with an RGB value of (255, 255, 255). For example, a 1000 × 2000 image will be padded to 2560 × 2560. For the line-thick images, we implemented a pixel-based algorithm to roughly estimate the median line width of each image, followed by applying a dilation function from OpenCV v4.5.5³⁰ (cv2.dilate) using a 3 × 3 or 2 × 2 kernel size depending on the estimated line width. These general depictions aim to make the input image styles suitable for the application domain of MolMiner-ImgRec.

Table 1. Summary of Benchmark Evaluation on Runtime, InChI-Based, and MCS-Based Accuracies (acc.) with Data Set Sizes (s) and Molecular Weight Averages (μ) and Standard Deviations (σ).

Data sets				Molvec	Imago	Osra	MolMiner
USPTO			InChI acc. ↑	88.6%	88.9%	87.9%	89.9%
s	μ	σ	MCS acc. ↑	88.9%	88.8%	16.9%	93.3%
5719	440.3	160.8	Runtime ↓	29 min	73 min	148 min	7 min

UOB			InChI acc. ↑	88.0%	66.2%	87.4%	90.0%
s	μ	σ	MCS acc. ↑	49.5%	42.7%	27.0%	62.7%
5740	213.5	57.3	Runtime ↓	28 min	153 min	126 min	6 min

CLEF2012			InChI acc. ↑	78.1%	59.7%	84.6%	84.6%
s	μ	σ	MCS acc. ↑	77.3%	59.7%	20.0%	86.5%
992	400.9	144.0	Runtime ↓	4 min	16 min	21 min	1 min

JPO			InChI acc. ↑	66.9%	48.6%	67.1%	72.2%
s	μ	σ	MCS acc. ↑	31.8%	26.2%	19.8%	34.8%
450	360.3	185.0	Runtime ↓	8 min	23 min	17 min	<1 min

Open in a new tab

We compared MolMiner-ImgRec’s runtime and accuracy performance with other existing open-source OCSR tools including MolVec v0.9.8,¹² OSRA v2.1.0,¹⁰ and Imago v2.0¹¹ on these four data sets. We reviewed the consistency of InChI strings that are mentioned in Rajan et al.¹³ Considering the atom-level evaluation method from Imago,³¹ we found that the atom-level and bond-level consistency indexes of MCS should be appropriate metrics for evaluating this task. It comprehensively measures the accuracy of atom level and bond level. In Table 1, we compared InChI-based accuracy and MCS-based accuracy and found some significant differences due to the count of “None” value, bond misplacement in aromatic rings, chirality reassignment, and RDKit-based InChI export. Here, MCS-based accuracy is deemed a more rigorous indicator to monitor the fine-grained accuracy of atoms and bonds. Successively, we analyzed the runtime (Ubuntu 20.04, Intel Xeon(R) Gold 6230R CPU@2.10 GHz) and MCS-based (both atom level and bond level) accuracy between our recognized molecules and ground truth molecules. The results are summarized in Table 1, showing that MolMiner performs better than the open-source tools both in runtimes and MCS-based accuracies for the four data sets of USPTO, UOB, CLEF2012, and JPO. We also tested the five representative failure cases from the MolVec GitHub issue (https://github.com/ncats/molvec/issues/18), and MolMiner gave a much better performance as illustrated in Figure S3. These comparisons suggest that MolMiner has a better capacity to deal with diverse-styled images.

Using the rule-based approach, CLiDE Pro is a popular and professional commercial software. Unfortunately, we do not have access to it and could not perform a fair comparison. According to the reported USPTO results from ChemAxon,³² the reported accuracy (93.8%) is slightly better than our MCS-based accuracy (93.3%). One of the possible reasons is the issue of crossing bonds. As previously mentioned, since MolMiner is based on deep learning object detection models, it is difficult for crossing bonds to be well recognized just by these simple atom and bond annotations. More elegant designs can be integrated into our models in the future. We are also trying to solve this issue with several unavoidable rules or coarse-grained annotations. Currently, manual correction is recommended in the interactive plugin of Ketcher.²⁹

Real-World Data Set Collection and Evaluation

To further validate the generalization capability of MolMiner, we have collected a real-world external test set with 3040 images from 239 scientific papers published in various journals. The new data set (MW avg., 496.8; MW std., 280.1) can be downloaded at https://zenodo.org/record/6973361. We used these four tools to test this data set, and the results are summarized in Table 2. MolMiner gave comparable recognition performance compared to the four benchmark data sets used in training and validation, while the other three methods performed much worse, demonstrating the applicability of MolMiner in real-world OCSR tasks. Although there is still space to reach 100% perfect recognition for real-world chemical images, MolMiner provides a useful and robust tool for the OCSR applications. In future work, a large-scale data set could be constructed to perform comprehensive fair assessments and comparisons for real-world application scenarios and to serve as the foundation for ongoing progress.

Table 2. Performance on a Self-Collected Real-World Data Set (3040 images) from Different Scientific Journals.

Tool	InChI acc. (%)	MCS acc. (%)
MolVec	62.6	50.1
Imago	10.8	10.3
OSRA	64.5	8.9
MolMiner	88.9	87.8

Open in a new tab

Application Case Illustration

The current version of MolMiner focuses on processing organic medicinal molecules (non-Markush structures) from scientific documents. As the recognized molecular images are not 100% correct (with some errors like crossing bonds, noise, and supergroup parser), minor manual corrections are necessary. MolMiner provides an interactive plugin of Ketcher to do this. In the following, we we illustrate how MolMiner works using three application cases.

We took a screenshot of a large-sized image (size, 3000 × 2068; dpi, 300) of palytoxin to make a test. It took approximately 3 s to return one well-recognized molecular structure without any manual corrections in the Ketcher plugin (shown in Figure 2).
A scanned PDF page from a scientific journal³³ (machine, HP M1210 MFP; dpi, 300; brightness, 128; contrast, 124) was tested by MolMiner, shown in Figure 3. The recognized results are satisfactory and achieve 100% accuracy under a set of appropriate scanning parameters. We also tried some sets of scanning parameters, which more or less could influence the recognized results.
We tested a challenging case of hand-drawn images from Clevert et al.¹⁷ In Figure 4, although there are still some errors including “N”, “Cl”, wedge bonds, and aromatic rings, without training on any similar data, MolMiner could recognize simple hand-drawn images (Figure 4 (right)), and main skeletons and their positions of complex images (Figure 4 (left)). The MolMiner recognition results of complex images serve as a good starting point for users to check and make necessary corrections easily.

Large-sized (3000 × 2068) case of palytoxin.

Case of one scanned journal page from Wang.³³

Case for recognizing hand-drawn images from Clevert et al.¹⁷ Some errors are highlighted in red, including “N”, “Cl”, wedge bonds, and aromatic rings.

Limitations

The current version of MolMiner is weak at dealing with crowded layout segmentation, colorful backgrounds, crossing and irregular bonds, Markush structures or polymers, blurred images, and extremely long supergroups. MolMiner provides an interactive plugin of Ketcher for users to carry out necessary manual corrections.

Conclusion

We have developed a practical OCSR software, MolMiner, based on advanced deep learning technologies including MobileNetv2, YOLOv5, and EasyOCRv1.4. We also developed an automatic data generation module to satisfy the data volume requirements of DL models. The benchmark and external evaluations suggest that MolMiner outperforms the open-source OCSR tools (including MolVec, OSRA, and Imago) on both accuracy and runtime. Although there is still room to reach 100% perfect recognition for real-world chemical images, MolMiner provides a useful tool for the OCSR tasks. Currently, MolMiner supports several frequently used functionalities such as batch PDF recognition, snapshot recognition, and real-time molecular editing. More application extensions will be integrated into MolMiner in the future. Mac and Windows versions are freely available here.³⁴ We give free daily access to all registered users to ensure daily use amount, and we also provide additional business channels for limitless access.

Data and Software Availability

The benchmark data sets (USPTO, UOB, CLEF2012, JPO) are used from https://github.com/Kohulan/OCSR_Review. The original molecular data are randomly selected from the ChEMBL29 data set.⁵ All the data sets that are used to construct the deep learning models (MolMiner-ImgDet, MolMiner-ImgRec, MolMiner-TextOCR) are automatically generated by open-source toolkits. The annotations of elements and supergroups are automatically generated using the MolDraw2DSVG and CondenseMolAbbreviations functions of RDKit v2021.09.1.²¹ The MolMiner-ImgDet data are automatically generated using ReportLab Open Source v3.5.0.³⁵ More details are described in the Supporting Information, Supplementary Datasets and Methods. The Mac and Windows download links for the latest MolMiner are freely available at https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/mac/PharmaMind-mac-latest-setup.dmg and https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/win/PharmaMind-win-latest-setup.exe, respectively.

Acknowledgments

This work is supported by the funding from Infinite Intelligence Pharma, Ltd.

Glossary

Acronyms

YOLO: You look only once
OCSR: Optical chemical structure recognition
OCR: Optical character recognition
CLiDE: Chemical literature data extraction

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.2c00733.

Data sets and methods. Figure S1: Case about chirality check. Figure S2: Case about double bonds with trans(E), cis(Z), and either(E/Z). Figure S3: Representative failure cases from MolVec GitHub (https://github.com/ncats/molvec/issues/18). (PDF)

Author Contributions

Y. Xu, J. Xiao, and C.-H. Chou contributed equally.

The authors declare no competing financial interest.

Supplementary Material

ci2c00733_si_001.pdf^{(526.1KB, pdf)}

References

Xu Y.; Lin K.; Wang S.; Wang L.; Cai C.; Song C.; Lai L.; Pei J. Deep learning for molecular generation. Future medicinal chemistry 2019, 11, 567–597. 10.4155/fmc-2018-0358. [DOI] [PubMed] [Google Scholar]
Lavecchia A. Deep learning in drug discovery: opportunities, challenges and future prospects. Drug discovery today 2019, 24, 2017–2032. 10.1016/j.drudis.2019.07.006. [DOI] [PubMed] [Google Scholar]
Tang B.; Pan Z.; Yin K.; Khateeb A. Recent advances of deep learning in bioinformatics and computational biology. Frontiers in genetics 2019, 10, 214. 10.3389/fgene.2019.00214. [DOI] [PMC free article] [PubMed] [Google Scholar]
Burley S. K.; Bhikadiya C.; Bi C.; Bittrich S.; Chen L.; Crichlow G. V.; Christie C. H.; Dalenberg K.; Di Costanzo L.; Duarte J. M.; et al. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic Acids Res. 2021, 49, D437–D451. 10.1093/nar/gkaa1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mendez D.; Gaulton A.; Bento A. P.; Chambers J.; De Veij M.; Félix E.; Magariños M. P.; Mosquera J. F.; Mutowo P.; Nowotka M.; et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019, 47, D930–D940. 10.1093/nar/gky1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kearnes S. M.; Maser M. R.; Wleklinski M.; Kast A.; Doyle A. G.; Dreher S. D.; Hawkins J. M.; Jensen K. F.; Coley C. W. The open reaction database. J. Am. Chem. Soc. 2021, 143, 18820–18826. 10.1021/jacs.1c09820. [DOI] [PubMed] [Google Scholar]
Eltyeb S.; Salim N. Chemical named entities recognition: a review on approaches and applications. Journal of cheminformatics 2014, 6, 1–12. 10.1186/1758-2946-6-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Valko A. T.; Johnson A. P. CLiDE Pro: the latest generation of CLiDE, a tool for optical chemical structure recognition. J. Chem. Inf. Model. 2009, 49, 780–787. 10.1021/ci800449t. [DOI] [PubMed] [Google Scholar]
ChemAxon and Keymodule Announce Integration of CLiDE OCSR and Commercial Relationship. https://www.prnewswire.co.uk/news-releases/chemaxon-and-keymodule-announce-integration-of-clide-ocsr-and-commercial-relationship-258109351.html (accessed 2022–05–18).
Filippov I. V.; Nicklaus M. C. Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. J. Chem. Inf. Model. 2009, 49, 740–743. 10.1021/ci800067r. [DOI] [PMC free article] [PubMed] [Google Scholar]
Imago OCR. https://lifescience.opensource.epam.com/imago/index.html (accessed 2022–09–15).
MolVec OCR. https://github.com/ncats/molvec (accessed 2022–05–18).
Rajan K.; Brinkhaus H. O.; Zielesny A.; Steinbeck C. A review of optical chemical structure recognition tools. Journal of Cheminformatics 2020, 12, 60. 10.1186/s13321-020-00465-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Molecular translation competition. https://www.kaggle.com/c/bms-molecular-translation (accessed 2022–05–18).
Heller S.; McNaught A.; Stein S.; Tchekhovskoi D.; Pletnev I. InChI-the worldwide chemical structure identifier standard. Journal of cheminformatics 2013, 5, 1–9. 10.1186/1758-2946-5-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rajan K.; Zielesny A.; Steinbeck C. DECIMER 1.0: deep learning for chemical image recognition using transformers. Journal of Cheminformatics 2021, 13, 1–16. 10.1186/s13321-021-00538-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Clevert D.-A.; Le T.; Winter R.; Montanari F. Img2Mol-accurate SMILES recognition from molecular graphical depictions. Chemical science 2021, 12, 14174–14181. 10.1039/D1SC01839F. [DOI] [PMC free article] [PubMed] [Google Scholar]
Oldenhof M.; Arany A.; Moreau Y.; Simm J. ChemGrapher: optical graph recognition of chemical compounds by deep learning. J. Chem. Inf. Model. 2020, 60, 4506–4517. 10.1021/acs.jcim.0c00459. [DOI] [PubMed] [Google Scholar]
Zhang X.-C.; Yi J.-C.; Yang G.-P.; Wu C.-K.; Hou T.-J.; Cao D.-S. ABC-Net: a divide-and-conquer based deep learning architecture for SMILES recognition from molecular images. Briefings in Bioinformatics 2022, 23, bbac033. 10.1093/bib/bbac033. [DOI] [PubMed] [Google Scholar]
LeCun Y.; Bengio Y.; Hinton G. Deep learning. Nature 2015, 521, 436–444. 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
RDKit: Open-source cheminformatics. https://www.rdkit.org (accessed 2021–10–01).
Sandler M.; Howard A.; Zhu M.; Zhmoginov A.; Chen L.-C.. Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018; pp 4510–4520.
Jocher G.et al. ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations; Zendo, 2021.
He K.; Gkioxari G.; Dollár P.; Girshick R.. Mask r-cnn. Proceedings of the IEEE International Conference on Computer Vision, 2017; pp 2961–2969.
Girshick R.Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision, 2015; pp 1440–1448.
Tan M.; Pang R.; Le Q. V.. Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020; pp 10781–10790.
EasyOCR. https://github.com/JaidedAI/EasyOCR (accessed 2022–05–18).
PDF.js. https://github.com/mozilla/pdf.js/ (accessed 2022–05–18).
Ketcher. https://github.com/epam/ketcher (accessed 2022–05–18).
OpenCV. https://github.com/opencv/opencv/releases/tag/4.5.5 (accessed 2022–05–18).
Imago report. https://lifescience.opensource.epam.com/downloads/imago-2.0.0/report.zip (accessed 2022–09–15).
ChemAxon & Keymodule report. https://chemaxon.com/app/uploads/2013/06/keymodule.pdf (accessed 2022–05–18).
Wang Y.-H. Traditional uses and pharmacologically active constituents of Dendrobium plants for dermatological disorders: a review. Natural Products and Bioprospecting 2021, 11, 465–487. 10.1007/s13659-021-00305-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
MolMiner download link. Mac https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/mac/PharmaMind-mac-latest-setup.dmg; Windows https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/win/PharmaMind-win-latest-setup.exe; Github: https://github.com/iipharma/pharmamind-molminer (accessed 2022–09–15).
ReportLab Open Source. https://hg.reportlab.com/hg-public/reportlab (accessed 2022–05–18).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci2c00733_si_001.pdf^{(526.1KB, pdf)}

Data Availability Statement

[ref1] Xu Y.; Lin K.; Wang S.; Wang L.; Cai C.; Song C.; Lai L.; Pei J. Deep learning for molecular generation. Future medicinal chemistry 2019, 11, 567–597. 10.4155/fmc-2018-0358. [DOI] [PubMed] [Google Scholar]

[ref2] Lavecchia A. Deep learning in drug discovery: opportunities, challenges and future prospects. Drug discovery today 2019, 24, 2017–2032. 10.1016/j.drudis.2019.07.006. [DOI] [PubMed] [Google Scholar]

[ref3] Tang B.; Pan Z.; Yin K.; Khateeb A. Recent advances of deep learning in bioinformatics and computational biology. Frontiers in genetics 2019, 10, 214. 10.3389/fgene.2019.00214. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] Burley S. K.; Bhikadiya C.; Bi C.; Bittrich S.; Chen L.; Crichlow G. V.; Christie C. H.; Dalenberg K.; Di Costanzo L.; Duarte J. M.; et al. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic Acids Res. 2021, 49, D437–D451. 10.1093/nar/gkaa1038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] Mendez D.; Gaulton A.; Bento A. P.; Chambers J.; De Veij M.; Félix E.; Magariños M. P.; Mosquera J. F.; Mutowo P.; Nowotka M.; et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019, 47, D930–D940. 10.1093/nar/gky1075. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] Kearnes S. M.; Maser M. R.; Wleklinski M.; Kast A.; Doyle A. G.; Dreher S. D.; Hawkins J. M.; Jensen K. F.; Coley C. W. The open reaction database. J. Am. Chem. Soc. 2021, 143, 18820–18826. 10.1021/jacs.1c09820. [DOI] [PubMed] [Google Scholar]

[ref7] Eltyeb S.; Salim N. Chemical named entities recognition: a review on approaches and applications. Journal of cheminformatics 2014, 6, 1–12. 10.1186/1758-2946-6-17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] Valko A. T.; Johnson A. P. CLiDE Pro: the latest generation of CLiDE, a tool for optical chemical structure recognition. J. Chem. Inf. Model. 2009, 49, 780–787. 10.1021/ci800449t. [DOI] [PubMed] [Google Scholar]

[ref9] ChemAxon and Keymodule Announce Integration of CLiDE OCSR and Commercial Relationship. https://www.prnewswire.co.uk/news-releases/chemaxon-and-keymodule-announce-integration-of-clide-ocsr-and-commercial-relationship-258109351.html (accessed 2022–05–18).

[ref10] Filippov I. V.; Nicklaus M. C. Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. J. Chem. Inf. Model. 2009, 49, 740–743. 10.1021/ci800067r. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] Imago OCR. https://lifescience.opensource.epam.com/imago/index.html (accessed 2022–09–15).

[ref12] MolVec OCR. https://github.com/ncats/molvec (accessed 2022–05–18).

[ref13] Rajan K.; Brinkhaus H. O.; Zielesny A.; Steinbeck C. A review of optical chemical structure recognition tools. Journal of Cheminformatics 2020, 12, 60. 10.1186/s13321-020-00465-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] Molecular translation competition. https://www.kaggle.com/c/bms-molecular-translation (accessed 2022–05–18).

[ref15] Heller S.; McNaught A.; Stein S.; Tchekhovskoi D.; Pletnev I. InChI-the worldwide chemical structure identifier standard. Journal of cheminformatics 2013, 5, 1–9. 10.1186/1758-2946-5-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] Rajan K.; Zielesny A.; Steinbeck C. DECIMER 1.0: deep learning for chemical image recognition using transformers. Journal of Cheminformatics 2021, 13, 1–16. 10.1186/s13321-021-00538-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] Clevert D.-A.; Le T.; Winter R.; Montanari F. Img2Mol-accurate SMILES recognition from molecular graphical depictions. Chemical science 2021, 12, 14174–14181. 10.1039/D1SC01839F. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] Oldenhof M.; Arany A.; Moreau Y.; Simm J. ChemGrapher: optical graph recognition of chemical compounds by deep learning. J. Chem. Inf. Model. 2020, 60, 4506–4517. 10.1021/acs.jcim.0c00459. [DOI] [PubMed] [Google Scholar]

[ref19] Zhang X.-C.; Yi J.-C.; Yang G.-P.; Wu C.-K.; Hou T.-J.; Cao D.-S. ABC-Net: a divide-and-conquer based deep learning architecture for SMILES recognition from molecular images. Briefings in Bioinformatics 2022, 23, bbac033. 10.1093/bib/bbac033. [DOI] [PubMed] [Google Scholar]

[ref20] LeCun Y.; Bengio Y.; Hinton G. Deep learning. Nature 2015, 521, 436–444. 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]

[ref21] RDKit: Open-source cheminformatics. https://www.rdkit.org (accessed 2021–10–01).

[ref22] Sandler M.; Howard A.; Zhu M.; Zhmoginov A.; Chen L.-C.. Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018; pp 4510–4520.

[ref23] Jocher G.et al. ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations; Zendo, 2021.

[ref24] He K.; Gkioxari G.; Dollár P.; Girshick R.. Mask r-cnn. Proceedings of the IEEE International Conference on Computer Vision, 2017; pp 2961–2969.

[ref25] Girshick R.Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision, 2015; pp 1440–1448.

[ref26] Tan M.; Pang R.; Le Q. V.. Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020; pp 10781–10790.

[ref27] EasyOCR. https://github.com/JaidedAI/EasyOCR (accessed 2022–05–18).

[ref28] PDF.js. https://github.com/mozilla/pdf.js/ (accessed 2022–05–18).

[ref29] Ketcher. https://github.com/epam/ketcher (accessed 2022–05–18).

[ref30] OpenCV. https://github.com/opencv/opencv/releases/tag/4.5.5 (accessed 2022–05–18).

[ref31] Imago report. https://lifescience.opensource.epam.com/downloads/imago-2.0.0/report.zip (accessed 2022–09–15).

[ref32] ChemAxon & Keymodule report. https://chemaxon.com/app/uploads/2013/06/keymodule.pdf (accessed 2022–05–18).

[ref33] Wang Y.-H. Traditional uses and pharmacologically active constituents of Dendrobium plants for dermatological disorders: a review. Natural Products and Bioprospecting 2021, 11, 465–487. 10.1007/s13659-021-00305-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref34] MolMiner download link. Mac https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/mac/PharmaMind-mac-latest-setup.dmg; Windows https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/win/PharmaMind-win-latest-setup.exe; Github: https://github.com/iipharma/pharmamind-molminer (accessed 2022–09–15).

[ref35] ReportLab Open Source. https://hg.reportlab.com/hg-public/reportlab (accessed 2022–05–18).

PERMALINK

MolMiner: You Only Look Once for Chemical Structure Recognition

Youjun Xu

Jinchuan Xiao

Chia-Han Chou

Jianhang Zhang

Jintao Zhu

Qiwan Hu

Hemin Li

Ningsheng Han

Bingyu Liu

Shuaipeng Zhang

Jinyu Han

Zhen Zhang

Shuhao Zhang

Weilin Zhang

Luhua Lai

Jianfeng Pei

Abstract

Introduction

MolMiner Recognition System

Module Implementation

MolMiner User Interface

Figure 1.

MolMiner Evaluation

Metrics

Benchmark Evaluation

Table 1. Summary of Benchmark Evaluation on Runtime, InChI-Based, and MCS-Based Accuracies (acc.) with Data Set Sizes (s) and Molecular Weight Averages (μ) and Standard Deviations (σ).

Real-World Data Set Collection and Evaluation

Table 2. Performance on a Self-Collected Real-World Data Set (3040 images) from Different Scientific Journals.

Application Case Illustration

Figure 2.

Figure 3.

Figure 4.

Limitations

Conclusion

Data and Software Availability

Acknowledgments

Glossary

Acronyms

Supporting Information Available

Author Contributions

Supplementary Material

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases