OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain

Jiuyang Zhao; Shu Huang; Jacqueline M Cole

doi:10.1021/acs.jcim.2c01259

. 2023 Mar 20;63(7):1961–1981. doi: 10.1021/acs.jcim.2c01259

OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain

Jiuyang Zhao ^†, Shu Huang ^†, Jacqueline M Cole ^†,^‡,^*

PMCID: PMC10091421 PMID: 36940385

Abstract

graphic file with name ci2c01259_0018.jpg

Text mining in the optical-materials domain is becoming increasingly important as the number of scientific publications in this area grows rapidly. Language models such as Bidirectional Encoder Representations from Transformers (BERT) have opened up a new era and brought a significant boost to state-of-the-art natural-language-processing (NLP) tasks. In this paper, we present two “materials-aware” text-based language models for optical research, OpticalBERT and OpticalPureBERT, which are trained on a large corpus of scientific literature in the optical-materials domain. These two models outperform BERT and previous state-of-the-art models in a variety of text-mining tasks about optical materials. We also release the first “materials-aware” table-based language model, OpticalTable-SQA. This is a querying facility that solicits answers to questions about optical materials using tabular information that pertains to this scientific domain. The OpticalTable-SQA model was realized by fine-tuning the Tapas-SQA model using a manually annotated OpticalTableQA data set which was curated specifically for this work. While preserving its sequential question-answering performance on general tables, the OpticalTable-SQA model significantly outperforms Tapas-SQA on optical-materials-related tables. All models and data sets are available to the optical-materials-science community.

Introduction

Modern optical devices rely on optical materials. Based on how they interact with electromagnetic waves, different materials have been used to make various optical applications. For example, windscreens and optical lenses are often made of glassy materials since this affords their good light transmission; materials with certain light-absorption characteristics can be used to make optical filters.

Significant effort is being made in this field to accelerate novel materials development, in light of the growing demand for advanced optical devices of this nature.¹⁻³ In recent years, natural-language processing (NLP) and its downstream tasks have been employed as powerful tools to extract materials-science information from textbooks, scientific publications, reports, and handbooks, to speed up the discovery of new materials. For instance, the “chemical-aware” text-mining tool ChemDataExtractor^4,5 has been developed to autogenerate custom databases for materials-science applications. ChemDataExtractor has already been deployed to autogenerate bespoke databases for battery materials,⁶ Curie and Néel temperatures of magnetic materials,⁷ refractive indices and dielectric constants of optical materials,⁸ UV/vis absorption spectral attributes of materials,⁹ photovoltaic materials and their cognate device-performance characteristics,¹⁰ the band gap of semiconductors,¹¹ and materials-engineering properties.¹² These rule-based and machine-learning approaches have been complemented by several studies which have shown that unsupervised pretraining language models on large corpora can significantly improve performance on many generic, i.e., nonscientific, NLP tasks. These language models have been created using Long–Short-Term-Memory-based (LSTM) architectures, such as Embeddings from Language Models (ELMo),¹³ or transformer-based architectures, such as Generative Pretrained Transformer (GPT)¹⁴ and Bidirectional Encoder Representations from Transformers (BERT).¹⁵ However, directly applying these NLP methodologies to the optical-materials-science domain is not viable, since the vocabulary of general corpora (e.g., Wikipedia) and scientific publications about optical materials is quite different. Researchers have developed specialist BERT-based language models for the general scientific fields of biology (BioBERT¹⁶) and materials sciences (MatSciBERT¹⁷ and MatBERT¹⁸). Furthermore, a BERT-based language model that has been designed for a specific area of biology or materials science will naturally be more powerful than a general scientific BERT-based language model, as has recently been demonstrated through the presentation of BatteryBERT.¹⁹ We further advocate that property-specific BERT-based language models in materials science are needed as information sources for data-driven materials discovery, rather than general BERT-based models. Given that data-driven materials discovery is one of our key goals and very much on the agenda of the Materials Genome Initiative,²⁰ we herein present a property-specific BERT-based language model for optical materials.

Furthermore, all aforementioned BERT-based language models have been exclusively trained on text, while previous work has emphasized the richness of information on optical properties in tables.⁸ NLP efforts that aim to extract information from tabular data include rule-based approaches such as TableDataExtractor⁵ and neural-network-based approaches such as TableQA (Table Question Answering),²¹ Tapas (Weakly Supervised Table Parsing via Pretraining),²² and T3QA (Topic Transferable Table Question Answering).²³ The Tapas model developed by Google Research is worthy of particular mention since it significantly improved the state-of-the-art performance of three open table-based question-answering data sets.²² Nevertheless, these models were trained from general corpora (e.g., WikiTable), and applying such pretrained models to the optical-materials-science domain does not afford tractable results. First, a large number of tables in publications about optical-materials science include symbols that represent certain optical properties in the table header. For example, “nD” refers to the refractive index, and “ϵ” denotes the dielectric constant. None of these aforementioned neural network-based models can correctly understand these symbols. Additionally, most of the contents of table cells within general corpora (e.g., WikiTable) are text,²² while most of the contents of table cells within optical-materials-science publications are numbers, which can be a problem for developing a question-answering system that engages with tabular information from papers about optical materials.

Accordingly, this study seeks to realize new BERT-based models and table-based data-extraction capabilities that serve optical-materials research communities. Specifically, we develop OpticalBERT and OpticalPureBERT language models that are based on the BERT architecture but are trained on a large corpus of publications about optical materials. We examine the performance of these new models extensively by comparing their test results with those of the models that are designed for general use on three downstream tasks: abstract classification, question-answering, and chemical-named-entity recognition. We also develop the OpticalTable-SQA language model which serves as a question-answering tool for tabular data sources. This OpticalTable-SQA model has been realized by fine-tuning the Tapas model²² using a manually annotated data set of more than 4,000 question-answering pairs that pertain to the optical-materials domain. The OpticalTable-SQA tool demonstrates a significant improvement in understanding symbols of common optical properties, without a loss in the question-answering precision of the Tapas model on general table data sets.

Methods

Background

The BERT¹⁵ model has a transformer-based²⁴ architecture. Instead of the traditional left-to-right language-modeling procedure, BERT achieves bidirectional information propagation by predicting randomly masked tokens in sentences and predicting whether or not two sentences follow each other (Next Sentence Prediction, NSP). In this study, we only used the former part of the BERT model architecture, i.e., its masked language model, since most downstream tasks that can make use of language models in the materials-science domain do not rely on NSP.

The Tapas model²² was also employed in this work, which uses the same structure as BERT for masked language modeling. Its token embeddings are combined with four more table-aware structural embeddings before feeding them to the language model, i.e., segment embeddings, column embeddings, row embeddings, and rank embeddings. These additional embeddings help to encode the structure of the table and enable the ultimate language model to correctly select table cells that contain the information sought once the model has been fine-tuned by a question-answering task.

Corpus

A total number of 668,188 papers were downloaded from the Royal Society of Chemistry (RSC) directly and from the Elsevier Science Direct using its sanctioned Application Programming Interface (API), with the queries keyword “optical material”. More details of the article-retrieval process can be found in a previous work reported by Zhao and Cole.⁸ The average paper contained 4,374 tokens, which is significantly larger than that of the 2,769 tokens which make up the papers that provided the data source for SciBERT.²⁵ The overall corpus size used to create our language models for optical materials was 2.92B tokens, similar to the 3.3B tokens on which the originally crafted BERT model, BERT-base, was trained for generic-language (i.e., nonscientific) text, and similar to the 3.17B tokens on which SciBERT was trained. A word cloud that was generated from the abstracts of a random sample of 1,000 papers in our corpus is shown in Figure 1.

Vocabulary

We constructed OpticalVocab, a new WordPiece vocabulary from our corpus about optical-materials science using the BertTokenizerFast library.²⁶ The library uses the WordPiece tokenizer²⁷ to collect the most frequently used words or subword units. Compared with a full-word dictionary or a character-level vocabulary, this subword tokenization method achieves an optimal balance between the size and the expression capability of the vocabulary.²⁷ We used OpticalVocab in the creation of our OpticalPureBERT model, while the vocabulary file used to generate our OpticalBERT model was the same as that was used to create the BERT-base model.

The quality of our vocabulary and tokenizer was evaluated by plotting Venn diagrams²⁸ of vocabularies that pertain to the BERT-base model, the SciBERT model, and our OpticalPureBERT model, as shown in Figure 2.

Comparison of vocabularies for the BERT-base, SciBERT, and OpticalPureBERT models. The digits represent the numbers of tokens in the corresponding vocabularies.

The largest token overlap between OpticalVocab and the other two vocabularies is 66.4% (cf. the intersection between the cased OpticalPureBERT and SciBERT models), which illustrates a substantial difference in frequently used words between text in papers about optical-materials science and about general science topics. We also show the subword fertility²⁹ and the unbroken ratio³⁰ of these three tokenizers in Figure 3. The subword fertility measures the average number of subwords that have been created after a word has been tokenized. The unbroken ratio counts the fraction of words whose completeness is preserved after tokenization.

Subword fertility (lower is better) and unbroken ratio (higher is better).

Figure 3 shows that the OpticalVocab used by OpticalPureBERT reduces the splitting of words when compared with the vocabularies of the other two models. This suggests that OpticalVocab is better suited for downstream tasks on our optical-materials-science corpus.

Pretraining

Figure 4 shows the four overarching stages in which the two optical-materials-related BERT models were developed. In the pretraining stage, the OpticalBERT model was trained on our optical-materials corpus after initializing weights from the BERT-base model, while the OpticalPureBERT model was trained from scratch on the same corpus using the same architecture as the BERT-base model. We trained two different versions of the OpticalBERT and OpticalPureBERT models: cased and uncased. The cased models were pretrained using the raw corpus, while uncased models were pretrained using the corpus where the characters of all words were confined to their lowercase. Masked-language modeling (MLM) was used as the primary training phase of the two models, in which 15% of words in the employed corpus were masked and the model was trained to predict the masked words. We trained all of our models with a batch size of 256 sequences and a maximum sequence length of 512 tokens. The required training time was 8 days for OpticalBERT models (further trained for 35 epochs from BERT weights) and 10 days for the OpticalPureBERT models (trained for 40 epochs), using eight NVIDIA DGX A100 GPUs on the ThetaGPU cluster at the Argonne Leadership Computing Facility (ALCF). Details of the pretraining hyperparameters can be found in the Supporting Information. All of our models were implemented in PyTorch³¹ using transformers.²⁶

Fine-Tuning

Our optical-materials-related BERT models can be applied to various downstream tasks with minimal changes to the BERT architecture. Thereby, we fine-tuned our OpticalBERT and OpticalPureBERT models on the following three text-mining tasks that are relevant to materials science: Abstract Classification, Question Answering (QA), and Chemical-Named-Entity Recognition (CNER).

Abstract Classification

Abstract/document classification refers to the task of classifying whether or not the abstract of a given research paper is relevant to a given field. In this study, we fine-tuned the pretrained BERT-based models by adding a new sequence-classification layer on the top of them. The output of this layer will either be 1, i.e., this abstract or paper is relevant to the research field of optical-materials science, or 0, i.e., this abstract or paper is focusing on other subjects. As the search through papers in the scraping process was undertaken by simply finding the phrase “optical material” within a given paper, this corpus of papers will inherently include publications that are unrelated to optical-materials research. For example, the optical property of a lens could be mentioned in a paper that focuses on biomedical surgical operations, where the essence of that research is not about optical materials. Successful classification of these papers into those relevant to the field of optical-materials research, or otherwise, will not only help to improve the accuracy of data-extraction tasks that employ this corpus by filtering through only papers of true interest but also will save a lot of time in a high-throughput text-mining study.

Annotated data are difficult and costly to collect for a specific scientific field owing to the domain expertise that is required for high-quality annotation.²⁵ However, we were able to build a training data set for the abstract classification task by selecting papers based on their journal names. Half of the data set contained papers that have “optic” in their journal names, such as “Optical Fiber Technology” and “Optical Materials”. The other half of the data set contained papers that were definitely not talking about optical materials. We selected this latter category of paper by excluding papers where it was difficult to determine whether they are talking about optical materials based on purely their journal names. A few examples of these “obscure” journal names are “Journal of Nanoparticle Research”, “Results in Chemistry”, and “Tetrahedron”. The resulting data set contains 17,748 abstracts that are believed to be relevant to optical-materials research and 17,748 abstracts that are believed to be irrelevant to optical-materials research, based solely on the journal name. The quality of this data set was validated by randomly sampling 300 abstracts and manually labeling whether or not they are correctly classified. A consistency ratio of 92.5% between manual labeling and “journal name labeling” suggests that our data set is of a sufficiently high quality to be used in an abstract-classification task. This small validation data set can be found in the Supporting Information.

The data set was divided according to a 80:20 split of a training set and a development set. We manually annotated another randomly sampled out-of-sample test data set of 315 arbitrary abstracts, i.e., abstracts whose journal name could belong to the ‘obscure’ category, in order to evaluate the real-world performance of our models. We fine-tuned the BERT-base model and the SciBERT model so that we could make a fair comparison of their performance against that of our new models. We also fine-tuned the MatBERT model¹⁸ and the MatSciBERT model¹⁷ on the same training set and used their performance on the same test sets as baselines, as they are existing BERT-based models that were also pretrained on materials science corpora. We likewise trained a logistic regression (LR)-based binary classification model as another baseline, so that we could compare the performance of our models against that of other techniques. Additionally, we tested the performance of zero-shot prompt learning³² on the test set.

Question-Answering Tasks on Text

In an (extractive) Question Answering (QA) process, questions are posed in natural language to a paragraph of a given research paper, and the algorithmic routines of the QA process aim to extract the correct answer to that question from the paragraph. QA tasks can be applied to BERT-based models by adding a linear layer to the BERT architecture at the top of its output, i.e., the sequence embedding. This linear layer generates two one-dimensional logits whose length corresponds to that of the sequence length: 1) the “start logits”, which are used to compute the probability of a given token to be the start of the predicted answer, and 2) the “end logits”, which are used to compute the probability of one token to be the end of the predicted answer. The confidence score, S, of a certain text slice, which starts at token i and ends at token j, to be the answer is given as

This can be also written as

where q_i, q_j, W_s, and W_e are, respectively, the token embedding of the start token, the token embedding of the end token, and the weights of the linear layers for the start and end logits. In this way, the piece of text that starts at token i and ends at token j with the highest combined probability will be identified as the model output.

The SQuAD v1.1 data set³³ was used to fine-tune our BERT-based models for QA applications. It is worth mentioning that the SQuAD v2.0 data set was not used in this fine-tuning process despite being released more recently than the SQuAD v1.1 database; this is because it contains unanswerable questions, which makes it less applicable to the science domain.^16,19,34 The SQuAD v 1.1 data set contains about 100,000 question-answer pairs for machine comprehension in the general text domain. The data set was split in a 90:10 ratio for the purpose of its training and development, respectively.

To evaluate the real-world performance of the fine-tuned models, we created two manually annotated out-of-sample test sets. The first test set contains 301 arbitrary question-answer pairs from a randomly sampled set of 162 paragraphs which are not in our pretraining corpus. The second test set contains 317 numerical question-answer pairs from the corresponding 305 paragraphs that had been sampled from an existing database about optical materials that had been previously autogenerated by ChemDataExtractor.⁸ Example constructs of data entries for these evaluation test sets are shown in Figure 5. A comprehensive horizontal comparison is performed between the fine-tuned OpticalPureBERT model, the OpticalBERT model, the BERT-base model, and the SciBERT model, as well as the MatBERT model¹⁸ and the MatSciBERT model.¹⁷ We also compared the test predictions of the numerical data set with the original records in the ChemDataExtractor-generated database,⁸ from which this numerical data set was constructed.

Example constructs of data entries for the evaluation question-answering test data sets. Top: arbitrary QA example. Bottom: numerical QA example. Annotated correct answers are highlighted.

Chemical-Named-Entity Recognition (CNER)

Chemical-named-entity recognition is one of the most fundamental steps that is required of the text-mining process when extracting data from the physics- or chemistry-related materials domain. This is because the primary identifier of the data sought is a chemical-named entity that needs to be recognized and hence extracted from a sentence of text. Conventional approaches to CNER have often focused on a hybrid combination of rule-based methods and LSTMs.^4,35,36 Recent studies have significantly improved the performance of CNER on existing data sets by fine-tuning BERT-related models.^16,17,37 By adding a single-output layer that is based on the word embedding of its last layer to the BERT architecture, BERT is able to compute its token-level probabilities within the BIO scheme.³⁸ The BIO scheme assigns three types of labels to each token of a sentence, “B-MAT” which indicates that the token refers to the beginning of a chemical name, and “I-MAT” which suggests that the token is part of a chemical name but is not its starting token, while “O” classifies all other ordinary tokens. It is necessary to include both “B-MAT” and “I-MAT” labels, as words will be split into subword tokens by the BERT tokenizer.

The data sets that were used to fine-tune our models are a combination of the BioCreative IV CHEMDNER data set³⁹ and the Matscholar data set.⁴⁰ The CHEMDNER data set contains 84,355 manually annotated chemical-named entities that span across 10,000 abstracts, and it has an interannotator agreement of 91%. The majority of the chemical names in the CHEMDNER data set are organic, owing to the disciplines from which its source papers were selected.³⁷ We thus included the Matscholar data set in our training data set together with the CHEMDNER data, in order to enhance the capability of our model, so that it can identify inorganic materials as well as organic ones. Although the 7,360 chemicals that have been annotated in the Matscholar data set are far fewer than those of the CHEMDNER data set (84,355), the Matscholar data set focuses on materials science which better suits our objectives. The combined data set was divided into training and development sets using the 80:20 split ratio, which is in common with data-splitting procedures that had been applied to assess other downstream tasks. A manually annotated out-of-sample test set was constructed to include 411 random-sampled examples of chemical-named entities. The performance of our models on the development set was compared with that of other BERT-based models, while the model performance on the test set was compared additionally with traditional NLP models that have been made via ChemDataExtractor.⁵

Question-Answering Tasks on Tables

We enabled question-answering capabilities for tabular data by focusing on teaching the model to understand various symbols of optical properties that reside in the header of tables which are contained within papers that make up the optical-materials-science corpus. This strategy was adopted because an optical property is commonly not presented precisely in complete English words within a table; rather, it is represented by a symbol of it. This task of selecting the correct cell from a table when the model is asked for a specific optical property of a certain compound, or a chemical in natural language, is known as a “single-cell extraction”. The publicly available table-parsing sequential question-answering model for general text, Tapas-SQA,²² provided a generic baseline model for our requirements which we could adapt for our scientific application. The Tapas-SQA model had been fined-tuned on the Sequential Question Answering (SQA)⁴¹ data set, which consists of 17,553 generic (i.e., nonscientific) question-answering pairs that had been annotated from tabular data on Wikipedia. The SQA data set is called a sequential question-answering data set because its annotation allows for a configuration where several questions might be asked of the data in series by a single enquirer. For example, a sequential set of questions asked by one enquirer could be

1.
who are the players?
2.
of those, who is from the USA?

The Tapas-SQA model is required to answer these questions in series, by taking the output of the first question as an input to answer the second question. The SQA data set⁴¹ has a preidentified test set, which enables direct comparisons between models that have been developed using different approaches. This test set contains 1,025 first questions, 1,024 s questions, 683 third questions, and 280 questions of higher orders. Thus, the ability of the Tapas-SQA model to correctly answer the first or first two questions plays a crucial role in real-world applications, where people often ask a series of interrelated questions in order to tackle a problem or to understand a situation.

Both the original corpus that had been used to train the Tapas-SQA model and this SQA data set⁴¹ contain a very limited amount of information about the optical-materials-science domain. Thus, we sought to adapt the Tapas-SQA model to suit our application area, by augmenting it with optical information from tabular data. Thereby, we first created the OpticalTableQA data set, which contains 4,534 manually annotated single question-answering pairs on tabular data about optical materials from more than 90 tables, where “single” refers to the fact that all question-answering pairs are first questions. These tables were taken from the papers that make up the aforementioned corpus that was used to train the OpticalBERT model, whereby they were filtered based on their structure, length, and content, in order to meet the requirements for fine-tuning the Tapas-SQA model.²² We designed eight question types that can be classified into two categories: 1) “what-questions” which ask about the property value of a chemical or compound and 2) “which-questions” which require the Tapas-SQA model to select the correct chemical or compound given a property value. The eight question types and the annotation process are exemplified in detail in Figure 6. To enhance model generalizability, this question-answering annotation was designed to cover a wide range of optical properties including, but not limited to, the refractive index, dielectric constant, absorption/fluorescence maximum, band gap, and dipole moment. The percentage composition of physical properties that are featured in this OpticalTableQA data set is given in Figure 7.

A table (top) with corresponding example question-answering annotations and corresponding question types (bottom).

Percentage compositions of different (optical) properties that feature in our OpticalTableQA data set.

We then fine-tuned the Tapas-SQA model on the OpticalTableQA data set to 1) build a model that is able to understand special symbols and certain table structures that pertain to the optical-materials-science domain and 2) preserve the ability to answer general questions of the adapted Tapas-SQA model as far as possible. The OpticalTableQA data set was shuffled and divided into training and development subsets whose proportioning carried an 80:20 ratio, respectively. This dividing procedure was implemented five times whereby the data set was split randomly in each case. Thus, five pairs of training and development subsets of the OpticalTableQA data set were created. For each of those five pairs of data subsets, the Tapas-SQA model was fine-tuned on the training set, and the performance of the fine-tuned model was evaluated using the development set. Meanwhile, the performance of the fine-tuned model on the SQA test set was compared with that of the original Tapas-SQA model.