Skip to main content
Patterns logoLink to Patterns
. 2022 Apr 8;3(4):100488. doi: 10.1016/j.patter.2022.100488

Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science

Amalie Trewartha 2,5, Nicholas Walker 1,5,7,, Haoyan Huo 2,4,5, Sanghoon Lee 1,4,5, Kevin Cruse 2,4,5, John Dagdelen 1,4,5, Alexander Dunn 1,4,5, Kristin A Persson 3,4,6, Gerbrand Ceder 2,4,6, Anubhav Jain 1,6,∗∗
PMCID: PMC9024010  PMID: 35465225

Summary

A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compare the performance of four NER models on three materials science datasets. The four models include a bidirectional long short-term memory (BiLSTM) and three transformer models (BERT, SciBERT, and MatBERT) with increasing degrees of domain-specific materials science pre-training. MatBERT improves over the other two BERTBASE-based models by 1%∼12%, implying that domain-specific pre-training provides measurable advantages. Despite relative architectural simplicity, the BiLSTM model consistently outperforms BERT, perhaps due to its domain-specific pre-trained word embeddings. Furthermore, MatBERT and SciBERT models outperform the original BERT model to a greater extent in the small data limit. MatBERT’s higher-quality predictions should accelerate the extraction of structured data from materials science literature.

keywords: NLP, NER, BERT, transformer, language model, pre-train, materials science, solid-state, doping, gold nanoparticles PACS: 07.00.00, 81.00.00

Highlights

  • Efficient extraction of information from materials science literature is needed

  • Domain-specific materials science pre-training improves results

  • Even simpler domain-specific models can outperform more complex general models

The bigger picture

A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to a massive increase in publications. Four different language models are trained to automatically collect important information from materials science articles. We compare a simple model (BiLSTM) with materials science knowledge to three variants of a more complex model: one with general knowledge (BERT), one with general scientific knowledge (SciBERT), and one with materials science knowledge (MatBERT). We find that MatBERT performs the best overall. This implies that language models with greater extents of materials science knowledge will perform better on materials science-related tasks. The simpler model even consistently outperforms BERT. Furthermore, the performance gaps grow when the models are given fewer examples of information extraction to learn from. MatBERT’s higher-quality results should accelerate the collection of information from materials science literature.


Efficient automated extraction of information from materials science literature is needed due to an increasingly unwieldy number of publications. For a selection of materials science NER tasks, Trewartha et al. find that language models pre-trained on materials science literature provide measurable advantages over language models pre-trained on general literature or even scientific literature from multiple fields. This provides an opportunity to produce higher-quality results that require less training data in order to address this problem.

Introduction

Recently, the number of publications in the field of materials science has grown exponentially.1 As a result, it has become increasingly difficult for researchers to follow research progress as it emerges, even within relatively restricted sub-domains. The size of the materials science literature means that even relatively simple questions, such as which material candidates have previously been studied for a particular application, can be difficult or impossible to comprehensively answer. This has created a need for new, more efficient ways to engage with the literature and extract the relevant information therein.

Natural language processing (NLP), the analysis of unstructured text using computers, provides a natural candidate for such an alternative approach. NLP has successfully been applied to a number of materials science applications and is the topic of several recent investigations in materials informatics.2, 3, 4, 5 Additionally, work has been done to develop meta-learning strategies for NER.6, 7, 8 Recently, the advent of transformer ML architectures such as BERT9 have revolutionized NLP; leading benchmarks such as GLUE10 are now dominated by models utilizing attention-based encoder-decoder architectures called transformers11 and perform comparably to humans on some tasks. Transformer models have ushered in a new NLP paradigm where large and general NLP models are “pre-trained” on semi-supervised tasks before being fine-tuned for downstream tasks.9,12, 13, 14, 15, 16, 17 The pre-training approach allows for task-specific models to be trained using relatively few hand-annotated examples; this is a useful feature for practical applications of NLP bottlenecked by annotation such as scientific tasks that contain technical text and esoteric vocabulary.

Although a single pre-trained model may address multiple NLP tasks (e.g., question answering, named entity recognition, next sentence prediction), the success of models with domain-specific pre-training such as BioBERT,18 CaseHOLD,19 and FinBERT20 begs the question: can transformer models be further improved with even more domain-specific pre-training? We hypothesize that the measurable advantages previously shown with domain-specific pre-training—for example, of SciBERT over BERT21—can again be extended to models specific to narrower scientific disciplines such as materials science. Improved domain-specific model performance implies improved ability for automated knowledge extraction from even the most complex and vexing (from the perspective of NLP models) scientific domains. Exploring this problem in-depth presents an opportunity for the collation and synthesis of massive numbers of highly complex scientific publications into otherwise inaccessible structured databases and models for knowledge generation.

In this work, we apply transformer models to the task of named entity recognition (NER)22 to extract and label important scientific entities relevant to materials chemistry from unstructured text. A well-trained NER model will be capable of automatically mapping the unstructured text of materials science publications to a queryable database of key terms. Historically, NER has been used to extract information such as names and locations from various articles, though recently it has been employed in the chemical, medical, and materials sciences as well.1, 2, 3, 4,23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39 For material science, this may include terms that refer to materials and their geometries, properties, syntheses, methods of characterization, and downstream applications. Strongly related work in text mining and language modeling has also been employed in the same fields.5,40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60 BERT has additionally found use in biology, medicine, and materials science.18,61,62

Specific to the field of materials science, there have been significant efforts to apply NER to the extraction of materials synthesis recipes, including using BERT.2,28,29,57,62 In the past, these have employed a combination of the aforementioned work in the chemical sciences to extract inorganic material entities with syntax trees and lookup tables to extract properties and processing conditions. The recently developed transformer-based models have been shown to offer significant performance improvements on NLP tasks.9 This provides an excellent opportunity to evaluate the performance of these new models on NER tasks specific to materials science.

In this work, we apply four different NER models to three different materials science datasets and analyze their performance. The simplest model considered is a bidirectional long short-term memory (BiLSTM) recurrent neural network. The other three models, variants of the popular transformer-based BERTBASE neural network structure,9 have identical model structures but use pre-training corpora of varying domain specificity. The considered datasets consist of one general-purpose materials science dataset (referred to as the solid-state dataset) and two topic-specific datasets that respectively focus on doping and gold nanoparticle synthesis. We use the results of NER on these materials science datasets to relate the domain specificity of the pre-training corpus to measurable performance differences in extracting named entities.

Datasets

Here we consider three different NER datasets, chosen to represent a diversity of text sources and problems relevant to materials science; a set of solid-state materials science abstracts with entities of broad interest,28 a set of abstracts with inorganic doping information, and a set of methods/results sections relevant to gold nanoparticle synthesis. Each of these is described in detail below. The solid-state dataset is publicly available,63 though only the DOIs and annotated entities are available for the other two.64

Solid-state dataset

The solid-state dataset discussed in this work consists of 800 annotated abstracts from solid-state materials publications collected using Elsevier’s Scopus/ScienceDirect65 and Springer-Nature66 APIs as well as web scraping for journals published by the Royal Society of Chemistry67 and the Electrochemical Society.68 Abstracts are considered relevant if they mention at least one inorganic material and at least one synthesis or characterization method for inorganic materials. The entity labels are chosen to represent a broad domain of materials science knowledge with eight different labeled entity types: inorganic materials (MAT), symmetry/phase labels (SPL), sample descriptors (DSC), material properties (PRO), material applications (APL), synthesis methods (SMT), and characterization methods (CMT). Details of the collection and pre-processing of these abstracts and detailed definitions of the labels are available in Weston et al.28

A condensed example is shown in Figure 1.69 This dataset is intended to provide a “catch-all” of relevant information without focusing on any specific facet of solid-state materials. Due to the broad definitions of the entities, the solid-state dataset generally contains more entities per paragraph than the other datasets. Additionally, an inter-annotator agreement of 87.4% was evaluated utilizing 25 annotations from a second annotator.28

Figure 1.

Figure 1

Solid-state annotation example

An example of the solid-state annotation scheme condensed from an example abstract in the solid-state dataset.

Doping dataset

The properties of doped materials used for applications requiring semiconductors are determined by critical pieces of information such as the base material (BASEMAT), the doping agent (DOPANT), and quantities associated with the doped material such as the doping density or the charge carrier density (DOPMODQ). The intention of this dataset is to capture the information relevant to the doping of a material and any other relevant quantitative measurements. Abstracts that specifically contain information about doping, i.e., those containing regular expressions matching “dop∗” (such as “dopant,” “doped,” and “co-doping”) or “n-type” or “p-type,” were queried from the Matscholar database of materials science abstracts.70 A set of 500 abstracts was randomly sampled from the queried set, from which 455 abstracts were identified by human annotators as relevant to inorganic materials science and were annotated by three annotators.

A condensed example is shown in Figure 2.71 As opposed to the solid-state and gold nanoparticle dataset, tokens were annotated one sentence at a time (one sample = one sentence). Sentences were annotated only when they contain specific and direct information about the doping of solid-state materials, e.g., “X was doped with Y,” “X:Y,” or “Y doping.” Sentences describing byproducts or targeted properties (e.g., magnetization) without direct reference to a dopant or a host material (e.g., “The layered TiO2 phase did not incorporate the dopant specie and had an anatase structure with measured lattice parameters of a=3.61Å, c=9.45Å.”) were not annotated.

Figure 2.

Figure 2

Doping annotation example

An example of the doping annotation scheme condensed from an example abstract in the doping dataset. Note that sentence-level annotation was conducted for doping annotations.

Gold nanoparticle dataset

Gold nanoparticles (AuNPs) are used widely in biomedicine (e.g., in vitro diagnostics), semiconductor technology, and cosmetics.72, 73, 74, 75, 76 Despite the strong reliance of AuNP properties on size and shape,77,78 only recently have synthesis methods been able to control AuNP morphology, particularly anisotropic nanorods. This dataset aims to capture AuNP morphologies and descriptions from relevant sections of the full text of AuNP synthesis literature. A single annotator annotated a set of 85 characterization paragraphs from 73 articles on AuNP synthesis.

A condensed example is shown in Figure 3.79 The entities for this model include general shape-based morphological information for the synthesized AuNPs, including noun-based morphological entities (MOR) and adjective-based, descriptive entities (DESs). Entities like “particle” or “AuNP” were annotated as MOR entities, so at least some target could be identified with which to attribute size information in the future since many nanoparticle articles only refer to the particles as the less descriptive “nanoparticle” or “NP.” Note that other aspects such as the dimensions of particles were not included due to very low levels of support for such labels in the original data. This is similar to past work on information extraction from nanomaterial synthesis literature.57 Furthermore, limiting the number of labels will tend to provide better performance, particularly for smaller datasets.

Figure 3.

Figure 3

Gold nanoparticle annotation example

An example of the gold nanoparticle annotation scheme condensed from an example paragraph in the gold nanoparticle dataset.

Methods

Four different models are trained and evaluated on each dataset, including a BiLSTM and three variations of networks using the bidirectional encoder representations from transformers (more specifically, BERT) structure. The three BERT networks considered include BERTBASE (uncased), SciBERT (uncased),21 and a pre-trained model introduced with this work, MatBERT (uncased). Each model, when given an abstract for a materials science publication in the form of a sequence of tokens, learns to classify each token into pre-defined categories. The token categories correspond to combinations of token position and entity type, i.e., BMAT, for the beginning token of a material entity. In this way, the NER models described here can be understood as sequence-to-sequence models (Seq2Seq) that transform a sequence of words into a sequence of labels. Unless otherwise specified, for each experiment, 80% of the data was used for training, 10% for validation, and 10% for testing. Sixteen different seeds (integer powers of two from 0 to 15) were used to determine the order of the training data as well as the model weight initialization.

Tokenizers

The Materials Tokenizer was used with the BiLSTM model.28 First, the tokenization step is carried out using ChemDataExtractor with additional pre-processing to split tokens that are either composed of a number and a unit or an element and a valence state.80 Processing the tokens then consists of filtering numbers to become < nUm > since they are often not tokenized correctly with ChemDataExtractor, normalizing simple chemical formulas so the order of the elements is standardized, lowercasing tokens with only the first letter capitalized that are not elements or chemical formulas, and removing accents.

BERT models, however, use the WordPiece subword tokenization algorithm, which is very similar to byte-pair encoding (BPE).81,82 BPE relies on a pre-tokenizer that splits the training data into words. After determining the unique words in the training data and their frequencies, BPE constructs a base vocabulary consisting of all symbols that occur in the words and is trained to learn merging rules, so two symbols from the base vocabulary can be combined to form a new symbol until the vocabulary has grown to the desired size. The learned merging rules can then be applied to new words as long as they are composed of symbols from the base vocabulary. In contrast to BPE, WordPiece learns symbol pairs that maximize the likelihood of the training data rather than the most frequent symbol pairs.

Tagging schemes

This work uses the IOBES tagging scheme.83 With this scheme, any token that does not correspond to an entity (or part of an entity) is labeled with O, denoting an “outside” classification. Single-token entities will be labeled SX where the S prefix denotes a “single” token entity, and the X is the entity type. For multi-token entities, the prefix B is used to denote the “beginning” token, E for the “end” token, and I for the tokens “inside” the span of the beginning and end tokens. The IOBES tagging scheme has been shown to provide higher F-scores than other similar tagging schemes while retaining the ability to identify consecutive entities.84

Conditional random field

For all of the models considered, a conditional random field (CRF) is utilized for decoding sequences in addition to calculating the training and validation loss, taking the classification layer output logits as inputs.85, 86, 87, 88 As opposed to a classification layer that outputs logits to predict labels without the consideration of neighboring labels, a CRF layer is capable of taking context from these neighboring labels into account when making predictions. Invalid transitions as defined by the tagging scheme (such as IX being followed by BX) are initialized to incur large loss penalties.

BiLSTM model

The BiLSTM network is an example of a gated recurrent neural network in which the connections between the nodes in the LSTM layers compose a directed graph along a temporal sequence, in this case, a sequence of words. This allows the network to track arbitrarily long-term dependencies in the input sequence, demonstrating temporal dynamic behavior. The bidirectional implementation allows for the LSTM layers to consider both the forward and backward directions of the sequence. Multi-head attention is also used to allow the network to attend to different parts of the sequence differently, i.e., responding to longer-term versus shorter-term dependencies.11 These dependency-sensitive representations of the tokens in the sequence can then be used for the downstream classification task via a classification layer. In this work, the word embeddings are initialized using pre-trained Mat2Vec embeddings with a vocabulary size of 529,688.89 During training, additional word features are learned using character-level convolutions. These features are then concatenated with the pre-trained Mat2Vec embeddings before being fed into the LSTM layers. The character-level convolutions can aid in improving embeddings for infrequent or even out-of-vocabulary words and have been shown to be useful on relatively small benchmark datasets.90

Table 1 summarizes the parameters used to construct the BiLSTM model. The only change in comparison to the BiLSTM model used in past work is the use of convolutional layers instead of BiLSTM layers for the character fields.28 For training the BiLSTM model with CRF output and loss, the pre-trained Mat2Vec embeddings were held constant by convention. The RangerLARS optimizer (also known as Over9000),91 a combination of a rectified adaptive moment estimation (RAdam)92 and Lookahead93 to produce the Ranger optimizer94 alongside least-angle regression (LARS),95 was used for all experiments. A learning rate schedule called “flat and anneal” was utilized, which consists of a constant learning rate for 72% of the training epochs followed by cosine annealing to decay the learning rate to 0.91 An initial learning rate of 4×102 was used alongside gradient clipping with a maximum norm of 1.0 to prevent exploding gradients. The training was conducted for 64 epochs, and the embeddings were held frozen throughout training.

Table 1.

BiLSTM parameters: A table of parameters for the BiLSTM model

Word Embedding Character Embedding LSTM Multi-head Attention
dimension 200 dimension 38 layers 2 heads 16
dropout 0.5 dropout 0.5 hidden dimension 64 dropout 0.25
dropout 0.1

BERT models

The three BERT models we investigate share the same BERTBASE network structure as well as the same tokenizer algorithm with a maximum vocabulary size of 30,552 tokens. Input sequences are limited to a maximum of 512 tokens. Refer to the original BERT paper for details on its architecture.9

Table 2 summarizes the parameters used to construct the BERTBASE model. The three BERT models considered in this work differ only in pre-training, which is largely determined by the corpora on which they are trained. Before training the actual BERT model parameters can take place, the WordPiece tokenizer must be trained on the corpora in order to establish the vocabulary of the model. After the tokenizer is trained, the corresponding BERT model is pre-trained on the same corpora. This consists of two tasks: masked language modeling (MLM) and next sentence prediction (NSP).9 The MLM task requires that the BERT model predicts missing words in input sequences where 15% of the words are masked. The NSP task requires that given two sequences, the BERT model predicts the likelihood that one follows the other. It has been shown that pre-training on different corpora can lead to different performances.21 This is of particular interest in technical fields where commonly used words and phrases may not be well-represented or even carry the same meaning in other contexts.

Table 2.

BERTBASE parameters: A table of parameters for the BERTBASE model

Hidden Layers 12 Embeddings
attention heads 12 hidden dimension 768
dropout 0.1 intermediate dimension 3,072
activation function GELU positions 512
layer normalization ϵ=1×102 token types 2

The original BERT model was trained on the BooksCorpus (800 million tokens) and English Wikipedia (2.5 billion tokens).9 By contrast, SciBERT was trained on 1.14 million scientific papers from Semantic Scholar (3.1 billion tokens) across a variety of fields.21 SciBERT was shown to outperform BERT on scientific tasks as a result.

Building on this, we present MatBERT as a BERT model trained using scientific papers specifically from the field of materials science. For training MatBERT, we randomly sampled two million papers, or around 61 million paragraphs, from a corpus mostly consisting of peer-reviewed materials science journal articles.2 To optimize MatBERT models for materials science terminologies, two WordPiece tokenizers (cased and uncased) were trained using these paragraphs with no additional pre-processing. Following BERT practices, the vocabulary sizes for the tokenizers are both 30,522. After tokenization, paragraphs with fewer than 20 or more than 510 tokens were removed, leaving a pre-training corpus consisting of around 50 million paragraphs (8.8 billion tokens). The two variants were trained using only the MLM task. An AdamW optimizer was used with a weight decay of 0.01 and the learning rate of 5105 decayed linearly to zero during five training epochs. A batch size of 192 paragraphs per gradient update step was used. The convergence of the MLM loss versus training steps can be found in the supplemental information. Each model was trained on eight NVIDIA V100 GPUs and took about 1 month to complete. The pre-training code and pre-trained MatBERT model weights are publicly available.96,97 In this work, the uncased version is used for all BERT variants.

For training of the BERT models (MatBERT, SciBERT, and BERT) with CRF output and loss, the pre-trained model parameters were fine-tuned. The model structures as well as the BERT pre-trained parameters were provided by the “transformers” library.98 The SciBERT pre-trained parameters compatible with this library were acquired using the SciBERT AllenAI repository.21 All experiments were performed using the PyTorch library.99 The LAMB optimizer was used for all experiments.100 Different initial learning rates for the BERT embeddings, BERT encoders, and the classification layers (the linear and CRF layers) were employed to reach optimum results. They were respectively chosen as 1104, 2103, and 1102. For the first epoch, only the classification layers are trained, after which the BERT layers are fine-tuned alongside the classification layers for four epochs. For the learning rate schedule, all learning rates are subjected to exponential decay to 10% of the initial value at the final epoch, starting at the end of the second epoch. Gradient clipping with a maximum norm of 1.0 was employed to prevent exploding gradients. For the BERT models (MatBERT, SciBERT, and BERT), the WordPiece tokenizer will often split up words into multiple subtokens. For label predictions, only the embedding of the first subtoken of each word is used for classification. This is consistent with conventional usage.9 The code used for training the BERT models on the NER tasks is publicly available.101

Results

In this section, model performances on the aforementioned datasets are reported along with model performance as a function of dataset size. An input sample consists of an entire paragraph from the dataset. The model classification performances are judged according to their achieved precision, recall, and F1-scores using the “micro” averaging scheme to accurately reflect the class imbalances in the datasets. In all experiments, the set of parameters at the end of an epoch that results in the best validation F1-score are evaluated on the test set. In all experiments, training was carried out for 64 epochs for the BiLSTM model and five epochs for the BERT, SciBERT, and MatBERT models. We reiterate that the only difference between the BERT models considered here is the choice of pre-training corpus.

In Figure 4, the performances of the models on the considered datasets are shown. Each point on the scatterplot depicts the 95% CI (assuming a normal distribution) across 16 seeds for the chosen metric, model, and dataset. The precision is the ratio of correctly predicted entities to all predicted entities, and the recall is the ratio of correctly predicted entities to all true entities. The F1-score is the harmonic mean of the precision and recall.

Figure 4.

Figure 4

NER model precisions, recalls, and F1-scores

Scatterplot summaries of the precisions, recalls, and F1-scores achieved by BiLSTM, BERT, SciBERT, and MatBERT model predictions with respect to the true labels on the solid-state dataset (A), doping dataset (B), and gold nanoparticle dataset (C).

In Figure 4A, it is shown that the MatBERT and SciBERT models perform better than the BERT and BiLSTM models (within statistical error as shown by the CIs) on the solid-state set as determined by the F1-score. For precision, recall, and F1-score, the MatBERT model performs slightly better than the SciBERT model. Interestingly, although the BERT and BiLSTM models achieve very similar F1-scores, there is actually a trade-off between the precision and recall with the models, as the BiLSTM model achieves higher precision, whereas the BERT model achieves higher recall. This means that the BiLSTM model is less susceptible to predicting false positives, while the BERT model is less susceptible to predicting false negatives. The precision and recall are much closer in value for the BERT model than for the BiLSTM model.

Furthermore, in Figure 4B, the same metrics for the doping dataset are shown. Once again, the MatBERT and SciBERT models perform better than the BERT and BiLSTM models. Additionally, the MatBERT model once again demonstrates better performance than the SciBERT model for precision, recall, and F1-score. Compared to the BERT model, the BiLSTM model achieves slightly higher precision (0.71±0.03 versus 0.70±0.02). The respective performances are nearly identical for the recall (0.68±0.03) and F1-score (0.69±0.02). However, the CIs are slightly higher with the BiLSTM model.

Finally, in Figure 4C, the same metrics are once again shown for the gold nanoparticle dataset. The MatBERT model again achieves a higher F1-score than the other models, but for this dataset, the BiLSTM model and the SciBERT model achieve a similar F1-score with the BERT model trailing behind. For the recall, it can be seen that the BERT model performs significantly worse than the other models, with the MatBERT model achieving the best performance followed by the BiLSTM model and then the SciBERT model in turn. For the precision, all of the models perform similarly, with the BERT model actually achieving the best performance, followed by the MatBERT model and then the SciBERT model with the BiLSTM model trailing.

Figure 5 shows a heatmap of the entity-wise average F1-scores attained for each model across the datasets. The highest score for each entity is in bold. MatBERT claims the best performance for all entities except for one, DSC, where it only slightly lags behind SciBERT. SciBERT then claims the second-best performance for the rest of the entities aside from DES, which the BiLSTM instead claims. Between the BiLSTM and the original BERT, the BiLSTM generally performs better across the entities, only performing much worse compared to BERT for DOPMODQ, slightly trailing behind BERT for the APL, PRO, SMT, and DOPANT entities and performing much better for the solid-state SPL, doping BASEMAT, DES, and MOR entities. Of particular interest is the very poor score of zero obtained by BERT on the DES entity, which was caused by the failure to predict any entities. Since SciBERT also scored poorly on the DES entity (0.29), with the BiLSTM (0.53) and MatBERT (0.67) models significantly outperforming BERT and SciBERT, this would suggest that the domain-specific pre-training is important to DES entity recognition performance.

Figure 5.

Figure 5

NER model entity score heatmap

A heatmap of entity-wise average F1-scores with the best score for each entity in bold.

Generally, the models tend to consistently perform better or worse on the same entities. All of the models tended to perform the poorest on the doping BASEMAT, DOPMODQ, and DES entities and the best on the DSC and MAT entities. There are some exceptions, however, with BERT performing relatively poorly on the SPL and MOR entities despite very good performances from the other models. The model performances on the DES entity vary far more than on the other entities, with very large performance gaps between the models.

To study the effect of the number of training examples on model performance, we plot learning curves for each model on each dataset in Figure 6. Curating and annotating even modestly sized datasets can entail considerable effort from domain experts in physics, chemistry, and materials science due to the highly technical nature of many publications in those fields. This is in contrast to canonical NER tasks such as CoNLL-2003102 (a NER set used in the original BERT publication9) that aim to identify less technical entities such as organizations, people, or places. Thus, models that can perform well on small training datasets will be of interest to domain experts looking to create structured technical datasets from text using NER.

Figure 6.

Figure 6

NER model learning curves

Learning curves for fine-tuning NER models on the solid-state dataset (A), doping dataset (B), and gold nanoparticle dataset (C). The micro-averaged F1-score on the test set (which is always the same 10% of the total data) is depicted. The smallest training set size was chosen as 10% of the total data and is incremented by 5% up to 80%.

In Figure 6, we observe MatBERT and SciBERT exhibiting large performance improvements over BERT at low numbers of training samples, in particular with fewer than 200 samples for the solid-state dataset and with fewer than 50 samples with the gold nanoparticle dataset. The BiLSTM model exhibits the best performance as the training set size approaches zero, but asymptotically approaches a lower limit than the SciBERT and MatBERT models as the number of training points increases. On the solid-state dataset, the larger number of annotated examples allows for BERT to close the gap in F1-score, so the CIs are overlapping at 400 samples and are indistinguishable at 600 samples. As opposed to the SciBERT and MatBERT models, however, BERT does not exceed the BiLSTM performance at any of the training sample intervals for any task. This is not to imply that BERT is approaching the same limit as the BiLSTM; rather, we expect that as the number of training samples is further increased, the general BERT model will exceed or reach the BiLSTM due to its much more complex architecture as seen with the solid-state dataset (though this is less clear for the two smaller datasets). Determining whether adding more NER training data for any one task will outweigh the effects of domain-specific pre-training—that is, whether the general BERT model will overlap SciBERT or MatBERT—requires further investigation with larger numbers of annotated technical text samples. Generally, we observe that more specific pre-training results in increased performance (by substantial margins, e.g., 0.05 micro F1-score improvement of MatBERT over general BERT at 320 solid-state training samples) for BERT-derived models at every training set size, particularly at small training set sizes.

Another contributing factor to the difference in performance is class support (the number of labels in the testing dataset for a given class). Figure 7 illustrates the disparity among entities’ F1-score by class support for each of the three datasets. As expected, classes with higher support generally have higher F1-scores, and classes with low support stratify according to the level of pre-training. We would intuitively expect MatBERT to perform much better on rarely mentioned entities than BERT given its higher exposure to materials-related text during pre-training. This can be readily seen with the DES entity and DOPMODQ entity, in which model performances likely suffer from very low support (respectively 10 and 20). For the DES entity, which has the lowest support, the models pre-trained on materials-related text perform significantly better than those trained on general scientific text or just general text. However, the large degree of stratification among BERT models for entities with higher support is of note. Particularly for the PRO entity (e.g., “Voight-Reuss-Hill average bulk moduli”) with a relatively large level of support (700 samples), MatBERT and SciBERT both make a substantial 0.03 and 0.04 F1-score improvement over BERT. This improvement may imply that highly specialized entities, such as materials science properties that do not appear frequently in general corpora but appear frequently in domain-specific corpora, benefit the most from more specialized pre-training even when there are relatively many samples for fine-tuning. For entities that are more commonly mentioned in general text corpora, such as MOR (e.g., “particles,” “rods,” “spheres”), DOPMODQ (e.g., “3%”), and DSC (e.g., “crystalline,” “amorphous,” “powder”), the level of pre-training appears less important at every level of support.

Figure 7.

Figure 7

NER model entity scores as a function of support

Entity score stratified by label count (support) for each of the datasets. Support varies from model to model due to tokenizer differences that result in different truncations of the input, possibly cutting off some entities. The BiLSTM model imposes no token restriction, while the BERT models are restricted to 512 tokens, with the rest being truncated. Furthermore, different BERT tokenizers can result in a different token count, changing the truncation from model to model.

Discussion

Whether domain-specific pre-training is needed for large transformer models remains an open question in the field of NLP. Although large models trained on massive general-purpose corpora are complex enough to allow for fine-tuning for various downstream tasks (question/answer, NSP, NER) as opposed to expensive from-scratch retraining, our results show evidence that domain-specific pre-training can measurably improve F1-score performance in the domain of materials science. The overall best performance of MatBERT across the three materials science datasets corroborates a growing body of evidence that domain-specific pre-training is not only a trivial improvement over generally pre-trained models but is indeed worth the effort of retraining large models like BERT. For instance, BioBERT18 demonstrated as much as 2.8% F1-score improvement over BERT in the biomedical domain; similarly, both CaseHOLD19 (legal corpora) and FinBERT20 (financial corpora) yield improvements over base BERT in their respective domains’ downstream tasks. The word distribution shift from a general-purpose corpus to an exclusively technical corpus is large enough to encourage full retraining of large transformer models.

Our results now introduce the question: How specialized should a pre-training corpus be so that it is both highly performant within a domain of knowledge and general enough to address a variety of NER problems within that domain? Although MatBERT improves on BiLSTM, SciBERT, and BERT for all but the smallest training set sizes, the MatBERT model we introduce is limited by the distribution of pre-training data. As detailed in methods, pre-training data were taken from a general material science corpus.2 However, as shown by the most frequent title keywords in Figure 8, this corpus is designed to be biased toward trending materials science topics describing experimental syntheses. For example, paragraphs from full texts tend to favor popular compounds (such as oxides, energy materials, and magnetic materials) or synthesis techniques (such as conventional solid-state or hydrothermal synthesis). The MatBERT pre-training corpus, therefore, puts less weight on computational papers containing density functional theory results, theoretical but yet-to-be-synthesized stoichiometries, and unusual but important phase labels. Thus, MatBERT may be improved by expanding the pre-training corpus beyond the set compiled in Kononova et al.2 The goal in selecting a pre-training corpus should be to strike a balance between the specificity needed to capture particular facets of materials science and transferability between disparate fields within materials science. Exploring other methods to sample the materials science literature for the purposes of model training is one possible avenue for future work.

Figure 8.

Figure 8

MatBERT keywords

Most frequent keywords appearing in the titles of the pre-training corpus of the MatBERT model.

Conclusions

As seen in the presented results and ensuing discussion, the MatBERT model achieves the best overall performance out of the considered models. The 1%∼4% F1-score improvement over SciBERT demonstrates that domain-specific pre-training provides a measurable advantage for NER in materials science. Furthermore, SciBERT improving upon BERT by 3%∼9% F1-score reinforces the importance of scientific pre-training in general for materials science text. Interestingly, it was even found that a comparatively simple BiLSTM model enhanced with embeddings pre-trained on materials science text provides better overall performance than the original BERT model. This suggests that pre-training on a domain-specific corpus can be more impactful on performance than employing modern large transformer-based models. Learning curves additionally show that in the low data limit, the BiLSTM outperforms the BERT models, albeit still with poor overall performance due to the lack of data. For larger datasets, though, MatBERT provides a definitive improvement in NER predictions that can be expected to accelerate the construction of structured materials science datasets.

Experimental procedures

Resource availability

Lead contact

Requests for additional information should be directed to the lead contact, Nicholas Walker (walkernr@lbl.gov).

Materials availability

This study did not generate physical materials.

Data and code availability

The pre-trained MatBERT model as well as the trained MatBERT NER models are publicly at https://figshare.com/articles/software/MatBERT-NER_models/15087276.97 The code used to pre-train MatBERT is publicly available at https://github.com/lbnlp/MatBERT.96 The code used to train MatBERT NER is publicly available at https://github.com/CederGroupHub/MatBERT_NER.101 The DOIs of the articles used for the new datasets alongside the associated extracted entities are publicly available at NER Datasets: https://figshare.com/articles/dataset/NER_Datasets_DOIs_and_Entities_Doping_and_AuNP_/16864357.103

Acknowledgments

This work was funded by Toyota Research Institute through the Accelerated Materials Design and Discovery program. Secondary funding to develop the gold nanoparticle dataset as well as MatBERT pre-training was provided for this work by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Materials Sciences and Engineering Division under Contract No. DE-AC02-05CH11231 (D2S2 program KCD2S2). This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231. This work also used the Extreme Science and Engineering Discovery Environment (XSEDE) GPU resources, specifically the Bridges-2 supercomputer at the Pittsburgh Supercomputing Center, through allocation TG-DMR970008S.

Author contributions

A.J., G.C., and K.A.P. supervised the research. J.D. wrote the data collection infrastructure and performed the data collection. S.L., N.W., A.D., and J.D. annotated the doping dataset. K.C. annotated the gold nanoparticle dataset. H.H. wrote the MatBERT pre-training code and performed the pre-training. N.W., A.T., K.C., S.L., and A.D. wrote the MatBERT NER training code. N.W. wrote the BiLSTM NER training code. N.W. performed the NER experiments and prepared the results. N.W., A.D., and H.H. prepared the figures. All authors contributed to the discussion and writing of the manuscript.

Declaration of interests

The authors declare no competing interests.

Published: April 8, 2022

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.patter.2022.100488.

Contributor Information

Nicholas Walker, Email: walkernr@lbl.gov.

Anubhav Jain, Email: ajain@lbl.gov.

Supplemental information

Document S1. Supplemental experimental procedures and Figures S1–S5
mmc1.pdf (340.5KB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (2.6MB, pdf)

References

  • 1.Kononova O., He T., Huo H., Trewartha A., Olivetti E.A., Ceder G. Opportunities and challenges of text mining in materials research. iScience. 2021;24:102155. doi: 10.1016/j.isci.2021.102155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kononova O., Huo H., He T., Rong Z., Botari T., Sun W., Tshitoyan V., Ceder G. Text-mined dataset of inorganic materials synthesis recipes. Sci. Data. 2019;6:203. doi: 10.1038/s41597-019-0224-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Olivetti E.A., Cole J.M., Kim E., Kononova O., Ceder G., Han T.Y.-J., Hiszpanski A.M. Data-driven materials research enabled by natural language processing and information extraction. Appl. Phys. Rev. 2020;7:041317. doi: 10.1063/5.0021106. [DOI] [Google Scholar]
  • 4.Krallinger M., Rabal O., Leitner F., Vazquez M., Salgado D., Lu Z., Leaman R., Lu Y., Ji D., Lowe D.M., et al. The chemdner corpus of chemicals and drugs and its annotation principles. J. Cheminf. 2015;7:S2. doi: 10.1186/1758-2946-7-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Gurulingappa H., Mudi A., Toldo L., Hofmann-Apitius M., Bhate J. Challenges in mining the literature for chemical information. RSC Adv. 2013;3:16194. doi: 10.1039/c3ra40787j. [DOI] [Google Scholar]
  • 6.Li J., Han P., Ren X., Hu J., Chen L., Shang S. Sequence labeling with meta-learning. IEEE Trans. Knowl. Data Eng. 2021:1. doi: 10.1109/TKDE.2021.3118469. [DOI] [Google Scholar]
  • 7.Li J., Chiu B., Feng S., Wang H. Few-shot named entity recognition via meta-learning. IEEE Trans. Knowl. Data Eng. 2020:1. doi: 10.1109/TKDE.2020.3038670. [DOI] [Google Scholar]
  • 8.Li J., Shang S., Chen L. Domain generalization for named entity boundary detection via metalearning. IEEE Trans. Neural Networks Learn. Syst. 2021;32:3819–3830. doi: 10.1109/TNNLS.2020.3015912. [DOI] [PubMed] [Google Scholar]
  • 9.Devlin J., Chang M.-W., Lee K., Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. Preprint at arXiv. 2019 1810.04805. [Google Scholar]
  • 10.Wang A., Singh A., Michael J., Hill F., Levy O., Bowman S.R. GLUE: a multi-task benchmark and analysis platform for natural language understanding. Preprint at arXiv. 2019 1804.07461. [Google Scholar]
  • 11.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. Attention Is All You Need. Preprint at arXiv. 2017 1706.03762. [Google Scholar]
  • 12.Howard J., Ruder S. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Association for Computational Linguistics; Melbourne, Australia: 2018. Universal language model fine-tuning for text classification; pp. 328–339. [DOI] [Google Scholar]
  • 13.Peters M.E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., Zettlemoyer L. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Association for Computational Linguistics; New Orleans, Louisiana: 2018. Deep contextualized word representations; pp. 2227–2237. [DOI] [Google Scholar]
  • 14.McCann B., Bradbury J., Xiong C., Socher R. Learned in translation: contextualized word vectors. Preprint at arXiv. 2018 1708.00107. [Google Scholar]
  • 15.Conneau A., Kiela D., Schwenk H., Barrault L., Bordes A. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; Copenhagen, Denmark: 2017. Supervised learning of universal sentence representations from natural language inference data; pp. 670–680. [DOI] [Google Scholar]
  • 16.Zhang K., Bowman S. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics; Brussels, Belgium: 2018. Language modeling teaches you more than translation does: lessons learned through auxiliary syntactic task analysis; pp. 359–361. [DOI] [Google Scholar]
  • 17.Brown T.B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., et al. Language models are few-shot learners. Preprint at arXiv. 2020 2005.14165. [Google Scholar]
  • 18.Lee J., Yoon W., Kim S., Kim D., Kim S., So C.H., Kang J. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019;36:4. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Zheng L., Guha N., Anderson B.R., Henderson P., Ho D.E. When does pretraining help? assessing self-supervised learning for law and the casehold dataset. Preprint at arXiv. 2021 2104.08671. [Google Scholar]
  • 20.Araci D. Finbert: financial sentiment analysis with pre-trained language models. Preprint at arXiv. 2019 1908.10063. [Google Scholar]
  • 21.Beltagy I., Lo K., Cohan A. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Association for Computational Linguistics; Hong Kong, China: 2019. SciBERT: a pretrained language model for scientific text; pp. 3615–3620. [DOI] [Google Scholar]
  • 22.Li J., Sun A., Han J., Li C. A survey on deep learning for named entity recognition. Preprint at arXiv. 2020 1812.09449. [Google Scholar]
  • 23.Eltyeb S., Salim N. Chemical named entities recognition: a review on approaches and applications. J. Cheminf. 2014;6:17. doi: 10.1186/1758-2946-6-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Corbett P., Boyle J. Chemlistem: chemical named entity recognition using recurrent neural networks. J. Cheminf. 2018;10:59. doi: 10.1186/s13321-018-0313-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Liang Z., Chen J., Xu Z., Chen Y., Hao T. A pattern-based method for medical entity recognition from Chinese diagnostic imaging text. Front. Artif. Intelligence. 2019;2:1. doi: 10.3389/frai.2019.00001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Sniegula A., Poniszewska-Maranda A., Chomatek L. Study of named entity recognition methods in biomedical field. Proced. Comp. Sci. 2019;160:260–265. doi: 10.1016/j.procs.2019.09.466. [DOI] [Google Scholar]
  • 27.Kanakarajan K.r., Kundumani B., Sankarasubbu M. Proceedings of the 20th Workshop on Biomedical Language Processing. Association for Computational Linguistics; 2021. BioELECTRA:pretrained biomedical text encoder using discriminators; pp. 143–154. [DOI] [Google Scholar]
  • 28.Weston L., Tshitoyan V., Dagdelen J., Kononova O., Trewartha A., Persson K., Ceder G., Jain A. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. J. Chem. Inf. Model. 2019;59:3692–3702. doi: 10.1021/acs.jcim.9b00470. [DOI] [PubMed] [Google Scholar]
  • 29.He T., Sun W., Huo H., Kononova O., Rong Z., Tshitoyan V., Botari T., Ceder G. Similarity of precursors in solid-state synthesis as text-mined from scientific literature. Chem. Mater. 2020;32:7861–7873. doi: 10.1021/acs.chemmater.0c02553. [DOI] [Google Scholar]
  • 30.Hatakeyama-Sato K., Oyaizu K. Integrating multiple materials science projects in a single neural network. Commun. Mater. 2020;1:49. doi: 10.1038/s43246-020-00052-8. [DOI] [Google Scholar]
  • 31.Dieb T., Yoshioka M., Hara S., Newton M. Framework for automatic information extraction from research papers on nanocrystal devices. Beilstein J. Nanotechnol. 2015;6:1872–1882. doi: 10.3762/bjnano.6.190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Gaultois M., Sparks T., Borg C., Seshadri R., Bonificio W., Clarke D. Data-driven review of thermoelectric materials: performance and resource considerations. Chem. Mater. 2013;25:2911–2920. doi: 10.1021/cm400893e. [DOI] [Google Scholar]
  • 33.Pang N., Qian L., Lyu W., Yang J.-D. Transfer learning for scientific data chain extraction in small chemical corpus with bert-crf model. Preprint at arXiv. 2019 1905.05615. [Google Scholar]
  • 34.Corbett P., Copestake A. Cascaded classifiers for confidence-based chemical named entity recognition. BMC Bioinf. 2008;9:S4. doi: 10.1186/1471-2105-9-S11-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Krallinger M., Rabal O., Lourenço A., Oyarzabal J., Valencia A. Information retrieval and text mining technologies for chemistry. Chem. Rev. 2017;117:7673–7761. doi: 10.1021/acs.chemrev.6b00851. [DOI] [PubMed] [Google Scholar]
  • 36.Rocktäschel T., Weidlich M., Leser U. Chemspot: a hybrid system for chemical named entity recognition. Bioinformatics. 2012;28:1633–1640. doi: 10.1093/bioinformatics/bts183. [DOI] [PubMed] [Google Scholar]
  • 37.Leaman R., Wei C.-H., Lu Z. tmchem: a high performance approach for chemical named entity recognition and normalization. J. Cheminf. 2015;7:S3. doi: 10.1186/1758-2946-7-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Korvigo I., Holmatov M., Zaikovskii A., Skoblov M. Putting hands to rest: efficient deep cnn-rnn architecture for chemical named entity recognition with no hand-crafted rules. J. Cheminf. 2018;10:28. doi: 10.1186/s13321-018-0280-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.García-Remesal M., García-Ruiz A., Pérez-Rey D., De La Iglesia D., Maojo V. Using nanoinformatics methods for automatically identifying relevant nanotoxicology entities from the literature. Biomed. Res. Int. 2013;2013:410294. doi: 10.1155/2013/410294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kononova O., He T., Huo H., Trewartha A., Olivetti E.A., Ceder G. Opportunities and challenges of text mining in materials research. iScience. 2021;24:102155. doi: 10.1016/j.isci.2021.102155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Fischer C.C., Tibbetts K.J., Morgan D., Ceder G. Predicting crystal structure by merging data mining with quantum mechanics. Nat. Mater. 2006;5:641–646. doi: 10.1038/nmat1691. [DOI] [PubMed] [Google Scholar]
  • 42.Young S.R., Maksov A., Ziatdinov M., Cao Y., Burch M., Balachandran J., Li L., Somnath S., Patton R.M., Kalinin S.V., et al. Data mining for better material synthesis: the case of pulsed laser deposition of complex oxides. J. Appl. Phys. 2018;123:115303. doi: 10.1063/1.5009942. [DOI] [Google Scholar]
  • 43.Alperin B., Kuzmin A., Ilina L., Gusev V., Salomatina N., Parmon V. Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine. J. Cheminf. 2016;8:22. doi: 10.1186/s13321-016-0136-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Court C., Cole J.M. Auto-generated materials database of curie and néel temperatures via semi-supervised relationship extraction. Sci. Data. 2018;5:180111. doi: 10.1038/sdata.2018.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Court C., Cole J. Magnetic and superconducting phase diagrams and transition temperatures predicted using text mining and machine learning. Npj Comput. Mater. 2020;6:1–9. doi: 10.1038/s41524-020-0287-8. [DOI] [Google Scholar]
  • 46.Jessop D.M., Adams S.E., Willighagen E.L., Hawizy L., Murray-Rust P. Oscar4: a flexible architecture for chemical text-mining. J. Cheminf. 2011;3:41. doi: 10.1186/1758-2946-3-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Hawizy L., Jessop D.M., Adams N., Murray-Rust P. Chemicaltagger: a tool for semantic text-mining in chemistry. J. Cheminf. 2011;3:1–13. doi: 10.1186/1758-2946-3-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Kolářik C., Klinger R., Friedrich C.M., Hofmann-Apitius M., Fluck J. Workshop on Building and evaluating resources for biomedical text mining. 2008. Chemical names: terminological resources and corpora annotation; pp. 51–58. [Google Scholar]
  • 49.Mysore S., Jensen Z., Kim E., Huang K., Chang H.-S., Strubell E., Flanigan J., McCallum A., Olivetti E. The materials science procedural text corpus: annotating materials synthesis procedures with shallow semantic structures, LAW 2019 - 13th Linguistic Annotation Workshop. Proc. Workshop. 2019:56–64. arXiv:1905.06939. [Google Scholar]
  • 50.Kuniyoshi F., Makino K., Ozawa J., Miwa M. Annotating and extracting synthesis process of all-solid-state batteries from scientific literature. Preprint at arXiv. 2020 2002.07339. [Google Scholar]
  • 51.Jensen Z., Kim E., Kwon S., Gani T., Roman-Leshkov Y., Moliner M., Corma A., Olivetti E. A machine learning approach to zeolite synthesis enabled by automatic literature data extraction. ACS Cent. Sci. 2019;5:892–899. doi: 10.1021/acscentsci.9b00193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Kim E., Huang K., Saunders A., McCallum A., Ceder G., Olivetti E. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem. Mater. 2017;29:9436–9444. doi: 10.1021/acs.chemmater.7b03500. [DOI] [Google Scholar]
  • 53.Kim E., Jensen Z., van Grootel A., Huang K., Staib M., Mysore S., Chang H.S., Strubell E., McCallum A., Jegelka S. Inorganic materials synthesis planning with literature-trained neural networks. J. Chem. Inf. Model. 2020;60:1194–1201. doi: 10.1021/acs.jcim.9b00995. [DOI] [PubMed] [Google Scholar]
  • 54.Mysore S., Kim E., Strubell E., Liu A., Chang H.-S., Kompella S., Huang K., McCallum A., Olivetti E. Automatically extracting action graphs from materials science synthesis procedures. Preprint at arXiv. 2017 1711.06872. [Google Scholar]
  • 55.Vaucher A., Zipoli F., Geluykens J., Nair V., Schwaller P., Laino T. Automated extraction of chemical synthesis actions from experimental procedures. Nat. Commun. 2020;11:3601. doi: 10.1038/s41467-020-17266-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Tehseen I., Tahir G., Shakeel K., Ali M. In: Artificial Intelligence Applications and Innovations. Iliadis L., Maglogiannis I., Plagianakos V., editors. Springer International Publishing; Cham: 2018. Corpus based machine translation for scientific text; pp. 196–206. [DOI] [Google Scholar]
  • 57.Hiszpanski A., Gallagher B., Chellappan K., Li P., Liu S., Kim H., Kailkhura B., Han J., Buttler D., Han T.-J. Nanomaterials synthesis insights from machine learning of scientific articles by extracting, structuring, and visualizing knowledge. J. Chem. Inf. Model. 2020;60:2876–2887. doi: 10.1021/acs.jcim.0c00199. [DOI] [PubMed] [Google Scholar]
  • 58.Kim J.-D., Ohta T., Tateisi Y., Tsujii J. Genia corpus – a semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19:i180–i182. doi: 10.1093/bioinformatics/btg1023. [DOI] [PubMed] [Google Scholar]
  • 59.Milosevic N., Gregson C., Hernandez R., Nenadic G. A framework for information extraction from tables in biomedical literature. IJDAR. 2019;22:55–78. doi: 10.1007/s10032-019-00317-0. [DOI] [Google Scholar]
  • 60.Huo H., Rong Z., Kononova O., Sun W., Botari T., He T., Tshitoyan V., Ceder G. Semi-supervised machine-learning classification of materials synthesis procedures. Npj Comput. Mater. 2019;5:1–7. doi: 10.1038/s41524-019-0204-1. [DOI] [Google Scholar]
  • 61.Rasmy L., Xiang Y., Xie Z., Tao C., Zhi D. Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction. Preprint at arXiv. 2020 doi: 10.1038/s41746-021-00455-y. 2005.12833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Friedrich A., Adel H., Tomazic F., Hingerl J., Benteau R., Marusczyk A., Lange L. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020. The SOFC-exp corpus and neural approaches to information extraction in the materials science domain; pp. 1255–1268. [DOI] [Google Scholar]
  • 63.Solid State Abstract Annotations Solid State Abstract Annotations. 2019. https://figshare.com/articles/dataset/Materials_Science_Named_Entity_Recognition_train_development_test_sets/8184428
  • 64.Doping and AuNP NER DOIs and Entities Doping and AuNP NER DOIs and Entities. 2022. https://figshare.com/articles/dataset/NER_Datasets_DOIs_and_Entities_Doping_and_AuNP_/16864357
  • 65.Elsevier scopus Elsevier scopus. 2022. https://dev.elsevier.com/
  • 66.Springer-nature Springer-nature. 2022. https://dev.springernature.com/
  • 67.Royal society of chemistry Royal society of chemistry. 2022. https://rsc.org/
  • 68.Electrochemical society Electrochemical society. 2022. https://electrochem.org/
  • 69.Baek M., Park S., Choi D. Synthesis of zirconia (zro2) nanowires via chemical vapor deposition. J. Cryst. Growth. 2017;459:198–202. doi: 10.1016/j.jcrysgro.2016.12.033. [DOI] [Google Scholar]
  • 70.Matscholar Matscholar. 2022. https://matscholar.com/
  • 71.Tang T.-P., Yang M.-R., Chen K.-S. Photoluminescence of zns: Sm phosphor prepared in a reductive atmosphere. Ceramics Int. 2000;26:153–158. doi: 10.1016/S0272-8842(99)00034-6. [DOI] [Google Scholar]
  • 72.Dykman L.A., Khlebtsov N.G. Gold nanoparticles in biology and medicine: recent advances and prospects. Acta Naturae. 2011;3:34–55. https://pubmed.ncbi.nlm.nih.gov/22649683 [PMC free article] [PubMed] [Google Scholar]
  • 73.Huang X., El-Sayed M.A. Gold nanoparticles: optical properties and implementations in cancer diagnosis and photothermal therapy. J. Adv. Res. 2010;1:13–28. doi: 10.1016/j.jare.2010.02.002. [DOI] [Google Scholar]
  • 74.Sandeep K., Manoj B., Thomas K.G. Gold nanoparticle on semiconductor quantum dot: do surface ligands influence fermi level equilibration. J. Chem. Phys. 2020;152:044710. doi: 10.1063/1.5138216. [DOI] [PubMed] [Google Scholar]
  • 75.Lau M., Ziefuss A., Komossa T., Barcikowski S. Inclusion of supported gold nanoparticles into their semiconductor support. Phys. Chem. Chem. Phys. 2015;17:29311–29318. doi: 10.1039/C5CP04296H. [DOI] [PubMed] [Google Scholar]
  • 76.Kaul S., Gulati N., Verma D., Mukherjee S., Nagaich U. Role of nanotechnology in cosmeceuticals: a review of recent advances. J. Pharm. 2018;2018:3420204. doi: 10.1155/2018/3420204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Dong Y.C., Hajfathalian M., Maidment P.S.N., Hsu J.C., Naha P.C., Si-Mohamed S., Breuilly M., Kim J., Chhour P., Douek P., et al. Effect of gold nanoparticle size on their properties as contrast agents for computed tomography. Sci. Rep. 2019;9:14912. doi: 10.1038/s41598-019-50332-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Ng S.A., Razak K.A., Aziz A.A., Cheong K.Y. The effect of size and shape of gold nanoparticles on thin film properties. J. Exp. Nanosci. 2014;9:64–77. doi: 10.1080/17458080.2013.813651. [DOI] [Google Scholar]
  • 79.Kaur R., Pal B. Physicochemical and catalytic properties of au nanorods micro-assembled in solvents of varying dipole moment and refractive index. Mater. Res. Bull. 2015;62:11–18. doi: 10.1016/j.materresbull.2014.11.012. [DOI] [Google Scholar]
  • 80.Swain M.C., Cole J.M. Chemdataextractor: a toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 2016;56:1894–1904. doi: 10.1021/acs.jcim.6b00207. [DOI] [PubMed] [Google Scholar]
  • 81.Schuster M., Nakajima K. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2012. Japanese and Korean voice search; pp. 5149–5152. [DOI] [Google Scholar]
  • 82.Sennrich R., Haddow B., Birch A. Neural machine translation of rare words with subword units. Preprint at arXiv. 2016 1508.07909. [Google Scholar]
  • 83.Krishnan V., Ganapathy V. 2005. Named Entity Recognition. [Google Scholar]
  • 84.Alshammari N., Alanazi S. The impact of using different annotation schemes on named entity recognition. Egypt. Inform. J. 2020;22:295–302. doi: 10.1016/j.eij.2020.10.004. [DOI] [Google Scholar]
  • 85.Lafferty J., McCallum A., Pereira F. Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01. Morgan Kaufmann Publishers Inc.; San Francisco, CA, USA: 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data; pp. 282–289. [Google Scholar]
  • 86.Lample G., Ballesteros M., Subramanian S., Kawakami K., Dyer C. Neural architectures for named entity recognition. Preprint at arXiv. 2016 1603.01360. [Google Scholar]
  • 87.Huang W., Cheng X., Wang T., Chu W. Bert-based multi-head selection for joint entity-relation extraction. Preprint at arXiv. 2019 1908.05908. [Google Scholar]
  • 88.Souza F., Nogueira R., Lotufo R. Portuguese named entity recognition using bert-crf. Preprint at arXiv. 2020 1909.10649. [Google Scholar]
  • 89.Tshitoyan V., Dagdelen J., Weston L., Dunn A., Rong Z., Kononova O., Persson K.A., Ceder G., Jain A. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature. 2019;571:95–98. doi: 10.1038/s41586-019-1335-8. [DOI] [PubMed] [Google Scholar]
  • 90.Jozefowicz R., Vinyals O., Schuster M., Shazeer N., Wu Y. Exploring the limits of language modeling. Preprint at arXiv. 2016 1602.02410. [Google Scholar]
  • 91.Grankin M. over9000. 2019. https://github.com/mgrankin/over9000
  • 92.Liu L., Jiang H., He P., Chen W., Liu X., Gao J., Han J. On the variance of the adaptive learning rate and beyond. Preprint at arXiv. 2020 1908.03265. [Google Scholar]
  • 93.Zhang M.R., Lucas J., Hinton G., Ba J. Lookahead optimizer: k steps forward, 1 step back. Preprint at arXiv. 2019 1907.08610. [Google Scholar]
  • 94.Wright L. New Deep Learning Optimizer, Ranger: Synergistic Combination of Radam Lookahead for the Best of Both. 2019. https://lessw.medium.com/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d
  • 95.Efron B., Hastie T., Johnstone I., Tibshirani R. Least angle regression. Ann. Stat. 2004;32:407–499. doi: 10.1214/009053604000000067. [DOI] [Google Scholar]
  • 96.MatBERT MatBERT. 2021. https://github.com/lbnlp/MatBERT
  • 97.MatBERT weights MatBERT weights. 2022. https://figshare.com/articles/software/MatBERT-NER_models/15087276
  • 98.Wolf T., Debut L., Sanh V., Chaumond J., Delangue C., Moi A., Cistac P., Rault T., Louf R., Funtowicz M., et al. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics; 2020. Transformers: state-of-the-art natural language processing; pp. 38–45.https://www.aclweb.org/anthology/2020.emnlp-demos.6 [Google Scholar]
  • 99.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al. In: Wallach H., Larochelle H., Beygelzimer A., d'Alché-Buc F., Fox E., Garnett R., editors. Vol. 32. Curran Associates, Inc.; 2019. Pytorch: an imperative style, high-performance deep learning library; pp. 8024–8035.http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf (Advances in Neural Information Processing Systems). [Google Scholar]
  • 100.You Y., Li J., Reddi S., Hseu J., Kumar S., Bhojanapalli S., Song X., Demmel J., Keutzer K., Hsieh C.-J. Large batch optimization for deep learning: training bert in 76 minutes. Preprint at arXiv. 2020 1904.00962. [Google Scholar]
  • 101.MatBERT MatBERT NER. 2022. https://zenodo.org/badge/latestdoi/315418846
  • 102.Tjong Kim Sang E.F., De Meulder F. 2003. Introduction to the Conll-2003 Shared Task: Language-independent Named Entity Recognition, CoRR cs.CL/0306050.http://arxiv.org/abs/cs/0306050 [Google Scholar]
  • 103.Doping and AuNP NER DOIs Doping and AuNP NER DOIs. 2022. https://figshare.com/articles/dataset/NER_Datasets_DOIs/16569567

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Supplemental experimental procedures and Figures S1–S5
mmc1.pdf (340.5KB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (2.6MB, pdf)

Data Availability Statement

The pre-trained MatBERT model as well as the trained MatBERT NER models are publicly at https://figshare.com/articles/software/MatBERT-NER_models/15087276.97 The code used to pre-train MatBERT is publicly available at https://github.com/lbnlp/MatBERT.96 The code used to train MatBERT NER is publicly available at https://github.com/CederGroupHub/MatBERT_NER.101 The DOIs of the articles used for the new datasets alongside the associated extracted entities are publicly available at NER Datasets: https://figshare.com/articles/dataset/NER_Datasets_DOIs_and_Entities_Doping_and_AuNP_/16864357.103


Articles from Patterns are provided here courtesy of Elsevier

RESOURCES