Skip to main content
PLOS One logoLink to PLOS One
. 2024 Oct 10;19(10):e0311161. doi: 10.1371/journal.pone.0311161

InstructNet: A novel approach for multi-label instruction classification through advanced deep learning

Tanjim Taharat Aurpa 1,2,*, Md Shoaib Ahmed 1,3, Md Mahbubur Rahman 1,4, Md Golam Moazzam 1
Editor: Iftikhar Ahmed Khan5
PMCID: PMC11469596  PMID: 39388407

Abstract

People use search engines for various topics and items, from daily essentials to more aspirational and specialized objects. Therefore, search engines have taken over as people’s preferred resource. The “How To” prefix has become familiar and widely used in various search styles to find solutions to particular problems. This search allows people to find sequential instructions by providing detailed guidelines to accomplish specific tasks. Categorizing instructional text is also essential for task-oriented learning and creating knowledge bases. This study uses the “How To” articles to determine the multi-label instruction category. We have brought this work with a dataset comprising 11,121 observations from wikiHow, where each record has multiple categories. To find out the multi-label category meticulously, we employ some transformer-based deep neural architectures, such as Generalized Autoregressive Pretraining for Language Understanding (XLNet), Bidirectional Encoder Representation from Transformers (BERT), etc. In our multi-label instruction classification process, we have reckoned our proposed architectures using accuracy and macro f1-score as the performance metrics. This thorough evaluation showed us much about our strategy’s strengths and drawbacks. Specifically, our implementation of the XLNet architecture has demonstrated unprecedented performance, achieving an accuracy of 97.30% and micro and macro average scores of 89.02% and 93%, a noteworthy accomplishment in multi-label classification. This high level of accuracy and macro average score is a testament to the effectiveness of the XLNet architecture in our proposed ‘InstructNet’ approach. By employing a multi-level strategy in our evaluation process, we have gained a more comprehensive knowledge of the effectiveness of our proposed architectures and identified areas for forthcoming improvement and refinement.

1 Introduction

Instructions and guidelines are mandatory for any well-defined task, as they help anyone execute work independently. Instructions or guidelines are generalized lists of steps that enable anyone to understand the working principle. Instead of listening to work procedures, modern people prefer to search for instructions on Google. It is necessary to categorize instructions efficiently to ensure an efficient search experience. Furthermore, one instruction can be employed in multiple categories. Implementing a solution that can help humankind analyze systems that are able to categorize instructions automatically can be revolutionary for searching. Additionally, the wikiHow articles are used for task-oriented learning to create knowledge bases, intelligent agents, chatbots, graph creation, etc., where step-by-step guidelines are learned by these systems. During these types of automated intelligent implementations, classifying wikiHow articles and tagging them with multiple appropriate tags can be very beneficial. It can help to get the answers to questions such as “In which the instructional text learned by the intelligent system belongs?” or “Can one learned instruction can be used for multiple contexts?” Nevertheless, no significant work has been conducted on the research topic of multilabel instructional text classification, which requires significant solutions.

The research attention in this area is not considered, whereas solutions such as wikiHow (https://www.wikihow.com/Main-Page) generate various instructions daily. WikiHow contains instructions in the “How to” manner for various tasks. The database is enriched with visual contents of instructions, which have all gone through the expert’s verification. This wiki-style website has 2.5 million registered users and has already featured more than 235,000 how-to articles till December 2021. These articles are categorized into different classes. Some of these can even be labeled under multiple categories or tags. Automatically identifying those tags can be an efficient way to search for guidelines. These wikiHow instructions are primarily focused on summarization tasks. Moreover, these instructions are shown to contribute to procedural-oriented task learning and to creating a knowledge base. Many datasets are created based on wikiHow articles. Besides learning task-oriented steps and creating a knowledge base, it is important to learn appropriate tags for the instructional texts. Moreover, Classifying them is also necessary because people need more time to accomplish any task. They google the procedure for their required task and follow it. Again, a single instruction can be classified into multiple categories.

Transformer architectures are now prevalent in natural language processing-related research. These models use the encoder-decoder approach, where the encoder encodes the input to the context vector, and the decoder does the opposite of it. The self-attention mechanism in the transformer provides significant results and contributes to developing various Large Language Models(LLM), such as BERT, ELECTRA, XLNet, etc. All of these models are pre-trained in a self-supervised fashion on a large amount of raw text; as a result, they show remarkable performance in the downstream tasks. The transformer-based architecture BERT (Bidirectional Encoder Representations from Transformers) brought a revolution in research related to natural language processing. It trains the transformer bidirectionally and uses positional encoding for sequences. This model works in two steps- Pretraining and fine-tuning. BERT uses two pretraining tasks, Mask Language Modeling(MLM) and Next Sentence Prediction(NSP), known as autoencoding models. These pre-trained models can then be fine-tuned for different downstream Tasks such as classification, Question Answering [1, 2], etc. It provided noteworthy performance in various NLP tasks like text classification [35], offensive language detection [6], Entity Relation and Recognition [7, 8] Translation Quality Estimation [9, 10].

XLNet is a popular generalized auto-regressive transformer that has shown its performance in various NLP tasks, such as Name Entity Recognition [11], Sentiment Analysis [12], Emotion detection [13, 14] etc. XLNet is a lengthened version of transformer-XL, which abolishes the limitation of transformers regarding the text sequence length. XLNet combines the concept of autoencoding models and autoregressive language models while ignoring their drawbacks. Bidirectional context and positional encoding are both maintained in XLNet architecture. It uses a different attention mechanism, namely Two stream self-attention. Like BERT, XLNet also has two steps: pretraining and fine-tuning. It uses Permutation Language Modeling (PLM) as the pretraining task and considers all the permutations of a sequence. It overcomes the limitation of BERT and shows a significant performance for the downstream tasks. XLNet outperformed BERT in twenty different Natural Language Processing Tasks.

Besides, variants of BERT, RoBERTa, AlBERT, and DistilBERT are utilized in NLP tasks nowadays. Another transformer architecture, ELECTRA, has been attracting researchers recently. Research work such as sentiment/emotion analysis (e.g., [15, 16]), text mining (e.g., [17], fake news analysis (e.g., [18]). All of these transformer models are very similar to the XLNet and BERT. We also utilized them in our data to justify the state-of-the-art performance of our proposed method.

1.1 Research objective

In this research, we have proposed an autoregressive transformer-based architecture for multilabel instruction classification. To the best of our knowledge, this is the first work for multilabel instruction classification using wikiHow articles. Previously, we have exclusively handled instructions that were categorized as single-category multiclass, where the instructions we have dealt with can only be classified into one category [5].

The main contribution of our work has been mentioned below:

  • Our work Introduces a groundbreaking transformer-based architecture, leveraging XLNet, to effectively categorize wikiHow articles for multilabel instructional text classification. This is the first-ever endeavor in multilabel instruction classification, marking a significant advancement in the field.

  • Our research also addresses the crucial aspect of data preparation for multilabel classification. We propose an algorithm that efficiently filters labels based on a given score, enhancing the practicality and applicability of our methodology.

  • Our methodology is rigorously evaluated using appropriate metrics, namely Accuracy and Macro F1 Score. This comprehensive evaluation provides robust evidence for the state-of-the-art performance of our proposed methods.

  • Creating a comparison scenario with other existing Large Language Models and visualizing the noteworthy performance.

We named the approach ‘InstructNet’ because we are emphasizing Instructional Text here. In the proposed approach, the primary classifier is XLNet. Therefore, the term ‘InstructNet’ is created by merging Instruction and XLNet.

Section 2, named Relates Works, contains the required literature review for this work. We have provided the preliminary concept and the overview of the proposed method in Section 3. In Section 4, we have given the pre-experimental setup. Then, Section 5 contains all of our findings and results. Lastly, Section 6 discusses our research.

2 Related work

wikiHow instructions have been utilized as research data in several deep-learning works. However, most of those are conducted for summarization tasks. Jadeja et al. [19] used two models, PEGASUS and T5, for text summarization tasks with the wikiHow dataset. They compared the proposed methods based on evolution metrics (BLEU and ROUGE) and human opinion. For lengthy text summarization, a sentence filtering technique is proposed in the paper [20]. It removes the low-quality sentences and preserves the highly informative ones. Authors in [21] proposed an unsupervised topic modeling approach for text summarization, combining Latent Dirichlet Allocation (LDA) and K-Medoids clustering. Other types of tasks on wikiHow instructions are learning different kinds of knowledge. Zhou showed a procedural knowledge-acquiring method from household instructions [22], and Zhang proposed a technique for reasoning tasks for relating goals and steps. In the paper [23] wikiHow, articles are used for both text summarization and categorization.

wikiHow instructions have been less utilized for text classification tasks; even the existing classification problems are only partially focused on text. Lin [24] et al. proposed a method to classify multistep activities from lengthy video spanning. They used a distant supervision technique that recognized the procedure steps from the wikiHow instructional video. Multi-task classification on wikiHow articles has been executed in [25] where authors used different transformer models and ordinal regression. DeBERTaV3large has provided the best result with an accuracy score of 64.12%. Another work [26] on the wikiHow article was done to determine whether an article need revision or not. They achieved the highest 68.84% accuracy using the XLNet. Another work [27] used a few shot classification models for the Label Semantics Aware Pre-training (LASP), which is a very important task to improve text classification. To supplement their data, they have added the wikiHow intent dataset [28]. Aurpa et al. [5] use the BERT model to classify wikiHow summary classification. None of these works considered the multilabel classification for wikiHow text.

XLNet has been used broadly in text classification tasks. Wang et al. [29] proposed an improved version of XLNet for text classification. Authors here showed almost 94.57% accuracy and 0.3133 loss rate.

Another contribution of XLNet can be seen in [30], where authors use this model for auto-labeling for text classification. They achieved an accuracy of almost 88.68Authors in [31] used XLNet for personality classification from textual data. They identified the micro and macro F1 scores for five big models using the XLNet. Liu proposed an XLNet-based hospitality and tourism-related text classification method in [32]. XLNet outperformed here with 72.2% accuracy in the Yelp dataset. XLNet scored 96% and performed best among different classifiers, including the transformer models for text classification in [33]. The use of BERT is also prevalent in text classification tasks. Authors in [34] propose a BERT CNN-based model and improve the accuracy of existing works. Another text classification approach is proposed in [35] where BERT and BiGRU models are combined for the Chinese text classification. This method performed more .9 accuracy and F1 score. Another work [36] combined BERT with CNN for long Chinese text classification and provided improved accuracy.

XLNet showed better results in multilabel text classifications. Roudsari introduced PatentNet in [37] for the multilabel classification of patent documents with various deep-learning models. Authors here also achieved the best performance using the XLNet. Emotion classification with the multilabel approach was done in [38] and reached the highest 45.6% accuracy with XLNet-MA. In the same way as XLNet, BERT also showed its popularity in multilabel text classification. Chalkidis et al. [39] used BERT for large-scale multilabel classification. Another work [40] has conducted multilabel text classification using biomedical texts. They used BERT in their proposed method and recognized aspect categories. A combined BERT model for tagging documents is proposed in [41]. They utilized not only the label semantics but also the fine-grained information of the text. These works are excellent research, undoubtedly. However, the use of XLNet on wikiHow data for multiple tag recommendation (multilabel classification) has yet to be considered. This is the primary intention of our research. Powerful large language models like BERT XLNET have yet to be used in the classification or tagging of instructional text, such as wikiHow articles.

3 Preliminary and proposed framework

3.1 Preliminary concept

3.1.1 Transformers and transformer-XL model

The popular Natural Language Processing architecture Transformers [42], which is also called encoded decoded models, gained popularity for its multi-head attention mechanism. The encoder in transformer models deals with the sequential input and maps input (x1, …, xn) into the context representation z(z1, …, zn). The decoder utilized this context for the development of output representation (y1, …, ym). The encoder-decoder stacks of the transformer accommodate encoder and decoder layers with N = 6 layers and two sublayers. These sublayers come about self-attention and fully connected position-wise feed-forward layers. There is a resultant dmodel = 512 dimensional sublayer Layernorm(x+sublayer(x)).

Transformer-XL is a neural architecture that combines Transformers and RNNs to improve language modeling. It proposes a segment-level recurrence mechanism that uses concealed states from earlier segments as an extended context for the present element, as shown in Fig 1. This allows learning dependency to extend beyond a defined length without interfering with temporal coherence. As a result, it surpasses RNNs and vanilla Transformers, which are longer than 80% and 450%, respectively, on eclectic word-level and character-level datasets and can generate coherent long-text articles. It can measure the ability to model long-term dependency through a new metric called Relative Effective Context Length (RECL) [43].

Fig 1. Illustration of how transformer-XL works with long sequences.

Fig 1

(If we compare the figures of the transformer and transformer XL, we can observe that the context representation is limited in transformers where the transformer XL is able to represent long sequences more efficiently).

Fig 1(a) and 1(b) shows the Training and Evaluation phases of normal transformers. Fig 1(c) and 1(d) is about the Training and Evaluation phases of Transformer XL.

3.1.2 Bidirectional Encoder Representation from Transformers (BERT)

According to Devlin [44], BERT is a multilayered bidirectional transformer encoder. An exact token sequence that may incorporate one or more sentences is used as the input for BERT. BERT consists of the following two steps:

  • Pretraining BERT: The two unsupervised tasks, Masked LM (Language Model) and NSP (Next Sentence Prediction), are utilized to pretrain BERT. To obtain the pre-trained bidirectional model, masked LM involves masking an array of random tokens and making predictions about them. In this approach, 15% input tokens are randomly masked, and the model attempts to regenerate original inputs while learning weights. The goal of NSP is to anticipate the subsequent sentence in a sentence pair. BERT takes this pair and aims to predict whether the sentences in the pair are subsequent sentences or not. BERT has been pre-trained using English Wikipedia’s text passages (not lists, headings, or Tables) (2,500M words) and BooksCorpus (800M words) [45]. This is how BERT utilized the self-supervised mechanism. Fig 2 represents the pretraining task of BERT.

  • Fine-tuning BERT: BERT is recognized for several downstream functions, including the ability to select appropriate inputs for both single and coupled sentences. It starts with pre-trained parameters, which can be adjusted further in later jobs using labeled data. The pre-trained BERT model shows noteworthy performance in popular downstream tasks, such as question answering, text classification, named entity recognition, machine translation, etc. In this work, we fine-tuned BERT for multilabel text classification.

Fig 2. The training phase of the BERT model.

Fig 2

The wikiHow instructions are tokenized after combining BERT’s special tokens [CLS].

3.1.3 XLNet

XLNet [46] is an autoregressive transformer model that outperforms BERT in various NLP tasks. It utilizes the advantages of the Autoregressive(AR) models while preserving the bidirectional context like an Autoencoded(AE) model. This is a broad version of the Transformer-XL model. It utilizes the pretraining process of Transformer XL and preserves the advantages of AE language models.

For given any text sequence x = x1, x2, …….., xT autoregressive models are pre-trained with the following factorization in Eq 1:

maxθlogpθ(x)=t=1Tlogpθ(xt|x<t)=t=1Tlogexp(hθ(x1:t-1)Te(xt))xexp(hθ(x1:t-1)Te(x)) (1)

Here hθ(x1:t−1) is the neural model’s context representation, and e(x) is the embedding of x.

For a given sequence x AE language models, e.g., BERT creates a corrupt sequence x^ where 15% random tokens are replaced with an artificial token [SEP]. Then the model is trained to recreate x¯ from x^ such as in Eq 2:

maxθlogpθ(x|x^)t=1Tmtlogpθ(xt|x^)=t=1Tmtlogexp(Hθ(x^)tTe(xt))xexp(Hθ(x^)tTe(x)) (2)

Both of these models have the pros and cons. For example, BERT assumes all tokens are independent and make noise by replacing input with an artificial symbol like [MASk]; AR models don’t have contextual access for both sides of the tokens. Addressing all of these, XLNet built a permutation language model that pre-trains the data like an AE model by combining the advantages of AR models. The objective can be expressed using the Eq 3.

maxθEzzT[t=1Tlogpθ(xt|x<t)] (3)

Here x is the text sequence, and z represents the sample factorization. pθ(x) is the maximum likelihood of sequence x in the accordance of z.

XLNet uses two-stream self-attention using two different types of hidden representation hθ(xzt) and gθ(xz<t,zt).

Fig 3 is the illustration of the overview of a Permutation Language Model for two-stream self-attention following parameters are reformed schematically.

Fig 3. Permutation language model for predicting token x3 for a given factorization order.

Fig 3

gzt(m) Attention(Q = gzt(m-1)), KV = hz<t(m-1); θ)

hzt(m) Attention(Q = hzt(m-1)), KV = gzt(m-1); θ)

Initially, hi(0)=e(xi) and gi(0)=w and Q, K, and V are, respectively Query, Key, and Value.

3.1.4 HowSumm dataset

HowSumm [47] is a large-scale dataset based on the wikiHow articles. The dataset has two parts: steps and methods. The observations of this dataset contain information such as the article’s URL, title, a target summary for summarization tasks, method name, steps, the source, and a list of categories. In this research, we have focused on the step part, which contains 11,121 observations and almost 6000+ unique labels for multilabel classification tasks. The challenge with this dataset is that there are lots of labels for which enough observations are not present. Therefore, before the utilization of the dataset, it was necessary to prepare it for deep learning models.

3.2 Proposed methodology

The proposed methodology of this research has been described here. We can divide the methodology into three parts: data preparation, fine-tuning the XLNet model, and evaluating the model. Fig 4 illustrates the working flow of our research work.

Fig 4. The system architecture of our proposed methodology.

Fig 4

It indicates the system’s workflow and is divided into three parts- Data Preparation and Preprocessing (Workflow given inside the green box of the figure), Model Training (The yellow box of the figure) and Validation (The blue box of the figure).

3.2.1 Data preparation and preprocessing phase

This phase of our methodology can be divided into more parts: label filtering, text preprocessing, tokenization, and label encoding. These steps are described below:

  • Label Filtering: We preprocess the text using different preprocessing techniques. Preparing our label presents several challenges. The dataset contains more than 6,000 labels. However, most of these labels don’t have enough observations, which tends to imbalance the dataset and affect the classifier’s performance. Therefore, we apply the proposed Algorithm 1 and prepare our data for this research. In this dataset, the least number of observations for any single label is only three. To tackle this data imbalance problem, we have chosen a methodical approach. We are removing the labels with a smaller number of observations using a specific algorithm. Our aim is to retain only those labels that have at least 500 observations, a strategy that we believe will effectively address the imbalance. This is how we have ensured that each label used in this paper has a sufficient amount of observations for preventing the label imbalance problem. This algorithm takes a dataset with text and labels and then returns a dataset after filtering labels. In Line 1, we are getting the List of selected labels using the procedure Label_Selection, which is implemented in Lines 12-26. Here, we have identified a selection score for each label using a threshold value, which is 500 for this research, and returned a list of selected labels that scored more than or equal to 1. The method uses a priority queue named ListSelectedlabel and storing the filtered labels. If we increase the threshold value, we need to get more labels from the dataset for multiclass classification. Moreover, by decreasing the threshold value we face, the class creates a class imbalance problem in our experimental results. Line 21 calculates the selection score for a particular label. Table 1 showed the List of selected 67 labels using the method Label_Selection. In Lines 2-11, we relabeled our texts using these chosen labels. Here, a while loop is used to iter over the text data. In each iteration, the algorithm checks whether a label for the text is present in ListSelectedlabel or not. If it is present, the algorithm keeps the label for that text. Otherwise, the label is dropped. In a similar manner, all the labels related to that text are checked. After filtering labels, our text is labeled with more than one label, which enables the multiclass classification problem in this research.

  • Text Tokenization: We preprocessed the raw text, which will be discussed later in the paper with proper examples. After that, we sent our text data to the XLNet and BERT tokenizer, as direct text cannot be absorbable by any model. Both tokenizers tokenize the data and some special token [CLS], [PAD] and prepare the input sequence for corresponding classifiers. We get three types of sequences from the tokenizer: input sequence, attention mask, and segment IDs.

  • Label encoding: For encoding our labels, we have selected a binary approach. For each text, we have created a binary list with the length of the total selected levels. The index of this array is the representation of all the labels’ numbers. For a text, a label is assigned to that data; then, the array value will be one based on the label’s number; otherwise, the value is 0.

Table 1. Filtered Labels after applying the data preparation algorithms.

All of these 67 labels are able to score more than 1, according to the algorithm. Only these few labels have more than 500 observations in the HowSumm dataset.

College University & Postgraduate Dog Training Birds
Dog Behavior Youth Work World
Endocrine System Health Recipes Sports and Fitness
Personal Care and Style Cars Outdoor Recreation
Arts and Entertainment Travel Pets and Animals
First-Aid & Emergency Health Care Crafts Featured Articles
Individual Sports Cats Home Maintenance
Musculoskeletal System Health Government Youth Dating
Psychological Health Cosmetics Emotions & Feelings
Banks and Financial Institutions Fashion Babies and Infants
Infectious Diseases Subjects Gardening
Nutrition and Food Health Women’s Health Horses
Cars & Other Vehicles Legal Matters Dating
Communication Skills Skin Care Music
Digestive System Health Dogs Development Stages
Education and Communications Relationships Home and Garden
Endocrine System Health Sleep Health Parenting
Computers and Electronics Hair Care Finance & Business
Psychological Disorders Family Life Hobbies and Crafts
Integumentary System Health Cleaning Studying
Coping with Illness Anxiety Health
Food and Entertaining Housekeeping Work World
Cardiovascular System Health Business Sports and Fitness
Outdoor Recreation Team Sports

The workflow of this part is mentioned in the “Data Preparation and Preprocessing” phase of Fig 4.

Algorithm 1 Algorithm for Data Preparation

Input: Data[Text, Labels]

Output: Data[Text, LabelsSelected]

 1: ListselectedLabelLabel_Selection(Data[Text, Labels])

 2: i ← 0

 3: while i < Data[Labels].length do

 4:  for each xlabel in Data[Labels][i] do

 5:   if xlabel in ListselectedLabel then

 6:    Keep xlabel

 7:   else

 8:    Drop xlabel

 9:   end if

 10:  end for

 11: end while

 12: procedure Label_Selection Data[Text, Labels]

 13:  ListselectedLabelPriorityQueue

 14:  Threshold ← 500

 15:  j ← 0

 16:  while i < Data[Labels].length do

 17:   ListuniqueLabelunique in Data[Labels][j]

 18:  end while

 19:  for each ylabel in ListuniqueLabel do

 20:   totalylabelComputethetotalnooftextforylabel

 21:   Selection_ScoretotalylabelThreshold

 22:   if Selection_Score ≥ 1 then

 23:    ListuniqueLabel.add(ylabel)

 24:   end if

 25:  end for

 26:  returnListuniqueLabel

 27: end procedure

3.2.2 Fine tuning the model

The “Model Training” phase of Fig 4 represents this part of our methodology. Here, we fine-tuned the BERT and XLNet models for multi-label classification. Both models consist of two input layers. One is for the input sequence, one for the attention mask, and the other for the segment IDs. Through the input layer, the inputs are passed to the XLNet layer for the XLNet classifier. In BERT architecture, the next layer after the input layers is the BERT layer. A fully connected layer is also used after that, and finally, the activation layer is used to predict the labels. We use the ‘Sigmoid’ activation function here in both models.

We trained the models with tokenized text and binarized labels. By manipulating the test data, we attained two different evaluation measures: binary cross-entropy loss and binary accuracy. The binary accuracy indicates the correct prediction of positive and negative values among all the predictions. Eq 4 is used to calculate binary accuracy.

Accuracy=TN+TPTN+TP+FN+FP (4)

Here, TN is the True Positive, which means the total correct prediction of true labels for each text. Then comes the FP False Positive, which is the number of wrong predictions of true labels. TN, the True Negative, and FN, the False Negative, are the number of consecutively true and false prophecies of the 0 class.

Averaging the log of the probability of right predictions, the Binary Cross Entropy Loss works. Based on that, we backpropagate the models and detect the necessary weights for the prediction of unseen data. It identifies the probability distribution of the predictions and ground truth labels. Eq 5 used to calculate Binary Cross Entropy Loss.

LBinary=-1Ni=1Nyi.log(p(yi))+(1-yi).log(1-p(yi)) (5)

Here p(yi) and 1 − p(yi) are the probability of zero and ones in the predicted labels.

Hyperparameters play a significant role in model fine-tuning, as they provide control over the learning process and the parameter values of a model. Therefore, we tuned the hyperparameters and compared the accuracy of the modifications for both XLNet and BERT models. Doing so helped us to perceive the best values for our proposed models in this research.

3.2.3 Model validation

Finally, the trained XLNet and BERT models are utilized for validation. These trained models consist of necessary weights for the predictions. We sent an unseen to both models. Then, we processed and tokenized the text to send it to the trained model.

Finally, the model worked to get the prediction of multiple labels from the given input. This part is illustrated in the model validation phase of Fig 4. Alongside binary accuracy and Binary Cross Entropy Loss, we have employed other metrics such as Macro Average F1 Score, Micro average F1 Score, Macro Average Precision, and Macro Average Recall. To calculate the Macro Average F1 Score and Micro average F1 Score, we rely on the precise Eqs 6 and 7.

MacroAverageF1Score=2×PMA×RMAPMA+RMA (6)
MicroAverageF1Score=TP¯TP¯+12(FP¯+FN¯) (7)

Here, PMA and RMA are the macro average precision and recall. Macro Average Precision and Recall, calculated by averaging each class’s precision and recall value. TP¯, FP¯, and FN¯ are the net True Positive, False Positive, and False Negative counted from the confusion matrix of each class.

4 Experimental setup

4.1 Experimental environment

We use Google Colab (https://colab.research.google.com/) to train our image analysis or classification model, which requires high computing powers that GPU (Graphics Processing Unit) can offer. However, GPU installation is costly and needs extra hardware to support the computation. Google Colab provides us with a high-end GPU on the cloud, along with all the essential packages for the training process. We do not have to install packages or concerns about storage space [48]. The specs of Google Colab include NVIDIA K80 GPU, 12 GB of GPU memory, Up to 2.91 teraflops double-precision rendition, and 358 GB of disk space. These specs create a robust computation environment for training Deep Learning models.

4.2 Hyperparameter tuning

It is appreciated that the hyperparameters of that model be tuned to bring out the best performance of a deep learning model. We train the model with five different Learning rates (1e-04, 2e-04, 3e-04, 4e-04, and 5e-04) and two different maximum lengths of the text sequence (484 and 512). We use the AdamW [49] optimizer in this work. The suitable parameters for this research have been enlisted in Table 2. We considered the following hyperparameters for our models:

Table 2. Hyperparameters for the proposed models.

For these values, the models provide the highest accuracy here. (The table columns are the Hyperparameter name and most significant Hyperparamters value for XLNet and BERT).

Hyperparameters XLNet BERT
Learning Rate 4e-04 5e-04
Batch size 48 48
Mximum Length 512 512
Optimizer AdamW AdamW
  • Learning Rate: A significant hyperparameter that controls how a model will adjust the updates for parameters. BERT and XLNet provide the best performance for learning rates 5e-04 and 4e-04, respectively.

  • Batch Size: It indicates the number of samples the model considers before updating the parameters during the model propagation. We consider only one batch size, which is 48 here for both models.

  • Maximum Length: Models usually works with fixed length sequence. The maximum length determines that for a model. The maximum length for the sequence we consider here is 512. BERT can take a maximum 512-length sequence. XLNet doesn’t have any limitations for text sequence. It can take any length of the sequence. Still, we use two different maximum lengths, 484 and 512, for both models so that we can create a comparison among transformer models. Both of these models performed their best for the maximum length of 512.

  • Optimizer: It is the algorithm or function that updates the model attributes to reduce the loss and improve the performance. Here we use the AdamW optimizer for both models. It is a stochastic gradient descent method that uses both first and second-order moments.

4.3 Data preprocessing

Before tokenizing our text and applying the data preparation algorithm, we preprocessed the data. Table 3 shows some samples of data after applying the Algorithm 1. We present four samples in this table, and multiple labels are given for each text. In these data, we have used the following preprocessing methods:

Table 3. Dataset state after applying our data preparation algorithm.

Text Selected Label
Type the author’s last name first, followed by a comma. Then type the author’s first initial. Add their middle initial, if given. Type a space after the period, then type the date of publication in parentheses. Include the year first, followed by a comma, then the month and day (if provided). Place a period after the closing parentheses. Example: Will, G. F. (2004, July 5). If there are multiple authors, separate their names with commas. Use an ampersand (&) before the last author’s name. Education and Communications, College University and Postgraduate
Don’t be in a rush to get up after a fainting spell. Your body and mind need time to recover. You should stay in your current position on the ground for at least 10-15 minutes. If you get up too soon you risk triggering another episode. Health, Cardiovascular System Health
Keep track of how many hands it takes to go from the bottom of the sun to the horizon. The number of hands it takes is the number of daylight hours remaining, or the hours left until sunset. For example, if you count 5 hands, then there are 5 hours remaining in the day or 5 hours until sunset. Home and Garden, Featured Articles’
  • Raw text has lots of special characters (.,%& # etc). First, we remove all the special characters from our data. These characters are completely unnecessary and have no contribution to the model’s performance.

  • Another component of any text that has no relevance to the model’s performance is the stop word. Therefore, we eliminated all the stop words from our text.

  • For more efficiency, we have lemmatized our text and used the lemmas of different complex words.

  • Raw text is the mixture of both capital and case letters. Finally, we convert all the text into lowercase. It will be beneficial for improving the performance of both classifiers.

We have mentioned our raw and preprocessed below for the visualization of the changes in data [50].

  • Raw Data

    Type the author’s last name first, followed by a comma. Then, type the author’s first initial. Add their middle initial, if given. Type a space after the period, then type the date of publication in parentheses. Include the year first, followed by a comma, then the month and day (if provided). Place a period after the closing parentheses. Example: Will, G. F. (2004, July 5). If there are multiple authors, separate their names with commas. Use an ampersand (&) before the last author’s name.

  • Prepossessed Data

    type author last name first follow comma then type author first initial add middle initial give type space period type date publication parentheses include year first follow comma month day provide place period closing parentheses example will g f 2004 July five if multiple author separate name comma use ampersand last author-name

5 Experimental results and its comparison

5.1 Model performance

We observed the model’s performances in different measures during the model training and applied the unseen validation data. For hyperparameter tuning, we determine the accuracy of different hyperparameters. The estimation given in Table 2 provides the highest performance. For both BERT and XLNet, we determine accuracy scores for different hyperparameter values. Table 4 shows different accuracy values for different hyperparameters. The highest accuracy for XLNet is 97.30%, and for the BERT model, it is 97.17%. In the Table, both of these values are marked as bold.

Table 4. Different binary accuracy score of the proposed XLNet and BERT models for different hyperparameter values.

(The table’s columns are Learning Rate, Maximum Length, XLNet Accuray and BERT Accuracy).

Learning rate Max Length XLNet Accuracy(%) BERT Accuracy(%)
1e-04 484 96.50 96.03
512 96.69 96.22
2e-04 484 95.90 96.91
512 96.01 96.22
3e-04 484 97.06 95.60
512 97.13 97.00
4e-04 484 96.98 95.54
512 97.30 96.78
5e-04 484 97.00 96.03
512 97.21 97.17

We track the model’s loss and accuracy over 40 epochs. This helps to understand how well the model is built over epochs. Here, we used ‘Binary Accuracy’ and ‘Binary Cross Entropy Loss’ for the loss. Binary accuracy calculates the percentages of correct classes among all predictions in binarized labels.

Our decision to plot the Training and Testing accuracy of our models over 40 epochs has proven to be a valuable strategy. This visualization of the learning process offers a clear picture of how the model has evolved. Most importantly, it allows us to detect any underfitting or overfitting, ensuring the model’s performance is optimal.

Fig 5 presents a visual representation of the Training and Testing accuracy of the XLNet model over 40 epochs. The blue curve represents the testing accuracy, while the red curve depicts the training accuracy. The highest training accuracy and testing accuracy, 99.47% and 97.31% respectively, serve as clear benchmarks for the model’s performance. In a similar way, the BERT model’s Training and Testing accuracy is given in Fig 6. Here, the highest training accuracy is 98.76% and 97.14% over 40 epochs. After observing these two curves, we can state that our model is neither underfitted nor overfitted.

Fig 5. Training accuracy and testing accuracy of the XLNet model over epochs.

Fig 5

(The curve is smoothened using the Gaussian filter).

Fig 6. Training accuracy and testing accuracy of the BERT model over epochs.

Fig 6

(The curve is smoothened using the Gaussian filter).

We have also presented the losses over epochs for our models. The training loss is a key tool for detecting how the training data has been adjusted in the model, and it is used to gauge the model’s performance in the validation dataset. Importantly, these losses can also help us identify potential underfitting and overfitting in our models, making us aware of their limitations.

Fig 7 provides a clear view of the training and testing losses for XLNet. The blue and red curves represent the testing and training accuracy over epochs, respectively. The lowest training loss is .1000, and it is .1052 for testing. These losses are crucial indicators of the model’s performance, providing valuable insights into its accuracy and fit.

Fig 7. Training loss and testing loss of the XLNet model over epochs.

Fig 7

(The curve is smoothened using the Gaussian filter).

Similarly, we have plotted the losses for the BERT model in Fig 8. The BERT model demonstrated a training loss of .1000, and a testing loss of 0.1103. These values, along with the curves of the models’ losses, indicate that the training data have been effectively fitted to our models. Furthermore, their performance in the validation data is commendable, providing a strong basis for comparison with XLNet.

Fig 8. Training loss and testing loss of the BERT model over epochs.

Fig 8

(The curve is smoothened using the Gaussian filter).

For smoothing the curve in both Figs 58, we use the Gaussian filter.

Our work goes beyond mere accuracy. We have meticulously determined a range of evaluation metrics, each playing a crucial role in our model evaluation. In Fig 9, we have sketched a bar chart where bars represent Macro Average Precision, Recall and F1 Score, and Micro Average F1 Score for our selected classifier XLNet and BERT.

Fig 9. Evaluation metrics for HowSumm dataset.

Fig 9

5.2 Comparison with other models and dataset

To compare the proposed method, we implemented other deep learning models: BERT, ELECTRA, Roberta, DistilBert, DeBERTa, GPT-4 and LSTM. Among all of these models, XLNet outperforms others. Here, we use binary accuracy and the Macro and micro average F1 scores. For the transformer models, we use their corresponding tokenizer to tokenize the sequence and then applied to the model.

For comparison, at first, we implemented RoBERTa [51]. It is a variation of BERT where the pretraining is performed in a larger dataset and after removing the Next Sentence Prediction(NSP) task from the pretraining phase. The authors claimed that removing NSP can help to improve performance. In our data, Roberta provided 96.90% accuracy, the Macro average F1 score was 89.01%, and the micro average score was 87.98%.

The next transformer architecture is ELECTRA [52]. It is very similar to BERT; however, unlike BERT, it used a generator and discriminator during the pretraining phase. The generator performed like the Mask Language Model, and the discriminator aims to discover the original and predicted Tokens. Here, ELECTRA has shown 95.98% accuracy, 87.56% Macro average F1 score, and 86% micro average score.

Another transformer-based model, DistilBert [53], performed with 97.02% accuracy, 90.2% Macro average F1 score, and 87.43% micro average F1 Score. Compared to BERT, it is cheap, small, and fast. It uses fewer parameter values than BERT.

We also applied DeBERTa [54] and GPT-4 to instructional text. These transformer models were introduced recently and performed well on different tasks. Here, we can see that these two models performed very closely to XLNet. Even DeBerta provided the highest accuracy but not the highest F1 scores. The accuracy of DeBERTa and GPT-4 are 97.32% and 9.20%, macro average F1 Scores are 90.34% and 91%, and micro average F1 Scores are 88.01% and 89.56%.

We also applied the LSTM(Long Short Term Memory), an autoregressive model. LSTM can handle shorter sequences than transformers and uses different text vectorization techniques. First, we determined the word embeddings of our text and then applied them to LSTM. The accuracy, macro average F1 score, and micro average F1 score are 95.8%, 85.09%, and 82.09.

Fig 10 represents the summary of binary accuracy, Macro and micro F1 score of the data for different deep learning models. We can observe that our proposed models, XLNet and BERT, outperform all other models.

Fig 10. Accuracy, macro and micro F1 Score of different deep learning models.

Fig 10

We also have applied our methodology to other datasets of multilabel classification named ‘arXiv Paper Abstracts’ dataset (https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts), Twitter Financial News (https://www.kaggle.com/datasets/sulphatet/twitter-financial-news), Toxic Comment Classification (https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data). Table 5 represents different evaluation metrics values for our classifier on these datasets. Our methodology performed well for these datasets, which is clearely visualized from Table 5.

Table 5. Evaluation metrics of three different multi-lable datasets for BERT and XLNet.

Dataset→ arXiv Paper Abstracts Twitter Financial News Toxic Comment Classification
Metrics ↓ XLNet BERT XLNet BERT XLNet BERT
Accuracy 99.59 99.46 98.45 99.01 90.32 91.94
Macro Average Precision 98.02 98.11 94.78 97.23 71.34 79.34
Macro Average Recall 96.48 95.76 96.00 98.29 75.31 81.29
Macro Average F1 Score 97.24 96.92 93.10 97.05 72.03 77.33
Micro Average F1 Score 93.22 94.89 92.21 96.07 70.45 77.03

6 Discussion

This research introduces an impactful methodology that can be instrumental in the development of multilabel instructional text classification systems. It leverages popular deep learning models such as BERT and XLNet, marking a significant advancement in this field. To the best of our knowledge, this is the first attempt at multilabel instruction classification. This research’s significant contribution lies in its ability to determine the appropriate tags for different instructions. This can greatly enhance instruction-searching experiences. Moreover, it paves the way for task-oriented learning of procedural text, empowering intelligent systems to understand the categories of the instruction.

This research uses one of the latest datasets based on the wikiHow article Howsumm. Though the dataset is prepared to focus on summarization tasks, we intend to utilize the dataset for classification tasks. For that, we introduce a novel and simple algorithm that filters the labels of HowSumm data and removes the class imbalance problem from it. This algorithm provided us with 67 significant labels after scoring all the labels we used for multilabel classification. In the future, we want to share this filtered dataset publicly. This helped us to ignore the data imbalance problem from the HowSumm dataset, which helped the classifier improve its performance. Perhaps sharing this filtered data creates new possibilities for using wikiHow articles in multilabel text classification.

For the sake of simplicity and ease of classification, we have opted for the binary label encoding technique. This technique converts the labels for a single text observation into 67 long arrays, where all the elements are either 0 or 1. As a result, we have used binary accuracy and binary cross-entropy loss as our proposed methodology. To align with this methodology, the activation function used in the models is Sigmoid.

As this is the first work on instructional text classification, we here used the most significant models. We have selected the best two transformers in our search. The latest Large Language Model for XLNet has been employed to bring out the best performance in our data. During the model building, we traced the model’s training and testing measures(Loss and Accuracy) and plotted these values in a graph. Figs 5 and 7 indicate these graphs where we have observed how perfect the model’s accuracy and loss values are over 40 epochs. BERT is also an outperformer; the accuracy and loss graph for BERT is given in Figs 6 and 8.

They provided significant scores for the data and outperformed other transformer models such as ELECTRA, Roberta, DistilBert, and another deep learning model, Long Short-Term Memory (LSTM). We maneuver two evaluation techniques for the justification of the proposed methodology: Accuracy, Macro, and Micro Average F1 Score. Fig 10 shows how XLNet outperforms other models with an Accuracy of 97.30%. The best performer, XLNet, is a generalized autoregressive transformer, which combines the facility of an autoencoding and an autoregressive model’s facilities and avoids their limitations. For the comparison, we have implemented BERT, an autoencoding Large Language Model, and LSTM, an autoregressive model. XLNet outperforms both of them with higher accuracy and Macro F1 Score. Though XLNet has no sequence length limit, it still outperformed the models with fixed length 512. As the instructional texts are longer, XLNet performed better than BERT.

This research marks the beginning of a new era in instructional text classification. By harnessing the power of the most advanced Large Language Model, we have kept the overall process straightforward. The use of binary encoding for labels, which tends to use sparse data, has led to a remarkably high-performing model. These practical implications underscore the potential of our proposed architecture in real-world scenarios.

7 Conclusion and future work

In our research, we aim to address a notable problem that needs more attention. Specifically, we tackle the challenge of multilabel classification for wikiHow instructions. Our approach is relatively straightforward, using a binary label encoding technique. Our proposed methodology has demonstrated impressive performance, surpassing other machine and deep learning-based research work. We have rigorously evaluated the effectiveness of our approach using two different metrics: Binary Accuracy and Macro F1 score.

There are some areas in this work that could be improved upon. In the future, we will address these issues by implementing a more efficient label encoding technique and reducing the amount of sparse data in a manner similar to what was done in this work. Additionally, we aim to optimize the method section of the HowSumm data, which contains more extensive sequences than the step data.

Data Availability

https://github.com/odelliab/HowSumm.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1. Aurpa TT, Ahmed MS, Rifat RK, Anwar MM, Ali AS. UDDIPOK: A reading comprehension based question answering dataset in Bangla language. Data in Brief. 2023;47:108933. doi: 10.1016/j.dib.2023.108933 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Aurpa TT, Rifat RK, Ahmed MS, Anwar MM, Ali AS. Reading comprehension based question answering system in Bangla language with transformer-based learning. Heliyon. 2022;8(10). doi: 10.1016/j.heliyon.2022.e11052 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Krishnan J, Anastasopoulos A, Purohit H, Rangwala H. Cross-lingual text classification of transliterated Hindi and Malayalam. In: 2022 IEEE International Conference on Big Data (Big Data). IEEE; 2022. p. 1850–1857.
  • 4.Kulkarni A, Mandhane M, Likhitkar M, Kshirsagar G, Jagdale J, Joshi R. Experimental evaluation of deep learning models for marathi text classification. In: Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2021. Springer; 2022. p. 605–613.
  • 5.Aurpa TT, Ahmed MS, Sadik R, Anwar S, Adnan MA, Anwar MM. Progressive guidance categorization using transformer-based deep neural network architecture. In: Hybrid Intelligent Systems: 21st International Conference on Hybrid Intelligent Systems (HIS 2021), December 14-16, 2021. Springer; 2022. p. 344–353.
  • 6.Colla D, Caselli T, Basile V, Mitrović J, Granitzer M. Grupato at semeval-2020 task 12: Retraining mbert on social media and fine-tuned offensive language models. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation. 2020. p. 1546–1554.
  • 7. Aurpa TT, Ahmed MS. An ensemble novel architecture for Bangla Mathematical Entity Recognition (MER) using transformer based learning. Heliyon. 2024. Feb 15;10(3). doi: 10.1016/j.heliyon.2024.e25467 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Xue K, Zhou Y, Ma Z, Ruan T, Zhang H, He P. Fine-tuning BERT for joint entity and relation extraction in Chinese medical text. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2019. p. 892–897.
  • 9.Gonen H, Ravfogel S, Elazar Y, Goldberg Y. It’s not Greek to mBERT: Inducing Word-Level Translations from Multilingual BERT. In: Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 2020. p. 45–56.
  • 10.Chowdhury S, Baili N, Vannah B. Ensemble Fine-tuned mBERT for Translation Quality Estimation. In: Proceedings of the Sixth Conference on Machine Translation. 2021. p. 897–903.
  • 11. Yan R, Jiang X, Dang D. Named entity recognition by using XLNet-BiLSTM-CRF. Neural Processing Letters. 2021;53(5):3339–3356. doi: 10.1007/s11063-021-10547-1 [DOI] [Google Scholar]
  • 12. Sweidan AH, El-Bendary N, Al-Feel H. Sentence-level aspect-based sentiment analysis for classifying adverse drug reactions (ADRs) using hybrid ontology-XLNet transfer learning. IEEE Access. 2021;9:90828–90846. doi: 10.1109/ACCESS.2021.3091394 [DOI] [Google Scholar]
  • 13. Shen W, Chen J, Quan X, Xie Z. Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35(15):13789–13797. [Google Scholar]
  • 14.Adoma AF, Nunoo-Mensah H, Chen W. Comparative analyses of bert, roberta, distilbert, and xlnet for text-based emotion recognition. In: 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP). IEEE; 2020. p. 117–121.
  • 15. Xu H, Liu B. DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. [Google Scholar]
  • 16. Al-Twairesh N. The evolution of language models applied to emotion analysis of Arabic tweets. Information. 2021;12(2):84. doi: 10.3390/info12020084 [DOI] [Google Scholar]
  • 17. Ozyurt IB. On the effectiveness of small, discriminatively pre-trained language representation models for biomedical text mining. bioRxiv. 2020. [Google Scholar]
  • 18. Das KA, Baruah A, Barbhuiya FA, Dey K. Ensemble of ELECTRA for Profiling Fake News Spreaders. In: CLEF (Working Notes). 2020. [Google Scholar]
  • 19.Jadeja D, Khetri A, Mittal A, Vishwakarma DK. Comparative Analysis of Transformer Models on WikiHow Dataset. In: 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS). IEEE; 2022. p. 655–658.
  • 20.Mei A, Kabir A, Bapat R, Judge J, Sun T, Wang WY. Learning to Prioritize: Precision-Driven Sentence Filtering for Long Text Summarization. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022. p. 313–318.
  • 21. Srivastava R, Singh P, Rana K, Kumar V. A topic modeled unsupervised approach to single document extractive text summarization. Knowledge-Based Systems. 2022;246:108636. doi: 10.1016/j.knosys.2022.108636 [DOI] [Google Scholar]
  • 22.Zhou Y, Shah J, Schockaert S. Learning Household Task Knowledge from WikiHow Descriptions. In: Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5). 2019. p. 50–56.
  • 23. Devi SS, Sneha S, Gururajan S, Siva G. Text Categorization and Summarization. International Journal of Recent Advances in Multidisciplinary Topics. 2023;4(3):73–77. [Google Scholar]
  • 24.Lin X, Petroni F, Bertasius G, Rohrbach M, Chang SF, Torresani L. Learning to recognize procedural activities with distant supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 13853–13863.
  • 25.Nouriborji M, Rohanian O, Clifton D. Nowruz at SemEval-2022 Task 7: Tackling Cloze Tests with Transformers and Ordinal Regression. In: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). 2022. p. 1071–1077.
  • 26.Wiriyathammabhum P. TTCB System description to a shared task on implicit and underspecified language 2021. In: Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language. 2021. p. 64–70.
  • 27.Mueller A, Krone J, Romeo S, Mansour S, Mansimov E, Zhang Y, et al. Label Semantic Aware Pre-training for Few-shot Text Classification. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. p. 8318–8334.
  • 28.Zhang L, Lyu Q, Callison-Burch C. Intent Detection with WikiHow. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. 2020. p. 328–333.
  • 29.Wang C, Zhang F. The performance of improved XLNet on text classification. In: Third International Conference on Artificial Intelligence and Electromechanical Automation (AIEA 2022). SPIE; 2022. p. 154–159.
  • 30.Salma TD, Saptawati GAP, Rusmawati Y. Text Classification Using XLNet with Infomap Automatic Labeling Process. In: 2021 8th International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA). IEEE; 2021. p. 1–6.
  • 31. Wang Y, Zheng J, Li Q, Wang C, Zhang H, Gong J. XLNet-caps: personality classification from textual posts. Electronics. 2021;10(11):1360. doi: 10.3390/electronics10111360 [DOI] [Google Scholar]
  • 32. Liu J, Hu S, Mehraliyev F, Liu H. Text classification in tourism and hospitality–a deep learning perspective. International Journal of Contemporary Hospitality Management. 2023. doi: 10.1108/IJCHM-07-2022-0913 [DOI] [Google Scholar]
  • 33.Arabadzhieva-Kalcheva N, Kovachev I. Comparison of BERT and XLNet accuracy with classical methods and algorithms in text classification. In: 2021 International Conference on Biomedical Innovations and Applications (BIA). IEEE; 2022. p. 74–76.
  • 34.Li W, Gao S, Zhou H, Huang Z, Zhang K, Li W. The automatic text classification method based on bert and feature union. In: 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS). IEEE; 2019. p. 774–777.
  • 35. Yu Q, Wang Z, Jiang K. Research on text classification based on bert-bigru model. Journal of Physics: Conference Series. 2021;1746(1):012019. [Google Scholar]
  • 36. Chen X, Cong P, Lv S. A long-text classification method of Chinese news based on BERT and CNN. IEEE Access. 2022;10:34046–34057. doi: 10.1109/ACCESS.2022.3162614 [DOI] [Google Scholar]
  • 37. Haghighian Roudsari A, Afshar J, Lee W, Lee S. PatentNet: multi-label classification of patent documents using deep learning based language understanding. Scientometrics. 2022;1–25. [Google Scholar]
  • 38. Ameer I, Bölücü N, Siddiqui MHF, Can B, Sidorov G, Gelbukh A. Multi-label emotion classification in texts using transfer learning. Expert Systems with Applications. 2023;213:118534. doi: 10.1016/j.eswa.2022.118534 [DOI] [Google Scholar]
  • 39.Chalkidis I, Fergadiotis E, Malakasiotis P, Androutsopoulos I. Large-Scale Multi-Label Text Classification on EU Legislation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. p. 6314–6322.
  • 40. Zhang X, Song X, Feng A, Gao Z. Multi-self-attention for aspect category detection and biomedical multilabel text classification with bert. Mathematical Problems in Engineering. 2021;2021:1–6. doi: 10.1155/2021/9628251 [DOI] [Google Scholar]
  • 41. Cai L, Song Y, Liu T, Zhang K. A hybrid BERT model that incorporates label semantics via adjustive attention for multi-label text classification. IEEE Access. 2020;8:152183–152192. doi: 10.1109/ACCESS.2020.3017382 [DOI] [Google Scholar]
  • 42. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30. [Google Scholar]
  • 43.Dai Z, Yang Z, Yang Y, Carbonell JG, Le Q, Salakhutdinov R. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. p. 2978–2988.
  • 44.Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. p. 4171–4186.
  • 45.Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, et al. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision. 2015. p. 19–27.
  • 46.Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV. XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc.; 2019. p. 517.
  • 47.Boni O, Feigenblat G, Lev G, Shmueli-Scheuer M, Sznajder B, Konopnicki D. HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow Articles. arXiv. 2021;2110.03179.
  • 48.Bisong E. Google Colaboratory. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform. Springer; 2019. p. 59–64.
  • 49.Kingma D, Ba J. Adam: A Method for Stochastic Optimization. In: International Conference on Learning Representations (ICLR). 2015. p. 12.
  • 50.Ahmed MS, Aurpa TT, Anwar MM. Online topical clusters detection for top-k trending topics in twitter. In2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) 2020 Dec 7 (pp. 573–577). IEEE.
  • 51.Liu Z, Lin W, Shi Y, Zhao J. A Robustly Optimized BERT Pre-training Approach with Post-training. In: Li S, Sun M, Liu Y, Wu H, Liu K, Che W, He S, Rao G, editors. Proceedings of the 20th Chinese National Conference on Computational Linguistics. Chinese Information Processing Society of China; 2021. p. 1218–1227.
  • 52.Clark K, Luong MT, Le QV, Manning CD. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In: Proceedings of the 8th International Conference on Learning Representations. 2020.
  • 53.Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: Proceedings of the 32nd Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019. 2019.
  • 54.He P, Liu X, Gao J, Chen W. DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION. In: International Conference on Learning Representations. 2021.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

https://github.com/odelliab/HowSumm.


Articles from PLOS ONE are provided here courtesy of PLOS

RESOURCES