Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2020 Mar 24;12036:174–181. doi: 10.1007/978-3-030-45442-5_22

Neural Query-Biased Abstractive Summarization Using Copying Mechanism

Tatsuya Ishigaki 15,, Hen-Hsen Huang 17, Hiroya Takamura 15,18, Hsin-Hsi Chen 16, Manabu Okumura 15
Editors: Joemon M Jose8, Emine Yilmaz9, João Magalhães10, Pablo Castells11, Nicola Ferro12, Mário J Silva13, Flávio Martins14
PMCID: PMC7148071

Abstract

This paper deals with the query-biased summarization task. Conventional non-neural network-based approaches have achieved better performance by primarily including the words overlapping between the source and the query in the summary. However, recurrent neural network (RNN)-based approaches do not explicitly model this phenomenon. Therefore, we model an RNN-based query-biased summarizer to primarily include the overlapping words in the summary, using a copying mechanism. Experimental results, in terms of both automatic evaluation with ROUGE and manual evaluation, show that the strategy to include the overlapping words also works well for neural query-biased summarizers.

Keywords: Abstractive summarization, Query-biased summarization

Introduction

A query-biased summarizer takes a query in addition to a source document as an input, and outputs a summary with respect to the query, as in Table 1. The generated summaries are intended to be used, for example, for snippets as the results of search engines. Query-biased summarization has been studied for decades [3, 4, 16, 18]. Conventional approaches are mostly extractive, and often use the overlapping words as cues to calculate the salience score of a sentence [14, 16, 18].

Table 1.

Example of a source document, a query, the gold summary. The words overlapping between the source and query are shown in bold.

Source: Vigilanteism simply causes more problems and will not fix the original problem. one should restore law and order rather than implementing disorder
Query: Will vigilanteism restore law and order?
Gold Summary: Vigilanteism merely instigates chaos

On the other hand, recurrent neural network (RNN)-based approaches have enabled summarizers to generate fluent abstractive summaries [1, 7, 13], but do not explicitly model the strategy to primarily include the overlapping words. In this paper, therefore, we incorporate this strategy into RNN-based summarizers using copying mechanisms.

A copying mechanism is a network to primarily include the words in the source document in the summary [5, 6, 17]. To achieve this, the copying mechanism increases the probability of including the words in the source document. A copying mechanism can be seen as an extension of the pointer-network [19], which only copies words in the input and does not output words other than in the input. Gu et al. [5], Gulcehre et al. [6], and Miao et al. [12] extended the pointer-network to copying mechanisms by using a function to balance copying and generation. See et al. [17] and Chen and Lapata [2] applied the copying mechanism to single-document summarization tasks without a query. We came up with an idea of using copying mechanisms to include the overlapping words in the summary. However, the copying mechanisms were originally designed for the settings without the query information, and it is not necessarily clear how we can integrate the mechanisms into a query-biased summarizer.

Encoder-decoders for the query-biased setting have been proposed. Hasselqvist et al. [7] proposed an architecture being able to copy the words in the source document, while our copying mechanisms copy the overlapping words and their surroundings explicitly. Nema et al. [13] presented a dataset extracted from Debatepedia. They proposed a method to gain the diversity of the summary, while we focus on copying mechanisms.

We propose three copying mechanisms designed for query-biased summarizers: copying from the source, copying the overlapping words, and copying the overlapping words and their surroundings. We empirically show that the models copying the overlapping words perform better. These results support the fact that the strategy to include the overlapping words, which was shown useful for conventional query-biased summarizers, also works well for neural network-based query-biased summarizers.

Base Model

We first explain a base query-biased neural abstractive summarizer proposed by Nema et al. [13], into which we integrate our copying mechanisms in the next section.

Encoders: The base model has two bi-directional Long Short-term Memory (LSTM) [8]-based encoders; one is for the query Inline graphic and another is for the source document Inline graphic. In each encoder, the outputs of the forward and the backward LSTM are concatenated into a vector. We refer to the generated vector for the i-th word in the query as Inline graphic, and the j-th word in the source document as Inline graphic.

Decoder with Query- and Source Document- Attentions: The decoder outputs the summary. The final state Inline graphic is used to initialize the first state of the LSTM in the decoder. In each time step t, the decoder calculates the attention weights Inline graphic for every word in the query. Inline graphic are weight matrices, and Inline graphic is a Inline graphicdimensional weight vector, where each element is automatically learned from the training data. Inline graphic is the output of the LSTM in the decoder. Here, Inline graphic is the embedding vector of the previously generated word and Inline graphic is a document representation explained later. The weights are converted into probabilities Inline graphic. We now obtain a query vector: Inline graphic

The source document attention mechanism further calculates the attention weights Inline graphic for every word in the source document and converts them into probabilities Inline graphic:

graphic file with name 498007_1_En_22_Equ1_HTML.gif 1

Inline graphic and Inline graphic are learnable parameters. Note that Eq. (1) contains Inline graphic, which means that the weights are calculated by considering the query. We then take the weighted average to obtain a document vector: Inline graphic.

Finally, the score of generating the word n in the pre-defined dictionary N is calculated as Inline graphic. The scores are converted into a probability distribution:

graphic file with name M22.gif 2

where Inline graphic, Inline graphic and Inline graphic are learnable parameter matrices. Inline graphic is a one-hot vector where the element corresponding to the word n is 1, or 0 otherwise. Thus, the dot product of Inline graphic and Inline graphic calculates the score of generating the word n. Equation (2) converts the score into a probability distribution.

Objective Function: All learnable matrices are tuned to minimize the negative likelihood for the reference summaries y in the training data D: Inline graphic.

Copying Mechanisms for Query-Biased Summarizers

We discuss the copying mechanisms for query-biased summarizers. Figure 1 shows the overview of a query-biased summarizer with a copying mechanism. In the following subsections, we present three mechanisms; SOURCE, OVERLAPand OVERLAP-WIND.

Fig. 1.

Fig. 1.

Overview of a query-biased summarizer with a copying mechanism.

SOURCE: We explain SOURCE, which copies the words from the source document. The strategy is a straightforward extension of the existing copying mechanisms [5, 6, 17]. The neural query-biased summarizer proposed by Hasselqvist et al. [7] also adopted this strategy, but they did not report its impact. We further extend this mechanism in the following subsections. In this strategy, the output layer calculates the probability distribution over the set Inline graphic, where N is the set of words in the pre-defined dictionary and M is the set of words in the source document. Thus, Inline graphic in Eq. (2) is modified to consider the extended vocabulary as follows:

graphic file with name M32.gif 3

In the copying mechanism, we consider two different probabilities for the word n in the vocabulary; the generation probability Inline graphic and the copying probability Inline graphic. The switching probability Inline graphic balances those probabilities as: Inline graphic. Here, Inline graphic and Inline graphic. Inline graphic represents a sigmoid function. Inline graphic is the dimension size of a word embedding. Inline graphic is calculated as follows:

graphic file with name M42.gif 4

Inline graphic is a function to return the position of the word n in the source document. The attention weight Inline graphic provided by the source document attention module is used as the score for outputting the word n in the source document.

OVERLAP: We propose OVERLAP, the model to copy the overlapping words. This model calculates the probability distribution over the set Inline graphic in the same way as in SOURCE. This model increases the scores for the overlapping words as follows:

graphic file with name M46.gif 5

The scores are converted into a probability distribution: Inline graphic.

In the equations above, Q refers to the set of content words1 in the query. Thus, Inline graphic represents the overlapping words. Inline graphic is a hyperparameter that controls the importance of the overlapping words. By using Inline graphic, this model can assign a relatively high probability for the overlapping words.Thus, the overlapping words are more likely to be included in the summary. Inline graphic is tuned on validation data.

OVERLAP-WINDOW: We finally explain OVERLAP-WIND, the model that copies the overlapping words and their surrounding words. We assume that the surroundings of overlapping words might also be important. This model calculates the scores for the overlapping words and their surroundings as follows:

graphic file with name M52.gif 6

where Inline graphic is the set that contains Inline graphic words around the overlapping word in addition to the overlapping word itself. Then, the scores are converted into a probability distribution by using the softmax function: Inline graphic.

Experiments

We used the publicly available dataset2 provided by Nema et al. [13]. The data contains the tuples of a source document, a query and a summary, extracted from Debatepedia3. We used 80% of the data for training, and the remaining was equally split for parameter tuning and testing.

We used Adam [9] for the optimizer with Inline graphic and Inline graphic. The initial learning rate was set to 0.0004. The word embeddings for both the query and the source document were initialized by GloVe [15] and further tuned during training the models. We selected the best-performing value from 200, 300 and 400 for the dimension size for LSTM by using the data for tuning. We used 32 for the batch size. We used all the vocabulary in the training data as the pre-defined dictionary.

We prepared three baselines without any copying mechanisms. The first, ENC-DEC, was a simple encoder-decoder based summarizer without the query encoder. This model uses Eq. (1) without Inline graphic. The second, ENC-DEC QUERY, was the query-aware encoder-decoder explained in Sect. 2. The third, DIVERSE, was the state-of-the-art model proposed by Nema et al. [13]. We adopted full-length ROUGE [11] for the automatic evaluation metric. In addition, we conducted manual evaluation by human judges. 55 randomly selected document/query pairs and their summaries generated by DIVERSE, SOURCE, and OVERLAP-WINDwere shown to crowdworkers on Amazon Mechanical Turk. We assigned 10 workers for each set of document/query/summaries and asked them to rank the summaries. We adopted readability and responsiveness as the manual evaluation criteria, following the evaluation metric in DUC20074. The workers were allowed to give the same rank to multiple summaries.

Results

We show the ROUGE scores5 and the averaged rankings from human judges in Table 2.

Table 2.

The full-length ROUGE-1, ROUGE-2, ROUGE-L (higher is better) and the averaged rankings (lower is better) from human judges. The best performing model is in bold.

ROUGE-1 ROUGE-2 ROUGE-L Readability Responsiveness
Reference 1.50 1.55
ENC-DEC 13.73 2.06 12.84
ENC-DEC QUERY 29.28 10.24 28.21
DIVERSE [13] 41.02 26.44 40.78 3.36 3.39
SOURCE 43.32 29.12 42.96 1.99 1.93
OVERLAP 43.47 29.68 43.26
OVERLAP-WIND Inline graphic 44.41Inline graphic 30.48Inline graphic 44.20Inline graphic 1.83Inline graphic 1.85Inline graphic
OVERLAP-WIND Inline graphic 43.16 29.15 42.90
OVERLAP-WIND Inline graphic 44.03 29.78 43.77

ROUGE Scores: ENC-DEC, without query information, achieved very low performance. Adding the query encoder (ENC-DEC QUERY) improved the score. Furthermore, copying the words in the source document (SOURCE) achieved better scores than those of the best-performing model without a copying mechanism (DIVERSE). Thus, integrating the copying mechanism improved the performance even in the query-biased setting. Among our models, OVERLAP  and OVERLAP-WIND  (Inline graphic)  achieved the better performances than SOURCE. The dagger (Inline graphic) indicates that the differences between the scores of SOURCE, OVERLAP  and those of our best-performing model (OVERLAP-WIND(Inline graphic)) are statistically significant with the paired bootstrap resampling test used in Koehn et al. [10] (Inline graphic). This supports our assumption that the strategy to copy the overlapping words is shown effective even for RNN-based summarizers.

Rankings by Human Judges: Our best model OVERLAP-WIND  (Inline graphic) is ranked higher than the state-of-the-art DIVERSE and SOURCE. The differences between OVERLAP-WIND  (Inline graphic) and SOURCE are statistically significant (Inline graphic) with the paired bootstrap resampling test [10]. The results of manual evaluation also support our assumption.

Conclusion

We proposed the copying mechanisms designed for query-biased summarizers to primarily include the words overlapping between the source document and the query. Our experimental results showed that the mechanisms to primarily include the overlapping words between the source document and the query achieved the better performances in terms of both ROUGE and rankings by human judges. The results suggested that the strategy to include the overlapping words, which has been shown useful for conventional non-neural summarizers, also works well for RNN-based summarizers.

Footnotes

1

We used the list of stop words defined in the nltk library for filtering to obtain content words.

5

The option for ROUGE is -a -n 2 -s.

Contributor Information

Joemon M. Jose, Email: joemon.jose@glasgow.ac.uk

Emine Yilmaz, Email: emine.yilmaz@ucl.ac.uk.

João Magalhães, Email: jm.magalhaes@fct.unl.pt.

Pablo Castells, Email: pablo.castells@uam.es.

Nicola Ferro, Email: ferro@dei.unipd.it.

Mário J. Silva, Email: mjs@inesc-id.pt

Flávio Martins, Email: flaviomartins@acm.org.

Tatsuya Ishigaki, Email: ishigaki@lr.pi.titech.ac.jp.

Hen-Hsen Huang, Email: hhhuang@nccu.edu.tw.

Hiroya Takamura, Email: takamura@pi.titech.ac.jp.

Hsin-Hsi Chen, Email: hhchen@ntu.edu.tw.

Manabu Okumura, Email: oku@pi.titech.ac.jp.

References

  • 1.Baumel, T., Eyal, M., Elhadad, M.: Query focused abstractive summarization: incorporating query relevance, multi-document coverage, and summary length constraints into seq2seq models. arXiv preprint arXiv:1801.07704 (2018)
  • 2.Cheng, J., Lapata, M.: Neural summarization by extracting sentences and words. In: Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Berlin, Germany, pp. 484–49 (2016)
  • 3.Dang, H.T.: Overview of DUC 2005. In: Proceedings of 2005 Document Understanding Conferences, DUC 2005, pp. 1–12. Citeseer (2005)
  • 4.Daumé III, H., Marcu, D.: Bayesian query-focused summarization. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2006, pp. 305–312 (2006)
  • 5.Gu, J., Lu, Z., Li, H., Li, V.O.: Incorporating copying mechanism in sequence-to-sequence learning. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016. pp. 1631–1640 (2016)
  • 6.Gulcehre, C., Ahn, S., Nallapati, R., Zhou, B., Bengio, Y.: Pointing the unknown words. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, vol. 1, pp. 140–149 (2016)
  • 7.Hasselqvist, J., Helmertz, N., Kågebäck, M.: Query-based abstractive summarization using neural networks. arXiv preprint arXiv:1712.06100 (2017)
  • 8.Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
  • 9.Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of 3rd International Conference on Learning Representations, ICLR 2015 (2015)
  • 10.Koehn, P.: Statistical significance tests for machine translation evaluation. In: Proceedings of the 2014 Conference on Empirical Methods on Natural Language Processing, EMNLP 2014, pp. 388–395 (2004)
  • 11.Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of ACL2004 Workshop, pp. 74–81 (2004)
  • 12.Miao, Y., Blunsom, P.: Language as a latent variable: discrete generative models for sentence compression. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, pp. 319–328 (2016)
  • 13.Nema, P., Khapra, M.M., Laha, A., Ravindran, B.: Diversity driven attention model for query-based abstractive summarization. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2017, pp. 1063–1072, July 2017
  • 14.Otterbacher J, Erkan G, Radev DR. Biased LexRank: passage retrieval using random walks with question-based priors. Inf. Process. Manag. 2009;45(1):42–54. doi: 10.1016/j.ipm.2008.06.004. [DOI] [Google Scholar]
  • 15.Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pp. 1532–1543 (2014)
  • 16.Schilder, F., Kondadadi, R.: FastSum: fast and accurate query-based multi-document summarization. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, ACL 2008, pp. 205–208 (2008)
  • 17.See, A., Liu, P.J., Manning, C.D.: Get to the point: summarization with pointer-generator networks. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, vol. 1, pp. 1073–1083 (2017)
  • 18.Tombros, A., Sanderson, M.: Advantages of query biased summaries in information retrieval. In: Proceedings of 21st Annual ACM/SIGIR International Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 2–10. ACM (1998)
  • 19.Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Proceedings of Twenty-Ninth Conference on Neural Information Processing Systems, NIPS 2015, pp. 2692–2700 (2015)

Articles from Advances in Information Retrieval are provided here courtesy of Nature Publishing Group

RESOURCES