Variational Bayesian Semi-supervised Keyword Extraction

Yaofang Hu; Yichen Cheng; Yusen Xia; Xinlei Wang

doi:10.1109/TPAMI.2025.3630577

. Author manuscript; available in PMC: 2026 Mar 7.

Published in final edited form as: IEEE Trans Pattern Anal Mach Intell. 2026 Mar;48(3):2790–2802. doi: 10.1109/TPAMI.2025.3630577

Variational Bayesian Semi-supervised Keyword Extraction

Yaofang Hu ¹, Yichen Cheng ², Yusen Xia ³, Xinlei Wang ⁴

PMCID: PMC12964378 NIHMSID: NIHMS2145363 PMID: 41201949

Abstract

The expansion of textual data, stemming from various sources such as online product reviews and scholarly publications on scientific discoveries, has created a significant demand for the extraction of succinct yet comprehensive information. While many methods have been proposed for automatic keyword extraction in unsupervised and fully supervised settings, effectively leveraging a partial list of known keywords, such as author-specified keywords or Twitter hashtags, remains underexplored. This work aims to enhance both the effectiveness and scalability of semi-supervised keyword extraction. We propose a novel variational Bayesian semi-supervised (VBSS) method that builds upon recent Bayesian advancement in the field, replacing computationally expensive posterior sampling with variational inference and data augmentation. This leads to closed-form updates and substantial speedups, particularly for long texts. Our numerical results show that the VBSS method not only improves performance on longer texts but also offers better control over false discovery rates compared to state-of-the-art keyword extraction techniques.

Index Terms—: Keyword extraction, variational inference, semi-supervised learning, Bayesian approach

I. Introduction

The rapid growth of big data in recent years has resulted in an influx of overwhelming information, leaving individuals susceptible to its volume. Consequently, there arises a need to distill the essence of this information, with one key approach being identification of keywords to efficiently capture the core concepts conveyed within the text. This problem has attracted efforts from researchers, partially due to its practical value. For instance, extracting keywords from texts in platforms such as TripAdvisor and Airbnb enhances recommendation accuracy while capturing significant words from a news article enables readers to quickly decide whether to read further.

Various approaches have been developed for extracting keywords from textual data, which can be grouped into three main categories based on the availability of labeled data: supervised, unsupervised, and semi-supervised methods [1]. Supervised methods, such as those described by [2], [3], and [4], rely on a labeled corpus of articles to train algorithms. These approaches can achieve high accuracy, due to the use of high-quality labeled data; however, acquiring such data requires significant human effort. As a result, the practical applicability of supervised methods is limited by the need for extensive data labeling. In contrast, unsupervised methods do not require training data and can be further divided into four branches: graph-based, statistic-based, linguistic-based, and entropy-based.

Graph-based techniques represent documents as graphs, with nodes representing words, and edges representing some relationship between nodes such as co-occurrence, syntax, or semantics [5]. TextRank (TR, 6) is the first graph-based algorithm. The key idea of TR is to transform a document into a graph and compute the importance scores $θ = {(θ_{1}, \dots, θ_{n})}^{T}$ for the $n$ candidate words. TR is motivated by the idea of PageRank [7], Google’s famous algorithm on ranking webpages. Intuitively, TR assumes words that have frequent occurrences or co-occur with other important words are more likely to be important. Formally, TR finds the importance scores that solve the equation $θ = (1 - d) 1_{n} + d G^{T} θ$ for $θ$ , where $d$ is a damping factor (typically set to be 0.85), $1_{n}$ is a vector of $n$ 1’s, $G = D^{- 1} A$ is the normalized adjacency matrix, $A$ is the weighted adjacency matrix whose $(i, j)$ -th entry represents the relation between the $i$ -th and $j$ -th words, and $D$ is the degree matrix whose diagonals equal the row sums of $A$ and off-diagonals are 0’s. For instance, the ( $i, j$ )-th entry of $A$ derived via co-occurrence rules is the number of co-occurrences of the $i$ -th and $j$ -th words within a fixed-width window. Following the success of TR, several variants have emerged, altering how the graph is generated. For example, TopicRank (TpR, 8) represents topics as nodes and semantic relations as edges. PositionRank (PosR, 9) incorporates the position of words into importance scores, and MultipartiteRank (MR, 10) combines both topic and position information.

Statistic-based methods rank words based on statistical measures. Examples include TF-IDF [11], which combines term frequency and inverse document frequency; KP-Miner [12], which combines TF-IDF with factors like word position; YAKE [13], which relies on various factors like word frequencies and position; and keyBERT [14] which leverages BERT embeddings and cosine similarity for keyword extraction. Linguistics-based methods detect keywords by exploiting linguistic features such as lexical analysis [15]. Entropy-based methods rank keywords by quantifying their information content based on the spatial distribution of word occurrences within the text. Examples include [16] and [17]. A recent development along this line is BRYT [18], a hybrid unsupervised method that integrates outputs from BERT [19], RAKE [20], YAKE, and TextRank, and rescoring the candidates based on cosine similarity. Unlike most approaches aimed at unstructured natural language documents, BRYT targets open data platforms and operates on short, structured metadata such as titles and descriptions. Unsupervised methods offer greater applicability and flexibility in real-world situations as they do not rely on expensive labeled data. However, they are more susceptible to noise and can produce less desirable results than supervised techniques.

To strike a balance between the cost of labeling texts and the accuracy of keyword identification, researchers have resorted to semi-supervised methods. These approaches often assume that a small subset of keywords is known in advance, and the challenge lies in incorporating this partial information effectively. In practice, such subsets can come from various sources, such as hashtags in Twitter posts or a limited number of keywords specified by authors in academic papers. While research on semi-supervised methods has gained attention in recent years, it is still relatively limited compared to supervised and unsupervised methods. Among them, [21] developed an interesting method (labeled SS) that integrates the partial label information into the calculation of importance scores while preserving the so-called “local consistency” [22] by solving $θ = (1 - d) y + d Q^{T} θ$ for $θ$ (equivalently, by minimizing $\sum_{i, j} A_{i j} {(\frac{1}{\sqrt{D_{i i}}} θ_{i} - \frac{1}{\sqrt{D_{j j}}} θ_{j})}^{2} + (1 - d) / d ‖ θ - y ‖^{2})$ . Here, $y = {(y_{1}, \dots, y_{n})}^{T}$ is a vector of observed labels, with $y_{i} = 1$ if the $i$ -th word is observed to be a keyword and 0 otherwise; $Q = D^{- 1 / 2} A D^{- 1 / 2}$ is another version of the normalized adjacency matrix, with $D$ and $A$ defined in the previous paragraph; and $d$ is still the damping factor. A penalty is placed on the distance between observed label $y$ and the importance scores $θ$ to make sure other words learn the importance scores from the observed words. As another state-of-the-art semi-supervised keyword extraction approach, the Bayesian semi-supervised (BSS) method [1] integrates the importance scores $θ$ from TextRank into a Bayesian logistic regression model and uses the partial label information $y$ to formulate the likelihood function. [23] and [24] also fall into the category of semi-supervised methods, but they differ significantly from SS and BSS approaches and assume that within a collection of documents, a small proportion of the documents are fully labeled while the remainder are unlabeled. In this paper, like SS and BSS, we assume for each document, a subset of the keywords is known. Thus, we exclude these two methods from our comparative study. We also refer readers to [1, 25,–28] for more in-depth information on the selected benchmark methods.

Semi-supervised methods offer advantages over both supervised and unsupervised ones. However, the aforementioned approaches have their own limitations. For example, the SS method requires manual threshold determination for selecting words with top importance scores. In contrast, the BSS method selects the threshold according to a specified false discovery rate (FDR), but it is computationally intensive and less suitable for long articles. These limitations motivate the need for a method that is both flexible and computationally efficient.

To this end, we propose a novel method, Variational Bayesian Semi-supervised (VBSS) Keyword Extraction. Built on BSS, the VBSS introduces key distinctions in both computation and model setup. First, in terms of computation, the VBSS algorithm employs mean-field variational inference (VI) [29, 30] for posterior approximation, combined with the statistical technique of data augmentation for binary outcomes [31], to ensure closed-form solutions for all parameters. This significantly enhances computational efficiency compared to BSS, which relies on computationally intensive Markov chain Monte Carlo (MCMC) sampling. Second, regarding model structure, the VBSS model incorporates a probit link function to connect importance scores, $θ$ , with keyword probabilities, denoted as $p = {(p_{1}, \dots, p_{n})}^{T}$ . In contrast, the BSS model uses a logit link function, which does not yield a closed-form VI solution for updating $θ$ . Additionally, extra parameters are introduced to increase model flexibility, thereby improving accuracy. Finally, in terms of performance, our numerical experiments show that the VBSS method consistently ranks among the top performers, frequently surpassing state-of-the-art methods in precision, recall and F-measure, while also significantly improving computational efficiency for both short and long articles over BSS.

The remainder of this paper is organized as follows. Section II introduces the construction of the Bayesian model and a detailed description of the VBSS algorithm, including computation and the pseudo code. We compare our VBSS and existing methods using real-world data sets in Section III. Section IV summarizes our work and discusses future directions.

II. Proposed Variational Bayesian Semi-Supervised Keyword Extraction Method

A. Bayesian Hierarchical Modeling

In our semi-supervised model setting, we assume that only a subset of the actual keywords in an article is known (or observed), meaning each candidate word (say, the $i$ -th) is associated with two labels: the observed label $y_{i}$ and the true label $y_{i}^{*}, i = 1, \dots, n$ . The observed label $y_{i} = 1$ indicates that the $i$ -th term is known to be a keyword while $y_{i} = 0$ means it is uncertain whether the term is a keyword; and the true label $y_{i}^{*}$ equals 1 if the $i$ -th term is indeed a keyword and 0 otherwise. Here, only true keywords can be observed; that is, $y_{i} = 1$ implies $y_{i}^{*} = 1$ (i.e., observed positives must be true positives). On the other hand, there is a (large) possibility that true keywords are not observed; that is, given $y_{i} = 0, y_{i}^{*}$ can take values of either 0 or 1 (i.e., observed negatives are not necessarily true negatives). Let $α_{i}$ denote the conditional probability that the $i$ -th candidate word is not observed to be a keyword given it actually is, and $α = {(α_{1}, \dots, α_{n})}^{T}$ denote the vector of these conditional probabilities. Thus,

\{\begin{array}{l} P (y_{i} = 0 ∣ y_{i}^{*} = 1) = α_{i} \\ P (y_{i} = 1 ∣ y_{i}^{*} = 1) = 1 - α_{i} \\ P (y_{i} = 0 ∣ y_{i}^{*} = 0) = 1 \\ P (y_{i} = 1 ∣ y_{i}^{*} = 0) = 0 \end{array}

(1)

For word $i$ , to link its probability of being a keyword to its importance score, we assume $p_{i} = P (y_{i}^{*} = 1) = Φ (a + b θ_{i})$ , where $Φ (\cdot)$ is the cumulative distribution function of $𝒩 (0,1)$ , and $a$ and $b$ represent the intercept and slope parameters in this generalized linear model using the probit link function $Φ^{- 1} (\cdot)$ . The likelihood function is

p (y ∣ a, b, θ, α) = \prod_{i = 1}^{n} {{[(1 - α_{i}) Φ (a + b θ_{i})]}^{y_{i}} \cdot {[1 - (1 - α_{i}) Φ (a + b θ_{i})]}^{1 - y_{i}}} .

(2)

It is worth noting that the BSS model assumes $p_{i} = {l o g i t}^{- 1} (θ_{i})$ instead. With two added parameters $a$ and $b$ , our proposed model gains increased flexibility. Furthermore, when switching from the logit link to the probit link, it facilitates efficient algorithm design by providing a closed-form solution at each step, as will be shown in Section II-B3. We point out that $θ$ represents the importance scores derived via different methods. Since methods can represent documents in graphs and calculate the importance scores in different ways, the scales of different $θ$ might vary. However, they carry the same conceptual meaning, which is the relative importance of candidate words. To incorporate the graph structure of the article and the observed label information, we consider a multivariate normal prior on $θ = {(θ_{1}, \dots, θ_{n})}^{T}$ :

π (θ ∣ σ^{2}) = 𝒩 (θ_{0}, B^{- 1} {(B^{- 1})}^{T} σ^{2}),

(3)

where $θ_{0} = (1 - d) B^{- 1} y$ (i.e., the solution from [21]) and $B = I - d Q^{T}$ , with $Q$ and $d$ defined in the introduction and $I$ being the $n \times n$ identity matrix. For simplicity, we denote $B^{- 1} {(B^{- 1})}^{T}$ by $U$ . We use $θ_{0 i}$ to denote the $i$ -th element of $θ_{0}$ , and $U_{i i}$ to denote the $i$ -th diagonal element of $U$ .

We set the priors of $a$ and $b$ to be $𝒩 (0, σ_{a}^{2})$ and $𝒩 (0, σ_{b}^{2})$ , respectively. We choose an inverse gamma prior $I G (τ, τ)$ for $σ^{2}$ , and a uniform prior $π (α_{i}) = 1$ for $α_{i}, α_{i} \in [0, 1]$ . Note that $σ_{a}^{2}, σ_{b}^{2}$ and $τ$ are user-defined hyperparameters. We suggest using diffuse or vague priors to reflect common situations where no meaningful prior knowledge exists about the hyperparameters. As such, we set $σ_{a}^{2} = σ_{b}^{2} = 10$ and $τ = 0.1$ . This choice ensures that the priors minimally influence the posterior distribution, in line with the principle of “letting the data speak” (i.e., objective inference).

B. A Variational Bayesian Approach

1). preliminaries for variational inference:

The gist of VI is to find a joint probability density function (within a candidate approximation density family) that best approximates the posterior distributions of parameters of interest in terms of Kullback-Leibler ( $𝒦 ℒ$ ) divergence. Without loss of generality, for this subsection only, suppose we have parameters $θ = {(θ_{1}, \dots, θ_{m})}^{T}$ and data $y = {(y_{1}, \dots, y_{n})}^{T}$ , where $m$ is the number of parameters and $n$ is the number of observations (in our specific context, $n$ is the number of candidate words and $m = n$ ). We use $𝒬$ to denote the family of the candidate approximation densities ( $q (θ)$ ) to the exact posterior distribution of $θ (p (θ ∣ y))$ . The “best” candidate $q^{*} (θ)$ is the one that minimizes its $𝒦 ℒ$ divergence to the posterior $p (θ ∣ y)$ . That is, $q^{*} (θ) = a r g {m i n}_{q (θ) \in 𝒬} 𝒦 ℒ (q (θ) ‖ p (θ ∣ y))$ . It turns out that the $𝒦 ℒ$ divergence is intractable since

𝒦 ℒ (q (θ) ‖ p (θ ∣ y)) = - \int q (θ) ln (\frac{p (θ ∣ y)}{q (θ)}) d θ = 𝔼_{q (θ)} [ln q (θ)] - 𝔼_{q (θ)} [ln p (θ ∣ y)] = 𝔼_{q (θ)} [ln q (θ)] - 𝔼_{q (θ)} [ln p (θ, y)] + ln p (y),

(4)

which involves the term $l n p (y)$ [29]. Here, $𝔼_{q (θ)} [\cdot]$ denotes taking the expectation with respect to $q (θ)$ . Since $l n p (y)$ does not depend on $q$ and can be treated as a constant, we can maximize an alternative quantity $𝔼_{q (θ)} [l n p (θ, y) - l n q (θ)]$ , known as the evidence lower bound (ELBO), which equals to a constant minus $𝒦 ℒ (q (θ) ‖ p (θ ∣ y))$ . We follow the mean-field VI machinery, and assume the variational distribution over $θ$ can be factorized as $q (θ_{1}, \dots, θ_{n}) = \prod_{i = 1}^{n} q (θ_{i})$ . We note that the mean-field VI is closely related to the approximation framework of the mean field theory [32] in physics. The coordinate ascent algorithm is often used to maximize the ELBO by iteratively updating the variational distribution $q (θ_{i})$ while holding $q (θ_{- i})$ (i.e., all other variational distributions) fixed. In each iteration, it can be shown that $q^{*} (θ_{i}) \propto e x p \{𝔼_{- θ_{i}} [l o g (p (θ_{i} ∣ θ_{- i}, y))]\}$ , which is obviously reminiscent of Gibbs sampling [33]. This suggests that the “best” candidate can be identified by sequentially updating the variational distribution for each coordinate and repeating the process until convergence. In practice, the number of iterations needed is usually much smaller than that of MCMC sampling. Therefore, variational inference achieves computational efficiency by transforming a sampling problem into an optimization problem.

2). Data Augmentation:

With the current model setup, not all variational distributions have closed-form updates in the mean-field VI stage, making the computation difficult. To solve this problem, we adopt the data augmentation technique from [31], commonly used in statistical literature [34, 35], where ‘data augmentation’ refers to methods that introduce unobserved data or latent variables to facilitate iterative optimization or sampling algorithms [35]. To do so, we introduce latent variables $z = {(z_{1}, \dots z_{n})}^{T}$ , whose signs determine the values of the true labels $y^{*}$ ; that is, $y_{i}^{*} = 1$ if $z_{i} > 0$ and 0 otherwise. Therefore, the true label $y_{i}^{*}$ may be regarded as the dichotomized form of the latent variable $z_{i}$ and $z_{i}$ can be viewed as the continuous counterpart of $y_{i}^{*}$ . Because of the probit link function used in Section [II], we can show $z_{i} \sim 𝒩 (a + b θ_{i}, 1)$ . Then we have

P (y_{i} ∣ z_{i}, α_{i}) = P (y_{i}, y_{i}^{*} = 1 ∣ z_{i}, α_{i}) + P (y_{i}, y_{i}^{*} = 0 ∣ z_{i}, α_{i}) = {\begin{array}{l} (1 - α_{i}) \cdot P (y_{i}^{*} = 1 ∣ z_{i}) & if y_{i} = 1 \\ α_{i} \cdot P (y_{i}^{*} = 1 ∣ z_{i}) + P (y_{i}^{*} = 0 ∣ z_{i}) & if y_{i} = 0 . \end{array} = {[(1 - α_{i}) \cdot 𝟙 (z_{i} > 0)]}^{y_{i}} \cdot {[α_{i} \cdot 𝟙 (z_{i} > 0) + 𝟙 (z_{i} \leq 0)]}^{1 - y_{i}}

(5)

A diagram is shown in Fig. 1 to illustrate the hierarchical structure of our proposed VBSS model, where the auxiliary variable z_i is shown in a dashed circle to indicate it is introduced to facilitate computation.

3). Variational Bayesian Semi-supervised Keyword Extraction:

With the introduction of the latent variables $z = {(z_{1}, \dots z_{n})}^{T}$ , the joint posterior distribution becomes

p (z, θ, a, b, σ^{2}, α ∣ y) \propto p (y ∣ z, α) p (z ∣ θ, a, b) π (θ ∣ σ^{2}) π (a) \cdot π (b) π (σ^{2}) π (α) = [\prod_{i = 1}^{n} p (y_{i} ∣ z_{i}, α_{i}) p (z_{i} ∣ θ_{i}, a, b) π (α_{i})] \cdot π (θ ∣ σ^{2}) π (a) π (b) π (σ^{2}) = \prod_{i = 1}^{n} {{[(1 - α_{i}) \cdot 𝟙 (z_{i} > 0)]}^{y_{i}} \cdot {[α_{i} \cdot 𝟙 (z_{i} > 0) + 𝟙 (z_{i} \leq 0)]}^{1 - y_{i}} \cdot 𝒩 (a + b θ_{i}, 1)} \cdot 𝒩 (θ_{0}, U σ^{2}) \cdot 𝒩 (0, σ_{a}^{2}) \cdot 𝒩 (0, σ_{b}^{2}) \cdot I G (τ, τ)

(6)

Now, our task is to find the variational distributions such that

p (z, θ, a, b, σ^{2}, α ∣ y) \approx [\prod_{i = 1}^{n} q (z_{i})] [\prod_{i = 1}^{n} q (θ_{i})] \cdot q (a) q (b) q (σ^{2}) [\prod_{i = 1}^{n} q (α_{i})] .

(7)

Following the standard literature on mean-field variational inference ([29]; see also Section II-B1), the optimal variational distribution for $θ_{i}$ satisfies $l n q^{*} (θ_{i}) \propto 𝔼 [l o g (p (θ_{i} ∣ θ_{- i}, y))]$ . Thus, we have:

ln q^{*} (θ_{i}) = 𝔼_{q (- θ_{i})} [ln [p (y ∣ z, α) p (z ∣ θ, a, b) π (θ ∣ σ^{2}) π (a) π (b) π (σ^{2}) π (α)]] + const = 𝔼_{q (z_{i}, a, b)} [ln p (z_{i} ∣ θ_{i}, a, b)] + 𝔼_{q (σ^{2})} [ln π (θ_{i} ∣ σ^{2})] + const = 𝔼_{q (z_{i}, a, b)} [- \frac{1}{2} {[z_{i} - (a + b θ_{i})]}^{2}] + 𝔼_{q (σ^{2})} [- \frac{1}{2 U_{i i} σ^{2}} {(θ_{i} - θ_{0 i})}^{2}] + const = - \frac{1}{2} (𝔼_{q (b)} [b^{2}] + \frac{1}{U_{i i}} 𝔼_{q (σ^{2})} [\frac{1}{σ^{2}}]) \cdot θ_{i}^{2} + (𝔼_{q (b)} [b] 𝔼_{q (z_{i})} [z_{i}] - 𝔼_{q (a)} [a] 𝔼_{q (b)} [b] + \frac{1}{U_{i i}} 𝔼_{q (σ^{2})} [\frac{1}{σ^{2}}] θ_{0 i}) \cdot θ_{i} + const,

(8)

which is the log of the probability density function of a normal distribution. By completing the square term, we obtain

q (θ_{i}) = 𝒩 (μ_{θ_{i}}, S_{θ_{i}}) μ_{θ_{i}} = S_{θ_{i}} \cdot (𝔼_{q (b)} [b] 𝔼_{q (z_{i})} [z_{i}] - 𝔼_{q (a)} [a] 𝔼_{q (b)} [b] + \frac{1}{U_{i i}} 𝔼_{q (σ^{2})} [\frac{1}{σ^{2}}] θ_{0 i}) S_{θ_{i}} = {(𝔼_{q (b)} [b^{2}] + \frac{1}{U_{i i}} 𝔼_{q (σ^{2})} [\frac{1}{σ^{2}}])}^{- 1}

(9)

Similarly, we derive the variational distributions of $z_{i}, a, b$ , $σ^{2}$ , and $α_{i}$ , respectively. The details are provided in Supplementary Material. As shown in the derivation for $θ_{i}$ above, the optimized variational distribution of a certain parameter involves the expectations of other parameters, which can be obtained straightforwardly based on each parameter’s variational distribution.

With the expectations listed in Section S2 in the Supplementary Material, we can update the variational posterior distributions iteratively. At convergence, the final output is a vector of estimated probability of each word being a keyword $\hat{p} = ({\hat{p}}_{1}, \dots, {\hat{p}}_{n})$ , where ${\hat{p}}_{i} = Φ ({\hat{m}}_{i})$ and ${\hat{m}}_{i}$ is the last update of $m_{i} = 𝔼_{q (a)} [a] + 𝔼_{q (b)} [b] \cdot 𝔼_{q (θ_{i})} [θ_{i}]$ .

To make the final decision of whether a candidate word is a keyword or not, following BSS, we adopt an FDR control machinery [36] to set the threshold, and words with probabilities larger than that will be selected as keywords. Given a probability threshold $h$ , the estimated FDR is

\hat{F D R} (h) = \frac{\sum_{i = 1}^{n} [(1 - {\hat{p}}_{i}) 𝟙 ({\hat{p}}_{i} \geq h)]}{\sum_{i = 1}^{n} 𝟙 ({\hat{p}}_{i} \geq h)},

(10)

where $𝟙 (\cdot)$ denotes the indicator function, equal 1 if the condition inside holds, and 0 otherwise. With a pre-specified FDR cutoff $γ$ such as 0.05, 0.1, or 0.15, we select the largest $h$ such that $\hat{F D R} (h) \leq γ$ .

Based on the output $\hat{p} = ({\hat{p}}_{1}, \dots, {\hat{p}}_{n})$ of VBSS, we proceed to calculate the estimated FDR with $h$ being each possible ${\hat{p}}_{i}$ . Then, for a pre-specified $γ$ , we aim to identify the smallest ${\hat{p}}_{i}$ (say, ${\hat{p}}_{k}$ ) such that $\hat{F D R} (h) \leq γ$ is satisfied. Consequently, any candidate word with a probability larger than ${\hat{p}}_{k}$ is considered as a keyword. We emphasize that this FDR-based identification approach offers the flexibility to select a varying number or proportion of keywords from different articles. This sets both BSS and VBSS apart from many existing methods that rely on a fixed threshold on the number or proportion of keywords [8, 12, 21, 37]. These conventional methods inherently make the unrealistic assumption of a uniform proportion of keywords across all documents.

4). Algorithm:

Based on the techniques and derivations discussed in the previous subsections, the pseudocode for VBSS is outlined in Algorithm 1. The algorithm takes as input the hyperparameters for the priors, $σ_{a}^{2}, σ_{b}^{2}$ and $τ$ , as well as a convergence threshold and the maximum number of iterations allowed. The initial step involves constructing an undirected weighted graph from the input document, following the graph construction method recommended by [6]. Specifically, the edge weight between any two words is set to be their number of co-occurrences within a window of two words.

To initiate the iterative computation of expectations of $z, θ, a$ , $b, α$ , and $σ^{2}$ , appropriate initial values of these expectations need to be determined. We use $θ_{0 i}$ (the solution from SS) as the initial point of $𝔼_{q (θ_{i})} [θ_{i}]$ ; and 0, 1 and 1 as the initial values of $𝔼_{q (a)} [a], 𝔼_{q (b)} [b]$ and $𝔼_{q (σ^{2})} [1 / σ^{2}]$ , respectively; and to initialize $𝔼_{q (α_{i})} [α_{i}]$ , we suggest using a roughly estimated proportion for unobserved keywords in the article, typically between 0.5 and 0.7. Then the VBSS algorithm will iteratively update the expectations based on what we obtained in Section S2 in the Supplementary Material, and check the convergence at each iteration. Based on our model, in the $j$ -th iteration, the probability of being a keyword for the $i$ -th candidate is estimated to be ${\hat{p}}_{i}^{(j)} = Φ ({\hat{m}}_{i}^{(j)}), j \in (1, \dots, miter)$ and $i \in (1, \dots, n)$ . The algorithm stops when the probability vectors of two successive iterations are sufficiently close. The final output of VBSS is a vector of probabilities: $\hat{p} = ({\hat{p}}_{1}, \dots, {\hat{p}}_{n})$ , the $i$ -th element of which represents the possibility of the $i$ -th candidate being a keyword. The per-iteration cost of updating the variables $θ, z, α$ is $𝒪 (n)$ . The FDR-based selection step adds $𝒪 (n l o g n)$ . Thus, the overall cost is $𝒪 (T n + n l o g n)$ where $T$ denotes the number of iterations. In practice, $T$ is typically small since variational inference converges rapidly, so the runtime scales essentially linearly in $n$ .

III. Experiments and Results

A. Experimental Setup

We compare the proposed VBSS with state-of-the-art methods that have open-source implementations, evaluating their performance in keyword identification and time efficiency across various real-world datasets with known labels. Supervised approaches are excluded from the comparison due to their need for extensive fully labeled data for training. Additionally, we omit unsupervised and semi-supervised methods that rely on external information, such as topic modeling results. The methods compared include three semi-supervised approaches (SS, BSS, and VBSS) and eight unsupervised methods (TR, TpR, PosR, MR, TF-IDF, KPMiner, YAKE and keyBERT). Among these, VBSS and BSS are Bayesian methods. For the implementation of the TR, SS and BSS methods, we utilize the R [38] code provided in [1]. TpR, PosR, MR, TF-IDF, KPMiner, and YAKE are implemented using the Python toolkit pke (python keyphrase extraction) developed by [39]. keyBERT is implemented using the Python code provided by [14]. We develop our own R code to implement VBSS, which is publicly available at github.com/YaofangHuYaofang/VBSS. All experiments were conducted on a desktop computer running Windows 11, with a 13th Gen Intel^® Core^™ i9-13950HX CPU at 2.20 GHz and 32 GB of RAM. No GPU acceleration was used. For simplicity, we set α_i = α and put a uniform prior on α (see Section S3 of the Supplementary Material for detail). The key settings for VBSS are summarized in Table S1 therein. Default values are used for the parameters of the other algorithms. Note that we use these settings across all numerical experiments. Due to the high computational demand of BSS, which uses Metropolis-Hastings MCMC sampling, it is excluded from the comparison of long-document datasets in Section III-B1. However, it is included in the time consumption analysis on short documents in Section III-B4

Algorithm 1.

Variational Bayesian Semi-supervised Keyword Extraction (VBSS)

Input	A document and a list of observed keywords; hyperparameters $τ, σ_{a}^{2}$ and $σ_{b}^{2}$ ; the damping factor $d$ ; the convergence threshold $ϵ$ ; the maximum number of iterations miter.

Step 1	Construct an undirected weighted graph from the input document.
Step 2	Specify initial values for expectations of $z, θ, a, b, σ^{2}$ and $α$ at $j = 0$ .
Step 3	For $j \in (1, \dots miter)$ : update the expectations of $z, θ, a, b, σ^{2}$ and $α$ , based on expectations listed in Section S2 in the Supplementary Material. compute ${\hat{p}}_{i}^{(j)} = Φ ({\hat{m}}_{i}^{(j)})$ for $i \in (1, \dots, n)$ . If the average probability over candidate words between the $j$ -th and $(j - 1)$ -th iteration. $\sum_{i = 1}^{n} \|{\hat{p}}_{i}^{(j)} - {\hat{p}}_{i}^{(j - 1)}\| / n < ϵ$ , break else continue end if End for

Output	A vector of probabilities $\hat{p} = ({\hat{p}}_{1}, \dots, {\hat{p}}_{n})$ , the $i$ -th element of which represents the possibility of the $i$ -th candidate being a keyword.

Open in a new tab

Since VBSS and BSS require a partial list of keywords from each document to infer the remaining keywords, we exclude datasets with limited ground-truth keyword lists, such as KDD (average 5.07 keywords) [40] and www (average 5.80 keywords) [40]. Instead, we use ten benchmark datasets of long English articles from various domains (bioinformatics, food and agriculture, and computer science) and document types (journal articles, conference papers, theses, and technical reports). The datasets are: citeulike180 [41], fao30 and fao780 [42], Krapivin2009 [43], Nguyen2007 [44], PubMed [45], SemEval2010 [46], Schutz2008 [47], theses100, and wiki20 [48]. Within each collection, we process the articles and keywords by following standard natural language processing (NLP) steps, including tokenization, part-of-speech tagging (POS-tagging), stop word removal and stemming, through R packages “tm”, “dplyr”, “udpipe”, “SnowballC” and “textstat”. Sample preprocessing code is available on GitHub. Tokenization splits text into individual words, which become the basic units for analysis. POS tagging assigns words to grammatical categories such as nouns or verbs. Stop words, which do not add much meaning to a sentence, such as prepositions and conjunctions, are removed, and stemming is applied to reduce words to their root forms (e.g., both “confidence” and “confident” become “confid”). Finally, words that appear only once or twice are also excluded, as they are unlikely to be keywords. After these steps, we retain documents with at least 11 keywords and randomly select 5 as the “observed partial list of keywords.” Table I summarizes the characteristics of the preprocessed datasets, including the number of keywords, candidate words, and keyword proportions. Further details can be found in the dataset references and [13].

TABLE I:

The description and summary statistics of the benchmark datasets. All datasets have been preprocessed to remove stop words. We convert all words into their stems and removed words appearing less than three times for each dataset. We only keep documents with at least 11 keywords for all ten datasets. The numbers of the remaining documents are listed in the “Description” column.

Dataset ID	Name	Description		Min	Q ₁	Q ₂	Mean	Q ₃	Max
1	citeulike180 [41]	54 papers in bioinformatics	#keywords	11	12.25	14	13.9	15	17
			#candidates	139	202.2	225	217.2	235.5	255
			%keywords	4.74	5.59	6.10	6.57	7.19	10.07

2	fao30 [42]	30 documents randomly selected from Food and Agriculture Organization (FAO) repository	#keywords	14	20.25	24	23.97	26	37
			#candidates	141	215.2	246	231.3	256.8	276
			%keywords	6.2	8.94	9.56	10.67	10.93	20.22

3	fao780 [42]	172 documents randomly selected from FAO repository	#keywords	11	12	13	13.12	14	20
			#candidates	74	244	254	246.9	265	289
			%keywords	3.83	4.56	5.12	5.60	5.72	20.27

4	Krapivin2009 [43]	172 computer science papers published by ACM	#keywords	11	12	14	15.66	18	36
			#candidates	158	248.8	299	307.3	357.8	542
			%keywords	2.46	3.71	4.81	5.39	6.36	14.62

5	Nguyen2007 [44]	129 science conference papers	#keywords	11	14	16	18.21	22	49
			#candidates	108	205	230	223.6	248	293
			%keywords	4.03	6.05	7.76	8.32	9.65	22.39

6	PubMed [45]	319 full papers from a library of biomedical papers	#keywords	11	11	14.5	15.2	17.8	25
			#candidates	73	179	220.5	202.8	242.5	262
			%keywords	4.9	6.3	6.9	8.0	8.3	17.8

7	SemEval2010 [46]	239 full papers from ACM digital library	#keywords	11	17	20	20.82	24	39
			#candidates	128	272.5	305	308.1	334.5	550
			%keywords	2.7	5.5	6.8	6.9	8.1	15.2

8	Schutz2008 [47]	1192 papers from PubMed central	#keywords	11	28	34	35.26	42	103
			#candidates	28	114	151	162.4	195	964
			%keywords	3.15	19.09	23.05	23.58	27.27	51.72

9	theses100	10 full master and Ph.D. theses from various domain	#keywords	11	11	12.50	12.92	13.25	21.00
			#candidates	632	726	795	809	891.2	1049
			%keywords	4.4	4.59	5.00	5.49	6.34	7.79

10	wiki20 [48]	20 technical research reports in computer science	#keywords	20	25.75	29.5	29.55	33.25	39
			#candidates	170	211.2	241	266.3	283.8	506
			%keywords	4.55	10.05	11.72	11.96	13.79	18.29

Open in a new tab

As mentioned before, the VBSS approach utilizes an FDR control procedure to select keywords, with the number of selected keywords determined by the pre-specified threshold γ, similar to BSS. In contrast, all the other methods output importance scores and require manual selection of a cutoff to determine the total number of identified keywords in each document. To ensure a fair comparison, we control the total number of keywords identified for those methods that require manual selection to be the same as the total number of keywords identified by VBSS. To achieve this, given a corpus consisting of multiple documents, we calculate the proportion of keywords identified by VBSS across all documents among all candidates (say, r%). Thereafter, for other methods that require manual selection, we select the r% (rounded) top-ranked words for each article. Each method’s performance is evaluated using precision, recall, and F-measure, where precision = TP/ (TP + FP), recall = TP/ (TP + FN), and F-measure = 2 × precision ×recall/ (precision + recall), where TP stands for true positives, FP for false positives, TN for true negatives, and FN for false negatives. F-measure can be viewed as an overall performance measure that combines both precision and recall.

B. Results

In this subsection, we present the experimental results comparing VBSS with the state-of-the-art methods. Our analysis focuses on four key objectives: (1) evaluating the overall performance of various methods in keyword extraction, (2) examining factors influencing VBSS’s performance, including the number of observed keywords k, choices of observed keywords for a fixed k, the damping factor d, and the proportion of keywords, (3) presenting a downstream use-case of VBSS in a document filtering scenario, and (4) comparing the computational efficiency of the methods, with a particular focus on the comparison between VBSS and BSS, the two Bayesian approaches.

1). The overall performance in keyword extraction:

In comparison to semi-supervised methods, which leverage a partial list of observed keywords, unsupervised methods face a disadvantage as they lack access to the label information. To ensure a fair comparison, we adjust the unsupervised methods by forcing the observed keywords to be predicted as positive even if they are predicted as negatives. More specifically, for a document of n candidate words, 5 observed keywords are automatically set to be true keywords, among other n − 5 words, the top n × r% − 5 (rounded) candidate words, based on importance scores, are selected as the predicted keywords (along with the 5 known keywords). In our pre-analysis, this adjustment led to significant improvements in precision, recall, and F-measure for the unsupervised methods while SS showed much less improvement and VBSS showed no change. For each dataset, we plot the precision, recall, and F-measure against different values of FDR cutoff γ for each method. As illustrated in Fig. 2, the VBSS method is consistently among the top performers on all datasets. It outperforms other methods on many of the datasets (e.g., citeulike180, fao780, Nguyen2007, PubMed, and SemEval2010), achieving the highest precision and F-measure across most γ values (except for γ = 0.05). Among the competing methods, the semi-supervised SS and unsupervised KPMiner perform well, occasionally surpassing VBSS under certain conditions. TF-IDF ranks next in overall performance, followed by TR.

Fig. 2: — The precisions, recalls and F-measures v.s. different FDR control levels of γ for various keyword extraction approaches for datasets in Table I.

An analysis of how often each method achieves the highest F-measure across the 10 datasets reveals insightful patterns. We present these results in pie charts, with each chart representing a different FDR cutoff value (γ). Note that for certain datasets, multiple methods may achieve the same highest F-measure. Fig. 3 summarizes these findings. When γ = 0.05, SS emerges as the most frequent top performer, achieving the best F-measure in half of the datasets. However, as γ increases to 0.1, VBSS surpasses SS, demonstrating superior performance in over a quarter of the datasets. For larger γ values (0.15 to 0.3), VBSS consistently leads in over half of the datasets, with the exception of γ = 0.2, where its proportion of being the best is slightly below 50%. While VBSS generally dominates, KPMiner and TF-IDF also exhibit strong performance at certain γ values, indicating their competitiveness under specific conditions.

Fig. 3: — Pie charts showing the proportion of times each method led in performance (in terms of F-measure) across 10 datasets for different γ values.

2). Factors that may influence the performance:

Using the SemEval2010 dataset, we explore different factors that may influence the performance of VBSS and other methods. The first factor we examine is the number of observed keywords k. For each article, we run three individual analyses by setting k = 3, 5, 7. The overall F-measure for each method v.s. the number of observed keywords k is shown in Fig. 4. As expected, the overall performance of all methods improves as k increases from 3 to 7, since more information becomes available (recall that for unsupervised approaches, we force all observed keywords to be positive). Specifically, when k = 3, the overall F-measures of the various methods are more comparable. However, as k increases to 7, the advantage of VBSS becomes more significant, surpassing the other methods when γ > 0.05. This illustrates the robustness of VBSS, and highlights the fact that VBSS makes use of the information more efficiently.

Fig. 4: — SemEval2010 data: the F-measures v.s. different number of observed keywords for various keyword extraction approaches with different FDR control γ.

Next, we investigate the performance variation of VBSS due to differently selected observed keywords. To do so, we randomly select an article from the SemEval2010 dataset titled “Feature Representation for Effective Action-Item Detection” [49]. After preprocessing, the article contains 331 unique candidate words and 24 true keywords. The VBSS method is applied 200 times in this article, with five randomly selected keywords serving as the observed information in each iteration. Table II presents the mean and standard deviation (SD, in parentheses) of the precision, recall, and F-measure across the 200 repetitions for different FDR control γ values. The results indicate that while the standard deviations increase with larger γ values, they remain relatively small, suggesting that the effect of different choices of observed keywords is minimal.

TABLE II:

Mean precision, recall, and F-measure with their standard deviations (in parentheses) across 200 repetitions of the VBSS algorithm implemented on a randomly selected document from the SemEval2010 dataset, each using five randomly selected keywords as observed ones, for varying γ values.

	FDR Control γ
	0.05	0.1	0.15	0.2	0.25	0.3
Precision	1 (0)	0.996 (0.026)	0.986 (0.045)	0.966 (0.061)	0.919 (0.078)	0.832 (0.086)
Recall	0.208 (0)	0.249 (0.007)	0.288 (0.013)	0.322 (0.021)	0.386 (0.031)	0.453 (0.043)
F-measure	0.345 (0)	0.398 (0.010)	0.445 (0.020)	0.484 (0.031)	0.543 (0.044)	0.586 (0.056)

Open in a new tab

We also examine the effect of the damping factor d on the performance of VBSS, which controls the probability of jumping from one vertex to another random vertex in the graph. The damping factor ranges from 0 to 1, with a default value of d = 0.85 as recommended by [7] and adopted by many subsequent studies. In the following analysis, we evaluate how VBSS’s performance changes on the SemEval2010 dataset as d varies from 0.7 to 0.95 in increments of 0.05, while keeping the selection of the five observed keywords unchanged. Fig. 5 presents the precision, recall, and F-measure as a function of d for different γ values. Evidently, precision gradually decreases as d increases from 0.7 to 0.9, but with a sharp drop at d = 0.95, while recall shows the opposite trend, gradually increasing and then spiking at d = 0.95. F-measure responds differently depending on γ: for γ ≤ 0.2, F-measure shows modest improvements with increasing d. However, for larger γ values (γ = 0.25, 0.3), an up-down pattern emerges, with performance dropping significantly at d = 0.95. Additionally, we observe that larger γ values result in greater sensitivity to changes in d, whereas smaller γ values exhibit less pronounced variation in performance. These trends are supported by ANOVA tests (see Table S2 in the Supplementary Material), which confirm that d has a statistically significant effect on precision and recall across all γ values, and on F-measure for nearly all γ values. These results confirm that the default choice of d = 0.85 is reasonable, generally yielding a high F-measure, which effectively balances precision and recall and is a crucial consideration for keyword extraction tasks.

Fig. 5: — SemEval2010 data: the relationship between the precision, recall, F-measure and damping factor d across different FDR control γ values.

A complementary analysis of how performance varies with the proportion of keywords in a document is provided in Section S6 of the Supplementary Material. All methods show a consistent decrease in F-measure as the proportion of keywords increases, though this trend becomes less pronounced with larger FDR cutoff values (γ). VBSS demonstrates strong performance, and as γ increases, it is often the best among the methods being compared.

3). Improved Filtering Demo:

To demonstrate the practical utility of VBSS in a realistic information retrieval (document filtering) setting, we develop a small-scale demo using the SemEval2010 dataset of scientific abstracts. This demo simulates how users might retrieve articles on specific topics using extracted keywords. We select five query terms: “autom,” “bayesian,” “combinatori,” “databas,” and “equilibrium,” representing commonly studied topics in areas such as probabilistic modeling, algorithm design, and data systems. These terms are chosen for their sufficient frequency in ground-truth keywords, ensuring meaningful evaluation.

For each query, documents whose ground-truth keyword set includes the term are designated as true relevant documents for retrieval performance assessment. A document is considered successfully retrieved if the query term is present in its extracted keywords. This then yields a relevance score for each method and query, defined as the proportion of true relevant documents where the query term is found among the keywords extracted by the method.

Here we compare VBSS with TR, TF-IDF, KPMiner, and SS due to their strong performance in Section III-B1. To ensure a fair comparison, we set the VBSS FDR threshold to γ = 0.3 and match the total number of keywords extracted by all baseline methods accordingly (as described in Section III-A). Table III presents the relevance scores for all methods across the five selected queries. While the queries do not always favor VBSS, it remains competitive overall, highlighting its robustness in surfacing topic-relevant keywords for effective downstream filtering or retrieval tasks driven by user-defined interests.

TABLE III:

SemEval2010 data: relevance scores for five query terms across different keyword extraction methods. A document is considered true relevant if its ground-truth keyword set contains the query term. A relevance score is defined as the proportion of true relevant documents in which the query appears among the extracted keywords. VBSS uses an FDR threshold of γ = 0.3. For other methods, the number of extracted keywords per document is adjusted to match that of VBSS, following the procedure described in Section III-A. Bold values indicate the highest relevance score for each query.

	“autom”	“bayesian”	“combinatori”	“databas”	“equilibrium ”
TR	0.222	0.333	0.154	0.667	0.381
TF-DF	0.222	0.333	0.231	1	0.619
KPMiner	0.111	0.25	0.269	0.5	0.262
SS	0.222	0.333	0.154	0.333	0.095
VBSS	0.222	0.5	0.308	1	0.476

Open in a new tab

4). Computation Efficiency:

Building on BSS, the primary motivation of the VBSS method is to enhance the efficiency of Bayesian computation through the VI mechanism. To highlight the advantages of adopting VI, we compare the running time of VBSS and BSS using the Hulth dataset [2], which contains over 2,000 abstracts from computer science papers published between 1998 and 2002. A subset of this dataset is used to benchmark BSS against existing methods in [1]. Since the dataset mainly consists of short articles, the BSS algorithm can be finished within a reasonable time frame on a system running Windows 11, equipped with a 13th Gen Intel^® Core^™ i9-13950HX CPU at 2.20 GHz.

Following the preprocessing steps in [1], after removing texts with fewer than 11 keywords, we are left with 1,459 abstracts. We do not aim to beat BSS which has demonstrated high accuracy of keyword identification for short documents. Instead, we illustrate the substantial improvement in computation efficiency. In Fig. 6, we present how much time VBSS and BSS take as the number of candidate words increases. In the left figure, the time of both methods (in the log scale) exhibits an approximate quadratic relationship to the number of candidate words n, although the trend with VBSS is harder to identify as it takes much shorter time than BSS. The boxplots on the right clearly demonstrate how much faster VBSS is than BSS. Overall, it takes VBSS 1.93 minutes to complete the identification across 1,459 abstracts while BSS spends 47.2 hours (~ 1467 times longer!). This comparison highlights the significant improvement in terms of the computation efficiency using VI, thus making VBSS an attractive option for keyword identification tasks.

Fig. 6: — Hulth abstract data: the plot of running time in seconds (log scale) of VBSS and BSS versus the number of candidate words n (top) and the boxplots of the time consumption of VBSS and BSS. BSS is visualized in grey and VBSS is in black.

As a side note, the precision, recall, and F-measure of VBSS and BSS across different FDR cutoff values are displayed in Table S3 in Supplementary Material for interested readers. VBSS achieves higher precision, while BSS delivers better recall, regardless of the chosen γ values. For the overall performance (F-measure), VBSS outperforms BSS for large γ values (i.e., 0.25 and 0.3) while BSS method is better for smaller γ values.

Finally, we evaluate the total running time of various methods across all documents within each dataset listed in Table I that contains long articles. As shown in Fig. 7, the VBSS method demonstrates impressive computational efficiency despite its Bayesian nature. Its performance is comparable to most non-Bayesian approaches and even surpasses them on certain datasets. Note that the BSS method was excluded from this analysis due to its prohibitive computational cost. For example, BSS required over 16 hours to process the Wiki20 dataset (20 articles), while Yake, the slowest among the remaining methods, completed the same task in approximately 13 minutes and VBSS completed it in 0.42 minutes.

Fig. 7: — Heatmap comparing the running time (log scale) of the different methods for extracting keywords from long English articles across 10 benchmark datasets. The color scale represents running time, ranging from yellow (fastest) to purple (slowest). Due to the wide range of the running time, a logarithmic scale is used to visualize the relative performance of each method. Individual cell values are displayed in minutes.

IV. Discussion and Future Directions

The keyword identification problem has attracted significant attention and research efforts, yet the semi-supervised setting, which assumes a subset of keywords is known, remains under-explored. Recently, [1] proposed a semi-supervised Bayesian keyword identification approach that show superior performance on short articles. However, the computational burden of their proposed method has prevented it from being applied effectively on longer articles. To address this challenge, we propose a novel method called variational Bayesian semi-supervised keyword extraction (VBSS). Our approach employs variational inference to approximate the joint posterior distribution, leading to a significant reduction in computational time, thereby enabling its application to longer articles. In addition, we introduce additional parameters to allow greater model flexibility. As a result, on long articles, VBSS algorithm exhibits remarkable performance particularly with larger γ values in comparison to a broad spectrum of existing methods. In addition, although VBSS is developed mainly for long articles, our method still exhibits impressive computational efficiency in short articles while preserving competitive performance.

For practical implementation of our VBSS method, the users can specify the FDR value γ to select the number of identified keywords. We suggest using a relatively larger γ as it is usually associated with better performance as shown in real-world examples. In our VBSS algorithm, we employ a co-occurrence-based approach to construct an undirected weighted graph from the input document. While this approach has been shown to be effective [6], future work could explore alternative graph construction techniques, such as those presented in [28], to potentially further enhance performance. In particular, our model structure is compatible with any document-level graph described by a weighted adjacency matrix. For example, one may construct a similarity graph using BERT embeddings in place of co-occurrence, and directly plug in the resulting adjacency matrix as the prior. This substitution will not affect the downstream calculation of the variational distributions, or the optimization of the variational objective. Furthermore, in our preliminary experiment (results not shown for conciseness), we notice that the choice of priors on θ has an effect on the results. For a particular dataset, the user might want to try different priors such as the prior used in [1] and other reasonable alternatives that carry information about the words’ importance.

We point out several directions for future research. In our model, to ensure an analytical form of the solution in each optimization step of our variational inference, we have assumed a probit model for the keyword probability and introduced additional latent variables to augment the model. A possible alternative to our model choice is to use the well-known logit link function instead. Similar to our data augmentation idea, to circumvent the need of numerical integration, fast computation could potentially be achieved through the idea of Pó lya-Gamma augmentation [50]. In this paper, as proof of the concept, we use the mean-field variational inference which assumes independence among the parameters. We adopt the coordinate ascent algorithm to optimize the variational objective function. Another choice is stochastic variational inference, whose objective function is optimized through stochastic optimization with noisy natural gradients [51]. In addition, one could consider a more relaxed group of variational distributions that accounts for possible dependencies among the parameters, for example, through some reparameterization scheme [52–55]. Furthermore, alternative divergence beyond KL divergence, such as f-divergence [56], might be considered for tighter bounds [57]. In parallel, incorporating user-oriented evaluations, where human feedback could provide valuable practical insights into the quality and usability of the extracted keywords, is an important direction for future work. Finally, while our model assumes observed keywords are accurate, practical applications might encounter mislabeling due to noise or annotation errors. Should auxiliary information, such as confidence scores or error patterns, become available, our hierarchical framework could be extended to integrate an additional probabilistic layer for robust label reliability modeling.

Supplementary Material

supp1-3630577

NIHMS2145363-supplement-supp1-3630577.pdf^{(1.3MB, pdf)}

Biographies

graphic file with name nihms-2145363-b0008.gif

Yaofang Hu is an Assistant Professor of Applied Statistics with the Department of Information Systems, Statistics, and Management Science, University of Alabama.

graphic file with name nihms-2145363-b0009.gif

Yichen Cheng is an Associate Professor of Business Analytics with the Institute for Insight, J. Mack Robinson College of Business, Georgia State University.

graphic file with name nihms-2145363-b0010.gif

Yusen Xia is the Anne and Michael D. Easterly Distinguished Professor in Analytics at the Robinson College of Business, Georgia State University.

graphic file with name nihms-2145363-b0011.gif

Xinlei Wang is Jenkins-Garrett Endowed Professor of Statistics and Data Science in the Department of Mathematics and Director for Research, Division of Data Science, College of Science, University of Texas at Arlington.

Contributor Information

Yaofang Hu, Department of Information Systems, Statistics, and Management Science, The University of Alabama, Tuscaloosa, AL, 35487 USA.

Yichen Cheng, Institute for Insight, Robinson College of Business, Georgia State University, Atlanta, GA, 30303 USA.

Yusen Xia, Institute for Insight, Robinson College of Business, Georgia State University, Atlanta, GA, 30303 USA.

Xinlei Wang, Department of Mathematics, the University of Texas at Arlington, Arlington, TX, 76019 USA; Division of Data Science, College of Science, University of Texas at Arlington, Arlington, TX, 76019 USA.

References

[1].Wang G, Cheng Y, Xia Y, Ling Q, and Wang X, “A Bayesian semisupervised approach to keyword extraction with only positive and unlabeled data,” INFORMS Journal on Computing, vol. 35, no. 3, pp. 675–691, 2023. [Google Scholar]
[2].Hulth A, “Improved automatic keyword extraction given more linguistic knowledge,” in Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 2003, pp. 216–223. [Google Scholar]
[3].Caragea C, Bulgarov F, Godea A, and Gollapalli SD, “Citation-enhanced keyphrase extraction from research papers: A supervised approach,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1435–1446. [Google Scholar]
[4].Bordoloi M, Chatterjee PC, Biswas SK, and Purkayastha B, “Keyword extraction using supervised cumulative textrank,” Multimedia Tools and Applications, vol. 79, no. 41-42, pp. 31467–31496, 2020. [Google Scholar]
[5].Beliga S, Meštrović A, and Martinčić-Ipšić S, “An overview of graph-based keyword extraction methods and approaches,” Journal of Information and Organizational Sciences, vol. 39, no. 1, pp. 1–20, 2015. [Google Scholar]
[6].Mihalcea R and Tarau P, “Textrank: Bringing order into text,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004, pp. 404–411. [Google Scholar]
[7].Brin S and Page L, “The anatomy of a large-scale hypertextual web search engine,” Computer Networks and ISDN Systems, vol. 30, no. 1-7, pp. 107–117, 1998. [Google Scholar]
[8].Bougouin A, Boudin F, and Daille B, “Topicrank: Graph-based topic ranking for keyphrase extraction,” in Proceedings of the Sixth International Joint Conference on Natural Language Processing, 2013, pp. 543–551. [Google Scholar]
[9].Florescu C and Caragea C, “A position-biased pagerank algorithm for keyphrase extraction,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017. [Google Scholar]
[10].Boudin F, “Unsupervised keyphrase extraction with multipartite graphs,” in 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), 2018, pp. 667–672. [Google Scholar]
[11].Sparck Jones K, “A statistical interpretation of term specificity and its application in retrieval,” Journal of Documentation, vol. 28, no. 1, pp. 11–21, 1972. [Google Scholar]
[12].El-Beltagy SR and Rafea A, “KP-Miner: A keyphrase extraction system for English and Arabic documents,” Information Systems, vol. 34, no. 1, pp. 132–144, 2009. [Google Scholar]
[13].Campos R, Mangaravite V, Pasquali A, Jorge AM, Nunes C, and Jatowt A, “A text feature based automatic keyword extraction method for single documents,” in European Conference on Information Retrieval. Springer, 2018, pp. 684–691. [Google Scholar]
[14].Grootendorst M, “Keybert: Minimal keyword extraction with bert.” 2020. [Online]. Available: 10.5281/zenodo.4461265 [DOI] [Google Scholar]
[15].Barzilay R and Elhadad M, “Using lexical chains for text summarization,” 1997. [Google Scholar]
[16].Herrera JP and Pury PA, “Statistical keyword detection in literary corpora,” The European Physical Journal B, vol. 63, pp. 135–146, 2008. [Google Scholar]
[17].Mehri A and Darooneh AH, “The role of entropy in word ranking,” Physica A: Statistical Mechanics and its Applications, vol. 390, no. 18-19, pp. 3157–3163, 2011. [Google Scholar]
[18].Ahmed U, Alexopoulos C, Piangerelli M, and Polini A, “Bryt: Automated keyword extraction for open datasets,” Intelligent Systems with Applications, vol. 23, p. 200421, 2024. [Google Scholar]
[19].Devlin J, Chang M-W, Lee K, and Toutanova K, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186. [Google Scholar]
[20].Campobello G, Segreto A, Zanafi S, and Serrano S, “Rake: A simple and efficient lossless compression algorithm for the internet of things,” in 2017 25th European Signal Processing Conference (EUSIPCO). IEEE, 2017, pp. 2581–2585. [Google Scholar]
[21].Li D, Li S, Li W, Wang W, and Qu W, “A semi-supervised key phrase extraction approach: Learning from title phrases through a document semantic network,” in Proceedings of the ACL 2010 Conference Short Papers, 2010, pp. 296–300. [Google Scholar]
[22].Zhou D, Bousquet O, Lal T, Weston J, and Scholkopf B, “Learning with local and global consistency,” Advances in Neural Information Processing Systems, vol. 16, 2003. [Google Scholar]
[23].Ye H and Wang L, “Semi-supervised learning for neural keyphrase generation,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 4142–4153. [Google Scholar]
[24].Jonathan FC and Karnalim O, “Semi-supervised keyphrase extraction on scientific article using fact-based sentiment,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 16, no. 4, pp. 1771–1778, 2018. [Google Scholar]
[25].Firoozeh N, Nazarenko A, Alizon F, and Daille B, “Keyword extraction: Issues and methods,” Natural Language Engineering, vol. 26, no. 3, pp. 259–291, 2020. [Google Scholar]
[26].Bharti SK and Babu KS, “Automatic keyword extraction for text summarization: A survey,” arXiv preprint arXiv:1704.03242, 2017. [Google Scholar]
[27].Siddiqi S and Sharan A, “Keyword and keyphrase extraction techniques: a literature review,” International Journal of Computer Applications, vol. 109, no. 2, 2015. [Google Scholar]
[28].Duari S and Bhatnagar V, “scake: semantic connectivity aware keyword extraction,” Information Sciences, vol. 477, pp. 100–117, 2019. [Google Scholar]
[29].Bishop CM, Pattern Recognition and Machine Learning. New York, NY: Springer, 2006. [Google Scholar]
[30].Blei DM, Kucukelbir A, and McAuliffe JD, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, vol. 112, no. 518, pp. 859–877, 2017. [Google Scholar]
[31].Albert JH and Chib S, “Bayesian analysis of binary and polychotomous response data,” Journal of the American Statistical Association, vol. 88, no. 422, pp. 669–679, 1993. [Google Scholar]
[32].Parisi G, Statistical Field Theory. Boston, MA, USA: Addison-Wesley, 1988. [Google Scholar]
[33].Blei DM and Jordan MI, “Variational inference for Dirichlet process mixtures,” Bayesian Analysis, vol. 1, no. 1, pp. 121–144, 2006. [Google Scholar]
[34].Tanner MA and Wong WH, “The calculation of posterior distributions by data augmentation,” Journal of the American statistical Association, vol. 82, no. 398, pp. 528–540, 1987. [Google Scholar]
[35].Van Dyk DA and Meng X-L, “The art of data augmentation,” Journal of Computational and Graphical Statistics, vol. 10, no. 1, pp. 1–50, 2001. [Google Scholar]
[36].Newton MA, Noueiry A, Sarkar D, and Ahlquist P, “Detecting differential gene expression with a semiparametric hierarchical mixture method,” Biostatistics, vol. 5, no. 2, pp. 155–176, 2004. [DOI] [PubMed] [Google Scholar]
[37].Lynn HM, Lee E, Choi C, and Kim P, “Swiftrank: an unsupervised statistical approach of keyword and salient sentence extraction for individual documents,” Procedia computer science, vol. 113, pp. 472–477, 2017. [Google Scholar]
[38].R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2022. [Online]. Available: https://www.R-project.org/ [Google Scholar]
[39].Boudin F, “PKE: an open source Python-based keyphrase extraction toolkit,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, 2016, pp. 69–73. [Google Scholar]
[40].Gollapalli SD and Caragea C, “Extracting keyphrases from research papers using citation networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 28, no. 1, 2014. [Google Scholar]
[41].Medelyan O, Frank E, and Witten IH, “Human-competitive tagging using automatic keyphrase extraction.” Association for Computational Linguistics, 2009. [Google Scholar]
[42].Medelyan O and Witten IH, “Domain-independent automatic keyphrase indexing with small training sets,” Journal of the American Society for Information Science and Technology, vol. 59, no. 7, pp. 1026–1040, 2008. [Google Scholar]
[43].Krapivin M, Autaeu A, Marchese M et al. “Large dataset for keyphrases extraction,” 2009. [Google Scholar]
[44].Nguyen TD and Kan M-Y, “Keyphrase extraction in scientific publications,” in International conference on Asian digital libraries. Springer, 2007, pp. 317–326. [Google Scholar]
[45].Aronson AR, Bodenreider O, Chang HF, Humphrey SM, Mork JG, Nelson SJ, Rindflesch TC, and Wilbur WJ, “The nlm indexing initiative.” in Proceedings of the AMIA Symposium. American Medical Informatics Association, 2000, p. 17. [PMC free article] [PubMed] [Google Scholar]
[46].Kim SN, Medelyan O, Kan M-Y, and Baldwin T, “SemEval-2010 Task 5 : Automatic keyphrase extraction from scientific articles,” in Proceedings of the 5th International Workshop on Semantic Evaluation, Erk K and Strapparava C, Eds. Uppsala, Sweden: Association for Computational Linguistics, Jul. 2010, pp. 21–26. [Online]. Available: https://aclanthology.org/S10-1004 [Google Scholar]
[47].Schutz AT et al. “Keyphrase extraction from single documents in the open domain exploiting linguistic and statistical methods,” M. App. Sc Thesis, 2008. [Google Scholar]
[48].Medelyan O, Witten IH, and Milne D, “Topic indexing with wikipedia,” in Proceedings of the AAAI WikiAI workshop, vol. 1, 2008, pp. 19–24. [Google Scholar]
[49].Bennett PN and Carbonell J, “Feature representation for effective action-item detection,” ACM SIGIR Special Interest Group on Information Retrival, 2005. [Google Scholar]
[50].Polson NG, Scott JG, and Windle J, “Bayesian inference for logistic models using Pólya–Gamma latent variables,” Journal of the American Statistical Association, vol. 108, no. 504, pp. 1339–1349, 2013. [Google Scholar]
[51].Hoffman MD, Blei DM, Wang C, and Paisley J, “Stochastic variational inference,” Journal of Machine Learning Research, vol. 14, pp. 1303–1347, 2013. [Google Scholar]
[52].Bernardo J, Bayarri M, Berger J, Dawid A, Heckerman D, Smith A, and West M, “Non-centered parameterisations for hierarchical models and data augmentation,” in Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting, vol. 307. Oxford University Press, USA, 2003. [Google Scholar]
[53].Papaspiliopoulos O, Roberts GO, and Sköld M, “A general framework for the parametrization of hierarchical models,” Statistical Science, pp. 59–73, 2007. [Google Scholar]
[54].Tan LS, “Use of model reparametrization to improve variational Bayes,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 83, no. 1, pp. 30–57, 2021. [Google Scholar]
[55].Tan LS and Nott DJ, “Variational inference for generalized linear mixed models using partially noncentered parametrizations,” Statistical Science, vol. 28, no. 2, pp. 168–188, 2013. [Google Scholar]
[56].Ali SM and Silvey SD, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 28, no. 1, pp. 131–142, 1966. [Google Scholar]
[57].Bamler R, Zhang C, Opper M, and Mandt S, “Perturbative black box variational inference,” Advances in Neural Information Processing Systems, vol. 30, 2017. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp1-3630577

NIHMS2145363-supplement-supp1-3630577.pdf^{(1.3MB, pdf)}

[R1] [1].Wang G, Cheng Y, Xia Y, Ling Q, and Wang X, “A Bayesian semisupervised approach to keyword extraction with only positive and unlabeled data,” INFORMS Journal on Computing, vol. 35, no. 3, pp. 675–691, 2023. [Google Scholar]

[R2] [2].Hulth A, “Improved automatic keyword extraction given more linguistic knowledge,” in Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 2003, pp. 216–223. [Google Scholar]

[R3] [3].Caragea C, Bulgarov F, Godea A, and Gollapalli SD, “Citation-enhanced keyphrase extraction from research papers: A supervised approach,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1435–1446. [Google Scholar]

[R4] [4].Bordoloi M, Chatterjee PC, Biswas SK, and Purkayastha B, “Keyword extraction using supervised cumulative textrank,” Multimedia Tools and Applications, vol. 79, no. 41-42, pp. 31467–31496, 2020. [Google Scholar]

[R5] [5].Beliga S, Meštrović A, and Martinčić-Ipšić S, “An overview of graph-based keyword extraction methods and approaches,” Journal of Information and Organizational Sciences, vol. 39, no. 1, pp. 1–20, 2015. [Google Scholar]

[R6] [6].Mihalcea R and Tarau P, “Textrank: Bringing order into text,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004, pp. 404–411. [Google Scholar]

[R7] [7].Brin S and Page L, “The anatomy of a large-scale hypertextual web search engine,” Computer Networks and ISDN Systems, vol. 30, no. 1-7, pp. 107–117, 1998. [Google Scholar]

[R8] [8].Bougouin A, Boudin F, and Daille B, “Topicrank: Graph-based topic ranking for keyphrase extraction,” in Proceedings of the Sixth International Joint Conference on Natural Language Processing, 2013, pp. 543–551. [Google Scholar]

[R9] [9].Florescu C and Caragea C, “A position-biased pagerank algorithm for keyphrase extraction,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017. [Google Scholar]

[R10] [10].Boudin F, “Unsupervised keyphrase extraction with multipartite graphs,” in 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), 2018, pp. 667–672. [Google Scholar]

[R11] [11].Sparck Jones K, “A statistical interpretation of term specificity and its application in retrieval,” Journal of Documentation, vol. 28, no. 1, pp. 11–21, 1972. [Google Scholar]

[R12] [12].El-Beltagy SR and Rafea A, “KP-Miner: A keyphrase extraction system for English and Arabic documents,” Information Systems, vol. 34, no. 1, pp. 132–144, 2009. [Google Scholar]

[R13] [13].Campos R, Mangaravite V, Pasquali A, Jorge AM, Nunes C, and Jatowt A, “A text feature based automatic keyword extraction method for single documents,” in European Conference on Information Retrieval. Springer, 2018, pp. 684–691. [Google Scholar]

[R14] [14].Grootendorst M, “Keybert: Minimal keyword extraction with bert.” 2020. [Online]. Available: 10.5281/zenodo.4461265 [DOI] [Google Scholar]

[R15] [15].Barzilay R and Elhadad M, “Using lexical chains for text summarization,” 1997. [Google Scholar]

[R16] [16].Herrera JP and Pury PA, “Statistical keyword detection in literary corpora,” The European Physical Journal B, vol. 63, pp. 135–146, 2008. [Google Scholar]

[R17] [17].Mehri A and Darooneh AH, “The role of entropy in word ranking,” Physica A: Statistical Mechanics and its Applications, vol. 390, no. 18-19, pp. 3157–3163, 2011. [Google Scholar]

[R18] [18].Ahmed U, Alexopoulos C, Piangerelli M, and Polini A, “Bryt: Automated keyword extraction for open datasets,” Intelligent Systems with Applications, vol. 23, p. 200421, 2024. [Google Scholar]

[R19] [19].Devlin J, Chang M-W, Lee K, and Toutanova K, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186. [Google Scholar]

[R20] [20].Campobello G, Segreto A, Zanafi S, and Serrano S, “Rake: A simple and efficient lossless compression algorithm for the internet of things,” in 2017 25th European Signal Processing Conference (EUSIPCO). IEEE, 2017, pp. 2581–2585. [Google Scholar]

[R21] [21].Li D, Li S, Li W, Wang W, and Qu W, “A semi-supervised key phrase extraction approach: Learning from title phrases through a document semantic network,” in Proceedings of the ACL 2010 Conference Short Papers, 2010, pp. 296–300. [Google Scholar]

[R22] [22].Zhou D, Bousquet O, Lal T, Weston J, and Scholkopf B, “Learning with local and global consistency,” Advances in Neural Information Processing Systems, vol. 16, 2003. [Google Scholar]

[R23] [23].Ye H and Wang L, “Semi-supervised learning for neural keyphrase generation,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 4142–4153. [Google Scholar]

[R24] [24].Jonathan FC and Karnalim O, “Semi-supervised keyphrase extraction on scientific article using fact-based sentiment,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 16, no. 4, pp. 1771–1778, 2018. [Google Scholar]

[R25] [25].Firoozeh N, Nazarenko A, Alizon F, and Daille B, “Keyword extraction: Issues and methods,” Natural Language Engineering, vol. 26, no. 3, pp. 259–291, 2020. [Google Scholar]

[R26] [26].Bharti SK and Babu KS, “Automatic keyword extraction for text summarization: A survey,” arXiv preprint arXiv:1704.03242, 2017. [Google Scholar]

[R27] [27].Siddiqi S and Sharan A, “Keyword and keyphrase extraction techniques: a literature review,” International Journal of Computer Applications, vol. 109, no. 2, 2015. [Google Scholar]

[R28] [28].Duari S and Bhatnagar V, “scake: semantic connectivity aware keyword extraction,” Information Sciences, vol. 477, pp. 100–117, 2019. [Google Scholar]

[R29] [29].Bishop CM, Pattern Recognition and Machine Learning. New York, NY: Springer, 2006. [Google Scholar]

[R30] [30].Blei DM, Kucukelbir A, and McAuliffe JD, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, vol. 112, no. 518, pp. 859–877, 2017. [Google Scholar]

[R31] [31].Albert JH and Chib S, “Bayesian analysis of binary and polychotomous response data,” Journal of the American Statistical Association, vol. 88, no. 422, pp. 669–679, 1993. [Google Scholar]

[R32] [32].Parisi G, Statistical Field Theory. Boston, MA, USA: Addison-Wesley, 1988. [Google Scholar]

[R33] [33].Blei DM and Jordan MI, “Variational inference for Dirichlet process mixtures,” Bayesian Analysis, vol. 1, no. 1, pp. 121–144, 2006. [Google Scholar]

[R34] [34].Tanner MA and Wong WH, “The calculation of posterior distributions by data augmentation,” Journal of the American statistical Association, vol. 82, no. 398, pp. 528–540, 1987. [Google Scholar]

[R35] [35].Van Dyk DA and Meng X-L, “The art of data augmentation,” Journal of Computational and Graphical Statistics, vol. 10, no. 1, pp. 1–50, 2001. [Google Scholar]

[R36] [36].Newton MA, Noueiry A, Sarkar D, and Ahlquist P, “Detecting differential gene expression with a semiparametric hierarchical mixture method,” Biostatistics, vol. 5, no. 2, pp. 155–176, 2004. [DOI] [PubMed] [Google Scholar]

[R37] [37].Lynn HM, Lee E, Choi C, and Kim P, “Swiftrank: an unsupervised statistical approach of keyword and salient sentence extraction for individual documents,” Procedia computer science, vol. 113, pp. 472–477, 2017. [Google Scholar]

[R38] [38].R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2022. [Online]. Available: https://www.R-project.org/ [Google Scholar]

[R39] [39].Boudin F, “PKE: an open source Python-based keyphrase extraction toolkit,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, 2016, pp. 69–73. [Google Scholar]

[R40] [40].Gollapalli SD and Caragea C, “Extracting keyphrases from research papers using citation networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 28, no. 1, 2014. [Google Scholar]

[R41] [41].Medelyan O, Frank E, and Witten IH, “Human-competitive tagging using automatic keyphrase extraction.” Association for Computational Linguistics, 2009. [Google Scholar]

[R42] [42].Medelyan O and Witten IH, “Domain-independent automatic keyphrase indexing with small training sets,” Journal of the American Society for Information Science and Technology, vol. 59, no. 7, pp. 1026–1040, 2008. [Google Scholar]

[R43] [43].Krapivin M, Autaeu A, Marchese M et al. “Large dataset for keyphrases extraction,” 2009. [Google Scholar]

[R44] [44].Nguyen TD and Kan M-Y, “Keyphrase extraction in scientific publications,” in International conference on Asian digital libraries. Springer, 2007, pp. 317–326. [Google Scholar]

[R45] [45].Aronson AR, Bodenreider O, Chang HF, Humphrey SM, Mork JG, Nelson SJ, Rindflesch TC, and Wilbur WJ, “The nlm indexing initiative.” in Proceedings of the AMIA Symposium. American Medical Informatics Association, 2000, p. 17. [PMC free article] [PubMed] [Google Scholar]

[R46] [46].Kim SN, Medelyan O, Kan M-Y, and Baldwin T, “SemEval-2010 Task 5 : Automatic keyphrase extraction from scientific articles,” in Proceedings of the 5th International Workshop on Semantic Evaluation, Erk K and Strapparava C, Eds. Uppsala, Sweden: Association for Computational Linguistics, Jul. 2010, pp. 21–26. [Online]. Available: https://aclanthology.org/S10-1004 [Google Scholar]

[R47] [47].Schutz AT et al. “Keyphrase extraction from single documents in the open domain exploiting linguistic and statistical methods,” M. App. Sc Thesis, 2008. [Google Scholar]

[R48] [48].Medelyan O, Witten IH, and Milne D, “Topic indexing with wikipedia,” in Proceedings of the AAAI WikiAI workshop, vol. 1, 2008, pp. 19–24. [Google Scholar]

[R49] [49].Bennett PN and Carbonell J, “Feature representation for effective action-item detection,” ACM SIGIR Special Interest Group on Information Retrival, 2005. [Google Scholar]

[R50] [50].Polson NG, Scott JG, and Windle J, “Bayesian inference for logistic models using Pólya–Gamma latent variables,” Journal of the American Statistical Association, vol. 108, no. 504, pp. 1339–1349, 2013. [Google Scholar]

[R51] [51].Hoffman MD, Blei DM, Wang C, and Paisley J, “Stochastic variational inference,” Journal of Machine Learning Research, vol. 14, pp. 1303–1347, 2013. [Google Scholar]

[R52] [52].Bernardo J, Bayarri M, Berger J, Dawid A, Heckerman D, Smith A, and West M, “Non-centered parameterisations for hierarchical models and data augmentation,” in Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting, vol. 307. Oxford University Press, USA, 2003. [Google Scholar]

[R53] [53].Papaspiliopoulos O, Roberts GO, and Sköld M, “A general framework for the parametrization of hierarchical models,” Statistical Science, pp. 59–73, 2007. [Google Scholar]

[R54] [54].Tan LS, “Use of model reparametrization to improve variational Bayes,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 83, no. 1, pp. 30–57, 2021. [Google Scholar]

[R55] [55].Tan LS and Nott DJ, “Variational inference for generalized linear mixed models using partially noncentered parametrizations,” Statistical Science, vol. 28, no. 2, pp. 168–188, 2013. [Google Scholar]

[R56] [56].Ali SM and Silvey SD, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 28, no. 1, pp. 131–142, 1966. [Google Scholar]

[R57] [57].Bamler R, Zhang C, Opper M, and Mandt S, “Perturbative black box variational inference,” Advances in Neural Information Processing Systems, vol. 30, 2017. [Google Scholar]

PERMALINK

Variational Bayesian Semi-supervised Keyword Extraction

Yaofang Hu

Yichen Cheng

Yusen Xia

Xinlei Wang

Abstract

I. Introduction

II. Proposed Variational Bayesian Semi-Supervised Keyword Extraction Method

A. Bayesian Hierarchical Modeling

B. A Variational Bayesian Approach

1). preliminaries for variational inference:

2). Data Augmentation:

Fig. 1:

3). Variational Bayesian Semi-supervised Keyword Extraction:

4). Algorithm:

III. Experiments and Results

A. Experimental Setup

Algorithm 1.

TABLE I:

B. Results

1). The overall performance in keyword extraction:

Fig. 2:

Fig. 3:

2). Factors that may influence the performance:

Fig. 4:

TABLE II:

Fig. 5:

3). Improved Filtering Demo:

TABLE III:

4). Computation Efficiency:

Fig. 6:

Fig. 7:

IV. Discussion and Future Directions

Supplementary Material

Biographies

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases