Coupled generation

Ben Dai; Xiaotong Shen; Wing Wong

doi:10.1080/01621459.2020.1844719

. Author manuscript; available in PMC: 2023 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2021 Jan 4;117(539):1243–1253. doi: 10.1080/01621459.2020.1844719

Coupled generation

Ben Dai ¹, Xiaotong Shen ¹, Wing Wong ²

PMCID: PMC9718422 NIHMSID: NIHMS1774499 PMID: 36465716

Abstract

Instance generation creates representative examples to interpret a learning model, as in regression and classification. For example, representative sentences of a topic of interest describe the topic specifically for sentence categorization. In such a situation, a large number of unlabeled observations may be available in addition to labeled data, for example, many unclassified text corpora (unlabeled instances) are available with only a few classified sentences (labeled instances). In this article, we introduce a novel generative method, called a coupled generator, producing instances given a specific learning outcome, based on indirect and direct generators. The indirect generator uses the inverse principle to yield the corresponding inverse probability, enabling to generate instances by leveraging an unlabeled data. The direct generator learns the distribution of an instance given its learning outcome. Then, the coupled generator seeks the best one from the indirect and direct generators, which is designed to enjoy the benefits of both and deliver higher generation accuracy. For sentence generation given a topic, we develop an embedding-based regression/classification in conjuncture with an unconditional recurrent neural network for the indirect generator, whereas a conditional recurrent neural network is natural for the corresponding direct generator. Moreover, we derive finite-sample generation error bounds for the indirect and direct generators to reveal the generative aspects of both methods thus explaining the benefits of the coupled generator. Finally, we apply the proposed methods to a real benchmark of abstract classification and demonstrate that the coupled generator composes reasonably good sentences from a dictionary to describe a specific topic of interest.

Keywords: Classification, Natural language processing, Numerical embeddings, Semisupervised generation, Unstructured data

1. Introduction

Generating an essay or a text for given structured information is an important Artificial Intelligence (AI) problem, which automatically imitates a certain style of writing. Whereas solving this AI problem is rather challenging, we tackle its simpler version in this article, which we call instance (example) generation, that is, generation of representative instances given a specific outcome to describe and interpret the corresponding learning model, for instance, classification and regression.

The use of black-box predictive models such as deep neural networks has delivered a high empirical learning accuracy in many real-life applications [14, 15]. Yet, it is difficult to make a sense of such a learning model. From the generative perspective, instance generation can describe the relationship between an instance and an outcome retrospectively. Its applications include a topic description of sentence categorization, abstractive text summarization [12], and image captioning [25], where generated sentences render descriptive examples of topics, texts, and images. In such a situation, sentence generation allows us to compose a novel essay and image captioning when the structured information is supplied. For example, the UCI abstract categorization benchmark¹ consists of sentences from abstracts of articles, which are labeled with one of five topic categories. The goal here is learning a sentence generation mechanism to compose a novel abstract given a specific topic, in which the generation performance is measured by the cross-entropy error based on a test sample.

In the literature, instance generation, despite its vast important applications in AI, remains largely unexplored, although some approaches have been suggested for sentence generation. For example, a computational linguistics approach represents words/phrases as trees to model linguistic dependencies [20], a learning approach uses a large text corpus to learn a sentence’s structure without any access to linguistic annotation [5]. In [26], a sentence generating model is proposed to produce a document by sampling the latent topic of a sentence and then words of the sentence using a recurrent neural network (RNN). In [37, 17], image captioning links the image content to a language model through an interplay between a convolution neural network (CNN) and an RNN. Yet, there is a paucity of works on instance generation given structured information, and incorporating both labeled and unlabeled data.

One of the primary characteristics of topic-instance data is that the amount of unlabeled data may be significantly larger than that of labeled data. For example, in sentence generation, uncategorized sentences are about ten times more than categorized ones. This is in a parallel situation of semisupervised learning with a different focus on leveraging unlabeled data to enhance the predictive accuracy of supervised learning [42, 18], which is in contrast to our generation objective given a learning outcome.

Our main contribution lies in the development of a new semisupervised generation framework for producing instances given an outcome. On this ground, we propose three generative methods–indirect, direct, and coupled generators. The indirect generator uses the principle of inverse learning to estimate the conditional probability distribution of an outcome given an instance, enabling to leverage unlabeled data, if available. On the other hand, the direct generator estimates the corresponding conditional probability of an instance given an outcome in a supervised manner. Then, the coupled generator is designed to enjoy the benefits of both generations. The proposed generators are illustrated in sentence generation, where we generate a sentence through sequential next-word-prediction. Specifically, we develop regularized embedding-based regression/classification in conjuncture with an unconditional RNN for the indirect generator, whereas we use a conditional RNN for the direct generator.

To shed light on the generative performance of the three generators, we derive finite-sample generation error bounds for each method. Interestingly, the generation error of the indirect generator is governed by the complexity of the parameter space of the conditional densities of an outcome given an instance and that of marginal densities. Similarly, that of the direct generator is determined by the conditional densities of an instance given an outcome. As a result, the indirect and direct generators have their own advantages with respect to generation with the unlabeled data is large, and importantly the coupled generator enjoys the benefits of both in terms of generation accuracy. This, together with a real benchmark of sentence categorization, demonstrates the utility of the coupled generation for composing reasonably good sentences to describe a specific topic. Numerically, the proposed method outperforms a separate RNN method and the indirect generator can leverage additional unlabeled data for further enhancing the performance.

This paper is organized as follows. Section 2 introduces the framework of coupled generation based on indirect and direct generations. Section 3 develops a theory of the generation performance of the proposed methods. Section 4 is devoted to the development of a novel sentence generative method given a topic of interest through sequential next-word prediction. Section 5 investigates the operating characteristics of the coupled generator and compares it with the direct and indirect generators as well as one competitor. The Appendix contains technical proofs.

2. Methods

Consider a generative model in which the goal is to generate an instance X given an outcome Y, where X and Y represent instance and response variables, which can be numerical or unstructured such as texts and documents that cannot be expressed in a predefined manner. In this article, we focus on instance generation under a generative model, based on the conditional distribution p_X|Y of X given an outcome of Y. As an example, in sentence generation [26], instance generation produces representative examples of X given a specific topic of Y, where X and Y represent a sentence and its associated topic.

For instance generation, a labeled training sample ${(x^{i}, y^{i})}_{i = 1}^{n}$ is available as well as an instance-only sample ${(x^{j})}_{j = 1}^{\tilde{n}}$ , whose sample size $\tilde{n}$ may greatly exceed or smaller than the sample size n. In our context, we leverage the unlabeled sample to enhance the generative accuracy of instance generation.

Indirect generator.

An indirect generator produces instances using an estimate of p_X|Y through the inverse relation (1): an estimate of p_Y|X based on ${(x^{i}, y^{i})}_{i = 1}^{n}$ in (2) and the marginal density p_X based on combined data ${(x^{i})}_{i = 1}^{n}$ and ${(x^{j})}_{j = 1}^{\tilde{n}}$ in (3). That is,

Indirect : {\hat{p}}_{X | Y}^{b} (x ∣ y) = \frac{{\hat{p}}_{Y | X} (y ∣ x) {\hat{p}}_{X} (x)}{\int_{x \in X} {\hat{p}}_{Y ∣ X} (y ∣ x) {\hat{p}}_{X} (x) d x},

(1)

{\hat{p}}_{Y ∣ X} = \underset{p_{Y | X} \in F_{b}}{argmin} - n^{- 1} \sum_{i = 1}^{n} log (p_{Y ∣ X} (y^{i} ∣ x^{i})) + λ_{b} J_{b} (p_{Y ∣ X}),

(2)

{\hat{p}}_{X} = \underset{p_{X} \in F_{m}}{argmin} - {(n + \tilde{n})}^{- 1} (\sum_{i = 1}^{n} log (p_{X} (x^{i})) + \sum_{j = 1}^{\tilde{n}} log (p_{X} (x^{j}))) + λ_{m} J_{m} (p_{X}),

(3)

where ${\hat{p}}_{Y ∣ X}$ and ${\hat{p}}_{X}$ in (1) are regularized maximum likelihood estimates of p_Y|X and p_X, J_b and J_m are regularizers, for example, L₁- or L₂-regularization in a neural network model, λ_b ≥ 0 and λ_m ≥ 0 are tuning parameters controlling the weights of regularization, and $F_{b}$ in (2) and $F_{m}$ in (3) are parameter spaces of p_Y|X and p_X, respectively. Note that $\int_{x \in X} {\hat{p}}_{X ∣ Y} (x ∣ y) {\hat{p}}_{X} (x) d x$ in (1) normalizes ${\hat{p}}_{X ∣ Y}^{b}$ to become a probability density, although normalization is unnecessary when only some aspects of the distribution such as the modes or percentiles are of concern, as opposed to the distribution itself. Importantly, the indirect generator leverages instance-only (unlabeled) data ${(x^{j})}_{j = 1}^{\tilde{n}}$ , but any potential bias in estimation of p_X based on ${(x^{j})}_{j = 1}^{\tilde{n}}$ could translate into that of p_Y|X.

Direct generator.

A direct generator uses p_X|Y to generate instances, estimated by minimizing the negative regularized conditional likelihood of X given Y based on ${(x^{i}, y^{i})}_{i = 1}^{n}$ :

Direct : {\hat{p}}_{X | Y}^{f} (x ∣ y) = {\hat{p}}_{X | Y} (x ∣ y), {\hat{p}}_{X ∣ Y} = \underset{p_{X | Y} \in F_{f}}{argmin} - n^{- 1} \sum_{i = 1}^{n} log (p_{X ∣ Y} (x^{i} ∣ y^{i})) + λ_{f} J_{f} (p_{X ∣ Y}),

(4)

where $F_{f}$ is a parameter space of p_X|Y, J_f is a regularizer, and λ_f ≥ 0 is a tuning parameter controlling the weight of regularization.

It appears that (4) can be extended to leverage additional unlabeled data through the conditional likelihood of p_X|Y and a mixture relation $\int_{y} p_{X ∣ Y} (x ∣ y) p_{Y} (y) d y = p_{X} (x)$ . Unfortunately, however, the mixture approach may suffer from an asymptotic bias when additional unlabeled data is included, thus degrading the estimation performance of p_X|Y [8, 9, 39]. This is because the aforementioned mixture relation may not hold when $F_{f}$ is misspecified, and moreover its impact could be minimal even it holds, especially when the support of Y is large. As suggested by the theorem in Section 4 [39], the supervised and semisupervised maximum likelihood estimates may converge to different values, and thus more unlabeled data produces a larger estimation bias as measured by the Kullback-Leibler divergence, when the model is misspecified in that $p_{X}^{0}$ does not belong to the parameter space $F_{f} = {p_{X} (x) = \int_{y} p_{X | Y} (x ∣ y) p_{Y} (y) d y; p_{X | Y} \in F_{f}}$ or the mixture relation is not satisfied. Furthermore, as demonstrated by Figures 1 and 2 of [8] and Figure 4.1 of [7], empirical studies indicate that an EM algorithm based on both labeled and unlabeled data tends to degrade performance solely based on the labeled data when the size of labeled data exceeds 30 in SecStr dataset. As a result, p_X|Y estimated from labeled data renders a better performance than that on labeled and unlabeled data.

In summary, how to leverage unlabeled data to enhance the generation performance remains an open question, which depends on model assumptions that may not be verifiable in practice. It is worth mentioning that (4) is a general formulation without assuming any specific assumption on how p_X is related to $F_{f}$ . However, if such an assumption becomes available in practice, (4) can be generalized based on it to incorporate unlabeled data for improvement. At present, we shall not pursue this aspect as the indirect method can benefit from additional unlabeled data, as suggested by Theorem 1 in Section 3.

Coupled generator.

The level of difficulty of estimating ${\hat{p}}_{X ∣ Y}^{f}$ and that of ${\hat{p}}_{X ∣ Y}^{b}$ may differ, particularly when p_X can be well-estimated from both instance-only and unlabeled data. Depending on situations, the former may be more difficult than the latter, and vice versa. Some theoretical results for this aspect are illustrated in Section 4.5. Then we propose coupled generation by choosing, between the two, the one maximizing a predictive log-likelihood, or minimizing a negative log-likelihood, such as (23) in the sentence generation example. In particular, a coupled generator is defined as,

{\hat{p}}_{X ∣ Y}^{c} = {\begin{array}{l} {\hat{p}}_{X ∣ Y}^{f} & if {\hat{p}}_{X ∣ Y}^{f} has a higher log-likelihood value on a validation set than {\hat{p}}_{X ∣ Y}^{b}, \\ {\hat{p}}_{X ∣ Y}^{b} & otherwise. \end{array}

(5)

The probability density ${\hat{p}}_{X ∣ Y}^{c}$ gives the whole spectrum of values of X given Y. First, we may generate representative instances using the mode of p_X|Y to give one representation or sampling-based on p_X|Y for multiple representations. Second, discriminative features X with respect to Y can be extracted by comparing ${\hat{p}}_{X ∣ Y}^{c}$ at different Y -values retrospectively. For example, in classification with Y = ±1, a comparison of ${\hat{p}}_{X ∣ Y = 1}^{c}$ and ${\hat{p}}_{X ∣ Y = - 1}^{c}$ leads to discriminative features. This aspect will be further investigated elsewhere.

Coupled learning has its distinct characteristics although it appears remotely related to semisupervised variational auto-encoders [18] and inverse autoregressive flows [19]. In particular, [19] uses a generative model p_X|Y and p_X to enhance a discriminative model p_Y|X regarding the marginal distribution as a mixture of conditional distributions, whereas the proposed indirect generator integrates the unlabeled data to separately estimate the marginal distribution. Furthermore, [18] estimates the marginal density of X p_X via a chain of latent factors and inevitable transformations of autoregressive neural networks and connects blocks by invertible relations. Yet, the proposed method links two conditional densities by Bayes’ law. Finally, the theoretical justification of [19] and [18] remains unknown.

3. Theory

This section develops a learning theory to investigate the generation errors of direct, indirect, and coupled generators. In particular, we derive finite-sample generation error bounds for estimators ${\hat{p}}_{X ∣ Y}^{b}$ , ${\hat{p}}_{X ∣ Y}^{f}$ , and ${\hat{p}}_{X ∣ Y}^{c}$ of (1), (4) and (5).

The generation error for generating X given Y is defined as the expected Hellinger-distance between two conditional densities p_X|Y and q_X|Y with respect to Y :

d (p_{X ∣ Y}, q_{X ∣ Y}) = {(E_{Y} h^{2} (p_{X ∣ Y}, q_{X ∣ Y}))}^{1 / 2} \equiv {(E_{Y} \int {(p_{X ∣ Y} - q_{X | Y})}^{2} d μ)}^{1 / 2},

where μ is the Lebesgue measure on x, and $E_{Y}$ is the expectation with respect to Y.

Three parameter spaces $F_{b}$ , $F_{m}$ , and $F_{f}$ are defined for estimating p_Y|X in (2), p_X in (3), and p_X|Y in (4), each of which is allowed to depend on the corresponding sample size. Then their regularized parameter spaces are given as follows: $F_{b, k} = {p_{Y | X} \in F_{b} : J (p_{Y ∣ X}) \leq k}$ for (2), $F_{m, k} = {p_{X} \in F_{m} : J (p_{X}) \leq k}$ for (3), and $F_{f, k} = {p_{X ∣ Y} \in F_{f} : J (p_{X ∣ Y}) \leq k}$ for (4). On this ground, we define the metric entropy to measure their complexities to be used for our theory.

The u-bracketing metric entropy H(u, $F$ ) of space $F$ with respect to a distance D is defined as the logarithm of the cardinality of the u-bracketing of $F$ of the smallest size. A u-bracketing of $F$ is a finite set (of pairs of functions) ${(p_{j}^{L}, p_{j}^{U}), j = 1, \dots, N}$ such that for any $p \in F$ , there is a j such that $p_{j}^{L} \leq p \leq p_{j}^{U}$ with $d (p_{j}^{L}, p_{j}^{U}) \leq u; j = 1, \dots, N$ . Note that $d^{2} (p_{Y ∣ X}, q_{Y ∣ X}) = E_{X} (h^{2} (p_{Y ∣ X}, q_{Y ∣ X}))$ , h²(p_X, q_X), and $E_{Y} h^{2} (p_{X ∣ Y}, q_{X ∣ Y})$ , respectively for $F_{b, k}$ , $F_{m, k}$ , and $F_{f, k}$ .

To quantify the degree of approximation of the true density $p_{Y ∣ X}^{0}$ by $F_{b}$ , we introduce a distance $ρ_{b} (p_{Y ∣ X}^{0}, p_{Y ∣ X}) = E_{X} E_{Y ∣ X} g_{α} (p_{Y ∣ X}^{0} / p_{Y ∣ X})$ , where g_α(x) = α⁻¹(x^α − 1) for α ∈ (0, 1). As suggested in Section 4 of [38], this distance is stronger than the corresponding Hellinger distance. Similarly, $ρ_{m} (p_{X}^{0}, p_{X}) = E_{X} g_{α} (p_{X}^{0} / p_{X})$ and $ρ_{f} ({p_{X ∣ Y}}^{0}, p_{X ∣ Y}) = E_{X} E_{Y ∣ X} g_{α} ({p_{X ∣ Y}}^{0} / p_{X ∣ Y})$ are defined for approximating the true densities $p_{X}^{0}$ and ${p_{X ∣ Y}}^{0}$ by $F_{m}$ and $F_{f}$ , respectively.

Let $p_{Y | X}^{*} \in F_{b}$ and $p_{X}^{*} \in F_{m}$ be two approximating points of $p_{Y ∣ X}^{0}$ and $p_{X}^{0}$ in that $ρ_{b} (p_{Y ∣ X}^{0}, p_{Y ∣ X}^{*}) \leq γ_{b}$ and $ρ_{m} (p_{X}^{0}, p_{X}^{*}) \leq γ_{m}$ for some sequences γ_b ≥ 0 and γ_m ≥ 0. Of course, γ_b = 0 when $p_{Y ∣ X}^{0} \in F_{b}$ and γ_m = 0 when $p_{X}^{0} \in F_{m}$ .

Theorem 1 (Indirect generator).

Suppose there exist some positive constants c₁–c₆, such that, for any ϵ_b > 0 and λ_b ≥ 0,

sup_{k \geq 1} \int_{2^{- 8} L_{k}}^{2^{1 / 2} L_{k}^{1 / 2}} H^{1 / 2} (u / c_{3}, F_{b, k}) d u / L_{k} \leq c_{2} n^{1 / 2}, L_{k} = c_{1} ϵ_{b}^{2} + λ_{b} (k - 1),

(6)

and, for any ϵ_m > 0 and λ_m ≥ 0,

sup_{k \geq 1} \int_{2^{- 8} L_{k}}^{2^{1 / 2} L_{k}^{1 / 2}} H^{1 / 2} (u / c_{6}, F_{m, k}) d u < L_{k} \leq c_{5} {(n + \tilde{n})}^{1 / 2}, L_{k} = c_{4} ϵ_{m}^{2} + λ_{m} (k - 1),

(7)

then

P (d ({\hat{p}}_{X ∣ Y}^{b}, {p_{X ∣ Y}}^{0}) \geq 2 (η_{b} + η_{m})) \leq 8 exp (- c_{7} n η_{b}^{2}) + 8 exp (- c_{8} (n + \tilde{n}) η_{m}^{2}), η_{b} = max (ϵ_{b}, γ_{b}^{1 / 2}), η_{m} = max (ϵ_{m}, γ_{m}^{1 / 2}),

(8)

provided that λ_{b $max (J_{b} (p_{Y ∣ X}^{*}), J_{b} (p_{Y ∣ X}^{0}), 1) \leq c_{9} η_{b}^{2}$} and λ_m $max (J_{m} (p_{X}^{*}), J_{m} (p_{X}^{0}), 1) \leq c_{9} η_{m}^{2}$ , and c₇–c₉ are some positive constants. Consequently, $d ({\hat{p}}_{X ∣ Y}^{b}, {p_{X ∣ Y}}^{0}) = O_{p} (η_{b} + η_{m})$ as n, $\tilde{n} \to \infty$ under under $p_{X, Y}^{0}$ .

Theorem 1 indicates that the generation error of the indirect generator is governed by the estimation errors ϵ_b and ϵ_m from (2) and (3), and the approximation errors γ_b and γ_m, where ϵ_b and ϵ_m can be obtained by solving the entropy integral equations in (6) and (7). Moreover, optimal rates of convergence can be obtained through tuning of λ_b and λ_m. Note that η_m could be tuned as a smaller order of η_b when the size of unlabeled data greatly exceeds the size of labeled data. Then the generalization error of the indirect method is governed primarily by η_b. In other words,the indirect generator’s performance is mainly determined by the estimation of p_Y|X.

For direct generator, let ${p_{X ∣ Y}}^{*} \in F_{f}$ be an approximation of p_X|Y⁰ in that ρ_f (p_X|Y⁰, p_X|Y*) ≤ γ_f for some γ_f ≥ 0.

Theorem 2 (Direct generator).

Suppose there exist some positive constants c₁₀–c₁₂, such that, for any ϵ_f > 0, and λ_f ≥ 0,

sup_{k \geq 1} \int_{2^{- 8} L_{k}}^{2^{1 / 2} L_{k}^{1 / 2}} H^{1 / 2} (u / c_{12}, F_{f, k}) d u / L_{k} \leq c_{11} n^{1 / 2}, L_{k} = c_{10} ϵ_{f}^{2} + λ_{f} (k - 1),

(9)

then

P (d ({\hat{p}}_{X ∣ Y}^{f}, {p_{X ∣ Y}}^{0}) \geq η_{f}) \leq 8 exp (- c_{13} n η_{f}^{2}), η_{f} = max (ϵ_{f}, γ_{f}^{1 / 2}),

(10)

provided that λ_f $max (J_{f} ({p_{X | Y}}^{*}), J_{f} ({p_{X | Y}}^{0}), 1) \leq c_{9} η_{f}^{2}$ , and c₁₃ > 0 is a constant. Consequently, $d ({\hat{p}}_{X ∣ Y}^{f}, {\hat{p}}_{X ∣ Y}^{0}) = O_{p} (η_{f})$ as n → ∞ under $p_{X, Y}^{0}$ .

In contrast to the indirect generation, the generation error η_f of the direct generation could be much larger or smaller than that of the indirect generation η_b depending on the complexities of $F_{f}$ , $F_{b}$ , and the corresponding approximation errors γ_f and γ_b, when p_X can be sufficiently well estimated. This suggests that either may outperform the other depending on the model assumptions.

Note that γ_f, γ_b, and γ_m are the approximation errors of the approximation capabilities of function spaces $F_{f}$ , $F_{b}$ and $F_{m}$ [35, 40]. In particular, when the function space is defined by a ReLU deep neural network to approximate a function in a Sobolev space, the approximation error is available and related to the scale of the neural network [40].

Theorem 3 says that the coupled generator performs no worse than the indirect and direct generators when (5) is used to select based on an independent cross-validation sample of size N.

Theorem 3 (Coupled generation).

Under $p_{X, Y}^{0}$ , as N → ∞, the coupled generator defined in (5) satisfies $K ({p_{X ∣ Y}}^{0}, {\hat{p}}_{X ∣ Y}^{c}) \leq min (K ({p_{X ∣ Y}}^{0}, {\hat{p}}_{X ∣ Y}^{b}), K ({p_{X ∣ Y}}^{0}, {\hat{p}}_{X ∣ Y}^{f}))$ , where K(p_X|Y, q_X|Y) is the Kullback-Leibler divergence between p_X|Y and q_X|Y.

Remarks:

In Theorem 3, if $K ({p_{X ∣ Y}}^{0}, p_{X ∣ Y}) \leq c_{14}^{2} d^{2} ({p_{X ∣ Y}}^{0}, p_{X ∣ Y})$ for some constant c₁₄ > 0, then $d ({p_{X ∣ Y}}^{0}, {\hat{p}}_{X ∣ Y}^{c}) \leq c_{15} min (d ({p_{X ∣ Y}}^{0}, {\hat{p}}_{X ∣ Y}^{b}), d ({p_{X ∣ Y}}^{0}, {\hat{p}}_{X ∣ Y}^{f}))$ , which occurs when the likelihood ratio is bounded.

4. Sentence generation given a topic

This section derives generative methods of sentence generation, which integrates the likelihood methods developed previously with language models to compose a sentence. As a result, a new sentence can be generated, which may not appear in training data; see Table 3 for an example.

Table 3.

An abstract generated by the coupled generator based on one random partition of the UCI benchmark text corpus for sentence categorization. Here five sentences (1)-(5) correspond to five categories: AIM, OWN, CONTRAST, BASIS, MISC, with the first five words of each sentence prespecified. All sentences are grammatically legitimate except ” improves” in (4) suffers from an error, and kolmogorov in (1) and israelis in (3) should be capitalized. These errors are correctable by a grammar checker.

(1) The paper extends research on the theory of choice rules. (2) We test our predictions using the ideas and the notion of kolmogorov complexity bound on the number of examples of the data sets. (3) The results demonstrate that israelis models can be used to provide new results for classification accuracy. (4) We show that implementation concerns the performance of the learning algorithm for improves the optimal predictor of the prediction. (5) The effect of balance is described by a high level of events and the objects that are shared.

Open in a new tab

A complete sentence is represented by a word vector X_1:T = (X₁, …, X_T)′, where X_t is the t-th word, T is a sentence-specific length, and ′ denotes the transpose of a vector. For convenience, we write X₁ = “START” and X_T+1 = “END” as the null words of the first and last words of a sentence, respectively. For example, X₁ =“START”, X₂ = ``Football″, X₃ = ``is″, X₄ = ``a″, X₅ = ``popular″, X₆ = ``sport″, and X₇ = ``END″. Together with X_1:T, its associated topic category Y = (Y₁, …, Y_K)′ is available, where Y_j ∈ {0, 1} or $Y_{j} \in ℝ$ . Finally, we construct a dictionary $D = {(w_{1}, \dots, w_{| D |})}^{'}$ to contain all composing words, that is $X_{t} \in D; t = 1, \dots, T$ , with $| D |$ denoting $D ’ s$ size.

For simplicity, we consider the case of a fixed T, where sentences of different lengths can be processed with a fixed length, as illustrated in Table 1. Sentence generation given a topic Y generates a sentence X_1:T+1 using the conditional probability P(X_1:T+1 = x_1:T+1 |Y = y). However, estimation of this probability at the sentence level is infeasible. Therefore, we decompose it at the word level by the probability chain rule:

log (p (X_{1 : T + 1} = x_{1 : T + 1} ∣ Y = y)) = \sum_{t = 1}^{T} log (p (X_{t + 1} = x_{t + 1} ∣ X_{1 : t} = x_{1 : t}, Y = y)) .

(11)

This decomposition (11) permits sequential generation of a sentence through next-word-prediction given existing words by learning p(X_t+1 = x_t+1 | x_1:t, y) from data; t =1, …,T.

Table 1.

Eleven next-word-prediction sequences with associated with a sentence.

Topic	Sentence
MISC	The loss bound of SYMBOL implies convergence in probability.
1.	Null Null Null Null Null Null Null Null Null Null Null START → The
2.	Null Null Null Null Null Null Null Null Null START The → loss
3.	Null Null Null Null Null Null Null Null START The loss → bound
4.	Null Null Null Null Null Null Null START The loss bound → of
5.	Null Null Null Null Null Null START The loss bound of → SYMBOL
6.	Null Null Null Null Null START The loss bound of SYMBOL → implies
7.	Null Null Null Null START The loss bound of SYMBOL implies → convergence
8.	Null Null Null START The loss bound of SYMBOL implies convergence → in
9.	Null Null START The loss bound of SYMBOL implies convergence in → probability
10.	Null START The loss bound of SYMBOL implies only convergence in probability →.
11.	START The loss bound of SYMBOL implies only convergence in probability. → END

Open in a new tab

Yet, estimation of p(X_t+1 = x_t+1 | x_1:t, y) remains challenging for unstructured X_1:t because of a lack of observations in any conditioning event of X_1:t given Y even with large training data. Furthermore, it is difficult to utilize unlabeled data to estimate p(X_t+1 = x_t+1 | x_1:t, y).

4.1. Indirect generator

In this context, we derive a version of (2) and that of (3) through (11) to estimate the inverse probabilities. Specifically, p(x_t+1 | x_1:t, y) can be written as

p (x_{t + 1} ∣ x_{1 : t}, y) = \frac{p (y ∣ x_{1 : t + 1}) p (x_{t + 1} ∣ x_{1 : t})}{\sum_{x_{t + 1} \in D} p (y ∣ x_{1 : t}, x_{t + 1}) p (x_{t + 1} ∣ x_{1 : t})};

(12)

for t =1, …, T. Then, we estimate the inverse probability p(y | x_1:t+1) based on labeled data ${(x_{1 : t}^{i}, y^{i})}_{i = 1}^{n}$ and estimate p(x_t+1 | x_1:t) based on ${(x_{1 : t}^{i})}_{i = 1}^{n}$ for t = 1, …, Tⁱ, and unlabeled data ${(x_{1 : t}^{j})}_{j = 1}^{\tilde{n}}$ for t = 1, …, T^j.

Estimation of p(y | x_1:t) may proceed with unstructured predictors x_1:t. To proceed, we map a sentence x_1:t to a numerical vector $E (x_{1 : t}) \in ℝ^{p}$ , known as a numerical embedding of size p via a pre-trained embedding model such as Doc2Vec [23, 24] and BERT [11]. If a pre-trained embedding model is sufficient in that $p (y ∣ X_{1 : t} = x_{1 : t}) = p (y ∣ E (X_{1 : t}) = E (x_{1 : t}))$ [10], the numerical embedding $E (x_{1 : t})$ captures word-to-word relations expressed in terms of co-occurrence of words, which may raise the level of predictability of unstructured predictors X_1:t. Next, we model p(y | x_1:t) through $p (y ∣ E (x_{1 : t}))$ when Y ∈ {0, 1}^K is categorical or $Y \in ℝ^{K}$ is continuous with an embedded label Y :

p (y ∣ x_{1 : t}) = {\begin{array}{l} y^{'} σ (f (E (x_{1 : t}))), & if y \in {0, 1}^{K}, \\ {(2 π)}^{- K / 2} exp (- \frac{1}{2} {‖ y - f (E (x_{1 : t})) ‖}_{2}^{2}), & if y \in ℝ^{K}, \end{array}

(13)

where K is the dimension of Y, σ(·) is the softmax function [1], and f is a nonparametric classification or regression function forest [3] or linear function $f (E (x_{1 : t})) = θ_{b} E (x_{1 : t})$ with $θ_{b} \in ℝ^{K \times p}$ . For illustration, we use a linear representation $f (E (x_{1 : t})) = θ_{b} E (x_{1 : t})$ in (13) sequentially. Now the cost function L_b (θ_b) in (2) becomes

L_{b} (θ_{b}) = - \frac{1}{n} \sum_{i = 1}^{n} {(T^{i})}^{- 1} \sum_{t = 1}^{T^{i}} log (p (y^{i} ∣ E (x_{l : t}^{i})) + λ_{b} J_{b} (f),

(14)

where λ_b ≥ 0 is a tuning parameter and J_b (f) ≥ 0 is a regularizer, for example, $J_{b} (f) = {‖ θ_{b} ‖}_{F}^{2}$ if $f (E (x_{1 : t})) = θ_{b} E (x_{1 : t})$ , where ∥·∥_F is the Frobenius-norm of a matrix.

On the other hand, the next word probability is estimated by a RNN in a form of

p (x_{t + 1} ∣ x_{1 : t}) = o_{[x_{t + 1}]} (x_{t}, h_{t}; θ_{m}), with h_{t} = h (x_{t}, h_{t - 1}; θ_{m}), h_{0} = 0,

(15)

where [x_t+1] ={j : w_j = x_t+1}, o_j (x_t, h_t, θ_m) is the probability of occurrence of the j-th word in $D$ , and h(x_t, h_t+1, θ_m) is a hidden state function, such as a long short-term memory unit (LSTM) [16], a bidirectional unit [32], a gated recurrent unit (GRU) [6], and GPT2 [30], θ_m is the parameter of a specific RNN model, for example, $θ_{m} = (W_{m}^{o}, W_{m}^{x}, W_{m}^{h})$ in a basic RNN,

o (x_{t}, h_{t}, θ_{m}) = σ (W_{m}^{o} h_{t}), h_{t} = ϕ (W_{m}^{x} 1_{[x_{t}]} + W_{m}^{h} h_{t - 1}), and h_{0} = 0,

(16)

where σ(·), as defined before, is the softmax and ϕ is an activation function such as the ReLU function [1], $W_{m}^{o} \in ℝ^{r_{m} \times | D |}$ , $W_{m}^{x} \in ℝ^{| D | \times x_{m}}$ , and $W_{m}^{h} \in ℝ^{r_{m} \times r_{m}}$ , and r_m is the number of latent factors of the RNN. See Figure 1 for a display of the architecture of a basic RNN.

Fig. 1 — A generated sentence by indirect and direct RNN generators in (20) and (15), where the RNN architecture is displayed, in which sentence “*The instantaneous loss bound of SYMBOL implies only convergence in probability*” with topic “MISC” is consecutively generated by words, h_t is the hidden node of RNNs in (20) and (15), and h₀ is the initial hidden state, which is zero under (15) and “MISC” under (20).

On the ground of (15), the cost function L_m(θ_m) in (3) becomes

L_{m} (θ_{m}) = - {(n + \tilde{n})}^{- 1} \sum_{i = 1}^{n} {(T^{i})}^{- 1} \sum_{t = 1}^{T^{i}} log (o_{[x_{t + 1}^{i}]} (x_{t}^{i}, h_{t}^{i}, θ_{m})) - {(n + \tilde{n})}^{- 1} \sum_{j = 1}^{\tilde{n}} {(T^{j})}^{- 1} \sum_{t = 1}^{T^{j}} log (o_{[x_{t + 1}^{j}]} (x_{t}^{j}, h_{t}^{j}, θ_{m})) + λ_{m} J_{m} (θ_{m}),

(17)

where λ_m ≥ 0 is a tuning parameter and J_m(θ_m) is a regularizer regularizing the weights matrix and the activation layer [22].

Minimizing (14) and (17) yields estimators θ_b and θ_m, respectively. Then the conditional probabilities are estimated as $\hat{p} (y ∣ x_{1 : t + 1}; θ_{b}) = σ (θ_{b} E (x_{1 : t + 1}))$ and $\hat{p} (x_{t + 1} ∣ x_{1 : t}; θ_{m}) = o_{[x_{t + 1}]} (x_{t}, h_{t - 1}, θ_{m})$ . Plugging these estimates into (12), we obtain the estimated probability, and the process is summarized as,

{\hat{p}}^{b} (X_{t + 1} = x ∣ x_{1 : t}, y) = \frac{\hat{p} (y ∣ x_{1 : t}, x; θ_{b}) \hat{p} (X_{t + 1} = x ∣ x_{1 : t}; θ_{m})}{\sum_{x \in D} \hat{p} (y ∣ x_{1 : t}, x; θ_{b}) \hat{p} (X_{t + 1} = x ∣ x_{1 : t}; θ_{m})} θ_{b} = \underset{θ_{b}}{argmin} L_{b} (θ_{b}), θ_{m} = \underset{θ_{m}}{argmin} L_{m} (θ_{m}) .

(18)

Then, a sentence is sequentially generated as follows:

{\hat{x}}_{t + 1} = \underset{x \in D}{argmax} {\hat{p}}^{b} (X_{t + 1} = x ∣ X_{1 : t} = x_{1 : t}, Y = y); t = 1, \dots, \hat{T} .

(19)

This generation process begins with x₁ = ``START″ or pre-specified t₀-words $x_{1 : t_{0}}$ and proceeds until $x_{\hat{T}} = ` ` END ″$ is reached, where $\hat{T}$ is an index at termination. It is worth mentioning that the denominator in (18) normalizes the probability but may not need to be computed when a maximizer of (18) is desired in (19).

4.2. Direct generator

The direct generation is inspired by a conditional RNN (C-RNN; [37, 17]) by estimating

p (x_{t + 1} ∣ x_{1 : t}, y) = o_{[x_{t + 1}]} (x_{t}, h_{t - 1}, y, θ_{f}), with h_{t} = h (x_{t}, h_{t - 1}, θ_{f}), h_{0} = h_{0} (y, θ_{f}),

(20)

where θ_f represents the parameters of a RNN, and h₀ is built on the label information as opposed to h₀ =0 in (16). As in (16), the direct generator requires additional parameters $W_{f}^{y} \in ℝ^{r_{f} \times K}$ for $θ_{f} = (W_{f}^{o}, W_{f}^{x}, W_{f}^{h}, W_{f}^{y})$ to model the effect from y as follows:

o (x_{t}, h_{t}, y, θ_{f}) = σ (W_{f}^{o} h_{t}), h_{t} = ϕ (W_{f}^{x} 1_{[x_{t}]} + W_{f}^{h} h_{t - 1}), and h_{0} = ϕ (W_{f}^{y} y),

(21)

where $W_{f}^{o} \in ℝ^{r_{f} \times | D |}$ , $W_{f}^{x} \in ℝ^{| D | \times r_{f}}$ , $W_{f}^{h} \in ℝ^{r_{m} \times r_{f}}$ , and r_f is the number of latent factors of the RNN. On this ground, the cost function in (4) becomes

L_{f} (θ_{f}) = - n^{- 1} \sum_{i = 1}^{n} {(T^{i})}^{- 1} \sum_{t = 1}^{T_{i}} log (o_{[x_{t + 1}^{i}]} (x_{t}^{i}, h_{t - 1}^{i}, y^{i}, θ_{f})) + λ_{f} J_{f} (θ_{f}),

(22)

where λ_f ≥ 0 is a tuning parameter and J_f (θ_f) is a nonnegative regularizer. Minimizing (22) in θ_f yields an estimate θ_f, thus the estimated probability ${\hat{p}}^{f} (x ∣ x_{1 : t}, y) = o_{[x]} (x_{t}, h_{t - 1}, y, θ_{f})$ , from (20). Then, sentence generation proceeds as in (19).

Worth of note is that the direct and indirect generators can be respectively implemented using different RNN models, for example, GPT2 for the direct RNN (20) while LSTM for the indirect RNN in (15). Moreover, different model architectures of RNNs may yield different empirical results. This aspect is illustrated in Section 5.

4.3. Coupled generator

Given the estimated probabilities ${\hat{p}}^{f} (x_{t + 1} ∣ x_{1 : t}, y)$ and ${\hat{p}}^{b} (x_{t + 1} ∣ x_{1 : t}, y)$ . The coupled generator chooses one between ${\hat{p}}^{f}$ and ${\hat{p}}^{b}$ on a validation set by minimizing an empirical version of the log-likelihood loss,

Ent (\hat{p}) = - T^{- 1} \sum_{t = 1}^{T} log \hat{p} (X_{t + 1} = x_{t + 1} ∣ X_{1 : t} = x_{1 : t}, Y = y) .

(23)

4.4. Large-scale computation

This section develops a computational scheme for the indirect generator in (14)–(17), and the direct generator in (22) can be treated by a standard RNN implementation as in [36, 29]. In particular, when stochastic backpropagation is used through the time gradient method, the computation complexity is of order of the number of parameters per time step [27].

In what follows, we apply gradient descent [41] or stochastic gradient descent [31] to solve (14). For (17), we apply a classical back-propagation algorithm. In each case, we use analytic a gradient expression for updates.

Gradient for indirect generation.

The gradient expression for θ_m in (17) is given in [29], while that for θ_b in (14) is computed as

\frac{\partial L_{b}}{\partial θ_{b, k}} = {\begin{array}{l} λ_{b} θ_{b, k} - n^{- 1} \sum_{i = 1}^{n} {(T^{i})}^{- 1} \sum_{t = 1}^{T^{i}} (y_{k}^{i} - σ_{k} (θ_{b} E (x_{1 : t}^{i})) E (x_{1 : t}^{i}), & y^{i} \in {0, 1}^{K}, \\ λ_{b} θ_{b, k} - n^{- 1} \sum_{i = 1}^{n} {(T^{i})}^{- 1} \sum_{t = 1}^{T^{i}} (y_{k}^{i} - {θ^{'}}_{b, k} E (x_{1 : t}^{i})) E (x_{1 : t}^{i}), & y^{i} \in ℝ^{K}, \end{array}

(24)

where θ_b,k denotes the kth column of θ_b.

The detail of gradient descent for the indirect generator is summarized as follows.

4.4.

Algorithm 1 can be updated by a stochastic gradient scheme [2]. Lemma 1 describes computational properties of Algorithm 1.

Lemma 1.

If the cost functions L_b in (14) and L_m in (17) are continuously twice differential, and the probability measure of random initialization is absolutely continuous with respect to the Lebesgue measure. Then, θ_b is a global minimizer of (14), while θ_m is a local minimizer of (17) almost surely, provided that the step size in Algorithm 1 is sufficiently small.

4.5. Theory for sentence generation

This section generalizes the theoretical result of Section 3 to the problem of next-word-prediction.

Now we use p_X|Y, p_Y|X, and p_X to respectively represent ${p_{X_{t + 1} ∣ X_{1 : t}, Y}}_{t = 1}^{T}, {p_{Y ∣ X_{1 : t}}}_{t = 1}^{T}$ , and ${p_{X_{t + 1} ∣ X_{1 : t}}}_{t = 1}^{T}$ . The expected square Hellinger-distance for next-word-prediction is

d (p_{X ∣ Y}, q_{X ∣ Y}) = {(\bar{E} h^{2} (p_{X_{t + 1} ∣ X_{1 : t}, Y}, q_{X_{t + 1} ∣ X_{1 : t}, Y}))}^{\frac{1}{2}},

(25)

where $\bar{E} (\cdot) = T^{- 1} \sum_{t = 1}^{T} E_{X_{1 : t}, Y} (\cdot)$ .

The metric entropy of $F_{b, k}$ is defined by a distance $κ^{2} (p_{Y ∣ X}, q_{Y ∣ X}) = \bar{E} h^{2} (p_{Y ∣ X_{1 : t}}, q_{Y ∣ X_{1 : t}})$ . Similarly, $κ^{2} (p_{X}, q_{X}) = \bar{E} h^{2} (p_{X_{t + 1} ∣ X_{1 : t}}, q_{X_{t + 1} ∣ X_{1 : t}})$ , and d²(p_X|Y, q_X|Y) are used for $F_{m, k}$ and $F_{f, k}$ , respectively.

The approximation error for p_Y|X⁰ is $ρ_{b} ({p_{Y ∣ X}}^{0}, p_{Y ∣ X}) = \bar{E} g_{α} (\frac{p^{0} (Y ∣ X_{1 : t + 1})}{p (Y ∣ X_{1 : t + 1})})$ . Similarly, the approximation errors $ρ_{m} ({p_{X}}^{0}, p_{X}) = \bar{E} g_{α} (\frac{p^{0} (X_{t + 1} ∣ X_{1 : t})}{p (X_{t + 1} ∣ X_{1 : t})})$ and $ρ_{f} ({p_{X ∣ Y}}^{0}, p_{X ∣ Y}) = \bar{E} g_{α} (\frac{p^{0} (X_{t + 1} ∣ X_{1 : t}, Y)}{p (X_{t + 1} ∣ X_{1 : t}, Y)})$ are used for p_X⁰ and p_X|Y⁰.

Corollary 1 (Sequential generation).

All the results in Theorems 1 and 2 continue to hold with the distance d(·,·) defined in (25).

Next we provide a theoretical example to illustrate Corollary 1.

Theoretical example.

Suppose that the RNN in (15) is a basic recurrent network with $θ_{m} = (W_{m}^{o}, W_{m}^{x}, W_{m}^{h})$ , that is, $o (x_{t}, h_{t - 1}, θ_{m}) = σ (W_{m}^{o} h_{t - 1}), h_{t} = ϕ (W_{m}^{x} 1_{[x_{t}]} + W_{m}^{h} h_{t - 1})$ , and $h_{0} = 0_{r_{m}}$ , where $W_{m}^{o} \in ℝ^{r_{m} \times | D |}$ , $W_{m}^{x} \in ℝ^{| D | \times r_{m}}$ , and $W_{m}^{h} \in ℝ^{r_{m} \times r_{m}}$ , r_m is the number of latent factors of the RNN, and ϕ(z) is an activation function, such as the sigmoid function ϕ(z) = 1/ (1+ exp(−z)), the tanh function ϕ(z) = tanh(z), and the Rectified linear unit (ReLU) ϕ(z) = z₊. For illustration, we focus on the sigmoid function.

The RNN in (20) is that $o (x_{t}, h_{t - 1}, θ_{f}) = σ (W_{f}^{o} h_{t - 1})$ , $h_{t} = ϕ (W_{f}^{x} 1_{[x_{t}]} + W_{f}^{h} h_{t - 1})$ , and $h_{0} = ϕ (W_{f}^{y} y)$ . The network parameters are $θ_{f} = (W_{f}^{o}, W_{f}^{x}, W_{f}^{h}, W_{f}^{y})$ , where $W_{f}^{o} \in ℝ^{r_{f} \times | D |}$ , $W_{f}^{x} \in ℝ^{| D | \times r_{f}}$ , $W_{f}^{h} \in ℝ^{r_{f} \times r_{f}}$ , and $W_{f}^{y} \in ℝ^{r_{f} \times K}$ , and r_f is the number of latent factors of the RNN in the direct generation.

Corollary 2 gives the generation errors of the direct and indirect generators.

Corollary 2 (Theoretical example).

For the estimated next-word probabilities $p_{X ∣ Y}^{f}$ by the direct generator in (22), we have that $d (p_{X ∣ Y}^{f}, p_{X ∣ Y}^{0}) = O_{p} (η_{f})$ , where

η_{f} = max {{(\frac{Λ_{f}}{n} log (\frac{n max (r_{f}, 2 c_{15}) 2^{T} T^{- 1 / 2}}{Λ_{f}}))}^{\frac{1}{2}}, γ_{f}^{\frac{1}{2}}},

$Λ_{f} = r_{f} (2 | D | + r_{f} + K)$ , $λ_{f} = c_{17} η_{f}^{2}$ , and c₁₅ > 0 and c₁₆ > 0 are constants with $E_{Y} ‖ Y ‖_{2}^{2} \leq c_{15}$ . Similarly, the estimated next-word probabilities $p_{X ∣ Y}^{b}$ by the indirect generator in (14) and (17) satisfies: $d (p_{X ∣ Y}^{b}, p_{X ∣ Y}^{0}) = O_{p} (η_{b} + η_{m})$ , where

η_{b} = max {{(\frac{Λ_{b}}{n} log (\frac{c_{16} n}{Λ_{b}}))}^{\frac{1}{2}}, γ_{b}^{\frac{1}{2}}},

η_{m} = max {{(\frac{Λ_{m}}{n + \tilde{n}} log (\frac{r_{m} (n + \tilde{n}) 2^{T} T^{- 1 / 2}}{Λ_{m}}))}^{\frac{1}{2}}, γ_{m}^{\frac{1}{2}}},

Λ_b = Kp, $Λ_{m} = r_{m} (2 | D | + r_{m})$ , $λ_{b} = c_{18} η_{b}^{2}$ , $λ_{m} = c_{18} η_{m}^{2}$ , and c₁₆ > 0 and c₁₈ > 0 are constants with $\bar{E} {‖ E (X_{1 : t}) ‖}_{2}^{2} \leq c_{16}$ .

Corollary 2 says that the generation error of the indirect generator in (1) becomes $d ({\hat{p}}_{X ∣ Y}^{b}, {p_{X ∣ Y}}^{0}) = O_{p} {(\frac{Λ_{b}}{n} log (\frac{n}{Λ_{b}}))}^{\frac{1}{2}}$ when $(\tilde{n} + n) / n = O (\frac{Λ_{m} log (r_{m} n / Λ_{m})}{Λ_{b} log (c_{16} n / Λ_{b})})$ , when γ_b = γ_m = 0. In fact, the generation error is primarily dominated by its estimation error of p(Y | X_1:t), because p(X_t+1 | X_1:t) can be well estimated with the help of large unlabeled data with $\tilde{n} ≫ n$ . In this situation, the indirect method outperforms the direct method, particularly when Λ_b < Λ_f, suggesting that the estimation complexity of the indirect method is less than that of the direct method. Interestingly, the generation error of the direct generator agrees with that of the maximum likelihood estimates under the Hellinger-distance [34, 38]. With respect to tuning, a large value of Λ_b, Λ_m, and Λ_f increases the complexity of the corresponding functional space for probability estimation, thus reduces the generation errors. Consequently, the generation errors of the direct and indirect generators indeed depend on the model complexity of parameter spaces $F_{b}$ and $F_{f}$ .

To illustrate the synergy of indirect and direct generators’ respective strengths, we consider two situations. First, $d (p_{X ∣ Y}^{f}, p_{X ∣ Y}^{0}) = o_{p} (1)$ but $d (p_{X ∣ Y}^{b}, p_{X ∣ Y}^{0})$ is bounded away from zero if an unlabeled sample ${(X_{1 : T^{j}}^{j})}_{j = 1}^{\tilde{n}}$ follows a different marginal distribution of the labeled sample ${(X_{1 : T^{i}}^{i})}_{i = 1}^{n}$ . Second, $d (p_{X ∣ Y}^{b}, p_{X ∣ Y}^{0}) = o_{p} (1)$ but $d (p_{X ∣ Y}^{f}, p_{X ∣ Y}^{0})$ is bounded away from zero in the presence of a new word in labeled but unlabeled samples. However, in both situations, $d (p_{X ∣ Y}^{c}, p_{X ∣ Y}^{0}) = o_{p} (1)$ , when Kullback-Leibler divergence is equivalent to the Hellinger-distance. In other words, only the coupled generator has a generation error tending to zero in both situations.

5. Benchmark

This section examines the performance of the coupled, indirect, and direct generators in one benchmark example, and compares with a baseline method “Separate RNN”, which fits RNNs for each topic as in [36]. The benchmark concerns sentence categorization based on a text corpus in the UCI machine learning repository². This corpus contains 1,039 labeled sentences collected from abstracts and introductions of 30 articles, in which five topic categories are AIM (a specific aim of the present paper), OWN (description of own work presented in the present paper), CONTRAST (comparison statements with other works, including strengths and weaknesses), BASIS (statements of agreement with other works or continuation of other works), and MISC (generally accepted scientific background or description of other works). These labels originate from three scientific domains: computational biology (PLOS), the machine learning repository on arXiv (ARXIV), and the psychology journal judgment and decision making (JDM). For example, a typical sentence such as “The instantaneous loss bound of SYMBOL implies only convergence in probability. ” is labeled as “MISC” according to scientific topic classification. In addition to the aforementioned labeled sentences, this corpus contains 34,481 unlabeled sentences from 300 articles in PLOS, ARXIV, and JDM.

Before proceeding, we pre-process the text corpus to filter out redundant each sentence’s component so that numerical embeddings are applied for the indirect generator. First, we replace all numerical values, symbolic values, and citations by “NUMBER”, “SYMBOL”, and “CITATION”, respectively, and remove all standalone punctuation marks except commas, periods, and semicolons. For unlabeled sentences, we remove words appearing less than 20 times in the corpus, which leads to an unlabeled corpus of 8,286 sentences. On this ground, a dictionary is constructed, consisting of 5,369 words extracted from labeled and unlabeled sentences.

For training, we generate word strings for next-word-prediction based on the maximum length of all sentences in the dataset. Thus, all the previous tokens in the sentence are contributing to predict the next word. Specifically, we create next-word-prediction sequences consisting of consecutive previous words and fill with the null word “NULL” to pad all word strings as the same length. An example of such a next-word-prediction sequence is displayed in Table 1. In this fashion, we gather enough training sentences as the null words do not impact our learning process. Now, 28,180 labeled next-word-prediction sequences are generated from the original 1,039 labeled sentences, together with 174,355 unlabeled sequences from the original 8,286 unlabeled sentences.

The generation performance is measured by two commonly used metrics, namely, the next-word entropy loss and the Bi-Lingual Evaluation Understudy (BLEU) loss [28] over a test sample, approximating the predicted Kullback-Leibler divergence and Jaccard distance [13], respectively. Given sentences ${(x_{1 : {\hat{T}}^{i}}^{i})}_{i = 1}^{n_{test}}$ generated from p and its reference sentence ${(x_{1 : T^{i}}^{i})}_{i = 1}^{n_{test}}$ given a topic y, the entropy loss is defined in as the empirical version of (23), while the BLEU_l-loss (l =1, …, 4) can be written as

{BLEU}_{l} - loss (p) = 1 - n_{test}^{- 1} \sum_{i = 1}^{n_{test}} exp (min (1 - \frac{{\hat{T}}^{i}}{T^{i}}, 0)) \frac{| {gram}_{l} (x_{1 : T^{i}}^{i}) \cap {gram}_{l} (x_{1 : {\hat{T}}^{i}}^{i}) |}{| {gram}_{l} (x_{1 : {\hat{T}}^{i}}^{i}) |},

where n_test is the number of sentences in the testing set, |·| denotes a set’s size and gram_l(·) is the l-gram set for a sentence. For a sentence “the cat in the hat”, its 1-gram set is {“the”, “cat”, “in”, “the”, “hat”}, the 2-gram set is {“the cat”, “cat in”, “in the”, “the hat”}, and the 3-gram set is {“the cat in”, “cat in the”, “in the hat”}. The BLEU_l-loss can be computed using the NLTK library in Python. Whereas the entropy loss measures the occurrence probability of the reference sentences, the BLEU_l-loss focuses on exact matching between l consecutive words of two sentence. Moreover, we also consider the SF-BLEU_l-loss to evaluate the diversity of a generated sentences [43], defined as

SF - {BLEU}_{l} - loss (p) = 1 - n_{test}^{- 1} \sum_{i = 1}^{n_{test}} exp (min (1 - \frac{{\hat{T}}^{i}}{{\hat{T}}_{min}^{- i}}, 0)) \frac{{max}_{j \neq i} | {gram}_{l} (x_{1 : {\hat{T}}^{i}}^{i}) \cap {gram}_{l} (x_{1 : {\hat{T}}^{j}}^{j}) |}{| {gram}_{l} (x_{1 : {\hat{T}}^{i}}^{i}) |},

where ${\hat{T}}_{min}^{- i} = \underset{{\hat{T}}^{j}; j \neq i}{argmin} | {\hat{T}}^{j} - {\hat{T}}^{i} |$ , and a high SF-BLEU_l-loss score means more diverse.

For training, validation, and testing, we randomly split all the labeled articles into three sets with a partition ratio of 60%, 20%, and 20%, respectively. Moreover, for a sentence x_1:T and its associated topic y in a testing set, five starting words x_1:5 as opposed to the null word are given to predict the rest of a sentence.

Consider two situations of the semantic label: (1) Y ∈ {0, 1}^K is categorical and is coded as a 0–1 vector using the one-hot encoding from the topic category; (2) $Y \in ℝ^{K}$ is continuous with each topic as a p = 128-dimensional vector based on Doc2Vec. In (2), each topic is represented by the averaged embedding of all the sentences in this topic category in training data.

In the case of Y ∈ {0, 1}^K, the indirect generator involves (14) and (17). For (14), we perform regularized multinomial logistic regression using the Python library sklearn³ on the embedded next-word prediction sequence training samples $(E (x_{1 : t}^{i}), y^{i})$ , where $E (x_{1 : t}^{i})$ is the numerical embedding of Doc2Vec⁴ of size p = 128 and the optimal λ_b is obtained by minimizing the entropy loss based on validation data over a set of grids {.0001, .001, .01, .1, 1, 10, 100}. For (17), the indirect RNN is trained based on both labeled and unlabeled next-word prediction sequences in training data. The indirect RNN model in (17) is structured in four layers, including an embedding layer consisting of 5, 369 nodes with each node corresponding one word in the dictionary $D$ , an LSTM layer of 128 latent factors, a dense layer with output dimension 5, 369. Note that the tuning parameter in (17) is fixed as λ_m = .0001 through in the embedding layer to regularize words in the absence in a training set. Similarly, the direct generator trains the RNN model in (22), which has the same configuration as the indirect RNN expect that the input dimension is $| D | + K = 5, 374$ in its embedding layer. Moreover, Separate RNN has the same structure as the indirect RNN given each topic.

As discussed in Section 4.2, different RNN model architectures may yield different empirical performance. Toward this end, we compare the LSTM architecture with GPT2 architecture for the direct RNNs. In particular, we consider the base GPT2 with 12 layers and 117M parameters [30] for the direct method, denoted as direct-GPT2. One key difference between LSTM and GPT2 lies in its masked self-attention layer, which masks future tokens and passes the attention information through tokens that are positioned at the left of the current position.

In the case of continuous $Y \in ℝ^{p}$ after numerical embedding, the indirect generation proceeds as in the categorical case except that linear regression as opposed to multinomial logistic regression in (14) is performed using sklearn on the labeled embedded next-word prediction sequences in training data $(E (x_{1 : t}^{i}), y^{i})$ , where each yⁱ is a 128-dimensional embedding vector.

All RNN models are trained using Keras⁵ with the batch and epoch sizes 200 and 100, and optimizer as Adam, and the over-fitting is prevented by early termination [4] of patience as 20. Moreover, the coupled generator is tuned as in (5).

As indicated in Table 2, when only labeled data is available, the coupled generator delivery higher accuracy than direct and indirect generators, which suggests the advantage of the proposed method. When combining with unlabeled data, the coupled generator outperforms the direct generator and separate RNN for both categorical and continuous labels, which selects the indirect generator in this case. With respect to the entropy loss, the amounts of improvement of the indirect generator over the separate RNN method and direct generator are 20.3% and 14.5% for the categorical case and 29.1% and 16.1% for the continuous case. With respect to BLEU₁–BLEU₄ losses, a similar situation occurs, with the amounts of improvement vary with the best improvement around 15.6%. Concerning unlabeled data, a comparison between the indirect generator with and without unlabeled data suggests that unlabeled data indeed help to improve the performance of the indirect generation over 14.5%. Interestingly, in terms of the entropy loss, the direct generator based on fine-tuned GPT2 outperforms the direct generator and indirect generator based on LSTM without unlabeled data, while the coupled generator achieves the best performance between them. However, they perform similarly in terms of BLEU_l scores. In view of the SF-BLEU_l scores, sentences generated by the direct and indirect generators have a high degree of diversity. Moreover, the semantic label Y after sentence embeddings Doc2Vec performs slightly worse than its categorical counterpart for the indirect and direct generations, indicating that semantic relations or linguistics dependencies, as captured by the sentence embeddings, may not have an impact given that there are only five categories. Finally, as suggested in Table 3, an abstract generated based on the five categories is reasonable except for three grammatical errors that are correctable by a grammar checker⁶.

Table 2.

Test errors in loss functions–Entropy, BLEU_l, and SF-BLEU_l (standard errors in parentheses) of various generators based on 20 random partitions of the UCI sentence categorization text corpus. Here “Separate RNN”, “Indirect”, “Direct”, “Direct-GPT2” and “Coupled” denote the separate RNN, indirect, and direct generators based on the RNN-LSTM architecture, the direct generator based on the RNN-GPT architecture, and the coupled generator, while Indirect-label or Coupled-label refers to the generation without unlabeled data.

Method	Entropy	BLEU₁-loss	BLEU₂-loss	BLEU₃-loss	BLEU₄-loss
Y : categorical label
Separate RNN	9.317(.040)	0.895(.010)	0.926(.008)	0.954(.007)	0.971(.005)
Indirect	7.424(.049)	0.768(.003)	0.854(.002)	0.885(.002)	0.914(.002)
Indirect-label	8.839(.060)	0.831(.008)	0.878(.005)	0.899(.004)	0.923(.003)
Direct	9.537(.054)	0.823(.008)	0.872(.005)	0.895(.005)	0.919(.004)
Direct-GPT2	8.684(.051)	0.900(.006)	0.954(.002)	0.970(.001)	0.981(.001)
Coupled	7.424(.049)	0.768(.003)	0.854(.002)	0.885(.002)	0.914(.002)
Coupled-label	8.644(.050)	0.880(.008)	0.932(.008)	0.949(.007)	0.963(.006)
		SF-BLEU₁-loss	SF-BLEU₂-loss	SF-BLEU₃-loss	SF-BLEU₄-loss
Separate RNN		0.076(.010)	0.208(.027)	0.271(.036)	0.303(.043)
Method	Entropy	BLEU₁-loss	BLEU₂-loss	BLEU₃-loss	BLEU₄-loss
Indirect		0.105(.006)	0.296(.009)	0.416(.012)	0.502(.013)
Indirect-label		0.138(.008)	0.363(.022)	0.472(.029)	0.545(.036)
Direct		0.139(.006)	0.372(.019)	0.487(.026)	0.561(.032)
Direct-GPT2		0.053(.006)	0.159(.019)	0.255(.031)	0.320(.040)
Coupled		0.105(.006)	0.296(.009)	0.416(.012)	0.502(.013)
Coupled-label		0.082(.011)	0.233(.028)	0.342(.038)	0.417(.045)
Method	Entropy	BLEU₁-loss	BLEU₂-loss	BLEU₃-loss	BLEU₄-loss
Y: continuous label based on Doc2Vec [23, 24]
Indirect	7.641(.036)	0.768(.005)	0.851(.003)	0.883(.003)	0.912(.003)
Indirect-label	8.512(.041)	0.912(.010)	0.937(.008)	0.949(.007)	0.960(.005)
Direct	9.102(.050)	0.916(.010)	0.939(.007)	0.950(.005)	0.961(.004)
Coupled	7.641(.036)	0.768(.005)	0.851(.003)	0.883(.003)	0.912(.003)
Coupled-label	8.512(.041)	0.912(.010)	0.937(.008)	0.949(.007)	0.960(.005)
		SF-BLEU₁-loss	SF-BLEU₂-loss	SF-BLEU₃-loss	SF-BLEU₄-loss
Indirect		0.097(.005)	0.261(.008)	0.361(.010)	0.440(.012)
Method	Entropy	BLEU₁-loss	BLEU₂-loss	BLEU₃-loss	BLEU₄-loss
Indirect-labeled		0.064(.010)	0.165(.026)	0.211(.035)	0.232(.040)
Direct		0.079(.014)	0.202(.037)	0.252(.046)	0.271(.050)
Coupled		0.097(.005)	0.261(.008)	0.361(.010)	0.440(.012)
Coupled-label		0.064(.010)	0.165(.026)	0.211(.035)	0.232(.040)

Open in a new tab

Supplementary Material

Supp 1

NIHMS1774499-supplement-Supp_1.gz^{(7.4KB, gz)}

Acknowledgments

The authors thank the editors, the associate editor, and two anonymous referees for helpful comments and suggestions.

Research supported in part by NSF grants DMS-1712564, DMS-1721216, DMS-1952539, DMS-1952386, and NIH grants 1R01GM126002, R01HL105397, and R01AG065636.

Appendix

Proof of Lemma 1.

Note that L_b (θ_b) in (14) is convex in θ_b and L_b (θ_b) and L_m(θ_m) in (17) are continuously twice-differential. Then the result follows from Theorem 4 of [21]. This completes the proof. □

Proof of Theorem 1.

Note that ${\hat{p}}_{Y}^{b} (y) = \int {\hat{p}}_{Y ∣ X}^{b} (y ∣ x) {\hat{p}}_{X}^{b} (x) d x$ and

d^{2} ({\hat{p}}_{X ∣ Y}^{b}, {p_{X ∣ Y}}^{0}) = \int {({(\frac{p_{Y}^{0} (y) {\hat{p}}_{Y | X}^{b} (y ∣ x) {\hat{p}}_{X}^{b} (x)}{{\hat{p}}_{Y}^{b} (y)})}^{1 / 2} - {(p_{Y}^{0} (y) {p_{X ∣ Y}}^{0} (x ∣ y))}^{1 / 2})}^{2} d x d y .

Furthermore, $\int {\hat{p}}_{Y ∣ X}^{b} (y ∣ x) d y = 1$ . It follows from the triangular inequality that

d ({\hat{p}}_{X | Y}^{b}, {p_{X ∣ Y}}^{0}) \leq {(\int {({(\frac{p_{Y}^{0} (y) {\hat{p}}_{Y | X}^{b} (y ∣ x) {\hat{p}}_{X}^{b} (x)}{{\hat{p}}_{Y}^{b} (y)})}^{1 / 2} - {({\hat{p}}_{Y | X}^{b} (y ∣ x) {\hat{p}}_{X}^{b} (x))}^{1 / 2})}^{2} d x d y)}^{1 / 2} + {(\int {({({\hat{p}}_{Y ∣ X}^{b} (y ∣ x) {\hat{p}}_{X}^{b} (x))}^{1 / 2} - {({\hat{p}}_{Y ∣ X}^{b} (y ∣ x) p_{X}^{0} (x))}^{1 / 2})}^{2} d x d y)}^{1 / 2} + {(\int {({({\hat{p}}_{Y | X}^{b} (y ∣ x) p_{X}^{0} (x))}^{1 / 2} - {({p_{X | Y}}^{0} (x ∣ y) p_{X}^{0} (x))}^{1 / 2})}^{2} d x d y)}^{1 / 2} = h (p_{Y}^{0}, {\hat{p}}_{Y}^{b}) + h (p_{X}^{0}, {\hat{p}}_{X}^{b}) + {(E (h^{2} ({\hat{p}}_{Y ∣ X}^{b}, p_{Y ∣ X}^{0})))}^{1 / 2} .

(26)

Note that ${\hat{p}}_{X, Y}^{b} (x, y) = {\hat{p}}_{Y ∣ X}^{b} (y ∣ x) {\hat{p}}_{X}^{b} (x)$ . By the triangle inequality,

h (p_{Y}^{0}, {\hat{p}}_{Y}^{b}) = {(\int {({(\int p_{X, Y}^{0} (x, y) d x)}^{\frac{1}{2}} - {(\int {\hat{p}}_{X, Y}^{b} (x, y) d x)}^{\frac{1}{2}})}^{2} d y)}^{\frac{1}{2}} \leq {(\int {({(p_{X, Y}^{0} (x, y))}^{\frac{1}{2}} - {({\hat{p}}_{X, Y}^{b} (x, y))}^{\frac{1}{2}})}^{2} d x d y)}^{\frac{1}{2}} \leq {(\int {({(p_{X, Y}^{0} (x, y))}^{\frac{1}{2}} - {(p_{X}^{0} (x) {\hat{p}}_{Y ∣ X}^{b} (y ∣ x))}^{\frac{1}{2}})}^{2} d x d y)}^{\frac{1}{2}} + {(\int {({({\hat{p}}_{X}^{b} (x))}^{\frac{1}{2}} - {(p_{X}^{0} (x))}^{\frac{1}{2}})}^{2} d x)}^{\frac{1}{2}} \leq h (p_{X}^{0}, {\hat{p}}_{X}^{b}) + {(E (h^{2} (p_{Y ∣ X}^{0}, {\hat{p}}_{Y ∣ X}^{b})))}^{\frac{1}{2}} .

Hence, $d ({\hat{p}}_{X ∣ Y}^{b}, {p_{X ∣ Y}}^{0}) \leq 2 (h ({\hat{p}}_{X}^{b}, p_{X}^{0}) + {(E (h^{2} ({\hat{p}}_{Y ∣ X}^{b}, p_{Y ∣ X}^{0})))}^{1 / 2})$ . Consequently,

P (d ({\hat{p}}_{X ∣ Y}^{b}, {p_{X ∣ Y}}^{0}) \geq 2 (η_{b} + η_{m})) \leq P (h (p_{X}^{0}, {\hat{p}}_{X}) \geq η_{m}) + P ({(E_{X} (h^{2} (p_{Y ∣ X}^{0}, {\hat{p}}_{Y ∣ X})))}^{\frac{1}{2}} \geq η_{b}) \equiv I_{1} + I_{2} .

To bound I₁, let

I_{3} = P ({(n + \tilde{n})}^{- 1} \sum_{i = 1}^{n + \tilde{n}} (log (p_{X}^{0} (X^{i})) - log (p_{X}^{*} (X^{i}))) + λ_{m} J (p_{X}^{0}) + λ_{m} J (p_{X}^{*})) \geq c_{9} η_{m}^{2} / 4);

I_{4} = P (sup_{d_{m} (p, p^{0}) \geq η_{m}} {(n + \tilde{n})}^{- 1} \sum_{i = 1}^{n + \tilde{n}} (log (\frac{p_{X} (X^{i})}{p_{X}^{0} (X^{i})} - λ_{m} J (p_{X}) + λ_{m} J (p_{X}^{0})) \geq - c_{9} η_{m}^{2} / 4),

where c₉ 1 − 2exp(−τ / 2) / (1 − exp(−τ / 2))² > 0 is a constant defined by the truncation constant τ > 0. Then I₁ is upper bounded by

P (sup_{d_{m} (p, p^{0}) \geq η_{m}} {(n + \tilde{n})}^{- 1} \sum_{i = 1}^{n + \tilde{n}} (log (p_{X} (X^{i}) / p_{X}^{*} (X^{i}))) - λ_{m} J (p_{X}) + λ_{m} J (p_{X}^{*})) \geq 0) \leq I_{3} + I_{4} .

By the Markov inequality,

I_{3} \leq P ({(n + \tilde{n})}^{- 1} \sum_{i = 1}^{n + \tilde{n}} (log (p_{X}^{0} (X^{i})) - log (p_{X}^{*} (X^{i}))) \geq c_{9} η_{m}^{2} / 4 - λ_{m} J (p_{X}^{*})) \leq \prod_{i = 1}^{n + \tilde{n}} E_{X} {(\frac{p_{X}^{0} (X^{i})}{p_{X}^{*} (X^{i})})}^{α} exp (- \frac{c_{9} α}{8} (n + \tilde{n}) η_{m}^{2}) \leq {(1 + α γ_{m})}^{n + \tilde{n}} exp (- \frac{c_{9} α}{8} (n + \tilde{n}) η_{m}^{2}) \leq exp (- \frac{c_{9} α}{8} (n + \tilde{n}) η_{m}^{2} + (n + \tilde{n}) α γ_{m}) .

By Corollary 1 of [33], $I_{4} \leq 7 exp (- c_{8} (n + \tilde{n}) η_{m}^{2} / 2)$ , implying $I_{1} \leq 7 exp (- c_{7} (n + \tilde{n}) η_{m}^{2}) + exp (- \frac{c_{9} α}{8} (n + \tilde{n}) η_{m}^{2} + (n + \tilde{n}) α γ_{m})$ for some constant c₇ > 0. For I₂, a similar probabilistic bound can be established by applying the same argument of Theorem 2 and switching the role of X and Y. This leads to $I_{2} \leq 7 exp (- c_{8} n η_{b}^{2}) + exp (- \frac{c_{9} α}{8} n η_{b}^{2} + n α γ_{b})$ for some constant c₈ > 0. The desired result then follows. □

Proof of Theorem 2.

Denote

I_{5} = P (sup_{d (p, p^{0}) \geq η_{f}} (n^{- 1} \sum_{i = 1}^{n} (log (\frac{{p_{X ∣ Y}}^{(τ)} (X^{i} ∣ Y^{i})}{{p_{X ∣ Y}}^{*} (X^{i} ∣ Y^{i})})) - λ J (p_{X | Y}) + λ J ({p_{X ∣ Y}}^{*})) \geq 0),

I_{6} = P (n^{- 1} \sum_{i = 1}^{n} (log (\frac{{p_{X ∣ Y}}^{0} (X^{i} ∣ Y^{i})}{{p_{X ∣ Y}}^{*} (X^{i} ∣ Y^{i})})) - λ J ({p_{X ∣ Y}}^{0}) + λ J ({p_{X ∣ Y}}^{*}) \geq c_{9} η_{f}^{2} / 4) .

By the definition of a minimizer, for any η_f > 0,

P (d ({\hat{p}}_{X ∣ Y}^{f}, {p_{X ∣ Y}}^{0}) \geq η_{f}) \leq I_{5} + I_{6},

where $log (F_{i}^{(τ)}) = log ({p_{X ∣ Y}}^{(τ)} (X^{i} ∣ Y^{i})) - log ({p_{X ∣ Y}}^{0} (X^{i} ∣ Y^{i}))$ and

{p_{X ∣ Y}}^{(τ)} (x ∣ y) = {\begin{array}{l} exp (- τ) {p_{X ∣ Y}}^{*} (x ∣ y), & if p_{X ∣ Y} (x ∣ y) < exp (- τ) {p_{X ∣ Y}}^{*} (x ∣ y), \\ p_{X ∣ Y} (x ∣ y), & otherwise, \end{array}

is the left truncation of p_X|Y(x|y).

Next, we bound I₅ and I₆ separately. An application of the same argument as in [38] yields that

I_{5} \leq P (n^{- 1} \sum_{i = 1}^{n} (log ({p_{X ∣ Y}}^{0} (X^{i} ∣ Y^{i})) - log ({p_{X ∣ Y}}^{*} (X^{i} ∣ Y^{i}))) \geq c_{9} η_{f}^{2} / 4 - λ_{f} J ({p_{X ∣ Y}}^{*})) \leq P (\prod_{i = 1}^{n} {(\frac{{p_{X ∣ Y}}^{0} (X^{i} ∣ Y^{i})}{{p_{X ∣ Y}}^{*} (X^{i} ∣ Y^{i})})}^{α} \geq exp (\frac{c_{9} α}{8} n η_{f}^{2})) \leq \prod_{i = 1}^{n} E_{Y} E_{X ∣ Y} {(\frac{{p_{X ∣ Y}}^{0} (X^{i} ∣ Y^{i})}{{p_{X ∣ Y}}^{*} (X^{i} ∣ Y^{i})})}^{α} exp (- \frac{c_{9} α}{8} n η_{f}^{2}) \leq {(1 + α γ_{f})}^{n} exp (- \frac{c_{9} α}{8} n η_{f}^{2}) \leq exp (- \frac{α}{8} c_{9} n η_{f}^{2} + n log (1 + α γ_{f})) \leq exp (- \frac{α}{8} c_{9} n η_{f}^{2} + n α γ_{f}),

(27)

where the second inequality follows from $λ_{f} J ({p_{X ∣ Y}}^{*}) \leq c_{9} η_{f}^{2} / 8$ and the third inequality follows from Markov’s inequality.

Our treatment of bounding I₆ relies on a chaining argument over a suitable partition of $F_{f}$ and the left-truncation of likelihood ratios as in [38, 33]. Now, consider a partition of $F_{f} = \cup_{k = 1}^{\infty} \cup_{j = 0}^{\infty} F_{k j}$ :

F_{k j} = {p \in F_{f} : 2^{i - 1} η_{n}^{2} \leq d^{2} (p^{0}, p) \leq 2^{i} η_{n}^{2}, 2^{j - 1} J^{0} \leq J (p) \leq 2^{j} J (p^{0})},

F_{k 0} = {p \in F_{f} : 2^{i - 1} η_{n}^{2} \leq d^{2} (p^{0}, p) \leq 2^{i} η_{n}^{2}, J (p) \leq J (p^{0})}; k = 1, \dots, j = 0, \dots,

where $log (F_{i}^{(τ)}) = log ({p_{X ∣ Y}}^{(τ)} (X^{i} ∣ Y^{i})) - log ({p_{X ∣ Y}}^{0} (X^{i} ∣ Y^{i}))$ . Then for any η_f > 0,

I_{6} \leq P (sup_{d (p, p^{0}) \geq η_{f}} (n^{- 1} \sum_{i = 1}^{n} log (F_{i}^{(τ)}) - λ_{f} J (p_{X ∣ Y}) + λ_{f} J ({p_{X ∣ Y}}^{0})) \geq - c_{9} η_{f}^{2} / 4) \leq \sum_{k = 1}^{\infty} \sum_{j = 0}^{\infty} P (sup_{p \in F_{k j}} (n^{- 1} \sum_{i = 1}^{n} log (F_{i}^{(τ)}) - λ_{f} J (p_{X ∣ Y}) + λ_{f} J ({p_{X ∣ Y}}^{0})) \geq - c_{9} η_{f}^{2} / 4) \equiv \sum_{k = 1}^{\infty} \sum_{j = 0}^{\infty} I_{k j},

(28)

where $I_{k j} = P (sup_{f \in F_{k j}} (n^{- 1} \sum_{i = 1}^{n} log (F_{i}^{(t)}) - λ_{f} J (p_{X | Y}) + λ_{f} J ({p_{X ∣ Y}}^{0})) \geq - c_{9} η_{f}^{2} / 4)$ . To treat I_kj, we control the mean and variance of $log (F^{(τ)})$ . By Lemma 4 of [38],

- sup_{p \in F_{k j}} E (log (F^{(τ)})) = - sup_{p \in F_{k j}} E_{Y} (E_{X ∣ Y} (log (F^{(τ)}))) \geq c_{9} inf_{p \in F_{k j}} d^{2} (p, p^{*}) \geq c_{9} {(2^{k - 1} η_{f})}^{2},

(29)

and the variance is bounded by

sup_{p \in F_{k j}} Var (log (F^{(τ)})) \leq sup_{p \in F_{k j}} E_{Y} (E_{X ∣ Y} (log {(F^{(τ)})}^{2})) \leq 4 exp (τ) sup_{p \in F_{k j}} E_{Y} h^{2} (p_{X ∣ Y}^{0}, p_{X ∣ Y}) \leq 4 exp (τ) {(2^{k} η_{f})}^{2} \leq 8 exp (τ) δ_{k j} / c_{9},

(30)

where the second inequality follows from Lemma 3 of [38]. Then, I_kj is upper-bounded by

I_{k j} \leq P (sup_{f \in F_{k j}} (n^{- 1} \sum_{i = 1}^{n} log (F_{i}^{(τ)}) - E log (F^{(τ)})) \geq - sup_{f \in F_{k j}} (E log (F^{(τ)}) + λ (J ({p_{X ∣ Y}}^{0}) - J (p_{X ∣ Y}))) - c_{9} η_{f}^{2} / 4) \leq P (sup_{f \in F_{k j}} (n^{- 1} \sum_{i = 1}^{n} log (F_{i}^{(τ)}) - E log (F^{(τ)})) \geq δ_{k j}) \leq 3 exp (- a_{3} n δ_{k j}),

(31)

where a₃ > 0 is a constant, $δ_{k j} = c_{9} 2^{k - 1} η_{n}^{2} / 2 + λ (2^{j - 1} - 1) J ({p_{X ∣ Y}}^{0})$ , $δ_{k 0} = c_{9} 2^{k - 2} η_{n}^{2} / 2$ , the second inequality follows from the assumption that $λ J ({p_{X ∣ Y}}^{0}) \leq c_{9} η_{f}^{2} / 4$ and (29), and the last inequality follows from Lemma 2 and the fact that the j-th (j ≥ 2) moment $E ({| log (\frac{{p_{X ∣ Y}}^{(τ)} (X ∣ Y)}{{p_{X ∣ Y}}^{0} (X ∣ Y)}) |}^{j})$ is bounded by

E_{Y} E_{X ∣ Y} (exp (| log (\frac{{p_{X ∣ Y}}^{(τ)} (X ∣ Y)}{{p_{X ∣ Y}}^{0} (X ∣ Y)}) |) - 1 - \frac{1}{2} | log (\frac{{p_{X ∣ Y}}^{(τ)} (X ∣ Y)}{{p_{X ∣ Y}}^{0} (X ∣ Y)}) |) \leq j! 2^{j} a_{1} E_{Y} {‖ {(p_{X | Y})}^{1 / 2} - {({p_{X ∣ Y}}^{0})}^{1 / 2} ‖}_{2}^{2},

where a₁ =(exp(τ / 2) − 1 – τ / 2)/(1 − exp(−τ/2))² > 0 is a constant and the last inequality follows from Lemma 5 in [34]. It suffices to verify the condition (2.4) of [38]. A combination of (28) and (31) yield that $I_{6} \leq \sum_{k = 1}^{\infty} \sum_{j = 0}^{\infty} 3 exp (- c_{13} n δ_{k j}^{2}) \leq 7 exp (- c_{13} n η_{f}^{2})$ , which, together with (27) yields that $P (d ({\hat{p}}_{X ∣ Y}^{f}, {p_{X ∣ Y}}^{0}) \geq η_{f}) \leq I_{5} + I_{6} \leq 7 exp (- c_{13} n η_{f}^{2}) + exp (- \frac{α}{8} c_{9} n η_{f}^{2} + n α γ_{f})$ . The desired result then follows. □

Proof of Theorem 3.

Let ${(X^{i}, Y^{i})}_{i = 1}^{N}$ be a cross-validation sample. By (5),

- \frac{1}{N} \sum_{i = 1}^{N} log {\hat{p}}_{X | Y}^{c} (X^{i} ∣ Y^{i}) \leq min (- \frac{1}{N} \sum_{i = 1}^{N} log {\hat{p}}_{X | Y}^{f} (X^{i} ∣ Y^{i}), - \frac{1}{N} \sum_{i = 1}^{N} log {\hat{p}}_{X | Y}^{b} (X^{i} ∣ Y^{i})),

then the desired result follows from the law of large number, by taking the limit for both the sides as N → ∞. This completes the proof. □

Proof of Corollary 1.

For the direct sequential generation, we apply the same argument in the proof of Theorem 2. Denote

I_{7} = P (sup_{d (p, p^{0}) \geq η_{f}} \sum_{i = 1}^{n} \frac{1}{n T} \sum_{t = 1}^{T} log (\frac{p^{0} (X_{t + 1}^{i} ∣ X_{1 : t}^{i}, Y^{i})}{p^{*} (X_{t + 1}^{i} ∣ X_{1 t;}^{i}, Y^{i})}) - λ_{f} J_{f} ({p_{X ∣ Y}}^{0}) + λ_{f} J_{f} ({p_{X ∣ Y}}^{*}) \geq \frac{c_{9} η_{f}^{2}}{4}),

I_{8} = P (sup_{d (p, p^{0}) \geq η_{f}} \sum_{i = 1}^{n} \frac{1}{n T} \sum_{t = 1}^{T} log (\frac{p^{(t)} (X_{t + 1}^{i} ∣ X_{1 : t}^{i}, Y^{i})}{p^{0} (X_{t + 1}^{i} ∣ X_{1 : t}^{i}, Y^{i})}) - λ_{f} J_{f} (p_{X ∣ Y}) + λ_{f} J_{f} ({p_{X ∣ Y}}^{0}) \geq - \frac{c_{9} η_{f}^{2}}{4}) .

Then $P (d (p_{X ∣ Y}^{f}, p_{X ∣ Y}^{0}) \geq η_{f}) \leq I_{7} + I_{8}$ , where p^(τ) (X_t+1 | X_1:t, Y) is the left truncation of p(X_t+1 | X_1:t, Y) as defined in the proof of Theorem 2.

For I₇,

I_{7} \leq P (\prod_{i = 1}^{n} {(\prod_{t = 1}^{T} \frac{p^{0} (X_{t + 1}^{i} ∣ X_{1 : t}^{i}, Y^{i})}{p^{*} (X_{t + 1}^{i} ∣ X_{1 : t}^{i}, Y^{i})})}^{\frac{α}{T - 1}} \geq exp (\frac{c_{9} α}{8} n η_{f}^{2})) \leq \prod_{i = 1}^{n} E {(\prod_{t = 1}^{T} \frac{p^{0} (X_{t + 1}^{i} ∣ X_{1 : t}^{i}, Y^{i})}{p^{*} (X_{t + 1}^{i} ∣ X_{1 : t}^{i}, Y^{i})})}^{\frac{α}{T - 1}} exp (- \frac{c_{9} α}{8} n η_{f}^{2}) \leq \prod_{i = 1}^{n} \bar{E} ({(\frac{p^{0} (X_{t + 1}^{i} ∣ X_{1 : t}^{i}, Y^{i})}{p^{*} (X_{t + 1}^{i} ∣ X_{1 : t}^{i}, Y^{i})})}^{α}) exp (- \frac{c_{9} α}{8} n η_{f}^{2}) \leq {(1 + α γ_{f})}^{n} exp (- \frac{c_{9} α}{8} n η_{f}^{2}) \leq \bar{r} exp (- \frac{α}{8} c_{9} n η_{f}^{2} + n α γ_{f}) .

For I₈, let $F_{t}^{(τ)} = log (p^{(τ)} (X_{t + 1} ∣ X_{1 : t}, Y) / p^{0} (X_{t + 1} ∣ X_{1 : t}, Y))$ . For the first moment,

E (T^{- 1} \sum_{t = 1}^{T} F_{t}^{(τ)}) = T^{- 1} \sum_{t = 1}^{T} E_{X_{1 : t}, Y} (E_{X_{t + 1}} (F_{t}^{(τ)} ∣ X_{1 : t}, Y)) \leq - c_{9} T^{- 1} \sum_{t = 1}^{T} E_{X_{1 : t}, Y} ({‖ {(p_{X_{t + 1} ∣ X_{1 : t}, Y}^{(τ)})}^{\frac{1}{2}} - {(p_{X_{t + 1} ∣ X_{1 : t}, Y}^{0})}^{\frac{1}{2}} ‖}_{2}^{2}) = - c_{9} d^{2} (p_{X ∣ Y}^{(τ)}, p_{X ∣ Y}^{0}) .

For the j-th moment with j ≥ 2,

E {| T^{- 1} \sum_{t = 1}^{T} F_{t}^{(τ)} |}^{j} \leq T^{- 1} \sum_{t = 1}^{T} E_{X_{1 : t}, Y} (E_{X_{t + 1}} ({| F_{t}^{(τ)} |}^{j} ∣ X_{1 : t}, Y)) \leq j! 2^{j} T^{- 1} \sum_{t = 1}^{T} E_{X_{1 : t}, Y} (E_{X_{t + 1}} ((exp (| F_{t}^{(τ)} | / 2) - 1 - | F_{t}^{(τ)} | / 2) ∣ X_{1 : t}, Y)) \leq j! 2^{j} a_{3} (\sum_{t = 1}^{T} E_{X_{1 : t}, Y} ({‖ {(p_{X_{t + 1} ∣ X_{1 : t}, Y}^{(τ)})}^{\frac{1}{2}} - {(p_{X_{t + 1} ∣ X_{1 : t}, Y}^{0})}^{\frac{1}{2}} ‖}_{2}^{2})) \leq j! 2^{j} a_{1} d^{2} (p_{X ∣ Y}^{(τ)}, p_{X ∣ Y}^{0}),

where the first inequality follows from the Jensen’s inequality. Then

P (d (p_{X ∣ Y}^{f}, p_{X ∣ Y}^{0}) \geq η_{f}) \leq 6 exp (- c_{7} n η_{f}^{2}) + exp (- \frac{α}{8} c_{9} n η_{f}^{2} + n α γ_{f}),

(32)

follows the same arguments as in the proof of Theorem 2.

For the indirect generation, let p_t (·) = p(·| X_1:t) and $E_{t} (\cdot) = E (\cdot ∣ X_{1 : t})$ . Then,

d (p_{X ∣ Y}^{b}, p_{X ∣ Y}^{0}) = {(T^{- 1} \sum_{t = 1}^{T} E_{X_{1 : t}} E_{Y ∣ X_{1 : t}} h^{2} ({\hat{p}}_{X_{t + 1} ∣ X_{1 : t}, Y}^{b}, p_{X_{t + 1} ∣ X_{1 : t}, Y}^{0}))}^{\frac{1}{2}} \leq 2 (T^{- 1} \sum_{t = 1}^{T} {(E (h ({\hat{p}}_{X_{t + 1} ∣ X_{1 : t}}, p_{X_{t + 1} ∣ X_{1 : t}}^{0}) + {(E_{t} h^{2} {({\hat{p}}_{Y ∣ X_{1 : t + 1}}, p_{Y ∣ X_{1 : t + 1}}^{0}))}^{\frac{1}{2}})}^{2}))}^{\frac{1}{2}} \leq 2 {(2 T^{- 1} \sum_{t = 1}^{T} (E h^{2} ({\hat{p}}_{X_{t + 1} |_{X_{1 : t}}}, p_{X_{t + 1} ∣ X_{1 : t}}^{0}) + E h^{2} ({\hat{p}}_{Y ∣ X_{1 : t + 1}}, p_{Y ∣ X_{1 : t + 1}}^{0})))}^{\frac{1}{2}} \leq 2 \sqrt{2} (d (\hat{p_{X}, p_{X}^{0}}) + d (\hat{p_{Y ∣ X}}, p_{Y ∣ X}^{0})),

where the first inequality follows from (26) by replacing p(·) as p_t (·), and $d^{2} (p_{X}, p_{X}^{0}) = \bar{E} h^{2} ({\hat{p}}_{X_{t + 1} ∣ X_{1 : t}}, p_{X_{t + 1} ∣ X_{1 : t}}^{0})$ , $d^{2} (p_{Y ∣ X}, p_{Y ∣ X}^{0}) = \bar{E} h^{2} ({\hat{p}}_{Y ∣ X_{1 : t + 1}}, p_{Y ∣ X_{1 : t + 1}}^{0})$ . Therefore,

P (d (p_{X | Y}^{b}, p_{X | Y}^{0}) \geq 2 \sqrt{2} (η_{b} + η_{m})) \leq P (d (p_{X}, p_{X}^{0}) \geq η_{m}) + P (d (p_{Y | X}, p_{Y ∣ X}^{0}) \geq η_{b}) \leq 7 exp (- c_{7} (n + \tilde{n}) η_{m}^{2}) + exp (- \frac{α c_{9}}{8} (n + \tilde{n}) η_{m}^{2} + α (n + \tilde{n}) γ_{m}) + 7 exp (- c_{8} n η_{b}^{2}) + exp (- \frac{α c_{9}}{8} n η_{b}^{2} + α n γ_{b}),

where the last inequality follows from (32). Similarly, bounds for $P (d (p_{X}, p_{X}^{0}) \geq η_{m})$ and $d (p_{Y ∣ X}, p_{Y ∣ X}^{0})$ can be established. The desired result then follows. □

Proof of Corollary 2.

It suffices to verify the entropy conditions in Corollary 1. For the direct generation, let $p_{X ∣ Y} = {p (X_{t + 1} ∣ X_{1 : t}, Y; θ_{f})}_{t = 1}^{T}$ and ${\bar{p}}_{X ∣ Y} = {p (X_{t + 1} ∣ X_{1 : t}, Y; {\bar{θ}}_{f})}_{t = 1}^{T}$ in $F_{f, k}$ . Then

κ^{2} (p_{X ∣ Y}, {\bar{p}}_{X ∣ Y}) \leq \bar{E} {‖ σ^{\frac{1}{2}} (W_{f}^{o} h_{t - 1}) - σ^{\frac{1}{2}} ({\bar{W}}_{f}^{o} {\bar{h}}_{t - 1}) ‖}_{2}^{2} \leq \frac{1}{2} \bar{E} | | W_{f}^{o} h_{t - 1} - {\bar{W}}_{f}^{o} {\bar{h}}_{t - 1} | |_{2}^{2} \leq \bar{E} (| | (W_{f}^{o} - {\bar{W}}_{f}^{o}) h_{t - 1} | |_{2}^{2} + | | {\bar{W}}_{f}^{o} (h_{t - 1} - {\bar{h}}_{t - 1}) | |_{2}^{2}) \leq \bar{E} ({‖ W_{f}^{o} - {\bar{W}}_{f}^{o} ‖}_{F}^{2} {‖ h_{t - 1} ‖}_{2}^{2} + {‖ {\bar{W}}_{f}^{o} ‖}_{F}^{2} {‖ h_{t - 1} - {\bar{h}}_{t - 1} ‖}_{2}^{2}) \leq (\frac{2 k {(4 k)}^{T + 1}}{T {(4 k - 1)}^{2}}) ({‖ W_{f}^{x} - {\bar{W}}_{f}^{x} ‖}_{F}^{2} + 2 r_{f} {‖ W_{f}^{h} - {\bar{W}}_{f}^{h} ‖}_{F}^{2} + 4 k c_{Y} {‖ W_{f}^{y} - {\bar{W}}_{f}^{y} ‖}_{F}^{2}) + r_{f} {‖ W_{f}^{o} - {\bar{W}}_{f}^{o} ‖}_{F}^{2} \leq T^{- 1} max {(2 r_{f}, 4 k c_{15})}^{2} {(4 k)}^{T} {‖ θ_{f} - {\bar{θ}}_{f} ‖}_{2}^{2},

where the last inequality uses the fact that

{‖ h_{t} - {\bar{h}}_{t} ‖}_{2}^{2} \leq \frac{{(4 k)}^{t} - 1}{4 k - 1} (2 {‖ W_{f}^{x} - {\bar{W}}_{f}^{x} ‖}_{F}^{2} + 4 r_{f} {‖ W_{f}^{h} - {\bar{W}}_{f}^{h} ‖}_{F}^{2}) + {(4 k)}^{t} {‖ W_{f}^{y} - {\bar{W}}_{f}^{y} ‖}_{F}^{2} ‖ Y ‖_{2}^{2},

which uses the fact that

{‖ h_{t} - {\bar{h}}_{t} ‖}_{2}^{2} \leq {‖ (W_{f}^{x} - {\bar{W}}_{f}^{x}) 1_{[X_{t}]} + W_{f}^{h} h_{t - 1} - {\bar{W}}_{f}^{h} {\bar{h}}_{t - 1} ‖}_{2}^{2} \leq 2 {‖ W_{f}^{x} - {\bar{W}}_{f}^{x} ‖}_{F}^{2} + 4 r_{f} {‖ W_{f}^{h} - {\bar{W}}_{f}^{h} ‖}_{F}^{2} + 4 k {‖ h_{t - 1} - {\bar{h}}_{t - 1} ‖}_{2}^{2},

and ${‖ h_{0} - {\bar{h}}_{0} ‖}_{2}^{2} \leq {‖ W_{f}^{y} - {\bar{W}}_{f}^{y} ‖}_{F}^{2} ‖ Y ‖_{2}^{2}$ .

Hence, $H (u, F_{f, k}) \leq Λ_{f} log (\frac{3 max (2 r_{f}, 4 k c_{15}) {(4 k)}^{(T + 1) / 2} / T^{1 / 2}}{u})$ and the entropy condition is met by setting $ϵ_{f} = {(\frac{Λ_{f}}{n} log (\frac{max (r_{f}, 2 c_{15}) 2^{T} / T^{1 / 2} n}{Λ_{f}}))}^{1 / 2}$ .

For the indirect generation, it suffices to verify the entropy conditions in Corollary 2. Let $p_{X} = {P_{X} (X_{t + 1} ∣ X_{1 : t}; θ_{m})}_{t = 1}^{T - 1}$ and ${\bar{p}}_{X} = {P_{X} (X_{t + 1} ∣ X_{1 : t}; {\bar{θ}}_{m})}_{t = 1}^{T - 1}$ . Note that $h_{0} = {\bar{h}}_{0} = 0_{r_{m}}$ and

κ^{2} (p_{X}, {\bar{p}}_{X}) \leq \bar{E} ({‖ W_{m}^{o} - {\bar{W}}_{m}^{o} ‖}_{F}^{2} {‖ h_{t - 1} ‖}_{2}^{2} + {‖ {\bar{W}}_{m}^{o} ‖}_{F}^{2} {‖ h_{t - 1} - {\bar{h}}_{t - 1} ‖}_{2}^{2}) \leq \frac{2 k {(4 k)}^{T + 1}}{T {(4 k - 1)}^{2}} ({‖ W_{m}^{x} - {\bar{W}}_{m}^{x} ‖}_{F}^{2} + 2 r_{f} {‖ W_{m}^{h} - {\bar{W}}_{m}^{h} ‖}_{F}^{2}) + r_{f} {‖ W_{m}^{o} - {\bar{W}}_{m}^{o} ‖}_{F}^{2} \leq 2 r_{m} T^{- 1} {(4 k)}^{T} {‖ θ_{m} - {\bar{θ}}_{m} ‖}_{2}^{2} .

Then, $H (u, F_{m, k}) \leq Λ_{m} log (\frac{3 r_{m} {(4 k)}^{(T + 1) / 2} T^{- 1 / 2}}{u})$ and the entropy condition is met by setting $ϵ_{m} = O_{p} ({(\frac{Λ_{m}}{n + \tilde{n}} log (\frac{r_{m} (n + \tilde{n}) 2^{T} T^{- 1 / 2})}{Λ_{m}}))}^{1 / 2})$ .

Moreover, if Y ∈ {0,1}^K,

κ^{2} (p_{Y ∣ X}, {\bar{p}}_{Y ∣ X}) \leq \bar{E} {‖ {(σ (θ_{b} E (X_{1 : t})))}^{\frac{1}{2}} - {(σ ({\bar{θ}}_{b} E (X_{1 : t})))}^{\frac{1}{2}} ‖}_{2}^{2}, \leq \bar{E} {‖ σ^{\frac{1}{2}} (θ_{b} E (X_{1 : t})) - σ^{\frac{1}{2}} ({\bar{θ}}_{b} E (X_{1 : t})) ‖}_{2}^{2} \leq \frac{1}{2} \bar{E} {‖ (θ_{b} - {\bar{θ}}_{b}) E (X_{1 : t}) ‖}_{2}^{2} \leq \frac{1}{2} {‖ θ_{b} - {\bar{θ}}_{b} ‖}_{F}^{2} \bar{E} {‖ E (X_{1 : t}) ‖}_{2}^{2} .

Similarly, $H (u, F_{b, k}) \leq Λ_{b} log (\frac{3 \sqrt{k c_{16}}}{u \sqrt{2}})$ and the entropy condition is met by setting $ϵ_{b} = O_{p} ({(\frac{Λ_{b}}{n} log (\frac{\sqrt{c_{16}} n}{Λ_{b}}))}^{1 / 2})$ .

If $y \in ℝ^{K}$ , then

κ^{2} (p_{Y ∣ X}, {\bar{p}}_{Y ∣ X}) = \bar{E} (1 - exp (- \frac{1}{8} {‖ (θ_{b} - {\bar{θ}}_{b}) E (X_{1 : t}) ‖}_{2}^{2})), \leq \frac{1}{8} \bar{E} {‖ (θ_{b} - {\bar{θ}}_{b}) E (X_{1 : t}) ‖}_{2}^{2} \leq \frac{1}{8} {‖ θ_{b} - {\bar{θ}}_{b} ‖}_{F}^{2} \bar{E} {‖ E (X_{1 : t}) ‖}_{2}^{2},

(33)

implying that $H (u, F_{b, k}) \leq Λ_{b} log (\frac{3 \sqrt{k c_{16}}}{2 \sqrt{2} u})$ , and the entropy condition holds when $ϵ_{b} = O_{p} ({(\frac{Λ_{b}}{n} log (\frac{\sqrt{c_{16}} n}{Λ_{b}}))}^{1 / 2})$ . This completes the proof. □

Lemma 2.

Let $v_{n} (f) = n^{- 1 / 2} \sum_{i = 1}^{n} (v (f (X^{i}), f^{0} (X^{i})) - E v (f (X^{i}), f^{0} (X^{i})))$ , assume that there exist some generic constants a₂ > 0 and a₃ > 0, for j ≥ 2, such that

E {| v (f (X), f^{0} (X)) |}^{j} \leq a_{2} j! 2^{j} d^{2} (f, f^{0}),

and for any δ > 0, if

\int_{δ / 2^{8}}^{2^{1 / 2} δ^{1 / 2}} H^{1 / 2} (u, V_{δ}) d u \leq a_{3} n^{1 / 2} δ,

where $V_{δ} = {v (f, f^{0}) : d^{2} (f, f^{0}) \leq δ, f \in F}$ , then there exist some constants a₄ > 0 and a₅ > 0 depending on a₂ and a₃ such that

P^{*} (sup_{d^{2} (f, f^{0}) \leq δ; f \in F} v_{n} (f) \geq a_{4} n^{1 / 2} δ) \leq 3 exp (- a_{5} n δ),

(34)

where P* is the outer probability measure corresponding to $p_{X}^{0}$ .

Proof of Lemma 2.

The result follows from Lemmas 5 and Lemma 7 in [38], by replacing the Hellinger distance in Lemma 5 as a generic distance d(·,·). □

Footnotes

Supplementary Materials

The supplementary materials provide Python codes used in real data application.

Notes

https://archive.ics.uci.edu/ml/datasets/Sentence+Classification

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

⁴

https://radimrehurek.com/gensim/models/doc2vec.html

⁵

https://keras.io/

⁶

http://www.grammarcheckforsentence.com/

References

[1].Bishop CM. Pattern recognition and machine learning. springer, 2006. [Google Scholar]
[2].Bottou L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. [Google Scholar]
[3].Breiman L. Random forests. Machine Learning, 45(1):5–32, 2001. [Google Scholar]
[4].Caruana R, Lawrence S, and Giles CL. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Advances in Neural Information Processing Systems, pages 402–408, 2001. [Google Scholar]
[5].Cheng J and Lapata M. Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252, 2016. [Google Scholar]
[6].Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [Google Scholar]
[7].Cozman F and Cohen I. Risks of semi-supervised learning. Semi-supervised learning, pages 56–72, 2006. [Google Scholar]
[8].Cozman FG, Cohen I, and Cirelo M. Unlabeled data can degrade classification performance of generative classifiers. In Flairs conference, pages 327–331, 2002. [Google Scholar]
[9].Cozman FG, Cohen I, and Cirelo MC. Semi-supervised learning of mixture models. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 99–106, 2003. [Google Scholar]
[10].Dai B, Shen X, and Wang J. Embedding learning. Journal of the American Statistical Association, (in press), 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Devlin J, Chang M-W, Lee K, and Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018. [Google Scholar]
[12].Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, Gao J, Zhou M, and Hon H-W. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054, 2019. [Google Scholar]
[13].Gjorgjioski V, Kocev D, and Džeroski S. Comparison of distances for multi-label classification with pcts. In Proceedings of the Slovenian KDD Conference on Data Mining and Data Warehouses (SiKDD’11), 2011. [Google Scholar]
[14].Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014. [Google Scholar]
[15].He D, Xia Y, Qin T, Wang L, Yu N, Liu T-Y, and Ma W-Y. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828, 2016. [Google Scholar]
[16].Hochreiter S and Schmidhuber J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. [DOI] [PubMed] [Google Scholar]
[17].Karpathy A and Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137, 2015. [DOI] [PubMed] [Google Scholar]
[18].Kingma DP, Mohamed S, Rezende DJ, and Welling M. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014. [Google Scholar]
[19].Kingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, and Welling M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016. [Google Scholar]
[20].Langkilde I. Forest-based statistical sentence generation. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pages 170–177. Association for Computational Linguistics, 2000. [Google Scholar]
[21].Lee JD, Simchowitz M, Jordan MI, and Recht B. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016. [Google Scholar]
[22].Merity S, Keskar NS, and Socher R. Regularizing and optimizing LSTM language models. In International Conference on Learning Representations, 2018. [Google Scholar]
[23].Mikolov T, Sutskever I, Chen K, Corrado GS, and Dean J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013. [Google Scholar]
[24].Mikolov T, Yih W-T, and Zweig G. Linguistic regularities in continuous space word representations. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, 2013. [Google Scholar]
[25].Mullachery V and Motwani V. Image captioning. arXiv preprint arXiv:1805.09137, 2018. [Google Scholar]
[26].Nallapati R, Melnyk I, Kumar A, and Zhou B. Sengen: Sentence generating neural variational topic model. arXiv preprint arXiv:1708.00308, 2017. [Google Scholar]
[27].Ollivier Y, Tallec C, and Charpiat G. Training recurrent networks online without backtracking. arXiv preprint arXiv:1507.07680, 2015. [Google Scholar]
[28].Papineni K, Roukos S, Ward T, and Zhu W-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318. Association for Computational Linguistics, 2002. [Google Scholar]
[29].Pascanu R, Mikolov T, and Bengio Y. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013. [Google Scholar]
[30].Radford A, Wu J, Child R, Luan D, Amodei D, and Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019. [Google Scholar]
[31].Schmidt M, Le Roux N, and Bach F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1–2):83–112, 2017. [Google Scholar]
[32].Schuster M and Paliwal KK. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997. [Google Scholar]
[33].Shen X. On the method of penalization. Statistica Sinica, 8(2):337–357, 1998. [Google Scholar]
[34].Shen X and Wong WH. Convergence rate of sieve estimates. Annals of Statistics, pages 580–615, 1994. [Google Scholar]
[35].Smale S and Zhou D-X. Estimating the approximation error in learning theory. Analysis and Applications, 1(01):17–41, 2003. [Google Scholar]
[36].Sutskever I, Martens J, and Hinton GE. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011. [Google Scholar]
[37].Vinyals O, Toshev A, Bengio S, and Erhan D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015. [Google Scholar]
[38].Wong WH, Shen X, et al. Probability inequalities for likelihood ratios and convergence rates of sieve mles. Annals of Statistics, 23(2):339–362, 1995. [Google Scholar]
[39].Yang T and Priebe CE. The effect of model misspecification on semi-supervised classification. IEEE transactions on pattern analysis and machine intelligence, 33(10):2093–2103, 2011. [DOI] [PubMed] [Google Scholar]
[40].Yarotsky D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017. [DOI] [PubMed] [Google Scholar]
[41].Yu H-F, Huang F-L, and Lin C-J. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 85(12):41–75, 2011. [Google Scholar]
[42].Zhou D, Hofmann T, and Schölkopf B. Semi-supervised learning on directed graphs. In Advances in Neural Information Processing Systems, pages 1633–1640, 2004. [Google Scholar]
[43].Zhu Y, Lu S, Zheng L, Guo J, Zhang W, Wang J, and Yu Y. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1097–1100, 2018. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1774499-supplement-Supp_1.gz^{(7.4KB, gz)}

[R1] [1].Bishop CM. Pattern recognition and machine learning. springer, 2006. [Google Scholar]

[R2] [2].Bottou L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. [Google Scholar]

[R3] [3].Breiman L. Random forests. Machine Learning, 45(1):5–32, 2001. [Google Scholar]

[R4] [4].Caruana R, Lawrence S, and Giles CL. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Advances in Neural Information Processing Systems, pages 402–408, 2001. [Google Scholar]

[R5] [5].Cheng J and Lapata M. Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252, 2016. [Google Scholar]

[R6] [6].Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [Google Scholar]

[R7] [7].Cozman F and Cohen I. Risks of semi-supervised learning. Semi-supervised learning, pages 56–72, 2006. [Google Scholar]

[R8] [8].Cozman FG, Cohen I, and Cirelo M. Unlabeled data can degrade classification performance of generative classifiers. In Flairs conference, pages 327–331, 2002. [Google Scholar]

[R9] [9].Cozman FG, Cohen I, and Cirelo MC. Semi-supervised learning of mixture models. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 99–106, 2003. [Google Scholar]

[R10] [10].Dai B, Shen X, and Wang J. Embedding learning. Journal of the American Statistical Association, (in press), 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Devlin J, Chang M-W, Lee K, and Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018. [Google Scholar]

[R12] [12].Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, Gao J, Zhou M, and Hon H-W. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054, 2019. [Google Scholar]

[R13] [13].Gjorgjioski V, Kocev D, and Džeroski S. Comparison of distances for multi-label classification with pcts. In Proceedings of the Slovenian KDD Conference on Data Mining and Data Warehouses (SiKDD’11), 2011. [Google Scholar]

[R14] [14].Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014. [Google Scholar]

[R15] [15].He D, Xia Y, Qin T, Wang L, Yu N, Liu T-Y, and Ma W-Y. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828, 2016. [Google Scholar]

[R16] [16].Hochreiter S and Schmidhuber J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. [DOI] [PubMed] [Google Scholar]

[R17] [17].Karpathy A and Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137, 2015. [DOI] [PubMed] [Google Scholar]

[R18] [18].Kingma DP, Mohamed S, Rezende DJ, and Welling M. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014. [Google Scholar]

[R19] [19].Kingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, and Welling M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016. [Google Scholar]

[R20] [20].Langkilde I. Forest-based statistical sentence generation. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pages 170–177. Association for Computational Linguistics, 2000. [Google Scholar]

[R21] [21].Lee JD, Simchowitz M, Jordan MI, and Recht B. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016. [Google Scholar]

[R22] [22].Merity S, Keskar NS, and Socher R. Regularizing and optimizing LSTM language models. In International Conference on Learning Representations, 2018. [Google Scholar]

[R23] [23].Mikolov T, Sutskever I, Chen K, Corrado GS, and Dean J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013. [Google Scholar]

[R24] [24].Mikolov T, Yih W-T, and Zweig G. Linguistic regularities in continuous space word representations. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, 2013. [Google Scholar]

[R25] [25].Mullachery V and Motwani V. Image captioning. arXiv preprint arXiv:1805.09137, 2018. [Google Scholar]

[R26] [26].Nallapati R, Melnyk I, Kumar A, and Zhou B. Sengen: Sentence generating neural variational topic model. arXiv preprint arXiv:1708.00308, 2017. [Google Scholar]

[R27] [27].Ollivier Y, Tallec C, and Charpiat G. Training recurrent networks online without backtracking. arXiv preprint arXiv:1507.07680, 2015. [Google Scholar]

[R28] [28].Papineni K, Roukos S, Ward T, and Zhu W-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318. Association for Computational Linguistics, 2002. [Google Scholar]

[R29] [29].Pascanu R, Mikolov T, and Bengio Y. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013. [Google Scholar]

[R30] [30].Radford A, Wu J, Child R, Luan D, Amodei D, and Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019. [Google Scholar]

[R31] [31].Schmidt M, Le Roux N, and Bach F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1–2):83–112, 2017. [Google Scholar]

[R32] [32].Schuster M and Paliwal KK. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997. [Google Scholar]

[R33] [33].Shen X. On the method of penalization. Statistica Sinica, 8(2):337–357, 1998. [Google Scholar]

[R34] [34].Shen X and Wong WH. Convergence rate of sieve estimates. Annals of Statistics, pages 580–615, 1994. [Google Scholar]

[R35] [35].Smale S and Zhou D-X. Estimating the approximation error in learning theory. Analysis and Applications, 1(01):17–41, 2003. [Google Scholar]

[R36] [36].Sutskever I, Martens J, and Hinton GE. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011. [Google Scholar]

[R37] [37].Vinyals O, Toshev A, Bengio S, and Erhan D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015. [Google Scholar]

[R38] [38].Wong WH, Shen X, et al. Probability inequalities for likelihood ratios and convergence rates of sieve mles. Annals of Statistics, 23(2):339–362, 1995. [Google Scholar]

[R39] [39].Yang T and Priebe CE. The effect of model misspecification on semi-supervised classification. IEEE transactions on pattern analysis and machine intelligence, 33(10):2093–2103, 2011. [DOI] [PubMed] [Google Scholar]

[R40] [40].Yarotsky D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017. [DOI] [PubMed] [Google Scholar]

[R41] [41].Yu H-F, Huang F-L, and Lin C-J. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 85(12):41–75, 2011. [Google Scholar]

[R42] [42].Zhou D, Hofmann T, and Schölkopf B. Semi-supervised learning on directed graphs. In Advances in Neural Information Processing Systems, pages 1633–1640, 2004. [Google Scholar]

[R43] [43].Zhu Y, Lu S, Zheng L, Guo J, Zhang W, Wang J, and Yu Y. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1097–1100, 2018. [Google Scholar]

PERMALINK

Coupled generation

Ben Dai

Xiaotong Shen

Wing Wong

Abstract

1. Introduction

2. Methods

Indirect generator.

Direct generator.

Coupled generator.

3. Theory

Theorem 1 (Indirect generator).

Theorem 2 (Direct generator).

Theorem 3 (Coupled generation).

Remarks:

4. Sentence generation given a topic

Table 3.

Table 1.

4.1. Indirect generator

Fig. 1.

4.2. Direct generator

4.3. Coupled generator

4.4. Large-scale computation

Gradient for indirect generation.

Lemma 1.

4.5. Theory for sentence generation

Corollary 1 (Sequential generation).

Theoretical example.

Corollary 2 (Theoretical example).

5. Benchmark

Table 2.

Supplementary Material

Acknowledgments

Appendix

Proof of Lemma 1.

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Theorem 3.

Proof of Corollary 1.

Proof of Corollary 2.

Lemma 2.

Proof of Lemma 2.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases