Skip to main content
Entropy logoLink to Entropy
. 2018 Nov 2;20(11):839. doi: 10.3390/e20110839

Cross Entropy of Neural Language Models at Infinity—A New Bound of the Entropy Rate

Shuntaro Takahashi 1,*, Kumiko Tanaka-Ishii 2
PMCID: PMC7512401  PMID: 33266563

Abstract

Neural language models have drawn a lot of attention for their strong ability to predict natural language text. In this paper, we estimate the entropy rate of natural language with state-of-the-art neural language models. To obtain the estimate, we consider the cross entropy, a measure of the prediction accuracy of neural language models, under the theoretically ideal conditions that they are trained with an infinitely large dataset and receive an infinitely long context for prediction. We empirically verify that the effects of the two parameters, the training data size and context length, on the cross entropy consistently obey a power-law decay with a positive constant for two different state-of-the-art neural language models with different language datasets. Based on the verification, we obtained 1.12 bits per character for English by extrapolating the two parameters to infinity. This result suggests that the upper bound of the entropy rate of natural language is potentially smaller than the previously reported values.

Keywords: entropy rate, natural language, language model, neural language model

1. Introduction

In recent years, a new type of computational model for natural language has emerged, the neural language model. The aggregation of technological advances in deep learning [1,2,3] has led to a series of substantial improvements to neural language models [4,5,6,7,8,9], and these models now significantly surpass the performance of n-gram language models in both character- and word-level prediction. Han et al. [10] further suggested that the performance of neural language models is potentially below currently reported values. They use neural networks with millions of parameters to best predict the next character from a context. The performance is quantified by the cross entropy, a measure of prediction accuracy.

Language modeling is a central task in natural language processing, and so language models have been developed and investigated for engineering purposes [11,12]. The majority of studies on neural language models has focused on decreasing the prediction error of the models in a fixed setting. We are naturally interested, however, in the case of situating models in extreme conditions such that an infinite amount of computational resources is available. Such a model’s prediction accuracy would be best if it was trained with an infinitely large dataset and made predictions by using an infinitely long context. The resulting value would, in fact, be a bound of the entropy rate.

The entropy rate of natural language is the average amount of information of one character in an infinite length of text, which characterizes the complexity of natural language. The entropy rate has been used to quantify the randomness of a process in many fields [13] and compare worldwide languages [14] specifically in computational linguistics. Because the true probability distributions of natural language are inaccessible, there have been various proposals for experimentally estimating the entropy rate. These approaches relied on the predictive power of humans [15,16] or computational models such as n-gram language models and compression algorithms [17,18,19]. Shannon [15] conducted a cognitive experiment in which a subject was asked to predict the next character from a previous context. He reported that the upper bound of the entropy rate of English is approximately 1.3 bits per character (bpc). Brown et al. [17] collected over 500 million words from text and constructed a word-level tri-gram language model to estimate the entropy rate of English characters. They reported an estimated upper bound of 1.75 bpc for English. They obtained this upper bound in a specific setting, i.e., with a fixed dataset size and context length.

In contrast to those studies, some studies have tried to estimate by extrapolating parameters to infinity. Hilberg [20], on the basis of the experimental results of Shannon [15], argued that the estimated value of the entropy rate would reach zero if a person received a text of infinite length to predict. While the majority of the studies [15,16,17,18,19] on this topic disagrees with this argument, it has motivated theoretical analyses [21,22,23,24] on estimating the entropy rate. Schümann and Grassberger [18] and Takahira et al. [19] sampled the encoding rate of a compression algorithm for different dataset sizes to estimate the encoding rate for an infinitely large dataset by extrapolation. Takahira et al. [19] observed that the accuracy of prediction follows relatively simple functions such as a power-law decay with a positive constant.

In this paper, we situate modern character-level neural language models in the problem of entropy rate estimation. We first study the effects of the model parameters, namely, the context length, training data size, and parameter size. Consistent with [19,25,26], we observe that the effects of the context length and training data size on the cross entropy consistently obey a power-law decay with a positive constant for two different neural language models with two different language datasets. On the basis of this observation, we extrapolate the parameters to infinity to obtain an estimate of the minimal cross entropy, which is a new bound of the entropy rate. We finally obtain an estimated entropy rate value of 1.12 bpc for English characters. This result suggests that the entropy rate of natural language is smaller than the previously reported values [15,16,17,19].

2. Entropy Rate Estimation of Natural Language

In this section, we introduce the terminology and concepts used in estimating the entropy rate of natural language with a language model.

Definition 1 (Shannon entropy).

Let X be a stationary and ergodic stochastic process {Xt}t=1, where each element takes a finite alphabet X and let P(x1m) be a joint probability distribution P(X1=x1,,Xm=xm). The Shannon entropy of a stochastic process H(X1m) is defined as

H(X1m)=x1mX1mP(x1m)logP(x1m). (1)

Definition 2 (Entropy rate).

The entropy rate h of a stochastic process X is defined as

h=limm1mH(X1m). (2)

The entropy rate h is the averaged amount of information of an element of an infinite-length sequence.

The Shannon–McMillan–Breiman theorem states that almost surely

h=limm1mx1mX1mlogP(x1m). (3)

The entropy rate h is therefore equivalent to the average negative log-likelihood of one sample with a size of m, X1m.

Definition 3 (Cross entropy and Kullback–Leibler divergence).

The cross entropy L(P,Q) and Kullback–Leibler divergence KL(PQ) between the probability distribution P of stochastic process X and probability distribution of model Q are defined as follows.

L(P,Q)=PlogQ=H(X1m)+KL(PQ) (4)
KL(PQ)=x1mX1mP(x1m)logP(x1m)logQ(x1m) (5)

One important property of the Kullback–Leibler divergence is Gibbs’ inequality, which states that

KL(PQ)0KL(PQ)=0if and only if P=Q.

From this property, the cross entropy is always greater than or equal to the entropy of a stochastic process X, and it is equal if and only if the two probability distributions, P and Q, are equivalent. Therefore, the cross entropy L(P,Q) is the upper bound of H(X). Similarly to Equation (3), we obtain Equation (6).

L(P,Q)=limm1mlogQ(x1m). (6)

We therefore obtain an upper bound of the entropy rate from the probability distribution of the model, Q.

Definition 4 (Language model).

A language model with context length n outputs a conditional probability distribution Q:

Q(xi+1|xin+1i) (7)

A language model with context length n computes the probability of the cross entropy as

Q(x1m)=Q(x1,x2,,xn1)i=nmQ(xi|xin+1,xin+2,,xi1). (8)

We still cannot compute L(P,Q) exactly, as thfe definition requires an infinitely long test dataset with m. There are three parameters —the training data size k, context length n, and test data size m— for which we should specify infinite size, but this approach would not be practically computable. If a language model learned from a longer context length and training data size, however, it would better approximate the stochastic process of natural language, performing best when trained with an infinite context length and training data size. In Section 4 and Section 6, we discuss the effects of these parameters on estimating the upper bound of the entropy rate h.

3. Model

In this section, we introduce two neural language models and an n-gram language model, which are used in the experiment to estimate the entropy rate.

3.1. Neural Language Model

Neural language models are parameterized by neural networks. These models receive a string of characters of length n and output a probability distribution for the next character. They are optimized by minimizing the cross entropy. In this paper, we use two state-of-the-art neural language models: the Recurrent Highway Network (RHN) [5] for English characters, and Averaged Stochastic Gradient Descent Weight-Dropped LSTM with Mixture of Softmax (AWD-LSTM-MoS) [7,8], for Chinese characters. We chose these because they are the best-performing neural language models. The parameter sizes of the models are 46 million for RHN and 35 million for AWD-LSTM-MoS. In Section 5, we test the effect of the parameter size and confirm that a scale of tens of millions is sufficiently large for these models to achieve good performance. It requires more than a few months for one of the models to reach convergence with the datasets that we use. These neural language models use embedding representation and enriched architectures of recurrent neural networks (RNNs) and require proper training strategies to achieve state-of-the-art performance. We provide detailed explanations of the components and experimental setups of the neural language models in Appendix A.

3.2. n-Gram Language Model

To highlight the performance of the neural language models, we compare their results with those of n-gram language models, which are (n1)-ordered Markov models. An n-gram model is often referred to as a count-based approach: the probability distribution is determined by the number of appearances in a training dataset.

In the experiment, we used a smoothing technique to enable longer n-grams and deal with data sparsity. Here, data sparsity refers to the fact that a large number of long n-grams in the test dataset never appeared in the training dataset. The smoothing technique calculates the weighted average of the probabilities of n-gram language models with different context lengths. The resulting model is a variant of the Katz backoff model [27]:

Pn(xi+1|xin+1i)=lnc(xi+1|xin+1i)c(xin+1i)c(xi+1|xin+1i)>0Pn1(xi+1|xin+2)c(xi+1|xin+1i)=0

Here, The context xin+1i is the elements of sequence x between the (in+1)th and ith elements, c(x|) is the count of x under condition ∗, and li is the weight term for an n-gram that satisfies the normalization condition i=1nli=1. This term is obtained by counting the number of times that c(xi+1|xin+1i)>0 is satisfied in a validation dataset separated from both the training and test datasets. The weight term li allows the n-gram language model to balance the use of short and long contexts to best predict the next character. We do not use other smoothing techniques such as Kneser–Ney smoothing [28], because they are specialized for word-level modeling.

3.3. Dataset

We used two substantially large datasets: the one billion (OB) dataset [29] and the Central News Agency (CNA) corpus. The OB dataset is a collection of crawled news text in English. The CNA dataset is a similar collection of news text written in Chinese. The statistics of the datasets are listed in Table 1. These datasets are of the largest scale available to the best of our knowledge.

Table 1.

Statistics of the datasets used in the experiment.

Dataset Character Tokens Word Tokens Vocabulary Unique Characters Test Character Tokens
One Billion (English) 4,091,497,941 768,648,885 81,198 139 37,811,222
Central News Agency (Chinese) 743,444,075 415,721,153 1,694,159 9171 493,376

4. Estimating Entropy Rate Through Extrapolation

We are interested in the entropy rate of natural language, but we can only practically obtain an upper bound, the cross entropy L(P,Q), as mentioned in Section 2. Empirical studies on neural language models [25,26] have investigated the effects of experimental parameters on the cross entropy of models. Hestness et al. [25] investigated how the training data size affects the cross entropy of a neural language model. Likewise, Khandelwal et al. [26] investigated the effect of context length on the cross entropy of a model. These works demonstrate that the cross entropy of neural language models monotonically decreases as the context length and training data size increase.

The functional form of these effects has been studied empirically. In the context of entropy rate estimation, Hilberg [20] originally introduced f0, a power-law function for Shannon’s figure in [15] that plots the entropy obtained from a cognitive experiment for different context lengths n. Hilberg claimed that f0 with β0.5 fit the plot well and hypothesized that the entropy rate of English characters is equal to zero. He defined f0 as

f0(x;A,β)=Axβ1 (9)

The power-law function f0 is readily generalized to f1:

f1(x;A,β,h)=Axβ1+h. (10)

The modified Hilberg function f1 characterizes the cross entropy of a language model as a power-law decay with a positive constant. If h=0, then it is equivalent to f0.

Hestness et al. [25] conducted empirical studies on the relationship between the training data size and a model’s performance in terms of three regions, called the “small-data region,” “power-law region,” and “irreducible-error region.” In the small-data region, neural network models only behave as random predictors. In the power-law region, the generalization error of a model decreases according to a power law. In the irreducible-error region, a model cannot reduce its error any more with a larger data size. Except for the small data region, this characterization is quite similar to the modified Hilberg function f1 because the terms xβ1 and h stand for power-law decay and irreducible error, respectively. These empirical observations could potentially be explained by statistical learning theory [30], although the settings of existing works [31] are largely different from language modeling.

With respect to these previous findings, in the rest of this paper, we first validate whether the effects of context length and training data size on language models really follow the modified Hilberg function f1 and the observations by Hestness et al. [25]. Then, we extrapolate the experimental results to estimate the cross entropies of language models at an infinite context length and infinite training data size. Possible functional forms were discussed in [19], which found that f0 and f1 are the most proper functions. This paper therefore focuses on f0 and f1 and excludes other fitting functions proposed in [18,19]. The parameters of f0 and f1 are obtained by minimizing the mean square-root error ϵ:

ϵ=1si=1s(f(xi)yi)2 (11)

where xi is the ith data point of a training data size or context length, f(xi) is the cross entropy predicted by function f at xi, and yi is the cross entropy of a model with the condition of xi.

5. Effect of Parameters

5.1. Effect of Context Length n

We first investigated the effect of context length on the cross entropy of the language models. We trained the models with different context lengths and sampled the cross entropy for each model. Figure 1 shows the relationship between the context length and cross entropy for RHN on the OB dataset and AWD-LSTM-MoS on the CNA dataset. For RHN on the OB dataset, we trained with different context lengths (2,4,6,8,10,20,30,40,50) and 16,777,216 characters of the OB dataset. For AWD-LSTM-MoS, we conducted the same experiment with different context lengths (5,10,,85) and 8,388,608 characters of the CNA dataset. We had shorter context lengths for RHN because it could not run on a single GPU with long context lengths over 50. These neural language models improved their cross entropy by using context lengths on the order of several tens.

Figure 1.

Figure 1

Effect of context length on the cross entropy of (a) RHN trained on 16,777,216 characters of the OB dataset and (b) AWD-LSTM-MoS trained on 8,388,608 characters of the CNA dataset. Both of these neural language models improved their prediction accuracy with longer context and were well characterized by the modified Hilberg function f1.

In contrast, the n-gram language models did not benefit from longer contexts to improve their prediction accuracy. Figure 2 shows a scatter plot of the context length and cross entropy with different training data sizes. With larger training data, the best performing value of “n” increased: with 65,536 characters, n=3 performed best, whereas with 33,554,432 characters, n=6 performed the best. Even with the largest dataset, however, the n-gram models did not benefit from contexts longer than n10 to improve their cross entropy. This difference characterizes the advantage of neural language models over n-gram language models.

Figure 2.

Figure 2

Effect of context length on the cross entropy of n-gram (1–15-gram) models with different training data sizes of 65,536, 524,288, 4,194,304, and 33,554,432 characters. A larger training data size allowed the n-gram language models to improve their prediction accuracy w.r.t. the context length. Even with such large training datasets, however, the best performing n-gram model had n10.

5.2. Effect of Training Data Size k

We next conducted an experiment to examine the relationship between the training data size and the cross entropy of the neural and n-gram language models. Figure 3 shows the relationships between the dataset size and cross entropy for the various models with different training data sizes. For the neural language models, we recorded the cross entropy with different dataset sizes (210,211,). The context lengths were fixed to sufficiently large lengths of 50 for RHN on the OB dataset and 70 for AWD-LSTM-MoS on the CNA dataset. For the n-gram language models, we obtained the cross entropy for n=1,,8 and the different training data sizes and selected the lowest cross entropy for each size.

Figure 3.

Figure 3

Cross entropy of (a) RHN on OB dataset, (b) AWD-LSTM-MoS on CNA dataset, and (c) n-gram language models on OB dataset with different training data sizes. The fitting function f1 was applied to the data in the regions of k65,536 for RHN, k16,384 for AWD-LSTM-MoS, and k8192 for the n-gram language models. The effect of the training data size was well characterized by f1, as with the context length in Figure 1.

For the results shown in Figure 3, because the models each exhibited a small-data region in which the cross entropy did not decrease or decay slowly, we defined drop point values at which the small-data region ended and the power-law region began. These values were determined heuristically. We then computed the parameters of f1 from the sampled data points after the drop points.

Our results confirm that the modified Hilberg function well characterizes the effect of the training data size on the cross entropy. RHN only made random predictions up to a training data size of 105 characters. The n-gram language models also had a small-data region up to 104 characters. These models then improved their cross entropies monotonically with increased training data size. This relationship was consistent for the different models and languages, as seen in Figure 3.

One notable difference was the behavior in the small-data region. While RHN on the OB dataset showed a plateau, AWD-LSTM-MoS on the CNA dataset had a slower but monotonic decay even in this region. One reason for this difference is the gap in the number of unique characters in English (139) and Chinese (9171). In the case of Chinese, if the training data size is significantly small, the model will be trained to output a small portion of unique characters. As the size increases, the model is more likely to process a larger ratio of unique characters, which leads to improved prediction accuracy. Therefore, there was a slight slope in the small-data region for AWD-LSTM-MoS on the CNA dataset.

5.3. Effect of Test Data Size m

By definition, the cross entropy L(P,Q) requires an infinitely long text, but we had only texts of finite length to obtain cross entropy values. In general, it is difficult to construct confidence bounds for the obtained values from neural networks. Because of this problem, it is uncertain whether the obtained values are reliable estimates. Therefore, we investigated the effect of the test data size m on the values obtained.

To cover different datasets and models, we examined the effect of the test data size on a 6-gram language model on the OB dataset and AWD-LSTM-MoS on the CNA dataset. The models were trained with 33,554,432 characters of the corresponding datasets. Figure 4 shows the values sampled at different test data sizes (210,211,). The orange lines in the figures represent the values obtained with the largest test data sizes. When the test data size was too small, up to m105, the obtained values fluctuated for both the 6-gram language model and AWD-LSTM-MoS, making the results unreliable. The values then seemed to converge when the test data size was sufficiently large, over 105 characters. Although this empirical analysis does not guarantee convergence, in practice, we can fairly compute the cross entropy L(P,Q) with m>105 characters of text.

Figure 4.

Figure 4

Effect of test data size on the cross entropy values. The models were trained with 33,554,432 characters of the corresponding datasets and evaluated with different test data sizes. The orange lines indicate the cross entropies obtained with the test datasets of maximum length. For both models (a,b), the values seemed to converge when the test data size was sufficiently large (m>105).

5.4. Effect of Parameter Size

The parameter size of neural language models (and of deep learning models in general) has a strong effect on their performance. We empirically analyzed this effect for AWD-LSTM-MoS with 4,194,304 characters of the CNA dataset. We varied the number of parameters by changing the numbers of dimensions of the LSTM layers and trained the models until convergence. Figure 5 shows a scatter plot of the number of parameters and the cross entropy. Similarly to the observations for the context length and training data size, the performance monotonically increased up to approximately 108, but then the values seemed to converge around that parameter size. Therefore, we used a model with a fixed number of parameters for the experiment. Note that we excluded RHN here because that model with a larger parameter size could not run on a single GPU.

Figure 5.

Figure 5

Effect of parameter size on the cross entropy for AWD-LSTM-MoS with the CNA dataset. The parameter size of AWD-LSTM-MoS was controlled by changing the numbers of dimensions of the LSTM layers. Even the minimum parameter size exceeded 106, because the vocabulary size of Chinese made the softmax layer of AWD-LSTM-MoS large. The cross entropy reached a plateau at a scale of 107 parameters.

6. Entropy Rate Estimated with Neural Language Models

From the above observations, we considered three approaches for estimating the entropy rate with the neural language models. The first approach simply uses the best, smallest cross entropy recorded by a model. This is the standard method for measuring the performance of language models, because the cross entropy is the upper bound of the entropy rate. Table 2 summarizes the smallest cross entropies recorded by the three models, as well as by the PPM-d algorithm [19]. We obtained 1.21 bpc for English and 4.43 bpc for Chinese from the respective RHN and AWD-LSTM-MoS models. These values were significantly smaller than those for the n-gram language models and the PPM-d algorithm, especially for the CNA dataset in Chinese.

Table 2.

Smallest cross entropy values obtained by the neural language models, n-gram language models, and PPM-d algorithm. The values were obtained from the models trained with the largest training data size and longest context length in the experiment. The neural language models consistently outperformed the n-gram language models and the PPM-d algorithm.

Dataset Model Smallest Cross-Entropy
OB RHN 1.210
OB n-gram model 1.451
OB PPM-d 1.442
CNA AWD-LSTM-MoS 4.429
CNA n-gram model 4.678
CNA PPM-d 4.753

Note: The smallest cross entropies for the two datasets are in bold.

The second approach uses extrapolation of the training data size with fixed context lengths and f1. We prioritized the training data size over the context length, because context lengths of n50 for RHN and n70 for AWD-LSTM-MoS had a small impact on the cross entropy, as seen in Figure 1. As summarized in Table 3, we obtained 1.14 bpc for the OB dataset (English) and 3.96 bpc for the CNA dataset (Chinese), which are smaller cross entropy values than with the first approach.

Table 3.

Estimated parameters and error of the fitting function f1 for the language models (including the PPM-d algorithm [19]) with different training data sizes. This fitting corresponds to the second approach described in Section 6. The drop point indicates the minimum training data size for fitting with the modified Hilberg function f1.

Dataset Model f1 Drop Point
A β h ϵ
OB RHN 358.997 0.570 1.144 0.008 216
OB n-gram model 8.393 0.840 1.210 0.004 213
OB PPM-d 12.551 0.791 1.302 0.015 -
CNA AWD-LSTM-MoS 82.006 0.723 3.960 0.025 214
CNA n-gram model 388.338 0.647 4.409 0.018 222
CNA PPM-d 35.602 0.766 4.295 0.078 -

Note: The smallest cross entropies for the OB and CNA datasets are in bold.

The third approach extends the modified Hilberg function f1 to a bivariate function g to take both the context length and dataset size into account:

g(x1,x2;A1,A2,β1,β2,h)=A1x1β1+A2x2β2+h (12)

where the parameter x1 is the training data size, and x2 is the context length. This extension from f1 to g allows us to estimate the entropy rate h from the power-law decay of the cross entropy with respect to both the context length and dataset size. This was a natural extension of the second approach, as we found that f1 well described the cross entropy of the models with respect to those two parameters. Because of the substantial increase in the number of sample points and the heavy computational cost for training the models, we focused on RHN with the OB dataset for this approach. For this bivariate regression, we sampled the cross entropy of RHN for all points at various context lengths (2,4,6,8,10,20,30,40) and training data sizes (220,,225), and at a context length of 50 with various dataset sizes (220,,229).

Figure 6 shows a 3D scatter plot of the cross entropy of RHN with different context lengths and training data sizes on the OB dataset. The parameters of g are listed in Table 4. With this setting, we obtained 1.12 bpc for the OB dataset in English. This value is smaller than those obtained in previous reports and with the above two estimation approaches. Note, however, that the parameter fitting is sensitive to deficits of data or changes in the range of samples, as compared with the previous two approaches.

Figure 6.

Figure 6

Cross entropy of RHN (z-axis) with different context lengths (x-axis) and training data sizes (y-axis) on the OB dataset. The data was sampled at various context lengths (2,4,6,8,10,20,30,40) and training data sizes (220,,225), and at a context length of 50 with various dataset sizes (220,,229). The data was fitted by function g with the parameters in Table 4.

Table 4.

Estimated parameters and error of fitting function g. This fitting corresponds to the third approach described in Section 6.

Dataset Model g
A1 β1 A2 β2 h ϵ
1B RHN 89.609 0.661 0.324 0.294 1.121 0.006

Note: The estimated value of the entropy rate h with the bivariate regression is marked in bold.

7. Conclusions

We explored estimation of the entropy rate of natural language with neural language models. We investigated the effects of four key parameters: the context length, training data size, test data size and parameter size. With respect to our findings on entropy rate estimation and empirical evaluation, we verified that the cross entropy of the neural language models decreased with a longer context length and larger training data size, and the decreases were characterized with the modified Hilberg function f1. These observations characterize how a machine learning model improves its performance as it receives more data for training. We then empirically explored the asymptotic behavior with a test data size of m>105 characters. We also tested the effect of varying the number of parameters. Through regression with the results of these investigations, we finally obtained entropy rate upper bounds of 1.12 bpc for English and 3.96 bpc for Chinese. These values are smaller than the previously reported values obtained with other computational models.

Appendix A. Neural Language Models

In this Appendix, we explain the general structures of modern neural language models and the specifications of Recurrent Highway Networks (RHNs) [5] and the AWD-LSTM-MoS model [8], which we used in the experiment. Modern neural language models rely on three components: an embedding representation, enriched architectures of recurrent neural networks (RNNs), and training strategies. In the following equations, bold capital letters such as W and U are matrices, bold lowercase letters such as h and b are one-dimensional vectors, and normal letters are real values.

Appendix A.1. Embedding Representation

In the scheme of machine learning, discrete symbols are represented by one-hot vectors. In the case of the character language models that we used, “a” is represented with (1,0,0,,0), “A” is represented with (0,1,0,,0), and so on. Therefore, “a” and “A” are represented as independent forms. This embedding representation allows us to describe the similarity and difference between “a” and “A” quantitatively. An embedding layer projects the discrete one-hot vector representation to a demb-dimensional continuous space. The vector of the corresponding symbol relatively describes the similarity and difference between the symbols. Embedding representation is a key to many successful applications of deep learning in natural language processing. In language modeling, the vectors of an embedding layer are initialized randomly and updated by a stochastic gradient algorithm together with the other layers of the neural language model.

Appendix A.2. Enriched Architectures of Recurrent Neural Networks

The RNN is a suitable architecture for neural networks with sequential data. An RNN takes the elements at time step t, xt (the embedding representation of a character), and a hidden vector ht1 as inputs to generate an output ht:

ht=f(Uxt+Wht1+b) (A1)

where f is a non-linear activation function, U and W are weight matrices, and b is a constant term. Plain RNNs, however, suffer from difficulty in training. Modern studies with RNNs have used enriched versions of the networks such as long short-term memory (LSTM). They have been widely applied to many tasks in, including language modeling. In contrast to plain RNNs, an LSTM has a cell ct and gating mechanisms. The computation process of LSTM is written as follows.

it=σ(Uixt+Wiht1+bi) (A2)
ft=σ(Ufxt+Wfht1+bf) (A3)
ot=σ(Uoxt+Woht1+bo) (A4)
c˜t=tanh(Uc˜xt+Wc˜ht1+bc˜) (A5)
ct=σ(ftct1+itc˜t) (A6)
ht=tanh(ct)ot. (A7)

where ∗ denotes element-wise multiplication, σ is a sigmoid function, and it, ft, and ot correspond to an input gate, forget gate, and output gate, respectively. In contrast to plain RNNs, as described in Equations (A5) and (A6), an LSTM is allowed to balance the use of contextual information from the previous time step, ct1, and the standard computation in an RNN, c˜t. This is called a gating mechanism.

Appendix A.3. Training Strategy

Neural language models require a proper training strategy to achieve state-of-the-art performance. Here, we introduce “dropout” and a “scheduled learning rate” as techniques of training, because they are the most impactful factors for improving the prediction performance of neural language models [9]. Dropout [3] is the most popular and effective regularization technique for neural networks. The idea is to randomly drop units from a network. Dropout prevents neural networks from overfitting and contributes to improving the generalization performance. In the case of RNNs, a plausibly modified variant called variational dropout [32] is used. A scheduled learning rate is another key to training neural networks effectively. In many successful deep learning models, the learning rate is set to a large value at the early stage of training and scheduled to decay as training progresses.

Appendix A.4. Recurrent Highway Networks

We used the official RHN implementation coded in tensorflow 0.12 [33] and available at https://github.com/julian121266/RecurrentHighwayNetworks. The RHN is an enriched RNN architecture inspired by highway networks [34]. It introduces deep computation within one time step of an RNN and has L layers to compute a hidden vector at the next time step. Following the notation in the original paper, the RHN computation is defined as

hl,t=tanh(Uhxt1{l=L}+Whlsl1,t+bhl) (A8)
tl,t=σ(Utxt1{l=L}+Wtlsl1,t+btl) (A9)
cl,t=σ(Ucxt1{l=L}+Wclsl1,t+bcl) (A10)
sl,t=hl,ttl,t+sl1,tcl,t (A11)

where 1{} is an indicator function. The first RHN layer takes xt and sL,t1 as inputs to produce its output. The following layers take sl,t as an input to output sl+1,t. The RHN layers finally compute sL,t, which is regarded as the hidden vector in LSTM, ht. Here, hl,t corresponds to the computation in a standard RNN, and tl,t and cl,t function as gating mechanisms to control the use of hl,t and sl1,t in computing sl,t. Therefore, the RHN is regarded as an extension of LSTM.

In the experiment, the number of layers was set to 10. The number of RHN units was set to 1500, and the total number of parameters reached 46 million. Stochastic gradient descent with momentum [35] was used as the optimization algorithm. The initial learning rate was set to 0.2, and a learning-rate decay of 1.03 was applied for every five epochs. A weight decay of 107 was applied for regularizing the parameters. Variational dropout was also applied, as well as another technique called gradient clipping, in which the norm of a gradient was rescaled to 10 when it exceeded 10.

Appendix A.5. AWD-LSTM-MoS

The AWD-LSTM-MoS model is an extension of the AWD-LSTM implemented in PyTorch 0.4 [36]. The official implementation is available at https://github.com/zihangdai/mos. In contrast to an RHN, AWD-LSTM keeps the model ingredients simple (embedding layer + LSTM layer + softmax layer) and focuses on the training strategy to improve the model performance. AWD-LSTM-MoS makes a change to the softmax layer. The standard softmax layer receives ht to output the probability of the next character as

P(y=w|ht)=exp(htTww)wexp(htTww). (A12)

The numerator represents the likelihood that y=w, and the denominator is a normalization constant to satisfy wP(y=w|ht)=1. A softmax layer is the standard choice for the output layer of a classification model. In a previous paper, we proposed a notion called mixture of softmax (MoS). As the name suggests, an MoS layer computes the probability of the next character from a mixture of K different softmax computations as follows.

hk,t=tanh(Wkht) (A13)
πk=exp(htTwπ,k)k=1Kexp(htTwπ,k) (A14)
P(y=w|ht)=k=1Kπkexp(hc,kww)wexp(hc,kww) (A15)

The authors argued that MoS contributes to the expressibility of a softmax layer. The experimental result demonstrated that it leads to an improved performance in a language modeling task.

In the experiment, K=15 was chosen. For the rest of the hyperparameters, we set the dimension of the embedding layer demb to 300, and the three layers of the LSTM had 1150, 1150, and 650 dimensions. The total number of parameters reached 35 million. Gradient clipping was applied for gradients with norms larger than 0.25. Averaged stochastic gradient descent (ASGD) [37] with an initial learning rate of 15.0 was used to update the weight with a batch size of 15. Variational dropout was also applied to the layers of this model.

Author Contributions

Conceptualization, K.T.-I. and S.T.; Methodology, S.T. and K.T.-I.; Software, S.T.; Validation, S.T. and K.T.-I.; Formal Analysis, S.T. and K.T.-I.; Investigation, S.T. and K.T.-I.; Resources, S.T. and K.T.-I.; Data Curation, S.T. and K.T.-I.; Writing—Original Draft Preparation, S.T.; Writing—Review & Editing, S.T. and K.T.-I.; Visualization, S.T.; Supervision, K.T.-I.; Project Administration, K.T.-I.; Funding Acquisition, K.T.-I.

Funding

This research was funded by JST Presto Grant Number JPMJPR14E5 and HITE funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
  • 2.Pascanu R., Mikolov T., Bengio Y. On the Difficulty of Training Recurrent Neural Networks; Proceedings of the 30th International Conference on Machine Learning; Atlanta, GA, USA. 16–21 June 2013; pp. III1310–III1318. [Google Scholar]
  • 3.Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014;15:1929–1958. [Google Scholar]
  • 4.Mikolov T., Karafiát M., Burget L., Cernocký J., Khudanpur S. Recurrent neural network based language model; Proceedings of the 11th Annual Conference of the International Speech Communication Association; Chiba, Japan. 26–30 September 2010; pp. 1045–1048. [Google Scholar]
  • 5.Zilly J.G., Srivastava R.K., Koutník J., Schmidhuber J. Recurrent highway networks. arXiv. 2016. 1607.03474
  • 6.Melis G., Dyer C., Blunsom P. On the state of the art of evaluation in neural language models; Proceedings of the 6th International Conference on Learning Representations; Vancouver, BC, Canada. 30 April–3 May 2018. [Google Scholar]
  • 7.Merity S., Keskar N.S., Socher R. Regularizing and optimizing LSTM language models; Proceedings of the 6th International Conference on Learning Representations; Vancouver, BC, Canada. 30 April–3 May 2018. [Google Scholar]
  • 8.Yang Z., Dai Z., Salakhutdinov R., Cohen W.W. Breaking the softmax bottleneck: A high-rank RNN language model; Proceedings of the 6th International Conference on Learning Representations; Vancouver, BC, Canada. 30 April–3 May 2018. [Google Scholar]
  • 9.Merity S., Keskar N., Socher R. An Analysis of Neural Language Modeling at Multiple Scales. arXiv. 2018. 1803.08240
  • 10.Han Y., Jiao J., Lee C.Z., Weissman T., Wu Y., Yu T. Entropy Rate Estimation for Markov Chains with Large State Space. arXiv. 2018. 1802.07889
  • 11.Manning C.D. Computational linguistics and deep learning. Comput. Linguist. 2015;41:701–707. doi: 10.1162/COLI_a_00239. [DOI] [Google Scholar]
  • 12.Devlin J., Chang M.W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv. 2018. 1810.04805
  • 13.Darmon D. Specific Differential Entropy Rate Estimation for Continuous-Valued Time Series. Entropy. 2016;18:190. doi: 10.3390/e18050190. [DOI] [Google Scholar]
  • 14.Bentz C., Alikaniotis D., Cysouw M., Ferrer-i Cancho R. The Entropy of Words—Learnability and Expressivity across More than 1000 Languages. Entropy. 2017;19:275. doi: 10.3390/e19060275. [DOI] [Google Scholar]
  • 15.Shannon C. Prediction and entropy of printed English. Bell Syst. Tech. J. 1951;30:50–64. doi: 10.1002/j.1538-7305.1951.tb01366.x. [DOI] [Google Scholar]
  • 16.Cover T.M., King R.C. A Convergent Gambling Estimate of the Entropy of English. IEEE Trans. Inf. Theory. 1978;24:413–421. doi: 10.1109/TIT.1978.1055912. [DOI] [Google Scholar]
  • 17.Brown P.F., Pietra S.A.D., Pietra V.J.D., Lai J.C., Mercer R.L. An Estimate of an Upper Bound for the Entropy of English. Comput. Linguist. 1992;18:31–40. [Google Scholar]
  • 18.Schümann T., Grassberger P. Entropy estimation of symbol sequences. Chaos. 1996;6:414–427. doi: 10.1063/1.166191. [DOI] [PubMed] [Google Scholar]
  • 19.Takahira R., Tanaka-Ishii K., Dębowski Ł. Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora. Entropy. 2016;18:364. doi: 10.3390/e18100364. [DOI] [Google Scholar]
  • 20.Hilberg W. Der bekannte Grenzwert der redundanzfreien Information in Texten—eine Fehlinterpretation der Shannonschen Experimente? Frequenz. 1990;44:243–248. doi: 10.1515/FREQ.1990.44.9-10.243. [DOI] [Google Scholar]
  • 21.Dębowski Ł. Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture. Entropy. 2015;17:5903–5919. [Google Scholar]
  • 22.Dębowski Ł. A general definition of conditional information and its application to ergodic decomposition. Stat. Probabil. Lett. 2009;79:1260–1268. [Google Scholar]
  • 23.Ornstein D., Weiss B. Entropy and data compression schemes. IEEE Trans. Inf. Theory. 1993;39:78–83. doi: 10.1109/18.179344. [DOI] [Google Scholar]
  • 24.Ebeling W., Nicolis G. Entropy of Symbolic Sequences: The Role of Correlations. Europhys Lett. 1991;14:191–196. doi: 10.1209/0295-5075/14/3/001. [DOI] [Google Scholar]
  • 25.Hestness J., Narang S., Ardalani N., Diamos G., Jun H., Kianinejad H., Patwary M., Yang Y., Zhou Y. Deep Learning Scaling is Predictable, Empirically. arXiv. 2017. 1712.00409
  • 26.Khandelwal U., He H., Qi P., Jurafsky D. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context; Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics; Melbourne, Australia. 15–20 July 2018; pp. 284–294. [Google Scholar]
  • 27.Katz S. Estimation of probabilities from sparse data for the language model component of a speech recognizer. Trans. Acoust. Speech Signal Process. 1987;35:400–401. doi: 10.1109/TASSP.1987.1165125. [DOI] [Google Scholar]
  • 28.Kneser R., Ney H. Improved backing-off for m-gram language modeling; Proceedings of the International Conference on Acoustics, Speech, and Signal Processing; Detroit, MI, USA. 9–12 May 1995. [Google Scholar]
  • 29.Chelba C., Mikolov T., Schuster M., Ge Q., Brants T., Koehn P., Robinson T. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. arXiv. 2013. 1312.3005
  • 30.Vapnik V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999;10:988–999. doi: 10.1109/72.788640. [DOI] [PubMed] [Google Scholar]
  • 31.Amari S.-i., Fujita N., Shinomoto S. Four Types of Learning Curves. Neural Comput. 1992;4:605–618. doi: 10.1162/neco.1992.4.4.605. [DOI] [Google Scholar]
  • 32.Gal Y., Ghahramani Z. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks; Proceedings of the 30th International Conference on Neural Information Processing Systems; Barcelona, Spain. 5–10 December 2016; pp. 1027–1035. [Google Scholar]
  • 33.Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., Corrado G.S., Davis A., Dean J., Devin M., et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. [(accessed on 24 October 2018)]; Available online: http://download.tensorflow.org/paper/whitepaper2015.pdf.
  • 34.Srivastava R.K., Greff K., Schmidhuber J. Training Very Deep Networks; Proceedings of the 28th International Conference on Neural Information Processing Systems; Montreal, QC, Canada. 7–12 December 2015; pp. 2377–2385. [Google Scholar]
  • 35.Sutskever I., Martens J., Dahl G.E., Hinton G.E. On the importance of initialization and momentum in deep learning; Proceedings of the 30th International Conference on Machine Learning; Atlanta, GA, USA. 16–21 June 2013; pp. III1139–III1147. [Google Scholar]
  • 36.Paszke A., Gross S., Chintala S., Chanan G., Yang E., DeVito Z., Lin Z., Desmaison A., Antiga L., Lerer A. Automatic differentiation in PyTorch; Proceedings of the Neural Information Processing Systems Autodiff Workshop; Long Beach, CA, USA. 9 December 2017; Dec 20, [Google Scholar]
  • 37.Polyak B.T., Juditsky A.B. Acceleration of Stochastic Approximation by Averaging. SIAM J. Control Optim. 1992;30:838–855. doi: 10.1137/0330046. [DOI] [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES