Skip to main content
. 2012 Dec 10;2:943. doi: 10.1038/srep00943

Figure 7. Growth fluctuation of word use scale with the size of the corpora.

Figure 7

(A) Depicted is the quantitative relation in Eq.(8) between σr(t|fc) and the corpus size Nu(t|fc). We calculate σr(t|fc) using the relatively common words that meet the criterion that their average word use 〈fi〉 over the entire word history (using words with lifetime Ti ≥ 10 years) is larger than a threshold fc ≡ 10/Min[Nu(t)] (see Table I). We show the language-dependent scaling value β ≈ 0.08–0.35 in each panel. For each language we show the value of the ordinary least squares best-fit β value with the standard error in parentheses. (B) Summary of β(Uc) exponents calculated using a use-threshold Uc, instead of a frequency threshold fc as used in (A). Error bars indicate the standard error in the OLS regression. We perform this additional analysis in order to provide alternative insight into the role of extremely rare words. For increasing Uc the β(Uc) value for each corpora increases from β ≈ 0.05 to β < 0.25. This language pruning method quantifies the role of new rare words (also including OCR errors, spelling and other orthographic variants), which are the significant components of language volatility.