Table 2. Power-law fitting results for words and lemmas, denoted respectively by subindices w and l.
V is the number of types (vocabulary size), n m is the maximum frequency of the distribution, N a is the number of types in the power-law tail, i.e., with n ≥ a, a is the minimum value for which the power-law fit holds, and γ and σ are the power-law exponent and its standard deviation, respectively. 2σ d, the double of the standard deviation σ d is also given. σ d is the standard deviation of γ l−γ w assuming independence, which is . The last column provides ℓ1, the number of lemmas associated to only one word form. Notice that the lemma exponent is very close to the one found in Ref. [29] for the tail of a double power-law fitting, except for Moby-Dick and Ulysses.
Title | V w | n mw | N aw | a w | γ w ± σ w | V l | n ml | N al | a l | γ l ± σ l | 2σ d | ℓ1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Clarissa | 20492 | 38632 | 1514 | 51 | 1.83±0.02 | 9041 | 41679 | 838 | 101 | 1.83±0.03 | 0.07 | 5750 |
Moby-Dick | 18516 | 14438 | 2658 | 8 | 1.97±0.02 | 9141 | 14438 | 1548 | 13 | 1.90±0.02 | 0.06 | 6157 |
Ulysses | 29450 | 14934 | 4377 | 6 | 1.95±0.01 | 12469 | 14934 | 1024 | 26 | 1.97±0.03 | 0.07 | 8670 |
Don Quijote | 21180 | 20704 | 939 | 40 | 1.93±0.03 | 7432 | 31521 | 936 | 32 | 1.83±0.03 | 0.08 | 3812 |
La Regenta | 21871 | 19596 | 1196 | 26 | 2.01±0.03 | 9900 | 32300 | 993 | 32 | 2.00±0.03 | 0.08 | 5308 |
Artamène | 25161 | 88490 | 936 | 200 | 1.86±0.03 | 5008 | 119016 | 641 | 200 | 1.79±0.03 | 0.08 | 2178 |
Bragelonne | 25775 | 26848 | 3173 | 16 | 1.84±0.02 | 10744 | 45577 | 1382 | 40 | 1.84±0.02 | 0.06 | 5391 |
Seitsemän | 22035 | 4247 | 22035 | 1 | 2.13±0.01 | 7658 | 4247 | 474 | 26 | 2.13±0.05 | 0.10 | 4246 |
Kevät ja | 25071 | 5042 | 8660 | 2 | 2.05±0.01 | 8898 | 6886 | 699 | 20 | 1.96±0.04 | 0.07 | 5060 |
Vanhempieni | 35931 | 5254 | 6523 | 3 | 2.09±0.01 | 13510 | 7526 | 571 | 32 | 2.05±0.04 | 0.09 | 7837 |