Figure 5.
Selecting an optimal proportion of pseudocounts using the MDL principle. For n = 500 and the observed frequencies f listed in Table 3, we apply pseudocounts as implied by the BLOSUM-62 substitution matrix. We use Equation (5) to compute the change in the description length of the data, when compared to the description length of the data at , for α between 0 and 0.1. The dot-dashed curve (in red) shows the increase in the description length of the data. The dashed curve (in blue) shows the decrease in the description length of the model. The total decrease in the description length, shown by the solid curve (in black), is maximized at
, which corresponds to 19.4 pseudocounts.