Skip to main content
. 2008 Dec 16;37(3):815–824. doi: 10.1093/nar/gkn981

Figure 7.

Figure 7.

The relationship between the pseudocount proportion α implied by the MDL principle and column relative entropy. Each point represents a multiple alignment column constructed by PSI-BLAST from the aravind103 query set (18,33) run against SWISS-PROT (34). Only columns with Inline graphic independent observations are considered. The x-axis represents the relative entropy D(f′ || p), where f′ is the observed frequency vector of the column after the addition of m0 = 5.5 pseudocounts, and p is the background amino acid frequency vector implicit in BLOSUM-62. The y-axis represents the pseudocount proportion α calculated from the MDL theory. The upper diagonal line (shown in red) represents the best power-law fit to the data. The lower diagonal line (shown in green) represents the power-law relationship of α to D(f′ || p) that empirically yields the optimal retrieval on the training set. Note that the background frequency vector p is the fixed point of the linear transformation M. Therefore, if f′= p, the increase in the description length of the data is identically zero for all α, implying that the MDL is optimized at Inline graphic. For any finite n, vectors f′ close enough to p also imply an optimal α of 1. A small number of such points are seen at the upper left of this graph.

HHS Vulnerability Disclosure