function [H,bias,CL] = MIST_ApproxEntropy(D,approxOrder,nShuffle) % [H,bias,CL] = MIST_ApproxEntropy(D,[approxOrder],[nShuffle]) % % Estimates the joint entropy of the columns in D using the MIST % approximation of order APPROXORDER. A bias for the estimation is also % returned using NSHUFFLE shuffling iterations. % % Inputs: % D - Data matrix in which each column is a variable and each row is a % multivariate sample of the variables. All variables are assumed to % be discrete, so continuous data should be binned prior to use. % % approxOrder - Order of MIST approximation to use for entropy % estimation. Valid values are [2--size(D,2)]. Default=2. % % nShuffle - Number of independent shuffling iterations to be used in the % estimation of the bias. Value of 0 indicates that no bias estimate % should be made. Default=0. % % Outputs: % H - Estimate of the joint entropy of all columns in the data matrix D % using the MIST approximation. See King and Tidor for more % information. % % bias - Estimated bias for the calculation of the approximation. Adding % the bias to H gives a bias adjusted approximation that generally % performs better for small sample sizes. % % CL - Confidence limit on the bias, at the p=1/NSHUFFLE level. Adding % this value to H gives a value that has a probability >= 1-p of % being greater than the approximation using the exact low-order % entropies. % % Example usage: % nOpts = [10,50,100,500,1e3]; % D = ceil(rand(max(nOpts),5)*3); % Hanalytic = 5*log(3); % for i=1:length(nOpts) % [H(i),bias(i),CL(i)] = MIST_ApproxEntropy(D(1:nOpts(i),:),2,100); % end % semilogx(nOpts,[H;H+bias;H+CL],'x-',nOpts,zeros(size(nOpts))+Hanalytic,'k--') % legend('MIST_2','BA-MIST','CL-MIST') % title('MIST approximation of joint entropy of 5 independent 3-state variables') % xlabel('samples') % ylabel('Joint Entropy (nats)') % % This code was developed and tested with MATLAB version 7.3 % % See King and Tidor, submitted for details of the % MIST approximation and bias estimation. % check default inputs if ~exist('approxOrder','var') || isempty(approxOrder) approxOrder = 2; end if ~exist('nShuffle','var') || isempty(nShuffle) nShuffle = 0; end bias = NaN;CL = NaN; % estimate low-order entropies directly Hlow = estimate_joints(D,approxOrder); % apply the MIST approximation to estimate the joint entropy of D [H,MIterms] = H_ubn(Hlow,approxOrder,1:size(D,2)); % estimate the bias if nShuffle if approxOrder ~= 2 bias = NaN; CL = NaN; warning('Bias estimation is only implemented for 2nd order') return else [bias,CL] = estimateBias(D,MIterms{1},nShuffle); end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function H = estimate_joints(D,max_order) %H = estimate_joints(D,max_order) % Estimate all joint entropies of the binned data matrix D for % orders 1..max_order. The returned cluster is a cell with entries % i..max_order, indexed by the set of columns included in the % calculation. N = size(D,2); [H_tot,c] = Hd_discrete(D); for i=1:max_order list = nchoosek(1:N,i); H{i} = [Hd_discrete_from_c(c,list),list]; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [ub,picked] = H_ubn(clusters,order,list) %[ub,picked] = H_ubn(clusters,order,list) % A greedy implementation of the upper bound is used to apply the joint H % approximation of arbitrary order to each row of indeces in list. Clusters % must have the 1:ORDER cell entries defined (these should all be estimated % directly from the data). picked is a cell where the ith cell entry is the % order of info terms used in the approximation for the the ith row in list %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % check the dimensions of the inputs % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % not enough info in clusters -> use lower order approx if order > length(clusters) order = length(clusters) warning(['\nThe inputted cluster does not contain high enough order information. \n',... 'An approximation of order ',num2str(order),' will be used instead.'],[]); end % if the order of approx is higher than the number of elements, just do the % lookup and give a warning if size(list,2) <= order warning(['\nNo approximation is necessary. CLUSTERS contains sufficient information\n',... 'to estimate entropies of order ',num2str(size(list,2)),' directly'],[]); ub = query_clusters(clusters,list,1); picked = {}; return end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Initialize the calculation % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % convert the clusters info to MI info sort_on = 1; MIn = MIn_from_clusters(clusters,order,sort_on); % check and make sure that MIn contains some information if maxd(MIn) <=0 error(['\nThe inputted cluster has no information of order ', num2str(order),'.\n',... 'Try using a lower order approximation']) end % figure out the problem dimensions n = length(MIn); k = size(list,2); ub = zeros(size(list,1),1); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % compute the upper bound for each mask % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for i=1:size(list,1) % pull out the appropriate subsets clust = list(i,:); for j=1:order clust_cell{j} = clust; end MIn_i = MIn(clust_cell{:}); % find the maximum info starting set [I_1_n,ind] = max(reshape(MIn_i,prod(size(MIn_i)),1)); max_ijk = ind2sub_vec(size(MIn_i),ind); % if nothing has any information, just add up the selfs and continue if (I_1_n <= 0) ub_temp = sum(query_clusters(clusters,clust',1)); picked_t = clust; else % add in the entropy of the first ORDER things chosen and the self % entropy of all other nodes H_ijk = query_clusters(clusters,clust(max_ijk)); H_ijk = H_ijk(1,sort_on); H_other = sum(query_clusters(clusters,clust(setdiff(1:k,max_ijk))'),1); H_other = H_other(1,sort_on); ub_temp = H_ijk + H_other; % record the first set of selected nodes clear picked_t picked_t(1,:) = clust(max_ijk); chosen = max_ijk; avail = setdiff(1:k,chosen); % select the remaining nodes to have max MI while length(avail) > 0 % pick out the allowable I to use for j=1:order-1 chosen_cell{j} = chosen; end MIn_sub = MIn_i(avail,chosen_cell{:}); % find the best (largest) I to remove [I_add,ind] = max(reshape(MIn_sub,prod(size(MIn_sub)),1)); max_ijk_t = ind2sub_vec(size(MIn_sub),ind); % update the ub and the lists ub_temp = ub_temp - I_add; picked_t(end+1,:) = clust([avail(max_ijk_t(1)),chosen(max_ijk_t(2:end))]); chosen = [chosen,avail(max_ijk_t(1))]; avail = avail([1:(max_ijk_t(1)-1),(max_ijk_t(1)+1):end]); end end ub(i) = ub_temp; picked{i} = picked_t; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [bias,CL] = estimateBias(D,MIterms,n_reps) % computes an upper bound to the error of estimating the joint entropy of % all columns in D using the 2nd order approximation, by propagating the % pairs errors from error_Hub pairs_err = zeros(n_reps,size(D,2)-1); for i=1:size(MIterms,1) pair = MIterms(i,:); pairs_err(:,i) = error_Hub(D(:,pair),size(D,1),n_reps); end bias = sum(mean(pairs_err)); CL = sum(max(pairs_err)); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [H,c] = Hd_discrete(D) %[H,p] = Hd_discrete(D) % Computes the joint entropy(H) and joint probability counts (c) of the d % inputted columns in D. D is assumed to be an Nxd matrix that has already % been binned. if min(size(D)) == 1 D = reshape(D,length(D),1); end % remove all NaN rows keep_row = find(any(isnan(D),2)==0); D = D(keep_row,:); [N,d] = size(D); D = renumber_bins(D); for i=1:d bins{i} = unique(D(:,i)); b(i) = length(bins{i}); end if any(b <= 1) warning('at least one of the columns of data has only one state') end % warn the user if there are are a large number of bins if any(b > N/2) warning(['The inputted data has as many as ',num2str(max(b)),' bins for ' ,... num2str(N),' entries. You should probably be using fewer bins']) end % % rename all of the bins to be 1..b % D_new = zeros(size(D)); % for i=1:d % for j=1:b(i) % D_new(find(D(:,i) == bins{i}(j)),i) = j; % end % end % D = D_new; % loop through all the entries and compute the counts c = zeros(b); for i=1:N entry = sub2ind_vec(b,D(i,:)); c(entry) = c(entry) + 1; end p_vec = c(find(c>0)); % remove zero probability entries norm = sum(p_vec); p_vec = p_vec/norm; H = -sum(p_vec.*log(p_vec)); % compute the entropy %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function H = Hd_discrete_from_c(c,list) %H = Hd_discrete_fromp(c,list) % Computes the joint entropy of each row of column indeces passed in list % using the full state counts matrix c. dim = length(size(c)); if size(list,2)>dim error('the dimension of list is larger than the dimension of p') end if dim == size(list,2) if size(list,1) ~= 1 warning('there are redundant entries in list') end p = c(find(c./sumd(c))); p = p./sum(p); H = -sum(p.*log(p)); return end H = zeros(size(list,1),1); for i=1:size(list,1) set = list(i,:); marg = setdiff(1:dim,set); c_temp = sumd(c,marg); % marginalize through all other dimensions p_vec = c_temp(find(c_temp>0)); % remove zero probability entries norm = sum(p_vec); p_vec = p_vec/norm; H(i) = -sum(p_vec.*log(p_vec)); % compute the entropy end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [err,H_ind] = error_Hub(D,N,n_reps) % computes an upper bound to the error of estimating the joint entropy of % all columns in D, by enforcing the columns to be independent, sampling % from the independent distrubtion and comparing to the sum of the % individual entropies (which is assumed to be fairly well converged). % the procedure is repeated n_reps times and the vector off all errors is % returned % compute the independent jointH H_ind = 0; for i=1:size(D,2) H_ind = H_ind + Hd_discrete(D(:,i)); end H_est = zeros(n_reps,1); for i=1:n_reps D_sub = zeros(N,size(D,2)); for j=1:size(D,2) D_sub(:,j) = randsample(D(:,j),N,1); end H_est(i) = Hd_discrete(D_sub); end err = H_ind - H_est; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function I = MIn_from_clusters(H,order,sort_on) %I = MIn_from_clusters(H,order,[sort_on]) % Generates an NxNx...N matrix containing the MI of all singles about all % (ORDER-1) groups. The (i,j,k,...) entry is I(i;j,k,...). Only cells with % all unique indeces are filled. All other terms are zero by default. % For example:I(1,2,2) = 0 (not I(1,2), as it "should") % H is assumed to be a cell of at least ORDER cells, one holding the % entropies of all sets of size 1,2,3... indexed by 1:N. Any number of % score columns may exist, but all calculations will use only column % sort_on. sort_on defaults to 1; % % Note: only the upper triangle of this matrix will be filled to avoid % redundancy sort_on_default = 1; if ~exist('sort_on') || isempty(sort_on) sort_on = sort_on_default; end if order < 2 error('The order of MI must be >= 2') elseif order > length(H) error('There is not sufficient information in H to compute this order I'); end % determine where the indeces start start_i = size(H{1},2); H1 = sortrows(H{1},start_i); Hoth = H{order-1}; Hall = H{order}; % figure out how many species there are N = size(H1,1); % fill the I array. The i,j,k... entry in I is MI(i;j,k...) I = zeros(N+zeros(1,order)); for i=1:size(Hall,1) joint = Hall(i,sort_on); opts = Hall(i,start_i:end); % for each of the options compute the MI between it and the other % set (plus two entries per pair for 6 total entries) for j=1:length(opts) i1 = opts(j); ioth = opts([1:(j-1),(j+1):end]); ioth_cell = num2cell(ioth); H_oth = query_clusters(H,ioth); H_oth = H_oth(sort_on); val = H1(i1) + H_oth - joint; I(i1,ioth_cell{:}) = val; end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function sub = ind2sub_vec(siz,ndx) %IND2SUB_VEC Multiple subscripts from linear index. % IND2SUB_VEC is used to determine the equivalent subscript values % corresponding to a given single index into an array. % % V = IND2SUB(SIZ,IND) returns the vector V containing the % equivalent row and column subscripts corresponding to the index % IND for a matrix of size SIZ. % % Class support for input IND: % float: double, single % % See also IND2SUB, SUB2IND, SUB2IND_VEC, FIND. % Copyright 1984-2005 The MathWorks, Inc. % $Revision: 1.13.4.3 $ $Date: 2005/03/23 20:24:04 $ % % Modified from above by BMK 1/2/07 siz = double(siz); if ndx > prod(siz) error('the index exceeds the dimension of size') end n = length(siz); k = [1 cumprod(siz(1:end-1))]; for i = n:-1:1, vi = rem(ndx-1, k(i)) + 1; vj = (ndx - vi)/k(i) + 1; sub(i) = vj; ndx = vi; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function M = maxd(X,dim) %MAXD Max of elements. % Wrapper around max that allows DIM to be a vector of dimensions % over which to max. % % M = MAD(X,DIM) maxes along the dimensions in the vector DIM. % % See also MAX. if nargin < 2 || isempty(dim) dim = 1:length(size(X)); end M = max(X,[],dim(1)); for d = dim(2:end) M = max(M,[],d); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function sub_cluster = query_clusters(clusters,mask,no_indeces) %sub_cluster = query_clusters(clusters,mask,[no_indeces]) % This function returns the subset of the inputted cluster % that matches mask. n is the number of species in each row % Set no_indeces to 1 to just get the 1st column value. no_indeces_default = 0; if nargin < 3 || isempty(no_indeces) no_indeces = no_indeces_default; end for i=1:size(mask,1) mask(i,:) = sort(mask(i,:)); end k = size(mask,2); cluster = clusters{k}; max_i = max(max(cluster(:,(end-k+1):end))); base_vec = max_i.^(0:k-1)'; clust_single_i = cluster(:,(end-k+1):end) * base_vec; mask_single_i = mask * base_vec; keep_i = find(ismember(clust_single_i,mask_single_i)); sub_cluster = cluster(keep_i,:); if no_indeces sub_cluster = sub_cluster(:,1); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function D_new = renumber_bins(D); %D_new = renumber_bins(D) % Renumber binned data to be bins 1..b for each column. [N,d] = size(D); for i=1:d bins{i} = unique(D(:,i)); b(i) = length(bins{i}); end % warn the user if there are are a large number of bins if any(b > N/2) warning(['The inputted data has as many as ',num2str(max(b)),' bins for ' ,... num2str(N),' entries. You should probably be using fewer bins']) end % rename all of the bins to be 1..b D_new = zeros(size(D)); for i=1:d for j=1:b(i) D_new(find(D(:,i) == bins{i}(j)),i) = j; end end D = D_new; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function ndx = sub2ind_vec(siz,vec) %SUB2IND_VEC Linear index from multiple subscripts. % SUB2IND_VEC is used to determine the equivalent single index % corresponding to a given set of subscript values. The subscript values % are passed in as a vector, rather than as separate inputs in contrast % to SUB2IND. % % IND = SUB2IND_VEC(SIZ,[I,J]) returns the linear index equivalent to the % row and column subscripts in the arrays I and J for an matrix of % size SIZ. % % IND = SUB2IND_VEC(SIZ,[I1,I2,...,IN]) returns the linear index % equivalent to the N subscripts in the arrays I1,I2,...,IN for an % array of size SIZ. % % Class support for inputs I,J: % float: double, single % % See also SUB2IND, IND2SUB. % Copyright 1984-2005 The MathWorks, Inc. % $Revision: 1.14.4.6 $ $Date: 2005/03/23 20:24:05 $ %============================================================================== narg = length(vec)+1; siz = double(siz); if length(siz)<2 ndx = vec; return; end %Adjust input if length(siz)<=narg-1 %Adjust for trailing singleton dimensions siz = [siz ones(1,narg-length(siz)-1)]; else %Adjust for linear indexing on last element siz = [siz(1:narg-2) prod(siz(narg-1:end))]; end %Compute linear indices k = [1 cumprod(siz(1:end-1))]; ndx = 1; %s = size(varargin{1}); %For size comparison for i = 1:length(siz), v = vec(i); %%Input checking % if ~isequal(s,size(v)) % %Verify sizes of subscripts % error('MATLAB:sub2ind:SubscriptVectorSize',... % 'The subscripts vectors must all be of the same size.'); % end if (v < 1) || (v > siz(i)) %Verify subscripts are within range error('MATLAB:sub2ind:IndexOutOfRange','Out of range subscript.'); end ndx = ndx + (v-1)*k(i); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function S = sumd(X,dim) %SUMD Sum of elements. % Wrapper around sum that allows DIM to be a vector of dimensions % over which to sum. % % S = SUMD(X) sums along all dimension. % % S = SUMD(X,DIM) sums along the dimensions in the vector DIM. % % See also SUM. if nargin < 2 || isempty(dim) dim = 1:length(size(X)); end S = sum(X,dim(1)); for d = dim(2:end) S = sum(S,d); end