Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Sep 1.
Published in final edited form as: Ann Appl Stat. 2020 Sep 18;14(3):1557–1580. doi: 10.1214/20-aoas1361

Inferring a consensus problem list using penalized multistage models for ordered data

Philip S Boonstra 1,*, John C Krauss 2
PMCID: PMC8345315  NIHMSID: NIHMS1696242  PMID: 34367405

Abstract

A patient’s medical problem list describes his or her current health status and aids in the coordination and transfer of care between providers. Because a problem list is generated once and then subsequently modified or updated, what is not usually observable is the provider-effect. That is, to what extent does a patient’s problem in the electronic medical record actually reflect a consensus communication of that patient’s current health status? To that end, we report on and analyze a unique interview-based design in which multiple medical providers independently generate problem lists for each of three patient case abstracts of varying clinical difficulty. Due to the uniqueness of both our data and the scientific objectives of our analysis, we apply and extend so-called multistage models for ordered lists and equip the models with variable selection penalties to induce sparsity. Each problem has a corresponding non-negative parameter estimate, interpreted as a relative log-odds ratio, with larger values suggesting greater importance and zero values suggesting unimportant problems. We use these fitted penalized models to quantify and report the extent of consensus. We conduct a simulation study to evaluate the performance of our methodology and then analyze the motivating problem list data. For the three case abstracts, the proportions of problems with model-estimated non-zero log-odds ratios were 10/28, 16/47, and 13/30. Physicians exhibited consensus on the highest ranked problems in the first and last case abstracts but agreement quickly deteriorated; in contrast, physicians broadly disagreed on the relevant problems for the middle – and most difficult – case abstract.

Keywords: conditional multinomial, L0 penalty, ranked lists, variable selection

1. Introduction

A patient’s medical problem list is defined as the minimal number of diagnoses that describe that patient’s current health status and risks to future health (Krauss et al., 2016). It serves as a “dynamic ‘table of contents’ ” (Weed, 1968) for the patient, which is useful for coordination of care between providers and care environments (Krauss et al., 2016). All providers of care for a patient work from the same problem list and update it at each encounter, but little is known about how much consensus there is between each provider’s individually generated problem list. There is clinical interest in having the problem list accurately reflect the patient’s current health. In other words, to what extent does a patient’s problem list in the electronic medical record reflect a consensus communication of that patient’s current health status? The statistical methodology developed in this paper is directly motivated by the idiosyncrasies of this ranked data context, as elucidated below.

The data upon which our methodology is based were collected via a series of interviews of faculty physicians at the University of Michigan (Ann Arbor, MI) conducted by the second author (JCK, the interviewer) between May 2013 and July 2014. All faculty members in the Department of General Medicine and the Department of Family Medicine (approximately 150 in total) were electronically invited to participate, and thirty-eight consented. Each interview consisted of the participating physician reviewing three real, previously reported patient case abstracts (labeled A, B, and C) that have been specifically developed for physician training in clinical reasoning (Meyer et al., 2013). For each case, the physician was asked to write down what would be her problem list for that patient as if she were the provider of care. The first six interviews were used as training for the interviewer to standardize the process as well as to develop a written vocabulary of expected problems for that case. The subsequent 32 interviews comprised the study data. For any novel problems encountered in this second round of interviews that were, in the opinion of the interviewer, similar to an existing problem already in the vocabulary, the interviewer noted this similarity and asked whether the subject would consider these equivalent or not. If the subject said ‘no’, then the novel problem was left as is. The cases were presented in the same order (alphabetical by label) for all interviews, based on an assumption that the most complex clinical case would be B and the simplest clinical case would be C. The interview results are summarized in Figure 1. See Krauss et al. (2016) for more details on the study design and case abstracts. The data obtained in this study – 32 de-novo problem lists generated for the same patient at a single point in time – do not naturally occur in a medical chart. Therefore, this study provides a unique opportunity to measure physician agreement and the degree to which a newly generated problem list is consistently generated. In other words, to what extent can a physician expect the accompanying problem list she receives with a patient to be the same problem list she herself would generate for that patient?

Figure 1:

Figure 1:

Counts of the frequency that each problem was listed on any of the n = 32 generated lists for each of three case abstracts, with shading and shape used to indicate the rank of that problem. For brevity, only those problems listed by at least two physicians are shown.

Similar questions arise in other diverse ranked data contexts, including election polling (Gormley and Murphy, 2008; Gormley et al., 2008), sorting genomic features (DeConde et al., 2006; Boulesteix and Slawski, 2009; Lin and Ding, 2009; Lin, 2010; Li et al., 2017, 2018), identifying bovine feeding preferences (Nombekela et al., 1994), handicapping horse races (Plackett, 1975; Benter et al., 2008), ranking basketball teams (Deng et al., 2014), or indexing search engine results (Webber et al., 2010). However, once the data are in hand, the subsequent analysis typically converges on a common goal, namely that of measuring agreement between the rankers.

So called ‘multistage models’, which are essentially a sequence of conditional multinomial distributions, can be used for aggregating and modeling a set of ordered lists such as what we analyze here (Plackett, 1975; Luce, 1959; Benter et al., 2008; Gormley and Murphy, 2008; Mollica and Tardella, 2017). However, multiple idiosyncracies, both with respect to the underlying nature of the data and with respect to our scientific objectives, require novel extensions to this multistage, model-based approach. There are three such proposed in this manuscript. First, we adjust multistage models to handle so-called ragged lists, which can have different lengths because the ranker chooses to stop ranking. The length of each list becomes informative in this case, and we model the fatigue process of rankers. Second, we equip the likelihood with a modified L0-type variable selection penalty to induce sparsity among the maximum penalized likelihood estimates. Sparsity is particularly desirable here because many aggregators will rank all items, requiring a post-hoc determination as to whether and where to truncate the consensus list. Although such penalties have been widely used, to our knowledge they have not yet been applied to models for ranked data, and thus this work represents a novel amalgamation of classical statistical models for ordered data with modern penalized regression techniques warranted by the motivating context. Third, we provide a computational framework in the R statistical environment for fitting these penalized models, including a coordinate ascent algorithm and tuning parameter selection based upon information theoretic criterion to select the appropriate amount of penalization. The remainder of this manuscript provides technical background (Section 2), describes in detail each of our proposed extensions (Section 3), presents a simulation study evaluating our methodology against eight possible comparators (Section 4), and then finally illustrates their application to the problem list data of interest (Section 5) and the NBA team rankings data from Deng et al. (2014) (Section 6). We conclude with a discussion in Section 7.

2. Technical background

Assume that each ranker is ordering items from a set of items, where each item is unambiguously mapped to an integer label {1,...,v}. As noted in the introduction, a ‘ranker’ could be anyone or anything from a person to a search engine to a cow to a case-control study; however, in our motivating context and therefore our methodology, the ranker is sentient and free to stop ranking at any point. Lists from multiple rankers are available, and we model the process of constructing these lists. Such models usually require that the data be formulated as either ranked or ordered lists (Marden, 1996). Both data types convey equivalent information, and both take the set of all permutations of the v integers as their support. However, whereas a ranked list gives the ranks of the v items, an ordered list permutes the v items themselves based upon their ranking. Specifically, the sth entry of a ranked list is the rank assigned to the item having integer label s (lower numbers indicate higher ranks), and the sth entry of an ordered list is the integer label of the item that is ranked sth (items appearing early in the list are ranked higher). The data in this paper are formulated as ordered lists, but we will refer to items that are ordered first as being ‘highly ranked’.

The orderings may be incomplete. For example, top-ranked lists of genes based upon phenotypic association do not include every single gene but are always truncated, e.g. to the top 25 genes (DeConde et al., 2006). The New York Times Hardcover Nonfiction Best Seller List (https://www.nytimes.com/books/best-sellers/hardcover-nonfiction, accessed 15-Mar-19) publishes a weekly list of the 15 best-selling hardcover non-fiction books, and extant but unpublished are the number 16, 17, etc. best-sellers. When such lists are also top-weighted, meaning that disagreement between two lists at higher ranks is more important than at lower ranks, Webber et al. (2010) call them ‘indefinite’. Not considering the top-weighted characteristic, such lists are called ‘top-k lists’ by Dwork et al. (2001) when they have been uniformly truncated to the first k items and ‘partial lists’ by Deng et al. (2014) when the point of truncation differs from list to list. Importantly, partial lists could be longer but have been artificially truncated, beyond the purview of the ranker and external to the ordering process. Distinct from these are what we call ragged lists, which we define as an ordered list arising from a ranker who is free to stop ranking. Subtly, a ragged list may be complete, if that ranker chooses to order all items. If there are unranked items, one can infer that the un-ranked items are below the ranked items. The problem list data we study here are ragged, since each physician was free to select as many or as few problems as desired. Although some existing rank aggregation methodologies can analyze ragged lists, to our knowledge there are none explicitly designed to model the stopping process nor induce sparsity in the aggregated list.

Figure 1 plots the frequency with which each item (problem) was ranked by a physician for each of the three patient abstracts. Focusing on case A, thirty physicians ranked diabetes mellitus somewhere in their constructed problem list for this patient, whereas eight problems were ranked by just one physician (not necessarily the same person). 23/32 physicians ranked osteoarthritis somewhere on their list, but only in one physician’s list was it in the top 4. In contrast, 26/32 physicians ranked pneumonia first on their list. Less overall agreement was observed on case B, with no problems being ranked first by more than eight physicians, and 18 problems appearing on exactly one list.

We now introduce some notation. For i = 1,...,n, the ith ranker’s ordered list of li items is denoted by xi={xi1,xi2,,xili}, with xis{1,,v} and s=1,,li indexing each stage. If the lists are complete, then liv for all lists; if they are partial, then lil<v for all i, where l is artificially chosen and external to the modeling process; if they are ragged, then liv for each i, with potentially different values of li for each i. We describe two broad approaches for analyzing ordered lists.

2.1. Pairwise similarities

One approach for quantifying agreement is to measure a distance or similarity between any pair of lists xi1 and xi2. For complete ordered lists, Kendall’s τ (Kendall, 1948) or Spearman’s ρ (Spearman, 1904) could be used. Lin and Ding (2009) proposed the Cross Entropy Monte Carlo (CEMC) algorithms that approximate an aggregated list that is, on average, closest to all observed lists with respect to one of these correlations. As we discuss below, these pairwise similarities may not always be appropriate for ordered lists. The rank-biased overlap (RBO, Webber et al., 2010) is a more recent example specifically designed for ordered lists. Given a user-specified parameter ψ ∈ (0, 1), the RBO between two lists xi1 and xi2 is

RBOψ(xi1,xi2)=1ψψd=1ψd|xi1,1:dxi2,1:d|/d,

where the expression |xi1,1:dxi2,1:d|/d denotes the size of the intersection of the first d elements divided by d. This proportion of the first d elements of each list that are shared is the so-called agreement at depth d. Agreements across all possible depths are then infinitely averaged using a convergent series of weights {ψd}d=1 Values of this similarity measure fall in the interval [0,1], where 1 indicates perfect overlap at all depths, and 0 indicates no overlap anywhere. The RBO assumes that each list is long enough so as to be effectively infinite. The exact value can only be calculated by examining an infinite value of depths. However, by truncating the calculation to some finite depth and determining the smallest and largest possible added value beyond this depth, a window within which the true RBO must lie can be created, the width of which decreases as the depth of truncation increases, due to the infinite series being convergent.

Krauss et al. (2016) proposed a length-dependent version of the RBO (LDRBO), specifically for measuring the similarity between two finite ragged lists:

LDRBOψ(xi1,xi2)=d=1max{li1,li2}ψd|xi1,1:dxi2,1:d|/dd=1max{li1,li2}ψd.

LDRBO measures average agreement, like RBO. It differs in that the maximum depth evaluated is always the longer of the two lists. Also contrasting with RBO, ψ can be set to 1 for LDRBO, in which case LDRBO simplifies to the average agreement across all depths. LDRBO and RBO will become similar as min{li1,li2} increases. It is also noteworthy that rank-based similarity measures such as (LD)RBO yield qualitatively different interpretations than standard correlation measures like Spearman’s ρ, even for very simple cases. For example, xi1={1,2,3,4} and xi2={4,3,2,1} have perfect negative correlation (ρ = −1); in contrast, with ψ = 1, LDRBO evaluates to a middling value of (0/1 + 0/2 + 2/3 + 4/4)/4 ≈ 0.42. LDRBO can only be zero between lists having no intersection, such as xi1={1,2,3,4} and xi3={5,6,7,8}, which, coincidentally, have a perfect positive correlation (ρ = 1). Thus, in the context of ordered lists, (LD)RBO better reflects the intuition that the exemplar pair {xi1,xi2} share more in common than {xi1,xi3}.

Using the motivating problem list data and setting ψ = 1, Krauss et al. (2016) used numerical methods to identify a theoretical ‘consensus problem list’ having the largest median value of LDRBO across all 32 physicians’ constructed lists. For Case A in Figure 1, the estimated consensus problem list was {PNEUMONIA, DIABETES MELLITUS, ANEMIA, SPLENOMEGALY, DEPRESSION WITH ANXIETY, OSTEOARTHRITIS, RENAL FAILURE, HYPOXIA}, having a median LDRBO of 0.683.

The findings of Krauss et al. (2016) not withstanding, insofar as the objective is to measure consensus and calculate an aggregated consensus list, similarity-based approaches such as the LDRBO may fall short. For example, the above methods can be used to calculate a consensus problem list for any group of lists, no matter how disparate the data are. Further, there is no obvious mathematical rationale to suggest that maximizing the median pairwise LDRBO – as opposed to the mean, minimum, or maximum LDRBO – results in the ‘right’ consensus list, nor that ψ = 1 is the right choice. Finally, even with these relatively small datasets, there are practical computational challenges to this approach: with 28 unique problems in Figure 1, there are 28! ≈ 1029.5 permutations of length 28 to search across as possible consensus lists, plus all candidate lists less than length 28. Krauss et al. (2016) used an approximate ‘branch and bound’ algorithm to substantially limit the scope of the search.

3. Model-based approaches

These reasons provide compelling rationale to consider instead model-based approaches for the analysis of our problem list data and ordered lists in general.

One model-based approach assumes that the ranking process is a Markov process. Treating the set of v items as the state space of a Markov chain, one can calculate the stationary distribution of the corresponding transition matrix, where larger probabilities correspond to higher ranked items (Lin, 2010; Dwork et al., 2001; DeConde et al., 2006). Depending on how the transition matrix is determined, there are several such approaches, which Lin (2010) label MC1, MC2, and MC3. Readers are referred to these references for more details. These approaches are amenable to the analysis of ragged lists, but, without further pruning, the resulting consensus list is always an ordering of all items listed by at least one ranker.

A second approach, called Mallows model, posits that each ranked list differs from some unknown, consensus ranking, i.e. a parameter vector to be estimated, according to a probabilistic model based upon a distance, i.e. Spearman’s ρ or Kendall’s τ, between the two list (Mallows, 1957). Besides the consensus ranking, the other parameter to be estimated in a Mallows-type model is the dispersion φ ∈ [0,1], where φ = 0 means that all rankers are exactly recapitulating the consensus ranking, and φ = 1 means that rankers are choosing items uniformly at random, without regard to the consensus ranking. Fligner and Verducci (1988) and Li et al. (2019) have both extended the Mallows model, allowing φ to depend upon the stage s, such that φ(s) can be small when s at earlier stages, where agreement is more important, and φ(s) can be large at later stages, where agreement is usually less important.

A third approach, and that which we extend in this paper, is the multistage model, which explicitly formulates the list-generating process (Plackett, 1975; Luce, 1959). The ith ranker generates an ordered list of length v from among a pre-specified, fixed-length set of items, starting with his/her/its most-preferred item. Define Ois to be the set of items yet to be ranked just before the sth stage:

Ois={{1,,v},s=1{k:k{xis}s<s},s>1}, (1)

and let 1[X] be 1 when the statement X is true and 0 otherwise. The Plackett-Luce (PL) probability that item k{1,,v}, is ordered sth is Pr(xis=kOis)=1[kOis]exp(θk)/jOisexp(θj), i.e. proportional to exp(θk) until it gets ordered, and zero afterwards. There are v parameters, Θ = {θ1, θ2,..., θv}. Of these, v − 1 are identified, and without loss of generality, we may assume that minj{θj} ≡ 0. See Section 5.6, Marden (1996) for an overview of classical multistage models. In contrast to the Mallows model, there is no explicit consensus list in this model; however, the set of the numeric weights θks gives, both in an absolute and relative sense, the order of preference across all items.

Analogous to the extended Mallows model proposed by Li et al. (2019), Benter added a dampening effect to PL models to allow for the relative preference between items to depend on the stage (Benter et al., 2008; Gormley and Murphy, 2008). Let a dampening function δ(s) map the set of integers s ∈ {1,..., v − 1} to the interval (0,1], with δ(1) ≡ 1 for identifiability. When δ(s) is small, the distinction between items decreases, and so, assuming that preferences are always strongest at early stages, it is reasonable to constrain δ(s) to be non-increasing with s. At the final stage, δ(v) may take any value, since there is no choice remaining. When δ(·) is limited to the set of non-increasing functions, this dampening function serves analogously to the ψ parameter in the (LD)RBO measures and the φ-function in the extended Mallows models, namely to reflect that agreement at higher ranks is relatively more important than at lower ranks.

Thus, the Benter-Plackett-Luce (BPL) model for the probability of selecting item k at the sth stage conditional on the choices from the previous s − 1 stages is Pr(xis=kOis)=1[kOis]exp(θkδ(s))/jOisexp(θjδ(s)), for k=1,,v and s=1,,li. To be estimated are the v − 1 identified parameters in Θ plus the number of parameters in the chosen functional form of δ(·), which we discuss in Section 3.1 below. At stage s, the log-odds of ordering item k1 over k2, conditional on neither having been yet ordered, are δ(s)[θk1θk2]. We now propose two novel extensions based upon the BPL model – one to the model itself and one to its estimation – tailored to the objectives of the problem list analysis.

3.1. Ranker Fatigue

In some contexts, a ranker’s list is a purposefully incomplete ordering of a subset of all possible items. In our case study, physicians stopped listing problems upon having decided that the already-listed problems adequately described the case abstracts. It is sensible therefore to model not just the ordering process but also the terminating process. This contrasts with standard PL/BPL models, which assume that liv. Notationally, this can be indicated by artificially extending the length of each ragged list xi by one and filling in this additional item with 0, i.e. xili0; this is not an actual item but rather indicates the list’s termination.

Now the probability of selecting item k = 0,...,v in the sth stage, s=1,,li, conditional on the previous s − 1 stages is written as

Pr(xis=kOis)=1[kOis]exp(δ(s)θk)+1[k=0s>1]exp(θ0)jOisexp(δ(s)θj)+1[s>1]exp(θ0). (2)

Like the standard BPL model, this assumes that there are a finite number of v items to be ranked; however, it is not assumed that all rankers will rank all items. Rather, a new parameter θ0 measures the ‘fatigue’ of ranker i beyond the first stage, and Θ = {θ0, θ1,..., θv} is length v + 1. The number of identified elements, not counting the dampening function δ(s), is one less than the length of Θ, and we set minj:j>0{θj} ≡ 0 to identify the model. At stage s > 1, ranker i will stop ordering items with probability exp(θ0)/(jOisexp(δ(s)θj)+exp(θ0)). The probability of stopping increases with s as well as with the total weight of the items previously ordered. Let β = {Θ, δ(·)} denote all parameters in the model. The log-likelihood of list xi is the logarithm of its joint density:

logfi(β)=s=1lilogPr(xisOis)=θ0+s=1li1δ(s)θxiss=1lilog(jOisexp(δ(s)θj)+1[s>1]exp(θ0)), (3)

where Ois is as defined in Equation (1). This is the model we will use in our simulations study and analysis of the problem list data. We discuss the choice of dampening function δ(·) in Section 3.2 and consider strategies for inducing sparsity in the fitted model in Section 3.3.

Remark 1

With the added fatigue parameter, the assumed minimum list length for any ranker is li=2, corresponding to exactly one actual item ranked (xi1{1,,v}) followed by a decision to stop (xi2=0). This is the rationale for having 1[k=0s>1] (as opposed to 1[k=0]) in the numerator of Equation (2), which gives that Pr(xi1=0Oi1)0 for any β.

Remark 2

A reviewer observed that to be able to call equation (3) a regular likelihood, we must make an implicit conditional independence assumption, namely that at each stage, a ranker’s probability is conditionally independent of all previous probabilities given Ois. We do so here, similar to previous uses of the BPL model, e.g. Gormley and Murphy (2008).

3.2. Parameterization of δ(·)

In their choice of δ(·) for the analysis of Irish presidential poll data, Gormley and Murphy (2008) placed no restrictions on δ(·) apart from requiring 0 ≤ δ(s) ≤ 1 for all s, resulting in v − 2 parameters to be estimated. The context of our analysis suggests that, at a minimum, δ(·) should be non-increasing in its argument to reflect that strength of preference is non-increasing with stage, i.e. rank. For this reason, and also being cognizant of the statistical cost of estimating many additional parameters, we constructed a two-parameter dampening function: δ(s)=δ2δ1s1+(1δ2)2s1, with scalar parameters δ1, δ2 ∈ [0,1]. This family contains dampening functions ranging from constant strength of preference (δ1 = δ2 = 1), strength of preference decreasing to a non-zero asymptote (δ1 = 1; δ2 < 1), or strength of preference decreasing to total lack of preference at lower ranks (δ1 < 1).

3.3. Estimating a Consensus Ordering

A standard maximum likelihood estimate (MLE) approach for estimating β = {Θ, δ1, δ2} would calculate β^MLE=argmaxβi=1nlogfi(β) subject to the constraint that minj:j>0{θ^j}0, so as to identify the model. However, even with this constraint, some of the parameters will still only be weakly identified, e.g. those corresponding to items appearing on only one observation’s list, and their estimates will be close to zero. An ideal model estimation process would adaptively recognize these weakly identified parameters and set them all exactly equal to zero. Note that this is a different type of variable selection problem than is typical: setting θ^k equal to zero in a BPL-type model does not remove the item from the fitted model but rather minimizes its relative weight. No item that has been ranked at least once can ever be removed entirely from the fitted model, i.e. by forcing θ^k= or exp(θ^k)=0, without resulting in a zero-valued likelihood function. Rather, this variable selection problem is one of identifying the set of items whose corresponding parameter estimates should be smallest and co-equal. Keeping in mind the scientific objective of constructing a consensus ordered list, a natural definition is then the set of non-zero θ^k’s, sorted in decreasing order. If the data are disparate enough to suggest that rankers are effectively ordering items at random, then the consensus list may be small or even the empty set, i.e. no consensus.

A common technique for dimension reduction in a maximum likelihood framework is to subtract from the log-likelihood function a penalty function on the item weights, g(Θ, λ). For a given value of λ, we would then calculate the penalized MLE (PMLE), defined as β^(λ)=argmaxβ{i=1nlogfi(β)g(β,λ)}. Assuming the model is not to be penalized for estimating θ0, the simplest possible BPL model would be θk ≡ 0 and δ1 = δ2 = 1, and a LASSO-type penalty (Tibshirani, 1996) applied to a BPL model would take the form g(β,λ)=λ(kθk+|logδ1|+|logδ2|) (note that if every θk wasn’t non-negative by design, we would need |θk| instead of θk). Relative to standard maximum likelihood estimation, this penalty would shrink each θk down towards zero and δ1 and δ2 up towards 1, more so for larger values of λ; some elements may be shrunk entirely. This latter characteristic makes the LASSO a variable selection penalty. As noted in the first paragraph of this section, variable selection is a crucial feature in our context, but it is less evident that shrinkage of the item weights is required or even desirable. Because each θk is relatively defined, if a parameter estimate θ^k gets set to zero, any larger parameter estimates will also need to be decreased in order to maintain the same implied probabilities. For example, consider a BPL model with three items, where the current parameter estimates are {θ^1,θ^2,θ^3}={0,log(2.9),log(1.1)}. The estimated probability of selecting item 2 at stage 1 is 2.9/(1 + 2.9 + 1.1) = 0.58. If θ^3 is to be set to zero to reflect that items 1 and 3 are equally least important, then the corresponding estimate of θ^2 must also be changed to approximately log(2.76) in order to maintain this probability: 2.76/(1 + 2.76 + 1) ≈ 0.58. That is, in order to change θ^3 from log(1.1) to 0 while maintaining the relative importance of θ^2 the latter must also be decreased. A LASSO-type penalty would induce additional shrinkage, beyond what was just described, and therefore may result in underfitting the model, i.e. not describing enough variability.

Variable selection without this additional shrinkage can be achieved with the L0 penalty g(β,λ)=λ(k=1v1[θk0]+1[δ11]+1[δ21]). This penalizes the log-likelihood for each additional parameter estimate that takes on a “non-simple” value by an amount λ, but the actual estimate does not further affect the penalty, i.e. there is no shrinkage. A computationally driven modification is called for here because λ(k=1v1[θk0]+1[δ11]+1[δ21]) is a multivariate discontinuous function and therefore numerically difficult to use within a penalized likelihood framework. Dicker et al. (2013) created a continuous version, called the seamless L0 penalty. Applied to our scenario, it is given by

g(β,λ,τ)=λk=1vlog2(θkθk+τ+1)+λlog2(|logδ1||logδ1|+τ+1)+λlog2(|logδ2||logδ2|+τ+1), (4)

where τ > 0 is an additional fixed constant parameter. In contrast to the discrete-valued L0 penalty that is always equal either to 0 (for each θk = 0 and δ1, δ2 = 1) or λ (for each θk > 0 and δ1, δ2 < 1), the seamless L0 penalty continuously transitions from 0 to λ, as illustrated in Figure 2. It becomes increasingly similar to the discontinuous L0 penalty as τ is closer to 0.

Figure 2:

Figure 2:

Seamless L0 penalty under the default choice of τ = 0.001 and different values of the penalty parameter (and asymptote) λ

3.4. Computational Implementation

We describe here our computational approach for fitting penalized BPL models using seamless L0 penalties. All code was written in the R statistical environment (R Core Team, 2018; Wickham, 2017; Neuwirth, 2014; Li et al., 2018; Schimek et al., 2015) and is freely available via github (https://github.com/psboonstra/RankModeling). When g is an L0-type penalty, maximizing i=1nlogfi(β)g(β,λ) is a non-convex optimization problem that is both computationally difficult and which admits the possibility of identifying local optima. These are the main challenges our algorithm must overcome.

As is typical in penalized estimation, we calculate the solution path for β under a grid of candidate values for λ. We apply a numerical coordinate ascent algorithm that iteratively cycles through all elements of β on a univariate basis, changing a given parameter estimate from its current value if doing so increases the penalized log-likelihood. After satisfying a specified convergence criterion to an estimate of β given the smallest value of λ, we use these values as a warm start for the next largest value of λ in the grid and so forth. The algorithm returns the entire solution path for β as a function of λ.

In more detail, suppose the current estimated value of β={Θ,δ1,δ2} at iteration m of the algorithm is denoted by Θ^(m)={θ^0(m),θ^1(m),,θ^v(m)},δ^1(m), and δ^2(m). Given these values and λ, we calculate the penalized log-likelihood values when incrementing one parameter estimate by each value in a proposal sequence Γ={γt,γt+1,,γ1,γ00,γ1,,γt1,γt}, where γj=γj for all j=1,,t. The inclusion of γ0 ≡ 0 means that one proposal is to not change any values. If the parameter to be updated corresponds to an item, i.e. θ1,..., θv, then γ0˜=θ^j(m) is also added to Γ, so that every iteration includes proposing to set each item’s parameter estimate to zero. Any proposals that would violate identifiability or model constraints, i.e. θk<0 or δ1,δ2[0,1], are truncated at the boundary of the constraint.

This step results in up to 2t + 2 penalized log-likelihood calculations, and we identify tmax{t,t+1,,1,0,0˜,1,,t1,t}, which is the index of Γ yielding the largest penalized log-likelihood (note that 0˜ does not exist when updating θ0, δ1, or δ2). We then set θ^k(m+1)θ^k(m)+γtmax(orδ^j(m+1)δ^j(m)+γtmax) and repeat the step for another parameter. Each cycle consists of proceeding through a random permutation of all elements of Θ^(m), δ^1(m), and, δ^2(m), and the process restarts until a certain minimum number of consecutive cycles change all parameter estimates by less than some convergence criterion. We discuss the choice of Γ and all other required inputs at the end of this section.

The relative relationship between the parameters warrants also considering multivariate proposals to speed convergence and discourage the algorithm from getting stuck in local optima. We incorporated such proposals in our implementation. One proposal adds a single negative constant randomly taken from {γt,γt+1,,γ1} to all θ^k’s, shifting them towards, but never less than, zero. A second multivariable proposal adds a single positive constant value randomly taken from {γ1,γ2,,γt} to all non-zero θ^k’s as well as one randomly selected zero-valued θ^k, if there is more than one such zero-valued θ^k. A third proposal considers the current estimated item weights Θ^(m) in increasing order and, with probability 1/16 = 0.0625, exchanges the index of each pair of neighboring parameter estimates. Note the proposal swaps parameter estimates based on their values, not their labels. For example, if {θ^1(m),θ^2(m),θ^3(m)}={0,log(2.9),log(1.1)}, the proposal would swap θ^1(m) and θ^3(m), i.e. {log(1.1),log(2.9),0}. This probability of swapping is a tuning parameter, the specific value of which was arbitrarily selected. We also considered a fourth multivariate proposal for the dampening function when δ^2(m)<1. The proposal is δ˜1=δ^2(m)δ^1(m)+(1δ^2(m))3 and δ˜2=0. The rationale is that the proposed dampening function is identical to the current dampening function at the first two (and most important) stages, but with a less complex formulation, since δ˜2=0. Each of these proposals is ad-hoc and, except for the last, stochastically made, but we emphasize that they are only ever accepted if doing so results in a larger penalized log-likelihood.

3.5. Default values

Our algorithm for approximating the maximized penalized log-likelihood requires choosing input values, most important being the grid of values of λ, the constant τ in equation (4), the proposal sequence Γ, and the convergence criterion ϵ. In our analyses, we used the default values described below, so that at a minimum a user need only provide the data, comprising a set of ordered lists.

The default choice of convergence criterion is ϵdef=0.001, which also means that the number of significant digits retained by the algorithm, generally equal to [log10(1/ϵ)], is by default 3. For a default value of the proposal sequence Γ, used in both the univariate and multivariate proposals, the algorithm calculates the evenly spaced sequence of t values between log(ϵ) and 0 and exponentiates it, setting the positive half, γ1,def,γ2,def,,γt,def, equal to the result (with the lower half being the symmetric values γt,def,,γ1,def). The default choice of t, when not provided, is tdef=[log10(1/ϵ)], yielding Γdef={1,0.032,0.001,0,0.001,0.032,1}. The default choice of τ is τdef=ϵ. Finally, a default grid of λs is calculated with an initial run of the algorithm that identifies the smallest λ that yields the most parsimonious possible model, say, λmax, and then calculates the 200 evenly spaced values (on the log-scale) between 10−5λmax and λmax.

Our implementation also allows for the user to specify multiple sets of initial parameter values, β(0), or to request multiple randomly generated sets of initial values. The algorithm is independently run for each set of initial values, and the result of each separate run is reported. This allows for a straightforward assessment of the impact of starting values on the final converged parameter estimates. We used three sets of initial values in both our simulation study and our data analyses.

3.6. Model Selection

We consider two information criteria to select λ > 0. The small-sample Akaike Information Criterion (AIC, Akaike, 1973; Hurvich and Tsai, 1989) and the Bayesian Information Criterion (BIC, Schwarz et al., 1978) both resemble a “model fit + model complexity” tradeoff. Letting p˜λ=1+k=1v1[θ^k0]+1[δ^11]+1[δ^21] denote the number of parameters in a fitted model under a given λ (the constant 1 is for θ0) and β^λ denote all BPL parameter estimates under a given λ, they are both given by 2i=1nlogfi(β^λ)+2h(p˜λ), where h(p˜λ)=p˜λn(np˜λ1)+ for the small-sample AIC, where (·)+ is the positive-part function, and h(p˜λ)=log(n)p˜λ/2 for the BIC.

4. Simulation study

To evaluate the finite sample performance of the penalized BPL models, including our proposed estimation procedure, we conducted a simulation study. Our two main objectives were to (i) compare the ranking performance of our penalized BPL models against possible comparators and (ii) evaluate the ability of our penalized BPL models to estimate the true, unknown item weights and distinguish between non-trivial and trivial items. These are distinct objectives because some rank aggregation methods only result in a fully ranked list of all items, whereas we are additionally interested in demonstrating that our penalized BPL model results in better estimation of the underlying weights than the unpenalized counterpart and that unimportant items can be identified as such.

We generated 1000 simulated datasets for each of 36 scenarios: 12 models described in Table 1 and three dataset sizes (30, 100, or 500 rankers). Collectively, these scenarios cover a range of characteristics: number of items (v), number of rankers (n), degree of raggedness (θ0), and the typical size and variation of item weights (θks). For each simulated dataset and each scenario, we fit an unpenalized BPL model (λ = 0) and penalized BPL models using AIC and BIC. For comparison, we evaluated our previous pairwise-LDRBO maximization (Krauss et al., 2016); the sample arithmetic mean (‘AMean’), geometric mean (‘GMean’), and median of the corresponding ranked lists; MC1, MC2, and MC3 (Lin, 2010; Schimek et al., 2015); the CEMC algorithms that identify the aggregated list that is, on average, closest to all observed lists (Lin and Ding, 2009; Schimek et al., 2015) using Spearman’s ρ (labeled CEMCρ) or Kendall’s τ (CEMCτ); and both the standard Mallows model (MM) and extended Mallows model (EMM) as implemented in Li et al. (2019). For calculating the sample means and medians, we assumed that each unranked item had rank equal to the average of the unused ranks. So, for example, when a list ranks seven items out of a possible v = 15, then the eight unranked items each get assigned a rank of 11.5. This filling-in of missing ranks was only used for calculation of the sample mean and median ranks.

Table 1:

Description of twelve true BPL models for generating ragged, ordered lists used in the simulation study. The last column gives the first, second, and third quartiles from the distribution of actual list length under these models.

Label θ0 {θk:θk>0,k>0} #{θk:θk=0,k>0} #{θk:k>0} li:{Q1,Q2,Q3}
1 −1.0 {0.090.01(k1)}k=15 5 10 {6,9,9}
2 0.5 {0.090.01(k1)}k=15 5 10 {2,4,6}
3 −1.0 {1.50.1(k1)}k=19 1 10 {8,9,9}
4 0.5 {1.50.1(k1)}k=19 1 10 {3,6,8}
5 −1.0 {1.5}k=15 5 10 {7,9,9}
6 0.5 {1.5}k=15 5 10 {3,5,8}
7 −1.0 {0.090.01(k1)}k=19 11 20 {11,17,19}
8 0.5 {0.090.01(k1)}k=19 11 20 {4,7,12}
9 −1.0 {1.50.1(k1)}k=115 5 20 {13,18,19}
10 0.5 {1.50.1(k1)}k=115 5 20 {5,10,14}
11 −1.0 {1.5}k=18 12 20 {13,18,19}
12 0.5 {1.5}k=18 12 20 {5,9,14}

A challenge in comparing multiple different ranking approaches is that each returns a qualitatively different entity. That is, the BPL models estimate a set of item weights, the MC models estimate a stationary probability distribution taking on values in the unit simplex, the means and median summarize the ranks, and the remaining methods – LDRBO CEMCρ, CEMCτ, MM, and EMM – give an integer-valued ordering from most to least preferred. To compare these disparate results, we calculated a metric that we called the ‘ordered root-mean-squared error (RMSE)’. Let θ(k), k = 1,..., v, denote the true value of the kth largest item weight, and let θ(k^) denote the true value of the item weight the method ranks kth. Then, the ordered RMSE is defined by k=1v(θ(k)θ(k^))2/v. The true values of each item weight are always used but are out of order, and methods underperform according to the extent they mis-rank items with substantially different true weights. The advantage of this metric is that it is applicable across all ranking approaches. The disadvantage is that it is not equipped to handle tied ranks, which can occur in the penalized BPL results and the sample means or medians of the ranks. For the penalized BPL models, we break ties based upon the ordering of items in the solution path at λ = 0. For the sample means and medians, we break ties based upon the frequency that an item was ranked first; remaining ties were randomly resolved. The smallest possible ordered RMSE is 0, when all items have been ordered according to their true item weight.

As a subsequent finite-sample assessment focusing exclusively on the BPL models, we calculated additional quantitative measures of the ability of these fitted models to estimate the unknown item weights, which none of the comparator methods do. Specifically, let θ^k be the estimate of the kth unknown item weight θk (note that these are no longer the order statistics). Then, we calculated the standard RMSE, defined as k=1v(θ^kθk)2/v. This differs from the ordered RMSE described in the previous paragraph in that the ordered RMSE does not require a method to estimate the item weight itself, whereas the standard RMSE does. We also calculated the true positive rate (TPR), defined as (k=1v1[θ^k>0]×1[θk>0])/k=1v1[θk>0], and the true negative rate (TNR), defined as (k=1v1[θ^k=0]×1[θk=0])/k=1v1[θk=0]. Finally, we calculated Youden’s index, defined as TPR + TNR − 1, and the running time for each method.

4.1. Results

Table 2 gives the average values of the ordered RMSE multiplied by 1000 and then rounded to the nearest integer for readability. All values within 5% of each rowwise minimum are in bold. The penalized BPL models, labeled ‘λAIC’ and ‘λBIC’, are not optimal in all scenarios but generally compare favorably to the remaining methods. The best overall method, AMean, was bold in all rows, followed closely by MC3. The worst method was MC1, which was bold in just two rows.

Table 2:

Mean values of the ordered RMSE (multiplied by 1000 and then rounded to the nearest integer) for the unpenalized and penalized versions of the proposed BPL model against eleven comparators across 36 scenarios (12 generating models from Table 1 × three sample size configurations). All values within 5% of each rowwise minimum are in bold. For λAIC and λBIC, the tied zero-valued estimated items weights are resolved based upon the λ = 0 value in the solution path.

Label n BPL LDRBO AMean Rank GMean Rank Median Rank MC1 MC2 MC3 CEMCρ CEMCτ EMM MM
λ=0 λAIC λBIC
1 30 48 48 48 48 48 48 48 51 48 48 48 48 48 48
1 100 45 45 46 47 45 45 46 51 46 45 46 45 46 45
1 500 39 40 40 43 39 40 43 51 49 39 40 40 39 40
2 30 49 49 49 48 48 48 49 51 49 48 49 48 48 48
2 100 45 46 47 47 46 46 47 51 51 46 47 46 46 46
2 500 40 42 43 43 40 40 43 51 51 40 43 40 41 41
3 30 242 247 247 347 232 244 256 571 247 233 287 249 267 246
3 100 130 131 133 267 130 137 155 571 147 130 179 145 149 141
3 500 56 55 55 145 55 58 86 571 74 56 75 65 60 64
4 30 268 274 272 333 250 265 278 569 305 252 359 268 291 265
4 100 147 148 154 282 145 150 169 572 209 145 247 154 167 155
4 500 62 62 63 166 63 65 89 571 132 63 107 70 68 69
5 30 15 19 18 275 13 18 34 1041 22 14 44 24 53 25
5 100 0 0 0 38 0 0 0 1043 0 0 0 0 0 0
5 500 1 0 0 0 0 0 0 1044 0 0 0 0 0 0
6 30 38 31 32 324 25 40 60 1037 78 30 135 50 87 44
6 100 4 0 0 58 0 0 0 1043 0 0 3 0 0 0
6 500 4 0 0 0 0 0 0 1044 0 0 0 0 0 0
7 30 41 42 42 41 41 41 41 43 41 41 41 41 41 41
7 100 40 40 40 40 40 40 40 43 40 40 40 40 40 40
7 500 37 37 38 38 36 37 38 43 42 36 37 37 37 37
8 30 41 41 41 41 41 41 41 43 42 41 41 41 41 41
8 100 41 41 41 41 40 40 41 43 43 40 41 40 40 40
8 500 38 39 39 38 37 37 38 43 43 37 38 37 37 37
9 30 333 338 335 423 320 325 352 716 327 320 366 333 345 330
9 100 192 193 193 306 192 196 218 716 201 192 234 207 213 201
9 500 84 84 85 182 86 87 107 716 107 85 116 98 102 94
10 30 357 368 366 420 337 338 367 714 392 337 425 347 376 356
10 100 237 210 220 325 204 207 232 716 337 204 290 216 234 220
10 500 149 95 100 199 93 94 120 716 330 92 154 104 113 101
11 30 114 117 112 426 105 110 185 1031 121 105 218 157 208 150
11 100 0 0 0 95 0 0 1 1032 0 0 1 3 4 0
11 500 0 0 0 0 0 0 0 1032 0 0 0 0 0 0
12 30 198 197 179 486 147 166 231 1024 249 148 376 197 277 203
12 100 43 0 1 177 0 1 3 1032 42 0 28 4 8 1
12 500 40 0 0 1 0 0 0 1031 4 0 0 0 0 0

The ordered RMSE metric does not directly characterize how well the item weights were estimated. To that end, the results in Table 3 make a direct comparison of the unpenalized and penalized BPL models in their ability to estimate the item weights. In contrast to Table 2, one or both penalized versions are nearly always preferred with regard to the standard RMSE metric. The penalized BPL models typically have a better TNR, whereas the unpenalized BPL model has a better TPR. On balance, the discriminatory ability of the penalized BPL models are better, as evidenced by higher values of Youden’s index. Finally, the penalized BPL models require about twice as much running time.

Table 3:

Five operating characteristics comparing the unpenalized and penalized versions of the BPL model across 36 scenarios (12 generating models from Table 1 × three sample size configurations). RMSE, TPR, TNR, and Youden are all multiplied by 100 and rounded to the nearest integer. Each value of an operating characteristic that is within 5% of the better of the two is in bold.

Label n RMSE×100 TPR×100 TNR×100 Youden×100 Run Time (sec.)

λ = 0 λAIC λBIC λ = 0 λAIC λBIC λ = 0 λAIC λBIC λ = 0 λAIC λBIC λ = 0 λAIC λBIC
1 30 1267 246 205 91 13 9 14 89 92 5 2 1 4 20 20
1 100 459 137 91 87 20 10 20 88 94 7 8 4 5 26 26
1 500 156 76 58 86 29 8 27 89 97 13 18 5 10 72 72
2 30 1184 214 167 87 12 9 15 91 94 2 3 2 3 17 17
2 100 492 147 100 82 19 10 24 87 92 6 7 2 3 22 22
2 500 212 81 63 79 25 10 35 89 95 13 15 5 5 61 61
3 30 880 970 1000 98 36 32 82 100 100 80 36 31 5 20 20
3 100 358 390 594 100 93 69 99 100 100 99 93 69 7 26 26
3 500 157 139 138 100 100 100 100 100 100 100 100 100 24 75 75
4 30 907 941 972 96 30 26 75 100 100 71 30 26 4 18 18
4 100 349 412 673 99 89 57 98 100 100 97 89 57 6 25 25
4 500 155 143 144 100 100 100 100 100 100 100 100 100 18 60 60
5 30 725 375 392 100 99 99 23 97 96 23 96 94 4 21 21
5 100 356 198 189 100 100 100 22 92 98 22 92 98 7 31 31
5 500 156 99 88 100 100 100 22 89 100 22 89 100 20 89 89
6 30 823 431 442 100 96 96 25 96 95 25 92 91 4 19 19
6 100 412 225 220 100 100 100 25 91 98 25 91 98 5 27 27
6 500 185 103 89 100 100 100 26 90 99 26 90 99 13 75 75
7 30 4283 1137 1045 94 17 11 7 85 90 1 1 1 16 108 108
7 100 997 240 112 95 21 8 7 84 94 2 5 2 24 127 127
7 500 326 115 57 91 26 7 14 83 95 5 9 2 60 336 336
8 30 3094 433 300 91 14 7 11 88 93 2 2 1 13 96 96
8 100 948 214 116 91 17 7 11 86 93 2 2 1 12 74 74
8 500 550 85 72 85 12 7 20 91 94 5 3 2 17 183 183
9 30 1595 803 839 97 33 33 17 98 98 14 31 31 17 94 94
9 100 471 284 346 98 70 56 20 97 100 18 66 55 27 137 137
9 500 190 125 150 99 91 80 23 92 100 22 82 80 94 479 479
10 30 1597 761 720 96 32 29 20 97 98 16 29 27 14 82 82
10 100 719 442 464 96 77 61 30 79 91 26 56 52 11 76 76
10 500 520 290 270 96 95 87 41 61 81 36 55 68 19 217 217
11 30 1238 446 450 100 92 96 12 97 93 12 90 89 16 86 86
11 100 591 244 205 100 100 100 14 89 97 14 89 97 21 105 105
11 500 352 195 122 100 100 100 24 66 98 24 66 98 46 274 274
12 30 1356 616 599 100 87 91 17 96 92 17 82 82 12 72 72
12 100 827 380 322 100 100 100 26 80 91 26 80 91 9 73 73
12 500 688 324 248 100 100 100 33 57 86 33 57 86 15 195 195

5. Data analysis: Problem lists

We now analyze the motivating problem list data. Tables 46 give the parameter estimates for all models for cases A–C, respectively. The BIC-based results are given for comparison, and we focus on the AIC-based fitted models. Figures S1S3 in the Supplement give the full solution paths from our algorithm, with the AIC and BIC solutions noted. Tables 46 also include the consensus problem list from Krauss et al. (2016) and the other comparator methods evaluated in the simulation study. To alternatively characterize the extent of physician consensus, Figure 3 plots the probability of the most preferred item at each stage according to the AIC-estimated BPL model fit, conditional on all prior stages having also selected the most preferred item. Each such modal list continues until the item “0” is selected.

Table 4:

Parameter estimates from fitted penalized Benter-Plackett-Luce (BPL) models applied to case A data, ordered by estimated values of θk from using λ = λAIC. The final row p˜λ, gives the number of non-zero parameters in the estimated model. For comparison, the remaining columns give the ranked list of problems according to the listed alternative algorithm or model; comparator entries in bold indicate ranks that were either (i) discrepant with λAIC by not more than 1 position or (ii) ranks that were larger than the total number of non-zero estimated weights according to λAIC and which λAIC estimated to be zero. The parenthetical numbers in the AMean, GMean, and Median rank columns correspond to the summary rank of that problem, and the parenthetical numbers in the MC columns correspond to the estimated transition probability for the stationary distribution (×100).

Problem / Parameter BPL
LDRBO AMean Rank GMean Rank Median Rank MC1 MC2 MC3 CEMCρ CEMCτ EMM MM
λ = 0 λAIC λBIC
θ0 3.95 2.57 2.78
PNEUMONIA 9.54 6.86 7.32 1 1(3.3) 1(1.6) 1(1.0) 1(5.6) 1(19.8) 1(15.2) 1 1 1 1
DIABETES MELLITUS 7.94 5.31 5.74 2 2(5.5) 2(4.4) 3(5.0) 5(3.9) 3(10.5) 3(8.8) 3 3 4 3
ANEMIA 7.93 5.30 5.74 3 3(6.5) 3(4.6) 2(4.0) 2(4.5) 2(14.1) 2(9.1) 2 2 2 2
DEPRESSION WITH ANXIETY 7.15 4.55 4.98 5 4(8.2) 4(6.9) 4(7.0) 9(3.7) 4(7.8) 4(6.3) 7 4 5 5
OSTEOARTHRITIS 6.64 4.07 4.49 6 5(10.1) 6(9.1) 5(8.0) 12(3.6) 5(5.8) 6(4.9) 8 6 6 8
SPLENOMEGALY 6.42 3.79 4.21 4 6(l0.8) 5(8.0) 7(12.5) 3(4.1) 7(3.9) 5(5.5) 4 5 3 4
POST MENOPAUSAL ON HRT 6.25 3.68 4.09 7(11.4) 9(9.8) 6(9.5) 15(3.5) 6(4.7) 9(4.4) 13 9 8 9
RENAL FAILURE 5.90 3.29 3.70 7 8(12.3) 8(9.6) 8(16.8) 7(3.8) 8(2.0) 8(4.5) 5 7 9 7
SYSTOLIC MURMUR 5.84 3.22 3.64 9(12.4) 7(9.3) 10(17.0) 4(3.9) 9(2.0) 7(4.5) 6 8 7 6
HISTORY OF SMOKING 4.76 2.26 2.65 10(15.1) 11(13.9) 9(17.0) 16(3.4) 21(1.5) 11(2.7) 25 10 10 10
CHEST PAIN 4.17 0.00 2.00 11(15.7) 10(13.1) 12(18.0) 8(3.8) 14(1.6) 10(2.7) 15 12 12 11
LOWER EXTREMITY EDEMA 4.10 0.00 2.00 12(15.8) 12(14.4) 11(17.8) 14(3.5) 20(1.5) 12(2.5) 9 11 11 12
IRON DEFICIENCY 3.37 0.00 0.00 13(16.6) 15(15.7) 13(18.0) 22(3.3) 15(1.6) 14(2.1) 11 14 13 14
HYPOXEMIA 3.04 0.00 0.00 14(16.7) 13(14.9) 17(18.2) 6(3.8) 17(1.6) 13(2.2) 24 13 20 13
HYPOPHOSPHATEMIA 2.87 0.00 0.00 16(17.0) 19(16.4) 15(18.0) 23(3.3) 11(1.6) 17(1.9) 10 16 16 16
THROMBOCYTOPENIA 2.83 0.00 0.00 15(17.0) 18(16.4) 14(18.0) 21(3.3) 24(1.5) 16(2.0) 16 15 15 15
SHORT OF BREATH 2.68 0.00 0.00 17(17.2) 14(15.5) 18(18.5) 10(3.6) 19(1.6) 15(2.0) 27 17 19 18
HYPOXIA 2.36 0.00 0.00 8 18(17.2) 16(15.8) 21(18.5) 19(3.4) 12(1.6) 19(1.9) 28 18 23 17
PANCYTOPENIA 2.30 0.00 0.00 19(17.3) 17(15.9) 23(18.5) 11(3.6) 13(1.6) 18(1.9) 19 20 24 19
HYPOALBUMINEMIA 2.14 0.00 0.00 20(17.4) 20(16.9) 16(18.2) 25(3.2) 28(1.5) 20(1.8) 20 19 14 20
FEVER 1.15 0.00 0.00 21(17.7) 21(16.9) 19(18.5) 13(3.5) 22(1.5) 21(1.7) 14 22 21 21
CONGESTIVE HEART FAILURE 1.10 0.00 0.00 23(17.8) 22(17.0) 27(18.5) 18(3.4) 10(1.6) 23(1.7) 22 27 27 24
PULMONARY EDEMA 1.07 0.00 0.00 25(17.8) 23(17.0) 25(18.5) 17(3.4) 18(1.6) 24(1.6) 17 25 28 26
PULMONARY EMBOLISM 0.95 0.00 0.00 22(17.7) 24(17.3) 24(18.5) 20(3.3) 16(1.6) 22(1.7) 23 21 25 22
TACHYCARDIA 0.90 0.00 0.00 24(17.8) 26(17.5) 28(18.5) 27(3.0) 23(1.5) 25(1.6) 12 28 22 23
DEPRESSION 0.86 0.00 0.00 26(17.9) 25(17.5) 20(18.5) 24(3.2) 26(1.5) 26(1.6) 21 23 17 27
HIGH HAPTOGLOBIN 0.86 0.00 0.00 27(17.9) 28(17.7) 22(18.5) 28(2.7) 25(1.5) 28(1.6) 18 24 26 25
ANXIETY 0.00 0.00 0.00 28(17.9) 27(17.6) 26(18.5) 26(3.0) 27(1.5) 27(1.6) 26 26 18 28
δ1 0.99 1.00 1.00
δ2 0.61 0.62 0.61

λ 0 3.38 1.19
p˜λ 30 12 14

Table 6:

Parameter estimates from fitted penalized Benter-Plackett-Luce (BPL) models applied to case C data, ordered by estimated values of θk from using λ = λAIC. The final row p˜λ gives the number of non-zero parameters in the estimated model. For comparison, the remaining columns give the ranked list of problems according to the listed alternative algorithm or model; comparator entries in bold indicate ranks that either were (i) discrepant with λAIC by not more than 1 position or (ii) larger than the total number of non-zero estimated weights according to λAIC and which λAIC estimated to be zero. The parenthetical numbers in the AMean, GMean, and Median rank columns correspond to the summary rank of that problem, and the parenthetical numbers in the MC columns correspond to the estimated transition probability for the stationary distribution (×100).

Problem / Parameter BPL
LDRBO Mean Rank GMean Rank Median Rank MC1 MC2 MC3 CEMCρ CEMCτ EMM MM
λ = 0 λAIC λBIC
θ0 3.47 3.28 3.45
PERICARDIAL EFFUSION 7.46 7.24 7.41 1 1(1.8) 1(1.3) 1(1.0) 1(7.7) 1(18.7) 1(16.8) 1 1 1 1
UTI 5.38 5.17 5.34 2 2(5.8) 2(4.1) 2(4.0) 2(4.3) 2(13.6) 2(9.1) 2 2 2 2
ANEMIA 4.99 4.78 4.95 3 3(6.5) 3(5.2) 3(5.0) 3(3.9) 3(10.3) 3(7.8) 3 3 3 3
ELEVATED LFT’S 4.57 4.36 4.53 4 5(8.4) 5(6.6) 4(5.0) 7(3.5) 4(8.1) 5(6.3) 4 4 4 4
HYPERTENSION 4.54 4.33 4.50 6 4(8.1) 4(6.2) 5(6.0) 6(3.6) 5(6.5) 4(6.5) 5 5 5 5
R EYE BLIND 3.56 3.34 3.52 7 6(12.6) 7(10.6) 6(11.0) 10(3.3) 6(4.1) 7(3.9) 9 6 7 6
SYSTOLIC MURMUR 3.32 3.10 3.28 7(13.3) 6(10.4) 7(17.5) 12(3.3) 10(1.7) 6(4.0) 6 7 6 7
HISTORY OF SMOKING 3.00 2.79 2.97 5 8(14.8) 11(13.4) 8(18.0) 9(3.4) 12(1.7) 10(3.0) 13 8 8 10
FEVER AND NIGHT SWEATS 2.74 2.52 2.70 9(15.2) 8(12.4) 14(18.5) 16(3.2) 8(1.7) 8(3.2) 10 11 12 12
PLEURAL EFFUSION 2.56 2.34 2.52 10(15.6) 12(13.5) 13(18.5) 15(3.2) 13(1.7) 12(2.9) 11 10 9 9
SHORTNESS OF BREATH 2.52 2.30 2.48 11(15.7) 9(12.6) 9(18.5) 4(3.8) 14(1.7) 9(3.0) 7 9 10 8
CHEST PAIN 2.41 2.19 2.38 12(15.8) 10(13.2) 12(18.5) 5(3.6) 7(1.7) 11(3.0) 8 12 11 11
DIASTOLIC MURMUR 2.23 2.02 2.20 13(16.5) 13(15.0) 10(18.5) 17(3.2) 9(1.7) 13(2.4) 26 14 13 13
HISTORY OF TAH/BSO 1.69 0.00 1.66 14(17.5) 14(16.8) 11(18.5) 21(3.0) 11(1.7) 14(2.0) 16 13 14 14
AORTIC DISSECTION 0.19 0.00 0.00 15(18.5) 15(17.3) 15(18.8) 8(3.4) 16(1.6) 15(1.7) 30 15 19 16
MYOCARDIAL INFARCTION 0.07 0.00 0.00 18(18.5) 24(18.2) 16(18.8) 24(2.8) 21(1.6) 22(1.6) 12 22 16 15
CARDIOMYOPATHY 0.04 0.00 0.00 16(18.5) 16(17.7) 17(18.8) 13(3.2) 15(1.6) 16(1.7) 19 23 26 17
INCREASED JVP 0.04 0.00 0.00 20(18.5) 22(18.1) 21(18.8) 22(3.0) 30(1.6) 23(1.6) 28 18 17 18
EKG CHANGES, OLD MI 0.03 0.00 0.00 24(18.6) 25(18.2) 23(18.8) 25(2.8) 20(1.6) 25(1.6) 23 19 23 21
THROMBOCYTOSIS 0.03 0.00 0.00 27(18.7) 28(18.5) 20(18.8) 30(2.5) 27(1.6) 29(1.6) 24 27 15 26
CONGESTIVE HEART FAILURE 0.03 0.00 0.00 17(18.5) 17(17.7) 24(18.8) 18(3.1) 19(1.6) 17(1.7) 18 21 21 19
RENAL INSUFFICIENCY 0.02 0.00 0.00 28(18.7) 29(18.5) 26(18.8) 29(2.5) 29(1.6) 28(1.6) 25 28 18 25
CARDIOMEGALY 0.02 0.00 0.00 19(18.5) 18(17.7) 19(18.8) 11(3.3) 25(1.6) 18(1.7) 15 16 28 20
HYPERTENSIVE HEART DISEASE 0.01 0.00 0.00 21(18.5) 19(17.7) 29(19.0) 19(3.1) 28(1.6) 20(1.6) 20 17 22 24
VALVULAR HEART DISEASE 0.01 0.00 0.00 22(18.5) 20(17.7) 30(19.0) 14(3.2) 26(1.6) 19(1.6) 29 24 24 22
IRON DEFICIENCY 0.01 0.00 0.00 23(18.6) 21(18.0) 22(18.8) 20(3.1) 17(1.6) 21(1.6) 17 20 29 23
EKG CHANGES 0.00 0.00 0.00 25(18.6) 23(18.1) 28(19.0) 23(2.9) 22(1.6) 24(1.6) 21 26 27 27
PULMONARY EMBOLISM 0.00 0.00 0.00 29(18.7) 27(18.4) 27(19.0) 26(2.7) 18(1.6) 27(1.6) 14 29 25 30
ASCVD 0.00 0.00 0.00 26(18.7) 26(18.4) 18(18.8) 27(2.7) 24(1.6) 26(1.6) 22 30 30 28
R SIDED HEART FAILURE 0.00 0.00 0.00 30(18.7) 30(18.5) 25(18.8) 28(2.5) 23(1.6) 30(1.6) 27 25 20 29
δ1 1.00 1.00 1.00
δ2 0.86 0.86 0.86

λ 0 3.04 0.01
p˜λ 29 15 16

Figure 3:

Figure 3:

The probability of selectingn the most preferred item at each stage according to the AIC-estimated BPL model fit in Tables 46, conditional on all prior stages having also selected the most preferred item. Each such modal list continues until the item “0” is selected.

Of the 28 unique problems listed for case A, 10 were estimated to have non-zero weights according to the AIC-selected model; these 10 problems are the consensus problem list according to the model. The BIC-selected model included 12 problems. Using AIC, the estimate of δ1 was 1 and the estimate of δ2 was 0.62, suggesting that relative preferences quickly decrease and level off at about 2/3 their starting values. For example, at stage 3, the dampening function evaluates to δ(3) = 0.62 + 0.385 ≈ 0.63, and the relative weight of, say, anemia at this stage (supposing it has not yet been ranked) decreases to 5.30 × 0.63 = 3.3. The BPL models’ ranks agree with the length-8 consensus problem list reported in Krauss et al. (2016) as well as with AMean and GMean on the three most important problems, and they nearly agree with Median, MC2, MC3, CEMCρ, CECMτ, and MM. Among the comparator methods, the BPL models agreed most closely with AMean. There was disagreement with between some methods at lower ranks, however. The BPL models did not put one problem from Krauss et al. – hypoxia – anywhere in their consensus list. The fatigue parameter θ0 was estimated to be 2.57 in the AIC-selected model. This value does not translate into an expected list length, which is a multidimensional function of all elements of β. Thus, we simulated many lists from the fitted model to characterize the distribution of list lengths. The first, second, and third quartiles of the length of these simulated lists was (4,6,9), compared to values of (5,8,9) for the observed case A data. From Figure 3, the most preferred item at stage 1 (pneumonia) is estimated to be selected with probability about 0.57, decreasing to about 0.20 for subsequent stages. This seems to disagree with the empiric proportion of physicians who ranked pneumonia first, which was 26/32 ≈ 0.82. This model misspecification is likely due to the fact that two physicians ranked it 4th and four others never ranked it. The probabilities sometimes increase with stage due to the effect of the dampening function.

Case B, given in Table S7, was the most challenging, consistent with the a priori expectation in the protocol design. There were 47 unique problems appearing in at least one of the 32 lists, and 14 unique problems were ranked highest on at least one list. diabetic ketoacidosis had the largest log-odds ratio of 4.56 (AIC) or 4.84 (BIC), both approximately 0.9 larger than the next highest ranked problem, renal failure. Beyond the first rank, the difference in log-odds ratios between consecutive problems was even smaller, e.g. 0.16 between ranks 2 and 3, 0.13 between ranks 3 and 4, and 0.28 between ranks 4 and 5, reflecting uncertainty on the part of the physicians regarding which items to rank where. The AIC-selected consensus problem list, i.e. those items with strictly positive log-odds ratios, had length 16, similar to the length of the LDRBO-based consensus list. However, there was significant reordering of the problems: the top four problems of the AIC-based list were ranked 2nd, 5th, 4th, and 8th, respectively, on the LDRBO list. Further, three problems in the AIC-selected consensus list were not in the LDRBO-based list. The AIC-based list agreed to a large extent with AMean, GMean, MC3 and CECMτ. The fatigue parameter θ0 was estimated to be 2.79, which together with the remaining parameter estimates, yields expected quartiles for the list length of (5,10,14), compared to observed quartiles of (8,10,12.5). In agreement with these findings, Figure 3 gives that the most preferred item at stage 1 (diabetic ketoacidosis) is estimated to selected with probability about 0.26, compared to an observed proportion of 8/32 = 0.25.

Remark 3

The LDRBO-based consensus list for Case B, given in Table S7 differs from that reported in Krauss et al. (2016): the 9th and 10th ranked items are swapped, and the 15th ranked item is different. For this paper we wrote a new algorithm for optimizing consensus, and it identified a list having a slightly larger (i.e. better) median pairwise LDRBO with the 32 physician lists: 0.584 here versus 0.581 reported in Krauss, et al.

Finally, the results for case C are given in Table 6. Thirty unique problems were listed across all lists. The largest log-odds ratio was attributed to pericardial effusion (7.24, 7.41 respectively for AIC, BIC). There was a significant gap between the next ranked item, urinary tract infection (UTI), and the difference in log-odds ratios was 7.24 − 5.17 ≈ 2.07, meaning that the model-estimated odds of ranking pericardial effusion over uti at stage 1 are exp{7.24 − 5.17} ≈ 8. In total, the consensus problem list was length 13 (AIC) or 14 (BIC), compared to an LDRBO-based length of 7. There was perfect agreement with the LDRBO-based list on the first four problems, with the only discrepancy occurring on history of smoking, ranked 8th in the AIC-selected list and 5th in the LDRBO-based list. There was widespread agreement with most other comparator methods. The estimate for θ0 was 3.28, and the set of parameter estimates yielded a simulation-based estimate of the expected quartiles for list length of (4,6,9), which are similar to the observed quartiles of (6,7,9). From Figure 3, pericardial effusion had a model-estimated 0.71 probability of selection at stage 1, with the most preferred items at subsequent stages being selected with probability between 0.2 and 0.3.

6. Data analysis: NBA team rankings

We briefly present here a secondary analysis of a dataset first reported in Deng et al. (2014). After the 2011 NBA preseason, six professional news agencies ranked all 30 teams in the league (we do not analyze the 28 student surveys reported in that paper). These six lists are complete rather than ragged, and the value of the fatigue parameter maximizing the likelihood is thus θ0 = ∞, meaning it can be dropped from the model. We applied the same set of methods to these data, the results of which are reported in Table S1 of the Supplement. There was widespread overall agreement between all methods. However, the AIC-selected list was substantially more parsimonious, which is due to the small-sample correction: it will not estimate more parameters, i.e. item weights, than there are observations, i.e. rankers, of which there are just 6 in these data.

7. Discussion

A challenging, but not unique, feature of the problem list data is that each list may have a different length, making difficult the implementation of multistage models that assume a uniform list length. Moreover, it is useful to have an aggregated consensus list that has excluded unimportant items. With these objectives in mind, we have extended classical, multistage models and amalgamated them with modern penalized likelihood ideas. As seen in our second data example, these penalized BPL models apply equally well to the analysis of non-ragged data.

We have already mentioned some advantages a modeling approach has over the approach taken by Krauss et al. (2016), which calculated a hypothetical problem list maximizing pairwise similarity with the observed problem lists. One additional, yet-unmentioned advantage over that approach and some of the others we’ve considered in this paper is that the penalized BPL models do not only order the items but also give an explicit numerical assessment of their relative importance by way of an estimated relative log-odds ratio. For example, in case B, we can conclude that there are a substantial number of problems for which the physicians were conflicted about: the difference in log-odds ratios between the 6th and 15th ranked problems, schizophrenia and sinusitis, respectively, was just 2.89 − 2.03 = 0.86, and consequently any differences in log-odds ratios between these ranks was even smaller. This may be why the penalized BPL consensus lists differed from the LDRBO-based list. In each of our problem list analyses, the existing methods that our penalized BPL models agreed with most often, i.e. MC3 and CEMCτ, were also the methods that performed well in our simulation study.

Tables 46 and Figure 3 may seem inconsistent: the set of non-zero items in the tables is somewhat longer than the length of the modal lists plotted in Figure 3. However, these results describe different dimensions of consensus. The tables describe overall physician agreement on the sets of relevant problems for each case abstract, whereas the figure characterizes the model-estimated probability of the list that is most likely to be constructed by an individual physician. Our results suggest that, for cases A and C, a physician should not expect to construct a list that matches that of her colleague beyond the highest ranked item; collectively, however, the physicians are in agreement on the first five or so items. In contrast, for case B, there was generally no consensus.

One important design-based challenge to our analysis is with regard to the defining, naming, and grouping of problems. As described in the introduction, physicians were free to describe problems in their own words during the interview. If the physician named any clinically similar problems that had already been listed, either by her or another physician, the interviewer verbally observed this and offered that she could change her similar-sounding problem to match the already existing one; however, she was not forced to do so. This is why case B has history of alcohol abuse, alcoholism, alcoholic cirrhosis with ascites, and alcoholic cirrhosis with sbp all listed as separate problems. We also implicitly assumed that the number of possible items, v, for each case was exactly the number of unique items listed by all physicians, but it is possible that, if more interviews were to be conducted, additional unique problems would be introduced to the vocabulary. One must therefore assume that our sample size was sufficiently large to include, at a minimum, those problems that would fall in the consensus list.

Supplementary Material

1

Table 5:

Parameter estimates from fitted penalized Benter-Plackett-Luce (BPL) models applied to case B data, ordered by estimated values of θk from using λ = λAIC. The final row p˜λ, gives the number of non-zero parameters in the estimated model. For comparison, the remaining columns give the ranked list of problems according to the listed alternative algorithm or model; comparator entries in bold indicate ranks that were either (i) discrepant with λAIC by not more than 1 position or (ii) ranks that were larger than the total number of non-zero estimated weights according to λAIC and which λAIC estimated to be zero. The parenthetical numbers in the AMean, GMean, and Median rank columns correspond to the summary rank of that problem, and the parenthetical numbers in the MC columns correspond to the estimated transition probability for the stationary distribution (×100).

Problem / Parameter BPL
LDRBO Mean Rank GMean Rank Median Rank MC1 MC2 MC3 CEMCρ CEMCτ EMM MM
λ = 0 λAIC λBIC
θ0 3.50 2.79 3.07
DIABETIC KETOACIDOSIS 5.27 4.56 4.84 2 1(4.7) 1(3.0) 1(3.0) 1(3.6) 1(12.7) 1(8.85) 1 1 1 1
RENAL FAILURE 4.39 3.68 3.96 5 2(9.0) 2(6.5) 4(6.0) 3(2.6) 5(6.4) 2(5.98) 4 2 5 4
SPONATENOUS BACTERIAL PERITONITIS 4.23 3.52 3.80 4 4(11.5) 3(6.5) 2(5.0) 2(2.7) 2(8.5) 3(5.59) 2 3 3 2
CIRRHOSIS DUE TO ALCOLHOL 4.10 3.39 3.67 8 3(9.9) 5(7.7) 5(8.0) 11(2.3) 6(5.1) 4(5.16) 7 5 7 7
MAXILLARY SINUS MASS 3.83 3.11 3.39 1 5(13.3) 4(7.2) 3(5.5) 5(2.5) 3(8.5) 5(4.92) 3 4 4 3
SCHIZOPHRENIA 3.60 2.89 3.17 11 6(l3.9) 8(11.4) 7(11.0) 9(2.3) 7(4.3) 8(3.84) 10 7 8 8
ENCEPHELOPATHY 3.50 2.79 3.07 3 8(16.7) 6(8.6) 10(26.5) 8(2.3) 10(1.9) 7(4.08) 6 6 2 5
MULTIPLE CRANIAL NERVE PALSIES 3.46 2.75 3.03 6 7(15.9) 7(10.0) 6(8.5) 4(2.5) 4(7.0) 6(4.12) 5 9 6 6
HYPERTENSION 3.14 2.43 2.71 12 9(18.2) 11(15.3) 9(15.5) 13(2.2) 8(3.2) 9(2.90) 12 10 13 13
HISTORY OF IV DRUG USE 3.14 2.42 2.71 14 10(18.5) 13(16.0) 8(14.0) 22(2.1) 9(3.1) 12(2.80) 14 14 14 14
HYPONATREMIA 3.00 2.29 2.57 7 11(19.9) 12(15.5) 15(28.0) 12(2.3) 13(1.2) 11(2.81) 8 12 12 12
HYPERKALEMIA 2.89 2.17 2.46 10 14(20.7) 14(16.4) 12(27.8) 10(2.3) 14(1.2) 14(2.59) 9 13 11 11
TOBACCO USE 2.87 2.15 2.44 13(20.5) 15(18.0) 11(26.5) 15(2.2) 15(1.2) 15(2.43) 16 16 15 16
SINUSITIS 2.73 2.03 2.31 9 12(20.5) 9(13.6) 13(28.0) 6(2.4) 11(1.3) 10(2.88) 44 8 10 9
MENINGITIS 2.74 2.03 2.31 15(21.0) 10(14.0) 14(28.0) 7(2.3) 12(1.2) 13(2.79) 13 11 9 10
SYSTOLIC MURMUR 2.42 1.71 2.00 16(23.5) 16(21.1) 16(28.0) 21(2.1) 16(1.2) 16(1.94) 11 15 16 15
HX ALCOHOL ABUSE 2.05 0.00 1.62 18(25.2) 20(23.4) 18(28.5) 26(2.1) 20(1.1) 18(1.62) 15 17 17 18
ORBIT FRACTURE 1.93 0.00 1.50 13 17(25.0) 17(21.4) 36(29.0) 18(2.2) 18(1.1) 17(1.78) 17 19 19 17
HX GUN SHOT WOUND 1.84 0.00 1.42 19(26.1) 22(24.8) 17(28.5) 29(2.1) 27(1.0) 19(1.46) 46 18 18 19
ANEMIA 1.49 0.00 0.00 20(26.8) 24(25.2) 42(29.0) 30(2.1) 17(1.1) 22(1.37) 19 23 20 21
DIABETES MELLITUS 1.14 0.00 0.00 22(27.1) 19(23.1) 22(29.0) 24(2.1) 41(1.0) 21(1.41) 45 20 35 24
FEVER 1.14 0.00 0.00 15 21(26.9) 18(22.0) 19(29.0) 19(2.2) 19(1.1) 20(1.45) 38 22 27 20
R ORBITAL FRACTURE 1.12 0.00 0.00 23(27.5) 25(26.2) 41(29.0) 28(2.1) 28(1.0) 25(1.25) 39 21 22 22
THROMBOCYTOPENIA 0.77 0.00 0.00 28(28.2) 34(27.6) 37(29.0) 43(1.8) 25(1.0) 28(1.12) 30 24 21 27
MUCORMYCOSIS 0.75 0.00 0.00 24(27.7) 21(24.6) 23(29.0) 14(2.2) 35(1.0) 23(1.27) 24 25 28 25
SEPSIS 0.75 0.00 0.00 25(27.7) 23(25.1) 24(29.0) 16(2.2) 40(1.0) 24(1.26) 37 28 40 23
PALATAL LESION 0.75 0.00 0.00 26(27.9) 27(26.5) 28(29.0) 31(2.0) 23(1.0) 26(1.19) 21 26 26 26
ANEMIA AND THROMBOCYTOPENIA 0.73 0.00 0.00 27(28.1) 32(27.2) 30(29.0) 36(1.9) 22(1.0) 27(1.14) 35 27 23 29
TACHYCARDIA 0.05 0.00 0.00 42(28.8) 44(28.5) 44(29.0) 46(1.6) 26(1.0) 44(1.02) 18 45 24 32
HYPERSOSMOLAR COMA 0.04 0.00 0.00 29(28.5) 29(27.0) 38(29.0) 27(2.1) 21(1.0) 30(1.09) 36 41 41 28
PLASMA PROTEIN DISORDER 0.03 0.00 0.00 37(28.7) 40(28.2) 33(29.0) 40(1.8) 34(1.0) 40(1.05) 22 44 29 33
POTENTIAL CVA 0.02 0.00 0.00 30(28.6) 26(26.5) 20(29.0) 17(2.2) 36(1.0) 29(1.10) 43 36 42 30
HEPTO-RENAL SYNDROME 0.01 0.00 0.00 31(28.6) 33(27.4) 25(29.0) 32(2.0) 24(1.0) 34(1.07) 40 29 37 31
LEUKOCYTOSIS 0.01 0.00 0.00 33(28.7) 35(27.6) 47(29.0) 34(2.0) 29(1.0) 35(1.07) 33 32 38 34
CEREBROVASCULAR ACCIDENT 0.01 0.00 0.00 38(28.8) 39(28.1) 27(29.0) 38(1.8) 47(1.0) 39(1.05) 20 37 32 37
DEHYDRATION 0.01 0.00 0.00 36(28.7) 36(27.7) 29(29.0) 33(2.0) 45(1.0) 36(1.06) 23 31 33 35
ALCOHOLISM 0.01 0.00 0.00 41(28.8) 41(28.2) 26(29.0) 39(1.8) 39(1.0) 41(1.04) 28 40 43 40
ALCOHOLIC CIRRHOSIS WITH SBP 0.00 0.00 0.00 32(28.7) 30(27.1) 45(29.0) 20(2.1) 30(1.0) 33(1.08) 26 30 39 36
SMOKING 0.00 0.00 0.00 39(28.8) 38(28.0) 39(29.0) 37(1.9) 44(1.0) 38(1.05) 25 43 34 39
ASCITES 0.00 0.00 0.00 34(28.7) 31(27.1) 31(29.0) 25(2.1) 42(1.0) 31(1.09) 42 38 36 38
ALCOHOLIC CIRRHOSIS WITH ASCITES 0.00 0.00 0.00 35(28.7) 28(26.6) 21(29.0) 23(2.1) 46(1.0) 32(1.09) 32 33 47 41
RENAL FAILURE WITH HYPERKALEMIA 0.00 0.00 0.00 40(28.8) 37(27.9) 43(29.0) 35(2.0) 32(1.0) 37(1.06) 27 34 45 42
POLYSUBSTANCE ABUSE 0.00 0.00 0.00 44(28.8) 43(28.3) 40(29.0) 41(1.8) 33(1.0) 42(1.04) 41 39 46 47
PROTEIN CALORIE MALNUTRITION 0.00 0.00 0.00 43(28.8) 42(28.3) 34(29.0) 42(1.8) 43(1.0) 43(1.03) 29 46 31 44
MALNUTRITION 0.00 0.00 0.00 45(28.9) 45(28.5) 35(29.0) 44(1.7) 31(1.0) 46(1.01) 34 35 25 43
HEPATOMEGALY 0.00 0.00 0.00 46(28.9) 46(28.6) 46(29.0) 45(1.6) 38(1.0) 45(1.01) 31 42 44 46
HX MEDICAL NONCOMPLIANCE 0.00 0.00 0.00 47(29.0) 47(28.8) 32(29.0) 47(1.5) 37(1.0) 47(0.99) 47 47 30 45
δ1 1.00 1.00 1.00
δ2 1.00 1.00 1.00

λ 0 5.03 2.18
p˜λ 40 17 20

Acknowledgments

Supported by the National Institutes of Health (UL1TR002240)

Contributor Information

Philip S. Boonstra, Department of Biostatistics, University of Michigan, USA.

John C. Krauss, Division of Hematology Oncology, University of Michigan, USA

References

  1. Akaike H. (1973). Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory pages 267–281. [Google Scholar]
  2. Benter W. et al. (2008). Computer-based horse race handicapping and wagering systems: A report. In Hausch DB, Lo VSY, and Ziemba WT, editors, Efficiency of racetrack betting markets, pages 183–198. World Scientific Publishing. [Google Scholar]
  3. Boulesteix A-L and Slawski M. (2009). Stability and aggregation of ranked gene lists. Briefings in bioinformatics 10, 556–568. [DOI] [PubMed] [Google Scholar]
  4. DeConde RP, Hawley S, Falcon S, Clegg N, Knudsen B, and Etzioni R. (2006). Combining results of microarray experiments: a rank aggregation approach. Statistical Applications in Genetics and Molecular Biology 5, Article 15. [DOI] [PubMed] [Google Scholar]
  5. Deng K, Han S, Li KJ, and Liu JS (2014). Bayesian aggregation of order-based rank data. Journal of the American Statistical Association 109, 1023–1039. [Google Scholar]
  6. Dicker L, Huang B, and Lin X. (2013). Variable selection and estimation with the seamless-l0 penalty. Statistica Sinica 23, 929–962. [Google Scholar]
  7. Dwork C, Kumar R, Naor M, and Sivakumar D. (2001). Rank aggregation methods for the web. In Proceedings of the 10th international conference on World Wide Web, pages 613–622. ACM. [Google Scholar]
  8. Fligner MA and Verducci JS (1988). Multistage ranking models. Journal of the American Statistical association 83, 892–901. [Google Scholar]
  9. Gormley IC and Murphy TB (2008). Exploring voting blocs within the irish electorate: A mixture modeling approach. Journal of the American Statistical Association 103, 1014–1027. [Google Scholar]
  10. Gormley IC, Murphy TB, et al. (2008). A mixture of experts model for rank data with applications in election studies. The Annals of Applied Statistics 2, 1452–1477. [Google Scholar]
  11. Hurvich CM and Tsai C-L (1989). Regression and time series model selection in small samples. Biometrika 76, 297–307. [Google Scholar]
  12. Kendall MG (1948). Rank correlation methods. Griffin, Oxford, England. [Google Scholar]
  13. Krauss JC, Boonstra PS, Vantsevich AV, and Friedman CP (2016). Is the problem list in the eye of the beholder? an exploration of consistency across physicians. Journal of the American Medical Informatics Association 23, 859–865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Li H, Xu M, Liu JS, and Fan X. (2018). ExtMallows: An Extended Mallows Model and Its Hierarchical Version for Ranked Data Aggregation. R package version 0.1.0. [Google Scholar]
  15. Li H, Xu M, Liu JS, and Fan X. (2019). An extended mallows model for ranked data aggregation. Journal of the American Statistical Association. [Google Scholar]
  16. Li X, Choudhary PK, Biswas S, and Wang X. (2018). A bayesian latent variable approach to aggregation of partial and top-ranked lists in genomic studies. Statistics in medicine. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Li X, Wang X, and Xiao G. (2017). A comparative study of rank aggregation methods for partial and top ranked lists in genomic applications. Briefings in bioinformatics. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Lin S. (2010). Space oriented rank-based data integration. Statistical Applications in Genetics and Molecular Biology 9,. [DOI] [PubMed] [Google Scholar]
  19. Lin S. and Ding J. (2009). Integration of ranked lists via cross entropy monte carlo with applications to mrna and microrna studies. Biometrics 65, 9–18. [DOI] [PubMed] [Google Scholar]
  20. Luce RD (1959). Individual Choice Behavior a Theoretical Analysis. John Wiley and Sons, New York. [Google Scholar]
  21. Mallows CL (1957). Non-null ranking models. Biometrika 44, 114–130. [Google Scholar]
  22. Marden JI (1996). Analyzing and modeling rank data. Chapman & Hall, London. [Google Scholar]
  23. Meyer AN, Payne VL, Meeks DW, Rao R, and Singh H. (2013). Physicians’ diagnostic accuracy, confidence, and resource requests: A vignette study. JAMA Internal Medicine 173, 1952–1958. [DOI] [PubMed] [Google Scholar]
  24. Mollica C. and Tardella L. (2017). Bayesian plackett–luce mixture models for partially ranked data. Psychometrika 82, 442–458. [DOI] [PubMed] [Google Scholar]
  25. Neuwirth E. (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1–2. [Google Scholar]
  26. Nombekela SW, Murphy MR, Gonyou HW, and Marden JI (1994). Dietary preferences in early lactation cows as affected by primary tastes and some common feed flavors. Journal of Dairy Science 77, 2393–2399. [DOI] [PubMed] [Google Scholar]
  27. Plackett RL (1975). The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics) 24, 193–202. [Google Scholar]
  28. R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
  29. Schimek M, Budinska E, Kugler K, Svendova V, Ding J, and Lin S. (2015). Topklists: a comprehensive r package for statistical inference, stochastic aggregation, and visualization of multiple omics ranked lists. Statistical Applications in Genetics and Molecular Biology pages 311–316. [DOI] [PubMed] [Google Scholar]
  30. Schwarz G. et al. (1978). Estimating the dimension of a model. The Annals of Statistics 6, 461–464. [Google Scholar]
  31. Spearman C. (1904). The proof and measurement of association between two things. The American Journal of Psychology 15, 72–101. [PubMed] [Google Scholar]
  32. Tibshirani R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 267–288. [Google Scholar]
  33. Webber W, Moffat A, and Zobel J. (2010). A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS) 28, 20. [Google Scholar]
  34. Weed LL (1968). Special article: Medical records that guide and teach. New England Journal of Medicine 278, 593–600. [DOI] [PubMed] [Google Scholar]
  35. Wickham H. (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES