A Learning Framework for Personalized Random Utility Maximization (RUM) Modeling of User Behavior

Jingshuo Feng; Xi Zhu; Feilong Wang; Shuai Huang; Cynthia Chen

doi:10.1109/tase.2020.3041411

. Author manuscript; available in PMC: 2022 Nov 4.

Published in final edited form as: IEEE Trans Autom Sci Eng. 2020 Dec 14;19(1):510–521. doi: 10.1109/tase.2020.3041411

A Learning Framework for Personalized Random Utility Maximization (RUM) Modeling of User Behavior

Jingshuo Feng ¹, Xi Zhu ², Feilong Wang ³, Shuai Huang ⁴, Cynthia Chen ⁵

PMCID: PMC9635458 NIHMSID: NIHMS1819280 PMID: 36337588

Abstract

Understanding user behavior is crucial for the success of many emerging applications that aim to provide personalized services for target users, such as many patient-centered health apps and transportation apps. Models based on the random utility maximization (RUM) theory are widely used in learning and understanding behavioral preferences on the population level but find difficult to estimate individuals’ preferences, particularly when individuals’ data are limited and fragmented. To address this problem, our framework builds on the concepts such as canonical structure and membership vectors invented in recent works on collaborative learning and is suitable for modeling heterogeneous population with insufficient data from each individual. We further propose an extension of the collaborative learning framework using pairwise-fusion regularization as a knowledge discovery tool for real-world applications where the canonical structure is uneven, e.g., some canonical models may only represent minor subpopulations. Computationally competent algorithms are developed to solve the corresponding optimization challenges. Extensive simulation studies and a real-world application in smart transportation demand management (TDM) show the effectiveness of our proposed methods.

Keywords: Machine learning, personalized behavior modeling, smart transportation demand management (TDM)

I. Introduction

UNDERSTANDING user behaviors is crucial for the success of many emerging applications, such as online retailers, transportation apps, and patient-centered health systems that aim to provide personalized services for target users. For example, recently, there has been much research attention on smart transportation demand management (TDM) due to the rising high demand in driving, which is closely related to a number of urban issues, such as congestion, air pollution, and public health [4]. TDM strategies are designed to modify travel behaviors by providing travelers incentives (or costs) to certain travel behaviors, for instance, promoting transit use by offering low-cost pass. Personalized incentive holds great promise to solve many challenges in TDM [5]–[7], leveraging on the rapid proliferation of smart personal technologies that make it possible to offer incentives individually [8], [9] rather than an average generic incentive.

The key to the success of any personalized service-providing system is a statistically accurate and efficient personalized model that can estimate individuals’ preferences from their behavior data. It has been found that the random utility maximization (RUM) model is an effective tool to learn individual’s preference from data [10]–[17]. The challenge is that the personal behavior data are usually limited and fragmented. This actually involves a long-standing problem in statistics and machine learning, considering the huge heterogeneity of the population and the prohibiting cost to collect high quality and sufficient data from each individual. Most prediction models are learned by pooling the individuals’ data together and creating a population model, thus ignoring individuals’ variations by only characterizing average effect. To address this problem, our framework builds on the concepts such as canonical structure and membership vectors invented in recent works on collaborative learning [1]–[3] and is suitable for modeling heterogeneous population with insufficient data from each individual.

The personal behavior data itself are challenging to model. For example, as described in [8], the personal behavior data are collected from each individual who is asked to choose between a promoted sustainable plan and his/her original travel plan. The RUM model assumes that the probability of choosing among multiple alternatives depends only on the differences in their respective utilities, and an individual will select the alternative that provides the maximum utility. Here, the utility is a concept quantifying the attractiveness of an alternative in a choice scenario, and it is assumed to be indirectly related to various characteristics of the alternative, the individual, and the surrounding environment [11], [18]. In reality, the utilities of alternatives are usually unobservable. To see that, for example, given two alternatives A and B, the only information we can observe is the final choice, which only indicates that the probability Pr(U_A ≥ U_B) ≥ 0.5 when the individual chooses A (U_A> is the utility of alternative A); otherwise, Pr(U_A ≥ U_B) ≤ 0.5. Neither the true probability nor utilities can be directly observed. Thus, learning personal preferences from behavior data add to the complexity of building personal RUM models using sparse and fragmented individual data.

To learn user behavior at the individual level, we utilize the theory of RUM and propose a novel logistic collaborative model (LogCM) in this work to address the aforementioned issues. Collaborative learning framework [1]–[3] is one of the state-of-the-art personalized modeling methods, which can learn distinct personalized models for each individual. The RUM has a solid behavior basis and could provide the proposed models good interpretability to understand user behavior. A set of canonical models will be learned to represent the heterogeneous population characteristics. Each canonical model can be treated as a representation of one behavior pattern or decision mechanism. It is usually unknown that which mechanism an individual may follow, and some individuals may exhibit a mix of thos6e patterns. Thus, mathematically, these canonical models span the modeling space for the individuals and provide a basis to characterize the individuals’ variations. We then learn a membership vector for each individual, which represents the degrees of resemblance between the model of an individual and the canonical models. With the knowledge of the canonical models and membership vectors, the common patterns are found and individual models can be derived. The collaborative learning framework is easy to explain and suitable for heterogeneous population with insufficient data from each individual since it considers both the commonalities among individuals and the characteristics of each individual, by learning canonical models and membership vectors, respectively. This novel collaborative learning model leads to a nonlinear and nonconvex constrained optimization problem, which could be solved by our proposed two-step iterative algorithm. We further argue that in many real-world applications, the canonical structure is uneven, e.g., some canonical models may only represent minor subpopulations, and the existing collaborative models fail to identify these minor canonical models. To address the limitation, we propose an extension using pairwise-fusion regularization [19], [20] as a knowledge discovery tool to reveal the canonical structure.

It is worthy of pointing out that the proposed work is different from some ongoing works in the literature, which aims to provide remedy for RUM for various types of complications in real applications. Most of these models are not designed for learning behavior models at the individual level. For example, Azari et al. [21] aimed to learn utilities of a set of alternatives with rank data and did not estimate personal preferences for each individual. Guevara and Ben-Akiva [22] dealt with the endogeneity caused by model misspecification (i.e., omission of the attributes). Ben-Akiva et al. [23] incorporated contexts like social network as the decision being made may also be affected by family, friends, and other choices being given. Hancock et al. [24] used a dynamic model from decision field theory to characterize the changing preferences with sequential choices and decisions. However, few literature has systematically tackled the problem of personalized modeling in the framework of RUM theory. One exception is the mixed logit model (MLM) [25], [26] that can deal with heterogeneous population where parameters are assumed to vary across individuals. An extensive comparison between MLM with our proposed model will be found in the numerical studies.

The remainder of this article is organized as follows. Section II presents the details of the proposed logistic collaborative model (LogCM) and an extension of it, the similarity-regularized LogCM (LogSCM). Related methods [e.g., RUM and mixed-effect models (MEMs)] and the relationships between them and the proposed model are discussed. Section III provides a two-step iterative algorithm for learning the parameters in the proposed models. Implementation guidelines in practice are also discussed around issues, such as initializing and hyperparameter tuning. Section IV evaluates the proposed methods on comprehensive simulation studies. A real-world case study is also shown in Section V. In Section VI, we introduce a new extension of the LogCM called the pairwise-fusion LogCM (LogPCM). It provides an innovative tool for better understanding the canonical structure, by incorporating pairwise-fusion regularization on the canonical models. Section VII concludes this article and discusses about possible future works.

II. LogCM

In this section, we present the LogCM for personalized modeling. Here, we develop the LogCM with the assumption that the decisions made by users are binary since this is the most common decision scenario in practice and multiple-choice outcomes could always be converted into binary outcomes, e.g., if multiple products are presented to a user, we may use binary outcome variables to indicate the “buy” or “not buy” for each product by the user. We first show that the RUM model used in [8] can be reformulated as a logistic regression (LR), which links the characteristics of the alternative and the outcome using the logistic function (also known as sigmoid function in deep learning). We then develop the mathematical formulation of the LogCM. We also derive its connection with the mixed-effects logistic model (logistic MEM) [27], [28], i.e., MLM [25].

A. Relationship Between LR and RUM

As mentioned in Section I, RUM assumes that the probability of selecting among alternatives depends only on the differences in their utilities. In other words, an individual will assign the highest probability to select the alternative which provides the maximum utility [11], [17], [18], where the utility is defined to be indirectly related to the various characteristics of the alternative. For instance, for a scenario with two alternatives, A and B, the decision-making problem is like a binary classification where the concept of utility for any alternative is a function of some variables that characterize the alternative. Specifically, based on the theory of RUM, for two alternatives, the probability of choosing alternative B can be written as $\Pr (U_{B} \geq U_{A})$ , where U_A = V_A + ϵ_A, U_B = V_B + ϵ_B are utilities associated with alternative A and B, respectively, with V representing the “systematic utility” and ϵ representing the “random utility” [11]. Then, we can analytically use the utility ratio [8] to represent the probability that an individual will choose alternative B rather than alternative A as R_B = (e^U_B/(e^U_B + e^U_A)). If R_B ≥ 0.5, the individual has a higher probability to choose alternative B; otherwise, R_A ≥ 0.5, the individual will be more likely to choose alternative A.

To characterize the difference between two alternatives, assume that there are p variables and we can define the difference of the two alternatives on these p variables as x₁, x₂, …, x_p. As only the difference matters to decide which alternative is more favorable, we could arbitrarily appoint one alternative as a baseline, i.e., we could set V_A = 0 if alternative A is the baseline. Then, we can define the utility of alternative B as $V_{B} = \sum_{p} β_{p} x_{p}$ . Since it is assumed that the systematic utilities represent the predictable part in decision-making, which are characterized by some variables that define the alternatives, and the random utilities are not observable and the utility ratio can be simplified by eliminating the random utilities as

R_{B} = \frac{e^{V_{B}}}{e^{V_{A}} + e^{V_{B}}} = \frac{\exp (x^{⊤} β)}{1 + \exp (x^{⊤} β)} .

The equation is mathematically the same as the logistic function in LRs. Thus, using the definition of utility ratio under binary decision-making cases, the RUM model is identical to the LR. As, in this article, we focus on binary decision-making outcomes, RUM and LR models could be used interchangeably.

B. Framework of Collaborative Learning

As personalized modeling often encounters the problem of lack of data, now, we present how the collaborative learning model is created for learning personalized RUM models in our target application. Usually, a multivariate statistical model requires a considerable amount of samples for reliable estimation. For instance, it is a commonly held belief that the proportion of sample size over number of parameters should be at least 30 in linear regression models with continuous outcomes [29]. It is therefore expected to be more demanding on the sample size if the outcomes are binary [30]. The high demand in sample size results in a challenge for robust estimation for personalized modeling given limited data size.

To overcome the problem of limited data, we adopt the collaborative learning framework that has been shown as an effective model in a range of engineering and healthcare applications [1]–[3]. The general idea of collaborative learning is to exploit the canonical structure that is embedded beneath the heterogeneity of a given population. An exemplary illustration is shown in Fig. 1. Denote the canonical models as f_k(x), k = 1, …, K, which can represent some common patterns or typical types from the heterogeneous N individuals. The number of canonical models, which could be determined by data-driven approaches as we will show later, is usually much smaller than the number of individuals (i.e., K ⪡ N), granting the advantage of collaborative learning to reduce the burden of estimating a large amount of free parameters.

Fig. 1. — Schematic of the collaborative learning framework.

With the knowledge of the canonical models, each individual model could be characterized as an integration of the canonical models. Here, we assign a membership vector c_i = [c_i1, …, c_iK]^⊤, i = 1, …, N to each individual i to represent the degrees of resemblance of the individual model to the canonical models. In other words, we assume that the model of each individual is a combination of the canonical models, and the weights are the elements of the corresponding membership vector. Since each canonical model describes one kind of mechanism patterns in the population, by integrating this set of canonical models, the individual models, denoted as $g_{i} (x) = \sum_{k} c_{i k} f_{k} (x), i = 1, \dots, N$ , can provide an adequate characterization of the individuals. Specifically, in the models with individual parameters $β_{i} = {[β_{i 1}, \dots, β_{i p}]}^{⊤}, i = 1, \dots, N$ , note that each canonical model is a model of the same form with a parameter vector, i.e., q_k = [q_k1, …, q_kp]^⊤, k = 1, …, K. Under the collaborative learning framework, we can assume that β_i of the model of individual i is a linear combination of the canonical parameters, i.e., $β_{i} = \sum_{k} c_{i k} q_{k}$ .

For example, in our work, the forms of canonical models and the personalized models are both logistic models, i.e., for kth canonical model, f_k(x) = log((Pr(y = 1))/(1 − Pr(y = 1))) = x^⊤q_k, where q_k is the parameter vector of this canonical model. Thus, under the collaborative learning framework, the model of individual i is $g_{i} (x) = \sum_{k} c_{i k} x^{⊤} q_{k} = x^{⊤} \sum_{k} c_{i k} q_{k}$ , and $β_{i} = \sum_{k} c_{i k} q_{k}$ is the personalized parameter vector for the individual.

Next, we will present the formulation of our proposed LogCM in detail and an intuitive extension, the similarity-regularized logistic collaborative model (LogSCM).

C. Model Formulation of LogCM and LogSCM

To derive the analytical formulation of LogCM, we first tidy up the parameters of the K canonical models as a matrix: $Q \equiv [q_{1}, \dots, q_{K}] \in ℝ^{p \times K}$ . Then, we can rewrite β_i as β_i = Qc_i.

LR is a widely used statistical model for binary-outcome problems. It assumes that the probability of being in a certain category depends on a set of variables (x’s), with the link of logit function. Under the collaborative learning framework, the logistic function can be expressed as

π_{i} (x_{i j}) = \Pr (y_{i j} = 1) = \frac{\exp (x_{i j}^{⊤} β_{i})}{1 + \exp (x_{i j}^{⊤} β_{i})} = \frac{\exp (x_{i j}^{⊤} Q c_{i})}{1 + \exp (x_{i j}^{⊤} Q c_{i})} .

(1)

π_i is the LR model for individual i, where y_ij is the jth binary observation of this individual and x_ij is the p-length characteristic variables vector. LR is always learned by maximizing the log-likelihood, which can be written as

l = \log (1 + \exp (x_{i j}^{⊤} β_{i})) - y_{i j} (x_{i j}^{⊤} β_{i}) .

(2)

Note that it is also the logistic loss function in machine learning. We can see from (2) that we can learn the parameters without knowing the latent variable given by x^⊤ β (the systematic utility in the context of RUM). It is straightforward to write up the log-likelihood function under collaborative learning framework for parameter estimation

\min_{C, Q} \sum_{i = 1}^{N} \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} {\log (1 + \exp (x_{i j}^{⊤} Q c_{i})) - y_{i j} (x_{i j}^{⊤} Q c_{i})} s.t. c_{i} \geq 0, c_{i}^{⊤} 1 = 1, i = 1, \dots, N .

(3)

Here, in our LogCM, note that the objective function is a weighted sum of the logistic loss of all individual models to gauge the goodness-of-fit of the models. n_i is the number of observations of individual i. The goal of applying the weight 1/n_i is to account for the different sample sizes of different individuals. The two constraints, c_ik ≥ 0 and $c_{i}^{⊤} 1 = 1$ , are imposed on c_i due to its definition as a membership vector. By solving this optimization problem, the parameter matrix of the canonical models Q and the membership vectors c_i, i = 1, …, N can be estimated. Then, the individual models can be obtained by β_i = Qc_i.

An obvious advantage of formulating the parameter estimation problem as an integrated optimization framework is that it is flexible to incorporate other kinds of data, prior knowledge, or any structural constraints that we may want to impose on the models. For instance, in many other applications, the similarity information among individuals could be very helpful to learn individual models by allowing similar individuals to have similar models [2]. Denote w_lm as the similarity between individuals l and m, i.e., the larger w_lm is, the more similar is the pair. To incorporate the similarity knowledge in the model formulation of LogCM, we could add a regularization term, ${\sum_{l, m} ‖ c_{l} - c_{m} ‖}^{2} w_{l m}$ , into the objective function of (3) and extend it to the similarity-regularized logistic collaborative model (LogSCM). Similar as in [31], we can reformulate this regularization term as a trace term, which can facilitate the development of our optimization solution in Section III

\frac{1}{2} \sum_{l, m} {‖ c_{l} - c_{m} ‖}^{2} w_{l m} = \sum_{l = 1}^{N} c_{l}^{⊤} c_{l} d_{l l} - \sum_{l, m} c_{l}^{⊤} c_{m} w_{l m} = Tr (C L C^{⊤}) .

(4)

Here, C is the matrix containing all membership vectors $C = [c_{1}, \dots, c_{N}] \in ℝ^{K \times N}$ . L is defined as D – W, where $W = (w_{l m}) \in ℝ^{N \times N}$ is the similarity matrix and D is a diagonal matrix with entries $d_{l l} = \sum_{m} w_{l m}$ . Thus, it leads to the following formulation of LogSCM:

\min_{C, Q} \sum_{i = 1}^{N} \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} {\log (1 + \exp (x_{i j}^{⊤} Q c_{i})) - y_{i j} (x_{i j}^{⊤} Q c_{i})} + λ T r (C L C^{⊤}) s.t. c_{i} \geq 0, c_{i}^{⊤} 1 = 1, i = 1, \dots, N

(5)

where λ ≥ 0 is a hyperparameter that can be tuned to control the effect of the regularization term on parameter estimation. The larger λ is, the greater influence will be imposed on the estimation by the regularization term.

It is not hard to see that when λ = 0, the LogSCM formulation in (5) will degenerate to the LogCM formulation in (3). By solving the optimization problem in (5), we can estimate the canonical parameters matrix Q and the membership matrix C. As the formulation is not jointly convex on both parameter matrices, we will propose an iterative two-step approach to solve them alternatively in Section III.

D. Relationship Between LogSCM and Mixed-Effects Logistic Model

The MEM has been a long-standing method to handle heterogeneity of individual models [32]–[34]. While it is reasonable to say that the primary motivation of MEM is not for personalized modeling but more for accounting for the hierarchical structure in the variation or correlation among observations, a modern view of MEM also indicates that MEM provides an approach for personalized modeling as it incorporates a level-two distribution model to characterize the interrelations of the level-one individuals. Specifically, for LR, mixed-effects LR model (logistic MEM) [27], [35] assumes that the random parts of the individual parameters β_i’s are independently identically distributed, sampled from a multivariate normal distribution, i.e., β_i ~ N(0, Σ), where Σ denotes a covariance matrix. Thus, it is of our interest to study the relationship between our proposed LogSCM and the logistic MEM. We can prove that the objective function of LogSCM is equivalent to the logistic MEM under certain conditions where w_lm = 1/λN for all pairs of individuals and Σ = QQ^⊤.

Theorem 1: The objective function of the LogSCM is equivalent to the objective function of mixed-effects LR model when W is a matrix with all entries being 1/λN and Σ = QQ^⊤.

The proof of the theorem is provided in the Supplementary Material (Appendix A). Theorem 1 shows a useful insight into our proposed collaborative learning approach’s unique capability of studying heterogeneous models compared to logistic MEM. First, LogSCM provides greater flexibility of incorporating information sources as the similarity matrix could be freely formed. For MEM, it essentially assumes that W is a matrix with all entries being 1/λN. It is not a surprise because the fundamental assumption of mixed-effects logistic model is that β_i ~ N(0, Σ), which treats all individuals equally as independent samples from the same distribution. Furthermore, our model explicitly shows the commonalities and differences in population by providing explicit forms of the canonical models and the membership vectors, while the logistic MEM encapsulates the heterogeneity into random effects. On the other hand, although Theorem 1 reveals the hidden relationship between LogSCM and logistic MEM, it does not indicate that logtistic MEM is simply a special case of the LogSCM. The logistic MEM can apply different forms of covariance matrix, which will lead to a different model from LogSCM. As such, LogSCM can be considered as a knowledge-driven logistic MEM with an extra capability to incorporate the canonical structure and flexible similarity information.

III. Parameter Estimation Algorithm

The formulation of LogSCM shown in (5) has a structure that could be utilized to develop a computational algorithm. Specifically, we note that if we iteratively optimize for Q and C in alternation, the optimization problem could be decomposed into two easier subproblems. This strategy has been exploited in [1]–[3] for linear regression models and has shown promising performances.

A. Estimation Step for Canonical Models (Q Step)

In this step, we focus on solving Q with a given C*, i.e., C* could be the latest estimation of C. Given C*, the regularization term Tr(CLC^⊤) in (5) is a constant. Therefore, the original problem degenerates to the subproblem

\min_{Q} \sum_{i = 1}^{N} \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} {\log (1 + \exp (x_{i j}^{⊤} Q c_{i}^{*})) - y_{i j} (x_{i j}^{⊤} Q c_{i}^{*})} .

(6)

To further reveal its structure for the benefit of showing how to solve this optimization problem, we can define

{\tilde{x}}_{i j} = {\tilde{X}}_{i j}^{⊤} c_{i}^{*} \in ℝ^{p K \times 1}

where

{\tilde{X}}_{i j} = {[\begin{matrix} x_{i j}^{⊤} & 0 & \dots & 0 \\ 0 & x_{i j}^{⊤} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & x_{i j}^{⊤} \end{matrix}]}_{K \times p K} .

Furthermore, denote $q \in ℝ^{p K \times 1}$ as the vectorized Q, and it is not hard to see that: $x_{i j}^{⊤} Q c_{i}^{*} = {\tilde{x}}_{i j}^{⊤} q$ . Thus, (6) can be simplified to

\min_{Q} \sum_{i = 1}^{N} \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} {\log (1 + \exp ({\tilde{x}}_{i j}^{⊤} q)) - y_{i j} ({\tilde{x}}_{i j}^{⊤} q)} .

(7)

This is a weighted sum of logistic loss, and the logistic loss has been proved convex in the literature [36]. Therefore, (7) could be solved by many off-the-shelf algorithms. In this article, we use an R package called CVXR for specifying and solving convex programs [37] to solve the problem in (7).

B. Estimation Step for Membership Vectors (C Step)

In this step, we focus on solving C with a given Q*. We briefly show how to solve C using a closed-form updating rule here and the detailed derivation can be found in the Supplementary Material (Appendix B). Given Q*, the Lagrangian function of the original formulation as shown in (5) could be derived as

L = \sum_{i = 1}^{N} \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} {\log (1 + \exp (x_{i j}^{⊤} Q^{*} c_{i})) - y_{i j} (x_{i j}^{⊤} Q^{*} c_{i})} + λ Tr (C L C^{⊤}) + \sum_{i = 1}^{N} η_{i} (c_{i}^{⊤} 1 - 1)

by introducing the Lagrangian multiplier η_i for constraint $c_{i}^{⊤} 1 = 1$ . Optimal C must follow the complementary condition, i.e., $(\partial L / \partial c_{i k}) c_{i k} = 0$ :

\frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} [\frac{\exp (x_{i j}^{⊤} Q^{*} c_{i})}{1 + \exp (x_{i j}^{⊤} Q^{*} c_{i})} - y_{i j}] {(Q^{* ⊤} x_{i j})}_{k} c_{i k} + 2 λ {({(C L)}_{i})}_{k} c_{i k} + η_{i} c_{i k} = 0 .

(8)

Then, with L = D – W, and the constraint that $c_{i}^{⊤} 1 = 1$ , the closed form of η_i is

η_{i} = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} {y_{i j} (x_{i j}^{⊤} Q^{*} c_{i}) - \frac{\exp (x_{i j}^{⊤} Q^{*} c_{i})}{1 + \exp (x_{i j}^{⊤} Q^{*} c_{i})} (x_{i j}^{⊤} Q^{*} c_{i})} - 2 λ {(C D)}_{i}^{⊤} c_{i} + 2 λ {(C W)}_{i}^{⊤} c_{i} .

(9)

Plug in the above expression of multiplier η_i into the complementary condition in (8), we can generate the following updating rule similar as in [1]–[3]:

c_{i k}^{m + 1} = c_{i k}^{m} \times {\frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} [- δ_{-} (\frac{\exp (x_{i j}^{⊤} Q^{*} c_{i}^{m})}{1 + \exp (x_{i j}^{⊤} Q^{*} c_{i}^{m})} {(Q^{* ⊤} x_{i j})}_{k}) + δ_{+} (y_{i j} {(Q^{* ⊤} x_{i j})}_{k})] + \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} [- δ_{-} (y_{i j} (x_{i j}^{⊤} Q^{*} c_{i}^{m})) + δ_{+} (\frac{\exp (x_{i j}^{⊤} Q^{*} c_{i}^{m})}{1 + \exp (x_{i j}^{⊤} Q^{*} c_{i}^{m})} (x_{i j}^{⊤} Q^{*} c_{i}^{m}))] + 2 λ {({(C^{m} W)}_{i})}_{k} + 2 λ {(C^{m} D)}_{i}^{⊤} c_{i}^{m}} / {\frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} [δ_{+} (\frac{\exp (x_{i j}^{⊤} Q^{*} c_{i}^{m})}{1 + \exp (x_{i j}^{⊤} Q^{*} c_{i}^{m})} {(Q^{* ⊤} x_{i j})}_{k}) - δ_{-} (y_{i j} {(Q^{* ⊤} x_{i j})}_{k})] + \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} [δ_{+} (y_{i j} (x_{i j}^{⊤} Q^{*} c_{i}^{m})) - δ_{-} (\frac{\exp (x_{i j}^{⊤} Q^{*} c_{i}^{m})}{1 + \exp (x_{i j}^{⊤} Q^{*} c_{i}^{m})} (x_{i j}^{⊤} Q^{*} c_{i}^{m}))] + 2 λ {({(C^{m} D)}_{i})}_{k} + 2 λ {(C^{m} W)}_{i}^{⊤} c_{i}^{m}} .

(10)

Here, δ₊(·) is a function defined as δ₊(x) = max(x, 0) and δ₋(·) is defined as δ₋(x) ≡ min(x, 0). Equation (10) is derived from (8) and (9) that are the complementary condition and the original constraint for the normalization of the membership vector. Therefore, (10) is a necessary condition to solve (5) and it is a stationary point. In addition, by introducing the δ-functions, we ensure that the numerator and the denominator are both nonnegative. Therefore, given any positive initial C⁽⁰⁾, the nonnegativity of C^m is guaranteed.

Thus, we derive an algorithm that iteratively optimizes for C and Q. An overview of the algorithm is shown in Fig. 2.

C. Empirical Guidelines for Implementing the Algorithm

In concluding this section, we introduce a few important empirical guidelines for implementing the algorithm.

1). Initialization of the Parameters:

One important issue is the initialization of our two-step computational algorithm shown in Fig. 2, i.e., to determine the initial values of canonical matrix Q and membership matrix C. If there are sufficient data from each individual to obtain a reliable estimation of the regression coefficients of the individuals, then we can employ clustering techniques such as the k-means method on the estimated regression coefficient vectors. The clustering algorithm will identify K vectors as centers of the K clusters, which will be our initial values of Q. On the other hand, if data are limited, which is more likely in practice, we recommend to use the mixed-effects LR model to obtain the estimation of the regression coefficients of the individuals.

Next, the resemblances between the regression coefficient vector of an individual i and the center vectors of the clusters can be calculated and further normalized to obtain the initial values of c_i. In this article, inspired by the assumption of canonical structure, i.e., individual models are combinations of the canonical models, we will gain the initial membership matrix C by solving an optimization problem as follows:

\min_{c_{i}} \sum_{i} ‖ β_{i} - Q^{(0)} c_{i} ‖^{2} s.t. c_{i} \geq 0, c_{i}^{⊤} 1 = 1, i = 1, \dots, N

(11)

where Q⁽⁰⁾ is the initial value of Q and β_i is the regression coefficient vector of individual i.

2). Determination of K:

In some applications [2], we have prior knowledge to help us determine the number of canonical models. Without reliable prior knowledge, we could obtain the optimal K using model selection methods, such as AIC/BIC criteria and cross-validation technique. For instance, in our numerical studies which will be shown later in Sections IV and V, we use fivefold cross validation to evaluate a range of K (for instance, from 2 to 10) in terms of prediction accuracy using metrics such as error rate and area under the receiver operating characteristic (ROC) curve, i.e., the area under the curve (AUC) value. Specifically, the training set will be randomly divided into five sets. At each fold, one of them will serve as the validation set. Then, learn models based on the other four sets and report the prediction accuracy on the validation set. By comparing the average accuracy for each candidate K, we can determine which fits the training data best, in terms of prediction accuracy, i.e., lowest error rate or highest AUC.

There are applications where the canonical structure may be uneven, e.g., there are minor subpopulations that may not be adequately represented in the data and thereby could not be effectively discovered by automatic methods such as cross validation. This presents a challenging situation and needs a specialized approach to tackle. In Section VI, we develop a method called pairwise-fusion LogCM (LogPCM) [19], [38] to help determine the number of canonical models and discover the canonical structure. Starting with a large K, LogPCM incorporates regularization on the differences between candidate canonical models and merges similar models into one. It is inspired by the idea of the path solution trajectory that is commonly used in sparse learning and will reveal the structure of the canonical models, i.e., a graphical tool will be derived for researchers to look for a proper K meeting their needs.

3). Acquisition of the Similarity Matrix W:

In some applications, the similarity matrix W is already known through expert opinion or previous studies. We could also quantify the similarities between individuals using some personal characteristics such as demographic and social-economic factors, and other factors depending on the application contexts [39], [40]. To define similarity, existing approaches, including 0-1 weighting, heat kernel weighting, and dot-product weighting, could be used. For instance, denote the personal characteristics of individual i as z_i, the heat kernel is defined as $w_{l m} = \exp (- {‖ z_{l} - z_{m} ‖}^{2} / σ^{2})$ , and the dot product is $w_{l m} = z_{l}^{⊤} z_{m}$ . The 0-1 weighting is calculated in a way that for each individual, we treat its k nearest neighbor as equally similar and assign the similarities as 1 for these nearest neighbors; for others, the similarities are 0. We recommend to use the heat kernel weighting in practice when W is not available because it creates a continuous similarity metric and has a tuning parameter σ², which can adapt to the data. On the other hand, even if personal characteristics are not available, we can obtain the similarities using a data-driven approach developed in [2]. The idea is to treat the estimated regression coefficients of the individuals as personal characteristics and calculate the similarities accordingly. In numerical studies, we will use a mixed-effects logistic model to obtain initial estimations of the regression coefficients of the individuals and use heat kernel weighting to calculate the similarity matrix W.

IV. Simulation Studies

To evaluate the model performance, we conduct extensive simulation studies in this section and will show the performance in a real-world case of learning personal travel preferences in Section V. We compare our proposed LogCM and LogSCM models with several benchmark methods, including: 1) the one-size-fits-all LR model that treats all individuals homogeneously, pooling all the individuals’ data together to learn one population model; 2) the mixed-effects LR model (logistic MEM), which considers that the coefficients of individuals are sampled from a certain distribution; and 3) the independent logistic regression model (ILM) that learns the regression coefficients of each individual solely based on his/her own data.

In the simulation where the true parameters are known, we can evaluate the model performance by looking at the differences between the learned coefficients and the real ones. We can use the average absolute error defined as $(1 / N) \sum_{i = 1}^{N} | β_{i} - {\hat{β}}_{i} |$ and also the average correlations as $(1 / N) \sum_{i = 1}^{N} ρ (β_{i}, {\hat{β}}_{i})$ . A better model will lead to a smaller average absolute error value and a higher average correlation that reflect smaller gaps between the estimated and real values.

Besides, there are also several prediction accuracy metrics that can also reflect the model performance. As the outcomes considered in this article are binary, we can evaluate the models using the error rate and the ROC curve. The ROC curve is a probability curve plotted with true positive rate (TPR, also known as sensitivity) against the false positive rate (FPR, equals to 1-specificity). We can further extract the AUC from the ROC curve for each model. The larger the AUC value is, the better the model is in terms of prediction accuracy. Since the evaluations for prediction accuracy (error rate and AUC value) do not require knowing the true parameters, they can also be used in evaluating model performances in real-world cases. The experiments are conducted on R (version 3.4.2) on an Intel Core i5, 8-GB 2.40-GHz PC, and the running times are reported in the results as well.

A. Design of the Simulation Experiments

We conduct a comprehensive set of experiments to evaluate the performance of our proposed methods with benchmark methods in a variety of scenarios. We design the following guidelines to generate data. A detailed note for data generation is given in the Supplementary Material (Appendix C).

With any given number of canonical models, for example, K = 3, we manually set the parameters of the canonical models encoded in Q to make sure that they are different enough, as the canonical models are assumed to represent different preferences, behavior, or mechanism patterns. Then, for generating C, the Dirichlet distribution is utilized to meet the constraints for membership vectors that $c_{i}^{⊤} 1 = 1$ and c_i ≥ 0. To ensure the heterogeneity among individuals, we design three distinct Dirichlet distributions as: F₁(c) ~ Dir(υ, 1, 1), F₂(c) ~ Dir(1, υ, 1), and F₃(c) ~ Dir(1, 1, υ) for K = 3, with a large tuning parameter υ = 20. Each individual is first assigned randomly to one of the designed Dirichlet distributions, and then, the parameter vector can be obtained by β_i = Qc_i. With this generation procedure, it is guaranteed that we can see the canonical structure in individual models. x_ij’s are generated from multivariate normal distributions and y_ij’s are calculated accordingly with a small normal distributed noise. We consider one more layer of complexity, which is the balance of the two classes in the binary outcomes. For balanced data, the two labels are of similar sizes (50% for each), and for imbalanced data, the percentage of one label is designed to be around 80%.

We generate 40 data points for each individual and randomly pick 10 for testing. For the remaining 30 data points, two realistic scenarios are considered, i.e., dense sampling [data size M ~ Unif(21, 30)] and sparse sampling [M ~ Unif(6, 12)]. Sparse sampling is designed to refer to the application contexts in which there are only a few data points for each individual.

B. Simulation Results

In applying our methods on the simulated data, we test our methods under two scenarios: one scenario assumes that we have known the number of canonical models K and the similarity matrix W by prior knowledge, and the other scenario assumes that we have no such knowledge, so we have to use a data-driven approach to obtain both. As discussed in Section III–C, we use cross validation to determine K for LogCM and LogSCM models, choosing that the one has the highest average accuracy on validation sets, and derive the similarities between individuals based on the estimated coefficients by logistic MEM using heat kernel function. The cross-validation technique is also used in determining the tuning hyperparameter λ for LogSCM.

Table I summarizes the results for balanced data with K = 3 and p = 5. We also conducted simulation experiments for other values, such as K = 5, 10 and p = 10, 50, 100, and similar results could be observed. In Table I, our LogCM and LogSCM model learned with estimated similarities are labeled as uLogCM and uLogSCM, respectively. While not shown in the table, by applying a cross-validation technique based on average AUC, we can successfully identify K = 3, which is the ground truth number of canonical models. The results for imbalanced data of the same setting are provided in the Supplementary Material (Appendix D). K = 3 can also be automatically learned.

TABLE I.

Model Performance Comparison on Testing Set of the Simulated Balanced Data

				Known Similarity		Unknown Similarity

	LR	MEM	ILM	LogCM	LogSCM	uLogCM	uLogSCM
Dense Sampling
Time(sec)	0.026	72.29	0.238	52.38	55.73	49.05	60.37
Absolute Error	4.769	5.578	4.311	1.684	1.589	1.684	1.707
Correlation	0.413	0.465	0.580	0.921	0.925	0.921	0.925
Error Rate	0.253	0.453	0.200	0.137	0.137	0.137	0.137
AUC	0.832	0.555	0.855	0.955	0.958	0.955	0.951

Sparse Sampling
Time (sec)	0.025	32.12	0.213	82.57	43.72	79.42	37.39
Absolute Error	4.745	4.257	5.398	3.973	2.606	3.973	2.906
Correlation	0.411	0.506	0.434	0.557	0.758	0.557	0.721
Error Rate	0.240	0.257	0.213	0.130	0.110	0.130	0.117
AUC	0.749	0.624	0.865	0.947	0.949	0.947	0.954

Open in a new tab

We can observe from Table I and Table S.I (Supplementary Material) as follows.

Overall, the LogSCM model outperforms the others since it can exploit the canonical structure of the individual models and the similarity information between individuals to enhance model estimation.
When data are more imbalanced and sparser, the advantage of LogSCM generally becomes larger, showing that a knowledge-driven model can overcome the lack of observations.
When the canonical structure is significant, i.e., υ is large, the proposed LogCM and LogSCM are better than other benchmark models in terms of both prediction accuracy and parameter learning, showing their efficacy in exploiting the canonical structure for better modeling.
The learned similarity matrix (in uLogSCM) can also help enhance the model estimation to some degrees, especially in sparse cases although may not as good as the real known information.

Furthermore, while it is not shown in the tables, it comes to us as a frequent observation that in some sparse sampling or imbalanced scenarios, the ILM and logistic MEM may not even be applied as the lack of data points result in highly ill-conditioned matrix operations that lead to immature breakdown of their computational algorithms.

V. Real-World Case Study

In this section, we apply our proposed methods on a real-world data set collected in [8]. The study investigated 1956 individuals about their preferences in daily commuting. For each individual, the experiment offered one alternative trip plan once a time to compare with the individual’s original plan. The alternative option includes a change in departure time, an amount of time can be saved on the road, and reward points of a certain amount to encourage the user to choose the new plan. Each time, the respondent was asked to choose between the two alternatives. Fig. 3 shows an exemplary choice scenario from the experiment, and Table II shows several rows of the data. The primary goal of this study is to learn personalized RUM models of individuals’ decisions in selecting among different transportation alternatives. As the RUM models of the individuals in [8] were learned from each individual’s own data separately, here, we aim to evaluate whether our proposed models could learn the individual models better.

Fig. 3. — Example of the alternatives and the choice question: the original plan and one promoted alternative plan with a certain amount of incentive.

TABLE II.

Examples of the Travel Behavior Preferences Data

Delay Early (min)	Delay Late (min)	Travel Time Saving (min)	Reward Points	Choice	Respondent ID	Question Number
30	0	5	40	B	1	1
0	30	5	40	A	1	2
10	0	5	40	B	1	3
…	…	…	…	…	…	…

Open in a new tab

The full study interviewed each respondent for up to 13 rounds, and in each round, two alternatives were compared. After a proper data cleaning, 828 individuals’ data [8] will be investigated in our study, and all of them answered all 13 questions. We assign the first ten rounds as training data and leave the last 3 as testing. There are four predictors that include schedule delay early, schedule delay late, travel time saving, and the amount of reward points. The separation of delay early and delay late is needed because it has been found that people have different preferences on departing earlier or later if they are asked to change their travel plan [41]. As the number of canonical models is unknown, we apply the same cross-validation procedure used in simulation studies to determine K for LogCM and LogSCM. Fig. 4 (left) shows the results of fivefold cross validation for determining K for LogCM, ranging from 2 to 12. The black dot indicates the average AUC on validation sets across all five folds and the error bar indicates the maximum and minimum. It can be observed that when K = 8, the average AUC on validation sets reaches the maximum and the variation across five folds is also smaller than the others. The similarities between individuals are derived based on the estimated coefficients given by logistic MEM as recommended. Then, we also use the same cross-validation procedure to determine the tuning hyperparameter λ. Fig. 4 (right) shows the results of LogSCM with λ ranging from 0.018 (e⁻⁴) to 7.389 (e²). We can observe that when λ = e^−0.21 = 0.811, the average AUC reaches the maximum, although the differences are not significant compared with other values of λ.

Fig. 4. — Determining the value for K and λ based on average AUC on validation sets using fivefold cross-validation technique, on the travel behavior preferences data. Left: determining K. Right: determining λ for LogSCM with K = 8.

The algorithms for both LogCM and LogSCM converge quickly. On these data, both in five rounds (ten steps). Figures for convergence performance can be found in the Supplementary Materials (Appendix E).

We compare our proposed models with the benchmark models mentioned before: one-size-fits-all LR, mixed-effects LR (logistic MEM), and independent logistic regression (ILR). Unlike in simulation studies where we know the true values of the parameters of the individual models, here, as a real-world application, we do not have such knowledge. Thus, we focus only on prediction accuracy on the testing data. The error rate and AUC value are adopted, as well as the mean squared error (MSE). MSE is quite popular in judging the goodness-of-fit of a model. Lower MSE will indicate higher prediction accuracy of the model. When applied in LR, it is calculated based on the probabilities given by the learned model, i.e., $(1 / N) \sum_{i = 1}^{N} (1 / n_{i}) \sum_{j = 1}^{n_{i}} {(y_{i j} - π_{i} (x_{i j}))}^{2}$ .

Table III summarizes the results. We can observe that the one-size-fits-all model (LR) has the worst prediction performance. It can also be observed that LogCM and LogSCM outperform others in terms of all three metrics. However, no significant difference between LogCM and LogSCM is observed here. It may come from the fact that the similarity information is learned from the data and the data for each individual is not large enough (only ten questions). It is also consistent with what we observed in cross-validation procedure for λ that the similarity information here may not improve the model performance a lot since different λ’s show quite similar performances on validation sets.

TABLE III.

Model Performance Comparison on Testing Set of the Travel Behavior Preferences Data

	LR	MEM	ILM	LogCM	LogSCM
Time (sec)	0.034	900.5	1.658	75.67	762.7
MSE	0.195	0.269	0.151	0.126	0.125
Error Rate	0.289	0.268	0.157	0.154	0.150
AUC	0.672	0.742	0.836	0.852	0.851

Open in a new tab

We further conduct another experiment to evaluate the effectiveness of the models using different sample sizes. Although in this complete training set only ten questions were collected from each individual, it is likely that in real-world implementation of the travel behavior intervention system, the number of observations of an individual may be even more insufficient and sometimes may be imbalanced. To account for these complexities, we further randomly eliminate 30% (dense sampling, 70% remained) and 50% (sparse sampling, 50% remained) of the data from the training set. Less than five questions for each individual will make it impossible to run individual LR with four variables and to run fivefold cross validation as well. We repeat the whole learning procedure (with cross-validation technique for determining K and λ) on those new sparse data. Table IV summarizes the results of these models, and the tendency can be seen in Fig. 5.

TABLE IV.

Model Performance Comparison on Testing Set of the Sampled Travel Behavior Preferences Data

	LR	MEM	ILM	LogCM	LogSCM
Dense Sampling
MSE	0.195	0.306	0.187	0.135	0.134
Error Rate	0.290	0.306	0.192	0.159	0.158
AUC	0.671	0.706	0.780	0.844	0.840

Sparse Sampling
MSE	0.196	0.220	0.224	0.164	0.159
Error Rate	0.291	0.317	0.228	0.191	0.182
AUC	0.670	0.681	0.733	0.822	0.823

Open in a new tab

Fig. 5. — Change of AUC of the models on testing set of the travel behavior preference data, under different levels of missing rate of training set.

The following can be observed.

The prediction performances of all models become worse when the training data become more sparse.
Our proposed models outperform the others in all three cases.
Fig. 5 shows that the accuracy of ILM and MEM drops very fast, whereas the accuracy of LR remains low but stable. It is because LR uses all individuals’ data so that it does not have the problem of limited data size. Meanwhile, our models show a high and stable capacity of learning under insufficient data.
LogSCM overall is slightly better than LogCM since it incorporates additional information in the data. This advantage is more noticeable when the data are very scarce, consistent with what we have observed in simulation studies.

As the similarity matrix is also learned from the data, we can expect an even better performance of LogSCM if valuable prior knowledge allows us to derive the similarity information. In addition, it can be seen that the LogCM and LogSCM can provide explicit insight into some common behavioral preferences in the population by the explicit modeling of the canonical models. In all, the study shows that our proposed models are more powerful in personalized modeling and prediction. The advantage is even more obvious when learning with limited data.

VI. Extension to Uneven Canonical Structure by Pairwise-Fusion Collaborative Model (LogPCM)

The proposed LogCM and LogSCM models use cross validation to select the number of canonical models. This implicitly assumes that the canonical structure is balanced and even, in other words, the subpopulations that correspond to the canonical models are of similar sizes. If this is not the case, there would be minority canonical models that only represent patterns in minor subpopulations, and they could be hard to be identified by cross-validation technique since the increased model complexity (i.e., more canonical models) would not be justified by the minor gain in model performance.

To address this problem, our strategy is to reveal the whole structure rather than identifying one best choice. This is inspired by the idea of path solution trajectory that has been used in sparse learning literature. We begin with a large K and incorporate pairwise-fusion regularization [19], [38] to penalize the pairwise differences of candidate canonical models. Similar to LASSO or group LASSO [20] methods, with a proper penalty, small differences can be shrunk to zero. In other words, the canonical models with small differences will be fused into one. Thus, we name this method the pairwise-fusion logistic collaborative model (LogPCM). A visual tool is also introduced to show how LogPCM can help researchers get insights into the canonical structures.

A. Formulation of LogPCM

Equation (12) shows the formulation of LogPCM

\min_{C, Q} \sum_{i = 1}^{N} \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} {\log (1 + \exp (x_{i j}^{⊤} Q c_{i})) - y_{i j} (x_{i j}^{⊤} Q c_{i})} + \frac{μ}{2} \sum_{k_{1}, k_{2}}^{K} {‖ q_{k_{1}} - q_{k_{2}} ‖}_{2} s.t. c_{i} \geq 0, c_{i}^{⊤} 1 = 1, i = 1, \dots, N .

(12)

Here, q_k₁ and q_k₂ are two candidate canonical models, i.e., two columns in matrix Q, and μ ≥ 0 is the tuning parameter controlling the tradeoff between prediction loss and the pairwise differences of the canonical models. The L₂-norm penalty exploits the nondifferentiability at q_k₁ − q_k₂ = 0, setting the difference to be exactly 0 when it is small enough.

To implement LogPCM, we start at a large K and a small μ. Then, gradually increase μ to see how the canonical models evolve, i.e., with the shrinkage effect of the L₂-norm, similar candidate canonical models will be fused into one, and eventually when μ is large enough, all canonical models will merge into one. Note that the definition of K here is slightly different from the K in LogCM: it is not the number of final canonical models, but rather an estimated upper bound.

B. Parameter Estimation Algorithm for LogPCM

As suggested in (6), the original Q step vectorizes the matrix Q and therefore simplifies the problem into a weighted LR formulation, which is convex optimization. Clearly, the additional L₂-norm is also convex with regard to canonical models q_k (k = 1, …, K). Therefore, it is also convex with the vectorized q. To see that, we define auxiliary matrices as

A_{(i j)} = {[0_{p}, \dots, 0_{p}, I_{p}, 0_{p}, \dots, 0_{p}, - I_{p}, 0_{p}, \dots, 0_{p}]}_{p \times K p}

(13)

where each 0_p is a p × p square matrix with all entries being 0 and I_p is a p × p unit matrix. In total, there are K square matrices that make up an auxiliary matrix A_(ij), where the ith square matrix is a unit matrix I_p, the jth is a negative unit matrix −I_p, and all the others are zero matrices 0_p. We can see that for vectorized $q = {[q_{1}^{⊤}, \dots, q_{K}^{⊤}]}^{⊤}$ .

A_{(i j)} q = q_{i} - q_{j} .

(14)

With (7) for nonpenalized q estimation, the modified Q step is simplified to be a nonconstrained convex optimization as follows and can also be solved by CVXR [37]:

\min_{q} \sum_{i = 1}^{N} \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} {\log (1 + \exp ({\tilde{x}}_{i j}^{⊤} q)) - y_{i j} ({\tilde{x}}_{i j}^{⊤} q)} + μ \sum_{k_{1}, k_{2}}^{K} {‖ A_{(k_{1} k_{2})} q ‖}_{2} .

(15)

C. Simulation Study

We run a simulation study similar as in Section IV, also based on K = 3 and p = 5. However, unlike the simulation studies in Section IV where individuals are randomly assigned to three groups (each group mainly represents one of the canonical models), here, among the 90 generated individuals, only about 10% of them correspond to the canonical model #3, i.e., P(F₃(c)) = 0.1, which is called a minority canonical model.

First, we investigate the performance of the cross validation for determining K. Fig. 6 shows that K = 5 is chosen based on the average AUC, which is not the ground truth. As a data-driven method, the cross-validation technique is designed to choose parameters, which leads to the highest prediction accuracy, but it may not reveal the right canonical structure.

If the primary goal is prediction, the cross-valuation technique can lead to a reasonable result. However, it is less advantageous in knowledge discovery. Fig. 7 (left) shows the evolving plot of the candidate canonical models in LogPCM when we tune the hyperparameter μ from small to large. It shows the relationships between the canonical models during the fusion progress, such as which ones are fused together. Starting with K₀ = 15 (the initial K), when μ grows larger, more candidate canonical models will be fused together, and when μ is large enough, they will all merge together and become a one-size-fit-all model, i.e., K = 1. Here, we can see that K = 3 represents a distinct geometrical pattern that indicates the possibility that there are three unique canonical models. Using K = 3 would lead to better model estimation, as shown in Table V. Thus, the pairwise-fusion regularization could give us more insight into the data set beyond what is provided by cross validation. It does not mean that K = 3 is the only best answer. It is rather a flexible suggestion to determine K and can be combined with domain knowledge. This revelation of the structure is important knowledge that the LogPCM can provide for practitioners to make informed decisions.

Fig. 7. — Fusion of the canonical models with increasing μ. Left: fusion progress on the simulated data, K₀ = 15. Right: fusion progress on the travel behavior preference data, K₀ = 20.

TABLE V.

Parameter Estimation Comparison on Testing Set of the Simulated Data With One Minority Canonical Model

	LR	MEM	ILM	LogCM-5	LogCM-3
Absolute Error	4.080	7.243	188.8	3.164	2.715
Correlation	0.849	0.747	0.612	0.928	0.960

Open in a new tab

D. Real-World Case Study

Similar as in Section VI–C, we implement the LogPCM method on the real-world data set that we used in Section V. Remember that in Section V, the cross-validation technique identified K = 8. Here, the evolving plot shown in Fig. 7 (right) reveals that it is quite likely that K = 5(4.5 < log (μ) < 6). If we seek further refinement of the canonical structure, K = 7 (3 < log(μ) < 4.5), K = 9 (2 < log(μ) < 3) are also reasonable choices. As in this real-world case study, we do not know the ground truth of the data set and test all these alternatives of K on the testing set. The results are shown in Table VI.

TABLE VI.

Model Performance Comparison on Testing Set of the Travel Behavior Preferences Data Using LogCM With Different Values of K

	K = 8	5	7	9
MSE	0.126	0.118	0.123	0.121
Error Rate	0.154	0.149	0.156	0.147
AUC	0.852	0.871	0.856	0.858

Open in a new tab

From Table VI, we can observe that K = 5 outperforms K = 8 in terms of AUC and MSE. K = 9 reaches the best result in error rate. The results indicate that the new LogPCM method provides a powerful tool in choosing K, which is flexible in practice because it shows the whole view of the canonical structure. Unlike the cross-validation technique that is entirely data-driven and determines one value of K based on a given validation criterion (which is its strength, while is also its limitation), LogPCM reveals more information about the canonical structure and gives practitioners a good reference and the flexibility to make informed decisions.

VII. Discussion and Conclusion

In this article, we propose a new method to learn personalized RUM models from their behavior data collected in a recently developed TDM system [8], which has shown that a personalized incentive system will result in better outcome in TDM. Our contributions in this article, first, lie on the aspect that the proposed LogCM addresses the challenge of lack of observations for personalized modeling. The proposed method allows us to learn individual models from both individual’s own data and the relationships with other individuals’ data. Second, our LogCM does not assume a central tendency of the population, which is usually required in many other competing methods such as the MEM. Thus, our proposed model has greater flexibility and can capture individual preferences better when considerable heterogeneity exists. Third, we further develop a pairwise-fusion collaborative model. It provides a flexible tool to help discover the canonical structure. We also provide a competent estimation method for our proposed models, which achieves less computation time than logistic MEM as shown in the simulation study. If the computational power permits, our algorithm can cope with most scenarios. However, indeed, its current efficiency is not optimized for some scenarios, such as online real-time computing. Under these circumstances, we recommend using a partial updating strategy where C step is applied every time, and Q is only updated periodically. In the future, we will extend the proposed methods to large-scale online applications with millions of users. We will also develop an optimal experimental design strategy based on the collaborative learning framework for better learning personal behavioral preferences by designing questions for the next stage. Another potential is to extend the framework to incorporate more complicated canonical structures, like interaction with system or changing preferences, to name a few.

Supplementary Material

supplementary materials

NIHMS1819280-supplement-supplementary_materials.pdf^{(281.8KB, pdf)}

Note to Practitioners—

The proposed methods in this article can learn a distinct behavior model for each user and understand his/her own preferences, even when each of the user’s data is limited. With personalized models, it will help nowadays’ personalized service apps to understand, explain, and change user behavior in a more targeted and efficient way. The utility of the method is illustrated by an application on a real-world transportation demand management problem where personalized incentives are assigned to users to change their travel behavior.

Biographies

graphic file with name nihms-1819280-b0008.gif

Jingshuo Feng received the B.S. degree in statistics from the Renmin University of China, Beijing, China, in 2017, and the M.S. degree in industrial and systems engineering from the University of Washington, Seattle, WA, USA, in 2019, where he is currently pursuing the Ph.D. degree in industrial and systems engineering.

His research interests include statistical learning and experimental design.

graphic file with name nihms-1819280-b0009.gif

Xi Zhu received the B.S. degree in civil engineering from Tsinghua University, Beijing, China, in 2013, and the M.S. and Ph.D. degrees in civil engineering (transportation engineering) from the University of Washington, Seattle, WA, USA, in 2015 and 2020, respectively.

Her research interests include human behavior modeling in transportation and experimental design.

graphic file with name nihms-1819280-b0010.gif

Feilong Wang received the B.S. degree in civil engineering from Harbin Engineering University, Harbin, China, in 2013, and the M.S. degree in control science and engineering from Beihang University, Beijing, China, in 2016. He is currently pursuing the Ph.D. degree in transportation engineering with the University of Washington, Seattle, WA, USA.

His research interests include transportation big data analysis and data-driven modeling.

graphic file with name nihms-1819280-b0011.gif

Shuai Huang (Member, IEEE) received the B.S. degree in statistics from the University of Science and Technology of China, Hefei, China, in 2007, and the Ph.D. degree in industrial engineering from Arizona State University, Tempe, AZ, USA, in 2012.

He is currently an Associate Professor with the Department of Industrial and Systems Engineering, University of Washington, Seattle, WA, USA. His research interests include statistical learning and data mining with applications in healthcare and manufacturing.

Dr. Huang is a member of the Institute for Operations Research and the Management Sciences (INFORMS) and the Institute of Industrial and System Engineers (IISE).

graphic file with name nihms-1819280-b0012.gif

Cynthia Chen received the Ph.D. degree in civil and environmental engineering from the University of California at Davis, Davis, CA, USA, in 2001.

She was an Assistant Professor with The City College of New York, New York, NY, USA, from 2003 to 2009. She is currently a Professor with the Department of Civil and Environmental Engineering at the University of Washington (UW), Seattle, WA, USA. At UW, she directs the Transportation-Human Interaction-and-Network Knowledge (THINK) lab (https://sites.uw.edu/thinklab). Her current research of THINK lab focuses on understanding data, modeling behaviors of individuals (mobility patterns) and networks (e.g., cascading processes), and designing interventions for modifying individual behaviors and network phenomena. Common to these threads is the development of innovative methodologies.

Footnotes

This article has supplementary material provided by the authors and color versions of one or more figures available at https://doi.org/10.1109/TASE.2020.3041411.

Contributor Information

Jingshuo Feng, Department of Industrial and Systems Engineering, University of Washington, Seattle, WA 98195 USA.

Xi Zhu, Department of Civil and Environmental Engineering, University of Washington, Seattle, WA 98195 USA.

Feilong Wang, Department of Civil and Environmental Engineering, University of Washington, Seattle, WA 98195 USA.

Shuai Huang, Department of Industrial and Systems Engineering, University of Washington, Seattle, WA 98195 USA.

Cynthia Chen, Department of Civil and Environmental Engineering, University of Washington, Seattle, WA 98195 USA.

References

[1].Lin Y, Liu K, Byon E, Qian X, and Huang S, “Domain-knowledge driven cognitive degradation modeling for Alzheimer’s disease,” in Proc. SIAM Int. Conf. Data Mining, Jun. 2015, pp. 721–729. [Google Scholar]
[2].Lin Y, Liu K, Byon E, Qian X, Liu S, and Huang S, “A collaborative learning framework for estimating many individualized regression models in a heterogeneous population,” IEEE Trans. Rel, vol. 67, no. 1, pp. 328–341, Mar. 2018. [Google Scholar]
[3].Lin Y, Liu S, and Huang S, “Selective sensing of a heterogeneous population of units with dynamic health conditions,” IISE Trans, vol. 50, no. 12, pp. 1076–1088, Dec. 2018. [Google Scholar]
[4].Katzev R, “Car sharing: A new approach to urban transportation problems,” Analyses Social Issues Public Policy, vol. 3, no. 1, pp. 65–86, Dec. 2003. [Google Scholar]
[5].Giuliano G, “Transportation demand management: Promise or panacea?” J. Amer. Planning Assoc, vol. 58, no. 3, pp. 327–335, Sep. 1992. [Google Scholar]
[6].Stopher PR, “Reducing road congestion: A reality check,” Transp. Policy, vol. 11, no. 2, pp. 117–131, Apr. 2004. [Google Scholar]
[7].Möser G and Bamberg S, “The effectiveness of soft transport policy measures: A critical assessment and meta-analysis of empirical evidence,” J. Environ. Psychol, vol. 28, no. 1, pp. 10–26, Mar. 2008. [Google Scholar]
[8].Zhu X, Wang F, Chen C, and Reed DD, “Personalized incentives for promoting sustainable travel behaviors,” Transp. Res. C, Emerg. Technol, vol. 113, pp. 314–331, Apr. 2020. [Google Scholar]
[9].Chen C, Ma J, Susilo Y, Liu Y, and Wang M, “The promises of big data and small data for travel behavior (aka human mobility) analysis,” Transp. Res. C, Emerg. Technol, vol. 68, pp. 285–299, Jul. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].McFadden D et al. , “The revealed preferences of a government bureaucracy: Theory,” Bell J. Econ, vol. 6, no. 2, pp. 401–416, 1975. [Google Scholar]
[11].Ben-Akiva ME, Lerman SR, and Lerman SR, Discrete Choice Analysis: Theory and Application to Travel Demand, vol. 9. Cambridge, MA, USA: MIT Press, 1985. [Google Scholar]
[12].Viney R, Lancsar E, and Louviere J, “Discrete choice experiments to measure consumer preferences for health and healthcare,” Expert Rev. Pharmacoecon. Outcomes Res, vol. 2, no. 4, pp. 319–326, Aug. 2002. [DOI] [PubMed] [Google Scholar]
[13].Parsons GR, Helm EC, and Bondelid T, “Measuring the economic benefits of water quality improvements to recreational users in six northeastern states: An application of the random utility maximization model,” Univ. Delaware, Newark, Delaware, Tech. Rep, 2003. [Google Scholar]
[14].Guimaraes P, Figueiredo O, and Woodward D, “Industrial location modeling: Extending the random utility framework,” J. Regional Sci, vol. 44, no. 1, pp. 1–20, 2004. [Google Scholar]
[15].Chorus CG, Rose JM, and Hensher DA, “Regret minimization or utility maximization: It depends on the attribute,” Environ. Planning B, Planning Des, vol. 40, no. 1, pp. 154–169, Feb. 2013. [Google Scholar]
[16].Kunnumkal S, “Randomization approaches for network revenue management with customer choice behavior,” Prod. Oper. Manage, vol. 23, no. 9, pp. 1617–1633, Sep. 2014. [Google Scholar]
[17].Hess S, Daly A, and Batley R, “Revisiting consistency with random utility maximisation: Theory and implications for practical work,” Theory Decis., vol. 84, no. 2, pp. 181–204, Mar. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Hensher DA, “Stated preference analysis of travel choices: The state of practice,” Transportation, vol. 21, no. 2, pp. 107–133, May 1994. [Google Scholar]
[19].Petry S, Flexeder C, and Tutz G, “Pairwise fused lasso,” Dept. Statist., Univ. Munich, Munich, Germany, Tech. Rep 102, 2011. [Google Scholar]
[20].Simon N, Friedman J, Hastie T, and Tibshirani R, “A sparse-group lasso,” J. Comput. Graph. Statist, vol. 22, no. 2, pp. 231–245, 2013. [Google Scholar]
[21].Azari H, Parks D, and Xia L, “Random utility theory for social choice,” in Proc. Adv. Neural Inf. Process. Syst, 2012, pp. 126–134. [Google Scholar]
[22].Guevara CA and Ben-Akiva M, “Addressing endogeneity in discrete choice models: Assessing control-function and latent-variable methods,” in Choice Modelling: The State of the Art and the State of Practice. Bingley, U.K.: Emerald Group, 2010. [Google Scholar]
[23].Ben-Akiva M et al. , “Process and context in choice models,” Marketing Lett., vol. 23, no. 2, pp. 439–456, 2012. [Google Scholar]
[24].Hancock TO, Hess S, and Choudhury CF, “Decision field theory: Improvements to current methodology and comparisons with standard choice modelling techniques,” Transp. Res. B, Methodol, vol. 107, pp. 18–40, Jan. 2018. [Google Scholar]
[25].Oh S and Shah D, “Learning mixed multinomial logit model from ordinal data,” in Proc. Adv. Neural Inf. Process. Syst, 2014, pp. 595–603. [Google Scholar]
[26].Becker F, Danaf M, Song X, Atasoy B, and Ben-Akiva M, “Bayesian estimator for logit mixtures with inter-and intra-consumer heterogeneity,” Transp. Res. B, Methodol, vol. 117, pp. 1–17, Nov. 2018. [Google Scholar]
[27].Have TRT and Localio AR, “Empirical Bayes estimation of random effects parameters in mixed effects logistic regression models,” Biometrics, vol. 55, no. 4, pp. 1022–1029, Dec. 1999. [DOI] [PubMed] [Google Scholar]
[28].Hedeker D, “A mixed-effects multinomial logistic regression model,” Statist. Med, vol. 22, no. 9, pp. 1433–1446, 2003. [DOI] [PubMed] [Google Scholar]
[29].Pedhazur EJ and Schmelkin LP, Measurement, Design, and Analysis: An Integrated Approach. London, U.K.: Psychology Press, 2013. [Google Scholar]
[30].Hsieh FY, Bloch DA, and Larsen MD, “A simple method of sample size calculation for linear and logistic regression,” Statist. Med, vol. 17, no. 14, pp. 1623–1634, Jul. 1998. [DOI] [PubMed] [Google Scholar]
[31].Cai D, He X, Han J, and Huang TS, “Graph regularized nonnegative matrix factorization for data representation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 33, no. 8, pp. 1548–1560, Aug. 2011. [DOI] [PubMed] [Google Scholar]
[32].Thomas R, Have T, Kunselman AR, Pulkstenis EP, and Landis JR, “Mixed effects logistic regression models for longitudinal binary response data with informative drop-out,” Biometrics, pp. 367–383, 1998. [PubMed] [Google Scholar]
[33].Gałecki A and Burzykowski T, Linear Mixed-Effects Models Using R: A Step-By-Step Approach. New York, NY, USA: Springer, 2013. [Google Scholar]
[34].Son J, Zhou Q, Zhou S, Mao X, and Salman M, “Evaluation and comparison of mixed effects model based prognosis for hard failure,” IEEE Trans. Rel, vol. 62, no. 2, pp. 379–394, Jun. 2013. [Google Scholar]
[35].Vermunt JK, “Mixed-effects logistic regression models for indirectly observed discrete outcome variables,” Multivariate Behav. Res, vol. 40, no. 3, pp. 281–301, Jul. 2005. [DOI] [PubMed] [Google Scholar]
[36].Minka T, “Algorithms for maximum-likelihood logistic regression,” Dept. Statist., CMU, Pittsburgh, PA, USA, Tech. Rep. TR 758, Tech. Rep, 2001. [Google Scholar]
[37].Fu A, Narasimhan B, and Boyd S, “CVXR: An R package for disciplined convex optimization,” 2017, arXiv:1711.07582. [Online]. Available: http://arxiv.org/abs/1711.07582 [Google Scholar]
[38].Ma S and Huang J, “A concave pairwise fusion approach to subgroup analysis,” J. Amer. Stat. Assoc, vol. 112, no. 517, pp. 410–423, Jan. 2017. [Google Scholar]
[39].Wang F, Sun J, Hu J, and Ebadollahi S, “IMet: Interactive metric learning in healthcare applications,” in Proc. SIAM Int. Conf. Data Mining, Apr. 2011, pp. 944–955. [Google Scholar]
[40].Sun J, Sow D, Hu J, and Ebadollahi S, “Localized supervised metric learning on temporal physiological data,” in Proc. 20th Int. Conf. Pattern Recognit., Aug. 2010, pp. 4149–4152. [Google Scholar]
[41].Ben-Elia E and Ettema D, “Changing commuters’ behavior using rewards: A study of rush-hour avoidance,” Transp. Res. F, Traffic Psychol. Behav, vol. 14, no. 5, pp. 354–368, 2011. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary materials

NIHMS1819280-supplement-supplementary_materials.pdf^{(281.8KB, pdf)}

[R1] [1].Lin Y, Liu K, Byon E, Qian X, and Huang S, “Domain-knowledge driven cognitive degradation modeling for Alzheimer’s disease,” in Proc. SIAM Int. Conf. Data Mining, Jun. 2015, pp. 721–729. [Google Scholar]

[R2] [2].Lin Y, Liu K, Byon E, Qian X, Liu S, and Huang S, “A collaborative learning framework for estimating many individualized regression models in a heterogeneous population,” IEEE Trans. Rel, vol. 67, no. 1, pp. 328–341, Mar. 2018. [Google Scholar]

[R3] [3].Lin Y, Liu S, and Huang S, “Selective sensing of a heterogeneous population of units with dynamic health conditions,” IISE Trans, vol. 50, no. 12, pp. 1076–1088, Dec. 2018. [Google Scholar]

[R4] [4].Katzev R, “Car sharing: A new approach to urban transportation problems,” Analyses Social Issues Public Policy, vol. 3, no. 1, pp. 65–86, Dec. 2003. [Google Scholar]

[R5] [5].Giuliano G, “Transportation demand management: Promise or panacea?” J. Amer. Planning Assoc, vol. 58, no. 3, pp. 327–335, Sep. 1992. [Google Scholar]

[R6] [6].Stopher PR, “Reducing road congestion: A reality check,” Transp. Policy, vol. 11, no. 2, pp. 117–131, Apr. 2004. [Google Scholar]

[R7] [7].Möser G and Bamberg S, “The effectiveness of soft transport policy measures: A critical assessment and meta-analysis of empirical evidence,” J. Environ. Psychol, vol. 28, no. 1, pp. 10–26, Mar. 2008. [Google Scholar]

[R8] [8].Zhu X, Wang F, Chen C, and Reed DD, “Personalized incentives for promoting sustainable travel behaviors,” Transp. Res. C, Emerg. Technol, vol. 113, pp. 314–331, Apr. 2020. [Google Scholar]

[R9] [9].Chen C, Ma J, Susilo Y, Liu Y, and Wang M, “The promises of big data and small data for travel behavior (aka human mobility) analysis,” Transp. Res. C, Emerg. Technol, vol. 68, pp. 285–299, Jul. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].McFadden D et al. , “The revealed preferences of a government bureaucracy: Theory,” Bell J. Econ, vol. 6, no. 2, pp. 401–416, 1975. [Google Scholar]

[R11] [11].Ben-Akiva ME, Lerman SR, and Lerman SR, Discrete Choice Analysis: Theory and Application to Travel Demand, vol. 9. Cambridge, MA, USA: MIT Press, 1985. [Google Scholar]

[R12] [12].Viney R, Lancsar E, and Louviere J, “Discrete choice experiments to measure consumer preferences for health and healthcare,” Expert Rev. Pharmacoecon. Outcomes Res, vol. 2, no. 4, pp. 319–326, Aug. 2002. [DOI] [PubMed] [Google Scholar]

[R13] [13].Parsons GR, Helm EC, and Bondelid T, “Measuring the economic benefits of water quality improvements to recreational users in six northeastern states: An application of the random utility maximization model,” Univ. Delaware, Newark, Delaware, Tech. Rep, 2003. [Google Scholar]

[R14] [14].Guimaraes P, Figueiredo O, and Woodward D, “Industrial location modeling: Extending the random utility framework,” J. Regional Sci, vol. 44, no. 1, pp. 1–20, 2004. [Google Scholar]

[R15] [15].Chorus CG, Rose JM, and Hensher DA, “Regret minimization or utility maximization: It depends on the attribute,” Environ. Planning B, Planning Des, vol. 40, no. 1, pp. 154–169, Feb. 2013. [Google Scholar]

[R16] [16].Kunnumkal S, “Randomization approaches for network revenue management with customer choice behavior,” Prod. Oper. Manage, vol. 23, no. 9, pp. 1617–1633, Sep. 2014. [Google Scholar]

[R17] [17].Hess S, Daly A, and Batley R, “Revisiting consistency with random utility maximisation: Theory and implications for practical work,” Theory Decis., vol. 84, no. 2, pp. 181–204, Mar. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Hensher DA, “Stated preference analysis of travel choices: The state of practice,” Transportation, vol. 21, no. 2, pp. 107–133, May 1994. [Google Scholar]

[R19] [19].Petry S, Flexeder C, and Tutz G, “Pairwise fused lasso,” Dept. Statist., Univ. Munich, Munich, Germany, Tech. Rep 102, 2011. [Google Scholar]

[R20] [20].Simon N, Friedman J, Hastie T, and Tibshirani R, “A sparse-group lasso,” J. Comput. Graph. Statist, vol. 22, no. 2, pp. 231–245, 2013. [Google Scholar]

[R21] [21].Azari H, Parks D, and Xia L, “Random utility theory for social choice,” in Proc. Adv. Neural Inf. Process. Syst, 2012, pp. 126–134. [Google Scholar]

[R22] [22].Guevara CA and Ben-Akiva M, “Addressing endogeneity in discrete choice models: Assessing control-function and latent-variable methods,” in Choice Modelling: The State of the Art and the State of Practice. Bingley, U.K.: Emerald Group, 2010. [Google Scholar]

[R23] [23].Ben-Akiva M et al. , “Process and context in choice models,” Marketing Lett., vol. 23, no. 2, pp. 439–456, 2012. [Google Scholar]

[R24] [24].Hancock TO, Hess S, and Choudhury CF, “Decision field theory: Improvements to current methodology and comparisons with standard choice modelling techniques,” Transp. Res. B, Methodol, vol. 107, pp. 18–40, Jan. 2018. [Google Scholar]

[R25] [25].Oh S and Shah D, “Learning mixed multinomial logit model from ordinal data,” in Proc. Adv. Neural Inf. Process. Syst, 2014, pp. 595–603. [Google Scholar]

[R26] [26].Becker F, Danaf M, Song X, Atasoy B, and Ben-Akiva M, “Bayesian estimator for logit mixtures with inter-and intra-consumer heterogeneity,” Transp. Res. B, Methodol, vol. 117, pp. 1–17, Nov. 2018. [Google Scholar]

[R27] [27].Have TRT and Localio AR, “Empirical Bayes estimation of random effects parameters in mixed effects logistic regression models,” Biometrics, vol. 55, no. 4, pp. 1022–1029, Dec. 1999. [DOI] [PubMed] [Google Scholar]

[R28] [28].Hedeker D, “A mixed-effects multinomial logistic regression model,” Statist. Med, vol. 22, no. 9, pp. 1433–1446, 2003. [DOI] [PubMed] [Google Scholar]

[R29] [29].Pedhazur EJ and Schmelkin LP, Measurement, Design, and Analysis: An Integrated Approach. London, U.K.: Psychology Press, 2013. [Google Scholar]

[R30] [30].Hsieh FY, Bloch DA, and Larsen MD, “A simple method of sample size calculation for linear and logistic regression,” Statist. Med, vol. 17, no. 14, pp. 1623–1634, Jul. 1998. [DOI] [PubMed] [Google Scholar]

[R31] [31].Cai D, He X, Han J, and Huang TS, “Graph regularized nonnegative matrix factorization for data representation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 33, no. 8, pp. 1548–1560, Aug. 2011. [DOI] [PubMed] [Google Scholar]

[R32] [32].Thomas R, Have T, Kunselman AR, Pulkstenis EP, and Landis JR, “Mixed effects logistic regression models for longitudinal binary response data with informative drop-out,” Biometrics, pp. 367–383, 1998. [PubMed] [Google Scholar]

[R33] [33].Gałecki A and Burzykowski T, Linear Mixed-Effects Models Using R: A Step-By-Step Approach. New York, NY, USA: Springer, 2013. [Google Scholar]

[R34] [34].Son J, Zhou Q, Zhou S, Mao X, and Salman M, “Evaluation and comparison of mixed effects model based prognosis for hard failure,” IEEE Trans. Rel, vol. 62, no. 2, pp. 379–394, Jun. 2013. [Google Scholar]

[R35] [35].Vermunt JK, “Mixed-effects logistic regression models for indirectly observed discrete outcome variables,” Multivariate Behav. Res, vol. 40, no. 3, pp. 281–301, Jul. 2005. [DOI] [PubMed] [Google Scholar]

[R36] [36].Minka T, “Algorithms for maximum-likelihood logistic regression,” Dept. Statist., CMU, Pittsburgh, PA, USA, Tech. Rep. TR 758, Tech. Rep, 2001. [Google Scholar]

[R37] [37].Fu A, Narasimhan B, and Boyd S, “CVXR: An R package for disciplined convex optimization,” 2017, arXiv:1711.07582. [Online]. Available: http://arxiv.org/abs/1711.07582 [Google Scholar]

[R38] [38].Ma S and Huang J, “A concave pairwise fusion approach to subgroup analysis,” J. Amer. Stat. Assoc, vol. 112, no. 517, pp. 410–423, Jan. 2017. [Google Scholar]

[R39] [39].Wang F, Sun J, Hu J, and Ebadollahi S, “IMet: Interactive metric learning in healthcare applications,” in Proc. SIAM Int. Conf. Data Mining, Apr. 2011, pp. 944–955. [Google Scholar]

[R40] [40].Sun J, Sow D, Hu J, and Ebadollahi S, “Localized supervised metric learning on temporal physiological data,” in Proc. 20th Int. Conf. Pattern Recognit., Aug. 2010, pp. 4149–4152. [Google Scholar]

[R41] [41].Ben-Elia E and Ettema D, “Changing commuters’ behavior using rewards: A study of rush-hour avoidance,” Transp. Res. F, Traffic Psychol. Behav, vol. 14, no. 5, pp. 354–368, 2011. [Google Scholar]

PERMALINK

A Learning Framework for Personalized Random Utility Maximization (RUM) Modeling of User Behavior

Jingshuo Feng

Xi Zhu

Feilong Wang

Shuai Huang

Cynthia Chen

Roles

Abstract

I. Introduction

II. LogCM

A. Relationship Between LR and RUM

B. Framework of Collaborative Learning

Fig. 1.

C. Model Formulation of LogCM and LogSCM

D. Relationship Between LogSCM and Mixed-Effects Logistic Model

III. Parameter Estimation Algorithm

A. Estimation Step for Canonical Models (Q Step)

B. Estimation Step for Membership Vectors (C Step)

Fig. 2.

C. Empirical Guidelines for Implementing the Algorithm

1). Initialization of the Parameters:

2). Determination of K:

3). Acquisition of the Similarity Matrix W:

IV. Simulation Studies

A. Design of the Simulation Experiments

B. Simulation Results

TABLE I.

V. Real-World Case Study

Fig. 3.

TABLE II.

Fig. 4.

TABLE III.

TABLE IV.

Fig. 5.

VI. Extension to Uneven Canonical Structure by Pairwise-Fusion Collaborative Model (LogPCM)

A. Formulation of LogPCM

B. Parameter Estimation Algorithm for LogPCM

C. Simulation Study

Fig. 6.

Fig. 7.

TABLE V.

D. Real-World Case Study

TABLE VI.

VII. Discussion and Conclusion

Supplementary Material

Note to Practitioners—

Biographies

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases