A Distributed Coordinate Descent Algorithm for Learning Factorization Machine

Kankan Zhao; Jing Zhang; Liangfu Zhang; Cuiping Li; Hong Chen

doi:10.1007/978-3-030-47436-2_66

. 2020 Apr 17;12085:881–893. doi: 10.1007/978-3-030-47436-2_66

A Distributed Coordinate Descent Algorithm for Learning Factorization Machine

Kankan Zhao ^14,¹⁵, Jing Zhang ^14,¹⁵, Liangfu Zhang ^14,¹⁵, Cuiping Li ^14,^15,^✉, Hong Chen ^14,¹⁵

Editors: Hady W Lauw⁸, Raymond Chi-Wing Wong⁹, Alexandros Ntoulas¹⁰, Ee-Peng Lim¹¹, See-Kiong Ng¹², Sinno Jialin Pan¹³

PMCID: PMC7206326

Abstract

Although much effort has been made to implement Factorization Machine (FM) on distributed frameworks, most of them achieve bad model performance or low efficiency. In this paper, we propose a new distributed block coordinate descent algorithm to learn FM. In addition, a distributed pre-computation mechanism incorporated with an optimized Parameter Server framework is designed to avoid the massive repetitive calculations and further reduce the communication cost. Systematically, we evaluate the proposed distributed algorithm on three different genres of datasets for prediction. The experimental results show that the proposed algorithm achieves significantly better performance (3.8%–6.0% RMSE) than the state-of-the-art baselines, and also achieves a 4.6–12.3 Inline graphic speedup when reaching a comparable performance.

Keywords: FM, Block coordinate descent, Distributed framework, Pre-computation

Introduction

Although there exists some research on adapting FM to the distributed frameworks, the problem remains largely unsolved. As we know, most of them implemented the (stochastic) gradient descent (SGD) to optimize the FM model, which is limited by the appropriate learning rate. Especially when the dataset and model space are quite large, an inappropriate learning rate may waste a lot of time in searching the optimal solution. Although some solutions can adjust the learning rate adaptively, the slow convergence rate is intolerable. Comparing with gradient optimization, coordinate descent (CD) can avoid setting learning rate, which makes CD converge faster and perform better in convex model. Thus, a distributed algorithm for solving FM by the CD algorithm is worth studying.

In this paper, we propose a novel CD algorithm under the Parameter Server (PS) framework to learn FM model. It divides all the parameters into blocks by the specific blocking scheme, and then updates one block at a time when maintaining the other blocks unchanged. We call this method as Block Coordinate Descent (BCD). It achieves better model performance within the less time. However, the high communication cost and massive repetitive calculations still affect the efficiency of framework. To address the above drawbacks, we design a distributed pre-computation mechanism incorporated with an optimized PS framework, by which we can avoid massive repetitive calculations and reduce the communication cost.

The main contributions of this paper can be listed as:

We propose a novel distributed BCD method to optimize the FM under the PS. By dividing all the parameters into blocks according to the specific blocking scheme, and updating the parameters in a synchronous way, our proposed approach can achieve better accuracy and efficiency.
We design a distributed pre-computation mechanism incorporated with the optimized PS framework, by which our approach can not only avoid massive repetitive calculations but also reduce the parameter exchanges.
The experimental results show that our algorithm achieves better RMSE performance (3.8%–6.0%) than the state-of-the-art methods, and achieves a 4.6–12.3 speedup to obtain a comparable RMSE performance.

Preliminary

This section describes the FM model and its CD optimization.

Factorization Machine

Suppose the training set of a prediction problem is formulated by Inline graphic , where each pair represents an instance with p-dimension variables and its target value (or label) y, then a FM model of order is defined as:

where notation Inline graphic is the global bias, models the weight of the j-th variable, and represents a k-dimension interaction weight vector of j-th variable.

Give each Inline graphic , FM can be represented as a linear combination of two functions and that are independent of . For example, when , then and .

Learning FM with Coordinate Descent

Suppose we are updating Inline graphic to , the least square loss function is defined as:

By minimizing the above loss function, the optimal Inline graphic can be obtained as:

It is clear that whenever updating a parameter, CD enumerates all the instances containing the related non-zero variables, and calculate Inline graphic and for each instance, which is very time consuming. To improve the training efficiency, [13] proposed a pre-computation mechanism under single-machine FM. To reduce the complexity of computing , the term is defined as:

By storing Inline graphic in a vector (n is the number of instances) and pre-computing it at the beginning, the computation of error terms can be done in constant time O(1). After changing to , can be updated by:

To reduce the complexity of calculating Inline graphic , we reformulate as:

The term Inline graphic can be pre-computed for each instance and stored in a matrix . By pre-computing , can be computed in constant time. When updating to , the corresponding is updated as:

Distributed BCD Framework

In this section, we first introduce how to infer FM by BCD, and then give the details of the propose distributed BCD framework.

Learning FM with BCD

CD updates one coordinate each time while fixing others unchanged. However, if we directly extend it to the distributed platform without any changes, the model training will be very low-efficiency.

To overcome the above drawback, we propose a new CD method named BCD. BCD divides all coordinates into multiple blocks according to the specific scheme. During the training process, all the workers update the same block simultaneously while keeping the other blocks unchanged. The main problem for learning FM under BCD method is how to divide three types of FM model parameters ( Inline graphic ) into blocks? Different blocking schemes may affect the model performance significantly because they correspond to the different orders of the parameter updating in FM. To simplify the blocking process, we combine the global with to form an extended . Then, according to the different combinations of parameter types, we can determine two types of blocking schemes. For the first scheme, each block contains not only Inline graphic , but also . In other words, we divide the parameters according to the order of the variables where each variable j correspond to and , i.e., a block of parameters can be . We name it as Mixed scheme. For the second scheme, each block contains either or . Or, put another way, we first update all blocks which include Inline graphic parameters and then update the blocks with parameters, i.e., a block of parameters can be or . We name this scheme as Separate scheme.

Distributed BCD Under Standard PS

After analyzing the detail of inferring FM by BCD algorithm, we propose adapting BCD-based FM to a specific distributed environment. As far as we know, two types of distributed platforms, Map-Reduce (Spark) platform and PS framework, are popular and always used to the machine learning tasks. Generally speaking, PS framework is more efficient than Map-Reduce platform. Therefore, we propose a distributed BCD to learn FM under the PS framework.

As mentioned in Sect. 3.1, the key idea of BCD is to update a block of parameters while maintaining all the other parameters unchanged. Thus, we can distribute BCD algorithm under synchronous PS framework. Before the model training, the server first initializes the model parameters and assigns the dataset to each worker randomly. Specific to each epoch, the workers update blocks in turns. In the update of each block, the workers firstly pull the valid parameters of current block and some other related parameters (the parameters which are co-occurrence with the parameters of the current block) from the server, and then calculate the intermediate results by the update rule. When all the workers pushed intermediate results to the server, the server updates the corresponding parameters by aggregating all intermediate results. To make the algorithm more understandable, we illustrate the learning framework of distributed BCD in Fig. 1(a). In the figure, notation Inline graphic denotes the valid parameters of block j, denotes the related parameters which are co-occurrence with the parameters of block j, and denotes the intermediate results for block j.

Fig. 1. — The distributed BCD framework with pre-computation mechanism or not.

However, we also observe some drawbacks of this approach. First, when updating a block, in addition to the current block, the workers always need to pull some other related parameters. The high communication cost affects the efficiency of the model training. Second, there are only part of parameters changed when calculating Inline graphic and in Eq. (4). But we completely re-calculate these two terms for updating each parameter. The massive repetitive calculations are very time consuming. To overcome these problems, we propose a distributed pre-computation mechanism under an optimized PS framework in the next section.

Distributed Precomputation

In this section, we design a distributed pre-computation mechanism incorporated with an optimized PS framework to solve this problem.

As a variant of CD method, our distributed BCD also has the massive repetitive calculations problem. To address this problem, we propose a distributed pr-computation mechanism to further improve the efficiency of our distributed BCD framework. The key idea is to pre-compute Inline graphic and for each instance at the beginning, store these pre-computation terms and the valid parameters (corresponding to non-zero variables of the training instance allocated to it) in each worker, and then modify them incrementally with new parameters.

However, two problems need to be solved when we implement the pre-computation to the distributed BCD framework discussed in Sect. 3.2. First, the workers in the above distributed framework pull the needed parameters before updating the block while the pull operation in the pre-computation mechanism is not necessary because all the valid parameters has been stored in each worker. Second, the pre-computation mechanism requests to update the pre-computation terms with the latest block at the end of block updating but the above system doesn’t.

Obviously, to incorporate the pre-computation into our distributed BCD framework, we must optimize the above synchronous PS architecture. Follow the principle of pre-computation mechanism, we give the detail training process in Algorithm 1. Before the model learning, server node does the same initialization as the algorithm described in Sect. 3.1 (Line 2). And then, each worker pulls all valid parameters, and pre-compute e and q for all instances (Line 4). Specific to the update of each block, the workers first calculate the intermediate results for the current block and then push them back to the server (Line 7&8). After collecting all the intermediate results, server node updates the parameters of current block and then push the latest version back to all workers (Line 14&15). Once each worker receives the latest block, it updates its own parameters and pre-computation terms (Line 9&10). To better understand the algorithm, we illustrate the training architecture of our optimized distributed framework in Fig. 1(b). The notations Inline graphic and are two pre-computation terms, and denotes the latest parameters in block j. From the perspective of architecture, we can see that the purposes, occasions and triggers of operations are very different in these two frameworks. In terms of framework efficiency, the optimized PS framework has two advantages. First, the massive repetitive calculations can be avoided by the pre-computation. Second, the communication cost between the server and the workers is further reduced because only the parameters of the current block need to transfer in the optimized framework.

Experiments

In this section, we conduct various experiments to evaluate our algorithm.

Experimental Setup

Datasets. We perform our experiments on three datasets: MovieLens10M1, Movielens20M(See footnote 1) and Yahoo music2. The details are shown in Table 1.

Table 1.

Datasets

Dataset	Ratings	Users	Items
Movielens10M	10 million	71,567	10,681
Movielens20M	20 million	138,000	27,000
Yahoo music	717 million	1.8 million	136,000

Open in a new tab

Comparison Methods. The comparison methods are as follows:

FM-SCD-PS: Implemented by SCD and implemented on PS [20].

FM-Asyn-SGD-PS: A simple version of distributed FM proposed in [9] which is implemented with standard SGD and asynchronous PS.

FM-Ada-SGD-PS: Implemented with AdaGrad and asynchronous PS.

FM-Syn-SGD-PS: Implemented with standard SGD and synchronous PS.

FM-SGD-Spark: Implemented with standard SGD and Spark. For the sake of fairness, we simulate Spark with standard PS system.

FM-BCD-NPC-PS: Implemented with BCD and PS but no pre-computation.

FM-BCD-PS (Our final approach): Implemented with BCD and the distributed pre-computation mechanism under our optimized PS.

Evaluation Measures. We consider the following performance measurements:

Accuracy & Efficiency Performance. We use RMSE and the training time to compare the performance of different methods respectively.
Parameter Analysis. We analyze the effect of different factors in our methods: different blocking schemes and using pre-computation or not.

Platform and Implementation. All methods are implemented using Scala and performed on a cluster containing 16 machines. Among them, 15 machines are used as workers, of which each one contains 2 CPU cores and 16 G memory, and the other one is served as the server, which contains 8 CPU cores and 64 G memory. For the BCD-based algorithms, we default to set the block size as 5000 in Movielens3 datasets and 10000 in Yahoo music dataset. Due to the different convergence rates of SGD and CD algorithms, we set T as 10, 20 and 3000 for BCD-based methods, CD baseline and gradient based algorithms, respectively.

Accuracy and Efficiency Performance

We compare RMSE performance and the corresponding elapsed time of all the comparison methods on the three datasets. To present the learning details of methods, we record a pair of the metric RMSE evaluated on the test set and the corresponding elapsed time. Then we plot all the (RMSE, time) pairs for each algorithm in Fig. 2.

Fig. 2. — Performance of all the comparison methods.

From the results, we can see that the resultant (RMSE, time) points of the proposed FM-BCD-PS are more concentrated in the bottom-left corner than those of other methods, which indicates that FM-BCD-PS can achieve smaller RMSE with less training time than the other methods. The reason for this is that our distributed BCD framework inherits the fast convergence rate of CD method and further improves the efficiency with pre-computation mechanism.

Specifically, when comparing with FM-SCD-PS, FM-BCD-PS achieves better RMSE performance (6.0%–7.8%) within less time (4.6–9.7 Inline graphic speedup for the same 10 iterations). The reason for this lies in that FM-SCD-PS tries to learn parameters on partial related instances and the corresponding imbalance workload in each worker. When comparing with the SGD-based baselines, FM-BCD-PS improves RMSE performance by 3.8%, 4.4% and 6.0%4, respectively, on Movielens10M, Movielens20M and Yahoo music. Meanwhile, it achieves 7.6–11.1(See footnote 4) Inline graphic , 8.3–12.3(See footnote 4) and 5.4–7.4(See footnote 4) speedup on Movielens10M, Movielens20M and Yahoo music, respectively. The main reason is that BCD converges faster than SGD.

Parameter Analysis

We now discuss how different factors affect the model performance.

Effect of Different Blocking Schemes. As mentioned in Sect. 3.1, our proposed algorithm can apply two types of blocking schemes. We compare them and study how different blocking schemes affect the RMSE performance of FM-BCD-PS in Fig. 3. From Fig. 3, we can see that the Separate scheme achieves better RMSE performance (4.6%-5.9%) with similar runtime in all datasets than the Mixed scheme. This is due to the fact that Mixed scheme update the coordinates ( Inline graphic ,) corresponding to the same feature i in the same block which will affect each other. Different from Mixed scheme, Separate scheme not only ensures that the coordinates corresponding to the same features locate in different blocks, but also guarantee that the most coordinates existing in the same instance are put in the different blocks. Thus, we adopt the Separate scheme on our proposed FM-BCD-PS algorithm in all other experiments.

Fig. 3. — Performance of FM-BCD-PS over different blocking schemes.

Effect of the Pre-computation Mechanism. We compare FM-BCD-NPC-PS with FM-BCD-PS, and study how the proposed pre-computation mechanism affects the efficiency performance of our algorithm in Fig. 4 and 5.

In Fig. 4, we record the runtime for each iteration of two algorithms. From Fig. 4, we can see that FM-BCD-PS achieves up to 4.6 Inline graphic speedup on both Movielens datasets and 7.2 speedup on Yahoo music dataset when obtaining a comparable RMSE performance. The reason for this is that FM-BCD-PS not only reduces the size of data exchanged between the server and the workers but also avoids the massive repetitive calculations for updating coordinates.

In Fig. 5, the blue scatter plots and the red scatter plots illustrate the size of exchange data in each block of FM-BCD-PS and FM-BCD-NPC-PS respectively. Furthermore, in the same color of scatter plots, the dot scatter plots and triangle scatter plots are used to respectively represent the maximum and minimum size of exchange data of the block. From Fig. 5, we can see that FM-BCD-PS achieves fewer exchange data size in each block than FM-BCD-NPC-PS. That is to say, the proposed distributed pre-computation mechanism can also reduce the communication cost between the server and the workers.

Effect of Block Size. We study how the block size affects RMSE and efficiency performance of the proposed FM-BCD-PS in Fig. 6 and 7 respectively. To further illustrate the reliability of the experimental results, we sample 20%, 40%, 60%, 80% and 100% of Yahoo music to do experiments, respectively.

Fig. 7. — Efficiency of FM-BCD-PS over different block sizes.

From Fig. 6, we can see that there is no big difference of RMSE over different block sizes on all datasets, which indicates that the block size has little or nothing effect on the model performance. The reason is that most coordinates existing in the same instance are still located in the different blocks. That is to say, the update of coordinates in the same block do not affect each other.

Figure 7 shows the training time can be reduced by increasing the block size. However, the elapsed time becomes stale when the block size reaches about 6000 on Movielens datasets and 12000 on Yahoo music dataset. This is due to the fact that there is a balance between the communication cost of each block and the number of blocks. In summary, we can speed up the model training by increasing the block size to an appropriate value while keeping the model performance.

Related Work

Factorization Machine

To achieve better performance, researchers mainly focus on three directions to extend FM. First, how to learn the FM with high-order interaction? Although [11] gives the general form of the FM model, the papers only propose the learning methods for 2-order FM [5, 11, 13]. To optimize the high-order FM efficiently, [2, 3, 10] give their solutions. Secondly, researchers consider combining FM with neural network algorithms in different ways [6, 15, 17, 19]. Thirdly, to address the non-convex problem of the FM model, researchers try to reconstruct the FM and make it be convex [1, 16]. To improve the training efficiency, some research efforts have been made to scale up FM [4, 9, 18, 20]. These works focus on building FM on distributed frameworks. DiFacto [9] is a distributed FM and can perform fine-grained capacity control based on both data and model statistics. Zhong et al. proposed another version of distributed FM which can take advantages of both data parallelism and model parallelism [20]. To address the heavy communication cost problem, [18] proposes a client-to-client architecture to learn FM model.

Coordinate Descent on Big Data

To adapt the CD to large scale datasets, researchers tried to extend it to distributed platform [14, 20]. [14] designed the first distributed CD system: Hydra. It divides the coordinates to disjoint subset and distributes them to all workers. If we adapt FM to Hydra, the instances stored in each worker are in large-scale and the calculation of gradients for coordinates are imbalanced. In such cases, the training efficiency will be greatly affected. Similar to Hydra, [20] proposed a stochastic CD (SCD) under the hybrid distributed framework. The distributed SCD-based FM has two drawbacks. First, its model performance is not good since the update of parameters may be based on the partial related instances. Second, the model training is not efficient because it does not ensure a balanced amount of calculation among the workers.

Discussion and Future Work

We discuss how to distribute logistic regression (LR), matrix factorization (MF) [8] and other factorization models to our proposed framework.

Logistic Regression. LR only considers the effect of independent variables and equals to the factorization machine that ignores the interaction part, i.e.,

Compared with FM, LR has fewer features but with similar update rule. Thus, LR can directly apply the distributed BCD framework.

Matrix Factorization. MF only incorporates user and item identifications as features and equals to the FM that ignores the other heuristic features, i.e.,

where m and n denotes the number of users and items, respectively. In this case, Inline graphic represents the bias for user u, and is the bias for item i. Accordingly, there is only two interaction latent factors and for user u and item i, respectively.

Compared with FM, MF has fewer variables but with the same update rule. Thus, we can implement the distributed BCD on MF.

Other Factorization Models. FM is a general factorization method, which can mimic other state-of-the-art models like SVD++ [7], PITF [12] and so on. Therefore, our framework can optimize these models in a similar way.

Conclusions

This paper proposes a new distributed BCD framework to learn the FM. Through conducting a pre-computation mechanism incorporated with our optimized PS framework, we can avoid massive repetitive calculations and further reduce the communication cost. In addition, it is worth mentioning that our proposed distributed BCD framework can also be applied to many other factorization models, such as LR, MF, SVD++ and so on. We compare the proposed algorithm with the state-of-the-art baselines, and find that our proposed FM-BCD-PS can achieve better performance (3.8%–6.0% RMSE) within shorter time (4.6–12.3 Inline graphic speedup). For future work, we aim to generalize the distributed BCD framework and apply more other machine learning algorithms on it.

Acknowledgments

This work is supported by National Key R&D Program of China (No.2018YFB1004401), and NSFC under the grant No. (61772537, 61772536, 61702522, 61532021).

Footnotes

https://grouplens.org/datasets/movielens.

https://webscope.sandbox.yahoo.com/catalog.php?datatype=r.

We use Movielens to represent two datasets: Movielens10M and Movielens20M.

⁴

Here we abandon the results of FM-SGD-Spark for its bad performance.

Contributor Information

Hady W. Lauw, Email: hadywlauw@smu.edu.sg

Raymond Chi-Wing Wong, Email: raywong@cse.ust.hk.

Alexandros Ntoulas, Email: antoulas@di.uoa.gr.

Ee-Peng Lim, Email: eplim@smu.edu.sg.

See-Kiong Ng, Email: seekiong@nus.edu.sg.

Sinno Jialin Pan, Email: sinnopan@ntu.edu.sg.

Kankan Zhao, Email: zhaokankan@ruc.edu.cn.

Jing Zhang, Email: zhang-jing@ruc.edu.cn.

Liangfu Zhang, Email: liangfu_zhang@ruc.edu.cn.

Cuiping Li, Email: licuiping@ruc.edu.cn.

Hong Chen, Email: chong@ruc.edu.cn.

References

1.Blondel M, Fujino A, Ueda N. Convex factorization machines. In: Appice A, Rodrigues PP, Santos Costa V, Gama J, Jorge A, Soares C, editors. Machine Learning and Knowledge Discovery in Databases; Cham: Springer; 2015. pp. 19–35. [Google Scholar]
2.Blondel, M., et al.: Higher-order factorization machines. In: NIPS 2016, pp. 3351–3359 (2016)
3.Blondel, M., et al.: Polynomial networks and factorization machines: new insights and efficient training algorithms. In: ICML 2016, pp. 850–858 (2016)
4.Cao, B., et al.: Multi-view machines. In: WSDM 2016, pp. 427–436 (2016)
5.Freudenthaler, C., et al.: Bayesian factorization machines (2011)
6.Guo, H., et al.: DeepFM: a factorization-machine based neural network for CTR prediction. In: IJCAI 2017, pp. 1725–1731 (2017)
7.Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: SIGKDD 2008, pp. 426–434 (2008)
8.Koren Y, et al. Matrix factorization techniques for recommender systems. Computer. 2009;42(8):30–37. doi: 10.1109/MC.2009.263. [DOI] [Google Scholar]
9.Li, M., et al.: DiFacto: distributed factorization machines. In: WSDM 2016, pp. 377–386 (2016)
10.Lu, C.T., et al.: Multilinear factorization machines for multi-task multi-view learning. In: WSDM 2017, pp. 701–709 (2017)
11.Rendle, S.: Factorization machines. In: ICDM 2010, pp. 995–1000. IEEE (2010)
12.Rendle, S., et al.: Pairwise interaction tensor factorization for personalized tag recommendation. In: WSDM 2010, pp. 81–90 (2010)
13.Rendle, S., et al.: Fast context-aware recommendations with factorization machines. In: SIGIR 2011, pp. 635–644 (2011)
14.Richtárik P, et al. Distributed coordinate descent method for learning with big data. J. Mach. Learn. Res. 2016;17(1):2657–2681. [Google Scholar]
15.Xiao, J., et al.: Attentional factorization machines: learning the weight of feature interactions via attention networks. In: IJCAI 2017, pp. 3119–3125 (2017)
16.Yamada, M., et al.: Convex factorization machine for toxicogenomics prediction. In: SIGKDD 2017, pp. 1215–1224 (2017)
17.Zhang W, Du T, Wang J, et al. Deep learning over multi-field categorical data. In: Ferro N, et al., editors. Advances in Information Retrieval; Cham: Springer; 2016. pp. 45–57. [Google Scholar]
18.Zhao K, Zhang J, Zhang L, Li C, Chen H. CDSFM: a circular distributed SGLD-based factorization machines. In: Pei J, Manolopoulos Y, Sadiq S, Li J, editors. Database Systems for Advanced Applications; Cham: Springer; 2018. pp. 701–709. [Google Scholar]
19.Zheng, L., et al.: Joint deep modeling of users and items using reviews for recommendation. In: WSDM 2017, pp. 425–434 (2017)
20.Zhong, E., et al.: Scaling factorization machines with parameter server. In: CIKM 2016, pp. 1583–1592 (2016)

[CR1] 1.Blondel M, Fujino A, Ueda N. Convex factorization machines. In: Appice A, Rodrigues PP, Santos Costa V, Gama J, Jorge A, Soares C, editors. Machine Learning and Knowledge Discovery in Databases; Cham: Springer; 2015. pp. 19–35. [Google Scholar]

[CR2] 2.Blondel, M., et al.: Higher-order factorization machines. In: NIPS 2016, pp. 3351–3359 (2016)

[CR3] 3.Blondel, M., et al.: Polynomial networks and factorization machines: new insights and efficient training algorithms. In: ICML 2016, pp. 850–858 (2016)

[CR4] 4.Cao, B., et al.: Multi-view machines. In: WSDM 2016, pp. 427–436 (2016)

[CR5] 5.Freudenthaler, C., et al.: Bayesian factorization machines (2011)

[CR6] 6.Guo, H., et al.: DeepFM: a factorization-machine based neural network for CTR prediction. In: IJCAI 2017, pp. 1725–1731 (2017)

[CR7] 7.Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: SIGKDD 2008, pp. 426–434 (2008)

[CR8] 8.Koren Y, et al. Matrix factorization techniques for recommender systems. Computer. 2009;42(8):30–37. doi: 10.1109/MC.2009.263. [DOI] [Google Scholar]

[CR9] 9.Li, M., et al.: DiFacto: distributed factorization machines. In: WSDM 2016, pp. 377–386 (2016)

[CR10] 10.Lu, C.T., et al.: Multilinear factorization machines for multi-task multi-view learning. In: WSDM 2017, pp. 701–709 (2017)

[CR11] 11.Rendle, S.: Factorization machines. In: ICDM 2010, pp. 995–1000. IEEE (2010)

[CR12] 12.Rendle, S., et al.: Pairwise interaction tensor factorization for personalized tag recommendation. In: WSDM 2010, pp. 81–90 (2010)

[CR13] 13.Rendle, S., et al.: Fast context-aware recommendations with factorization machines. In: SIGIR 2011, pp. 635–644 (2011)

[CR14] 14.Richtárik P, et al. Distributed coordinate descent method for learning with big data. J. Mach. Learn. Res. 2016;17(1):2657–2681. [Google Scholar]

[CR15] 15.Xiao, J., et al.: Attentional factorization machines: learning the weight of feature interactions via attention networks. In: IJCAI 2017, pp. 3119–3125 (2017)

[CR16] 16.Yamada, M., et al.: Convex factorization machine for toxicogenomics prediction. In: SIGKDD 2017, pp. 1215–1224 (2017)

[CR17] 17.Zhang W, Du T, Wang J, et al. Deep learning over multi-field categorical data. In: Ferro N, et al., editors. Advances in Information Retrieval; Cham: Springer; 2016. pp. 45–57. [Google Scholar]

[CR18] 18.Zhao K, Zhang J, Zhang L, Li C, Chen H. CDSFM: a circular distributed SGLD-based factorization machines. In: Pei J, Manolopoulos Y, Sadiq S, Li J, editors. Database Systems for Advanced Applications; Cham: Springer; 2018. pp. 701–709. [Google Scholar]

[CR19] 19.Zheng, L., et al.: Joint deep modeling of users and items using reviews for recommendation. In: WSDM 2017, pp. 425–434 (2017)

[CR20] 20.Zhong, E., et al.: Scaling factorization machines with parameter server. In: CIKM 2016, pp. 1583–1592 (2016)

PERMALINK

A Distributed Coordinate Descent Algorithm for Learning Factorization Machine

Kankan Zhao

Jing Zhang

Liangfu Zhang

Cuiping Li

Hong Chen

Abstract

Introduction

Preliminary

Factorization Machine

Learning FM with Coordinate Descent

Distributed BCD Framework

Learning FM with BCD

Distributed BCD Under Standard PS

Fig. 1.

Distributed Precomputation

Experiments

Experimental Setup

Table 1.

Accuracy and Efficiency Performance

Fig. 2.

Parameter Analysis

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

Related Work

Factorization Machine

Coordinate Descent on Big Data

Discussion and Future Work

Conclusions

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases