Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2018 May 4;13(5):e0190831. doi: 10.1371/journal.pone.0190831

An improved advertising CTR prediction approach based on the fuzzy deep neural network

Zilong Jiang 1,2, Shu Gao 1, Mingjiang Li 2,*
Editor: Yong Deng3
PMCID: PMC5935396  PMID: 29727443

Abstract

Combining a deep neural network with fuzzy theory, this paper proposes an advertising click-through rate (CTR) prediction approach based on a fuzzy deep neural network (FDNN). In this approach, fuzzy Gaussian-Bernoulli restricted Boltzmann machine (FGBRBM) is first applied to input raw data from advertising datasets. Next, fuzzy restricted Boltzmann machine (FRBM) is used to construct the fuzzy deep belief network (FDBN) with the unsupervised method layer by layer. Finally, fuzzy logistic regression (FLR) is utilized for modeling the CTR. The experimental results show that the proposed FDNN model outperforms several baseline models in terms of both data representation capability and robustness in advertising click log datasets with noise.

Introduction

Internet advertising is regarded as an effective advertising communication approach due to its strong targeted communication ability. Therefore, increasing numbers of researchers from industry and academia have investigated internet advertising, and it has become an important source of income for internet companies.

The cost per click (CPC) model [1] is one of the most common payment models in internet advertising. Approximately 66% of advertising transactions depend on the CPC because the CPC can more accurately reveal the conversion rate compared with the other models [2]. In the CPC model, the click-through rate (CTR) is a significant index for measuring the effect of advertisement placement.

To address this task, Chapelle O. et al. propose a machine learning framework that uses Maximum Entropy to implement a logistic regression model and can address billions of samples and hundreds of millions of parameters. A two-phase feature selection algorithm is provided to reduce the need for domain expertise: a generalized mutual information method is used to select the feature groups that are used in the model; next, feature hashing is used to regulate the size of the models [3]. McMahan et al. adopted a logistic regression model to solve advertising CTR problems for Google. Using multiple characteristics, including user information, search keywords, advertising data, and relative metadata, with advertisements as the input of the model, these researchers proposed an online sparse learning algorithm to the train model [4]. These above methods are recently used to address CTR prediction through the logistic regression (LR) model in industry. These methods can appropriately address a large number of features in real advertising system in industry, and can be efficiently parallelized in a distributed computing system. However, they cannot learn complex mapping relations from data because of their limited data representation capability.

T. A. Anh-Phuong presented an online learning algorithm for click-through rate prediction, known as Follow The Regularized Factorized Leader (FTRFL), which combines the Follow-The-Regularized-Leader Proximal (FTRL-P) algorithm with per-coordinate learning rates to obtain factorization machines. These researchers attempt to obtain both the sparsity provided by FTRL-Proximal and the ability to estimate higher-order features based on FM [5]. Z. Pan et al. proposed a Sparse Factorization Machine (SFM) model to solve the problem of the sparsity of the ad transaction dataset. In this model, the Laplace distribution is introduced to model the parameters because it can better fit the sparse data, which has a high ratio of zero elements. Furthermore, these researchers parallelized the SFM model on the Spark platform to support CTR prediction in a real advertising system [2]. These above methods are recently adopted to address CTR prediction through Factorization Machines (FMs) in academia. These methods can alleviate this problem of feature learning by compressing the sparse features into a dense vector space and using the second-order feature interactions. However, when facing a dataset with complex relations, they cannot sufficiently capture the higher-order non-linear representative features and complex inherent mapping relations among the users, contexts and ads in the advertising datasets.

A. I. Baqapuri and I. Trofimov adopted Artificial Neural Networks (ANNs) to predict the advertising CTR and obtained a better effect than that of the logistic regression model [6]. Dave and other researchers proposed a model that inherits the click information of rare/new ads from frequent ads that are semantically related to calculate the CTR values for newly created ads or rare ads that do not have sufficient historical information. The semantic features are derived from the search ad click-through graphs and advertiser account information. Next, gradient boosted decision trees (GBDTs) is adopted to learn from these similarity features. Experiments demonstrate that this learned model, using these features, obtains good CTR prediction performance for new ads [7]. Y Juan et al. first used trees in GBDT to generate abstract nonlinear features and then combined the original features as input features, which were fed into the FM model for the advertising prediction [8]. These above methods are recently used to address CTR prediction through models ensemble technique, and they give state-of-the-art ideas for industry and academia.

Zhang and other researchers used recurrent neural networks (RNNs) to predict the CTR of sponsored search advertising. Using back-propagation through time (BPTT) to train their model, these researchers obtained more accurate results for advertising CTR prediction than neural networks and logistic regression models [9]. Yu adopted the improved regression neural network to predict advertising CTR by using long short-term memory (LSTM) to modify RNN. This method effectively prevents the explosion or vanishing of the gradient [10]. Jiang et al. adopted the DBNLR model, which integrates a deep belief network (DBN) with logistic regression (LR) to address the problem of CTR prediction. A DBN stacked of RBMs is used to obtain abstract features from original advertisement datasets, and then a regression model is adopted to calculate the CTR prediction value [11]. Junxuan Chen et al. proposed a novel deep neural network that contains convolution layers to extract representative visual features, and fully connected layers learn the complex and effective nonlinear features based on the basic visual features. Finally, these features are fed into a logistic regression model to predict the CTR value [12]. These above methods are state-of-the-art solutions which are adopted to address CTR prediction task through deep learning technology. In these above methods, a deep architecture model is usually used to automatically extract the features of advertising click log data, and a regression model is subsequently trained to model the advertising CTR. Because a deep architecture is used to learn the features, there is no need for these methods to depend on prior knowledge or human labor; they can effectively fit complex nonlinear mapping relations in advertising datasets. However, when the dataset contains noisy data such as missing values and outliers, these methods may exhibit performance degradation.

A common strategy for dealing with this problem is to identify and delete records with outliers during data preprocessing. However, deleting entire records results in the loss of much valuable and clean information. The other common strategy is to replace missing values and outliers with “0” or “NULL”; however, the records with outliers are preserved in the training set, which also affects the accuracy of CTR prediction.

Uncertainties not only exist in the data themselves but occur at each phase of big data processing. For instance, the collected data may be created by faulty sensors or provided by not fully informed customers; the outputs of specific artificial intelligent algorithms also contain uncertainties. In these cases, fuzzy set techniques could be one of the most efficient tools to handle various types of uncertainties [13]. Because the parameters of these above deep architecture models are constants, and the learning processes of the parameters are constrained in a relatively small space, these methods have insufficient representation capability and relatively weak robustness when the training data have been interrupted by uncertainties. Meanwhile, fuzzy set techniques would be more efficient if they are used associated with other decision making techniques, such as probability, neural networks, etc., because each type of techniques exhibit their own strengths of representing and handling information granularity [13].

Due to the lack of a comprehensive understanding of the advertising datasets with noises, it is often difficult for these methods to ensure that the extracted features capture the optimal information for predicting the CTR. These deficiencies affect the accuracy and fitness of these methods. According to the advantages of fuzzy set techniques, this paper proposes a novel CTR prediction method based on the fuzzy deep neural network (FDNN) to address advertising datasets with noise.

The main contributions of this paper can be summarized as follows:

1. This paper proposes an FDNN model in which FDBN is used to automatically extract abstract and complicated features from raw advertising datasets without any artificial intervention or prior knowledge and subsequently uses fuzzy logistic regression to model the CTR.

2. Fuzzy set techniques are introduced into the Gaussian-Bernoulli Restricted Boltzmann Machine (GBRBM), Restricted Boltzmann Machine (RBM) and logistic regression models to construct basic components of FDNN, and corresponding learning algorithms are presented in detail.

3. We conduct extensive experiments on real-world datasets to demonstrate the effectiveness and efficiency of the proposed FDNN model. The impacts of several baseline models are also discussed. Experiments results show the superiority of the proposed method. On the first dataset, the results show that the proposed method achieves competitive performances compared with the state-of-the-art models in terms of both data representation capability and robustness.

Materials and methods

Relative theories of fuzzy numbers and fuzzy functions

Fuzzy number

Definition 1. Triangular Fuzzy Number

A fuzzy number θ¯ (fuzzy set) on a real domain can be defined as a triangular fuzzy number if its membership function μθ¯(x):R[0,1] is equal to [14], and it is shown in S1 Fig.

μθ¯(x)=θ¯(x)={xm-l-lm-l,x[l,m]xm-u-um-u,x[m,u]0,otherwise, (1)

where lmu, l and u stand for the lower and upper values of the support of fuzzy number θ¯ (fuzzy set), respectively; and m is the most likely value (middle value), that is, when x = m, x belongs to θ¯. The triangular fuzzy number can be denoted by (l, m, u). The support of θ¯ is the set of elements {xR|l < x < u}. When l = m = u, it is a non-fuzzy number.

Definition 2. α-cuts of a fuzzy set

The α-cut of fuzzy set θ¯, represented by θ¯[α], is defined as θ¯[α]={xΩ|θ¯α}=[θL,θR], where 0 < α ≤ 1, θL is the lower value of θ¯[α] and θR is the upper value of θ¯[α] [15, 16].

Fuzzy function

Definition 3. Fuzzy Function

Extended from a real-valued function f: Y = f(x, θ), the fuzzy function f¯ can be defined as [16]

Y¯=f¯(x,θ¯), (2)

where θ and θ¯ are parameters of functions f and f¯ [16, 17], Y¯ is the fuzzy output set.

The membership function Y¯(y) can be expressed as

Y¯(y)=supθ{min(θ¯1(θ1),...,θ¯n(θn))|f(x,θ)=y}, (3)

where θ = (θ1, …, θn)T, and θ¯=(θ¯1,...,θ¯n)T

Property 1. Real Domain Interval Arithmetic

For two real domain intervals [a, b] and [c, d], the interval arithmetic can be defined as [16, 18]

[a, b] + [c, d] = [a + c, b + d];

[a, b] − [c, d] = [ac, bd];

[a, b] × [c, d] = [min(a × c, a × d, b × c, b × d), max(a × c, a × d, b × c, b × d)];

[a, b] ÷ [c, d] = [min(a ÷ c, a ÷ d, b ÷ c, b ÷ d), max(a ÷ c, a ÷ d, b ÷ c, b ÷ d)].

Definition 4. α-Cuts of a Fuzzy Function

α-Cuts of Y¯: For a continuous function f, the α-Cuts of Y¯, namely Y¯[α]=[Y¯1[α],Y¯2[α]], can be expressed as [16]

{Y¯1[α]=min{Y(θ,x)|θθ¯[α]}Y¯2[α]=max{Y(θ,x)|θθ¯[α]}. (4)

It is infeasible to calculate the membership function Y¯(y) using formulas (2) and (3) because this requires maximization and minimization of the original function. This paper adopts α-cuts and interval arithmetic to solve this problem.

Y¯[α]=f(x,θ¯[α]) (5)

The membership function Y¯(y) of a fuzzy function can be obtained by interval arithmetic because intervals θ¯[α] are easy to calculate. However, when f is very complex, interval arithmetic will be NP-hard; therefore, this paper introduces a defuzzification approach to handle this problem.

Defuzzification of the fuzzy function

The centroid approximate solution [19] is employed to defuzzify the fuzzy function Y¯=f¯(x,θ¯) in this paper.

The centroid of fuzzy function f¯(x,θ¯) can be denoted by fc(x,θ¯) [19]:

fc(x,θ¯)=θf(x,θ)dθf(x,θ)dθ,θθ¯. (6)

However, directly calculating formula (6) is difficult because it involves integrals. The centroid can be approximated through many α-cuts of the fuzzy function in discrete form. θ¯ is a vector of fuzzy numbers and an α-cut of θ¯ can be denoted as θ¯[α]=[θL,θR], where θL and θR are lower and upper bounds, respectively, of the interval with respect to α. When x is nonnegative, f¯(x,θ¯) is a monotonically decreasing function with respect to parameters θ¯. Thus, according to the interval arithmetic principle, definition 3 and formula (5), the α-cut of f¯(x,θ¯) is given by

f¯(x,θ¯)[α]=f(x,θ¯[α])=[f(x,θL),f(x,θR)]. (7)

The approximate centroid of fuzzy function f¯(x,θ¯) can be obtained by [20]

fc(x,θ¯)i=1Mαi[f(x,θiL),f(x,θiR)]2i=1Mαi, (8)

where α = (α1, …, αN), α ∈ [0, 1]N and θ¯[αi]=[θiL,θiR].

Although all the α-cutsare bounded intervals, this paper takes only the special case where αi = 1 into consideration. Let θ = [θL, θR]. Formula (8) can be expressed as [20]

fc(x,θ¯)12[f(x,θL)+f(x,θR)]. (9)

Therefore, fuzzy function Y¯ can be expressed as

Y¯=f¯(x,θ¯)=fc(x,θ¯)12[f(x,θL)+f(x,θR)]. (10)

Advertising CTR prediction based on the fuzzy deep neural network

Basic fuzzy components and corresponding learning algorithms

Inspired by the idea of above fuzzy theory and methods [1620], we first provide a detailed introduction and describe the corresponding learning algorithms of several basic components that have been modified using fuzzy technology.

A. FRBM and its Learning Algorithm

Fuzzy restricted Boltzmann machine (FRBM) is a symmetric neural network with binary nodes that is based on an energy model. It contains a set of visual binary nodes v ∈ {0, 1}D and another set of hidden binary nodes h ∈ {0, 1}F. There are no connections between different nodes of the same hidden layer or between different nodes of the same visual layer. Fuzzy parameters are employed to govern the FRBM model. The structure of FRBM is described in S2 Fig.

The fuzzy energy function of FRBM can be expressed as [20]

E¯(x,h;θ¯)=-b¯Tx-c¯Th-hTW¯x. (11)

In this formula, θ¯={b¯,c¯,W¯}, b¯ and c¯ are the offsets, and W¯ is the connection weight between the ith visible node and the jth hidden node.

According to the free energy function F(x,θ)=-logh˜e-E(x,h˜,θ), the fuzzy free energy function F¯ is expressed as

F¯(x,θ¯)=-logh˜e-E¯(x,h˜,θ¯). (12)

If F¯(x,θ¯) is used directly to define the probability, it is difficult to calculate the fuzzy probability, fuzzy maximum likelihood and fuzzy objective function in the optimization. Therefore, F¯(x,θ¯) needs to be defuzzifized to transform the optimization into regular maximum-likelihood optimization. This paper adopts a centroid method to defuzzify F¯(x,θ¯).

The centroid of F¯(x,θ¯) is expressed as Fc(x,θ¯), and the probability can be defined as

Pc(x,θ¯)=e-Fc(x,θ¯)Z,Z=x˜e-Fc(x˜,θ¯). (13)

This paper selects the negative log-likelihood as the objective function of FRBM:

L(θ¯,D)=-xDlogPc(x,θ¯). (14)

In the formula, D denotes the training dataset.

The learning algorithm of FRBM finds the parameters θ¯ that minimize the objective function L(θ¯,D) (minLθ¯(θ¯,D)). The paper adopts the centroid approximation method to defuzzify F¯(x,θ¯):

F¯(x,θ¯)=Fc(x,θ¯)12[F(x,θL)+F(x,θR)]. (15)

The gradients of L(θ¯,D) with respect to θL can be expressed as

-logPc(x,θ¯)θL=Fc(x,θL)θL-Ep[Fc(x,θL)θL]. (16)

In the formula, Ep(⋅) is the expectation over the target probability distribution P.

Correspondingly,

-logPc(x,θ¯)θR=Fc(x,θR)θR-Ep[Fc(x,θR)θR]. (17)

Because it is difficult to calculate the expectation Ep[Fc(x,θ)θ], an approximation approach is employed. After defuzzifying the objective function, Gibbs sampling [21] is adopted to sample from these conditional distributions.

For FRBM, the fuzzy conditional probabilities can be expressed as P¯(hj=1|x)=σ(c¯j+W¯·jx) and P¯(xi=1|h)=σ(b¯i+W¯i·Th). The α-cuts of the fuzzy conditional probabilities are described as P¯(hj=1|x)[α]=[PL(hj=1|x),PR(hj=1|x)] and P¯(xi=1|h)[α]=[PL(xi=1|h),PR(xi=1|h)].

In these formulas, PL(hj|x), PR(hj|x), PL(xi|h) and PR(xi|h) are the conditional probabilities with respect to the lower bounds and upper bounds of the parameters. Consequently,

PL(hj|x)=P(hj|x;θL)=σ(cjL+W·jLx),PR(hj|x)=P(hj|x;θR)=σ(cjR+W·jRx), (18)

and

PL(xi|h)=P(xi|h;θL)=σ(biL+Wi·Lh),PR(xi|h)=P(xi|h;θR)=σ(biR+Wi·Rh). (19)

In these formulas, WijL and WijR are the lower bound and upper bound of the connection weight, biL and biR are the lower bound and upper bound of the visible bias, and cjL and cjR are the lower bound and upper bound of the hidden bias [20].

According to (15), (16), (17) and the expectation estimation method of [20, 22], the gradients of L(θ¯,D) with respect to the fuzzy parameters of FRBM can be obtained as follows:

-logPc(x)WijL=Ep[PL(hj|x)·xiL]-PL(hj|x)·xiL,
-logPc(x)cjL=Ep[PL(hj|x)]-PL(hj|x),
-logPc(x)biL=Ep[PL(xi|h)]-xiL,
-logPc(x)WijR=Ep[PR(hj|x)·xiR]-PR(hj|x)·xiR,
-logPc(x)cjR=Ep[PR(hj|x)]-PR(hj|x),
-logPc(x)biR=Ep[PR(xi|h)]-xiR,

where Pc(x) is the centroid probability.

The CD1 algorithm [22] is employed to obtain the updating rules for parameters (θL and θR)to approximate the expectation Ep[Fc(x,θ)θ] [20, 22]. The above description is summarized as Algorithm 1.

Algorithm 1 The learning algorithm of FRBM

Input: x(0) is an input sample of the training set;

  ε is the learning rate;

  WL and WR are the lower and upper bounds of connection weight matrices of the visible and hidden layers;

  bL and bR are the lower and upper bounds of bias vectors of the visible nodes;

  cL and cR are the lower and upper bounds of bias vectors of the hidden nodes.

Output: The latest parameters of FRBM: WL, WR, bL, bR, cL, cR.

 BEGIN

 1. For all hidden nodes j do

  obtain PL(hjL(0)=1|x(0)) and PR(hjR(0)=1|x(0)) by means of formula (18);

  sample hjL(0){0,1} from PL(hjL(0)|x(0));

  sample hjR(0){0,1} from PR(hjR(0)|x(0));

  end for

 2. For all visible nodes i do

  gain PL(xiL(1)=1|hL(0)) and PR(xiR(1)=1|hR(0)) by means of formula (19);

  sample xiL(1){0,1} from PL(xiL(1)|hL(0));

  sample xiR(1){0,1} from PR(xiR(1)|hR(0));

  end for

 3. For all hidden nodes j do

  obtain PL(hjL(1)=1|xL(1)) and PR(hjR(1)=1|xR(1)) by means of formula (18);

  sample hjL(1){0,1} from PL(hjL(1)|xL(1));

  sample hjR(1){0,1} from PR(hjR(1)|xR(1));

  end for

 4. The parameters can be updated according to the following rules:

  WL = WL + ε(x(0)PL(hL(0) = 1|x(0)) − xL(1)PL(hL(1) = 1|xL(1)));

  bL = bL + ε(x(0)xL(1))

  cL = cL + ε(PL(hL(0) = 1|x(0)PL(hL(1) = 1|xL(1)));

  WR = WR + ε(x(0)PR(hR(0) = 1|x(0)) − xR(1)PR(hR(1) = 1|xR(1)));

  bR = bR + ε(x(0)xR(1));

  cR = cR + ε(PR(hR(0) = 1|x(0)PR(hR(1) = 1|xR(1)));

 5. return WL, WR, bL, bR, cL, cR.

 END

B. FGBRBM and Its Learning Algorithm

The nodes of the visible and hidden layers in the fuzzy Gaussian-Bernoulli restricted Boltzmann machine (FGBRBM) model correspond to the Gaussian-Bernoulli nodes. In other words, the nodes of visible layer vRD are Gaussian nodes and the nodes of the first hidden layer h ∈ {0, 1}F are Bernoulli nodes (binary nodes) [23]. Their structures are described in S3 Fig. The fuzzy energy function of FGBRBM can be defined as

E¯(v,h;θ¯)=-(vT·1σ)TW¯h-12(vT·1σ-b¯T·1σ)2-b¯Th, (20)

where σi denotes the standard deviation of vi in the ith dimension of the Gaussian visual node.

Correspondingly, as for FGBRBM, the fuzzy conditional probability distribution for obtaining nodes of the hidden layer through nodes of the visual layer can be described as:

P¯(hj=1|v)=g(c¯j+W¯i·)=g(c¯j+iviσiW¯ij), (21)

where g(⋅) denotes the sigmoid function.

The formula for constructing the nodes of the visual layer through nodes of the hidden layer can be expressed as

vi=N(b¯i+σij=1FW¯ijhj,σi2). (22)

In this formula, N(μ, σ2) is a Gaussian probability density function, where μ is the mean value and σ2 denotes the variance.

The α-cuts of fuzzy conditional probabilities can be expressed as P¯(hj=1|v)[α]=[PL(hj=1|v),PR(hj=1|v)], v¯i[α]=N(b¯i+σij=1FW¯ijhj,σi2)[α]=[viL,viR], where PL(hj|v) and PR(hj|v) are the conditional probabilities with respect to the lower and upper bounds of the parameters governing the FGBRBM, and viL and viR are the probabilities of the normal distribution. Consequently,

PL(hj|v)=P(hj|v;θL)=σ(cjL+W·jLv),PR(hj|v)=P(hj|v;θR)=σ(ciR+W·jRv), (23)

and

viL=N(biL+σij=1FWijLhj,σi2),viR=N(biR+σij=1FWijRhj,σi2), (24)

where WijL and WijR are the lower and upper bounds of the connection weight, respectively; biL and biR are the lower and upper bounds of visible node bias, respectively; and cjL and cjR are the lower and upper bounds of the hidden node bias, respectively.

The CD-1 algorithm can also be used for learning parameters of the fuzzy Gaussian-Bernoulli RBM model. In the process of training, the input data of the whole training set should first be normalized so that each dimension of the input data in the FGBRBM model obeys the normal distribution, i.e., the mean value is 0 and the variance is 1. When the formula P(vi=1|h,θ¯) is calculated, σ = 1. In the process of reconstructing the nodes of the visual layer, this paper does not adopt binary data instead of the probabilities of the normal distribution. The above description is summarized as Algorithm 2.

Algorithm 2 The learning algorithm of FGBRBM

Input: x(0) is a sample of the training dataset;

  ε is the learning rate;

  WL and WR are the lower and upper bounds of connection weight matrices of the visible and hidden layers;

  bL and bR are the lower and upper bounds of bias vectors of the visible nodes;

  cL and cR are the lower and upper bounds of bias vectors of the hidden nodes.

Output: The latest parameters of FGBRBM: WL, WR, bL, bR, cL, cR.

 BEGIN

 1. For all hidden nodes j do

  obtain PL(hjL(0)=1|v(0)) and PR(hjR(0)=1|v(0)) by means of formula (23);

  sample hjL(0){0,1} from PL(hjL(0)|v(0));

  sample hjR(0){0,1} from PR(hjR(0)|v(0));

  end for

 2. For all visible nodes i do

  obtain viL and viR by means of formula (24);

  viL(1)=N(biL+j=1FWijLhjL(0),1);

  viR(1)=N(biR+j=1FWijRhjR(0),1);

  end for

 3. For all hidden nodes j do

  obtain PL(hjL(1)=1|vL(1)) and PR(hjR(1)=1|vR(1)) by means of formula (23);

  sample hjL(1){0,1} from PL(hjL(1)|vL(1));

  sample hjR(1){0,1} from PR(hjR(1)|vR(1));

  end for

 4. The parameters can be updated according to the following rules:

  WL = WL + ε(v(0)PL(hL(0) = 1|v(0)) − vL(1)PL(hL(1) = 1|vL(1)));

  bL = bL + ε(v(0)vL(1))

  cL = cL + ε(PL(hL(0) = 1|v(0)PL(hL(1) = 1|vL(1)));

  WR = WR + ε(v(0)PR(hR(0) = 1|v(0)) − vR(1)PR(hR(1) = 1|vR(1)));

  bR = bR + ε(v(0)vR(1))

  cR = cR + ε(PR(hR(0) = 1|v(0)PR(hR(1) = 1|vR(1)));

 5. return WL, WR, bL, bR, cL, cR.

 END

C. FLR and Its Learning Algorithm Used for Modeling the Click Through Rate

The logistic regression (LR) model [24] is a classic model in the field of advertising click-rate prediction. The LR model is described in S4 Fig. The activation function of the node in the output layer is a sigmoid function [25]. The output value of the node (value of the activation function) hθ(x′) is the probability that a user will click on the advertisement; it can also be described as P(y = 1|x′; θ′). When a sample (xd,yd) in the training set D=(x1,y1),(x2,y2),...,(xN,yN) is given, the probability value of advertising click rate prediction can be obtained in terms of logistic regression and can be described by formula (25)

p(click|a,u,c)=p(yd=1|xd;θ)=11+e-θTxd. (25)

Correspondingly, p(y=0|x;θ)=1-11+e-θTxd.

In this formula, θT denotes the vector of parameters of the standardized logistic regression model, a refers to the features of the advertisement, u denotes the features of users, and c represents the features of the context environment. These three features compose a vector, which can be denoted as xd. This vector is fed into the logistic regression model for CTR prediction.

For a single sample (xd,yd), the probability of correctly predicting the advertising click-through rate can be described as P(yd|xd;θ)=hθ(xd)yd(1-hθ(xd))(1-yd). This paper selects the likelihood function as the objective function:

J(xd;θ)=-1Nlog(P(YD|XD;θ))=-1Nlog(d=1N(hθ(xd)yd(1-hθ(xd))1-yd))=-1Nd=1N(ydloghθ(xd)+(1-yd)log(1-hθ(xd))). (26)

In formula (26), the click label yd denotes the expected output and hθ(xd) is the real output value of the neuron node in the output layer (output of the activation function); hθ(xd) can be described as hθ(xd)=g(z), where z=i(wixi+b)=θTxd and g(x) denotes the sigmoid function.

As for the fuzzy logistic regression model, the objective function can be described as

J¯(xd;θ¯)=-1Nlog(P¯(YD|XD;θ¯))=-1Nlog(d=1N(h¯θ¯(xd)yd(1-h¯θ¯(xd))1-yd))=-1Nd=1N(ydlogh¯θ¯(xd)+(1-yd)log(1-h¯θ¯(xd))), (27)

where h¯θ¯(xd)=g¯(z¯), z¯=i(wi¯xi+b¯)=θ¯Txd and g(x)=11+e-x.

According to formulas (5) and (7), the gradient of the objective function of FLR with respect to fuzzy parameter θ¯ can be expressed as

J¯(xd;θ¯)θ¯L=Jc(xd;θ¯)θ¯L=-1Nd=1N(yd1hθL(xd)hθL(xd)θL-(1-yd)11-hθL(xd)hθL(xd)θL)=1Nd=1N(hθL(xd)-yd)xd=Cd=1N(hθL(xd)-yd)xd, (28)
J¯(xd;θ¯)θ¯R=Jc(xd;θ¯)θ¯R=-1Nd=1N(yd1hθR(xd)hθR(xd)θR-(1-yd)11-hθR(xd)hθR(xd)θR)=1Nd=1N(hθR(xd)-yd)xd=Cd=1N(hθR(xd)-yd)xd. (29)

The above description is summarized as Algorithm 3.

Algorithm 3 The learning algorithm of FLR

Input: (xd,yd) is a sample of training sets;

  ε is the learning rate;

  θL={WL,cL} and θR={WR,cR}, where WL and WR are the lower and upper bounds of connection weight matrices of FLR, and cL and cR are the lower and upper bounds of bias vectors of the output node of FLR;

Output: The latest parameters of FLR: θL={WL,cL}, θR={WR,cR}.

 BEGIN

 1. obtain Jc(θ¯)θL by means of formula (28) based on the input data xd;

 2. obtain Jc(θ¯)θR by means of formula (29) based on the input data xd;

 3. The parameters can be updated according to the following rules:

  θL=θL+εJc(θ¯)θL;

  θR=θR+εJc(θ¯)θR;

 4. return θL,θR.

 END

Advertising CTR prediction model FDNN

A. Construction of Advertising CTR Prediction Model FDNN

This section describes the proposed fuzzy deep neural network (FDNN) model for advertising CTR prediction and its learning algorithm in detail. The architecture of the method and its process of predicting CTR are illustrated in S5 Fig.

First, the fuzzy deep belief network (FDBN) constituted by FRBM and GBRBM is used to automatically learn the discriminative features of raw input data from an advertising dataset, and these features are able to capture internal explanatory information that is hidden in the raw data, amplify the important information for discrimination and weaken irrelevant information [26].

Next, FLR is adopted to model the CTR prediction and map the predicted CTR value to the range between 0 and 1.

The raw input data of advertising CTR prediction are the fields that include the corresponding advertisement features, users’ features, and context features, which are extracted from the advertisement click log. These fields contain not only numerical data but also multi-category data; therefore, the raw input data of advertising CTR prediction are composed of vectors with different data types.

Because FRBM only models input with binary variance (0, 1), this paper adopts fuzzy Gaussian-Bernoulli RBM (FGBRBM) to address the input data of advertising CTR prediction.

Then, FRBM can be used to construct FDBN through layer-by-layer stacking; the detailed steps can be found in [22].

In real applications, advertising click rate prediction refers to binary classification or regression; therefore, this paper uses fuzzy logistic regression to model user behaviors in clicking advertisements. After the input data vector x = (a, u, c) = (x1, x2, …, xn) is extracted by FDBN, the vector x, which is composed of the binary values, in the last hidden layer of FDBN is fed into fuzzy the logistic regression model for CTR prediction.

The advantage of the proposed method is that fuzzy set techniques are introduced, and the uncertainties in the relationships between nodes located in adjacent layers is taken into consideration. Nodes in adjacent layers of FDNN often interact in uncertain ways. Because the parameters that represent the relationships between nodes in adjacent layers are fuzzy numbers and the learning process of fuzzy parameters is extended to a relatively wider space, this advantage will be reflected in the fitness of the joint probability distribution. Combined with the merits of deep learning, the proposed method demonstrates superior performance in coping with data with noises.

B. Construction and Training of the Advertising CTR Prediction Model FDNN

The process of constructing FDNN is illustrated in S6 Fig, and the detailed steps of constructing the FDNN are as follows:

Step 1: The training of FRBM layer by layer

First, the vectors that are constituted by related data of textual advertising click logs are regarded as the input of FGBRBM, and the CD1 algorithm is used to train FGBRBM; in this way, the parameter θ1 of FGBRBM is obtained. Based on the input vectors and the parameter θ1, the hidden nodes h1 of FGBRBM are obtained.

Second, the parameter θ2 of the current FRBM can be obtained when h1 is regarded as the input data for the CD1 algorithm. Based on h1 and θ2, the hidden nodes h2 of the current FRBM can be obtained.

Third, the previous process is repeated until reaching the fixed layer.

Step 2: Constructing FDBN by means of FRBMs

According to the training procedure of FRBM in step 1, the weight values of FRBMs are connected from bottom to top to construct FDBN. In this FDBN, the connection between the top two layers is undirected and the other connection is directed from top to bottom [22].

Step 3: Constructing the FDNN that is used for advertising CTR prediction by means of FDBN

After an FLR model is added onto the top of the FDBN, the probability that the user clicks this advertisement is equal to the output value of the nodes of the input layer. At this point, the network becomes the initially trained fuzzy deep neural network (FDNN).

Step 4: Fine-tuning the parameters of FDNN

After FDNN is initially trained layer by layer, a fine-tuning process needs to be conducted. In contrast to the unsupervised learning approach in the initial training, the fine-tuning is implemented using supervised learning. Based on the weight values of the network that were obtained from the initial training, error backpropagation [27, 29] and stochastic gradient descent are used to update the weight values and bias of the neural network. After fine-tuning, the process of training FDNN is complete. The output of FDNN is in the form of a probability: p(click|a,u,c)=p(yd=1|xd;θ).

The process of fine-tuning the parameters of FDNN contains two phases.

1) Forward calculation of the input signal:

When labeled data (xd, yd) of click logs are given, that is, the combined feature vector x = (a, u, c) = (x1, x2, …, xn) is given, the jth node in the lth layer (l = 2, ‥, Ln) of the FDNN deep neural network can be defined as follows:

Ln denotes the total number of layers in the FDNN network;

sl denotes the number of nodes in the lth layer of the FDNN network;

wjil¯ denotes the connection weight between the jth node in the lth layer and the ith node in the l − 1th layer; bjl¯ denotes the bias of the jth node in the lth layer; netjl¯=bjl¯+i=1sl-1wjil¯oil-1¯ denotes the weighted-sum input of the jth node in the lth layer;

ojl¯ denotes the output value of the activation function of the jth node in the lth layer;

The selected activation function is sigmoid function.

In the forward calculation, the input signal is transmitted from the input layer to the output layer. The input value of the jth node in the second layer of FDNN (l = 2) can be described as netj2¯=bj2¯+i=1s1wji2¯xi. When the input of node j is transferred by the activation function, its output value can be described as oj2¯=g(netj2¯). Therefore, the input value of the jth node in the lth layer of FDNN (l = 3, …, Ln) can be described as netjl¯=bjl¯+i=1sl-1wjil¯oi and its corresponding output value can be described as ojl¯=g(netjl¯).

In particular, the output value of the jth node in the th layer of FDNN can be described as

h¯θ¯(xd)=Oj=O1=O1Ln¯. (30)

2) Backpropagation of the error signal:

This paper regards the cross-entropy as the objective function [25]. It can be described as formula (31). The local gradient will be refined using backpropagation. The error signal between the output value and labeled signal will be transmitted from the output layer to the input layer in the process.

J¯(xd;θ¯)=ydlogh¯θ¯(xd)+(1-yd)log(1-h¯θ¯(xd))=ydlogg(netjLn¯)+(1-yd)log(1-g(netjLn¯)), (31)

In this formula, yd is the label value of the sample, which can be represented as yd ∈ [0, 1].

The gradient descent method will be adopted to learn the model. This paper will first define the residual error δjl¯ [28] of the jth node in the lth layerlayer as the partialand of objective function J¯ with respect to the weighted sum netjl¯ of node j. The residual error δjLn¯ of the j th neuron node of the Ln th layer can be calculated by formula (32)

δjLn¯=J¯netjLn¯=yd1g(netjLn¯)g(netjLn¯)+(1-yd)-11-g(netjLn¯)g(netjLn¯)=yd-g(netjLn¯)=ydO1Ln¯. (32)

The residual error δjl of the jth neuron node of the lth layer (l = 2, …, Ln − 1) can be calculated by formula (33) according to the chain rule [29].

δjl¯=J¯netjl¯=J¯ojl¯·ojl¯netjl¯=i=1sl+1(J¯netil+1¯netil+1¯ojl¯)·ojl¯netjl¯=(i=1sl+1δil+1¯·wjil+1¯)g(netjl¯)netjl¯=(i=1sl+1δil+1¯·wjil+1¯)g(netjl¯)(1-g(netjl¯))=(i=1sl+1δil+1¯·wjil+1¯)ojl¯(1-ojl¯), (33)

where wjil+1¯ denotes the connection weight between the jth node of the lth layer and the ith node of l + 1th layer.

According to formulas (2) and (10), the gradient of cost function J¯(W¯,b¯) with respect to parameters wjil¯ and bjl¯ can be calculated by formulas (34), (35), (36) and (37).

J¯wjilL=J¯netjl¯·netjl¯wjilL=δjl¯L·oil-1¯L, (34)
J¯bjlL=δjl¯L, (35)
J¯wjilR=J¯netjl¯·netjl¯wjilR=δjl¯R·oil-1¯R, (36)
J¯bjlR=δjl¯R. (37)

The above process is described in Algorithm 4.

Algorithm 4 The algorithm for fine-tuning the parameters of FDNN using BP

Input: (xd, yd) is the labeled training data of textual advertising click logs;

  ε is the learning rate;

  wjilL is the lower bound of the connection weight between the jth node in the lth layer and the ith node in the (l − 1)th layer of FDNN;

  bjlL is the lower bound of the bias of the jth node in the lth layer of FDNN;

  wjilR is the upper bound of the connection weight between the jth node in the lth layer and the ith node in the (l − 1)th layer of FDNN;

  bjlR is the upper bound of the bias of the jth node in the lth layer of FDNN;

Output: The latest parameters: wjilL, wjilR, bjlL, bjlR

 BEGIN

 1. In the forward-propagation calculation, when the input data vector xd is fed into the input layer, the output value of FDNN can be obtained using formula (30);

 2. In the calculation of error backpropagation,

  obtain J¯wjilL by means of formula (34);

  obtain J¯bjlL by means of formula (35);

  obtain J¯wjilR by means of formula (36);

  obtain J¯bjlR by means of formula (37);

 3. The parameters can be updated according to the following rules:

  wjilL=wjilL+εJ¯wjilL;

  bjlL=bjlL+εJ¯bjlL;

  wjilR=wjilR+εJ¯wjilR;

  bjlR=bjlR+εJ¯bjlR;

 4. return wjilL, bjlL, wjilR, bjlR.

 END

Experiment datasets and environment

Description of dataset

This paper uses the Criteo Display Advertising Challenge dataset, which is provided by the Criteo company on the famous machine learning website Kaggle for advertising CTR competition [30]. Every advertisement log record contains the set of features of the displayed advertisement, which is composed of 13 numerical features and 26 categorical features. The detailed formats of these features are given in S1 Table. The class Label is used to indicate whether this advertisement is clicked: if the advertisement is clicked, the value of Label is 1; otherwise, the value of Label is 0. Categorical features are desensitized using Harsh mapping; therefore, real meaning can not be obtained from them.

To adequately evaluate the performance of this model and prevent the interference of many factors such as data timeliness, we randomly disrupt the order of the raw samples in the experiment. Meanwhile, to avoid the over-fitting, the 5-fold cross validation method [31] is adopted in the training process of all experiments.

Experiment environment

Six high-performance workstations are used for the experiments. The hardware and software configurations of the experimental environment are shown as S2 Table.

Results and analysis

Baseline models

This paper selects the logistic regression (LR), Factorization Machines (FMs) [32], GBDT+FM [8], and DBNLR [11] models as the baseline models for performance comparison.

Performance evaluation metric of models

AUC

This paper uses the area under the receiver operating characteristic curve (AUC) [33, 34] as the main performance evaluation metric for advertising CTR prediction. The confusion matrix is shown as S3 Table [35, 36]. Formulas TPR = TP/(TP + FN) and FPR = FP/(FP + TN) are used to calculate the corresponding TPR and FPR values of the confusion matrix to obtain a co-ordinate pair (dot pair). The dot pairs make up the receiver operating characteristic (ROC) curve [37]. The AUC is the total area under the ROC. The larger the AUC is, the more accurate the advertising CTR prediction is, and the better the effect achieved by the advertising placement is.

Log-loss

This paper selects the logarithmic loss function ‘log-loss’ [38, 39] as another auxiliary performance evaluation metric for advertising CTR prediction.

logloss=-1Li=1Lyilogy¯i+(1-yi)log(1-y¯i), (38)

where L is the number of samples, yi is the true label (0 or 1), and y¯i is the predicted probability. The value of log-loss reflects the similarity between the CTR predicted by model and real CTR. The smaller the value of log-loss is, the more accurate the advertising CTR prediction is.

Experiments design and results analysis

Configuration optimization and performance analysis of FDNN and DBNLR

A. Feature Pre-processing

FDNN is similar with DBNLR in the architecture. Their difference lies in the components of model. Therefore, this paper selects the DBNLR as a baseline model.

In the experiment on the FDNN and DBNLR models, the numerical features of the raw dataset are directly adopted, and these categorical features are expanded to binary vectors using one-hot representation.

Although the method of one-hot representation is straightforward and effective, it makes the feature vectors sparse and high-dimensional. In the deep learning models, when the dimensionality of the features is large, the scale of the parameters of the model becomes large, which leads to a sharp increase in training time. Moreover, the hardware will be a bottleneck in the training process. Therefore, this paper adopts a feature hashing strategy (signed hashing trick) [40] to reduce the dimensionality of these sparse binary feature vectors. Finally, the joint vector that is composed of numerical features and categorical features and has been processed by the feature hashing strategy is regarded as the input of the two deep architecture models.

Input vectors can be obtained when the data of the Display Advertising Challenge dataset are preprocessed using the feature preprocessing methods described above. In this experiment, after preprocessing, the dimensionality of the input vector is 1746088. Therefore, the number of nodes in the visible layer (input layer) is 1746088.

B. The Parameter Settings for Training FDNN and DBNLR

These two models are developed based on the Theano framework [41] by means of Python. In all experiments, the maximum number of learning epochs in the training process is set to a sufficiently large value (800) and the Early-Stopping strategy is adopted to automatically cut epochs for obtaining the best validation score. Furthermore, this strategy can also avoid over-fitting.

The steps of the Early-Stopping strategy are as follows:

1. Split the dataset into a training set and a validation set;

2. When the training of every epoch has been finished, calculate the validation error (loss) in the validation dataset;

3. When the validation error of the validation dataset no longer decreases, record the number of epochs and the current validation error value. If the validation error of the validation dataset does not increase after continuation, the training process is terminated, the best validation score is obtained, and optimization is complete. Meanwhile, the optimal parameter estimation values are obtained.

In the initial training process of FDBN and DBNLR, the gradient descent method with minibatch is used for training on these samples and each minibatch includes 2000 training samples. For FGBRBM (or GBRBM), the learning rate is 0.01, while for FRBM (or RBM) with the binary variable, the learning rate is 0.1. The decay factor of the weight value is 0.0002.

In the fine-tuning process of FDNN and DBNLR, the gradient descent method with minibatch is used for training on these samples and each minibatch includes 4000 training samples. The decay factor of the weight value is 0 and the learning rate is set as 0.005.

Dropout represents the probability that a node is retained in the neural network [42, 43]. A reasonable sparse dropout strategy can strengthen the robustness of a neural network. Dropout is utilized to reduce over-fitting and the complexity of these two deep learning models. The paper sets the dropout rate to 0.5, 0.6, 0.7, 0.8, and 0.9 to implement the testing experiments. Under the condition of having the same configuration structure, the dropout rates are respectively 0.5 and 0.6 when FDNN and DBNLR use their own optimal settings.

C. Optimization of the number of hidden layers and hidden nodes

Currently, there is no method for quickly setting the number of hidden layers and the number of hidden nodes in each layer. Therefore, many experiments and much experience are utilized to search for the optimal structure. According to past experience, when the number of nodes of the visible layer is large, the nodes of the hidden layer need to perform “dimensionality reduction- dimensionality addition—dimensionality reduction”. However, when the number of nodes of the visible layer is small, the nodes of the hidden layer only need to perform “dimensionality addition—dimensionality reduction”. In general, the number of hidden nodes will be increased or decreased with multiple. By comparing the performances in many experiments with different configurations (different numbers of nodes of each hidden layer and numbers of hidden layers), the configuration with relatively optimal performance can be obtained.

In S4 Table, information on different configurations in many experiments is given. In S7 Fig, the corresponding AUC values of DBNLR and FDNN with different configurations are given. In S8 Fig, the corresponding log-loss values of DBNLR and FDNN with different configurations are given.

As presented in S7 and S8 Figs, in the experiments for training FDNN, in the beginning, increasing the number of hidden layers can improve the performance of the proposed method; however, when the number of hidden layers is more than 3, the performance is degraded. This is because FDNN does not adequately capture feature interactions when the number of hidden layers is small. However, when the number of hidden layers and the number of nodes in the hidden layers exceed a certain scale, cross-features disturb and degrade the accuracy of CTR prediction. The over-fitting phenomenon is observed and the algorithm spends more time learning the parameters.

After many experiments have been completed, a configuration with relatively optimal performance is obtained: Conf11, the number of hidden layers is 3; the number of nodes of the first hidden layer is 170; the number of nodes of the second hidden layer is 1700; and the number of nodes of the third hidden layer is 17. This paper chooses this configuration as the final structure of FDNN. Corresponding AUC and log-loss on the test dataset can be seen in S9 and S10 Figs.

In the experiments for training DBNLR, relatively optimal performance is obtained when the configuration structure is as follows: Conf7, the number of hidden layers is 3; the number of nodes of the first hidden layer is 150; the number of nodes of the second hidden layer is 1500; and the number of nodes of the third hidden layer is 15. This paper chooses this configuration as the final structure of DBNLR. Corresponding AUC and log-loss on the test dataset can be seen in S9 and S10 Figs.

We observe that the curve of the AUC value of the DBNLR model and that of the FDNN model are indented. However, when they have the same configuration, the overall level of performance of DBNLR is worse than that of FDNN; this is true for the log-loss index. It is because fuzzy set techniques are introduced in the proposed FDNN model. The components of FDNN model are fuzzy RBM, fuzzy GBRBM and fuzzy LR. With the data of many records in the Display Advertising Challenge dataset which contains outliers and missing values implying many uncertainties, the uncertainties can be taken into consideration and efficiently handled by fuzzy set techniques; therefore, the performance of the FDNN model for CTR prediction is better than that of DBNLR in the experiment.

Configuration optimization of other baseline models

A. LR Solution

CTR prediction in a real advertising system involves a large number of samples, while logistic regression (LR) model can appropriately address a large number of features and can be trained rapidly, so the LR model is selected as a baseline model in this paper. This paper selects LIBLINEAR [44] to implement L2-regularized logistic regression, and the stochastic gradient method (SG) is adopted to optimize the parameters of LR. In the LR solution, the numerical features are used directly and the categorical features are first represented with one-hot encoding and then used to establish the feature space with MurmurHash 3. After many experiments, this paper adopts the following parameters: η = 0.2, λ = 0.

B. FM Solution

For advertising CTR prediction, Factorization Machines (FMs) combine the advantages of Support Vector Machines with factorization models, and can characterize the basic feature interactions of a large number of advertising data, therefore, this paper selects the FM model as a baseline model [45]. For advertising CTR prediction, Factorization Machines (FMs) is also adopted, which is based on feature engineering and matrix design. This paper adopts LIBFM to implement factorization machines [46, 47] and the stochastic gradient method (SG) is adopted to optimize the parameters of FM. The numerical features are used directly and the categorical features are first represented with one-hot encoding. After many experiments, this paper adopts the following parameters: λ = 40, k = 30.

C. GBDT+FM Solution

Gradient Boost Decision Tree (GBDT) is a nonlinear model that is based on the concept of collective learning boosting. GBDT has strong extracting feature ability and nonlinear fitting ability, and it can flexibly handle various types of data in the advertising dataset including continuous value and discrete value, but no over-fitting occurs. Therefore, this paper selects GBDT+FM solution as a baseline model. Based on the work of Y Juan et al., this paper implements GBDT+FM solutions for performance comparison and analysis. First, in Preprocessing-A, all numerical features are used directly and all categorical features are represented with one-hot encoding. Then, these features are fed into decision trees in GBDT to generate GBDT features. This paper selects Xgboost to implement the GBDT [48]. In Preprocessing-A, the number of decision trees in GBDT is 30 and the depth is 7. Thirty features are generated for each impression. The following process is Preprocessing-B, which is used for generating input features for FM. In Preprocessing-B, numerical features (I1-I13) with values greater than 2 are transformed by v ← ⌊log(v)2⌋; categorical features (C1-C26) that appear fewer than 10 times are transformed into a special value; and GBDT features are directly included. Therefore, each record contains 69 features: 13 numerical features, 26 categorical features and 30 GBDT features. Then, these three groups of features are hashed into 1 M dimensions using a hashing trick. Finally, the processed features are fed into the FM model for CTR prediction. The parameters of the FM model that achieve the relatively optimal results are “λ = 40, k = 100”.

Experimental results and analysis

According to S9 and S10 Figs, the proposed FDNN method has the best performance and LR has the worst performance. In the process of training models, it can be found that logistic regression model can be trained rapidly, but its effect is in the general level. One reason is that when LR is used to predict CTR, it requires the experts with sophisticated experience to implement feature engineering through the human labor. The other reason is that the simple structure of LR restrains its representation ability. These two disadvantages cause that LR cannot learn complex feature interactions and is insensitive to outliers and missing values. Therefore, LR has the relatively bad performance in the experiment.

According to S9 and S10 Figs, FM is superior in performance to LR. The reason is that FM extends the idea of LR because FM adds to second order fitting on the basis of first order fitting and it can automatically learn cross features of any two features from different dimensions. And cross features are expressed with embedding vector. Compared with LR, FM model has fewer manual intervene but higher efficiency. Therefore, FM has better representation ability than LR in the experiment. However, FM also cannot learn much more complex feature interactions because of its relatively simple structure.

In this experiment, GBDT+FM solution outperform the FM model. The GBDT+FM solution obtains good prediction results, which are slightly inferior to those of the proposed method. The GBDT can learn useful high-order feature interactions because it has strong nonlinear fitting ability and the robustness for hyperparameter, so it can efficiently extracts discriminative features and cross-features of advertising dataset, and shows better performance than FM model in our CTR prediction experiment.

Compared with the GBDT+FM solution, the proposed method achieves more than 0.09% improvement in terms of AUC (0.01% in terms of log-loss) on the dataset. The reasons are as follows. The first is that the quality of data exerts a great influence on effect of model. Because the dataset contains many noisy data, there are bottlenecks in the effects of GBDT+FM solution. Compared with this model, the proposed FDNN model, as a result of being endowed with fuzzy technology, can capture more-complex high-order feature interactions of the advertising dataset, which contains many outliers and missing values. The second is that GBDT+FM solution, belonging to a kind of stacking combined model without backproportion, is not always better than the proposed model with backproportion in effect improvement for CTR prediction from the theoretical perspective. Therefore, it demonstrates relatively good performance in terms of both data representation capability and prediction accuracy. In a real production environment for an advertising system, a small improvement in advertising CTR is likely to result in a significant increase in financial income for the internet company and an improved user experience.

Discussion

This paper proposes a fuzzy deep neural network (FDNN) to address the problem of advertising CTR prediction. The network’s performance is compared with those of several baseline models in real advertisement datasets. Due to the fuzziness that is introduced into the proposed model, it can learn the most useful high-order feature interactions and demonstrates good performance in terms of both data representation capability and robustness in advertisement datasets with noise.

Big data bring new perspectives to these fields and challenges in processing and discovering valuable information. The traditional artificial intelligence techniques can no longer effectively extract and organize the discriminative information from raw advertisement data directly [49]. This paper provides a basis for the further application of fuzzy deep neural networks in advertising CTR prediction. Future work will focus on exploring additional unsupervised feature learning and deep architecture models to improve the CTR prediction for internet advertising.

In addition, in big data era, a real production environment for advertising CTR prediction requires that the advertising system return the prediction results in milliseconds or even microseconds. Determining how to implement low-lag and high-efficiency impromptu Ad-Hoc analysis to predict and return results based on a big dataset is a great challenge for big data processing systems. Big data systems with stream computing, such as Spark Streaming, Storm, and Flink, can implement analysis and query in real time. However, because of the limited memory capacity and loss of raw historical data in system, Ad-Hoc analysis and query for a big dataset cannot be implemented. Therefore, exploring CTR prediction solutions based on streaming big data processing technology is another future direction of the work of this paper.

Supporting information

S1 Fig. Triangular fuzzy number.

(ZIP)

S2 Fig. Structure of fuzzy restricted boltzmann machine.

(ZIP)

S3 Fig. Structure of fuzzy gaussian-bernoulli restricted boltzmann machine.

(ZIP)

S4 Fig. Modeling click-through rate prediction by means of LR.

(ZIP)

S5 Fig. Architecture of the FDNN model and its process of predicting CTR.

(ZIP)

S6 Fig. The process of constructing FDNN.

(ZIP)

S7 Fig. AUC values of FDNN and DBNLR with different numbers of layers and nodes.

(ZIP)

S8 Fig. Log-loss values of FDNN and DBNLR with different numbers of layers and nodes.

(ZIP)

S9 Fig. Comparison of AUC among all models on the test dataset.

(ZIP)

S10 Fig. Comparison of log-loss among all models on the test dataset.

(ZIP)

S1 Table. Data format of dataset.

(ZIP)

S2 Table. Experimental environment.

(ZIP)

S3 Table. Confusion matrix.

(ZIP)

S4 Table. Configure information of FDNN and DBNLR with different numbers of layers and nodes.

(ZIP)

Acknowledgments

This work is supported by the Natural Science Research Project of Education Department of Guizhou Province of China (No.KY2017350), the Project Funds of the Digital Engineering Research Center for Local National Culture of Universities in Guizhou (No.KY2013227) and the Scientific Research Project of Qiannan Normal University for Nationalities (No.qnsy2017006). The authors are thankful for the research work in all referenced manuscripts.

Data Availability

Data are available: kaggle-display-advertising-challenge-dataset (https://doi.org/10.6084/m9.figshare.5732310.v1); iPinYou Global RTB Bidding Algorithm Competition Dataset season2 (https://doi.org/10.6084/m9.figshare.5732328.v1).

Funding Statement

This work is supported by the Natural Science Research Project of Education Department of Guizhou Province of China (No.KY2017350), the Project Funds of the Digital Engineering Research Center for Local National Culture of Universities in Guizhou (No.KY2013227) and the Scientific Research Project of Qiannan Normal University for Nationalities (No.qnsy2017006). The funder makes contributions on analysis, decision to publish.

References

  • 1. Sami Najafi-Asadolahi, Fridgeirsdottir Kristin. Cost-Per-Click Pricing for Display Advertising. Journal Manufacturing & Service Operations Management. 2014; 16(4):482–497. doi: 10.1287/msom.2014.0491 [Google Scholar]
  • 2.Z Pan, E Chen, Q Liu, T Xu, H Ma, H Lin. Sparse Factorization Machines for Click-through Rate Prediction. IEEE 16th International Conference on Data Mining (ICDM 2016), Barcelona, Spain, 2016; pp.400–409.
  • 3. Chapelle Olivier, Manavoglu Eren, Rómer Rosales. Simple and scalable response prediction for display advertising [J]. ACM Transactions on Intelligent Systems and Technology, 2014,5(4):1–34. doi: 10.1145/2532128 [Google Scholar]
  • 4.H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, et al. Ad click prediction: a view from the trenches. Proceedings of the 19th ACM international conference on Knowledge discovery and data mining (ACM SIGKDD 2013), New York, USA, 2013; pp.1222–1230.
  • 5.Anh-Phuong TA. Factorization Machines with Follow-The-Regularized-Leader for CTR prediction inDisplay Advertising. 2015 IEEE International Conference on Big Data (IEEE BigData 2015). Santa Clara, CA, USA, 2015; pp. 2889–2891.
  • 6.Afroze Ibrahim Baqapuri, Ilya Trofimov. Using Neural Networks for Click Prediction of Sponsored Search. arxiv preprint arXiv; 1412.6601, 2014.
  • 7.Kushal Dave, Vasudeva Varma. Predicting the Click-Through Rate for Rare/NewAds. Centre for Search and Information Extraction Lab International Institute of Information Technology Hyderabad, INDIA, April 2010, pp.1–18.
  • 8.Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, Chih-Jen Lin. Field-aware Factorization Machines for CTR Prediction. ACM Conference on Recommender Systems (ACM RecSys 2016), Boston, United States, 2016; pp. 43–50.
  • 9.Y Zhang, H Dai, C Xu, J Feng, T Wang, J Bian, et al. Sequential Click Prediction for Sponsored Search with Recurrent Neural Networks. 28th AAAI Conference on Artificial Intelligence (AAAI 2014), Québec City, Québec, Canada, 2014; pp.1369–1375.
  • 10. Yu Shimin. Prediction of ads’click through rate based on recurrent neural network. Hangzhou: Zhejiang sci-tech university, 2016; pp.12–15. [Google Scholar]
  • 11. Jiang Z, Gao S and Dai W. Research on CTR Prediction for Contextual Advertising Based on Deep Architecture Model. Control Engineering and Applied Informatics. 2016; 18(1):11–19. [Google Scholar]
  • 12.Junxuan Chen, Baigui Sun, Hao Li, Hongtao Lu, Xian-Sheng Hua. Deep CTR Prediction in Display Advertising. Proceedings of the 24th ACM on Multimedia Conference (ACM MM 2016), Amsterdam, Netherlands, 2016; pp.811–820.
  • 13. Wang H, Xu Z, Pedrycz W. An overview on the roles of fuzzy set techniques in big data processing: Trends, challenges and opportunities. Knowledge-Based Systems, 2017, 118:15–30. doi: 10.1016/j.knosys.2016.11.008 [Google Scholar]
  • 14. Xu Zeshui. A Method for Priorities of Triangular Fuzzy Number Complementary Judgement Matrices. Fuzzy Systems & Mathematics, 2002; 16(1):47–50. [Google Scholar]
  • 15. Asghari-Larimi M. Upper and lower (alpha, beta)-intuitionistic fuzzy set. International Mathematical Forum, 2012(9–12); pp.531–538. [Google Scholar]
  • 16. Hamrawi Hussam, Coupland Simon, John Robert. Type-2 Fuzzy Alpha-Cuts. IEEE Transactions on Fuzzy Systems, 2017; 25(3):682–692. doi: 10.1109/TFUZZ.2016.2574914 [Google Scholar]
  • 17. Dutta Soma and Chakraborty Mihir K. Fuzzy relation and fuzzy function over fuzzy sets: A retrospective. Soft Computing, 2015; 19(1):99–112. doi: 10.1007/s00500-014-1356-z [Google Scholar]
  • 18. Klimt Gustav, Buckley James J. Fuzzy Probabilities: New Approach and Applications, NewYork:Physica-Verlag, 2003; 20 (3):363–372. [Google Scholar]
  • 19. Runkler T.A., Glesner Manfred. DECADE—fast centroid approximation defuzzification for real time fuzzy control applications. DBLP, 1994:161–165. [Google Scholar]
  • 20. Chen C.L. Philip, Zhang Chun-Yang, Chen Long, and Gan Min. Fuzzy Restricted Boltzmann Machine for the Enhancement of Deep Learning. IEEE Transactions on Fuzzy Systems, 2015; 23(6):2163–2173. doi: 10.1109/TFUZZ.2015.2406889 [Google Scholar]
  • 21. Gelfand Alan E. Journal of the American Statistical Association, 2000, vol. 95, no. 452, pp.1300–1304. doi: 10.1080/01621459.2000.10474335 [Google Scholar]
  • 22. Hinton Geoffrey E., Osindero Simon, Teh Yee-Whye. A fast learning algorithm for deep belief nets. Neural Computation, 2006; 18(7):1527–1554. doi: 10.1162/neco.2006.18.7.1527 [DOI] [PubMed] [Google Scholar]
  • 23.Hyun-Chul Kim, Jong-Hwan Lee. Evaluation of weight sparsity regularizion schemes of deep neural networks applied to functional neuroimaging data. The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, 2017; pp.6150–6154.
  • 24. Osman Taher, Divigalpitiya Prasanna, Arima Takafumi. Driving factors of urban sprawl in Giza governorate of the Greater Cairo Metropolitan Region using a logistic regression model. International Journal of Urban Sciences, 2016; 58:1–20. [Google Scholar]
  • 25. Shrestha Ravi, Mohammed Shahed Khan, Hasan Md. Mehedi, Wahid Khabou. Automated Adaptive Brightness in Wireless Capsule Endoscopy Using Image Segmentation and Sigmoid Function. IEEE Transactions on Biomedical Circuits & Systems. 2016; 10(4):884 doi: 10.1109/TBCAS.2016.2546838 [DOI] [PubMed] [Google Scholar]
  • 26. Yann LeCun, Bengio Yoshua, and Hinton Geoffrey. Deep learning, Nature, 2015; 521(7553): 436–444. doi: 10.1038/nature14539 [DOI] [PubMed] [Google Scholar]
  • 27. Horikawa SI, Furuhashi T, Uchikawa Y. On fuzzy modeling using fuzzy neural networks with the back-propagation algorithm. IEEE Transactions on Neural Networks, 1992; 3(5):801–806. doi: 10.1109/72.159069 [DOI] [PubMed] [Google Scholar]
  • 28. Mosca A, Magoulas GD. Advances in Computational Intelligence Systems. New York: Springer Publishing Company, 2017; pp.433–445. [Google Scholar]
  • 29. Dreyfus SE. Artificial neural networks, back propagation, and the Kelley-Bryson gradient procedure. Journal of Guidance Control & Dynamics, 2012; 13(5):926–928. [Google Scholar]
  • 30.Kaggle Display Advertising Challenge Dataset.[EB/OL]. http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/
  • 31. Triba MN, Le ML, Amathieu R, Goossens C, Bouchemal N. PLS/OPLS models in metabolomics: the impact of permutation of dataset rows on the K-fold cross-validation quality parameters. Molecular Biosystems, 2015; 11(1):13 doi: 10.1039/C4MB00414K [DOI] [PubMed] [Google Scholar]
  • 32. Chen C, Hou C, Xiao J, Yuan X. Purchase Behavior Prediction in E-Commerce with Factorization Machines. IEICE Transactions on Information and Systems, 2016, E99.D(1), pp.270–274. doi: 10.1587/transinf.2015EDL8116 [Google Scholar]
  • 33. Akimova Tatiana, Levine Matthew H., Beier Ulf H., Hancock Wayne W. Standardization, Evaluation, and Area-Under-Curve Analysis of Human and Murine Treg Suppressive Function. Suppression and Regulation of Immune Responses. 2016; vol.1371, pp.43–78. doi: 10.1007/978-1-4939-3139-2_4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Alberto Jiménez-Valverde. Insights into the area under the receiver operating characteristic curve (AUC) as a discrimination measure in species distribution modeling. Global Ecology and Biogeography. 2012; 21(4):498–507. doi: 10.1111/j.1466-8238.2011.00683.x [Google Scholar]
  • 35. Deng X, Liu Q, Deng Y, Mahadevan S. An improved method to construct basic probability assignment based on the confusion matrix for classification problem, Information Sciences, 2016; 340: 250–261. [Google Scholar]
  • 36. Ohsaki M, Wang P, Matsuda K, Katagiri S, Watanabe H, Ralescu A. Confusion-Matrix-Based Kernel Logistic Regression for Imbalanced Data Classification. IEEE Transactions on Knowledge & Data Engineering, 2017; 29(9):1806–1819. doi: 10.1109/TKDE.2017.2682249 [Google Scholar]
  • 37. Talabani A. Jamal, Endreseth B.H., Lydersen S., Edna T.-H. Clinical diagnostic accuracy of acute colonic diverticulitis in patients admitted with acute abdominal pain, a receiver operating characteristic curve analysis. International Journal of Colorectal Disease. 2017; 32(1):41–47. doi: 10.1007/s00384-016-2644-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.sklearn.metrics.log_loss.[EB/OL]. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html.
  • 39.Logarithmic Loss https://www.Kaggle.com/wiki/LogarithmicLoss.
  • 40.sklearn.feature_extraction. FeatureHasher.[EB/OL]. http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html.
  • 41.Theano.[EB/OL]. http://deeplearning.net/software/theano/.
  • 42. Gao W, Zhou Z. Dropout Rademacher complexity of deep neural networks. Science China Information Sciences, 2016; 59(7):072104 doi: 10.1007/s11432-015-5470-z [Google Scholar]
  • 43.Z Li, B Gong, T Yang. Improved Dropout for Shallow and Deep Learning. The 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 2016; pp. 1–8.
  • 44.liblinear.[EB/OL]. http://www.csie.ntu.edu.tw/cjlin/liblinear/.
  • 45.R.J. Oentaryo, E.P. Lim, J.W. Low, D. Lo, M. Finegold. Predicting response in mobile advertising with hierarchical importance-aware factorization machine. The Seventh ACM International Conference on Web Search & Data Mining (ACM WSDM 2014), New York, USA, 2014; pp. 123–132.
  • 46. Rendle Steffen. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology, 2012; 3(3): Article 57(1–22). doi: 10.1145/2168752.2168771 [Google Scholar]
  • 47.libfm.[EB/OL]. http://www.libfm.org/.
  • 48.Xgboost. Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more.[EB/OL]. https://github.com/dmlc/xgboost.
  • 49. Lei Y, Jia F, Lin J, Xing S, and Ding SX. An Intelligent Fault Diagnosis Method Using Unsupervised Feature Learning Towards Mechanical Big Data. IEEE Transactions on Industrial Electronics, 2016; 65(5):3137–3147. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Triangular fuzzy number.

(ZIP)

S2 Fig. Structure of fuzzy restricted boltzmann machine.

(ZIP)

S3 Fig. Structure of fuzzy gaussian-bernoulli restricted boltzmann machine.

(ZIP)

S4 Fig. Modeling click-through rate prediction by means of LR.

(ZIP)

S5 Fig. Architecture of the FDNN model and its process of predicting CTR.

(ZIP)

S6 Fig. The process of constructing FDNN.

(ZIP)

S7 Fig. AUC values of FDNN and DBNLR with different numbers of layers and nodes.

(ZIP)

S8 Fig. Log-loss values of FDNN and DBNLR with different numbers of layers and nodes.

(ZIP)

S9 Fig. Comparison of AUC among all models on the test dataset.

(ZIP)

S10 Fig. Comparison of log-loss among all models on the test dataset.

(ZIP)

S1 Table. Data format of dataset.

(ZIP)

S2 Table. Experimental environment.

(ZIP)

S3 Table. Confusion matrix.

(ZIP)

S4 Table. Configure information of FDNN and DBNLR with different numbers of layers and nodes.

(ZIP)

Data Availability Statement

Data are available: kaggle-display-advertising-challenge-dataset (https://doi.org/10.6084/m9.figshare.5732310.v1); iPinYou Global RTB Bidding Algorithm Competition Dataset season2 (https://doi.org/10.6084/m9.figshare.5732328.v1).


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES