Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jan 1.
Published in final edited form as: IEEE Trans Image Process. 2020 Dec 11;30:1086–1099. doi: 10.1109/TIP.2020.3042082

Rethinking Shape From Shading for Spoofing Detection

J Matías Di Martino 1,2, Qiang Qiu 3, Guillermo Sapiro 3
PMCID: PMC7894987  NIHMSID: NIHMS1668132  PMID: 33290220

Abstract

Spoofing attacks are critical threats to modern face recognition systems, and most common countermeasures exploit 2D texture features as they are easy to extract and deploy. 3D shape-based methods can substantially improve spoofing prevention, but extracting the 3D shape of the face often requires complex hardware such as a 3D scanner and expensive computation. Motivated by the classical shape-from-shading model, we propose to obtain 3D facial features that can be used to recognize the presence of an actual 3D face, without explicit shape reconstruction. Such shading-based 3D features are extracted highly efficiently from a pair of images captured under different illumination, e.g., two images captured with and without flash. Thus the proposed method provides a rich 3D geometrical representation at negligible computational cost and minimal to none additional hardware. A theoretical analysis is provided to support why such simple 3D features can effectively describe the presence of an actual 3D shape while avoiding complicated calibration steps or hardware setup. Experimental validation shows that the proposed method can produce state-of-the-art spoofing prevention and enhance existing texture-based solutions.

Index Terms—: Liveness detection, spoofing attack, face recognition, active light, flash, 3D facial features

I. Introduction

FACE recognition became one of the most extended and popular biometric techniques. In particular, 2D face recognition algorithms have been ubiquitously deployed in the past years, e.g., at airports, ATMs, and personal devices. However, fake facial images can be deployed to spoof automatic face recognition systems.

Spoofing attacks proved to be a critical threats to modern face recognition systems. Various hacking methods have been developed to achieve illegal access to systems guarded with face recognition technology [1]. One of the most popular and simple hacking techniques consists of printing or replaying by some media a high quality picture of the target subject [2], [3]. A single image contains limited information to distinguish between a high quality replica of a subject and the subject itself, as is illustrated in Fig. 1. Systems that attempt to distinguish among 2D images as those in Fig. 1 are relatively easy to overcome, as we demonstrate in the following sections. Fortunately, spoofing detection can be substantially improved by extracting 3D information of the scene, but the main caveat is that pursuing a 3D reconstruction of the scene is a challenging task and typically requires complex hardware and expensive computation.

Fig. 1.

Fig. 1.

Illustration of some of the challenges in identifying photos of real (live) subjects from a single image. From left to right, the second and fourth pictures correspond to a photo of the (live) subject. The first picture is a photo of a computer screen where the face of the test subject is displayed. The third and fifth pictures are taken by photographing a printed portrait of the test subject.

Shape-from-shading (SFS) is a classic computer vision theory that allows the extraction of the 3D shape of the scene from a few images of it. But the canonical SFS formulation is impractical for spoofing detection as it requires knowing the illumination conditions and involves solving non-trivial partial differential equations. By generalizing a Lambertian model and bypassing typical requirements of SFS, we arrive at extremely efficient shading-based 3D features with high discriminative power for the detection of spoofing attacks. We provide a theoretical analysis to show that such features capture actual depth information with invariant properties that make them calibration-free and suitable for real-world applications.

The main contributions of this article can be summarized as:

  • We rethink shape-from-shading ideas and propose a simple and effective feature representation design for spoofing detection.

  • We provide a theoretical analysis of the proposed features and show that they capture actual depth information while they present invariant properties that make them suitable for practical applications (with no need for calibration or a sensitive set up of the hardware).

  • We show that texture-based spoofing strategies provide good results in specific circumstances, but they do not generalize well. Finally, we show that depth information can greatly improve generalization, e.g., making solutions less dependent on the resolution of the input images.

A. Related Work

A wide and ingenious variety of methods have been proposed to prevent spoofing attacks. For example, a class of methods makes use of visible and thermal infrared imagery [4], while other techniques recognize spontaneous eye blinks using a single low cost RGB camera [5]. Methods based on multispectral analysis are among the most reliable liveness detection solutions [4], [6], [7]. But, despite their accuracy, multispectral methods are expensive and hard to deploy as they require very particular optical hardware.

The analysis of video sequences has also been extensively exploited. These solutions include the analysis of short videos from which 3D information of the face can be extracted [5], [8]–[11]. These methods became particularly popular in the past decade since no specific hardware is required, and they can be readily implemented on smartphones and personal computers. In this work, we show that such type of geometrical information can be achieved in a significantly simpler fashion.

1). Single Image:

The methods described above require either the use of expensive hardware or complex computational analysis. Significantly simpler solutions have been proposed analyzing single RGB input frames [12]–[24]. For example, Li et al. [15] compared the Fourier spectra of authentic and impostor images and observed that the high-frequency components of a printed picture are less intense than of a live face. A more elaborated solution is proposed by Tan et al. [22], who infer illumination properties over the face (using Variational Retinex theory) and measure the effective resolution of input images. Atoum et al. [13] use a pre-trained Convolutional Neural Network (CNN) to hallucinate from the input image the 3D shape of the face. Then, they combine 3D features with texture features extracted from RGB patches of the face. Li et al. [16] on the other hand, propose to extract local binary patterns from the convolutional feature maps of a pre-trained DNN.

2). 3D Features:

Extracting real 3D facial information requires more than a single (RGB) input image [25], moreover, the problem of retrieving the shape of a scene from a monocular view is mathematically ill-posed (we discuss this in more detail in Sec. IIIB). One of the simplest ways to extract actual 3D information is to capture at least two images. This can be done (a) employing a single camera and changing the illumination conditions, or (b) using two cameras (e.g., a stereo pair). In the present article, we propose features that can be extracted following the first approach.

Spoofing detection has been an active research topic in the last years and several related works have been proposed. In the following, we describe the main differences between those and the present work. For example, Kose and Dugelay [26] propose countermeasures to spoofing attacks based on 2D and 3D facial information, but they require a 3D facial scan for the extraction of micro-texture features. Our approach, on the other hand, extracts local 3D information bypassing the 3D facial reconstruction (which is a critical practical limitation). Liu et al. [10], Atoum et al. [13], and Liu et al. [27] propose to extract 2D features and estimate 3D information from a single input image. As we will discuss in Section IIIB, this has limitations in the context of spoofing attacks. Even though it can be helpful to improve the extraction of 2D features, this does not provide actual 3D information as we argue in the following sections. Wen et al. [28], Patel et al. [29], and Garcia and De Queiroz [30] focus on the extraction of texture-based features such as specular reflection, blurriness, chromatic moment, moiré patterns, and color diversity. Their work can complement ours (and vice-versa) as we focus on texture-agnostic 3D feature extraction (which as we show in Section III, further improves when combined with texture information). Wang et al. [11] infer 3D information from the motion of the facial landmarks. There are two fundamental differences between their work and ours. Firstly, they require video input while we extract features from a pair of images under different illumination. Secondly, their approach fail under replay attacks as the perceived movement of the landmarks matches the ones of a real face. Our approach, on the other hand, captures actual 3D information of the scene (independently of the perceived facial motion).

Previous works have proposed ideas in a similar direction to ours. An example is the work of Zhang et al. [24], they extract features by estimating the coefficients that describe the lighting environment. To that end, they model the 3D face using a morphable model (3DMM) and the illumination coefficients using a sphere harmonic illumination model (SHIM). There are many differences between this method and the one proposed in the present work, the most notable one is that we extract local shape information rather than estimating variations in the global lighting conditions. Another related approach is proposed by Chan et al. [31], who propose the use of two images: one taken under ambient illumination and another one under an additional flash light. This work is, to the best of our knowledge, the closest to ours. Although we exploit similar input data (i.e., a pair of flash and non-flash images), the features we propose are substantially different to theirs. The most significant difference is that we focus on the extraction of albedo independent 3D geometrical information while Chan et al. use LBP descriptors which are known to be heavily influenced by the texture properties of images. This difference has important theoretical and practical consequences as we discuss in Section II and empirically validate in Section III.

II. Shading-Based 3D Features

Motivated by the classical shape from shading model [32], [33], we show how pictures captured under different illumination can be used to efficiently extract highly discriminative 3D features for spoofing detection. A flash (or an arbitrary light source) is included at testing time to modify the illumination conditions between two consecutive pictures. As we will show the proposed features are texture invariant, and therefore, they are not affected by the lighting conditions in which the spoofing samples where generated. In contrast with standard shape from shading approaches, we extract local 3D information and bypass the most computationally expensive step of shape from shading: the integration of an empirical gradient field [34]–[36].

A. Shape From Shading Model

Surfaces reflect light differently depending on their microscopical properties. The two more popular models for describing surface reflections are the Lambertian and the Specular model. Lambertian surfaces reflect light by scattering it in all directions. In contrast, specular surfaces reflect light in a specif direction (symmetric to the incident light). Most natural surfaces can be modeled as a combination of these two ideal models, where one can refer to the Lambertian or Specular component of a given surface.

The Lambertian model provides a proper approximation of human face reflection properties [37]–[40]. Therefore, the local orientation of the skin with respect to the incident light plays a crucial role in how bright different portions of the face are perceived. More precisely,

I(x,y,z)=I0a(x,y,z) max (n^(x,y,z)l^,0), (1)

where I denotes the intensity of the light reflected, I0 the intensity of the incident light, a represents the surface albedo, n^ is a unit vector normal to the surface, and I^ a unit vector indicating the direction in which the light rays approach the surface at (x, y, z(x, y)). Equation (1) assumes homogeneous illumination (i.e., constant I0 and I^). For compactness, we now include the intensity factor I0 into the vector I^ (i.e., I=defI0I^).

Most SFS methods assume uniform albedo a and that the light source l is known (alternative algorithms have been proposed, which can estimate both the albedo and the direction of the illumination [41], [42]). Assuming the light source is known l = (lx, ly, lz), and expressing the surface normals as

n^(x,y)=1(zx)2+(zy)2+1(zx,zy,1)t, (2)

the 3D shape of the scene can be retrieved solving

I(x)1+|z(x)|+(lx,ly)z(x)lz=0. (3)

Although Eq. (3) is the most common formulation for SFS problem [33], [43], other related formulation have also been proposed [44]. Solving the SFS problem is hard; it is mathematically ill-posed and requires the numerical solution of partial differential equations.

Instead of solving Eq. (3) and retrieving the absolute 3D shape of the scene z(x, y), we propose an alternative analysis starting by generalizing Eq. (1). Our method does not require the typical hypotheses of SFS (uniform albedo and known illumination). We are able to eliminate these hypotheses as we do not attempt to obtain a 3D map of the face, but instead, we show that 3D features can be extracted in a significantly simpler manner which can be further exploited by DNN-based learning to discriminate live subjects from spoofing attacks.

B. Extraction of 3D Features

The generalization of Eq (1) to an arbitrary number of illuminants l1, ⋯ , lm is

I(x,y,z)=a(x,y,z)i=1mmax (n^(x,y,z)li,0). (4)

Figure 2 illustrates a pair of images from a live subject (left pair) and images captured of a spoofing attack (right pair). In each case one of the pictures is taken under ambient illumination and the other one adding a flash light. Let us denote the image taken under ambient illumination as Ia, and the image taken with the an additional source of light If.

Fig. 2.

Fig. 2.

From left to right: picture a live subject under ambient illumination (denoted as Ia), picture of the same subject taken with an additional source light (denoted as If), picture of a spoofing attack under ambient illumination, and the picture of the same attack with the additional source of light.

The camera parameters such as the shutter speed, ISO, and aperture are assumed to be fixed between the shots. Using the model described above, pictures Ia and If can be expressed as

Ia(u,v)=a(u,v)i=1mmax (n^(u,v)li,0), (5)
If(u,v)=a(u,v)i=1mmax (n^(u,v)li,0)+a(u,v)n^(u,v)If. (6)

Equations (5) and (6) model the ambient illumination as an arbitrary (unknown) discrete combination of m light sources. The vector lf in Eq. (6) represents the intensity and direction (both unknown) of the additional light source. One may notice that we replace the three-dimensional world coordinates (x, y, z) by two dimensional pixel coordinates (u, v). I (u, v) represents the light collected by the pixel (u, v) which corresponds to the optical image of the scattered light I (x, y, z).

We now define the quantity

IB(u,v)=defIf(u,v)Ia(u,v)Ia(u,v), (7)

equivalent to

IB(u,v)=n^(u,v)Ifi=1mmax (n^(u,v)li). (8)

When a flash is used and images Ia/If are collected almost simultaneously, the raw input images can be used to compute the quantity IB defined above. In order to provide a more general framework, we register the input images using facial landmarks, thus allowing us to compute the proposed features even when images are taken seconds apart.

For brevity, we denote the proposed 3D feature in Eq. (8) as the IB features. IB features have interesting properties for the particular task at hand as will discuss in the following, and provide the base representation for the proposed anti-spoofing solution. Using IB features as initial representation, we will show that efficient and effective spoofing detection solutions can be developed.

C. Analyzing 3D Features From Shading

We provide a theoretical explanation of why such a simple shading-based feature can effectively describe 3D shapes. We address here the following questions: (i) Is the relation between the geometry of the scene and the proposed features bijective? (ii) How sensitive are the proposed features to the physical set up of the additional light?

1). Bijectivity:

Assuming that the main sources of light that play a significant role are located in front of the scene, Eq. (8) can be simplified as

IB(u,v)=n^(u,v)Ifi=1mn^(u,v)li.=n^(u,v)lfn^(u,v)la,     la=defi=1mli. (9)

la encapsulates the contribution of all illuminants, and represent the overall intensity and the principal orientation of the ambient light.

Now consider a regular 3D surface S defined by (x,y,z(x,y))3,zC2(Ω) for an open connected domain Ω2, and the local (unit) normal vector n^(x,y) as defined in Eq. (2).

Proposition 1: Let two surfaces

S1={(x,y,z1(x,y)),(x,y)Ω} (10)
S2={(x,y,z2(x,y)),(x,y)Ω} (11)

have the same set of normal vectors n^1(x,y)=n^2(x,y),(x,y)Ω. Then, S1 and S2 are parallel to each other (i.e., z1(x, y) = z2(x, y) + k, ∀ (x, y) ∈ Ω).

Proof: We define the height difference between S1 and S2 as z3(x, y) − z1(x, y) − z2(x, y). As the normal vectors of both surfaces are equal

z1xz1x2+z1y2+1=z2xz2x2+z2y2+1 (12)
z1yz1x2+z1y2+1=z2yz2x2+z2y2+1 (13)
1z1x2+z1y2+1=1z2x2+z2y2+1. (14)

Substituting (14) into (12) and (13) we get z1x = z2x and z1y = z2y.

z3 partial derivatives are z3i = (z1z2)i = (z1iz2i) = 0 for i = {x, y}. Therefore, the difference between the two surfaces is a constant plane parallel to axes x and y (i.e., z1(x, y) = z2(x, y) + k for all (x, y) ∈ Ω). □

Proposition 2: If lf and la are not collinear, and the difference between two surfaces Si and Sii normals is a vector field parallel to lf × la, then these two surfaces present identical IB features, i.e., I Bi (x, y) = I Bii (x, y), ∀ (x, y) ∈ Ω.

Proof: Let us define the vector field u(x, y) as the difference between Si and Sii normals:

n^ii(x,y)=n^i(x,y)+u(x,y),     (x,y)Ω. (15)

Si and Sii have an identical feature representation if and only if,

n^i(x,y)lfn^i(x,y)la=(n^i(x,y)+u(x,y))lf(n^i(x,y)+u(x,y))la  (x,y)Ω. (16)

Equation (16) can be transformed into

(n^i(x,y)lf)(u(x,y)la)=(n^i(x,y)la)(u(x,y)lf). (17)

Finally, as u(x, y) ∝ lf × la, Eq. (17) trivially holds, which completes the proof. □

Let lf1 and lf2 represent two light sources and la the overall ambient illumination. We define I Bk as the features computed from the pair of images {Ia, Ifk}. For example, I B1 is produced from the pair of images captured under la and la + lf1 illumination. Two representations I B1 and I B2 are defined to be independent if the vectors lf1, lf2 and la associated to these features are non-coplanar.

Proposition 3: If two surfaces Si and Sii share two independent feature representations, i.e., {IB1i,IB2i}={IB1ii,IB2ii}, then both surfaces are parallel to each other.

Proof: The ideas used to prove Proposition 2 can be reproduced considering each par of illuminants lfk, la for k = {1, 2}. The difference between Si and Sii normals u(x,y)=defn^ii(x,y)n^i(x,y) must verify,

u(x,y)lf1×la     (x,y)Ω, (18)
u(x,y)lf2×la     (x,y)Ω. (19)

Given that lf1, lf2 and la are non-coplanar, the unique solution to previous equations is the trivial solution u(x, y) = 0 ∀ (x, y) ∈ ω. Finally, considering Prop. 1, if both surfaces have identical normal fields they verify zi(x, y) = zii(x, y)+k for all (x, y) ∈ Ω. □

Proposition 2 shows that even though IB features do not uniquely determine the shape of the scene, it significantly constrains the set possible shapes (which as we shall see, provides enough information to distinguish spoofing samples from live faces). Then, Proposition 3 demonstrates that if two or more flashes are considered (located such that lf1 and lf2 and la are non-coplanar), two sets of IB features can be used to uniquely identify a surface. Again it is important to highlight that we do not claim that it is easy to perform a 3D reconstruction from the proposed features, but instead, that they are an effective representation to discriminate between live faces and most common spoofing attacks.

2). Sensitivity to Light Position:

Now we will show that our approach is effective in practice, by being insensitive to the relative position of the additional illuminant with respect to the face.

Equation (9) can be interpreted as the ratio between the cosine of two angles. Defining np as the projection of vector n^ onto the plane defined by la and lf, we define α, β and ψ the angle between vectors nplf^, npla^ and lfla^ respectively (Fig. 3).

IB=n^(x,y)Ifn(x^,y)la=cos(α(x,y))cos(β(x,y))=cos(ψα(x,y))cos(α(x,y)). (20)

Equation (20) can be expressed as,

IB=cos(ψ)+sin(ψ) tan(α(x,y)) (21)

where we explicitly include spatial coordinates (x, y) to highlight that α depends on the local shape of the surface while ψ is a constant angle (set by the relative position of illuminants).

Fig. 3.

Fig. 3.

A model of a lambertian surface under ambient illumination plus an additional source of light. On the right side, a zoomed patch of the surface is displayed. n^ represents the unitary vector normal to the surface at (x, y, z(x, y)), np is the projection of n^ into the plane defined by vectors lf and la.

Expressing IB features in the form of Eq. (20) is interesting because it shows that the larger the angle between the additional light and the overall ambient illumination, the more sensitive the proposed features are to the geometry of the scene. A degenerate case takes place when the additional light and the overall ambient illumination are parallel to each other (ψ = 0). In this condition no 3D information can be extracted. In other words, in order to have reliable 3D features, we must design the physical set up such that the additional source of light is shifted with respect to the overall ambient illumination.

A second important property is that the relative angle between illuminants only plays the role of a global amplification factor. Thus, a simple normalization step makes the features invariant to the angle ψ (this property is illustrated in the experiment presented in Fig. 10). This has a huge practical advantage, as we will show in the next section, the proposed representation can be used without any prior calibration step or the estimation of the lighting position.

Fig. 10.

Fig. 10.

The first row illustrates a set of facial images captured using different flash lights located at different position. The ambient illumination is modeled as frontal. The angle between lf and la is provided on top of each picture. In this experiment, the ambient and the flash light are of the same magnitude. The second row shows raw IB features computed from each If image and a common Ia (taken exclusively under ambient illumination). The third row shows the same set of features after normalization. Additionally, an horizontal profile of the patches displayed in the second row are plotted (top right side), and aligned features are observed across light positions (different color lines).

III. Experiments and Discussion

In this section, we first study texture-based features to understand what are the intrinsic limitations of single image methods, and in what sense, 3D features are beneficial. Then, we experimentally evaluate the set of proposed features and describe their main properties. Finally, we show how combining 3D and texture information improves accuracy, generalization, and robustness.

A. Databases

Four datasets are used for experimental validation and analysis. Three of them are publicly available: Casia [3], Oulu [2], and Texas [45]. In addition, we collect a dataset referred to as Ambient-Flash. Casia and Oulu databases are mainly used to develop and examine texture-based features. Texas database provides ground truth 3D facial scanners which are used to simulate facial images under different illumination conditions. As we shall see, this is particularly helpful to initialize DNN-based classifiers. Finally, the Ambient-Flash dataset provides realistic and challenging real-world examples to evaluate the performance and generalization capabilities of the proposed method. Example samples of each dataset are illustrated in Fig. 4. A detailed description of these datasets is provided in appendix A.

Fig. 4.

Fig. 4.

Illustrative samples of the databases used for experimental validation.

B. Texture-Based Methods

As the main objective of this work is to provide simple and reliable 3D features, it is of crucial importance to discuss why 3D features are critical in spoofing detection.

1). 3D From a Single Input Image:

Let us begin by discussing whether if it is possible or not to extract 3D features from a single (RGB) input image.

Machine learning models have been successfully developed to perform the task of 3D hallucination. Abundant literature address this problem, e.g., [32], [33], [46]–[54]. However, measuring actual 3D information and 3D hallucination are two very different tasks. In particular, retrieving the shape of a scene from a single image is an ill-posed problem, and therefore, hallucination methods enforce strong priors about the scene. Fig. 5 shows examples of hallucinating 3D applying 3DMM for spoofing (planar) and live examples.

Fig. 5.

Fig. 5.

The first three samples (from left to right) correspond to samples from the Oulu database [2], the last two images were collected by us. The first row shows the input RGB frames (live subjects and spoof attacks). The second row displays the 3DMM hallucinated from each input image. We used the code and 3DMM model provided by Huber et al. [49].

In other worlds, given a single RGB image, there are infinite potential 3D surfaces that can produce it, as illustrated in Fig. 6. Therefore, we can not expect hallucination algorithms to provide actual 3D features in the context of spoofing detection. Additional strategies to retrieve true 3D information need to be explored, as proposed in this article. Although a single image does not contain actual 3D information of the scene, it does provide useful visual and texture cues for the detection of spoofing attacks, as we will discuss in the following.

Fig. 6.

Fig. 6.

3D surfaces (a), (b) and (c) produce the exact same RGB image (illustrated on the bottom of the figure). This simulation is implemented assuming a simple pinhole camera model [25].

2). Spoofing Detection From a Single Image:

Single frame methods are popular for spoofing detection. We evaluate three families of single image methods: (a) classification based on handcrafted texture and color features [55], [56], (b) patch-based DNNs [13], and (c) face DNNs [57]–[59]. For the classification of the handcrafted texture and color features, SVM and a shallow fully connected neural network are considered. A detailed description of these implementations is provided in appendix B.

The equal error rate (EER) and the half total error rate (HTER) are standard measures of biometric solutions [60] and thus are selected for the experimental evaluation. Table I compares standard texture-based anti spoofing strategies. The classification results obtained are consistent with the values reported in the literature.

TABLE I.

Classification Results of Different Texture-Based Anti-Spoofing Solutions. Equal Error Rate (EER) and Half Total Error Rate (HTER) Are Reported (for Both Measures the Lower the better). For Each Dataset, Train and Test Partitions Are Used for Training and Testing Respectively, Test Results Are the Ones Reported.

EER
(%)
HTER
(%)
Casia database
lbp-gray SVM [55] 14.2 12.5
lbp-gray FC 14.5 14.1
lbp-YCbCr SVM [56] 3.3 3.1
lbp-YCbCr FC 3.5 4.8
adversarial adaptation* [61] 3.2 -
patch-DNN RGB 3.9 3.8
patch-DNN lbp-YCbCr 6.1 6.1
patch-based YCbCr* [13] 4,3 4.0
patch-based YCbCr+HSV* [13] 4.4 3.7
patch-based YCbCr+HSV+LBP* [13] 7.7 6.1
face-DNN YCbCr 6.6 6.6
Oulu database
lbp-gray SVM 16.4 13.3
lbp-gray FC 14.4 13.1
lbp-YCbCr SVM 3.9 3.7
lbp-YCbCr FC 5.3 4.5
texture and temporal features* [62] - 2.6
patch-DNN RGB 5.6 5.6
patch-DNN lbp-YCbCr 5.8 5.6
face-DNN YCbCr 6.6 6.5
*

Values Extracted From the Cited Reference

The experiments show that both texture and color cues provide useful information for spoofing classification, which explains their popularity. However, despite the optimistic previous results, texture-based solutions are difficult to generalize as we will discuss in the following experiments.

3). Performance vs Generalization:

Although color and texture cues are useful in spoofing detection, they tend to be highly specific to the attack deployed, the properties of the camera, and display/printer (see for example Figure 7). After all, different screens, printers, and sensors have a specific color and resolution profile. Moreover, the distance between the attack/face and the camera also plays an important role in how the texture information is perceived.

Fig. 7.

Fig. 7.

Example frames from OULU-NPU face presentation attack database [2]. The first row shows one frame of an input video of a live and an attack example. The second row shows registered patches of the face extracted from each respective input frame. The third row illustrates for each image patch, the distribution of pixel colors on the Lab color space (a 2D section of constant lighting value is displayed). The fourth row, shows the absolute value of the 2D Fourier transform of each facial patch (in logarithmic scale).

To illustrate this, we test patch-DNN and face-DNN classifiers training and testing over different datasets. Table II reports the result of training and testing using different datasets, and we observe dramatic performance degradation. As we can see generalizing single image solutions is a challenging and important practical problem.

TABLE II.

Classification Results When Train and Test Sets Are Selected From Different Datasets. Equal Error Rate (EER) and Half Total Error Rate (HTER) Are Reported (for Both Measures the Lower the better). C+O+AF Denotes That the Union of Oulu, Casia and Ambient-Flash Are Used for Training

Test setTrain set Casia Oulu C+O+AF
patch-DNN RGB (EER% / HTER%)
Casia 3.9 / 3.8 35.1 / 35.1 3.7 / 3.7
Oulu 20.8 / 18.5 5.6 / 5.6 8.7 / 8.4
Amb.-Flash 40.1 / 40.1 34.8 / 34.8 -
face-DNN YCbCr (EER% / HTER%)
Casia 6.9 / 6.9 14.0 / 13.9 8.6 / 8.1
Oulu 16.3 / 16.3 6.6 / 6.5 9.5 / 9.2
Amb.-Flash 17.6 / 17.6 0.4 / 0.3 -

A second important flaw texture-based approaches have is that the texture of the attack can easily be manipulated to spoof a particular classifier (for example, dynamically adapting the displayed image using a tablet or phone). Inspired by the ideas proposed by Kurakin et al. [63], we design an adversarial attack and show that it can be effectively deployed in the physical world. Implementation details are provided in appendix C. Figure 8 illustrates the output of the face-DNN network when the original spoof image and the adversarial attack are displayed. The face ROI is displayed red when the facial image is classified as a spoofing attack, and green when the image is classified as a live subject). As we can see, even when a method is trained with similar data, classifiers can be hacked by modifying the texture of the displayed attack.

Fig. 8.

Fig. 8.

Deployment of the adversarial attack in the physical world. A demo version of the system is running face-DNN classifier. Images are captured using a webcam and to each input frame, dlib face detector is applied to select the facial ROI. Then if a face is detected, the input image is cropped to the facial ROI and classified. The face ROI is displayed red if the sample is classified as an attack, and green if the sample is classified as a live subject. While the camera is facing the screen, we switched between the original spoofing attack and the generated adversarial attack.

C. Shading-Based 3D Features

In this section, we evaluate in detail the proposed IB features. Three methods are tested for classification using support vector machines (SVM), a simple fully connected network (FC), and a deep neural network (DNN). The simple fully connected network considered for classification is identical to the one applied in the previous section (and described in appendix D).

The architecture of the adopted DNN is detailed in Table VIII. The network has approximately a million of trainable parameters and is trained in two steps. First, we train from scratch using synthetic data generated from Texas samples (appendix E). Then, we fine-tune the network using a subset of the subjects available in the Ambient-Flash dataset (leaving the remaining subjects for testing). Using the Texas dataset 70 thousand synthetic training IB images are generated from 119 different identities. We observed that four epochs are sufficient to achieve convergence. The batch size is set to 16 samples, and the total training time was approximately 60 minutes on a GPU NVIDIA Titan XP with 12GB of memory.

TABLE VIII.

Architecture of the DNN Implemented for the Classification of IB Features. Conv Stands for Convolutional, FC for Fully Connected, and BN for Batch Normalization. Training Is Performed Using Adam Optimizer and Binary Cross-Entropy as the Minimization Loss

Input: 186×224×1 image.
Layer 1.1: Conv - 16 kernels 3×3 , BN, relu.
Layer 1.2: Conv - 16 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 2: Conv - 32 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 3: Conv - 64 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 4: Conv - 128 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 5: FC - 64 neurons, BN, relu.
MaxPooling 2×2
Dropout 50%
Layer 6: FC - 64 neurons , BN, relu.
Output: FC - 1 neuron, sigmoid activation.

1). Performance of Our 3D Features:

Table III shows the classification results considering the proposed 3D features. During the experiments, training is performed leaving samples that belong to test subjects out. As a fair comparison, we include here the recent proposed 3D features [31], denoted as “LBP_FI,” “SD_FIC,”, which are also extracted from a pair of flash/no-flash images.

TABLE III.

Classification Results Using Exclusively 3D Features. Training Is Performed Excluding Test Subjects From Any Training Step. Results Reported on Ambient-Flash Dataset

Method Acc.
(%)
HTER
(%)
EER
(%)
APCER
(%)
BPCER
(%)
IB DNN (our) 99.9 0.1 0.1 0.0 0.2
IB SVM (our) 91.3 8.3 9.0 6.1 10.5
IB FC (our) 96.7 3.1 3.4 2.7 3.5
LBP_FI+SD_FIC [31] SVM 90.7 8.8 9.0 8.4 9.2
LBP_FI+SD_FIC [31] FC 96.2 3.8 5.2 3.0 4.6

As we can see the proposed IB features enable a high discrimination between live samples and spoofing attacks. Figure 9 shows examples of the proposed IB features extracted from genuine cases (top) and planar and non-planar spoofing attacks (bottom). Such a simple shading-based 3D descriptor produces state-of-the-art detection performance even for simple classification algorithms such as SVM. These results, being intrinsically 3D, are invariant to the type of domain changes that make texture-based systems fail (see, e.g., Table II and Fig. 8). Moreover, virtually perfect classification rates can be achieved when classification is performed using a DNN trained with real plus simulated synthetic data.

Fig. 9.

Fig. 9.

Proposed illumination based features extracted from genuine and spoofing examples.

The effectiveness of the proposed features is in part due to its invariance properties discussed in Sec. II. A simple representation is learned without calibration requirements and for images collected with an additional illuminant located at an arbitrary (unknown) position. We further illustrate this simulating IB features for samples from Texas dataset. Figure 10 shows how the proposed features remain invariant (up to a global amplification) when the flash location is shifted (property described by Eq. (21)).

2). Computational Cost:

The proposed shading-based 3D features can be computed highly efficiently. We test the time required to compute different features on a Lenovo T430 laptop with a CPU Intel Core i5. The detection of facial landmarks requires about 25ms. The computation of the feature LBP_FI requires an extra 29ms (total time 54ms). Computing the proposed IB features requires almost an order of magnitude less time, 4ms (after landmarks detection) leading to a total time of 29ms.

3). Combining Texture and 3D Features:

Table IV provides the results obtained for the combination of 3D features and texture features. Both texture and geometrical cues contain useful, and more importantly, independent information for the detection of spoofing attacks. We observe that the combination of IB features and texture-based features improves the classification performance. These results are expected as classification algorithms tend to improve when independent sources of information are aggregated [64], [65].

TABLE IV.

Comparison of Texture, 3D, and the Combination of Both Features. The Mean Accuracy, HTER, and EER Over Ambient-Flash Dataset Is Reported. As in Previous Experiments, Different Identities are Used for Training and Testing. The FC Network Introduced Before Is Used for Classification. As Texture Descriptor LBP Is Selected. Results Reported on Ambient-Flash Dataset

Features Acc.
(%)
HTER
(%)
EER
(%)
APCER
(%)
BPCER
(%)
Only Depth: [31] 96.2 3.8 5.2 3.0 4.6
Only Depth: IB (our) 96.7 3.1 3.4 2.7 3.5
Only Texture: 98.7 1.3 2.7 2.5 2.6
Combination: [31] + Texture 99.3 0.5 0.6 0.3 0.4
Combination: IB + Texture (our) 99.9 0.1 0.1 0.0 0.1

The results presented in Table IV must be complemented with the experiment presented in Fig. 11. As we see in Table IV texture-based features can achieve in certain conditions great performance, however, as we discussed previously generalization is a fundamental practical challenge. For example, Fig. 11 compares across different input image resolutions, the classification performance of exclusively texture features, and the combination of texture plus 3D features. The combination of texture and depth-based features can be used to improve methods performance, but more importantly, to improve robustness and generalization capabilities. We observe that, without 3D features, only texture information is significantly more sensitive to the resolution of input data.

Fig. 11.

Fig. 11.

Performance of texture and texture plus 3D when the resolution of the input image changes. The highest resolution (value 1 in x-axis) corresponds to facial patches of 400×400 pixels, the lowest resolution correspond to a 60% (i.e., 240×240). The FC network described before is used for classification. Solid curves represent the mean performance (across different subjects) and the shadows indicate the standard deviation.

4). Potential Cases of Failure:

Discussing algorithmic accountability is essential in security applications. The proposed features are texture-invariant and capture real 3D information of the scene. Therefore, they can be miss-lead by presenting an accurate 3D replica of the subject face. On the other hand, texture-based features can be miss-lead by presenting a high-resolution (planar or not) image or video of the target subject. Creating a high-quality image is relatively easy, also creating a 3D mask with poor texture definition is relatively easy. Creating a 3D mask that simultaneously and accurately reproduces the subject shape and texture is much harder, and that is why the proposed features are useful to enhance and boost texture-based existing methods.

In addition, our method relies on capturing two images (one with flash and one without flash). Therefore, on extremely bright outdoor scenarios, it could happen that the power of some devices flash is insufficient compared to the ambient illumination. This can be solved in practice, by using an illumination source with a specific and narrow spectral band (which can be exploited to increase the effective contrast of the projected light).

IV. Conclusion and Future Work

This article proposed and analyzed an extremely light-weight shading-based 3D feature for spoofing detection. The proposed features can be ubiquitously deployed at negligible computational cost and with minimal additional hardware. We showed that popular texture-based solutions generalize poorly, and are prone to novel texture manipulation attacks. As demonstrated, the proposed 3D features can be deployed alone or together with texture-based information to further improve their performance, generalization, and robustness.

The proposed ideas can be generalized to handle video inputs, for example, by using a fast light source (such as an LED). This light can be switched at a lower slower frame rate than the camera, e.g., if the camera captures at 30 frames per second, the light can be turned on and off at 10 cycles per second. Then, capturing a video in this conditions, V (x, y, t), we can define for a time interval T (e.g., T = 10ms) two images Ia = V (x, y, ta) and If = V (x, y, tf), where ta and tf are defined as ta = argmin(tT)V(x, y, t)∥ and tf = argmax(tT)V(x, y, t)∥. The intuition behind this idea is that the frame with maximum brightness will be that for which the light is on, while the frame with minimum brightness is the one for which the light is off. (Because the light switches at a slower rate than the camera, such frames are guaranteed to exist.) This way the proposed features (extracted from the pair Ia and If) can be computed for video inputs at a rate of 1/T features per second.

Acknowledgment

J. Matías Di Martino thanks NVIDIA for the donation of a GPU. The authors thank Trishul Nagenalli for his assistance during the collection of some of the data.

This work was supported in part by the Comision Sectorial de Investigacion Cientifica (CSIC), in part by the Google, in part by the Microsoft, in part by the Amazon, in part by the Cisco, in part by the NSF, in part by the National Geospatial Intelligence Agency (NGA), in part by the Office of Naval Research (ONR), and in part by the Army Research Office (ARO).

Biography

graphic file with name nihms-1668132-b0001.gif

J. Matías Di Martino (Member, IEEE) received the B.Sc. and Ph.D. degrees in electrical engineering from the Universidad de la Republica, Uruguay, in 2011 and 2015, respectively. In 2016, he has worked as a Research Associate with the Ecole Normale Superieure de Cachan, Paris. He is currently working as an Assistant Professor with the Physics Department, School of Engineering, Universidad de la Republica. He also holds a postdoctoral associate researcher position at Duke University, USA. His main research interests include applied optics, image processing, and data science. In particular, he focused in the past years on the study of active stereo techniques for 3D reconstruction and the enhancement of machine learning techniques.

graphic file with name nihms-1668132-b0002.gif

Qiang Qiu (Member, IEEE) received the bachelor’s (Hons.) and master’s degrees in computer science from the National University of Singapore in 2001 and 2002, respectively, and the Ph.D. degree in computer science from the University of Maryland, College Park, in 2013. From 2002 to 2007, he was a Senior Research Engineer with the Institute for Infocomm Research, Singapore. He is currently an Assistant Research Professor with the Department of Electrical and Computer Engineering, Duke University. His research interests include computer vision and machine learning, specifically on face recognition, human activity recognition, image classification, and representation learning.

graphic file with name nihms-1668132-b0003.gif

Guillermo Sapiro (Fellow, IEEE) was born in Montevideo, Uruguay, in April 1966. He received the B.Sc. (summa cum laude), M.Sc., and Ph.D. degrees from the Department of Electrical Engineering, Technion, Israel Institute of Technology, in 1989, 1991, and 1993, respectively. After a postdoctoral research at MIT, he became a member of the Technical Staff at the research facilities of HP Labs, Palo Alto, CA. He was with the Department of Electrical and Computer Engineering, University of Minnesota, where he held the position of the Distinguished McKnight University Professor and the Vincentine Hermes-Luh Chair in electrical and computer engineering. He is currently the Edmund T. Pratt, Jr. School Professor with Duke University. He also works on theory and applications in computer vision, computer graphics, medical imaging, image analysis, and machine learning. He has authored or coauthored more than 400 articles in these areas and has written a book published by Cambridge University Press in January 2001. He is a fellow of SIAM. He was awarded the Gutwirth Scholarship for Special Excellence in Graduate Studies in 1991, the Ollendorff Fellowship for Excellence in Vision and Image Understanding Work in 1992, the Rothschild Fellowship for Postdoctoral Studies in 1993, the Office of Naval Research Young Investigator Award in 1998, the Presidential Early Career Awards for Scientist and Engineers (PECASE) in 1998, the National Science Foundation Career Award in 1999, and the National Security Science and Engineering Faculty Fellowship in 2010. He received the Test of Time Award at ICCV 2011. He was the founding Editor-in-Chief of the SIAM Journal on Imaging Sciences.

Appendix A. Description of the Datasets

Casia.

This database contains 50 different subjects, three acquisition devices and three types of attack: a warped photo attack, a cut photo attack, and a video attack. For each subject 12 videos were recorded (3 genuine and 9 fake), and the final database contains 600 video clips. In order to analyze single-image features, we extract random frames from the input videos and process each of these frames as individual samples.

Oulu.

This database consists of 5940 videos corresponding to 55 subjects. Videos were captured by high-resolution frontal cameras of six different smartphones. High-quality print and video-replay attacks were created using two different printers and display devices. Again, we extracted and analyzed video frames independently.

Texas.

This database consists of 1149 pairs of high resolution, pose normalized, and aligned color and range (3D) images of 118 subjects. Only images of live subjects are provided (originally, this database was not intended to evaluate spoofing attacks, but face recognition). Nevertheless, this database is very useful to test implicit 3D features and fundamentally simulate complementary training data.

Ambient-Flash.

This database was collected by us and includes images of 18 subjects. Images were collected over the span of three days on different indoor locations and lighting conditions. Three different kinds of attacks were collected: planar printed portraits, curved printed portraits, and portraits displayed in digital screens. 7503 pairs of images of live and spoof samples are provided. Each pair contains a photograph taken under ambient illumination and a photograph flashing an additional source of light. Images were collected varying the location, direction, and power of the additional light. More precisely, ambient luminance varied from 100 to 400 Lux (depending on the location), while the additional source of light was set to increase the luminance over the subjects faces between 35–45%. Images were captured using three devices: a webcam Logitech C920 HD, the webcam included in a LenovoT430u laptop (resolution 640×480), and the camera of a smartphone MotoG4.

Appendix B. Single Image Spoofing Detection: Implementation Details

To analyze how automatic learning can be exploited to extract the texture and color patterns illustrated above, we implement and study three family of single image methods: (a) classification based on handcrafted texture and color features, (b) patch-based DNNs, and (c) face DNNs.

Texture and color features.

Local binary patterns (LBP) are considered for micro-texture feature extraction. We compare LBP patterns from gray and color versions of the input frame as proposed in [55] and [56] respectively. Patterns are extracted over nine regions of the face which are then concatenated [55]. In a similar way, color features are obtained after mapping the input RGB image into YCbCr and/or HSV color spaces [56].

For classification, SVM and a shallow fully connected neural network are considered. The structure of the network is described in Table V. During training Adam optimizer is used and the loss minimized is the binary cross-entropy.

TABLE V.

Light-Weight Fully Connected Neural Network Used for Classification of Texture and Color Based Feature Vectors. FC Stands for Fully Connected and BN for Batch Normalization

Input: N×1 column feature vector.
Layer 1: FC - 32 neurons, BN, relu.
Layer 2: FC - 32 neurons, BN, relu.
Layer 3: FC - 32 neurons, BN, relu.
Output: FC - 1 neuron, sigmoid activation.

Patch-based DNN.

Micro-texture patterns contain additional information to discriminate between live and spoof images. Therefore, a family of modern approaches avoid pre-processing and face registration. Instead, patches from the raw input image are analyzed aiming to preserve existing texture information unaffected.

Following ideas proposed in [13] we implemented a DNN patch-based scheme. First, N random patches of the facial region are extracted. Then, a DNN is trained to classify individual patches. Finally, patch scores are combined to assign to each input image an absolute score.

Different numbers of patches and patch sizes are evaluated. For the databases at hand, we observed that N = 25 and a patch size of 64×64 pixels are an appropriate choice. The architecture of the network used for patch classification is provided in Table VI.

TABLE VI.

Architecture of Patch-Based DNN. Conv Stands for Convolutional, FC for Fully Connected, and BN for Batch Normalization. Training Is Performed Using Adam Optimizer and Binary Cross-Entropy as the Minimization Loss

Input: 64×64×3 image patch.
Layer 1: Conv - 50 kernels 5×5 , BN, relu.
MaxPooling 2×2
Layer 2: Conv - 75 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 3: Conv - 100 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 4: FC - 20 neurons, BN, relu.
Layer 5: FC - 20 neurons, BN, relu.
Output: FC - 1 neuron, sigmoid activation.

Face DNN.

Patch-based solutions have two main challenges. First, is hard to make these approaches scale invariant as pre-processing steps are explicitly avoided. In second place, the prediction error associated to individual patches is diverse as we shall discuss in the following experiments.

To overcome these difficulties while improving handcrafted texture and color features, DNN methods have been proposed for global facial analysis. Following ideas introduced in [57]–[59] we implement and test a DNN for global facial analysis. The network implemented is described in Table VII. The architecture is inspired in the standard VGG network [66] with three minor adaptations. In order to capture finer texture details we consider higher resolution inputs (480×480 instead of 224×224). Then, to account for the higher resolution input and control the total number of parameters we introduce additional convolutional layers and pooling. Finally, as spoofing detection problem consists of only two classes, the final fully connected layers are simplified.

TABLE VII.

Architecture of the DNN Implemented for Global Face Analysis. Conv Stands for Convolutional, FC for Fully Connected, and BN for Batch Normalization. Training Is Performed Using Adam Optimizer and Binary Cross-Entropy as the Minimization Loss

Input: 480×480×3 image.
Layer 1.1: Conv - 32 kernels 3×3 , BN, relu.
Layer 1.2: Conv - 32 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 2: Conv - 64 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 3: Conv - 64 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 4: Conv - 128 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 5: Conv - 256 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 6: Conv - 64 kernels 3×3 , BN, relu.
MaxPooling 2×2
Layer 7: FC - 256 neurons, BN, relu.
Dropout 50%
Layer 8: FC - 124 neurons, BN, relu.
Output: FC - 1 neuron, sigmoid activation.

Appendix C. Synthesis of an Adversarial Attack

The face-DNN classifier (evaluated in Sec. 3.2 of the paper) is trained merging Casia, Oulu, and Ambient-Flash datasets. We empirically observed that this strategy is the more reliable one in terms of generalization and the hardest to hack.

The DNN can be expressed as a mapping f:I[0,1], where I represents the set of all possible discrete images of size 480×480×3. We assume that input images take values in the range [0, 1]. For an input I with ground truth label y, L(I, y) denotes the cross-entropy cost function of the neural network output f (I) = ŷ given the class y.

Algorithm 1 details the iterative process implemented for the generation of spoofing adversarial attacks. Essentially, the process of creating adversarial attacks consists of producing slight modifications to the original input until it is missclassified.

Figure 12 illustrates an example of how the adversarial attack evolves for an test (spoofing) input sample. The initial image (top left) corresponds to a photograph of a printed portrait of a test subject, the score of the face-DNN for this initial image is 0.99 (recall that the label 1 is associated to spoofing attacks and 0 to live samples). For the sake of simplicity, let us assume during this experiment that the classification threshold is set at the value .5. Then, scores above/below .5 are associated to spoofing/live classes respectively. As shown in Fig. 12, the initial image is correctly classified as a spoofing attack with a high confidence (output 0.99). (This image has never been seen by the algorithm or included for training.) The second row of Fig. 12 shows the absolute value of the difference between the original image and the result after 10, 20 and 30 iterations respectively.

Algorithm 1.

Generation of an Adversarial Attack. An Input Image I0 That Corresponds to a Spoofing Attack Is Modified Until the Network Gives Low Scores to It (i.e., It Is Classified as a Live sample). Notice That as We Are Working With a Single Output That Represents p(y = 1|I), and the Gradient Is Normalized by Its Norm, Step 8 Is Equivalent to a Gradient Ascend on the Cross-Entropy L(I, y = 1)

Input: Initial image I0, Network model f
 1: Set gradient descend parameters:
 2: δ = 5×10−3 Gradient step.
 3: N = 40 Number of steps.
 4: Δ = .15 Max. difference with respect to I0.
 5: I0rgb = I0 Set seed.
 6: for i in 1 … N do
 7:  I0rgbI0c Map input to selected color space
 8:  I1cI0cδIf(I0c)If(I0c)
 9:  I1cI1rgb Map back to RGB
 10:  I1rgb = min(I0 + Δ, max(I0 − Δ, I1rgb))
 11:  I1rgb = clip(I0rgb, 0, 1)
 12:  I0rgb = I1rgb
 13: end for
 14: return I1rgb

Fig. 12.

Fig. 12.

The first row shows the result of applying the steps described in Alg. 1. The first image (top left) illustrates an example of input image and corresponds to a picture of a printed portrait of a test subject (spoofing example). The second row shows the absolute value of the difference with respect to the original image.

This experiment illustrates that it is possible to modify the texture of a spoofing attack to hack a texture-based classification approach. However, it remains to show that such procedure is still effective after the image is physically displayed [63]. After all, the process of printing or displaying on a screen the generated adversarial image introduces uncontrolled texture distortion. Figure 8 shows an example of how the adversarial attack is able to mislead a spoofing detection algorithm even in the physical world.

Appendix D. Architecture of the DNNs

The architecture of the DNN implemented for feature extraction and classification are detailed in tables V, VI, VII, and VIII.

Appendix E. Generation of Synthetic Training Samples

Live and spoofing samples are generated from the Texas dataset, examples of raw input images from this dataset are illustrated in Fig. 4. Using texture and depth facial information we simulate live and spoofing samples as follows.

Firstly, we synthesize additional RGB images simulating an additional source of light. Figure 13 illustrates the result of simulating facial images under an additional source of light (e.g., a flash). The intensity of the illuminant and its orientation can be randomly modified.

Fig. 13.

Fig. 13.

Result of simulating an image with an additional illuminant. The power and location of the additional source of light can be arbitrary modified.

Secondly, we compute IB features combining the original RGB input with the simulated sample under additional light. Figure 14 illustrates examples of live and spoof simulated samples.

Fig. 14.

Fig. 14.

IB feature simulated live and spoof samples using Texas dataset.

Algorithm 2 summarizes the main steps required for the simulation of IB synthetic data.

For the generation of realistic synthetic data important details must be addressed. A realistic synthesis is crucial to train DNNs that perform well on real data and generalize across various practical scenarios.

For the reproduction of live samples, naturally the ground truth depth is considered. On the other hand, a fake depth profile is simulated when creating spoofing samples. Fake depth profiles are simulated as polynomials of random coefficients plus smooth noise. We used polynomials of degree 3 with random coefficients sampled from a uniform distribution.

Smooth random noise is generated by filtering random white noise with a random Gaussian kernel of width σ~N(30,30) (mean and std of 30 pixels). This procedure is engineered to simulate the depth of deformed printed portraits that could be deployed in practical spoofing situations.

Algorithm 2.

Simulate IB Features From RGB and Depth

Input: Input images I (RGB), and D (Depth)
 1: Random label: y ~ Bemoulli{−1,1}(l/2)
 2: if y == 1 then Simulate spoofing sample
 3:  D = simulate_smooth_random_depth.
 4: else Simulate live sample
 5:  Use ground truth depth data.
 6: end if
 7: Set random values for lightIntensity, lightDirection.
 8: I f = Simulate_additional_illuminant(I, lightIntensity, lightDirection).
 9: I f = Random_shift(I f)
 10: IB = Compute_illumination_based_feat(I,I f)
 11: return features: IB, label: y

Finally, before the computation of IB features from a pair of ambient-flash, one of the images is randomly shifted Ĩ (u, v) = I (uμu, vμv) with μi~U[3,3]. This step is important to account for imprecise registration that may occur on real data (recall that, in practice, automatically detected landmarks are used for registration).

References

  • [1].Chingovska I, Erdogmus N, Anjos A, and Marcel S, “Face recognition systems under spoofing attacks,” in Face Recognition Across the Imaging Spectrum. Cham, Switzerland: Springer, 2016, pp. 165–194. [Google Scholar]
  • [2].Boulkenafet Z, Komulainen J, Li L, Feng X, and Hadid A, “OULU-NPU: A mobile face presentation attack database with real-world variations,” in Proc. 12th IEEE Int. Conf. Autom. Face Gesture Recognit. (FG), May 2017, pp. 612–618. [Google Scholar]
  • [3].Zhang Z, Yan J, Liu S, Lei Z, Yi D, and Li SZ, “A face antispoofing database with diverse attacks,” in Proc. 5th IAPR Int. Conf. Biometrics (ICB), March 2012, pp. 26–31. [Google Scholar]
  • [4].Socolinsky DA, Selinger A, and Neuheisel JD, “Face recognition with visible and thermal infrared imagery,” Comput. Vis. Image Understand, vol. 91, nos. 1–2, pp. 72–114, July 2003. [Google Scholar]
  • [5].Pan G, Sun L, Wu Z, and Lao S, “Eyeblink-based anti-spoofing in face recognition from a generic webcamera,” in Proc. IEEE 11th Int. Conf. Comput. Vis, October 2007, pp. 1–8. [Google Scholar]
  • [6].Pavlidis I and Symosek P, “The imaging issue in an automatic face/disguise detection system,” in Proc. IEEE Workshop Comput. Vis. Beyond Visible Spectr., Methods Appl, October 2000, pp. 15–24. [Google Scholar]
  • [7].Zhang Z, Yi D, Lei Z, and Li SZ, “Face liveness detection by learning multispectral reflectance distributions,” in IEEE Internatinal Conf. Autom. Face Gesture Recognit. Workshops, March 2011, pp. 436–441. [Google Scholar]
  • [8].Bao W, Li H, Li N, and Jiang W, “A liveness detection method for face recognition based on optical flow field,” in Proc. Int. Conf. Image Anal. Signal Process, 2009, pp. 233–236. [Google Scholar]
  • [9].Kollreider K, Fronthaler H, and Bigun J, “Evaluating liveness by face images and the structure tensor,” in Proc. 4th IEEE Workshop Autom. Identificat. Adv. Technol. (AutoID), October 2005, pp. 75–80. [Google Scholar]
  • [10].Liu Y, Jourabloo A, and Liu X, “Learning deep models for face anti-spoofing: Binary or auxiliary supervision,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, June 2018, pp. 389–398. [Google Scholar]
  • [11].Wang T, Yang J, Lei Z, Liao S, and Li SZ, “Face liveness detection using 3D structure recovered from a single camera,” in Proc. Int. Conf. Biometrics (ICB), June 2013, pp. 1–6. [Google Scholar]
  • [12].Agarwal A, Singh R, and Vatsa M, “Face anti-spoofing using haralick features,” in Proc. IEEE 8th Int. Conf. Biometrics Theory, Appl. Syst. (BTAS), September 2016, pp. 1–6. [Google Scholar]
  • [13].Atoum Y, Liu Y, Jourabloo A, and Liu X, “Face anti-spoofing using patch and depth-based CNNs,” in Proc. IEEE Int. Joint Conf. Biometrics (IJCB), October 2017, pp. 319–328. [Google Scholar]
  • [14].Boulkenafet Z et al. , “A competition on generalized software-based face presentation attack detection in mobile scenarios,” in Proc. IEEE Int. Joint Conf. Biometrics (IJCB), October 2017, pp. 688–696. [Google Scholar]
  • [15].Li J, Wang Y, Tan T, and Jain AK, “Live face detection based on the analysis of Fourier spectra,” Proc. SPIE, vol. 5404, pp. 296–304, August 2004. [Google Scholar]
  • [16].Li L, Feng X, Jiang X, Xia Z, and Hadid A, “Face anti-spoofing via deep local binary patterns,” in Proc. IEEE Int. Conf. Image Process. (ICIP), September 2017, pp. 91–111. [Google Scholar]
  • [17].Li Q, Xia Z, and Xing G, “A binocular framework for face liveness verification under unconstrained localization,” in Proc. 9th Int. Conf. Mach. Learn. Appl, December 2010, pp. 37–40. [Google Scholar]
  • [18].Yang J, Lei Z, and Li SZ, “Learn convolutional neural network for face anti-spoofing,” 2014, arXiv:1408.5601 [Online]. Available: http://arxiv.org/abs/1408.5601 [Google Scholar]
  • [19].Nikisins O, Mohammadi A, Anjos A, and Marcel S, “On effectiveness of anomaly detection approaches against unseen presentation attacks in face anti-spoofing,” in Proc. Int. Conf. Biometrics (ICB), February 2018, pp. 75–81. [Google Scholar]
  • [20].Peixoto B, Michelassi C, and Rocha A, “Face liveness detection under bad illumination conditions,” in Proc. 18th IEEE Int. Conf. Image Process, September 2011, pp. 3557–3560. [Google Scholar]
  • [21].Sun X, Huang L, and Liu C, “Dual camera based feature for face spoofing detection,” in Proc. Chin. Conf. Pattern Recognit. (CCPR), 2016, pp. 332–344. [Google Scholar]
  • [22].Tan X, Li Y, Liu J, and Jiang L, “Face liveness detection from a single image with sparse low rank bilinear discriminative model,” in Proc. Eur. Conf. Comput. Vis. (ECCV) Berlin, Germany: Springer, 2010, pp. 504–517. [Google Scholar]
  • [23].Yeh C-H and Chang H-H, “Face liveness detection with feature discrimination between sharpness and blurriness,” in Proc. 15th IAPR Int. Conf. Mach. Vis. Appl. (MVA), May 2017, pp. 398–401. [Google Scholar]
  • [24].Zhang X, Hu X, Ma M, Chen C, and Peng S, “Face spoofing detection based on 3D lighting environment analysis of image pair,” in Proc. 23rd Int. Conf. Pattern Recognit. (ICPR), December 2016, pp. 2995–3000. [Google Scholar]
  • [25].Hartley R and Zisserman A, Multiple View Geometry in Computer Vision. Cambridge, U.K.: Cambridge Univ. Press, 2003. [Google Scholar]
  • [26].Kose N and Dugelay J-L, “Mask spoofing in face recognition and countermeasures,” Image Vis. Comput, vol. 32, no. 10, pp. 779–789, October 2014. [Google Scholar]
  • [27].Liu Y, Stehouwer J, Jourabloo A, Atoum Y, and Liu X, “Presentation attack detection for face in mobile phones,” in Advances in Computer Vision and Pattern Recognition. London, U.K.: Springer, 2019, pp. 171–196. [Google Scholar]
  • [28].Wen D, Han H, and Jain AK, “Face spoof detection with image distortion analysis,” IEEE Trans. Inf. Forensics Security, vol. 10, no. 4, pp. 746–761, April 2015. [Google Scholar]
  • [29].Patel K, Han H, and Jain AK, “Secure face unlock: Spoof detection on smartphones,” IEEE Trans. Inf. Forensics Security, vol. 11, no. 10, pp. 2268–2283, October 2016. [Google Scholar]
  • [30].Caetano Garcia D and de Queiroz RL, “Face-spoofing 2D-detection based on Moiré-pattern analysis,” IEEE Trans. Inf. Forensics Security, vol. 10, no. 4, pp. 778–786, April 2015. [Google Scholar]
  • [31].Chan PPK et al. , “Face liveness detection using a flash against 2D spoofing attack,” IEEE Trans. Inf. Forensics Security, vol. 13, no. pp. 521–534, February 2018. [Google Scholar]
  • [32].Prados E and Faugeras O, “Shape from shading,” in Handbook of Mathematical Models in Computer Vision. Boston, MA, USA: Springer, 2006, pp. 375–388. [Google Scholar]
  • [33].Zhang R, Tsai P-S, Cryer JE, and Shah M, “Shape-from-shading: A survey,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 21, no. 8, pp. 690–706, August 1999. [Google Scholar]
  • [34].Agrawal A, Chellappa R, and Raskar R, “An algebraic approach to surface reconstruction from gradient fields,” in Proc. 10th IEEE Int. Conf. Comput. Vis. (ICCV), vol. 1, October 2005, pp. 174–181. [Google Scholar]
  • [35].Reddy D, Agrawal A, and Chellappa R, “Enforcing integrability by error correction using 1-minimization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, June 2009, pp. 2350–2357. [Google Scholar]
  • [36].Tumblin J, Agrawal A, and Raskar R, “Why i want a gradient camera,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), June 2005, pp. 103–110. [Google Scholar]
  • [37].Basri R and Jacobs DW, “Lambertian reflectance and linear subspaces,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 25, no. pp. 218–233, February 2003. [Google Scholar]
  • [38].Hayakawa H, “Photometric stereo under a light source with arbitrary motion,” J. Opt. Soc. Amer. A, Opt. Image Sci, vol. 11, no. 11, pp. 3079–3089, 1994. [Google Scholar]
  • [39].Shashua A, “On photometric issues in 3d visual recognition from a single 2d image,” Int. J. Comput. Vis, vol. 21, nos. 1–2, pp. 99–122, 1997. [Google Scholar]
  • [40].Zhou SK, Aggarwal G, Chellappa R, and Jacobs DW, “Appearance characterization of linear lambertian objects, generalized photometric stereo, and illumination-invariant face recognition,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 29, no. pp. 230–245, February 2007. [DOI] [PubMed] [Google Scholar]
  • [41].Pentland AP, “Finding the illuminant direction,” J. Opt. Soc. Amer, vol. 72, no. 4, pp. 448–455, April 1982. [Google Scholar]
  • [42].Zheng Q and Chellappa R, “Estimation of illuminant direction, albedo, and shape from shading,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit, June 1991, pp. 540–545. [Google Scholar]
  • [43].Durou J-D, Falcone M, and Sagona M, “Numerical methods for shape-from-shading: A new survey with benchmarks,” Comput. Vis. Image Understand, vol. 109, no. 1, pp. 22–43, January 2008. [Google Scholar]
  • [44].Prados E and Faugeras O, “Shape from shading: A well-posed problem?” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), June 2005, pp. 870–877. [Google Scholar]
  • [45].Gupta S, Castleman KR, Markey MK, and Bovik AC, “Texas 3D face recognition database,” in Proc. IEEE Southwest Symp. Image Anal. Interpretation (SSIAI), May 2010, pp. 97–100. [Google Scholar]
  • [46].Blanz V and Vetter T, “A morphable model for the synthesis of 3D faces,” in Proc. 26th Annu. Conf. Comput. Graph. Interact. Techn. - SIGGRAPH, 1999, pp. 187–194. [Google Scholar]
  • [47].Eigen D and Fergus R, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), December 2015, pp. 2650–2658. [Google Scholar]
  • [48].Eigen D, Puhrsch C, and Fergus R, “Depth map prediction from a single image using a multi-scale deep network,” in Proc. Adv. Neural Inf. Process. Syst, 2014, pp. 2366–2374. [Google Scholar]
  • [49].Huber P et al. , “A multiresolution 3D morphable face model and fitting framework,” in Proc. 11th Joint Conf. Comput. Vis., Imag. Comput. Graph. Theory Appl, 2016, pp. 1–8. [Google Scholar]
  • [50].Laina I, Rupprecht C, Belagiannis V, Tombari F, and Navab N, “Deeper depth prediction with fully convolutional residual networks,” in Proc. 4th Int. Conf. 3D Vis. (3DV), October 2016, pp. 239–248. [Google Scholar]
  • [51].Pini S, Grazioli F, Borghi G, Vezzani R, and Cucchiara R, “Learning to generate facial depth maps,” 2018, arXiv:1805.11927 [Online]. Available: http://arxiv.org/abs/1805.11927 [Google Scholar]
  • [52].Saxena A, Chung SH, and Ng AY, “3-D depth reconstruction from a single still image,” Int. J. Comput. Vis, vol. 76, no. 1, pp. 53–69, December 2007. [Google Scholar]
  • [53].Saxena A, Sun M, and Ng AY, “Make3D: Learning 3D scene structure from a single still image,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 31, no. 5, pp. 824–840, May 2009. [DOI] [PubMed] [Google Scholar]
  • [54].Subbarao M and Surya G, “Depth from defocus: A spatial domain approach,” Int. J. Comput. Vis, vol. 13, no. 3, pp. 271–294, December 1994. [Google Scholar]
  • [55].Maatta J, Hadid A, and Pietikainen M, “Face spoofing detection from single images using micro-texture analysis,” in Proc. Int. Joint Conf. Biometrics (IJCB), October 2011, pp. 1–7. [Google Scholar]
  • [56].Boulkenafet Z, Komulainen J, and Hadid A, “Face spoofing detection using color texture analysis,” IEEE Trans. Inf. Forensics Security, vol. 11, no. 8, pp. 146–157, August 2016. [Google Scholar]
  • [57].Li H, He P, Wang S, Rocha A, Jiang X, and Kot AC, “Learning generalized deep feature representation for face anti-spoofing,” IEEE Trans. Inf. Forensics Security, vol. 13, no. 10, pp. 2639–2652, October 2018. [Google Scholar]
  • [58].Li L, Feng X, Boulkenafet Z, Xia Z, Li M, and Hadid A, “An original face anti-spoofing approach using partial convolutional neural network,” in Proc. 6th Int. Conf. Image Process. Theory, Tools Appl. (IPTA), December 2016, pp. 1–6. [Google Scholar]
  • [59].Lucena O, Junior A, Moia V, Souza R, Valle E, and Lotufo R, “Transfer learning using convolutional neural networks for face anti-spoofing,” in Proc. Int. Conf. Image Anal. Recognit, July 2017, pp. 27–34. [Google Scholar]
  • [60].Poh N, ho Chan C, Kittler J, Fierrez J, and Galbally J, “Description of metrics for the evaluation of biometric performance,” Biometrics Eval. Test., Tech. Rep, 2012. [Online]. Available: https://www.beat-eu.org/project/deliverables-public/d3.3-description-of-metrics-for-the-evaluation-of-biometric-performance [Google Scholar]
  • [61].Wang G, Han H, Shan S, and Chen X, “Improving cross-database face presentation attack detection via adversarial domain adaptation,” in Proc. Int. Conf. Biometrics (ICB), June 2019, pp. 1–8. [Google Scholar]
  • [62].Yang X et al. , “Face anti-spoofing: Model matters, so does data,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), June 2019, pp. 3507–3516. [Google Scholar]
  • [63].Kurakin A, Goodfellow I, and Bengio S, “Adversarial examples in the physical world,” 2016, arXiv:1607.02533 [Online]. Available: http://arxiv.org/abs/1607.02533 [Google Scholar]
  • [64].Kuncheva LI, Combining Pattern Classifiers: Methods Algorithms. Hoboken, NJ, USA: Wiley, 2004. [Google Scholar]
  • [65].Kuncheva LI, Whitaker CJ, Shipp CA, and Duin RPW, “Is independence good for combining classifiers?” in Proc. 15th Int. Conf. Pattern Recognit. ICPR-, September 2000, pp. 168–171. [Google Scholar]
  • [66].Simonyan K and Zisserman A, “Very deep convolutional networks for large-scale image recognition,” 2014, arXiv:1409.1556 [Online]. Available: http://arxiv.org/abs/1409.1556 [Google Scholar]

RESOURCES