Abstract
Fairness in data-driven decision-making studies scenarios where individuals from certain population segments may be unfairly treated when being considered for loan or job applications, access to public resources, or other types of services. In location-based applications, decisions are based on individual whereabouts, which often correlate with sensitive attributes such as race, income, and education.
While fairness has received significant attention recently, e.g., in machine learning, there is little focus on achieving fairness when dealing with location data. Due to their characteristics and specific type of processing algorithms, location data pose important fairness challenges. We introduce the concept of spatial data fairness to address the specific challenges of location data and spatial queries. We devise a novel building block to achieve fairness in the form of fair polynomials. Next, we propose two mechanisms based on fair polynomials that achieve individual spatial fairness, corresponding to two common location-based decision-making types: distance-based and zone-based. Extensive experimental results on real data show that the proposed mechanisms achieve spatial fairness without sacrificing utility.
1. INTRODUCTION
In the past decade, location data became an integral part of many applications, e.g., mobile apps, smart health, smart cities. Individual locations are often used in creating user profiles, or as an input for various decision-making processes (e.g., machine learning), which may affect an individual’s access to public resources, loans, etc.
It is already well-understood that location bias has significant effects on underprivileged communities, e.g., in the context of transportation and housing [25]. Some public authorities designated entire geographical regions as economically-disadvantaged areas (EDA) [2]. However, while fairness has been studied recently in ML settings [6] for generic data types [16], no specific solution studies fairness in spatial data processing. Addressing spatial data fairness presents two specific challenges:
(1) Location data may lead, intentionally or inadvertently, to exercise bias against individuals from disadvantaged backgrounds in a stealth fashion. While it is illegal to use race or ethnicity in a loan-granting or hiring decision, one may use location of current residence as input. Even though location may not seem sensitive, it may be used to discriminate against people of a certain ethnicity, as on many occasions people from the same ethnic group congregate in certain spatially-focused communities [18]. Similar concerns exist for income or education level, which often exhibit strong correlations with the location where an individual works, lives or travels [9]. Note that, this sort of discrimination may often occur inadvertently, as opaque ML algorithms automatically exploit certain correlations in location data (e.g., higher default rates in certain zipcodes), without realizing their fairness implications.
(2) Fairness is achieved through some data transformation designed to prevent, or limit, the amount of bias in processing. This causes loss of utility, whereby the result of processing can be sub-optimal compared to the result obtained on the original data. Achieving fairness requires some utility loss, and the emerging fairness-utility trade-off must be carefully considered when devising a fairness mechanism. In the case of location data, utility has specific formulations, which may impact results in a way that is unique to spatial query processing algorithms. Using generic fairness mechanisms devised for other types of data may lead to poor utility, as seen in [27]. Therefore, it is desirable to design customized mechanisms for fighting bias in location data, such that the utility of spatial information is not significantly decreased.
In this paper, we introduce specific mechanisms targeted at providing spatial fairness while preserving data utility. We focus on the case of individual fairness [12], which is more difficult to achieve, but provides a higher level of fairness guarantees compared to its group-level counterpart. We provide specific definitions of location bias, and carefully characterize how location data can be used to exercise discriminatory decisions.
We introduce a novel construction called fair polynomials (Section 3.1) that can be used as building block within mechanisms for spatial fairness1. We perform a detailed exploration of fair polynomials in order to understand their properties and the trade-off achieved between enforcing fairness and preserving data utility.
We identify two broad categories of scenarios where location bias occurs, and we define spatial fairness mechanisms for each:
Distance-based fairness is relevant in location-based advertising and ride-hailing, where the dominant query type is nearest-neighbors (NN). In this setting, location bias occurs when individuals are impacted by their distance to a reference point. In location-based marketing, an algorithm may advertise special deals to customers that are nearby a newly-opened health food store. The specific coordinates of the customers may be less relevant for utility, and instead the distance to a landmark is the important factor (i.e., the proximity to the store is a good indicator of likelihood of visiting it). While this may be efficient, it can have fairness repercussions. For instance, if the algorithm chooses the first 100 potential customers to reach based on distance, it is possible that only a rich neighborhood is covered. Customers from a poorer neighborhood that is adjacent to the rich one may never be selected, due to a slightly larger distance threshold. Figure 1a illustrates this case. In this situation, we would like to ensure that individuals from poorer backgrounds also get the chance to benefit from special deals on healthy food, even though they are slightly farther away (i.e., avoid the hard decision boundary phenomenon).
Zone-based fairness is applicable in scenarios like gerrymandering, loan analysis or insurance pricing, where spatial range queries are the norm. In this case, we look at how to ensure spatial fairness with respect to coordinate values, instead of distances. This setting is broader, as it can provide fairness with respect to any reference point. Conversely, the amount of data utility sacrificed in the process may be higher. Figure 1b illustrates this point, where two individual homes are quoted significantly different insurance premiums due to their surrounding characteristics. Ideally, we would like the two residences to have similar premiums, given their physical proximity.
Figure 1:
Examples of two common location bias scenarios.
Our specific contributions are:
We identify the problem of bias in spatial data processing, and formalize the notion of spatial data fairness;
We introduce two definitions of spatial data fairness based on common interaction types, namely distance-based and zone-based fairness;
We devise the novel concept of fair polynomials, which can be used as a building block to obtain mechanisms that achieve spatial fairness;
We propose two mechanisms based on fair polynomials that enforce distance-based and zone-based fairness;
We perform an extensive experimental evaluation on real datasets that shows the effectiveness of the proposed mechanisms, and investigates the fairness-utility trade-off.
The rest of the paper is organized as follows: Section 2 formalizes the notions of bias and fairness for location data. Sections 3 and 4 introduce our proposed distance-based and zone-based fairness mechanisms, respectively. We survey related work in Section 5. Section 6 reports the results of our experimental evaluation, followed by conclusions in Section 7.
2. SYSTEM MODEL
2.1. Location Bias
The existence of bias, rooted in data or algorithms, is commonly used as a basis for reasoning on unfairness in decision-making. Several sources of bias have been identified in the literature, such as measurement bias [29] and behavioral bias [24], many of which are intertwined. In this work, we formalize a type of bias that occurs due to location data. Location bias is formally defined as follows:
Definition 1 (Location Bias). Distortion or algorithmic bias generated based on locations of entities in the geospatial domain or their distances to reference points is referred to as location bias.
Distortion refers to the bias intrinsic to the data. Whereas algorithmic bias [5] refers bias that is generated by processing algorithms. A category of bias closely related to algorithmic bias is called data processing bias [24], which occurs during data cleaning, enrichment, and aggregation. Our focus is on location bias sourced in processing algorithms. Location bias appears in a variety of applications where distances or locations may cause discrimination against individuals or groups. Consider the example in Fig. 2 in which a store wants to contact nearby customers with daily offers. A widely known algorithm such as nearest neighbors may be used to decide which customers should be contacted. If fairness measures are not considered, and if the store is located in a rich neighborhood, customers who live in less privileged areas of the city may never be contacted, and will be unable to enjoy the special offers.
Figure 2:
An example of distance-based fairness problem.
Such an algorithmic advantage towards a particular group or towards certain individuals is an example of distance-based location bias with respect to a reference point. The distances can be represented as a one-dimensional input feature; however, the source of location bias can be multidimensional, more commonly stored in two or more dimensions, e.g., as latitudes and longitudes. Consider classifying districts in a city as “high risk” or “low risk”, where more police presence and government resources are allocated to locations with higher crime rates. A strict boundary between a low-risk and high-risk district dictates that nearby individuals placed oppositely across the border are treated differently, despite their proximity. The unfairness manifests in the number of police patrols, the tolerance of police officers to crime, the cost of insurance, and the approval of home improvement loans. To address location bias in classification tasks, i.e., to prevent individuals with similar features from being treated significantly different, i.e., unfairly, we first introduce the notion of location fairness in Section 2.2, and then formulate the problem of achieving location fairness in Section 2.3.
2.2. Spatial Fairness Definition
There are two categories of fairness definitions, namely group fairness and individual fairness [13]. The former definition addresses the case where a group with certain features is treated statistically different compared to other groups. The latter approach focuses on treating individuals with similar features in a similar way. We adopt the use of individual fairness for spatial data, since (i) it provides higher fairness guarantees; (ii) it is more suitable for continuous domain features such as locations, in contrast to categorical features such as education, race, and gender; and (iii) location attributes tend to be dynamic, and hence more relevant on an individual basis, rather than for a group. To this end, we formally present the notion of individual spatial fairness in Definition 2:
Definition 2. (Individual Spatial Fairness). Let 𝓛 = {l1, l2, …, lm} denote the set of individual locations that need to be classified over the output set 𝓐, where . A randomized mapping M : 𝓛 → Δ(𝓐) satisfies individual spatial fairness iff for every two locations lu, lv ∈ 𝓛 the (D, d)-Lipschitz constraint holds,
| (1) |
Intuitively, the definition states that the evaluation process M for two similar locations should yield similar outcomes. The definition relies on two key distance metrics, (1) similarity distance metric , measuring how similar individuals are, and (2) a distance metric D(.) measuring the distance between outcome distributions. The former metric will be defined thoroughly for locations in the upcoming sections, as it is tailored to the specific location interaction type. The latter metric, on the other hand, is commonly defined as total variation norm or so-called statistical distance. Given two probability distributions P and Q over outcome space 𝓐, the statistical distance is calculated as
| (2) |
Our focus in this work is on binary decision-making tasks, hence, the output space is given by 𝓐 = {0, 1}. We assume a classifier modeled as a randomized mechanism M : L → Δ(𝓐) mapping individuals over outcomes, where Δ(𝓐) denotes all possible distributions. Thus, the classification of an individual lu ∈ 𝓛 over outcome space 𝓐 is done according to the distribution of M(lu). To simplify notation, we assume function M to return the likelihood of the positive outcome, i.e., M(lu) = M(lu∣a = 1).
2.3. Problem Formulation
Consider m datapoints 𝓛 = {l1, l2, …, lm} located in a k-dimensional space . Each location represents an individual. Associated attributes of data points are stored in a tabular format such that the ith row of the table is dedicated to li (as shown in Fig. 2).
Attribute Distance to Reference (DtR) represents the distance from a reference point2 R. We use as DtR metric the Minkowski distance of order p (p-norm distance) defined for two data points lu = (x1, …, xk) and lv = (x′1, …, x′k) as
| (3) |
The u-th entry of the DtR column is associated with datapoint lu and entails a scalar lu = ∥(lu, R)∥p/ɣ, where . Constant ɣ ensures the range of distances is [0,1] (0 ≤ li ≤ 1, ∀i=1…m). The DtR column is shown with DtR metric set to 2-norm. It is important to note that DtR is based on data representation, and it is not to be confused with the two distance metrics d(.) and D(.), which are crucial elements of individual fairness.
As part of our system model, we assume a classifier performing decision-making on top of the data (this model aligns well with current location-based applications, where some machine learning is involved in data processing). Given an input individual u, the classifier M(lu) returns a likelihood score for that individual based on her location (e.g., likelihood of receiving a location advertisement). Scores are real values in the range of 0 to 1.
The classifier output scores are shown in the last column of Fig. 2b. For example, user A is located at the coordinate lA with the calculated distance from the reference point of normalized to , and the generated score of M(lA) = 0.8. Other features and attributes used in the model could be race, gender, education, etc. The two problems we seek to address to achieve individual fairness are defined as follows:
Problem 1. (Distance-based Spatial Fairness) For a given location dataset 𝓛 with the corresponding DtRs {l1, …, lm}, and a function M : 𝓛 → [0, 1], devise a mechanism to enforce individual distance-based fairness (D, d)-Lipschitz constraints with respect to DtRs.
| (4) |
Problem 2. (Zone-based Spatial Fairness) For a given location dataset 𝓛 = {l1, l2, …, lm}, and a function M : 𝓛 → [0, 1], devise a mechanism to enforce individual fairness (D, d)-Lipschitz constraints with respect to location coordinates.
| (5) |
Fairness mechanisms must inherently alter the output likelihood scores in order to achieve the fairness requirement. Hence, there is a cost for such an operation in terms of utility loss. Since we expect the output of a fairness mechanism to be used for a learning task, we choose as utility metric fitting error, a widely accepted ML metric for output scores, formally presented in Definition 3.
Definition 3. (Utility). Let 𝓑 : M → M′ be a mechanism that maps every likelihood score M to a likelihood score in M′ given that M, M′ : 𝓛 → Δ(𝓐). The fitting error (utility) of 𝓑 is:
| (6) |
As an example, suppose that the output score is mapped to a constant 0.5, i.e., M′(li) = 0.5, ∀i. The utility loss can then be calculated as .
3. DISTANCE-BASED SPATIAL FAIRNESS
We introduce a spatial fairness mechanism for Problem 1. Each data point is augmented with a DtR column representing a user’s distance from the reference point. Therefore, the most natural similarity distance metric is 1-norm.
| (7) |
Lemma 3.1. Given the classifier output space of 𝓐 = {0, 1}, the statistical distance for every two individuals can be calculated as
| (8) |
Proof.
| (9) |
| (10) |
| (11) |
3.1. Fair-Polynomials
Despite the strong fairness guarantees provided by individual fairness, applying a large number of hard constraints has limited its practicality. A common mechanism for individual fairness is to define an application-specific optimization problem usually referred to as vendor’s utility function and solve it while imposing individual fairness hard constraints. Unfortunately, two major issues arise with such an approach when applied to location data: (i) In existing approaches, e.g., [27], a constraint optimization problem solver is used to alter input locations such that fairness requirements are met. The number of constraints grows quadratically with the number of data points, which makes their enforcement computationally prohibitive; (ii) The definition of the utility function in most scenarios is not straightforward, confining applicable use cases.
We devise the concept of fair polynomials, the intuition behind which is depicted in Figure 2c. A polynomial is efficiently fitted to the output scores of the classifier with a reasonably low fitting error. Fair polynomials no longer require enforcement of a large number of hard constraints: given a new data point, its corresponding fair value can be generated by evaluating the polynomial at that point.
Definition 4 (c-fair Polynomials). A single variable degree n polynomial is said to be c-fair if and only if for every two points x and y in its domain
| (12) |
Given that a fair polynomial providing good estimates of likelihood scores exists, by one-to-one mapping (fitting) of the likelihood scores to polynomials, individual distance-based fairness can be achieved for every two data points. The constant c ∈ [1, + inf] in c-fair polynomials aims to exploit the trade-off between utility and fairness. When c = 1, the optimal individual location fairness is achieved but is usually associated with a higher loss in utility. When the value of c grows larger, the fairness constraint is relaxed, leading to higher utility but lower fairness.
As the distance-based fairness problem only involves scalars, our focus is on single variable degree n polynomials. We will extend to multi-variable polynomials to accommodate for multi-dimensional data points and address Problem 2 in Section 4.1. In the following, we answer three central questions, (i) what is the sufficient condition for a polynomial to be fair, (ii) how to derive the coefficients of the polynomial by imposing individual fairness constraints, and (iii) how to determine the degree of a fair polynomial.
3.2. Sufficient Condition for Fair Polynomials
There are several families of polynomials that preserve individual fairness over the defined distances for DtR. For example, one such family of polynomials is P(x) = cxn/n which is proven in Lemma 3.2 to be a c-fair polynomial.
Lemma 3.2. The polynomial P(x) = cxn/n, is a c-fair polynomial for every two points x, y ∈ [−1, 1].
Proof. The proof can be derived by expanding the equation and applying triangle inequality, considering that ∣xiyj∣ ≤ 1, ∀i, j.
| (13) |
| (14) |
A fair polynomial must be flexible enough to reduce the error once the likelihood scores are fitted to the polynomial, and not every fair family of polynomials is a viable option. Consider the generic degree n polynomial written as
| (15) |
where ai are real numbers. In Theorem 1, we derive a sufficient condition for polynomials of order n to preserve individual fairness.
Theorem 1. A sufficient condition for a single variable degree n polynomial to be c-fair given that and ∣x∣ ≤ 1 is to have,
| (16) |
Proof. Following the definition of individual location fairness:
| (17) |
| (18) |
| (19) |
The above inequality is true based on Jensen’s inequality (and also, extended triangle inequality) as well as applying the result from Lemma 3.2. Given that the inequality in Eq. (16) is satisfied, the polynomial is proven to be c-fair based on the definition.
The theorem indicates that if likelihood scores generated by the model are fitted to a polynomial for which the coefficients are selected such that , then c-fairness is guaranteed for data entries. The sufficient condition in Theorem 1 can be used directly to learn c-fair polynomials, but the non-linearity existing in the constraint can result in higher computation complexity, as coefficients are unbounded. Theorem 2 addresses this problem by deriving linear constraints over coefficients.
Theorem 2. A sufficient condition for a 1-variable n-th degree polynomial to be c-fair is to have:
| (20) |
Proof. The bound on each ai value must allow for the maximum degree of freedom while fitting the likelihood scores. Therefore, the condition can be written as an optimization problem.
| (21) |
Writing the Lagrangian and applying the stationary condition of Karush–Kuhn–Tucker (KKT) [15],
| (22) |
| (23) |
λ can be derived from complementary slackness to be
| (24) |
Therefore, bounds on the coefficients are given as
| (25) |
3.3. Derivation of Fair Polynomials
We employ a simple ML model to compute polynomial coefficients, where each location distance represents a training sample used to fit the likelihood scores to a polynomial. The training set can be assembled by choosing at random a number of locations from the same data domain as the application (e.g., residential locations within a city). In cases where the target user population is already known (e.g., the coordinates of customers for a store that is using location-based advertising), this user set can be directly used for training. In the following, to simplify notation, we assume the latter. For a given training input li, the polynomial output is derived as
| (26) |
We denote the matrix of all training examples as
Recall that m is the number of training examples, and the variables that we learn are the ais, which define the fair polynomial fitted to the data. The vector of coefficients can be written as
| (27) |
and the likelihood scores are vectorized as
| (28) |
The convex optimization problem to learn a is formulated as:
| (29) |
This is equivalent to the least square problem with linear constraints and can be solved efficiently with algorithms such as Trust Region Reflective [11] and Bounded-variable least-squares [28], with complexity linear to the order of the polynomial n. Existing work [27] requires enforcing O(m2) hard constraints (m >> n).
The selection of the polynomial degree n can be conducted based on a trial and error methodology. The optimal degree is the one that results in the minimum variance of error between likelihood scores and their corresponding values on the polynomial. Formally, let ei denote the error between M(li) and P(li), i.e.,
| (30) |
Then, the value of n ≥ 1 is selected such that
| (31) |
4. ZONE-BASED SPATIAL FAIRNESS
Revisiting the example in Figure 2, suppose that two users A and B both apply for a home improvement loan, and despite living in close proximity, one is categorized in an underdeveloped area and the other in a developed region due to geographic segmentation. A bank applies a classifier to decide whether an applicant should be granted a loan. The applicant whose home is in the underdeveloped category might be disadvantaged, as the location category can significantly impact the output of the classifier. Individual fairness argues that if two users are located close to each other, their output likelihood scores should not differ significantly.
In the distance-based fairness case, a single variable c-fair polynomial can fit output scores due to scalar distances. For multidimensional data points, Definition 4 is no longer directly applicable. To address this problem, we extend the definition to multivariate polynomials to achieve individual fairness for higher dimensional data points. The number of variables involved in fair polynomials is equal to the dimensionality of data points (k).
Three key variables are involved in finding an efficient family of fair polynomials that can fit the output likelihood scores with low utility loss: (i) dimensionality of data k; (ii) the distance metric d(.) and (iii) fair polynomial degree n. The individual fairness problem can be characterized with respect to these criteria as follows:
One-dimensional data representation (scalars), 1-norm distance, flexible order polynomial. This corresponds to the distance-based fairness case.
2-Dimensional data representation, 2-norm distances; order 1 polynomial. This is the most common scenario for locations where the attribute columns include 2D coordinates, and the fairness must be achieved with respect to Euclidean distance between individuals.
k-Dimensional data representation, 2-norm distances; order 1 polynomial.
k-Dimensional data representation, p-norm distances; order 1 polynomial.
k-Dimensional data representation, p-norm distances; flexible order polynomial.
We formulate and derive the sufficiency condition to guarantee individual fairness for each mentioned scenario. The optimization problem in Eq. (29) is formulated with the derived constraints. We omit the vectorization process for conciseness. For several of the proofs used in this section, we make use of Generalized Titu’s Lemma provided in Lemma 4.1.
Lemma 4.1 (Generalized Titu’s Lemma). Let m be an integer greater than or equal to 2, a non-negative real number, and xi a positive real number. Then,
| (32) |
Proof. Proof is in Appendix B of our extended version3.
4.1. 2-norm, 2 dimensional data, Order 1 polynomial
For higher dimensional data, the most common scenario happens when data points are in 2D, and the order of the polynomial is one. In practice, data points represent coordinates of locations on the map. Consider two locations l1 = (x1, x2) and in , where x1 and are the x-axis coordinates, while x2 and denote y-axis coordinates. To achieve individual fairness with respect to locations, the hard Lipschitz constraints dictate that:
| (33) |
The distance between distribution scores, i.e., D(.), is calculated as before based on Equation (3.1) and the distance between locations is the 2-norm of data points (Euclidean distance), calculated as:
| (34) |
We start by showing how a fair-polynomial can be derived for the Euclidean similarity distance. Then, we relax the assumptions and generalize the approach for arbitrary distance norms as well as n-dimensional data points. Recall that as location data are stored in 2D, the fair polynomial consists of two variables. The generalized definition of fair polynomials for order n polynomials and k dimensional data is provided in Definition 5.
Definition 5 (Generalized c-Fair Polynomial). The polynomial with real coefficients is c-fair iff for every two points x = (x1, x2, …, xm) and in its domain
| (35) |
In the case of 2-dimensional locations and Euclidean distance, fair polynomials imply that for every two locations l1 = (x1, x2) and , we must have,
| (36) |
Where the polynomial is denoted by
| (37) |
The goal is to learn the coefficients ai such that the polynomial P(.) can model the output scores M(.) and preserve fairness with respect to Euclidean distance. Theorem 3 provides the sufficiency condition for a two-variables order one polynomial to be fair.
Theorem 3. A sufficient condition for a 2-variable first degree polynomial P(x1, x2) = a0 + a1x1 + a2x2 to be c-fair defined over 2-norm similarity distance is to have:
| (38) |
Proof. On the one hand, based on Lemma 4.1, a lower bound for Euclidean distances can be written as
| (39) |
| (40) |
On the other hand, for the polynomial one can write
| (41) |
| (42) |
By combining the two equations the sufficiency condition in Equation (38) can be derived from the following inequality
| (43) |
The above theorem indicates that if the coefficients of polynomials fitted to data are chosen such that ∣a1∣, , fairness is guaranteed for every two locations in the domain. The sufficiency condition for first degree polynomials is generalized for k-dimensional data points in space in Theorem 4 (the number of variables in the polynomial is equal to the number of dimensions).
Theorem 4. A sufficient condition for a k-variable first degree polynomial defined over 2-norm similarity distance to be c-fair is:
| (44) |
Proof. Please see proof in Appendix B of our extended version.
4.2. p-norm, k dimensional, Order 1 polynomial
We relax the similarity metric for arbitrary p-norm distance, calculated for data points li = (x1, …, xk) and as
| (45) |
Theorem 5. A sufficient condition for a k-variable first degree polynomial defined over p-norm similarity distance to be c-fair is:
| (46) |
Proof. Based on generalized Titu’s Lemma, we have on the one hand a lower bound for Euclidean distances:
| (47) |
On the other hand, for the polynomial one can write
| (48) |
| (49) |
Combining the two, we obtain:
| (50) |
The inequality is satisfied when .
4.3. p-norm, k dimensional, Order n polynomial
So far, the sufficiency condition for c-fair polynomials was derived for arbitrary norms in k-dimensional space based on order 1 polynomials. Moreover, for distance-based fairness, c-fair polynomials were developed for 1-dimensional distance using arbitrary degree n polynomial. This subsection provides the theoretical background for the generalized scenario in which the location data are in k dimensions with the norm set to p, and degree n polynomials.
Although by increasing the degree of polynomials, a better fit to likelihood scores can be achieved, the existence of monomials in which multiple variables are involved leads to complexity in the derivation of sufficiency conditions. To address this, we assume that the monomials in the multivariable polynomial consist of only a single variable. Making such an assumption comes with the cost of utility loss; however, it greatly reduces the complexity of the generic case. We assume that the degree n polynomial is expressed as the summation of k univariate polynomials.
| (51) |
where is a degree n univariate polynomial with its input being xi, the ith variable in the original polynomial. The assumption helps to remove existence of monomials with multiple variables such as and to simplify derivation of location fairness sufficiency conditions provided in Theorem 6.
Theorem 6. A sufficient condition for a k-variable n-th degree polynomial to be c-fair defined over p-norm similarity distance is to have:
| (52) |
Proof. We write the equation in its component form shown in Equation 51. An upper bound for Pi(xi) can be derived as,
| (53) |
| (54) |
An upper bound for the sub-terms for all j = 1…k can be derived as
| (55) |
The above inequality is derived based on Eq. (17) setting ∣xi∣ ≤ 1. We also use the lower bound derived in Eq. (47)
| (56) |
Putting the derived upper bound and lower bound together, the following inequality is satisfied, and c-fairness is guaranteed.
| (57) |
| (58) |
By applying the method used in DtR, the bounds are linearized to,
| (59) |
It is worth noting that the generated scores by fair polynomials can result in values greater than one or less than zero. In such scenarios, the values are suppressed to one and zero, respectively. It is straightforward to prove that the suppression process does not violate the individual fairness constraints and leads to higher utility.
5. RELATED WORK
Fairness notions can be grouped into two broad categories of Group Fairness and Individual Fairness [12]. In group fairness, a protected attribute of the dataset, such as race or gender, which is considered to be critical in decision-making outcomes, partitions individuals into groups. The ML model used for a decision-making task on the dataset is considered to be fair if it achieves some statistical measure across groups. A few of the key statistical measures include statistical parity [21][12], equalized odds [17], treatment equality [7], and test fairness [10]. Individual fairness aims to give similar predictions to similar outcomes, focusing on fairness for individuals as opposed to groups. Group fairness notions are generally weaker than individual fairness notions [19]. Despite higher fairness guarantees provided by individual fairness and fragility of group fairness notions, group fairness notions are widely studied in the literature due to their easier enforcement [23]. Only a handful of approaches exist in the literature to achieve fairness in the geospatial domain.
The current state-of-the-art approach to enforce individual location fairness is to define a linear loss function once the likelihood scores are generated and solve optimization under individual fairness Lipschitz constraints. Let I be an instance of our problem consisting of a metric , and a loss function , the optimization problem is defined as,
| (60) |
| (61) |
| (62) |
One can see that the number of constraints in this mechanism grows quadratically with the number of individuals, imposing a large computational complexity on the system. The authors in [27] formulate the loss function for location-based advertisements in social media. Locations visited on the map are shown as binary strings, and a classifier is used to predict whether a user should receive a targeted advertisement. Moreover, not directly related to locations, but for general purpose advertisement and auctions, individual fairness is applied in [13]. Another application over which the loss function has been defined is achieving individual fairness in ranking and recommendation systems [26]. In ranking systems, the amount of unfairness with respect to individuals is measured after ranking, and a loss function aims to reorder ranking such that the amount of individual unfairness is minimized [8].
Several attempts have also been made to apply the individual fairness notion for clustering datapoints in Cartesian space. The notion in [20] defines clustering conducted for a point in space as fair if the average distance to the points in its own cluster is not greater than the average distance to the points in any other cluster. The authors in [22] focus on defining individual fairness for k-median and k-means algorithms. Clustering is defined to be individually fair if every point expects to have a cluster center within a particular radius. To the best of our knowledge, no work has directly defined individual fairness with respect to locations.
6. EXPERIMENTAL EVALUATION
We evaluate our proposed spatial data fairness mechanisms in the two studied scenarios. For the distance-based case, we use a dataset of taxi fares from New York City; for the zone-based case, we consider budget allocation to police departments according to the Chicago crime occurrence dataset. We ran experiments on a 3.40GHz core-i7 Intel processor with 8GB RAM. The code is implemented in Python and uses the Trust Region Reflective least square implementation from Scipy [4] (the maximum number of iterations for convergence is set to 300, and the default tolerance threshold value of 1e-2 is used to stop the optimization iterations).
6.1. Distance-based Spatial Fairness
We sampled 120,000 records from the NYC taxi dataset [1] providing over 55 million trips and their associated fares. We deployed an Artificial Neural Network (ANN) to assess the likelihood of taxi fares being fair in the system. Specifically, we seek to capture whether there is bias in the setting of fares based on the specific neighborhoods where the trip starts/ends. Ideally, the trip distance should be the only factor determining price (we carefully pre-process the dataset such that trips are clustered according to the time of week/day, such that differences in fare due to demand status and traffic causes are eliminated). Our goal is to first understand the percentage of records for which the individual fairness constraints do not hold with respect to traveled distances. Then, we analyze the performance of the proposed c-fair mechanism.
ML Model for Fairness Characterization.
Our ANN model consists of two hidden layers with 200 and 100 neurons and an output layer with two neurons representing the binary classification task. The activation function used in the model is RELU, the dropout probability for each layer set to 0.4, and cross-entropy is used as the loss function. The accuracy of the model is 92%. The input features include pick-up date and time (categorical hour, AM or PM, weekday, EDT date), pick-up longitude, pick-up latitude, drop-off longitude, drop-off latitude, passenger count, and distance traveled in kilometers. The ride fares have a mean of 10 dollars with a standard deviation of 7 dollars, and the average traveled distance is 3.31 km with a standard deviation of 3.2 km. For model training purposes, we split the data into training, validation and test datasets with 96, 000, 12, 000, and 12, 000 records, respectively. To generate the ground truth for the training dataset, we have used price per kilometer traveled as the indicator of how fair the associated traveled fares are. For every hour of the day, the average price per kilometer is calculated as the hard threshold between fair and unfair travels. The trips above the threshold are classified as unfair, and the values less than the threshold are assumed to be fair. This results in a total of 21, 928 trips being categorized as unfair.
Once the ANN model is trained, we predict the likelihood of each trip fare being fair on the test dataset. For every two records, the individual fairness constraint is evaluated to reveal whether fares are fair with respect to travel distances. In the absence of any fairness mechanisms being deployed, 32% of constraints are not satisfied, hence those trips are unfair. The 32% threshold is highlighted with a red horizontal line in the experimental plots, to highlight the fairness improvement of the considered approaches.
Next, we apply the proposed c-fair mechanism to achieve distance-based fairness. Our experiments evaluate the performance based on four key metrics: percentage of unfairness (constraints were not satisfied), the degree of c-fair polynomial (n), the parameter c, and the root mean square (RMS) of fitting error to likelihood scores.
Percentage of Unfairness.
Fig. 3a shows the impact of increasing c on reducing unfairness when the degree of the polynomial is 5, 10, 15, and 20. As expected, lower values of c result in higher fairness in the system, with maximum fairness achieved when c is equal to one. For the maximum fairness scenario, the percentage of unfairness is zero, meaning that all individual fairness constraints are satisfied for every two records in the dataset. By increasing the value of c, fair polynomials would have more room for maneuver and fitting to likelihood scores, but it comes with the cost of higher unfairness. Such behaviour demonstrates the utility-fairness trade-off captured by the constant c. Increasing the polynomial degree can be seen to improve the percentage of unfairness until it reaches the point where it overfits the likelihood scores, and the performance deteriorates. Fig. 3b shows more clearly the impact of increasing the value of n on unfairness. Lower c values result in a lower percentage of unfairness for all polynomial degrees.
Figure 3:
Distance-based Mechanism, New York taxi dataset.
Fitting Error.
Figs. 3c and 3d demonstrate the amount of utility loss in data due to fitting likelihood scores to a c-fair polynomial. Two key trends can be observed from the figures. First, increasing the value of c lowers the fitting error. This is expected, as higher c allows more flexibility for selecting coefficients and better fitting performance. Second, increasing the value of n for the same value of c raises the fitting error. To understand this behavior, one can intuitively look at the problem as allocating the same amount of budget among several buckets representing coefficients. Although increasing the degrees of freedom provides better fitting performance as higher degree monomials exist, it further restricts the budget for each coefficient. Thus, the lower degree monomials, which have a more significant impact on the performance, are allocated a lower amount of budget, negatively affecting the performance.
Computational Complexity.
We measure the computational overhead of c-fair polynomials in terms of time complexity, number of iterations before optimization convergence, and final optimization cost (the final optimization cost represents the value of the Scipy cost function upon reaching the solution [4]). Fig. 5 shows the results. In each graph, the overhead is shown for four values of c = 25, 50, 75, 100 plotted for varying polynomial degrees. Overall, the time complexity is in the order of milliseconds and does not limit the practical deployment of c-fair polynomials. The second graph illustrates the number of iterations before reaching the optimal point. The optimization process stops either by reaching the maximum number of iterations (300) or when the relative change in optimization cost remains below the tolerance threshold (1e-2). As explained previously, the slight oscillation in the performance is due to selecting a random start point for the optimization.
Figure 5:
Computation Overhead Analysis on New York Taxi Dataset.
Note that, increasing the degree of polynomial n results in higher computational overhead. This is expected, as more degrees of freedom (coefficients) lead to more effort for finding the optimal point. Another consistent behavior across all three figures is that increasing c on average reduces the computation complexity cost and facilitates reaching the near-optimal point. The trend is more apparent in the final optimization cost figure, in which it can be clearly seen that a higher c value leads to a lower cost.
6.2. Zone-based Spatial Fairness
For this scenario, we consider the case of budget allocation to different areas of Chicago, USA, based on the measured crime rates. We use the dataset provided by the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system [3], consisting of reported crime incidents in Chicago. A 1024 × 1024 grid is overlaid on top of the Chicago map, and the goal is to fairly allocate the budget such that neighborhoods that are close to each other are treated similarly. We have selected seven major crime categories of sexual assault, homicide, kidnapping, sex offense, motor vehicle theft, criminal damage, and narcotics among the reported crimes, and trained a logistic regression model to infer the likelihood of crime occurrence in each cell. The training dataset includes the crime data from January to November 2015, and the December data is chosen as the test dataset. The accuracy of the model is 94% and its output is a set of likelihoods indicating the probability of crime occurrence. The budget allocated to each cell is proportional to the likelihood score derived by the classifier.
Once the likelihood scores are generated, they are used with X and Y cell coordinates to achieve individual location fairness with the distance metric set to 2-norm. In absence of any fairness mechanism, we determine the percentage of individual location fairness constraints not being satisfied at 44.0%.
To understand if the expansion to higher polynomial degrees is essential, we started our experiments by focusing on degree one c-fair polynomials and applying the results in Theorem 3. As expected, the fitting error was rather high, and the utility was insufficient. We also noticed that for degree one polynomials, the optimal solution is achieved even when the value of c is equal to one. Therefore, increasing c does not help with improving the fitting error. Thus, it is crucial to use higher degree polynomials for this purpose.
Next, we apply the optimization formulation derived in Theorem 6, the most generalized formula allowing each dimension to contribute in fitting with a degree n polynomial. Fig. 4 demonstrates the performance of c-fair polynomials for achieving individual location fairness on the crime dataset. The patterns are generally consistent with the distance-based case considered in the New York taxi fares experiment. Figs. 4a and 4b show the impact of c and n on the percentage of unfairness and Figs. 4c and 4d show the performance with respect to utility. The red line is used as the reference point representing the percentage of unfairness in the original data, in the absence of fairness mechanisms.
Figure 4:
Zone-based Mechanism.
Increasing the value of c results in a lower degree of fairness and higher fitting error once the degree of polynomial reaches an acceptable level. This result further substantiates the fairness-utility trade-off in the system, also observed in the distance-based case. In a 2-dimensional space, using a degree of 10 and the above polynomial to model each dimension can ensure that scores are under-fitted, preventing high fitting error values. In summary, the amount of fairness achieved with the fair polynomials, even for a reasonably low degree for polynomials such as 15 and values of c greater than 10, can be seen to be over 70%.
Fig. 6 shows the computation complexity of the proposed mechanism. The first point to notice is that a relatively high amount of time is required to achieve zone fairness compared to the distance-based setting. However, computational complexity is still low in absolute value, and not an obstacle for practical deployment, with sub-second execution time.
Figure 6:
Computation Overhead Analysis on Chicago Crime Dataset.
6.3. Comparison with benchmarks
We derive a threshold-based benchmark from the fairness mechanism in [12] and the binary evaluation approach in [14]. Given constant threshold t, let polynomial P(li) = t, and let parameter α define by how much scores can be altered (e.g., if α = 0.1, each score can be altered by at most ±0.1). For DtR, the benchmark cycles through each score Mi and pushes it towards polynomial P(li) with an allowance of α:
| (63) |
The sign operation ensures the direction of change is favorable to the benchmark, and min ensures that the change in score does not overshoot P(li). When α is zero, no utility loss exists, and the unfairness percentage is the same as the original. As α grows, the flexibility margin results in more constraints being satisfied until absolute fairness is achieved. For the zone-based fairness case, we use , easily extensible to higher dimensions.
Figure 7 shows the utility-fairness trade-off obtained by the baseline for both distance- and zone-based cases. Unfairness is monotonically decreasing as one allows a higher α alteration to the score. We notice a plateau of desirable behavior concentrated between α = [0.2, 0.6], where the utility-fairness trade-off is good (for α values below 0.2 unfairness is too high, whereas for values higher than 0.6 the fairness gain fades in the multi-variate case).
Figure 7:
Comparison Benchmark Performance.
Next, we focus our attention to this desirable α range, and we plot the relative performance of the benchmark versus our proposed fair polynomials approach for various values of c and n. The plots in Figs. 8 and 9 show that our approach provides a superior trade-off compared to benchmarks. The benchmark outperforms only for zone-based fairness when n = 20. In all other cases, fair polynomials provide either a vastly improved utility, or better fairness.
Figure 8:
Distance-based Mechanism Benchmark Comparison, New York taxi dataset.
Figure 9:
Zone-based Mechanism Benchmark Comparison, Chicago crime dataset.
7. CONCLUSION
We studied in depth the problem of individual fairness for location data, and we identified sources of location bias that can occur in practical settings. We formulated two distinct settings for spatial fairness: distance-based and zone-based fairness, and we devised specific techniques to achieve spatial fairness while preserving utility, with the help of a novel construction called fair polynomials. While our focus is on spatial fairness, fair polynomials have the potential to provide useful building blocks for fairness in other application domains. In future work, we plan to study more complex types of spatial queries. At the same time, we plan to study the effect of fairness mechanisms in conjunction with other constraints, such as privacy, e.g., devise mechanisms that can achieve both spatial fairness and location privacy.
Table 1:
Summary of notations.
| Symbol | Description |
|---|---|
| 𝓛 = {l1, …, lm} | Set of datapoints in |
| li | Distance from li to reference point |
| m | Number of datapoints |
| ∥.∥p | p-norm distance |
| 𝓐 | Classification output domain |
| d(.) | Distance between datapoints |
| D(.) | Distance between distributions |
| M(li) | Likelihood score of location li |
ACKNOWLEDGMENTS
This research has been funded in part by NSF grants IIS-1910950, IIS-1909806, CNS-2125530, NIH grant R01LM014026, the USC Integrated Media Systems Center (IMSC), and unrestricted cash gift from Microsoft Research and Google. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of any of the sponsors such as the NSF.
A. APPENDIX (PROOF OF LEMMAS)
Lemma A.1 (Generalized Titu’s Lemma). Let m be an integer greater than or equal to 2, a non-negative real number, and xi a positive real number. Then,
| (64) |
Proof. By Holder’s inequality
| (65) |
| (66) |
| (67) |
| (68) |
Theorem 7. A sufficient condition for a k-variable first degree polynomial with real coefficients is to have: Euclidean distance n variables.
| (69) |
Proof. For every two locations l1 = (x1, …, xk) and l2 = (x′1, x′k), a lower bound can be derived based on Generalized Titu’s Lemma.
| (70) |
On the other hand, the following inequality can be written for the polynomial.
| (71) |
| (72) |
By combining two equations the sufficiency condition can be derived from,
B. APPENDIX (POLYNOMIAL VISUALIZATION)
Figure 10 presents the real-world example of Fig. 2 for NYC taxi dataset. It can be seen in the figure that a fair polynomial is fitted to output ML scores shown by blue dots. The degree of the polynomial is 6, and the value of c is set to 100.
Figure 10:
An example of c-fair polynomial applied to the NYC taxi dataset.
Footnotes
PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at https://github.com/SinaShaham/c-Fair-Polynomials/tree/main
While our focus is on geospatial data, some of our results can be extended for other types of data with continuous domains.
The proposed approach can be extended to multiple reference points by using a composite cost metric, or an existing locality-sensitive hashing (LSH) approach that produces a single scalar distance value over multiple reference points.
Contributor Information
Sina Shaham, Viterbi School of Engineering University of Southern California Los Angeles, California, USA.
Gabriel Ghinita, College of Science and Engineering Hamad Bin Khalifa University Qatar Foundation, Doha, Qatar.
Cyrus Shahabi, Viterbi School of Engineering University of Southern California Los Angeles, California, USA.
REFERENCES
- [1].[n.d.]. New York City taxi fare prediction. https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data [Google Scholar]
- [2].2013. Chicago Metropolitan Agency for Planning. Travel patterns in economically disconnected area clusters. https://www.cmap.illinois.gov/2050/maps/transit-eda [Google Scholar]
- [3].2015. Chicago Crime Dataset. https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9 [Google Scholar]
- [4].2021. SciPy Documentation Version: 1.7.1 (lsq linear). https://docs.scipy.org/doc/scipy/ [Google Scholar]
- [5].Baeza-Yates Ricardo. 2018. Bias on the web. Commun. ACM 61, 6 (2018), 54–61. [Google Scholar]
- [6].Barocas Solon, Hardt Moritz, and Narayanan Arvind. 2017. Fairness in machine learning. Nips tutorial 1 (2017), 2. [Google Scholar]
- [7].Berk Richard, Heidari Hoda, Jabbari Shahin, Kearns Michael, and Roth Aaron. 2021. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research 50, 1 (2021), 3–44. [Google Scholar]
- [8].Biega Asia J, Gummadi Krishna P, and Weikum Gerhard. 2018. Equity of attention: Amortizing individual fairness in rankings. In The 41st international acm sigir conference on research & development in information retrieval. 405–414. [Google Scholar]
- [9].Cantoni Enrico. 2020. A precinct too far: Turnout and voting costs. American Economic Journal: Applied Economics 12, 1 (2020), 61–85. [Google Scholar]
- [10].Chouldechova Alexandra. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153–163. [DOI] [PubMed] [Google Scholar]
- [11].Conn Andrew R, Gould Nicholas IM, and Toint Philippe L. 2000. Trust region methods. SIAM. [Google Scholar]
- [12].Dwork Cynthia, Hardt Moritz, Pitassi Toniann, Reingold Omer, and Zemel Richard. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. 214–226. [Google Scholar]
- [13].Dwork Cynthia and Ilvento Christina. 2018. Individual fairness under composition. Proceedings of Fairness, Accountability, Transparency in Machine Learning (2018). [Google Scholar]
- [14].Fabris Alessandro, Messina Stefano, Silvello Gianmaria, and Susto Gian Antonio. 2022. Algorithmic Fairness Datasets: the Story so Far. arXiv preprint arXiv:2202.01711 (2022). [Google Scholar]
- [15].Ghojogh Benyamin, Ghodsi Ali, Karray Fakhri, and Crowley Mark. 2021. KKT Conditions, First-Order and Second-Order Optimization, and Distributed Optimization: Tutorial and Survey. arXiv preprint arXiv:2110.01858 (2021). [Google Scholar]
- [16].Hajian Sara, Bonchi Francesco, and Castillo Carlos. 2016. Algorithmic bias: From discrimination discovery to fairness-aware data mining. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2125–2126. [Google Scholar]
- [17].Hardt Moritz, Price Eric, and Srebro Nati. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016), 3315–3323. [Google Scholar]
- [18].Kim Kamyoung. 2020. A Spatial Optimization Approach for Simultaneously Districting Precincts and Locating Polling Places. ISPRS International Journal of Geo-Information 9, 5 (2020), 301. [Google Scholar]
- [19].Kim Michael P, Korolova Aleksandra, Rothblum Guy N, and Yona Gal. 2019. Preference-informed fairness. arXiv preprint arXiv:1904.01793 (2019). [Google Scholar]
- [20].Kleindessner Matthäus, Awasthi Pranjal, and Morgenstern Jamie. 2020. A Notion of Individual Fairness for Clustering. arXiv preprint arXiv:2006.04960 (2020). [Google Scholar]
- [21].Kusner Matt J, Loftus Joshua R, Russell Chris, and Silva Ricardo. 2017. Counterfactual fairness. arXiv preprint arXiv:1703.06856 (2017). [Google Scholar]
- [22].Mahabadi Sepideh and Vakilian Ali. 2020. Individual fairness for k-clustering. In International Conference on Machine Learning. PMLR, 6586–6596. [Google Scholar]
- [23].Mehrabi Ninareh, Morstatter Fred, Saxena Nripsuta, Lerman Kristina, and Galstyan Aram. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–35. [Google Scholar]
- [24].Olteanu Alexandra, Castillo Carlos, Diaz Fernando, and Kıcıman Emre. 2019. Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data 2(2019), 13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Pandey Akshat and Caliskan Aylin. 2021. Disparate Impact of Artificial Intelligence Bias in Ridehailing Economy’s Price Discrimination Algorithms. In AIES ’21: AAAI/ACM Conference on AI, Ethics, and Society, Virtual Event, USA, May 19-21, 2021, Fourcade Marion, Kuipers Benjamin, Lazar Seth, and Mulligan Deirdre K. (Eds.). ACM, 822–833. [Google Scholar]
- [26].Pitoura Evaggelia, Stefanidis Kostas, and Koutrika Georgia. 2021. Fairness in Rankings and Recommendations: An Overview. arXiv preprint arXiv:2104.05994 (2021). [Google Scholar]
- [27].Riederer Christopher and Chaintreau Augustin. 2017. The Price of Fairness in Location Based Advertising. (2017). [Google Scholar]
- [28].Stark Philip B and Parker Robert L. 1995. Bounded-variable least-squares: an algorithm and applications. Computational Statistics 10 (1995), 129–129. [Google Scholar]
- [29].Suresh Harini and Guttag John V. 2019. A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.10002 2 (2019). [Google Scholar]










