Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Jan 1.
Published in final edited form as: Geogr Anal. 2014 Aug 26;47(1):50–72. doi: 10.1111/gean.12045

Predicting Regional Self-identification from Spatial Network Models

Zack W Almquist 1, Carter T Butts 2
PMCID: PMC4322384  NIHMSID: NIHMS614669  PMID: 25684791

Abstract

Social scientists characterize social life as a hierarchy of environments, from the micro level of an individual’s knowledge and perceptions to the macro level of large-scale social networks. In accordance with this typology, individuals are typically thought to reside in micro- and macro-level structures, composed of multifaceted relations (e.g., acquaintanceship, friendship, and kinship). This article analyzes the effects of social structure on micro outcomes through the case of regional identification. Self identification occurs in many different domains, one of which is regional; i.e., the identification of oneself with a locationally-associated group (e.g., a “New Yorker” or “Parisian”). Here, regional self-identification is posited to result from an influence process based on the location of an individual’s alters (e.g., friends, kin or coworkers), such that one tends to identify with regions in which many of his or her alters reside. The structure of this paper is laid out as follows: initially, we begin with a discussion of the relevant social science literature for both social networks and identification. This discussion is followed with one about competing mechanisms for regional identification that are motivated first from the social network literature, and second by the social psychological and cognitive literature of decision making and heuristics. Next, the paper covers the data and methods employed to test the proposed mechanisms. Finally, the paper concludes with a discussion of its findings and further implications for the larger social science literature.

Introduction

Social scientists characterize social life as a hierarchy of environments, from the micro level of an individual’s knowledge and perceptions to the macro level of large-scale social networks. In accordance with this typology, individuals typically are thought to reside in micro- and macro-level structures, composed of multifaceted relations; e.g. acquaintanceship, friendship, and kinship (Mayhew and Levinger 1976). In this paper, we treat self identification as occurring when an individual chooses to associate him or herself with a given label (e.g., Sam’s mother). Self-identified groups occur when each of two or more individuals choose to identify with a label (or category) that exists a priori as a result of a consensus (e.g., “I am black, ” and/or “I am a mother”). Thus, in this context self-identified groups arise from micro-level processes of individual decision making.

While this article focuses on the particular case of regional self-identification, self-identification, more broadly, is of special concern to social scientists because it determines racial/ethnic, sexual, gender, class, and other identities (Howard 2000). Self-identified groups also are of particular interest to the subfields of social psychology, social boundaries, and gender relations (see Howard 2000; Jenkins 2000; Turner et al. 1987). Howard (2000) argues that these subfields view identity (self identification) as a product of modern society and as a core issue, especially when compared to societies with rigidly imposed identities. Specifically, this paper proposes that the basic underlying mechanisms for self identification is a cognitive system, where an individual selects his or her identification from within a set of salient items (e.g., cities) and employs a heuristic – or set of rules for choosing among those items.

The main hypothesis of this work, here dubbed the Social Network Hypothesis of Regional Self-Identification (SNH), is that individuals choose the region with which they identify based on the salience of the relations of the social networks in which they are embedded (e.g., friends, acquaintances, coworkers, kin; for a visualization of a spatial network see Figure 1). In other words, individuals choose to identify with the region in which they have the most alters (e.g., friends or kin; see Almquist (2012)). We contrast this hypothesis with a series of alternatives that are motivated by, arguably intuitive, salient components of modern life (e.g., maps, advertisements, schools, postal codes). For example, one might argue that the region that is most salient to an individual is the one that is most proximal, more so even than the one in which he or she has the most social relations.

Figure 1.

Figure 1

A simulated spatial Bernoulli network for San Francisco, CA. The simulation was performed using the procedure outlined in the Social Network Model (Tie Volume Model) Section using the Facebook SIF in Section. The map is as an orthogonal projection around the centroid point in meters. Gray lines represent US Census block lines, dots represent individuals, and black lines represent a social relation (e.g., friendship).

To date, social scientists primarily have studied regional identification in the context of national identification (see Gould and White 1986; Tan 2005), with a few studies about urban/rural identification (see Fischer 1982; Wirth 1938). More recently, new developments in online data processing and management allow for larger-scale and higher quality geographic data collection by nonprofessionals, what the Geographic literature has dubbed volunteered geographic information (VGI) (Goodchild 2007). VGI data are detailed Geographic data (e.g., latitude and longitude coordinates) collected by nonprofessionals, employing modern geographic information software (GIS; e.g. Google maps). One of the more famous of these collection efforts is the Common Census Internet Project (Baldwin 2010; Flanagina and Metzger 2008), the data source for this paper.

Background: Social Networks and Geography

Spatially embedded social networks have a long history in the geography literature (e.g., gravity models; Haynes and Fortheringham 1984; Phillips et al. 1976) and the social network literature (for a review, see Barabàsi and Frangos 2002; Butts 2002; Butts and Acton 2011). In the geography literature, a historical and recent revival of formal network models has taken place that builds on graph theory, statistics, and machine learning literatures (Gahegan 2000; Gopal and Fischer 1996; Griffith 2011; Rogerson 1997; Tinkler 1972). Related extensions within this context include clever optimization and uses of point process models (Almquist and Butts 2012; Boots 1977; Okabe and Yamada 2001; Schneider 2005; Serra and ReVelle 1999; Shiode 2008; Yamada and Thill 2007) and the application and inclusion of network autocorrelation models in the geographic literature (Farber et al. 2009; Páez et al. 2008; Peeters and Thomas 2009; Townsley 2009). Possibly the longest running literature about spatially embedded networks is that for roads (e.g., Bentley et al. 2013; Black 1992; Hudson 1969; Morley and Thornes 1972; Okabe and Yamada 2001; Okabe et al. 1995; Osleeb and Ratick 1990; Peeters et al. 1998; Xie and Levinson 2009; Zemanian 1980). More recent developments in the network and geography literature include developments concerning the problem of small worlds (e.g., Rogerson 1997; Xu and Sui 2009) originally introduced by Travers and Milgram (1969) and Milgram (1967), and later extended by Watts and Strogatz (1998). Other important examples of empirical spatial networks include those for cities (Neal 2012; Portugali et al. 1994; Taylor 2001), drainage networks (Werner 1972), and t-communities (Grannis 2009; Whalen et al. 2012), and the use of networks in cognitive models and spatial thinking problems (Mirchandani 1980; Morley and Thornes 1972; Smith et al. 1982).

Regional Identification as a Cognitive Process

The definition of self identification employed in this paper (i.e., individual y identifies with object x) is that of a behavior requiring an individual to match him or herself with a label that is drawn from a set of potential labels (or categories) that exist in his or her cultural repertoire (e.g., doctor or Asian). In this sense, a component of self identification exists that requires a decision from an actor, and which can be further described as a choice. This choice, at least at some level, must involve the act of information processing, if only for the actor to allocate him or herself to some default option (see Gigerenzer and Todd 1999; Hutchinson and Gigerenzer 2005).

In the case of regional identification, these assumptions imply that an individual has a mechanism for identifying the set of potential geographic categories (e.g., towns, cities, or other culturally recognized places) at the situationally relevant scale, and a way to choose an item from within a given set (e.g., Irvine) with which he or she identifies. This process can be seen in everyday life in a variety of contexts; for example, when an individual proclaims “I am a Persian, ” or “I am a New Yorker” therefore regional identification can be characterized by the combination of (1) a choice set and (2) a heuristic. Much of the following discussion is dedicated to describing potential mechanisms which are competing for the “best” (i.e., most predictively accurate) choice set and heuristic in a model of regional identification, including mechanisms involving social structures.

Scale and Regional Identification

As implied by the proceeding discussion, an individual is potentially able to identify him or herself with a preferred geographical unit at multiple scales; each scale is defined by a culturally relevant set of geographical units (e.g., neighborhoods or local communities, towns or cities, states or provinces, nations), which constitutes the choice set for an identification decision. (Meaningful scales for such identification are themselves culturally defined.) Thus, We may envision the regional identification process as producing for each individual a “cone” of valid identities x0x1, each having the property that individual y associates more strongly with region xi than any other region xi at the same scale in his or her cultural repertoire. This is depicted schematically in figure 2.

Figure 2.

Figure 2

At a given culturally defined scale (planes), an individual most closely identifies with a given geographical unit (circled areas). Identification at any given scale can be conceptualized as eliciting a “slice” through the cone-like structure formed from the union of possible elicitations. “Slicing” at a uniform level allows us to examine identification mechanisms across individuals.

Our focus in this paper can be viewed as follows: given a uniform “slice” through the cones of regional identities in a population at a given scale, what predicts the units with which each individual will identify? In particular, we here consider identification for local communities among residents of the United States, at a scale that corresponds to “places” designated by the United States (US) Census.

Mechanisms of Regional Identification

We may hypothesize a variety of processes by which regional identification may occur at a given scale. The mechanisms we consider here are divided into two key subgroups, the SNH, and the Geography and Prominence Hypotheses. The first of these hypotheses is based on social structure in which individuals are embedded; the subsequent competing hypotheses are based on particularly salient properties of modern life.

A Social Network Hypothesis of Regional Self-Identification

One potential mechanism for regional identification is based on the social context in which individuals are embedded (e.g., friendship, co-worker, and kinship networks). In such a case, regional identification might be an individual performing a search over his or her personal networks (Dodds et al. 2003), and selecting the region which contains within it the largest number of alters. Tying this notion back to the concept of salience will be important throughout this paper, this hypothesis can be rephrased as an argument that regions containing the maximal number of an individual’s alters are the most salient regions to that individual for this type of identification.

SNH

Individuals choose to identify with the region in which they have the largest number of alters.

Different processes (or combinations thereof) potentially could underlie the ultimate mechanism of regional identification and inform the SNH. Intuitively, we might suspect this hypothesis to be plausible for many reasons. For example: (1) individuals search over their alters and select the region to identify with based on a plurality heuristics; (2) individuals have more exposure to the places they have more alters, and thus such an area is more salient; and (3) individuals mimic their peers and thus choose to identify with the area with which they view the bulk of their peers as identifying. As the data does not allow us to distinguish between these micro-level processes, we view the SNH as a representing a class of mechanisms (one or more of which may be active at once), which we collectively distinguish from other classes of identification mechanisms.

Geography and Prominence Hypotheses

The SNH involves one class of mechanisms for regional identification, but others can be entertained. The first alternative hypothesis proposed here is dubbed the Proximity Hypothesis. The Proximity Hypothesis is based on the intuitive salience of certain geographies to an individual, particularly those that are the closest (most proximal) to an individual (e.g., one lives near Irvine, CA and identifies with Irvine).

Proximity Hypothesis

Individuals choose to identify with the region that is most proximal (closest) to them, given their current geographic location of residence.

An alternative hypothesis – although, one which is related to the Proximity Hypothesis – is one in which the most salient region is not simply the most proximal, but is a balance of being both the most prominent (salient given some characteristics/threshold) and also most proximal to an individuals location. In this case, one assumption is that an individual limits his or her choice-set to only those regions that meet a particular prominence characteristic/threshold (e.g., presence of National Football League team/population threshold; see Gigerenzer and Todd 1999), and subsequently selects the most prominent region within this limited set.

Prominence Hypothesis

Individuals choose to identify with the closest prominent region to their geographic locations.

One also might propose the reverse of the aforementioned hypothesis, where an individual first limits his or her choice-set by the saliency criterion of distance, and then chooses a region to identify with based on some prominence characteristic/threshold.

Distance Hypothesis

Individuals choose to identify with the most prominent region within a given distance radius.

The Prominence and Distance Hypotheses are motivated, first, by the elimination heuristics that have been shown to be fast and frugal, as well as accurate in judgement making (Berretty et al. 1997), and second, by the vetting models in the fields of population biology and public health (Handcock and Jones 2004).

Elimination models in the cognitive science literature were conceived for choice tasks; in these models, an object is chosen by repeatedly eliminating subsets of objects from further consideration, thereby whittling down the set of remaining possibilities (Tversky 1972). These heuristics have been extended to include categorization tasks such as length and widths of flower parts by Berretty et al. (1997). Similarly the Prominence and Distance Hypotheses may be perceived as a series of elimination heuristics (i.e., limiting the choice set by one criterion after another until a single item remains).

The vetting models were conceived as a two-stage process model for how individuals form sexual partnerships: (1) individuals generate a list of acquaintances, and (2) choose their sexual partners (Handcock and Jones 2004). Similarly the Prominence and Distance Hypotheses may be defined as a two-stage process model in which an individual first limits his or her choice set (e.g., only cities greater than 50,000), and then selects an item in his or her choice set (e.g., the closest remaining city).

The Case of Community Identification

The regional identification processes outlined in the previous section are hypothesized to predict the identification within a scale-induced choice set; to test these hypotheses, it suffices to consider a set of identification decisions (1) made on a common scale, for which (2) a consensus choice set is readily available. In this paper, we employ data on identification with local communities collected by the Common Census Project (CCP; Baldwin 2010). As discussed further subsequently, CCP respondents overwhelmingly (>95%) selected regions of identification that correspond to Census Designated Places (CDPs) as defined by the year 2000 US Census (US Census Bureau 2001). CDPs are constructed to correspond to towns, cities, or other well-defined local population aggregates with a commonly identified name, and thus serve as an effective operationalization of culturally recognized “communities”; respondents readily selecting CDPs when describing the local community or area with which they identify (despite being given the opportunity to enter alternative labels) further validates the intelligibility of this geographical unit to the study population. Henceforth, we employ CDPs as our geographical unit of interest, using the term community’ as an intuitive shorthand to describe what these units represent.

Our subjects identifying with units at the community scale does not preclude them from identifying with units at other scales. Rather, the community scale serves as a uniform slice through a respondents’ regional identification cones, giving us a basis for systematic prediction across respondents. We do not, in particular, require that respondents’ strength of identification of the community scale be stronger or more salient, than for example their identification at larger scales. What we do require is that each respondent identify more strongly with the community he or she selects than any other available community, an assumption that is consistent with the nature of the CCP data.

Of particular use to researchers investigating regional identification (of cities or other geographical levels) is the body of spatial and geographic data from the 2000 US Census (Almquist 2010; US Census Bureau 2001) and data from the Common Census internet project (Baldwin 2010), each of which is readily available, detailed resource of geographic and identification data. Next are a detailed descriptions of the necessary US Census and the Common Census data sets. Before proceeding to a description of our analysis techniques, we provide an overview of these data sets.

US Census Demographic and Geographic Data

The 2000 US Census Summary File 1 (SF1) data consists of population counts and other basic demographics at five geographic resolutions: blocks, block groups, tracts, counties, and states (for detailed definitions, see, the US Census Bureau 2001), each of which exhaustively covers the land mass of the US. The US Census data also contain geographic and demographic data for what it calls census designated places (CDPs), which shall be referred to as communities in the remainder of this article.

As noted previously, the US Census Bureau’s definition of CDPs closely approximates what most individuals of the US would consider communities. This linage is reinforced by analysis of the CCP data, for which 96 percent of respondents’ reports of identification are found to coincide with places as they are defined by the US Census Bureau (e.g., a respondent might choose Irvine for his or her identification). This outcome occurs even though respondents were both given the option of choosing items outside the category of places, and provided the option of writing-in their own preference. There are a total of 24,670 places ranging in population size from zero to eight million (there is no minimum population requirement for a place (US Census Bureau 2001)), with most corresponding to towns, cities, or well-defined and commonly named areas within larger urban areas. CDPs also can include military installations or other areas that are well-recognized (and which may have a residential population), but that are not captured by conventional definitions of “city, ” “town, ” or the like.

Our analysis employs a GIS implementation of the 2000 US Census data by Almquist (2010), implemented in the R statistical computing environment (R Development Core Team 2010). R’s spatial tools (Bivand et al. 2008) were used for associated data manipulation and analysis.

Common Census Internet Project

The Common Census internet project is a website started in 2005 by Baldwin (2010) to develop a “natural” (perceptual consensus) mapping of the US such that the borders of/within an area emerge from a consensus among the individuals who reside in that area. In practice, the CCP data are a convenience sample from 2005-present that consists of five questions related to one’s geography, several of which focus on regional identification. In this work, responses to three of the five questions are used to test the hypotheses proposed in this paper. The proceeding analysis also utilizes data from the first of the five questions from the online questionnaire, which elicits a respondent’s address and automatically geocodes his or her location (after which these results were anonymized to the census geography of the block).1

Given that respondents may answer any of the Common Census questions at idiosyncratic geographic levels, we limit our analyses to those respondents who supply at least one answer at the community (CDP) level. The first2 and third3 questions pertain largely to community-level identification, and approximately 96% of respondents supply an answer corresponding to a CDP.

Crucially, both questions request that a respondent ignore any official boundaries and answer only with the region he or she feels that he or she identifies with, which should elicit the processes of regional identification this work is interested in, rather than simply a report of the geographic location of individuals.

Responses collected after 2007 were omitted from analysis as a result of the faulty geocoding of respondents’ locations; this elimination left a total 51,655 respondents, of which 45,167 answered with a CDP for the second question; this number increases to 49,769 when results of the second and third question are combined. By using a combination of questions two and three, this analysis utilized a sample of 49,769 respondents, which is 96 percent of the total surveyed population (2005–2007). Figure 3 visualizes the location of each respondent. The resulting sample includes individuals who identify with 10,325 different places, where approximately 20% of these individuals selected out-of-state places.

Figure 3.

Figure 3

Centroid locations of the Common Census Internet Project respondents in the continental US, in an Albers Conical Equal Area projection (in meters).

Because the Common Census is an internet-based self-selected sample, it contain systematic biases4. We would expect that these biases would follow those commonly found in internet surveys (e.g., respondents are younger, wealthier, and more educated people5).

Methodology

In order to utilize the Common Census data to evaluate the previously discussed hypotheses, these hypotheses must first be operationalized for a specific level of regional identification (here, the community level). What follows is a series of model proposals, each of which represents one of the aforementioned hypotheses in an analytical framework. The first of these proposals is for a baseline model, here dubbed the Uniform Choice Model. This baseline model provides a comparison point for all other models, assuring a reader that the regional identification data of interest here does, in fact, contain structure (is non-random).

The Uniform Choice Model

The Uniform Choice Model is a family of parameterized models that, given a respondent and his or her geographic location, map each respondent’s location to a randomly-chosen item (place) from within the choice-set (P). The location of a respondent is coded, using an anonymization procedure, in terms of the centroid longitude and latitude coordinates of the US Census block in which the respondent resides. Consequently, multiple respondents can have the same coordinates, although they do not live in the same household. The choice-set of locations available to individuals for identification is the set of all CDPs in the continental US (24,670 places).

In effect, the Uniform Choice Model is a mapping of respondents’ locations to a CDP randomly drawn from a uniform probability distribution, and is implemented using the following algorithm: (1) map all CDPs onto the Natural numbers, (2) select each respondent’s place of identification by drawing a random number from a uniform distribution and (3) map that number back to the corresponding place (e.g., if a respondent with a location (−108.62, 44.97) selects Ardmor, AL, the model predicts this respondent identifies with Ardmor, AL).

Tie Volume and the Social Network Hypothesis of Regional Self-Identification

Interest in large-scale, spatially-embedded networks has a long history in the social sciences, stemming from the famous Milgram experiments (Milgram 1967; Travers and Milgram 1969), later re-popularized as the “small-world” phenomenon by Watts and Strogatz (1998). Recently, methods for statistical and simulation-based modeling of large-scale spatially-embedded networks have been developed by Butts (2003); Butts and Acton (2011); Butts et al. (2012).

Spatial Bernoulli Graphs and the Spatial Interaction Function

A well-established empirical regularity is that the marginal probability of a social tie between two persons declines with increasing geographical distance for a wide range of social relations (e.g., Bossard 1932; Festinger et al. 1950; Freeman et al. 1988; Hägerstrand 1966; Latané et al. 1994; McPherson et al. 2001). Butts (2003) demonstrates that, under fairly weak conditions, spatial structure is adequate to account for the vast majority of network structure (in terms of total entropy) at large geographical scales. Simple network models based on the distance/tie probabiliy relationship have been shown to produce reasonable distributions for structural features such as degree distributions (Butts et al. 2012) and have been found to have predictive power for example crime rates in neighborhoods (Hipp et al. 2013).

The most basic family of such network models is the set of spatial Bernoulli graphs. We define a spatial Bernoulli graph in the manner of Butts and Acton (2011). Consider a set of vertices, V, which are spatially embedded with a distance matrix D ∈ [0, 1)N×N. Let G be a random graph on V, with stochastic adjacency matrix Y ∈ {0, 1}N×N. The pmf of G given D is

Pr(Y=yD,Fd)={i,j}B(yijFd(dij)) (1)

where B is the Bernoulli pmf, and Inline graphic : [0, ∞) → [0, 1] [the spatial interaction function (SIF)]. The SIF controls the underlying structure of a network, and thus is the key component within this family of models; specifically the SIF relates distance to the marginal tie probability. Empirically, real-world social networks typically appear to have an SIF, where the marginal tie probability decays with distance (see Butts 2003). Another well-known empirical regularity is that the marginal probability of a tie between two persons declines with geographical distance for a broad range of relationships (e.g., Arentze and Timmermans. 2005; Axhausen 2007; Bossard 1932; Carrasco et al. 2008; Festinger et al. 1950; Freeman et al. 1988; Hägerstrand 1966; Latané et al. 1994; McPherson et al. 2001). This tendency suggests that the functional form for a social network SIF is some variant of a power law. Here we consider two basic functional forms of an SIF based off of empirical data estimated from two large communication networks (see Section Social Network Model (Tie Volume Model)):

Fd(x)=pd1+(αx)γ,(attenuatedpowerlaw) (2)
Fd(x)=pd(1+αx)γ,(powerlaw) (3)

where pd is the baseline tie probability at distance 0, γis a shape parameter governing the distance effect, and α is a scaling term. For a typical visualization of a network drawn from a model of this type, see Figure 1.

As the preceding discussion suggests, network structure and geography are intricately linked, and the spatial Bernoulli graphs can be viewed as providing a social structural interpretation of the classical gravity models (Haynes and Fortheringham 1984) that are replete in the geographical literature. The gravity models can be viewed as a family of nonlinear regression models for valued relational data, in which the expected degree of interaction between elements is taken to be a product marginal rates (i.e., row/column effects) and an attenuation function dependent upon the distance between them. Formally,

E[Yij]P(i)P(j)Fd(d(i,j)), (4)

where P(x) is the interaction potential of element x, and Inline graphic is the SIF. Thus the spatial Bernoulli graphs can be viewed as a special class of gravity models for dichotomous interactions (although this does not extend to the general class of spatial random graph models; e.g. see(Daraganova et al. 2012)). While gravity models are not always motivated by a clear social mechanism, here models of this form (i.e., spatial Bernoulli graphs) are used to capture the expected number of social ties between an individual respondent and all individuals in a given areal unit (based on extrapolative simulation from models fit to network data in prior work). Thus this paper provides an example of the connection between classical geographical techniques and other forms of relational analysis.

Tie Volume

Geographically embedded networks have many properties that are jointly related to space and social structure Butts (2003; ming), the most relevant to this work being tie volume. The tie volume, Inline graphic(A, B) between areal units A and B for graph G is the number of edges (i, j) such that vertex i resides in unit A and vertex j resides in unit B. If we take Ai to be an arbitrarily small region around vertex i (such that Ai contains no other vertices), Inline graphic(Ai, B) also can be used to express the total number of ties from vertex i to individuals in areal unit B; we use the shorthand Inline graphic(i, B) to denote this special case. When dealing with extrapolatively simulated networks (as in the present context), it is natural to work with the expected tie volume Inline graphic Inline graphicV(A, B) rather than the tie volume for an observed graph; in the foregoing, we refer to the expected tie volume simply as the “tie volume” where there is no danger of confusion.

Now, SNH can be operationalized in terms of tie volume in a straightforward manner. Given the calculations of the expected tie volume between two locations (e.g., a respondent’s home location and each community in the US), the SNH predicts that an individual identifies with the community that has the largest expected tie volume with his or her residential location.

Social Network Model (Tie Volume Model)

To obtain identification predictions from the Tie Volume Model, we proceed as follows: first, calculate the expected tie volume between a respondent’s home block to the block groups that make up a given community, dividing by the population of a respondent’s block to obtain the expected number of ties from the respondent to residents of each block group in the community. Next, sum the expected tie volumes from a respondent to each block group in the community, providing the expected tie volume between the respondent and the community as a whole. Repeat this procedure for each community in a choice set, then select the community with the maximum expected tie volume as the location with which an individual identifies. This procedure may be written explicitly as follows:

  1. Let P be the set of communities, with each Pk consisting of nk block groups gk1, , gknk with population counts given by Inline graphic. Let ri be the Census block in which the ith respondent resides.

  2. For each k ∈ 1 , |P|, calculate EV(i,Pk)=1(rj)j=1nkEV(rj,gkj).

  3. Select arg maxPkP E Inline graphic(i, Pk); this is the community with which i is predicted to identify.

The expected tie volume between a respondent’s location and a given block group depends on both the detailed geometry of the blocks/block groups and the SIF, and is computed via a Monte Carlo quadrature algorithm (Butts ming).6 In this paper we employ two distinct SIFs. The first is a classic SIF estimated from a large-scale phone network, and the second is a modern example from the social networking site Facebook7

The first SIF used in this article is based on Hägerstrand’s data set of phone calls made between regions in rural Sweden in 1950. Butts (2002) computed this SIF from Hägerstrand’s (1966), “technologically mediated communication” relation, which acts as a long tailed example with a slowly decaying distance function (approximately d−2.95). The parametric form is an attenuated power law (see equation 2), with parameters (0.937, 0.538, 2.956).

The second SIF used in this paper is based on a uniform sample of Facebook users in 2009 collected by Gjoka et al. (2010), where the authors recorded (when identified) a user’s university affiliation and his or her alter’s affiliation. From this information Spiro et al. (2012), computed an SIF for Facebook friendship between university affiliated individuals. This SIF represents a “modern technologically mediated communication” relation, which acts as a long tailed example with a slowly decaying distance function (approximately d−6.527). Its parametric form is a power law (see equation 3), with parameters (0.627,0.049,6.527).

The regional identification proposed in the SNH suggests that a weak interaction SIF such as that from a communication network might be representative of the type of macro-level structure underlying this phenomenon. The SIFs employed here are two examples of how such a network can scale with distance; by representing a fairly wide range of scaling parameters, they allow us to examine the robustness of the SNH while still employing SIFs based on (previously) observed network structure. Both SIFs were inferred from observed networks in previous studies, and were not in any way fit to the CCP (or other regional identification) data. Thus, these models are zero-parameter with respect to CCP prediction, because they contain no free parameters that are adjusted to improve fit for the regional identification data.

Because optimal prediction from the Tie Volume Model requires that one have either a priori knowledge of the exact SIF governing identification-relevant relationships or infer an SIF from the data (to guarantee the best fitting model), computing the expected tie volume in the manner implemented here is a more stringent test of the SNH than for example fitting the observed data to a gravity model. If the Tie Volume Model outperforms competing models in predicting Regional Identification, the extent of this superior performance would only increase in a better-fitted model. In effect, the aspects of sub-optimality of this model make it a stronger test of the effects of large-scale social networks on regional identification.

To demonstrate the Tie Volume Model, we consider an illustrative case within California. Respondent A lives within a Census block in Albany, CA. First, we compute the expected tie volume of respondent A, given their home location within the city of Albany to all other communities in California under the aforementioned SIF (See Figure 4). We then rank the results and select the community with the highest expected tie volume between respondent A and all communities within California. In this case respondent A lives in Albany but identifies with Berkeley, as the Tie Volume Model predicts (See Figure 4).

Figure 4.

Figure 4

An example of Tie Volume Model for single a respondent living in Albany, CA. Results logged for visualization purposes (log is a rank-preserving transformation and therefore does not change the results). (a) Full state example of the Tie Volume Model for a single respondent living in Albany, CA. (b) A close up of the example of the Tie Volume Model for a single respondent living in Albany, CA.

The Proximity Model

The Proximity Model is a family of parameterized models that map the location of each respondent to the nearest item (community) in the choice-set (P), where nearest here is defined as the item with minimum distance between itself and the respondent’s location. Given the notation of the The Uniform Choice Model Section, in combination with a distance function d(·, ·), the algorithm first calculates a respondent’s distance from his or her location to every item in the choice-set, and then selects the item with the smallest corresponding distance.

In order to best approximate the actual physical distance between a respondent and each place in the continental US, the Great Circle distance8 is calculated between the longitude and latitude of each respondent and the center point of each community. For example, a respondent with a location (−107.53, 41.03) would be predicted to identify with Dixon, WY as a result of the respondent having a distance of zero between his or her location and the location of Dixon, WY and greater than zero distance for all other places.

Vetting Models

Given the notation in Section and the distance function of The Proximity Model Section, two distinct families of single-parameter vetting models are proposed, the first of which is a distance-based vetting model, here dubbed the Distance Vetting Model, and the second of which is a prominence-based vetting model, here called the Population Vetting Model. All vetting models are named according to the initial rule an individual uses to first limit his or her choice set.

Each vetting model may be viewed as a two-stage process (Handcock and Jones 2004) in which an individual first limits his or her personal choice set with a decision rule, and subsequently selects a final choice based on a different decision rule. This procedure follows the same basic logic as the elimination heuristics in the cognitive science literature (Berretty et al. 1997; Tversky 1972), and involves the following three basic steps:

  • Step 1) Select a rule to limit the choice set (e.g., individuals contemplate only communities within 50 miles of where they live).

  • Step 2) Select a rule to pick from among the limited choice set (e.g., individuals choose the highest-population community within the resulting choice set).

  • Step 3) Apply the conjunction of Steps 1 and 2.

Step 1 constrains a choice set using a decision rule, motivated by the hypotheses in the Mechanisms of Regional Identification Section, which is operationalized as a parameter constraint, θ, and relation operator, R (e.g., a binary relation R usually is defined as an ordered triple (X, Y, G) where X and Y are arbitrary sets, and G is a subset of the Cartesian product X × Y; this is commonly written xRy). For example, in the case of the Distance Vetting Model, a choice set is limited to only those communities less than θ distance from a respondent.

In Step 2 another decision rule is chosen, again motivated by the hypotheses in the Mechanisms of Regional Identification Section. In this article, two decision rules are proposed: closest (C) and largest (L). closest is where an actor chooses the community nearest to where he or she lives that is contained within his or her limited choice set. largest is where an actor chooses the most salient item in terms of population size (e.g., largest) within the limited choice set. The largest decision rule requires a monotonicity assumption for a choice set, which can be accomplished by listing cities in descending order based on population size.

The Distance Vetting Model

The Distance Vetting Model assumes that, in the process of regional identification, an individual considers only those regions within some maximum distance of where he or she lives. This initial limitation is achieved by narrowing the choice set to only those cities which are less than or equal to θ distance from an individual (i.e., R is the ≤ operator). Subsequently, an individual makes his or her ultimate choice by selecting the most prominent community from within that radius (θ).

For example, a respondent with a location (−86.816, 33.272) and a θ = 106.58 km has an initial, limited choice-set of 171 communities. The largest three communities within this radius (in descending order) are Birmingham, AL (population 242,820) Tuscaloosa, AL (population 77,906); and Hoover, AL (population 62,742). Thus, this respondent is predicted to identify with Birmingham, AL.

The Population Vetting Model

According to the Population Vetting Model, an individual considers only communities whose population is greater than or equal to θ (e.g., population ≥ 50,000, for θ = 50, 000). This individual then makes his or her final choice by selecting the closest community from within this choice set.9

For example, a respondent with a location, (−86.816, 33.272) and θ= 289, 315.4 has a resulting initial choice-set of 57 communities. The closest three communities from within this set are (in ascending order of distance) Atlanta, GA (228.9 km), Nashville, TN (322.3 km), and Memphis, TN (356.6 km). Thus, this respondent is predicted to identify with Atlanta, GA.

Computational Considerations

Each of the aforementioned algorithms employed in this paper are implemented in the R statistical programing environment (R Development Core Team 2010). The Uniform Choice Model and Proximity Model are implemented exactly as discussed, as is the Tie Volume Model, including estimation of the expected tie volume between a respondent’s block and the block groups of a given city using the spatialNetwork package (implemented in R). The Population Vetting Model and Distance Vetting Model employ modern techniques of optimization (specifically we employ the optimization function provided in the R base code; R Development Core Team (2010)) to obtain their parameter estimates (all code written in the R statistical programing language).

For parametric models, estimates of model standard errors and confidence intervals are performed using a non-parametric bootstrap (10,000 replications), allowing for tests of statistically significant differences in performance among the proposed models (Dwass 1957).

Analysis and Results

Uniform Choice Model and Baseline Models

To test the hypotheses proposed in the Mechanisms of Regional Identification Section, each model discussed in the Methodology Section has been applied to the Common Census data set, the results of which (at national-level estimates) are summarized in Table 1. Currently, computation of the Tie Volume Model solutions for the entire national data set is not feasible; rather, it has been applied on a state-by-state basis to the contiguous US. This implementation means that any individual who resides in one state, but selects a city in another state, counts against the model for its predictive analysis (e.g., if a respondent lives in New Jersey and selects New York City as the city with which he or she identifies, then the model cannot predict it and is penalized).

Table 1.

National Comparison of each Model; standard errors and 95% CI calculated using a non-parametric bootstrap with 10,000 replications.

Proportion Correct 95% CI
Uniform Choice Model 0.00020 (0.00007, 0.00033)
Tie-Volume Model N/A N/A
Proximity Model 0.62968 (0.62028, 0.63909)
Distance Vetting Model 0.34939 (0.34002, 0.35874)
Population Vetting Model 0.62956 (0.62003, 0.63908)

Table 1 shows model prediction for community-level regional identification. These results illustrate the poor performance of the Uniform Choice Model – which predicts only 0.02% of the data. This performance is interpreted as evidence of the presence of underlying structure in the data set (i.e., individuals do not choose to identify with a place at random). Of the four baseline models proposed, the Proximity Model performs the best. The Distance Vetting Model performs quite poorly (θ̂ = 53.13 kilometers), which may be a result of heterogeneity of human settlements (an assumption not accounted for in the single-parameter Distance Vetting Model). Although including additional parameters in any of the models, including the Distance Vetting Model, improves overall performance, the relative performance of the Distance Vetting Model most likely would not change given its initial shortcomings in the single-parameter version (e.g., 20% reduction in accurate predictions compared with the Proximity Model). The Proximity Model and Population Vetting Model θ̂= 1, 480 people) are statistically indistinguishable, and the limit of the Population Vetting Model is the Proximity Model (if θ = 0 the Population Vetting Model is identical to the Proximity Model). Overall, from the national results presented in Table 1 imply: (1) regional identification is not a random process, and (2) of the baseline models proposed, the Proximity hypothesis is the most likely mechanism for regional identification at the community scale.

Tie Volume Model versus the Proximity Model: A State-by-State Analysis

Because it was not possible to utilize the Tie Volume Model nationally, this paper presents a state-by-state comparison of the proportion correctly predicted by the it and the Proximity Model (where both models have been provided a limited choice set such that only the cities within a state are considered; both models are penalized by individuals who select communities out of state; Table 2). The Proximity Model is chosen for this comparison because it is the best performing model for the baseline hypotheses.10,11

Table 2.

State-by-State Comparison of the Tie-Volume Model VS the Proximity Model; Tie-Volume Model and Proximity Model proportions predicted correctly where both models have been provided a limited choice set such that only the cities within a state are considered. The difference of the two proportions compared using an un-pooled z-test and bootstrap estimated standard errors.

Hägerstrand SIF Facebook SIF
TV Prox. Diff. P-value TV Prox. Diff. P-value
Alabama 0.7107 0.6311 0.0796 0.0046* 0.7767 0.6311 0.1462 0.0000*
Arizona 0.7120 0.4839 0.2281 0.0000* 0.7414 0.4839 0.2540 0.0000*
Arkansas 0.8320 0.7033 0.1286 0.0000* 0.8594 0.7019 0.1618 0.0000*
California 0.6384 0.5302 0.1081 0.0000* 0.6763 0.5302 0.1476 0.0000*
Colorado 0.7684 0.4433 0.3250 0.0000* 0.8067 0.4429 0.3626 0.0000*
Connecticut 0.3984 0.3938 0.0047 0.8642 0.5611 0.3938 0.1738 0.0000*
DC 0.5510 0.5510 0.0000 1.0000 0.5510 0.5510 0.0000 1.0000
Delaware 0.5852 0.5057 0.0795 0.1351 0.6640 0.5057 0.1619 0.0000*
Florida 0.6435 0.4597 0.1838 0.0000* 0.6925 0.4488 0.2426 0.0000*
Georgia 0.5430 0.4547 0.0883 0.0000* 0.5803 0.4547 0.1252 0.0000*
Idaho 0.4081 0.3969 0.0112 0.7318 0.4271 0.3969 0.0334 0.1334
Illinois 0.8121 0.6002 0.2120 0.0000* 0.8768 0.6002 0.2761 0.0000*
Indiana 0.6151 0.5666 0.0485 0.0294* 0.6659 0.5666 0.1003 0.0000*
Iowa 0.8291 0.7450 0.0840 0.0001* 0.8794 0.7450 0.1341 0.0000*
Kansas 0.7930 0.6356 0.1573 0.0000* 0.8805 0.6343 0.2455 0.0000*
Kentucky 0.4859 0.4382 0.0477 0.1067 0.5181 0.4374 0.0797 0.0000*
Louisiana 0.7340 0.4601 0.2739 0.0000* 0.7778 0.4601 0.3190 0.0000*
Maine 0.5447 0.5000 0.0447 0.3218 0.7688 0.5000 0.2754 0.0000*
Maryland 0.5244 0.4810 0.0434 0.0252* 0.5703 0.4781 0.0897 0.0000*
Massachusetts 0.4604 0.4490 0.0113 0.4667 0.5824 0.4490 0.1338 0.0000*
Michigan 0.6858 0.6585 0.0273 0.1174 0.8081 0.6585 0.1484 0.0000*
Minnesota 0.8231 0.6248 0.1983 0.0000* 0.8559 0.6224 0.2334 0.0000*
Mississippi 0.7429 0.6381 0.1048 0.0192* 0.8430 0.6351 0.2089 0.0000*
Missouri 0.7093 0.6166 0.0927 0.0000* 0.7794 0.6166 0.1598 0.0000*
Montana 0.7228 0.7065 0.0163 0.7278 0.7727 0.7065 0.0721 0.0201*
Nebraska 0.8000 0.7323 0.0677 0.0422* 0.8543 0.7278 0.1295 0.0000*
Nevada 0.6842 0.3454 0.3388 0.0000* 0.6574 0.3454 0.3109 0.0000*
New Hampshire 0.5535 0.5203 0.0332 0.4385 0.7636 0.5203 0.2429 0.0000*
New Jersey 0.5105 0.5252 −0.0147 0.4331 0.6851 0.5252 0.1614 0.0000*
New Mexico 0.8182 0.5273 0.2909 0.0000* 0.8374 0.5273 0.3130 0.0000*
New York 0.4409 0.4337 0.0072 0.5943 0.5171 0.4336 0.0830 0.0000*
North Carolina 0.7440 0.5245 0.2195 0.0000* 0.8225 0.5245 0.2981 0.0000*
North Dakota 0.8288 0.6937 0.1351 0.0162* 0.8700 0.6937 0.1798 0.0001*
Ohio 0.6341 0.5592 0.0749 0.0000* 0.7034 0.5595 0.1450 0.0000*
Oklahoma 0.7896 0.4239 0.3657 0.0000* 0.8171 0.4239 0.3939 0.0000*
Oregon 0.6785 0.5302 0.1483 0.0000* 0.6924 0.5302 0.1594 0.0000*
Pennsylvania 0.5421 0.4730 0.0691 0.0000* 0.6299 0.4730 0.1560 0.0000*
Rhode Island 0.6080 0.5227 0.0852 0.1047 0.6923 0.5169 0.1757 0.0000*
South Carolina 0.6524 0.5025 0.1499 0.0000* 0.7145 0.5025 0.2106 0.0000*
South Dakota 0.8226 0.7661 0.0565 0.2700 0.8981 0.7661 0.1242 0.0007*
Tennessee 0.5272 0.4877 0.0395 0.1348 0.5618 0.4871 0.0772 0.0001*
Texas 0.7317 0.5213 0.2104 0.0000* 0.7832 0.5214 0.2616 0.0000*
Utah 0.6831 0.6399 0.0432 0.1525 0.7315 0.6399 0.0895 0.0001*
Vermont 0.4533 0.3667 0.0867 0.1233 0.8243 0.3667 0.4551 0.0000*
Virginia 0.5623 0.5192 0.0430 0.0044* 0.6313 0.5192 0.1092 0.0000*
Washington 0.4770 0.3730 0.1040 0.0000* 0.4830 0.3730 0.1108 0.0000*
West Virginia 0.6270 0.5164 0.1107 0.0133* 0.7015 0.5164 0.1952 0.0000*
Wisconsin 0.7799 0.5619 0.2181 0.0000* 0.8876 0.5602 0.3282 0.0000*
Wyoming 0.8601 0.8042 0.0559 0.2060 0.8958 0.8042 0.0902 0.0060*

Pooled 0.6330 0.5199 0.1131 0.0000* 0.7010 0.5190 0.1814 0.0000*
*

denotes significant at 0.05 alpha-level

Inspecting the pattern of results presented in Table 2, the Tie Volume Model consistently outperforms the Proximity model, sometimes by as much as 27% in the Hägerstrand SIF case and by as much as 45% in the case of the Facebook SIF. The Tie Volume Model performs significantly better than the Proximity model for 31 of the 48 continental states and DC in the case of the Hägerstrand SIF, and 47 of the 48 continental states and DC in the case of the Facebook SIF. In other words, the Tie Volume Model has significantly better prediction of regional identification for almost all of the states analyzed, and a greater raw number of correct predictions for all but one state analyzed (and this case is not-significant) in the case of the Hägerstrand SIF, and all states in the case of the Facebook SIF. If one takes the aggregation of the Tie Volume Model applied to each state individually as an estimate for the full contiguous US and then compares this estimate to that for Proximity model, one again finds a highly significant result for both SIFs (over 11% > (Table 2) for the Hägerstrand SIF, and over 18% > (Table 2) in the case of the Facebook SIF. The Facebook SIF performs significantly better than the national estimates of all the baseline models with a 7% improvement over the best performing model (Table 1 and 2). As a cautionary note, the pooled Tie Volume Model only moderately out performs the unconstrained Proximity Model in the case of the Hägerstrand SIF (which would not be statistically significant), this outcome is moderated by the Facebook SIF results, which statistically out perform all the baseline models.

For the states that do not exhibit a statistically significant difference in the performance of the Tie Volume versus Proximity Models (20 of 49, <50% for the Hägerstrand SIF and 2 of 49; < 5% for the Facebook SIF), at least some of these cases may be due to power constraints (e.g., a state like Delaware, which has only 176 respondents, may lack the requisite statistical power for such a comparison). A closer look at several of the worst-performance states (from the perspective of the Tie Volume Model) reveals that several of these cases are ones furnishing arguably fertile ground for out-of-state identification based on the size of the state, as well as the size of nearby (yet not in-state) cities (e.g., Connecticut or Maryland, which are near New York and Washington, DC, respectively).

Discussion and Conclusion

This paper outlins a cognitive representation of self identification, and further makes a case for regional self-identification as a particularly interesting case study of self identification. If summarizes an evaluation of six competing hypotheses where we find that the social influence model performs the best. The social network model performs the best without fitting to the data (i.e., it is a zero parameter model), whereas the other five models are optimized to the data, thus providing a stronger result. The superior performance of the SNH-based model affirms the theory that regional identification is both a social and a geographical process.

The application of comparable models (and hypotheses of social structure) to other forms of identification (e.g., gender, racial/ethnicity, urban/rural, and national) possibly may shed light on many different areas of identification. For example, the large-scale social network methods can be used to accurately predict even difficult cases of identification (e.g., boundary cases of ethnic identity).

Finally, the successful application of large-scale social network models to the regional identification problem provides further validation for geographical factors as critical drivers of social process (Mayhew 1984). Even very simple spatial network models, incorporating marginal distance effects, are here able to predict a complex social psychological process. Applications of such models to other social processes would seem to be a fruitful direction for further research.

Acknowledgments

This work was supported in part by an Office of Naval Research (ONR) award (# N00014-08-1-1015), National Science Foundation (NSF) awards (# BCS-0827027), (# SES-1260798) and (# OIA-1028394), and National Institute of Health (NIH)/National Institute of Child Health & Human Development (NICHD) award (# 1R01HD068395-01).

Footnotes

1

As a volunteered, self-reported/administered survey, the CCP is necessarily limited by both the design of the instrument and informant inaccuracy. Although error from such sources can never be ruled out, we did not observe evidence suggestive of data quality problems, and we note that our findings are robust to fairly large perturbations in the data set.

2

Question two states: Many Americans have addresses that say they live in one town or neighborhood, have a government or police force of another name, and fall in the school district of yet another area. For this question, forget about what all “official” sources have told you and answer whatever you feel you identify with most.

What do you consider to be your local community? Don’t confuse this with your whole local area; that will be in the next step. This is about the single local community you most feel you live in.

3

Question three states: This step asks for you to identify with a slightly larger area. Again, please ignore all “official” boundaries like counties, telephone area codes or zip codes, and answer what you feel you identify most with.

We need a way to identify your local area–the local community you just specified, together with the local communities that immediately surround it.

So, please choose the name of the local community that you feel is the natural cultural and economic center within your local area.

Or, if you feel a general name (i.e. “Hope Valley”, “Pleasant Lake Area” or “Midway-Fairview Area”) is more descriptive of your local area culturally than the name of a single central community, then please give what you feel to be the best commonly accepted name for your local area.

4

It is not obvious that this should be a problem for this article as the mechanisms proposed should be largely universal; however, one might be concerned that the geography/population of the respondents could be systematically different than most “Americans. ” This however does not appear to be the case as far as can be tested with the anonymized data. We used the block level data of each individuals home as proxy for their neighborhood and calculated the racial composition of each respondents neighbored as compared to the city, county, and state over series of common demographics and detected only minor variations from what would be expected from a random sample of individuals.

5

The Pew Internet & American Life Survey, December 2010, http://www.pewinternet.org.

6

These algorithms have been implemented in the spatialNetwork in the R statistical environment (Butts and Almquist 2013; R Development Core Team 2010). The spatialNetwork software package requires the user choose a particular parametric form of the SIF.

7

Facebook, an online social networking site, offers a rich context in which to study social relations. Further, it has attracted researchers from many different fields (Lewis et al. 2008; Tufekci 2008; Wimmer and Lewis 2010). Users of the website build detailed personal profiles, including information about demographics, interests, and activities. Beyond personal characteristics, Facebook allows users to publicly declare “friendships” with other users (so called Facebook friendship). Declared friendships must be confirmed by both parties involved, and therefore constitute mutual relationship acknowledged by both individuals. Although much debate exists over the nature of Facebook friendships evidence suggests that Facebook users maintain a significant degree of online/offline integration (Lampe et al. 2006; Wimmer and Lewis 2010). That is, individuals primarily use the service to “friend” others whom they met in a offline context, rather then search out friends with whom they have had no offline interaction. The popularity and global penetration of Facebook makes it extremely attractive to researchers as a source for rich population level social interaction data. It is one of the most prominent sources for large-scale social network data. Given its extremely high membership (and daily usage) rates, Facebook users have access to a extremely large, diverse (both spatially and demographically) population of potential social contacts.

8

d(vi, vj) = Cr cos−1 [cos(vi)2 cos(vj)2 + cos ((vi)1vj)1) sin(vi)2 sin(vj)2], where Cr is the spherical radius (approximately 6,371km in the case of the Earth).

9

Note that the Population Vetting Model reproduces the Proximity Model when θ = 0, resulting in a “limited” choice set that, is, in fact, the entire choice set. This effect also may be observed in cases in which θ is sufficiently small.

10

One might worry that performing a state-by-state comparison unfairly limits the choice set for the Tie Volume Model. Although this might be the case, we have no evidence that this should be an issue. To this effect we took a moderately sized state (Nebraska) and performed our procedure giving the choice set as all contiguous states with Nebraska and itself (i.e., Wyoming, South Dakota, Iowa, Missouri, Kansas, Colorado) and the model performed approximately identically with the sole state constraint. Notice that many of the adjoining states have large nearby cities which might influence the prediction, e.g., Denver, CO.

11

Results for the baseline models maintain their rank order when given the more limited choice set with a linear decrease in prediction.

References

  1. Almquist ZW. Us census spatial and demographic data in r: The uscensus2000 suite of packages. Journal of Statistical Software. 2010;37(6):1–31. [Google Scholar]
  2. Almquist ZW. Random errors in egocentric networks. Social Networks. 2012;34(4):493– 505. doi: 10.1016/j.socnet.2012.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Almquist ZW, Butts CT. Point process models for household distributions within small areal units. Demographic Research. 2012;26(12):593–632. [Google Scholar]
  4. Arentze TA, Timmermans HJ. Representing mental maps and cognitive learning in micro-simulation models of activity-travel choice dynamics. Transportation. 2005;32(4):321–340. [Google Scholar]
  5. Axhausen KW. Activity spaces, biographies, social networks and their welfare gains and externalities: Some hypotheses and empirical results. Mobilities. 2007;2(1):15–36. [Google Scholar]
  6. Baldwin M. [Accessed in 2010];The Common Census Internet Project. 2010 http://www.commoncensus.org.
  7. Barabási A-L, Frangos J. Linked: The New Science Of Networks Science Of Networks. Basic Books; 2002. [Google Scholar]
  8. Bentley GC, Cromley RG, Atkinson-Palombo C. The network interpolation of population for flow modeling using dasymetric mapping. Geographical Analysis. 2013;45:307–323. [Google Scholar]
  9. Berretty PM, Todd PM, Blythe PW. Categorization by elimination: A fast and frugal approach to categorization. Proceedings of the Nineteenth Annual Conference of the Cognitive Science Society; 1997. pp. 43–48. [Google Scholar]
  10. Bivand RS, Pebesma EJ, Gómez-Rubio V. Applied Spatial Data Analysis with R. Springer; New York, NY: 2008. [Google Scholar]
  11. Black WR. Network autocorrelation in transport network and flow systems. Geographical Analysis. 1992;39:268–292. [Google Scholar]
  12. Boots BN. Contact number properties in the study of cellular networks. Geographical Analysis. 1977;9:379–387. [Google Scholar]
  13. Bossard JHS. Residential propinquity as a factor in marriage selection. American Journal of Sociology. 1932;38(2):219–224. [Google Scholar]
  14. Butts CT. Doctoral dissertation in the department of social and decision sciences. Carnegie Mellon University; Pittsburgh, PA: 2002. Spatial Models of Large-Scale Interpersonal Networks. [Google Scholar]
  15. Butts CT. Predictability of large-scale spatially embedded networks. In: Breiger RL, Carley KM, Pattison P, editors. Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers. National Academies Press; D.C: 2003. pp. 313–323. [Google Scholar]
  16. Butts CT. Space and Structure: Models and Methods for Large-scale Interpersonal Networks. Springer; New York, NY: Forthcoming. [Google Scholar]
  17. Butts CT, Acton RM. Spatial modeling of social networks. In: Nyerges Timothy, Helen Couclelis RM., editors. The Sage Handbook of GIS and Society Research. SAGE Publications; Thousand Oaks, CA: 2011. pp. 222–250. [Google Scholar]
  18. Butts CT, Acton RM, Hipp JR, Nagle NN. Geographical variability and network structure. Social Networks. 2012;34:82–100. [Google Scholar]
  19. Butts CT, Almquist ZW. R package version 1.0. 2013. network Spatial: Tools for the Generation and Analysis of Spatially-embedded Networks. [Google Scholar]
  20. Carrasco JA, Miller EJ, Wellman B. How far and with whom do people socialize?: Empirical evidence about distance between social network members. Transportation Research Record: Journal of the Transportation Research Board. 2008;2076(1):114– 122. [Google Scholar]
  21. Daraganova G, Pattison P, Koskinen J, Mitchell B, Bill A, Watts M, Baum S. Networks and geography: Modelling community network structures as the outcome of both spatial and network processes. Social Networks. 2012;34(1):6– 17. [Google Scholar]
  22. Dodds PS, Muhamad R, Watts DJ. An experimental study of search in global social networks. Science. 2003:301. doi: 10.1126/science.1081058. [DOI] [PubMed] [Google Scholar]
  23. Dwass M. Modified randomization tests for nonparametric hypotheses. The Annals of Mathematical Statistics. 1957;28:181–187. [Google Scholar]
  24. Farber S, Páez A, Volz E. opology and dependency tests in spatial and network autoregressive models. Geographical Analysis. 2009;41:158–180. [Google Scholar]
  25. Festinger L, Schachter S, Back K. Social Pressures in Informal Groups: a Study of Human Factors in Housing. Stanford University Press; Palo Alto, CA: 1950. [Google Scholar]
  26. Fischer CS. To Dwell Among Friends– Personal Networks in Town and City. The University of Chicago Press; Chicago, IL: 1982. [Google Scholar]
  27. Flanagina AJ, Metzger MJ. The credibility of volunteered geographic information. GeoJournal. 2008;72:137–148. [Google Scholar]
  28. Freeman LC, Freeman SC, Michaelson AG. On human social intelligence. Journal of Social Biological Structure. 1988;11:415–425. [Google Scholar]
  29. Gahegan M. On the application of inductive machine learning tools to geographical analysis. Geographical Analysis. 2000;32:113–139. [Google Scholar]
  30. Gigerenzer G, Todd PM, editors. Simple heuristics that make us smart. Oxford University Press; New York, NY: 1999. [Google Scholar]
  31. Gjoka M, Kurant M, Butts CT, Markopoulou A. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. Proceedings of IEEE INFOCOM ‘10; San Diego, CA. 2010. [Google Scholar]
  32. Goodchild MF. Citizens as sensors: the world of volunteered geography. Geo-Journal. 2007;69:211–221. [Google Scholar]
  33. Gopal S, Fischer MM. Learning in single hidden-layer feedforward network models: Backpropagation in a spatial interaction modeling context. Geographical Analysis. 1996;28:38–55. [Google Scholar]
  34. Gould P, White R. Mental Maps. 2 Routledge; London, England: 1986. [Google Scholar]
  35. Grannis R. From the Ground Up: Translating Geography into Community through Neighbor Networks. Princeton University Press; Princeton, NJ: 2009. [Google Scholar]
  36. Griffith D. Geography, graph theory, and the new network science. geographical analysis. Geographical Analysis. 2011;43:345–346. [Google Scholar]
  37. Hägerstrand T. Aspects of the spatial structure of social communication and the diffusion of information. Papers in Regional Science. 1966;16(1):27–42. [Google Scholar]
  38. Handcock MS, Jones JH. Likelihood-based inference for stochastic models of sexual network formation. Theoretical Population Biology. 2004;65:413–422. doi: 10.1016/j.tpb.2003.09.006. [DOI] [PubMed] [Google Scholar]
  39. Haynes KE, Fortheringham AS. Gravity and spatial interaction models, volume 2 of Scientific Geography. Sage publications; Beverly Hills, CA: 1984. [Google Scholar]
  40. Hipp JR, Butts CT, Acton RM, Nagle NN, Boessen A. Extrapolative simulation of neighborhood networks based on population spatial distribution: Do they predict crime? Social Networks. 2013;35:614–625. [Google Scholar]
  41. Howard JA. Social psychology of identities. Annual Review of Sociology. 2000;26:367– 393. [Google Scholar]
  42. Hudson JC. A model of spatial relations. geographical analysis. Geographical Analysis. 1969;1:260–271. [Google Scholar]
  43. Hutchinson JM, Gigerenzer G. Simple heuristics and rules of thumb: Where psychologists and behavioral biologists might meet. Behavioral Process. 2005;69:97–124. doi: 10.1016/j.beproc.2005.02.019. [DOI] [PubMed] [Google Scholar]
  44. Jenkins R. Categorization: Identity, social process and epistemology. Current Sociology. 2000;48(7):7–25. [Google Scholar]
  45. Lampe C, Ellison N, Steinfield C. A Face (book) in the crowd: Social searching vs. social browsing. In. Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work; ACM; 2006. pp. 167–170. [Google Scholar]
  46. Latané B, Nowak A, Liu JH. Measuring emergent social phenomena: Dynamism, polarization, and clustering as order parameters of social systems. Behavioral Science. 1994;39:1–24. [Google Scholar]
  47. Lewis K, Kaufman J, Gonzalez M, Wimmer A, Christakis N. Tastes, ties, and time: A new social network dataset using Facebook.com. Social Networks. 2008;30(4):330–342. [Google Scholar]
  48. Mayhew BH. Chance and necessity in sociological theory. Journal of Mathematical Sociology. 1984;9:305–339. [Google Scholar]
  49. Mayhew BH, Levinger RL. Size and the density of interaction in human aggregates. The American Journal of Sociology. 1976;83(1):86–110. [Google Scholar]
  50. McPherson M, Smith-Lovin L, Cook JM. Birds of a feather: Homophily in social networks. Annual Review Sociology. 2001;27:415–444. [Google Scholar]
  51. Milgram S. The small-world problem. Psychology Today. 1967;1(1):61–67. [Google Scholar]
  52. Mirchandani PB. Locational decisions on stochastic networks. Geographical Analysis. 1980;12:172–183. [Google Scholar]
  53. Morley CD, Thornes JB. A markov decision model for network flows. Geographical Analysis. 1972;4:180–193. [Google Scholar]
  54. Neal Z. Structural determinism in the interlocking world city network. Geographical Analysis. 2012;44:162–170. [Google Scholar]
  55. Okabe A, Yamada I. he k-function method on a network and its computational implementation. Geographical Analysis. 2001;33:271–290. [Google Scholar]
  56. Okabe A, Yomono H, Kitamura M. Statistical analysis of the distribution of points on a network. Geographical Analysis. 1995;27:152–175. [Google Scholar]
  57. Osleeb JP, Ratick SJ. A dynamic location-allocation model for evaluating the spatial impacts for just-in-time planning. Geographical Analysis. 1990;22:50–69. [Google Scholar]
  58. Páez A, Scott DM, Volz E. Weight matrices for social influence analysis: An investigation of measurement errors and their effect on model identification and estimation quality. Social Networks. 2008;30(4):309–317. [Google Scholar]
  59. Peeters D, Thisse JF, Thomas I. Transportation networks and the location of human activities. Geographical Analysis. 1998;30:355–371. [Google Scholar]
  60. Peeters D, Thomas I. Network autocorrelation. Geographical Analysis. 2009;41:436– 443. [Google Scholar]
  61. Phillips F, White GM, Haynes KE. Extremal approaches to estimating spatial interaction. Geographical Analysis. 1976;8:185–200. [Google Scholar]
  62. Portugali J, Benenson I, Omer I. Sociospatial residential dynamics: Stability and instability within a self-organizing city. Geographical Analysis. 1994;26:321–340. [Google Scholar]
  63. R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2010. [Google Scholar]
  64. Rogerson PA. Estimating the size of social networks. Geographical Analysis. 1997;29:50–63. [Google Scholar]
  65. Schneider B. Extraction of hierarchical surface networks from bilinear surface patches. Geographical Analysis. 2005;37:244–263. [Google Scholar]
  66. Serra D, ReVelle C. Competitive location and pricing on networks. Geographical Analysis. 1999;31:109–129. [Google Scholar]
  67. Shiode S. Analysis of a distribution of point events using the network-based quadrat method. Geographical Analysis. 2008;20:122–139. [Google Scholar]
  68. Smith TR, Pellegrino JW, Golledge RG. Computational process modeling of spatial cognition and behavior. Geographical Analysis. 1982;14:305–325. [Google Scholar]
  69. Spiro ES, Almquist ZW, Butts CT. Working paper. Department of Sociology, University of California; Irvine: 2012. The persistence of division: Geography, institutions, and online friendship ties. [Google Scholar]
  70. Tan S. Challenging citizenship: group membership and cultural identity in a global age. Ashgate Publishing Limited; Burlinton, VT: 2005. [Google Scholar]
  71. Taylor PJ. Specification of the world city network. Geographical Analysis. 2001;33:181– 194. [Google Scholar]
  72. Tinkler KJ. Bounded planar networks: A theory of radial structures. Geographical Analysis. 1972;4:5–33. [Google Scholar]
  73. Townsley M. Spatial autocorrelation and impacts on criminology. Geographical Analysis. 2009;41:452–461. [Google Scholar]
  74. Travers J, Milgram S. An experimental study of the small world problem. Sociometry. 1969;32(4):425–443. [Google Scholar]
  75. Tufekci Z. Grooming, Gossip, Facebook and Myspace. Information, Communication & Society. 2008;11(4):544–564. [Google Scholar]
  76. Turner JC, Hogg MA, Oakes PJ, Reicher SD, Wetherell MS. Rediscovering the Social Group: A Self-Categorization Theory. Basil Blackwell Ltd; New York, NY: 1987. [Google Scholar]
  77. Tversky A. Elimination by aspects: A theory of choice. Psychological Review. 1972;79(4):281–299. [Google Scholar]
  78. US Census Bureau. Technical report. US Census Bureau; 2001. Census 2000 summary file 1 united states/prepared by the u.s. census bureau. [Google Scholar]
  79. Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature. 1998;393(6684):440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
  80. Werner C. Patterns of drainage areas with random topology. geographical analysis. Geographical Analysis. 1972;4:119–133. [Google Scholar]
  81. Whalen KE, Páez A, Bhat C, Moniruzzaman M, Paleti R. T-communities and sense of community in a university town: Evidence from a student sample using a spatial ordered response model. Urban Studies. 2012;49:1357–1376. [Google Scholar]
  82. Wimmer A, Lewis K. Beyond and below racial homophily: Erg models of a friendship network documented on facebook. American Journal of Sociology. 2010;116(2):583– 642. doi: 10.1086/653658. [DOI] [PubMed] [Google Scholar]
  83. Wirth L. Urbanism as a way of life. The American Journal of Sociology. 1938;44:1–24. [Google Scholar]
  84. Xie F, Levinson D. Effect of small-world networks on epidemic propagation and intervention. Geographical Analysis. 2009;41:263–282. [Google Scholar]
  85. Xu Z, Sui DZ. Effect of small-world networks on epidemic propagation and intervention. Geographical Analysis. 2009;41:263–282. [Google Scholar]
  86. Yamada I, Thill JC. Local indicators of network-constrained clusters in spatial point patterns. Geographical Analysis. 2007;39:268–292. [Google Scholar]
  87. Zemanian AH. Two-level periodic marketing networks wherein traders store goods. Geographical Analysis. 1980;12:353–372. [Google Scholar]

RESOURCES