Abstract
The School Attendance Boundary Information System is a social science data infrastructure project that assembles, processes, and distributes spatial data delineating K through 12th grade school attendance boundaries for thousands of school districts in U.S. Although geography is a fundamental organizing feature of K to 12 education, until now school attendance boundary data have not been made readily available on a massive basis and in an easy-to-use format. The School Attendance Boundary Information System removes these barriers by linking spatial data delineating school attendance boundaries with tabular data describing the demographic characteristics of populations living within those boundaries. This paper explains why a comprehensive GIS database of K through 12 school attendance boundaries is valuable, how original spatial information delineating school attendance boundaries is collected from local agencies, and techniques for modeling and storing the data so they provide maximum flexibility to the user community. An important goal of this paper is to share the techniques used to assemble the SABINS database so that local and state agencies apply a standard set of procedures and models as they gather data for their regions.
INTRODUCTION
The School Attendance Boundary Information System (SABINS) is a spatial data infrastructure project that assembles and distributes a database consisting of Kindergarten through 12th grade public school attendance boundaries for thousands of school districts in the United States. School districts represented in the data collected during the SABINS project contain over half of all school children and these data are available free-of-charge from www.sabinsdata.org. Until the advent of SABINS, the Census Bureau’s administrative units were the “coin of the realm” for researchers and policy makers interested in understanding a core social science issue: the impact of social context on life outcomes. An important goal of the SABINS project is to add to the quality of geographic and demographic data available to social scientists and policy makers who explore such issues. For example, researchers who study how neighborhood context influences educational outcomes, crime, disease, and related social processes are typically confined to using areal units such as Census tracts or block groups. While useful, these administrative geographies have limitations. In particular, they fail to “map onto” socially meaningful boundaries that have significance to the people who live, work and play within them. Nor do census tracts and block groups have a direct relationship with important local services—in particularly they do not indicate which children have access to public educational facilities. School attendance boundaries—which are the catchment areas or zones that are drawn by local school districts to designate the housing units served by public schools—offer an alternative set of geography that researchers and policy makers can use to improve the delivery of educational services. Yet, assembling and harmonizing school attendance boundary geography for hundreds of school districts is simply too expensive and time-consuming for small research teams and daunting for scholars whose expertise lies outside the domain of Geographic Information Systems (GIS). Moreover, school attendance boundary geographies present a variety of unexpected and difficult challenges given the counterintuitive relationships between schools and the attendance boundaries they serve.
The SABINS project overcomes these challenges by creating a data structure that allows the seamless integration of school attendance boundaries with three datasets: (1) school level information from the National Center for Education Statistics’ Common Core of Data (CCD), which is a Federal database describing the name, location and student enrollment of all public schools in U.S.; (2) complete count population data from the 2010 Census and (3) and detailed socio-demographic data from the American Community Survey (ACS). The goal of this paper is to explain how school attendance boundaries are collected, processed, stored and modeled so that state and locates agencies can adopt and modify the procedures and models we describe in this paper. The aim is to significantly expand the SABINS database—both spatially and temporally—by describing how the first, massive effort to assemble and model school attendance boundary geography was completed.
The SABINS database contains Kindergarten through 12th grade attendance boundaries for three states (Delaware, Minnesota and Oregon), roughly 600 school districts embedded within a sample of regionally diverse Metropolitan Statistical Areas (MSAs)1, and over 400 of the largest school districts. Figure 1 shows collected data in red. The SABINS data project has also, for the first time, identified all school districts in the U.S. that are served by one and only one elementary, middle and high school—more precisely, there is one and only one school in the district that has a Kindergarten, a 1st grade, a 2nd grade and so on for every grade to grade 12. In these districts, every attendance boundary is coincident with the entire school district boundary. These districts are “ facto” school attendance since the Census Bureau distributes a national file of school district boundaries in GIS format. Nearly 60 percent of school districts in the U.S. are de facto school attendance boundaries and they enroll roughly 20 percent of the public school students in the U.S. Identifying de facto attendance boundaries demonstrates the scope of work necessary for states to build their own school boundary datasets—a long term goal of the SABINS project.
Figure 1.
Geographic Distribution of School Districts Collected During SABINS Project.
In addition to this massive data collection effort, another accomplishment of the SABINS project is to organize spatial data delineating school attendance boundaries into a data model that allows users the flexibility to analyze geographic areas that meet their particular needs. The primary aspects of the SABINS dataset consist of the following elements. First, the SABINS data allows extraction of grade-specific school attendance boundary geographies for grades Kindergarten through 12. From a purely geographic standpoint, it is intractable to group school attendance boundaries into a three tier classification system of “elementary,” “middle” and “high” schools. The grade spans these three labels signify vary widely across school districts. For example, a school district might have school attendance boundaries that cover the typical K-5, 6–8 and 9–12 grade spans but that same school district may also have boundaries that cover atypical grade spans such as K-2, 3–6, K-8 and 10–12. The SABINS data are structured so that any school attendance boundary at one grade level that is coincident with a school attendance at another grade level can be linked together with a common identification code. Thus, if the school attendance boundaries for grades K, 1, 2, 3, 4 and 5 are coincident, this “K-5” boundary will have the same identification code for each of these grades. This coding system allows users the flexibility of working with grade-specific boundaries or boundaries that span grades so that users can define “elementary,” “middle,” and “high” schools as they see fit.
The SABINS project has also created a data model that specifies the “many to many” relationship between school attendance boundaries and the schools that supply services to those boundaries. Schools and their corresponding boundaries are closely related but are not the same. While 93 percent of schools serve one boundary, there are deviations from this dominant pattern, including: (1) two or more schools can provide services to the same boundary; (2) two or more schools can provide services to a portion of an overlapping boundary; (3) the same school can provide services to different boundaries at different grade levels—e.g., a school can serve a Kindergarten boundary that covers a different area than the first grade boundary; (4) a school can supply services to school attendance boundaries located in different school districts (or even in different states). Although these variations exist, 93 percent of school and school attendance boundaries have a “one-to-one” relationship. Still, the SABINS project treats schools and school attendance boundaries as separate entities and this allows users to effortlessly link schools with their corresponding geographies. The logical relationships among geographic and physical entities are discussed in detail in the data model section of this paper.
In addition to linking schools with their boundaries, the SABINS project associates school attendance boundaries with Census geography. This enables the SABINS project to efficiently estimate the socio-demographic characteristics of people and households within school attendance boundaries. Every school attendance boundary in the SABINS database is associated with a Census block. This relationship facilitates the summary of block-level population characteristics to: (1) grade-specific school attendance boundaries; (2) school attendance boundaries that are coincident across grade spans; (3) schools that provide services to specific areas. Finally, the SABINS project integrates school attendance boundaries with detailed socio-demographic data from the Census Bureau’s American Community Survey. American Community Survey data are summarized to block groups—but block groups do not nest within school attendance boundaries. To overcome the misalignment between these geographies, the SABINS project uses a straightforward spatial allocation technique to estimate detailed population characteristics within school attendance boundaries.
THE USEFULENESS OF SCHOOL ATTENDANCE BOUNDARIES
Some scholars argue that Census tracts and other administrative geography (e.g., block groups and zip codes) are questionable proxies for neighborhoods (Sampson et al. 2002) and they are clearly not interchangeable with school attendance boundaries themselves. Yet, due to a lack of school boundary geography, researchers are often forced to make the simplifying assumption that the Census tract or block group inside of which a school is located is an adequate proxy for a school’s attendance boundary (Card and Kreuger 1992; Entwisle et al. 1997; Reardon and Yun 2001; Frankenburg et al. 2003; Logan and Oakley 2004; Owens 2010). This is problematic for a variety of reasons. Most importantly, the demographic composition of a school attendance boundary is not perfectly correlated with the demographic composition of proxy areas.
As shown in Table 1, across Kindergarten boundaries in Delaware, the correlation coefficient between the percentage of non-Hispanic black people in Census tracts in which schools are located and the percentage of non-Hispanic black people in schools’ actual attendance boundaries is .87; this correlation is .41 for grade 7 boundaries and .58 for grade 12 boundaries.
Table 1.
Correlation Coefficients between Neighborhood racial composition across geographies by grade levels
| Percent Non-Hispanic African American | |||
|---|---|---|---|
| Kindergarten | (1) | (2) | (3) |
| (1) School Boundary | 1.00 | ||
| (2) Census Tracts | 0.87 | 1.00 | |
| (3) Block Groups | 0.82 | 0.94 | 1.00 |
|
| |||
| Grade 7 | (1) | (2) | (3) |
| (1) School Boundary | 1.00 | ||
| (2) Census Tracts | 0.41 | 1.00 | |
| (3) Block Groups | 0.34 | 0.90 | 1.00 |
|
| |||
| Grade 12 | (1) | (2) | (3) |
| (1) School Boundary | 1.00 | ||
| (2) Census Tracts | 0.58 | 1.00 | |
| (3) Block Groups | 0.66 | 0.92 | 1.00 |
Number of Kindergarten observations is 84; grade 7 is 33 and grade 12 is 23. Data source: 2010 Census Redistricting Data (Public Law 94-171) Summary File 1, DE prepared by the U.S. Census Bureau, 2011.
Similar correlations exist between block-groups and school attendance boundaries. The data are also consistent for other racial groups (tables upon request). Beyond the imperfect correlations between actual and proxy zones, some Census tracts or block groups do not contain a school while others contain multiple schools. Using proxy zones such as Census tracts has the result of counting populations in some areas multiple times while failing to include populations in areas that do not contain a school. Thus, using tracts and block groups leads to inaccuracies when tabulating population totals for an entire school district.
While school attendance boundaries provide a more accurate representation of the population characteristics in schools’ attendance boundaries, not all students who live within school attendance boundaries attend their local public schools. As shown in Table 2, in the U.S. during 2010, 10.7 percent of students attended private schools and 5.7 percent attended magnet or charter schools. Thus, it is important to emphasize that public schools provide educational services to a fixed area—and it is possible to determine the characteristics of students who live in the area a school serves—but this information is an imperfect indicator of the characteristics of children who are actually enrolled in a school.
Table 2.
Percent of students enrolled in public school by grade span, 2009
| Kindergarten | 1 to 4 | 5 to 8 | 9 to 12 | Total | |
|---|---|---|---|---|---|
| % in Private | 13.2 | 10.7 | 10.8 | 9.9 | 10.7 |
| % in Magnet | 2.0 | 2.1 | 2.5 | 3.8 | 2.8 |
| % in Charter | 3.2 | 3.0 | 3.0 | 2.7 | 2.9 |
| Total | 18.4 | 15.8 | 16.3 | 16.4 | 16.4 |
Sources: Percent public school enrollment from the 2009 American Community Survey; Percent magnet and charter school enrollment derived from 2009–2010 school-year Common Core of Data (Chen 2011).
Despite this limitation, there are three important points to emphasize: First, Census data distinguish between public and private school children and this makes it possible to create demographic profiles of school attendance boundaries for public school children. Researchers have already used the SABINS data to distinguish between public and private school students who live a school attendance boundary (The National Academy of Sciences 2009). Second, although school attendance boundaries are “permeable,” they are much better than substituting Census tracts, block groups and zip codes as proxy areas for school attendance boundaries. Third, the geography and corresponding socio-demographic data is important in and of itself. For example, scholars and policy makers can use school attendance boundary data to investigate unique questions such as the extent to which local districts delineate their boundaries to reduce or contribute to racial and economic segregation (Heckman and Taylor, 1969), for studies of public health and epidemiology (Elliott and Wartenberg 2004; Diez Roux 2001; Krieger 2006; Krieger et al. 2002; Shai 2006; Winkleby and Cubbin 2003; Xue et al. 2009), to study the effects of school quality on housing values (Black 1999; Brunner et al. 2002; Brunner et al. 2001; Downes and Zabel 2002; Ioannides 2004; Weimer and Wolkoff 2001) and to understand the factors that lead to public and private school choice (Saporito 2009).
Specific policy and planning applications using school attendance boundaries include building safe walking and biking routes to school (Huang and Hawley 2009); estimating public school populations eligible for a subsidized school meals (The National Academy of Sciences 2009); school enrollment projections (Edwards and Ehrenthal 2008); and efficient bus routing and siting new school construction (Lemberga and Church 2000). Thus, a part of the utility of school attendance boundaries is the geography itself.
BUILDING THE SCHOOL ATTENDANCE BOUNDARY INFORMATION SYSTEM
Perhaps the most daunting and uncertain task in building the SABINS database is collecting the source information from hundreds of local school districts and county GIS offices. One goal of this paper is to document the feasibility of collecting and compiling this information. The very largest school districts—those that enroll 20,000 or more students—have scores or hundreds of K-12 school attendance boundaries—but this information exists in a wide variety of formats. Most school districts post school attendance boundary information on their web pages to inform parents which schools their children should attend. This information is typically displayed or described in one of four formats: (1) as static, cartographic images (e.g., PDF images) displaying school attendance boundaries and the streets and other line features that school attendance boundaries follow; (2) as interactive, web-enabled maps that allow parents to pan and zoom to areas within school district; (3) as narrative or legal descriptions that verbally describe the boundaries; (4) as web-enabled “address locators” that allows parents to enter their residential address into a search engine. All of this information is used to digitize school attendance boundaries.
Not all school districts post school attendance boundary information on their web pages. In such cases, school districts have an “in-house” document or map that they do not distribute. Such documents are public, however, and most school districts supply their information when asked informally. If a district rejects an informal request, a formal public records request is filed with a school district or county GIS office. Most state laws entitle “anyone” or “any citizen” to request copies of public information. Typically, a person who requests information is not required to specify the reason they want the information and public officials are often barred from asking people why they are making a request. Most states specify that public information can be used for any purpose. The list of documents included in most state public records laws is expansive. For example, New Mexico’s Statute reads: “All documents, papers, letters, books, maps, tapes, photographs, recordings and other materials, regardless of physical form or characteristics, that are used, created, received, maintained or held by or on behalf of any public body and relate to public business, whether or not the records are required by law to be created or maintained” (New Mexico Statute § 14-2–6(E), NMSA 1978). These broad rights generally entitle the public to copies of GIS files and other related information. The willingness of school districts to share data—bolstered with legal entitlement to the data—allows the SABINS project to acquire nearly 100 percent of the information that it seeks.
Digitizing Procedures
The SABINS project digitizes “analog” information such as static images of school attendance zones, narrative/legal descriptions of school attendance boundaries or as lists of addresses that schools serve. Standard procedures were used to digitize. Given the national scope of the SABINS project—coupled with the fundamental goal of creating demographic estimates describing the characteristics or persons, families and households within school attendance boundaries—the primary base layer used to digitize school attendance boundaries are the line features from the Census Bureau’s 2010 Topologically Integrated Geographic Encoding and Referencing System (2020 TIGER/Line files). These files are available at http://www.census.gov/geo/www/tiger/tgrshp2010/tgrshp2010.html. The TIGER/Lines files consist of streets, railways, hydrographic features, utility lines, fence lines, and walking paths. In addition to delineating physical line features, TIGER/Lines also contain legal boundaries such as municipalities and school districts. The advantages of using TIGER/Lines are their ready availability and the fact that the metadata document the positional accuracy of lines for each county.
Using these nation-wide line features also ensures that all school attendance boundaries have a common spatial reference system and clean topology (no gaps and overlaps between adjacent polygons). To ensure clean topology, line features that comprise the outlines of a school attendance boundary are “geotraced.” Geotracing adopts all of the vertices of the TIGER/Line features such that school attendance boundaries follow them precisely.
Although most local school district delineate their school attendance boundaries to follow line features represented in the TIGER/Line files, some portions of some school attendance boundaries do not follow roadways, etc. This occurs when a portion of a school attendance boundary encompasses addresses on both sides of a street. In such cases, various data sources are used to supplement the digitizing process. In particular, Esri’s “ArcGIS online” database provides three important resources. The first database is the Digital Orthophoto Quarter Quad (DOQQ) areal imagery (at a resolution of one meter) which identifies features of interest (typically housing units). A second source of information is the “World Street Map,” which often contains parcel outlines (sans attributes) and this imagery is used as rough guide for digitizing along the periphery of parcel boundaries. Finally, actual parcel data for many counties are also available via ArcGIS Online and these actual boundaries are used as a base layer when available. If a school attendance boundary encompasses houses on both sides of a street, these three resources are used to ensure that school attendance boundary polygons do not cut across housing units.
Although these digitizing methods produce a set of electronic GIS files that follow a consistent system, in most situations it is desirable to obtain original electronic GIS files from a school district. Most of the very largest school districts in the country (those that enroll over 20,000 students) create digital GIS files in-house or contract a consulting firm to make them. In most cases, school districts share these data upon request (typically in the form of an Esri shapefile or geodatabase, a MapInfo file or a Computer Aided Drafting [CAD] file).
The quality of these GIS files varies widely. In ideal cases, school districts or county GIS offices use cadastral data to digitize school attendance boundaries. In the experience of the SABINS project, a few school districts and most county GIS offices use local cadastral data to digitize their boundaries and some enforce topological rules to their spatial data—typically consisting of “no gaps” and “no overlaps” among polygons. In other cases, the quality of the digital GIS files is poor. The lines of school attendance boundaries do not carefully follow visible line features and gaps and overlaps between boundaries are extensive. Despite the lower quality of some GIS files obtained from local agencies, the benefits of collecting digital GIS files outweighed the work of digitizing polygons from scratch. Given this, the SABINS project made efforts to acquire electronic GIS files: approximately half of the source information obtained from local districts arrived in Esri format, CAD files, or MapInfo files.
LOGICAL GIS DATA MODEL
The primary entities in the SABINS database consist of school attendance boundaries, the public school or schools that provide educational services to each attendance boundary, and the Census blocks that lie within each school attendance boundary (Figure 2). Related entities include the school districts that contain school attendance boundaries and the number of private, charter and magnet schools located within school attendance boundaries.
Figure 2.
Logical Data Model Representing Relationships Among School Boundary Entities
There is a temptation to think of schools and school attendance boundaries as the same entity, and many local districts simply assign a school’s identification code—usually in the form of a school name—to the boundary to which it supplies services. However, schools and their corresponding boundaries do not have a one-to-one relationship. As shown in Figure 3, some noteworthy relationships in the model include:
Figure 3.
Types of School Attendance Boundaries
(1) Two or more schools that provide services to the same school attendance boundary. This scenario is depicted by the green polygon. Children who live within the green shaded school attendance boundary attend either or Adams or Taylor.
(2) Two or more schools that provide services to an overlapping portion of two or more “parent” school attendance boundaries. This scenario is depicted by the light red boundaries (i.e., the parent boundaries) and the dark red boundary (i.e., the partially overlapping “child” boundary). Children who live in the dark red polygon have the option of attending either Washington or Lincoln. Children who live in the light red boundary labeled “Washington” must attend Washington school and children who live in the boundary labeled “Lincoln” must attend Lincoln school.
(3) A school that provides services to school attendance boundaries in different school districts. This scenario typically occurs in rural areas in the higher grades (typically grades 9 to 12). In this scenario, the each school district’s polygon is preserved but the school that provides services to both school attendance boundaries is associated with each polygon.
These relationships are counterintuitive to the commonplace notion that one school serves one school attendance boundary—or that every school only serves children within a single school district. To accommodate these relationships, school attendance boundaries are assigned identification codes that are separate from but linked with school identification codes. School identification codes are derived from the U.S. Department of Education’s Common Core of Data (CCD) (U.S. Department of Education 2010). The CCD contains a unique identification code for the roughly 100,000 public schools in the U.S. A relational table links schools in the CCD table with school attendance boundaries in the geographic file. Note that the CCD provide a limited set of variables including school names, grade spans, student enrollment by grade and race and the number of children who receive a free or reduced-priced lunch.
The related tables address situations (1) through (3) above. Situation (1) is addressed in a straightforward manner by building a relational table that links schools and school attendance boundaries. A public school student who lives in one of these “optional” school attendance boundaries can chose to enroll in one of the neighborhood schools that supply services to the area. Situation (2) occurs when portions of school attendance boundaries overlap, as shown in the boundaries shaded in red in Figure 3. In this scenario, the overlapping portion of the school two school attendance boundaries is treated as separate polygon with a unique identification code. Two or more public schools provide educational services to the area and children who live in this polygon can select one of the schools that supply services to it. Thus, as shown in the logical data model, the relationship between schools and boundaries is “many to many.” Situation (3) occurs when a school supplies services to children in a school attendance boundary in different school districts. This is a rare exception and usually occurs in grades nine and above and in rural areas. Students who live in one school district attend a high school that falls under the administrative jurisdiction of another school district.
Although these three situations necessitate a series of normalized tables that relate schools with the boundaries to which they supply services, there are other compelling reasons to treat schools and attendance boundaries as separate but related entities. A single school can serve one boundary but, over time, the geography of the boundary that a school serves can change. Thus, while the school remains the same each of its unique school attendance boundaries could have separate identification codes to indicate a temporal change in geographic coverage.
Creating Grade-specific Geography
While the entities in the SABINS data model are relatively few compared with some spatial data models, this masks some of the complex spatial and tabular relationships among these entities. In particular, school attendance boundaries are typically thought of as three layers consisting of “elementary,” “middle” and “high” school polygons. The terms “elementary,” “middle” and “high” have no standard grade-ranges and are merely convenient labels for attempting to describe schools that provide services to “children,” “adolescents” and “teenagers.” Indeed, there are 91 possible grade spans that a school attendance boundary can cover (e.g., grade K, grades K to 1; grades K to 2 and so on to grade 12) and there are school attendance boundaries that cover most of these 91 possible grade spans.
To overcome this challenge, the SABINS database contains school attendance boundaries by grade level. Users have access to grade-specific boundaries or boundaries that can be re-assembled to cover grade spans2. A simple example illustrates the challenges presented when trying to configure “elementary” school attendance boundaries. There are many cases in which some sixth grade school attendance boundaries are embedded in an “elementary” school layer while other sixth grade boundaries are embedded in the “middle” and the “high” school layers. If a school district represents its sixth grade school attendance boundaries in three separate layers, it is not possible to determine the school assignment of all sixth grade students by examining only the elementary school geography. In this situation, it is necessary to create a separate school attendance boundary layer for sixth grade by merging the sixth grade polygons from the elementary, middle and high school layers. This principle holds true for all grades. Thus the SABINS project creates 13 geographic layers, one for each grade K through 12.3
While it is true that every area within a school district must be covered by a school attendance boundary for each grade, many school attendance boundaries are coincident across grade spans. Indeed, in almost all school districts, the second grade attendance boundaries are coincident with third grade attendance boundaries. It is valuable to know whether boundaries are coincident across grade levels and this information is preserved in the primary key of each school attendance boundary polygon. The primary key consists of four fields: (1) LEAID, which is a unique identification code for every school district and is derived from the U.S. Census Bureau; (2) BOUNDARYID, which is the unique identification code for every school attendance boundary within a district; (3) YEAR is the school year for the data (where a value of “10” is the 2009–2010 school year) and (4) GRADE, where a value of “00” represents Kindergarten. If a school attendance boundary is the same across grades (for the same year) the values for the fields LEAID and the BOUNDARYID are the same. For example, if all of the school attendance boundaries in a given district are coincident for grades K through 5, then values for LEAID, BOUNDARYID and YEAR will be the same; if a specific Kindergarten boundary remains coincident over time (i.e., the school district has not “rezoned” a K boundary), then the LEAID, BOUNDARYID and GRADE will be the same across years.
Attributes describing the objects stored in the PY_SABINS polygon layer also include a series of binary fields indicating whether or not a school attendance boundary is coincident with the entire school district, is a multi-part polygon, or is an open enrollment boundary in which children can select from among two or more schools that provide services to the boundary. A special case code also indicates whether or not a boundary is served by multiple schools or whether a portion of a boundary overlaps with a portion of another boundary.
PROCESSING STEPS TO IMPLEMENT THE DATA MODEL
When GIS data of school attendance boundaries are available from local agencies, it is preferable to process these files rather than digitize the school attendance boundaries from paper maps or narrative descriptions. The primary advantage of obtaining existing GIS data is that is saves time (particularly for the largest school districts). While using GIS-ready files saves time, some of the data acquired have the following deficiencies: (1) there are no topological rules established or enforced in the geography and some school districts may have hundreds of gaps and overlaps between their school attendance polygons; (2) some portions of “elementary,” “middle” and “high” school attendance boundaries should share the same line segments but they do not and, as discussed below, this can cause problems in associating Census blocks to school attendance boundaries across grade-specific boundaries; (3) school districts need to be edge-matched to eliminate gaps and overlaps between them; (4) most school attendance boundaries do not have an identification code that is distinct from the school or schools that provide services to the school attendance boundary; (5) multipart polygons are typically treated as single part polygons; (6) the files do not necessarily identify the grades that a particular school attendance boundary serves. Despite these shortcomings, it is still preferable to obtain and “clean” the digitized and attributed GIS data rather than digitize the boundaries from scratch. It is also easier to process data that are digitized by SABINS staff—even though these data do not have the deficiencies listed above.
The SABINS project has written a series of custom GIS programs that quickly, consistently and accurately correct the problems associated with school attendance boundary geography. Indeed, these scripts are used to process all GIS data, whether or not they were digitized by SABINS staff or were obtained directly from local agencies. The first processing step consists of assigning all school attendance boundaries a common set of fields in the attribute table. All school attendance boundaries are assigned a field called “source name” that contains the “identification” code that school districts assign each school attendance boundary; most school districts identified their school attendance boundary with a single field that contains the name(s) of the school(s) that provide services to a school attendance boundary. This “source name” was preserved throughout all processing steps as a means of quality assurance. (Note, any single-part polygons that are supposed to be the same—as indicated by the same source name—are dissolved into multipart polygons.)
Because local agencies typically conflate schools with school attendance boundaries, the second processing step is to assign each school attendance boundary an identification code that is preserved throughout all processing stages. This identification code preserves the original geography obtained from school districts. This identification code is created by concatenating three separate fields. The first is a school district identification code number, called the Local Education Agency ID (or LEAID). The LEAID is derived from U.S. Department of Education’s CCD. The second field is a school level field, where the character “E” is assigned to elementary school polygons while the characters “M” and “H” are assigned to middle and high school attendance boundaries, respectively. (Some school districts have up to five layers of school attendance boundaries, such as a “pre-school” and “intermediate” layers; these are labeled with a “P” and “I” respectively.) The third field is a sequential set of numbers that are automatically generated in the attribute table of ArcMap 10’s shapefiles and feature classes (and stored as a static field in the attribute table). Concatenating these three fields allows SABINS to reproduce the original “pre-school,” “elementary,” “intermediate,” “middle,” and “high” school polygon layers that were delineated by local school districts—but this identification codes distinguishes these original boundaries from the schools that supply services to them.
Associating schools with school attendance boundary polygons
School attendance boundaries are then assigned a series of fields that contain the unique identification codes of the schools that provide services to them. The school identification codes are derived from the CCD and, as noted, this identification field is called NCESSCH. The process of assigning the NCESSCH school identification codes to every polygon is completed with “human intelligence” using a custom tool that allows users to identify the name of the school that serves a school attendance boundary (which are typically in the attribute table of a GIS layer) and find its corresponding school name in the CCD. For example, the polygon layer may have the school name “John P. Jones” in the attribute table while the CCD could have the school name “John Paul Jones Elementary School.” The NCESSCH identification code from the CCD is populated in the attribute table of school attendance boundary polygon layer with a point-and-click of a mouse. (The school name from the CCD is also populated in a separate field of the attribute table to ensure the accuracy of the school assignment.) If two or more schools serve an attendance boundary, their corresponding NCESSCH identification codes are stored in a subsequent series of separate fields.
Each school attendance boundary is then assigned 13 fields for grades Kindergarten through 12. If a school attendance boundary is contained within an elementary school layer, the fields are named “E_00” to “E_12”; middle school layers are given field names of “M_00” to “M_12” and high school layers are given “H_00” to “H_12.” These fields are set to null initially. Each unique “elementary,” “middle” and “high” school attendance boundary is then joined with the CCD (by the NCESSCH code). If the CCD indicates that the student enrollment for a grade is greater than 0, then the school attendance boundary is assigned the NCESSCH code for that grade. For example, if an elementary school attendance boundary is served by a school that has at least one student enrolled in grades Kindergarten through five, the E_00 through E_05 fields are assigned NCESSCH codes while the E_06 through E_12 fields remain null. This is accomplished with scripts. A typical middle school will have NCESSCH codes the fields M_06 through M_08 fields while the remaining fields remain null; similarly, a typical high school will have values of one for the ninth through twelfth grades.
If a grade-specific boundary is served by more than one school, then the NCESSCH identification codes of all schools that serve the boundary are stored in a comma delimited list (where the identification codes are sorted from low to high). For example, a school attendance boundary in the elementary layer can be served by two schools: Jones School where NCESSCH is “1200200001” and Montgomery School where NCESSCH is “1200200002”. The CCD indicates that the Jones School offers grades 00 to 06 and Montgomery School offers grades 04 to 06. This results in populating grade-specific fields, as shown in Table 3. At this stage, another field is added that indicates the range of grades of the schools providing services to a school attendance boundary. In the example shown in Table 3, this field contains 0,1,2,3,4,5,6. These steps are completed with custom Python scripts to ensure speed and accuracy.
Table 3.
Assignment of Grade-specific School Identification Codes to the “Elementary” School Layer.
| Name of Grade-specific Field: | NCESSCH Codes |
|---|---|
| e_00 | 1200200001 |
| e_01 | 1200200001 |
| e_02 | 1200200001 |
| e_03 | 1200200001 |
| e_04 | 1200200001,1200200002 |
| e_05 | 1200200001,1200200002 |
| e_06 | 1200200001,1200200002 |
| e_07 | Null |
| e_08 | Null |
| e_09 | Null |
| e_10 | Null |
| e_10 | Null |
| e_12 | Null |
Storing the Separate Layers as a Union
Once each elementary, middle and high school attendance boundary in each of the layers is assigned is assigned school identification codes for specific grades, each layer is “unioned” in GIS. This process results in creating a new set of polygons (stored in a single layer) that represent the unique intersections among the polygons from the original layers. This overlay process quickly eliminates gaps and overlaps between polygons (or holes within polygons) and conflates line features from the elementary, middle and high school layers. The conflation process helps ensure that any lines that should be the same are the same.
Correcting the topological flaws in the polygon files is made possible with tolerance settings in GIS software. During the union process a user can specify a tolerance setting (expressed as units of distance) that will automatically snap vertices of adjacent polygons together (within the same polygon layer or across different polygons layers). The snapping process moves two vertices that are within a specified tolerance level to the same location. If the tolerance setting is 30 feet and 2 vertices are within 30 feet of each other, both vertices will be moved to the midpoint of their original location. The tolerance setting affects a layer’s precision since it specifies the amount of coordinate movement allowed. Any vertices that are within the distance of the tolerance setting are moved and this can have the unwanted effect of simplifying features if a tolerance level is set too high. The SABINS project uses a tolerance level of no more than 30 feet.
The union process is essential for enforcing topological rules. Correcting gaps and overlaps in the elementary, middle and high school attendance boundary files separately is insufficient. A simple example illustrates the challenge. If some sixth grade school attendance boundaries are represented in the “elementary” school layer and others are represented in the “middle” and “high” school layers, creating a single sixth grade file requires merging boundaries from all three layers. Thus, it is not sufficient to enforce topology for each of the three layers separately—and then merge the sixth grade polygons that originated from each layer. Simply merging the sixth grade polygons derived from the (topologically corrected) elementary, middle and high school layers would still lead to gaps and overlaps between the sixth grade polygons that originated from three layers. The union process eliminates this problem by conflating line features from the three layers.
Although the union immediately eliminates gaps, overlaps and small holes within polygons in the same layer—and conflates line features across layers—a union among the three layers often creates new “sliver” polygon. Slivers are small areal features that commonly occur along the borders of polygons following the union of two or more polygon layers. Slivers primarily occur from gaps and overlaps that exceed the tolerance level set when a union is created—and because line features from the original overlain layers that should have been coincident were not.
If the tolerance setting was 30 feet, slivers will be formed between school attendance boundaries that did not have vertices within 30 feet of each other. Eliminating slivers essentially conflates line features and is accomplished in two ways. First, any sliver that was not covered by the original “elementary,” “middle,” or “high” school layer is identified. Such polygons are almost always on the periphery of a school district. These slivers are identified by visual inspection and eliminated by merging them with larger polygons—and this ensures that all K-12 boundaries cover that exact same area of the school district. The final step is to identify slivers manually and merge them with neighboring school attendance boundary polygons. (It is necessary to “explode” multipart polygons into single-part polygons in order to identify and eliminate slivers.)
In addition to enforcing topology quickly, the union of the three layers ensures that the entire area within a school district (i.e., all intersections among the original elementary, middle and high school polygons) is covered by every grade K through 12. Recall that temporary identification codes were assigned to the original school attendance boundaries and these school attendance boundaries were linked with grade-specific school identification codes (i.e., E_00 to E_12; M_00 to M_12 and; H_00 to H_12). Custom GIS scripts ensure that the every intersection of the union formed with the original layers is served by at least one school for every grade. Using grade six as an example, at least one school identification code must be present in the sixth-grade fields derived from the elementary, middle and high school layers. That is, at least one of the E_06, M_06 or H_06 fields must have an NCESSCH code. This ensures that a given intersection is served by at least on school for each grade.
If a specific polygon in the union file is missing a particular grade (e.g., the fields E_06, M_06, and H_06 are all null) this indicates that a boundary is not served by a particular grade. This is an error. This error occurs when inaccurate information for a school’s grade range is provided by the CCD. The SABINS project relies on the CCD to determine the grades covered by a school attendance boundary. When the CCD is inaccurate, the missing information is obtained through a phone call to a local school district or examination of the district’s web page. The attribute table of the union file can be quickly updated by entering the correct NCESSCH code. By contrast, if the same grade originates from two different layers (e.g., the E_05 and M_05 fields both have an NCESSCH code) then this indicates that two different schools serve that boundary for the same grade.
The final step in the union process is to add 13 new fields to the attribute table. These fields store the NCESSCH identification codes of the schools that provide services to school attendance boundaries. (The NCESSCH identification codes for a grade are sorted from low to high within a single field and are comma delimited.) These 13 fields are “final_id00” to “final_id12.” The code that undertakes this task determines if, for example, a fifth grade school attendance boundary polygon originated from the elementary, middle or high school attendance boundary file. The logic of the code can be summarized as follows: If E_05 equal has an NCESSCH code, the “final_ID05” field is assigned the NCESSCH school identification code from the elementary school layer; if M05 has an NCESSCH code, the “final_ID05” field is assigned the NCESSCH code from the middle school layer. Once all thirteen Final_ID fields are assigned NCESSCH codes, the union is dissolved 13 times—one time each using the fields from Final_ID00 to Final_ID13. This creates 13 layers for grades Kindergarten through 12 and each layer has an attribute describing the school or schools that supply services to a school attendance boundary. This entire process creates 13 topologically correct and geometrically consistent grade-separated feature datasets, one each for grade Kindergarten through 12. Each of the boundaries in the 13 feature datasets is then assigned a permanent identification code (the SABINSID as shown in the in the PY_SABINS feature dataset in Figure 3). At this stage the “NS_SABINS_CCD” associational table is created in order to follow the normal forms necessary for robust spatial database management systems.4
One final processing step incorporates ancillary information about school attendance boundaries. This information is stored in the “NS_SABINS” table shown in Figure 3. This information includes the number of magnet (cnt_magnet_inside), charter (cnt_charter_inside) special education (cnt_special_ed_public_inside) and private (cnt_private_inside) schools inside a school attendance boundary. The private school locations are derived from the Private School Universe Survey (U.S. Department of Education 2008). Other fields count the number of schools that supply services to a boundary (called cnt_open_enrollment) and the number of parts of multipart polygons (cnt_multi_part). The SABINS database also provides a hyperlink that allows users to download the original source information (e.g., narrative descriptions, static images and shapefiles) used to create school attendance boundaries (link_raw_data).
Block Rectification for Custom Census Tabulations
As stated earlier, the primary set of geography SABINS supplies to the public are “block rectified” school attendance boundaries. This means that, in the SABINS database, school attendance boundaries are aggregates of Census blocks. Most school attendance boundaries are closely aligned with the TIGER/Line files and, since these line features comprise Census blocks, most school attendance boundaries are, in fact, meant to entirely contain Census blocks. Still, some school districts delineate some of their school attendance boundaries such that a portion of some school attendance boundaries serve children on both sides of a street. In such cases, the school attendance boundary legitimately and intentionally splits a Census block. Still, the SABINS database assigns an entire Census block to an attendance boundary regardless of whether it is split by that attendance boundary. Thus, block rectified school attendance boundaries are not precisely the same as those delineated or described by a local school district. (SABINS allows users to obtain the original source information with a hyperlink stored in “link_raw_data” field so that they may access “pre-block rectified” files.)
The SABINS project uses a straightforward block–rectification technique. A point file is created that represents the geographic center (or centroid) of all U.S. Census blocks. The Census Bureau’s block file contains the centroid coordinates for each block and these are used to create a point layer. The point layer representing block centroids is then spatially joined with the school attendance boundaries—if a block point falls within a particular school attendance boundary polygon, the block centroid is assigned the school boundary identification code. The block-assignment method is one reason topological rules such as “no gaps” and “no overlaps” among polygons must be enforced in the processing stages of the project. Once the block points have been assigned school attendance boundary identification codes, the block points are then rejoined with the original block polygons from which these points were generated. After the block polygon file is associated with the identification codes of a school attendance boundary, the block polygons are dissolved into “block-rectified” school attendance boundaries. As discussed below, since most blocks have over 90 percent of their area within a school attendance boundary, the block rectification process essentially conflates school attendance boundaries with 2010 TIGER/Lines (which are the features that comprise census blocks).
There are three reasons that SABINS database assigns entire blocks to school attendance boundaries. First, block rectified boundaries can be used to create custom tabulations of Census population counts that can be released to the public. The Census Bureau’s disclosure policies include a stipulation that Census blocks are entirely nested within the custom geography before a custom tabulation can be released publicly. The Census Bureau maintains strict disclosure rules to ensure that geographic or tabular information supplied to the public cannot be used to identify households and the characteristics of individuals who live within them. An example of demonstrates how custom tabulations from the Census Bureau are useful. The U.S. Department of Agriculture’s Food and Nutrition Services branch used SABINS data to estimate the percentage of public school students enrolled in a school who were eligible to receive a subsidized meal. These custom tabulations have been released to the public for some school districts—which was made possible because the block rectified school attendance boundaries met the disclosure rules of rules of the Census Bureau.
A second reason to block-rectify school attendance boundaries is to generate school attendance boundary population totals using published Census data. The Census block is the smallest unit for which the Census Bureau tabulates data. Thus, aggregating the block-level population totals to the block-rectified attendance boundaries provides a straightforward way of producing demographic information.
A third reason to block rectify school attendance boundaries is to conflate the line work of school attendance boundaries with Census blocks. Many school districts that supply electronic GIS files to the SABINS project use local cadastral data to delineate their school attendance boundaries— as is the case of districts in Delaware. A school attendance boundary may follow a line segment such as a road but this road, as delineated in local source data, is not aligned with TIGER/Line files. As shown in Figure 4, the source information used to digitize a school attendance boundary along Dallas Highway does not align precisely with Census Bureau’s representation of Dallas Highway. Thus, using the MAF/TIGER lines to address-match will locate some address points within the wrong (pre block-rectified) school attendance boundary. This is shown by the address circled in red—which has been assigned to wrong boundary. After block rectification, a Census block along the periphery of a school attendance boundary will share geometry with the TIGER/Lines. This makes geocoding with TIGER/Line features more manageable. (Of course, some addresses will not be inaccurately assigned to school boundaries after boundaries are rectified to blocks; these errors will occur when a school boundary intentionally splits a block. SABINS metadata cautions users about the pitfalls of using SABINS data to locate addresses within school attendance boundaries.)
Figure 4.
Misalignment of Locally Delineated Streets and School Attendance Boundaries; Points Represent Geocoded Addresses with a 10 Foot Offset.
Degree of block nesting
Table 4 shows the percentage of 2010 Census blocks in Delaware that have varying proportions of their area within a school attendance. Only blocks with at least one person living in them are considered. There are 15,933 populated blocks in Delaware (out of a total of 24,115 blocks). Of the blocks with a population, 96.4 percent have at least 99 percent of their area within a Kindergarten school attendance boundary. This same figure is 98.1 percent for grades 7 and 12. If the proportion of a block’s area within a school attendance boundary is increased to 90 percent then the percentage of blocks that are nested within Kindergarten attendance boundaries is 98.5 percent; this figure is 99.5 percent for grades 7 and 12. The threshold of 90 percent is somewhat arbitrary. Still, if a school attendance boundary contains 90 percent of a block’s area, the imperfect nesting almost always results from discrepancies in line work between TIGER/Line features and locally defined features. This still means that less than two percent of blocks are legitimately split by school attendance boundaries. In cases in which a block is intended to be split by a school attendance boundary, the entire block is still assigned to a single school attendance boundary and, although this is less than desirable, it is necessary for obtaining custom tabulations from the U.S. Census Bureau.
Table 4.
Percent of 2010 Census blocks that have varying percentages of their area within school attendance boundaries, Delaware, 2009–2010.
| Grade Span | Kindergarten | Grade 7 | Grade 12 |
|---|---|---|---|
| 99 percent within | 96.4 | 98.2 | 98.2 |
| 95 percent within | 98.1 | 99.3 | 99.4 |
| 90 percent within | 98.5 | 99.4 | 99.5 |
|
| |||
| Number of Attendance | 84.0 | 33.0 | 23.0 |
|
| |||
| Boundaries | |||
It is important to know how much assigning entire blocks to school attendance boundaries potentially impacts population estimates within school attendance boundaries. Table 5 shows the results of a sensitivity analysis that estimates how much the block rectification process may affect population characteristics within school attendance boundaries. As in previous analysis, grades Kindergarten, 7 and 12 in Delaware are used to explore this topic. This analysis compares the percent difference of non-Hispanic white, non-Hispanic black and Hispanic people in: (1) block rectified school attendance boundaries with (2) a modified areal weighting approach. The modified areal weighting approach assigns all of a block’s population to a school attendance boundary if the school attendance boundary contains more than 90 percent of the block. If a school attendance boundary contains between 10 and 90 percent of a block’s area, the school attendance boundary is assigned population totals in proportion to the area of the block that it contains. For example, if a school attendance boundary contains 85 percent of a block it is assigned 85 percent of its population. Block rectified boundaries consist of the entire population of whole Census blocks—where a Census block is assigned to a school attendance boundary if its centroid lies within it.
Table 5.
Correlation coefficients between allocated and known percentages of various racial groups in school attendance boundaries, Delaware, 2009–2010.
| Racial Group | Kindergarten | Grade 7 | Grade 12 |
|---|---|---|---|
| Non-Hispanic Black | .997 | .999 | .998 |
| Hispanic | .993 | .998 | .998 |
| Non-Hispanic White | .997 | .999 | .998 |
|
| |||
| N Catchment Areas | 84.0 | 33.0 | 23.0 |
Findings show that the difference between the percent of non-Hispanic white people in boundaries that are block-rectified and areal weighted is less than one percentage point for 93 percent of observations. Slightly fewer than five percent of boundaries differ by one to two percentage points while about two percent of the boundaries differ by two to four percentage points. For grades 7 and 12 attendance boundaries, 100 percent of school attendance boundaries have less than a one percentage point difference in the percent of non-Hispanic white people in block rectified and modified areal weighted school attendance boundaries. Thus, at the lower grade levels, the block rectification process results in more inaccuracy than at the higher grade levels. (Results for other racial groups are nearly identical and are available upon request.) Even though “block-rectified” boundaries result in some inaccuracy in socio-demographic estimates, the trade-off is the ability to acquire custom tabulations with U.S. Census Data and to seamlessly create accurate population totals using publicly available block-level Census Data.
INTEGRATING DEMOGRAPHIC DATA WITH ATTENDANCE BOUDNARIES
The SABINS database provides users with demographic estimates describing the characteristics of persons, families, households and housing units within block-rectified school attendance boundaries. Data that describe these characteristics originate from two sources. The first source is basic population counts from the decennial census that are summarized at the Census block level (from the Census Bureau’s 2010 Summary File 1). The block-level Summary File 1 file represents the “complete count” or 100 percent census. Basic block-level summary counts include the number of people by age, gender and race. Data are also available for the number of occupied and vacant housing units. Since school attendance boundaries are aggregations of Census blocks, it is fairly straightforward to sum block-level population characteristics to school attendance boundaries. For example, users will have ready access to information such as the percent of five to nine year old children who are Hispanic.
More detailed socio-demographic characteristics are available from the American Community Survey (ACS). The ACS samples U.S. households continuously throughout the year and does so every year (American Community Survey Office 2011). A rolling, five-year compilation of the annual ACS sample data are made available to the public. The SABINS database contains the latest five-year series, which runs from 2006 to 2010. ACS data are summarized at the block group level. Blocks groups are relatively small areas that are aggregations of Census blocks (in Delaware, there is an average of 28 populated blocks for every block group). Examples of ACS variables summarized at the block group level include the number of people who are employed; work in various occupations and industries; have achieved various levels of education; are poor; belong to various ethnic categories; speak various languages; and are born in foreign countries.
While the ACS provides much information about the U.S. population, the challenge is to summarize these data to school attendance boundaries. The SABINS project uses a straightforward interpolation procedure to allocate socio-demographic characteristics from block groups to school attendance boundaries, as described by Saporito et al. (2007). The procedure first estimates the estimates the number of people of a given social characteristic who live within a block. The 100 percent count block level variables from the decennial census guide the allocation. The SABINS project then re-aggregates the block-level interpolated values to school attendance boundaries.
A simple hypothetical example describes the process of allocating block-group poverty rates to school attendance boundaries. A block group contains three Census blocks within it. The first block has 100 people, the second 200, and the third 300. The entire block group contains 100 low-income residents. Since the first block contains 1/6th of the block-group’s population, the allocation procedure assigns 1/6th of the block group’s low-income residents to that block. If a school attendance boundary contains the first block (but not the remaining two blocks) then 1/6th of the block-group’s low-income residents are allocated to that school attendance boundary.
This basic principle can be extended in two ways. First, different block-level variables can be used to guide the allocation procedure. For example, a variable might describe the number of Asian people in a block group who achieved various levels of education (e.g., the number of Asian people who earned a college degree). Since the number of Asian people is tabulated at the block level, it is more accurate to use block-level counts of Asian people to guide the allocation than it is to use the entire population in the block. This is because Asian people might be distributed differently within block groups than other ethnic groups (i.e., Asian people might be segregated from the members of other groups). Second, the allocation procedure also allows the allocation of block group data that are provided as averages (e.g., the mean income of persons age 25 and over who reside within a block group). The procedure for allocating average income, for example, begins by estimating the total number of dollars in each block group. This is completed by multiplying the total number of people within the block group by the mean income within the block group. Total dollars are then allocated from block groups down to blocks depending upon the number of people over who live within the block. For example, if 1/10th of a block group’s people live within one of its blocks, then 1/10th of the block group’s total dollars are assigned to that block. The total number of dollars allocated to blocks are summed to school attendance boundaries and divided by the total number of people in school attendance boundaries to estimate mean income of people within them.
This allocation procedure introduces error because sub-populations are not necessarily distributed evenly across blocks within block groups. A simple, hypothetical example illustrates the challenge: a block group contains 4 blocks, 400 people and 100 low-income people; the people are distributed evenly across blocks (i.e., there are 100 people in each block) but all low-income people are concentrated in one of the four blocks. Yet, the allocation procedure assumes that there are 25 low-income people in each block. While such an extreme example rarely occurs in practice, the allocation procedure introduces some error.
In order to determine how much error is introduced in the allocation procedure, a sensitivity analysis was conducted in which the actual racial characteristics of people in school attendance boundaries were correlated with values interpolated from block groups. Because 2010 census data provide counts of people by race at the block level, it is possible to generate actual counts of people by race for each school attendance boundary. This actual count is correlated with estimates produced by interpolating block-group data. Specifically, correlations between known and interpolated values were created for the percent of non-Hispanic white, non-Hispanic black and Hispanic people in Kindergarten, 7th and 12th grade school attendance boundaries in Delaware. Results are shown in Table 5. Findings indicate that correlation coefficient between the actual and interpolated values are at least .993 for all racial comparisons—and at least .997 for 8 of the 9 racial comparisons. In substantive terms, across Kindergarten school attendance boundaries, the difference between actual and interpolated values (for all three racial groups considered collectively) is less than 2 percentage points for 90 percent of Kindergarten boundaries. The percentage point difference is between 2.1 and 5 for 9.4 percent of the Kindergarten boundaries while .6 percent of the cases have a difference of between 5 and 7 percentage points. At the higher grade levels, the difference between the actual and interpolated values all for all racial groups are less than 1.5 percentage points.
From Grade-specific School Attendance Boundaries to Schools
As noted throughout this paper, schools provide services to school attendance boundaries but school and their boundaries do not necessarily share a one-to-one relationship. In some cases, boundaries are served by multiple schools. In other cases, portions of some boundaries overlap and the area of intersection is served by multiple schools. If a user wants to determine the population characteristics of the people living in the boundaries served by each school (for a single grade level), it is necessary to sum populations living in each school’s attendance boundary—and then divide these sums by the number of schools that serve the boundary.
This procedure is relatively straightforward. Users join counts of people within school attendance boundaries to the schools that supply services to those boundaries. Once every boundary is joined to the school (or schools) that provide services to it, a value is generated that counts the number of schools that supply services to a school attendance boundary. If one school provides services to an area (which is the most typical scenario) then the “school count” value will be one; if two schools supply services to an area, the count will be two and so on. The “school count” value is then divided into the population counts of each school boundary and the “weighted counts” are summed to the schools the provide services to each boundary. This procedure preserves the original population counts of the entire school district while still providing the ability to produce meaningful statistics at the school level (e.g., the percent of children who are low-income). The summation process also produces a single estimate for each school that describes the population characteristics of people who live within a boundary—but is based on the assumption that children are evenly allocated to each school that serves a boundary.
From Grade-specific Schools to Entire Schools
Many users will want to estimate the population characteristics of people who live within a school’s catchment areas—not simply for a single grade but for an entire school irrespective of the grade span it serves. This can be achieved if some simple assumptions are made. To illustrate this process, assume that the goal is to generate the number of black and white people who live within the school attendance boundaries served by schools and to generate these estimates for all schools irrespective of their grade ranges. It is useful to recall that the SABINS database consists of 13 grade-specific polygon layers spanning grades K to 12. Creating school counts from these 13 layers requires several basic steps. First, the number of black and white people in each grade-specific set of boundaries is divided 13. This essentially allocates 1/13th of the population to each grade. The second step is to join the weighted counts of black and white people to the schools that provide services to those boundaries—at this stage every record is a school. Third, the number of black and white people who live within in each grade-specific attendance boundary is divided by the number of schools that provide services to those boundaries; if a boundary is served by two schools, the population in the boundary is divided by two. Fourth, the 13 grade-specific, school-based data files are appended together (i.e., stacked on top of one another). For example, if a school serves five grades, the data for the five grades will be repeated in the database. The final step is to sum the number of black and white people across the 13 “stacked” files to each school. This is accomplished by aggregating (or collapsing) on school identification code (using the NCESSCH field). This last step reproduces the original population counts in a school district but the final result allocates data to whole schools—and not simply to grade- specific boundaries. The result is an estimate of the characteristics of people who live within the attendance boundaries served by every school—irrespective of the grade spans and attendance boundary combinations that those schools serve. These steps can be completed in most statistical software.5
CONCLUSIONS AND FUTURE WORK
The School Attendance Boundary Information System is a spatial data infrastructure project that has, for the first time, collected, processed, harmonized, and disseminated K-12 educational geography on a massive scale. The first major achievement of the project is to demonstrate the feasibility of collecting and digitizing this information for hundreds of the largest school districts in the U.S. This demonstrates that it is realistic for educational agencies to replicate such collection efforts within their own states—a long term strategy for institutionalizing the SABINS project. In this paper, we describe how school attendance boundary geography are modeled and integrated with data sources from the U.S. Census Bureau and the U.S. Department of Education. The aim is to provide states and other agencies with a template for building their own school attendance boundary database. While there are imperfections in the original source information used to digitize school boundaries—and in the techniques used to digitize those boundaries and link them with schools—these deficiencies can be corrected with the models and procedures we describe. There are also some inaccuracies introduced by rectifying school attendance boundaries with census blocks—and in allocating population counts from block groups to school attendance boundaries. Yet, the errors introduced during this process still leaves users with data that is superior to using administrative geography as proxies for school attendance boundaries.
Still, there are ways to improve existing data. The SABINS project is also in the midst of building a web-based digitizing system so that school districts can use to delineate their boundaries (and update them annually). A web-based digitizing system has several, potential advantages. It can local districts money because they do not have to buy software or pay consultants to digitize and display their boundaries; it allows the project to collect data on an even larger scale and in a standard format; and it improves the accuracy of the data because the system can provide tutorials that teach best practices in using GIS to digitize school attendance boundaries. It also enables “edge matching” between school districts so there are not gaps or overlaps among adjacent local educations agencies. Finally, to the extent that districts can use TIGER/Line files to trace their boundaries, there will be less error (and more consistency) in the GIS files that represent school boundaries. SABINS is in the midst of building this capacity.
The second major accomplishment of the SABINS project is taming some of the seemingly intractable difficulties of modeling the relationships among school attendance boundary geographies and the schools that supply services to those geographies. While it is reasonable for a local school district to treat schools and school attendance boundaries as the same entities, such a system is not useful at larger scales. Designing a robust database management system consisting of normalized, relational tables linking schools with boundaries allows for easy management, update and analysis of the database, and, more importantly, it makes it possible to integrate existing school-level databases such as the CCD with school geography. In particular, the SABINS database structure allows users to extract grade-specific boundaries or boundaries that span more than one grade; it allows users to then join each of these boundaries to schools that supply services to them; it allows more than one school to supply services to a single boundary or an overlapping boundary; it allows schools to have different boundaries at different grade levels; finally, it allows schools to serve boundaries in multiple school districts. This model serves as a template that states can adopt, modify, and incorporate into their own enterprise GIS systems.
SABINS also uses straightforward spatial interpolation techniques that estimate of the socio-demographic characteristics of people and households that are located within school boundaries. Although allocating demographic information from Census geography to school attendance boundaries is reasonably accurate, it is important to remember that school attendance boundaries are “permeable” because some students are enrolled in private, charter and magnet schools that draw children from traditional, neighborhood schools. While the interpolation techniques used in the SABINS project result in reasonable estimate of the characteristics of people who live within school boundaries, these estimates are imperfect reflections of who is enrolled in a school. Nevertheless, this will be addressed with the release of custom tabulations from the Census that consist of public school children only—which is a closer representation of who is enrolled in the school that serves a given boundary. Taken as a whole, SABINS provides researchers, policy makers and local administrators with a rich new spatial and tabular data source that serves the diverse needs of a wide constituency.
Acknowledgments
The authors wish to thank the following people for their contributions: Doug Geverdt, Jeff Han, Laura Nixon, and Petra Noble. This research was supported with grants from the National Science Foundation (SES-1123727, SES-1123894, SES-0921794 and SES-0921279) and the United States Department of Education’s National Center for Education Statistics.
Biographies
Salvatore Saporito is an Associate Professor of Sociology at the College of William and Mary. He has conducted research on the causes and consequences of racial and economic segregation in schools and he has used school attendance boundary information to investigate the process of segregation. He has B.A. in sociology from Glassboro State College and a Ph.D. in sociology from Temple University.
Corresponding Address:
Department of Sociology
The College of William and Mary
Williamsburg, VA 23187-8795
Email: sjsapo@wm.edu
David Van Riper is the Director of the Spatial Analysis Core at the Minnesota Population Center and has worked extensively on creating and disseminating large GIS data infrastructure such as the National Historical Geographic Information Systems (http://www.nhgis.org). He has completed a B.A. in Geography in 1999 from the University of Wisconsin-Madison and holds a M.A. in geography from the University of Minnesota.
Corresponding Address:
Minnesota Population Center
50 Willey Hall
225 – 19th Avenue South
Minneapolis, MN 55455
Email: vanriper@umn.edu
Ashwini Wakchaure is a GIS programmer on the SABINS project. She earned a Ph.D. from the Department of Urban and Regional Planning at the University of Florida in 2009. She teaches courses in GIS programming at the College of William and Mary.
Corresponding Address:
Department of Sociology
The College of William and Mary
Williamsburg, VA 23187-8795
Email: awakchaure@wm.edu
Footnotes
The 13 metropolitan areas are Atlanta GA, Bakersfield CA, Hartford CT, Houston TX, Kansas City MO, Miami FL, Milwaukee WI, Minneapolis-St. Paul MN, Philadelphia PA, Portland OR, Orlando FL, Tampa FL, Tucson AZ, Virginia Beach VA, and Washington DC.
The SABINS project preserves all of the original “elementary,” “middle” and “high” school attendance boundaries. (Some school districts have four or five sets of boundaries such as “primary” or “intermediate” boundaries and these are preserved as well). Users can request the original boundaries upon request.
In rare cases, a school may have different boundaries at different grade levels. For example, a school’s K to 5 boundaries cover a different area than its 6 to 8 boundaries.
At this stage, it is also possible to extract the original, topologically corrected “elementary,” “middle,” and “high” school layers (and their corresponding NCESSSH school identification codes) from the union. The attribute tables can be normalized and duplicate school identification codes can be eliminated so that all data follow the normal forms.
Since most schools serve the same boundaries across grades, the records for each school will repeat themselves—thus, the number of black and white people in a school’s boundaries will be the same. However, if a school serves different boundaries at different grades, the number of black and white people will vary by grade. This processes of aggregating to schools addresses rare situations in which a school serves different boundaries at different grade levels.
Contributor Information
Salvatore Saporito, Department of Sociology, College of William and Mary, sjsapo@wm.edu, Phone: (757) 221-2604.
David Van Riper, Minnesota Population Center, University of Minnesota, vanriper@umn.edu.
Ashwini Wakchaure, Department of Sociology, College of William and Mary, awakchaure@wm.edu.
References
- Alexander K, Entwisle D, Olson L. Children, schools and inequality. Boulder, CO: Westview Press; 1997. [Google Scholar]
- Black Sandra. Do better schools matter? Parental valuation of elementary education. Quarterly Journal of Economics. 1999;114:577–599. [Google Scholar]
- Brunner E, Murdoch J, Thayer M. School finance reform and housing values: Evidence from Los Angeles. Public finance and management. 2002;2:535–565. [Google Scholar]
- Brunner E, Sonstelie J, Thayer M. Capitalization and the voucher. Journal of urban economics. 2001;50:517–536. [Google Scholar]
- Card D, Kreuger A. Does school quality matter? Returns to education and the characteristics of public schools in the United States. Journal of Political Economy. 1992;100:1–40. [Google Scholar]
- Chen C. U.S Department of Education. Washington, DC: National Center for Education Statistics; 2011. Numbers and types of public elementary and secondary schools from the Common Core of Data: School year 2009–10 (NCES 2011-345) Retrieved [June 15, 2011] from http://nces.ed.gov/pubsearch. [Google Scholar]
- Diez Roux A. Investigating neighborhood and area effects on health. American public health association. 2000;91:1783–1789. doi: 10.2105/ajph.91.11.1783. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Downes T, Zabel J. The impact of school characteristics on house price: Chicago 1987–1991. Journal of urban economics. 2002;52:1–25. [Google Scholar]
- Edwards V, Ehrenthal M. Making individual school enrollment projections using ‘micro-geographies’; Presentation at the Population Association of American Meetings; New Orleans. 2008. April 17th to 19th. [Google Scholar]
- Elliott P, Wartenberg D. Spatial epidemiology: Current approaches and future challenges. Environmental Health Perspectives. 2004;112(9):998–1006. doi: 10.1289/ehp.6735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frankenberg E, Lee C, Orfield G. A multiracial society with segregated schools: Are we losing the dream? Cambridge, MA: Harvard University Civil Rights Project; 2003. [Google Scholar]
- Heckman L, Taylor H. School rezoning to achieve racial balance: A linear programming approach. Socio-Economic Planning Sciences. 1969;3:127–133. [Google Scholar]
- Huang R, Hawley D. A data model and Internet GIS framework for safe routes to school. Journal of the Urban and Regional Information Systems Association. 2009;21:21–31. [Google Scholar]
- Ionnides Y. Neighborhood income distribution. Journal of urban economics. 2004;56:435–457. [Google Scholar]
- Krieger N. A century of census tracts: Health and the body politic (1906–2006) Journal of urban health. 2006;83:355–361. doi: 10.1007/s11524-006-9040-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krieger N, Waterman P, Chen J, Soobader M, Subramanian SV, Carson R. ZIP code caveat: Bias due to spatiotemporal mismatches between ZIP codes and US census-defined geographic areas—the public health disparities geocoding project. American journal of public health. 2002;92:1100–1102. doi: 10.2105/ajph.92.7.1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lemberga D, Church R. The school boundary stability problem over time. Socio-Economic Planning Sciences. 2000;34:159–176. [Google Scholar]
- Logan J, Oakley D. The continuing legacy of the Brown decision: Court action and school segregation, 1960–2000. Albany, NY: Lewis Mumford Center for Comparative Urban and Regional Research; 2004. [Google Scholar]
- National Academy of Sciences. Estimating children eligible for school nutrition programs using the American Community Survey. Current projects system. 2009 Retrieved from http://www8.nationalacademies.org/cp/projectview.aspx?key=49102.
- New Mexico Statute § 14-2-6(E) NMSA. 1978 [Google Scholar]
- Owens A. Neighborhoods and schools as competing and reinforcing contexts for educational attainment. Sociology of Education. 2010;83:287–311. [Google Scholar]
- Sampson R, Morenoff J, Gannon-Rowley T. Assessing ‘neighborhood effects’: Social processes and new directions in research. Annual Review of Sociology. 2002;28:433–478. [Google Scholar]
- Reardon S, Yun J. Suburban racial change and suburban school segregation, 1987–1995. Sociology of education. 2001;74:79–101. [Google Scholar]
- Saporito S. School Choice in Black and White: Private School Enrollment among Racial Groups, 1990–2000. Peabody Journal of Education. 2009;84:172–190. [Google Scholar]
- Shai D. Income, housing, and fire injuries: A census tract analysis. Public health reports. 2006;121:149–154. doi: 10.1177/003335490612100208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- United State Census Bureau. 2010 Topologically Integrated Geographically Encoded Referencing System Line Shapefiles (TIGER/Line Shapefiles) Washington, DC: U.S. Census Bureau; 2011. [machine readable data files] [Google Scholar]
- United States Census Bureau. 2006–2010 American Community Survey (ACS) Washington, DC: U.S. Census Bureau; 2011. [machine readable data files] [Google Scholar]
- United States Census Bureau. 2006–2010 American Community Survey (ACS) 5-Year Summary File Technical Documentation. Washington, DC: U.S. Census Bureau; 2011. [Google Scholar]
- United States Census Bureau. 2010 Census Redistricting Data (Public Law 94-171) Washington, DC: U.S Census Bureau; 2011. [machine-readable data files] [Google Scholar]
- United States Census Bureau. 2010 Census Summary File 1 (SF1) Washington, DC: U.S Census Bureau; 2011. [machine-readable data files] [Google Scholar]
- United States Department of Education. Common Core of Data. Washington, DC: U.S. Department of Education; 2008. [Machine-readable database] [Google Scholar]
- United States Department of Education. Private School Universe Survey. Washington, DC: U.S. Department of Education; 2008. [Machine-readable database] [Google Scholar]
- Weimer D, Wolkoff M. School Performance and Housing Values: Using NonContiguous District and Incorporation Boundaries to Identify School Effects. National Tax Journal. 2001;54:231–253. [Google Scholar]
- Winkleby MA, Cubbin C. Influence of individual and neighborhood socioeconomic status on mortality among black, Mexican-American, and white women and men in the United States. Journal of epidemiology and community health. 2003;57:444–452. doi: 10.1136/jech.57.6.444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xue J, McCurdy T, Burke J, Bhaduri B, Liu C, Nutaro J, Patterson L. Analysis of school commuting data for exposure modeling purposes. Journal of exposure science and environmental epidemiology. 2009;20:69–78. doi: 10.1038/jes.2009.3. [DOI] [PubMed] [Google Scholar]




