Summary of Data Sources and Geographies

The Bay Area Equity Atlas (referred to simply as the “Atlas” below) draws upon a regional equity indicators database assembled using a broad array of data sources and methodologies. The data not only is drawn from the National Equity Atlas indicators database, also developed and maintained by PolicyLink and ERI, but also includes more than a dozen new indicators derived from local and state data sources as well as unique surveys. The Atlas includes data for the following geographies:

  • Region: The Five- and Nine-County Bay Area regions
  • County: The nine Bay Area counties (Alameda, Contra Costa, Marin, Napa, San Mateo, San Francisco, Santa Clara, Solano, and Sonoma)
  • Sub-county: 40 Consistent Public Use Microdata Areas (CPUMAs)
  • Large city: Six large Bay Area cities (Fremont, Hayward, Oakland, San Francisco, San José, and Sunnyvale)
  • Other city or town: 95 other Bay Area cities and towns
  • Census Designated Place (CDP): 119 unincorporated areas of Bay Area counties identified by the Census for statistical purposes
  • State: California

The Atlas also includes data in some maps for the 1,588 census tracts in the region. While indicator data is available for a variety of years from 2000 onward, it was derived to reflect consistent geographic boundaries over time. While some geographies, such as counties, tend to be fairly stable over time, others, such as census-defined places and census tracts, are not and can change with each decennial census. Data for counties (and thus the five- and nine-county regions), census-defined places and census tracts reflect the official geographic boundaries of the 2010 Decennial Census (although there have been no county boundary changes in the region since 2000). 

Consistent Public Use Microdata Areas (referred to as “CPUMAs” in the Atlas) are a geography created by the Integrated Public Use Microdata Series (IPUMS USA). The version used in the Atlas is based on the CPUMA0010 variable and is drawn to essentially form what is the lowest common denominator from a geographic perspective between 2000 and 2010 Public Use Microdata Areas (PUMAs). PUMAs are statistical geographic areas of at least 100,000 people and are defined for the dissemination of Public Use Microdata Sample (PUMS) data from the Decennial Census and American Community Survey (ACS).

Although a variety of state and local data sources are used, the summary and microdata files from the long form (Summary File 3) of the 2000 Census and from the ACS for years 2010 and later are the primary data sources, and account for a large portion of the indicators and breakdowns found in the Atlas. Given the detailed ways in which the Atlas cuts data by geography and social and demographic characteristics, we selected versions of the summary and microdata files to achieve a large enough sample size for a reasonable degree of statistical reliability. For example, Summary File 3 of the 2000 Census includes a sample of about one in six households, and the microdata files are available as both 1- and 5-percent samples of the U.S. population. We chose the 2000 5-percent sample. The ACS replaced the long form of the Decennial Census in 2010, and is an annual survey that is administered continuously throughout each year covering about 1 percent of the U.S. population each year. The data are currently released as both 1-year and 5-year samples, and we use the 5-year samples to achieve a comparable sample size to that in 2000. 

There is a tradeoff between the summary files and microdata in terms of the level of detail of the social and demographic characteristics that can be tabulated, and the level of detail in terms of geography. The summary files of the Decennial Census and ACS include a limited set of summary tabulations of population and housing characteristics and are available at a high level of geographic detail (down to what is known as the census block group level), while the microdata (or PUMS) files contain individual responses to these respective surveys, allowing for great flexibility in terms of the characteristics that can be tabulated, but they can only be tabulated down to the PUMA level. 

To strike a balance between providing some data for all Atlas geographies for as many indicators as possible — including smaller geographies such as census-defined places and census tracts — and providing data by detailed demographic categories such as race/ethnicity, gender, ancestry, and poverty level, we draw upon both the summary and microdata files for many indicators. In general, we use the Census and ACS summary files for certain breakdowns (e.g., trends, rankings, maps) that cover all Atlas geographies but are not highly detailed in terms of demographic categories, and draw upon Census and ACS microdata for other breakdowns that are not available for the smallest Atlas geographies (e.g., other city or town, census designated place, and census tract) but provide indicator data by far more detailed demographic categories.

Details on the data sources and methods used for each indicator can be found by clicking on the People, Place, and Power indicator category links at the bottom of this page. However, a few notes on dataset construction are relevant to many indicators and are worth noting here; most are related to how the Census and ACS summary files and microdata are applied.

Assembling Geographically Consistent Data Over Time

As noted above, most Atlas geographies are already consistent between 2000 and later years, with the exception of census-defined places and census tracts. To derive data for 2000 in geographies consistent with the 2010 Census, data from Summary File 3 of the 2000 Census is generally drawn from GeoLytics, Inc., which has been “re-shaped” to reflect 2010 census geographies. This was generally necessary for census tracts, which experienced many changes between the 2000 and 2010 Census, but was also necessary for census-defined places, which experienced fewer changes. Fortunately, the ACS summary files in 2010 and later (at least until census boundaries are redrawn based on the 2020 Decennial Census), already generally follow 2010 Census geographies. While no census tracts changed boundaries in the Nine-County Bay Area between 2010 and later years, a handful of census-defined places did, and in those cases, data is missing in years after 2010 (e.g. 2015) given that the Atlas is based upon consistent 2010 Census geographies.

Data at the Consistent Public Use Microdata Area (CPUMA) level are derived for 2000 and later years from 2000 Census (GeoLytics, Inc.) and ACS summary files by aggregating across 2010 census tracts within each CPUMA, using a crosswalk that assigns each census tract to the CPUMA that contains the largest share of its 2010 population by census block (based on the 2010 Census, Summary File 1). Data at the CPUMA level are derived for 2000 and later years from the IPUMS USA version of the microdata files by aggregating observations by the CPUMA0010 variable.

Censoring Observations with Small Sample Sizes

Most indicators in the Atlas are measures of central tendency (e.g., means and medians) based on survey data, and are subject to a margin of error. While we do not report margins of error, we do make efforts to avoid reporting highly unreliable estimates. Unless otherwise noted, for all indicators derived from the Census and ACS summary files, we do not report values that are based on fewer than 100 (weighted) observations in the denominator/universe. For example, the universe for the median earnings indicator is full-time workers ages 16 years and older with earnings, and we do not report indicator values with fewer than 100 such workers in the summary file data for any particular geography/demographic group. Similarly (and unless otherwise noted), for all indicators based on the Census and ACS microdata, we do not report any estimates based on a universe of fewer than 100 individual survey respondents. For example, the universe for the disconnected youth indicator (at least for the microdata-based breakdowns) is the population ages 16 through 24 years, and we do not report the percentage of disconnected youth if there are fewer than 100 individual survey respondents (i.e., unweighted) in the universe for any particular geography/demographic group. However, it is important to keep in mind that even with this restriction in place, all indicator values should be regarded as estimates, and particular care should be taken when interpreting data for less populated geographies and for smaller demographic subgroups. Users should not assume that small differences in indicator values between demographic subgroups are statistically significant. Finally, even with the aforementioned sample size restrictions in place, estimates of zero or 0 percent are possible. Such estimates should be regarded as very small numbers/percentages and not actually zero. Similarly, estimates of 100 percent should be regarded high percentages, and not actually 100 percent.

Categorizing People by Race/Ethnicity, Nativity, and Ancestry

In the Atlas, categorization of people by race/ethnicity is generally based on individual responses to various census surveys. For most indicators, people are categorized into six mutually exclusive groups on the basis of their response to two separate questions on race and Hispanic origin, plus one more category for all people of color combined, as follows:

  • “White” is used to refer to all people who identify as White alone and do not identify as being of Hispanic origin.
  • “Black” is used to refer to all people who identify as Black or African American alone and do not identify as being of Hispanic origin.
  • “Latino” and “Latinx” are used interchangeably to refer to all people who identify as being of Hispanic origin, regardless of racial identification. 
  • “Asian or Pacific Islander” is used to refer to all people who identify as Asian American, Native Hawaiian, or Pacific Islander alone and do not identify as being of Hispanic origin.
  • “Native American” is used to refer to all people who identify as Native American or Alaskan Native alone and do not identify as being of Hispanic origin.
  • “Mixed/other” is used to refer to all people who identify with a single racial category not included above, or who identify with multiple racial categories, and do not identify as being of Hispanic origin. Importantly, prior to the 2000 Census the questionnaire did not allow for multiple responses to the race question, causing some degree of inconsistency in data for this racial/ethnic category before and after 2000.
  • “People of color” is used to refer to all people who do not identify as non-Hispanic White.

Exceptions to this categorization are noted in the data notes that can be found by clicking on the question mark above each indicator display. They generally arise because Census and/or ACS summary file data is used (rather than microdata), and most summary file tables that disaggregate socioeconomic data by race/ethnicity only provide data that is exclusive of those who identify as Hispanic or Latinx for the White population. This means that the data presented for the Black, Asian or Pacific Islander, Native American, and Mixed/other populations may include some number of people from the Latinx category. The Mixed/other category is likely to have the largest share of Latinx people included in the indicator data reported for them, but this depends on the geography being examined.

Categorization of people by nativity is generally based on individual responses to survey questions on country of birth and parental citizenship. Unless otherwise noted, people are categorized into two mutually exclusive groups as follows:

  • “U.S.-born” refers to all people who identify as being born in the United States (including U.S. territories and outlying areas), or born abroad of at least one U.S. citizen parent.
  • “Immigrant” refers to all people who identify as being born abroad, outside of the United States, of non-U.S. citizen parents.

Some Atlas indicator breakdowns include further detail by ancestry. Most breakdowns are based on 2000 Census and ACS microdata from IPUMS USA, with both broad and detailed ancestral groups identified. The purpose of the data by ancestry is to provide further information on equity indicators and population diversity for distinct subgroups within each of the mutually exclusive broad racial/ethnic groups described above (except for the Mixed/other group). For this reason, the ancestral groupings are defined by examining each broad racial/ethnic group separately and selecting the ancestries within each group that capture a reasonably large number of people identified nationwide. Therefore, some ancestral groups are included in more than one broad racial/ethnic group. For example, data for those of Panamanian ancestry is broken out and reported on for both the Black and Latinx populations, while data for those of Irish ancestry is available for both the Black and White populations (subject to sample size limitations in the IPUMS data). The ancestral groups broken out for each broad racial/ethnic group other than Native Americans are based on the first response to the census question on ancestry, recorded in the IPUMS variable “ANCESTR1.” For Native Americans, they are based on the detailed responses to the census question on race, recorded in the IPUMS variable “RACED.” The reason for this that the vast majority of responses for Native Americans to the ancestry question (about 75 percent in most years) are coded in the ANCESTR1 variable as simply “American Indian (all tribes)” while the responses reflected in the RACED variable identify a variety of detailed Native American tribes. For more information on how the ancestral categories were constructed, please see the Data and Methods document for the National Equity Atlas.