The American
Community Survey (ACS) offers geodatabases with geographic
information and associated data of interest to researchers in the area.
The goal of geogenr is to facilitate access to this
information through functions that allow us to select the geodatabases
that interest us, download them, access the information they contain,
filter it and export it in various formats so that we can process it
with other tools if required.
Other packages are available that are very useful to access the same
data, such as tidycensus,
which works in an integrated way with tigris.
The main characteristics of geogenr that distinguish it
from other proposals are the following:
it works locally, once available geodatabases are downloaded (can be downloaded using the package);
supports access at the level of group of variables integrated in a layer, instead of at the level of variable or vector of variables;
decomposes ACS composite variables into structured fields;
allows to directly integrate variables of several years;
and It allows us to export the data to other formats.
The rest of this document is structured as follows: First, the starting data are presented. Then, an illustrative example of how the package works is developed. Finally, the document ends with conclusions.
The package is based on the geodatabases available on the TIGER/Line with Selected Demographic and Economic Data web page. For each year (as of 2012) a list of geodatabases appears under two sections:
Legal and Administrative Areas;
Statistical Areas.
These geodatabases bring together geography from the TIGER/Line Shapefiles and data from the American Community Survey (ACS) 5-year estimates.
Each ACS geodatabase is structured in layers: a geographic layer, a metadata layer, and the rest are data layers. The data layers have a matrix form, the rows are indexed by instances of the geographic layer, the columns by variables defined in the metadata layer, the cells are numeric values. Here we have an example:
GEOID          B01001e1 B01001m1 B01001e2 B01001m2 B01001e3 B01001m3 B01001e4 B01001m4 ...
16000US0100100      218      165       92      114       10       16       18       30
16000US0100124     2582       24     1313       98       45       37       14       19
16000US0100460     4374       24     1963      158      144       76      105       68
16000US0100484      641      159      326       89       10       17       16       11
16000US0100676      295      102      143       55        7       11       14       17
16000US0100820    32878       57    16236      453     1159      257     1151      209
...
Some of the defined variables are shown below.
Short_Name Full_Name
B01001e1   SEX BY AGE: Total: Total Population -- (Estimate)
B01001m1   SEX BY AGE: Total: Total Population -- (Margin of Error)
B01001e2   SEX BY AGE: Male: Total Population -- (Estimate)
B01001m2   SEX BY AGE: Male: Total Population -- (Margin of Error)
B01001e3   SEX BY AGE: Male: Under 5 years: Total Population -- (Estimate)
B01001m3   SEX BY AGE: Male: Under 5 years: Total Population -- (Margin of Error)
B01001e4   SEX BY AGE: Male: 5 to 9 years: Total Population -- (Estimate)
B01001m4   SEX BY AGE: Male: 5 to 9 years: Total Population -- (Margin of Error)
...
Each variable (Short_Name) corresponds to combinations
of various field values separated by a separator (:),
forming a string (Full_Name). The field name of each value
is not available but the topics included are detailed on the web page Subjects
Included in the Survey. There are tens of thousands of variables of
these characteristics that, in addition to the metadata layer, can be
found on the TIGER/Line
with Selected Demographic and Economic Data Record Layouts web page.
For each combination of values, one variable associated with the
estimate and another with the margin of error are
defined. Within each layer, variables can be considered in groups,
defined by the first part of the Full Name (for example
UNWEIGHTED SAMPLE HOUSING UNITS and
SEX BY AGE).
A module of geogenr package analyses the components of
Full_name, structuring them in fields; and it allows access
to variables in groups.
To work with this data with the geogenr package we
distinguish three phases:
obtaining the data,
data selection,
and generation of results,
which are developed below.
Once the result structure is generated, we can export it or define queries on it.
The data is available in the form of a geodatabase. One geodatabase for each area in each of the two area groups.
In this example, first, we select and download the ACS geodatabases using the functions offered by the package. When we have them in the same folder (they can be distributed in subfolders), we can access the information they contain to select the one that interests us.
We create an object of class acs_5yr, indicating a
folder where we will download the geodatabases.
We can query the available geodatabases by group, area, subject and year using the methods offered by the object.
ac |>
  get_area_groups()
#> [1] "Legal and Administrative Areas" "Statistical Areas"
ac |>
  get_areas(group = "Legal and Administrative Areas")
#>  [1] "American Indian/Alaska Native/Native Hawaiian Area"
#>  [2] "Alaska Native Regional Corporation"                
#>  [3] "Congressional District (116th Congress)"           
#>  [4] "County"                                            
#>  [5] "Place"                                             
#>  [6] "Elementary School District"                        
#>  [7] "Secondary School District"                         
#>  [8] "Unified School District"                           
#>  [9] "State"                                             
#> [10] "State Legislative Districts Upper Chamber"         
#> [11] "State Legislative Districts Lower Chamber"         
#> [12] "Code Tabulation Area"
ac |>
  get_area_years(area = "Alaska Native Regional Corporation")
#> [1] "2013" "2014" "2015" "2016" "2017" "2018" "2019" "2020" "2021"Once we decide the area to work with, we download the files if they are not already downloaded. There are also functions available to consult previously downloaded data.
ac <- ac |>
  select_area_files("Alaska Native Regional Corporation", 2020:2021)
files <- ac |>
  download_selected_files(unzip = FALSE)#> [1] TRUE TRUEThe data is downloaded to the working folder associated with the object. We can indicate that they are classified by year or by area. We can also indicate that it will be automatically unzipped once downloaded. In this case they are explicitly unzipped.
With these operations we already have the data available locally.
Once unzipped, we can consult the available areas and years.
ac |>
  get_available_areas()
#> [1] "Alaska Native Regional Corporation"
ac |>
  get_available_area_years(area = "Alaska Native Regional Corporation")
#> [1] "2020" "2021"For each area we can query the topics of the available reports.
ac |>
  get_available_area_topics("Alaska Native Regional Corporation")
#>  [1] "X01 Age And Sex"                     "X02 Race"                           
#>  [3] "X03 Hispanic Or Latino Origin"       "X04 Ancestry"                       
#>  [5] "X05 Foreign Born Citizenship"        "X06 Place Of Birth"                 
#>  [7] "X07 Migration"                       "X08 Commuting"                      
#>  [9] "X09 Children Household Relationship" "X10 Grandparents Grandchildren"     
#> [11] "X11 Household Family Subfamilies"    "X12 Marital Status And History"     
#> [13] "X13 Fertility"                       "X14 School Enrollment"              
#> [15] "X15 Educational Attainment"          "X16 Language Spoken At Home"        
#> [17] "X17 Poverty"                         "X18 Disability"                     
#> [19] "X19 Income"                          "X20 Earnings"                       
#> [21] "X21 Veteran Status"                  "X22 Food Stamps"                    
#> [23] "X23 Employment Status"               "X24 Industry Occupation"            
#> [25] "X25 Housing Characteristics"         "X26 Group Quarters"                 
#> [27] "X27 Health Insurance"                "X28 Computer And Internet Use"      
#> [29] "X99 Imputation"We can select one or more topics to obtain the reports they
contain.To select them, we create an object of class
acs_5yr_topic. We can also select the years for which
reports are included; by default all available years are considered.
Once a topic has been selected, we can consult the available reports or subreports.
act |>
  get_report_names()
#> [1] "B01001-Sex By Age"        "B01002-Median Age By Sex"
#> [3] "B01003-Total Population"We can focus on a report or subreport, we can also work with all the reports contained in the topic.
In this case we are going to work with the entire topic.
Once we have obtained a group of reports, we can obtain the associated data in various formats.
acs_5yr_geo class formatIn this format, three layers are defined:
data with geographical information and and the names
and values of the variables for each instance,metadata with the description of the variables,origin with the metadata about the origin of the
data.This format allows us to perform simple queries using the metadata and the geographic layer.
We obtain and consult the content of the metadata: structured description of the variables.
metadata <- geo |>
  get_metadata()
metadata
#> # A tibble: 1,436 × 12
#>    variable year  Short_Name Full_Name   report subreport report_var report_desc
#>    <chr>    <chr> <chr>      <chr>       <chr>  <chr>          <int> <chr>      
#>  1 V0001    2020  B01001Ae1  Sex By Age… B01001 A                  1 Sex By Age…
#>  2 V0002    2020  B01001Ae10 Sex By Age… B01001 A                 10 Sex By Age…
#>  3 V0003    2020  B01001Ae11 Sex By Age… B01001 A                 11 Sex By Age…
#>  4 V0004    2020  B01001Ae12 Sex By Age… B01001 A                 12 Sex By Age…
#>  5 V0005    2020  B01001Ae13 Sex By Age… B01001 A                 13 Sex By Age…
#>  6 V0006    2020  B01001Ae14 Sex By Age… B01001 A                 14 Sex By Age…
#>  7 V0007    2020  B01001Ae15 Sex By Age… B01001 A                 15 Sex By Age…
#>  8 V0008    2020  B01001Ae16 Sex By Age… B01001 A                 16 Sex By Age…
#>  9 V0009    2020  B01001Ae17 Sex By Age… B01001 A                 17 Sex By Age…
#> 10 V0010    2020  B01001Ae18 Sex By Age… B01001 A                 18 Sex By Age…
#> # ℹ 1,426 more rows
#> # ℹ 4 more variables: measure <chr>, item1 <chr>, item2 <chr>, group <chr>We filter the metadata using the functions of the dplyr
package.
metadata <-
  dplyr::filter(
    metadata,
    item2 == "Female" &
      group == "People Who Are American Indian And Alaska Native Alone" &
      measure == "estimate"
  )We obtain a new object whose data corresponds to the metadata that we have obtained as a result of the selection operations.
geo2 <- geo |>
  set_metadata(metadata)
geo2 |>
  get_metadata()
#> # A tibble: 2 × 12
#>   variable year  Short_Name Full_Name    report subreport report_var report_desc
#>   <chr>    <chr> <chr>      <chr>        <chr>  <chr>          <int> <chr>      
#> 1 V0671    2020  B01002Ce3  Median Age … B01002 C                  3 Median Age…
#> 2 V1389    2021  B01002Ce3  Median Age … B01002 C                  3 Median Age…
#> # ℹ 4 more variables: measure <chr>, item1 <chr>, item2 <chr>, group <chr>We work with the new object, for example, by defining a new variable and graphing it using the associated geographic data.
geo_layer <- geo2 |> 
  get_geo_layer()
geo_layer$faiana21vs20 <- 100 * (geo_layer$V1389 - geo_layer$V0671) / geo_layer$V0671
plot(sf::st_shift_longitude(geo_layer[, "faiana21vs20"]))Additionally, we can export any of these objects in
GeoPackage format to use other query tools, such as
QGIS.
rolap package objectsWe can export reports from a theme as flat_table or
star_database objects from the rolap
package, which allow us to enrich them with additional information or
work directly with OLAP tools.
We are going to export them as an object of the
star_database class.
Below are the tables that make up the star ROLAP design obtained.
st_dm <- st |>
  rolap::as_dm_class(pk_facts = FALSE)
st_dm |> 
  dm::dm_draw(rankdir = "LR", view_type = "all")We also show the number of instances of each table.
l_db <- st |>
  rolap::as_tibble_list()
names <- sort(names(l_db))
for (name in names){
  cat(sprintf("name: %s, %d rows\n", name, nrow(l_db[[name]])))
}
#> name: anrc, 8572 rows
#> name: dim_what, 359 rows
#> name: dim_when, 2 rows
#> name: dim_where, 12 rowsWe can also export it to any RDBMS.
The American Community Survey (ACS) offers geodatabases with geographic information and associated data of interest to researchers in the area. These data can be accessed through various alternatives in which we must indicate the year and variable names. Due to the large number of variables and their structure, this operation is not easy.
The geogenr package offers an alternative that allows us
to download the geodatabases that are considered necessary and access
the variables by selecting data layers and logical groups of variables.
Additionally, it allows us to obtain data in various formats to consult
them directly or use other tools.