Chapter 4 Getting started

4.1 Introduction

This chapter provides an overview of the packages, data, and design objects we use frequently throughout this book. As mentioned in Chapter 2, understanding how a survey was conducted helps us make sense of the results and interpret findings. Therefore, we provide background on the datasets used in examples and exercises. Next, we walk through how to create the survey design objects necessary to begin an analysis. Finally, we provide an overview of the {srvyr} package and the steps needed for analysis. Please report any bugs and issues encountered while going through the book to the book’s GitHub repository.

4.2 Setup

This section provides details on the required packages and data, as well as the steps for preparing survey design objects. For a streamlined learning experience, we recommend taking the time to walk through the code provided here and making sure everything is properly set up.

4.2.1 Packages

We use several packages throughout the book, but let’s install and load specific ones for this chapter. Many functions in the examples and exercises are from three packages: {tidyverse}, {survey}, and {srvyr} (Wickham et al. 2019; Lumley 2010; Freedman Ellis and Schneider 2024). The packages can be installed from the Comprehensive R Archive Network (CRAN) using the code below:

install.packages(c("tidyverse", "survey", "srvyr"))

We bundled the datasets used in the book in an R package, {srvyrexploR} (Zimmer, Powell, and Velásquez 2024). To install it from GitHub, use the {pak} package (Csárdi and Hester 2024):

install.packages("pak")
pak::pak("tidy-survey-r/srvyrexploR")

After installing these packages, load them using the library() function:

library(tidyverse)
library(survey)
library(srvyr)
library(srvyrexploR)

The packages {broom}, {gt}, and {gtsummary} play a role in displaying output and creating formatted tables (Iannone et al. 2024; Robinson, Hayes, and Couch 2023; Sjoberg et al. 2021). Install them with the provided code2:

install.packages(c("gt", "gtsummary"))

After installing these packages, load them using the library() function:

library(broom)
library(gt)
library(gtsummary)

Install and load the {censusapi} package to access the Current Population Survey (CPS), which we use to ensure accurate weighting of a key dataset in the book (Recht 2024). Run the code below to install {censusapi}:

install.packages("censusapi")

After installing this package, load it using the library() function:

library(censusapi)

Note that the {censusapi} package requires a Census API key, available for free from the U.S. Census Bureau website (refer to the package documentation for more information). We recommend storing the Census API key in the R environment instead of directly in the code. To do this, run the Sys.setenv() script below, substituting the API key where it says YOUR_API_KEY_HERE.

Sys.setenv(CENSUS_KEY = "YOUR_API_KEY_HERE")

Then, restart the R session. Once the Census API key is stored, we can retrieve it in our R code with Sys.getenv("CENSUS_KEY").

There are a few other packages used in the book in limited frequency. We list them in the Prerequisite boxes at the beginning of each chapter. As we work through the book, make sure to check the Prerequisite box and install any missing packages before proceeding.

4.2.2 Data

The {srvyrexploR} package contains the datasets used in the book. Once installed and loaded, explore the documentation using the help() function. Read the descriptions of the datasets to understand what they contain:

help(package = "srvyrexploR")

This book uses two main datasets: the American National Election Studies (ANES – DeBell 2010) and the Residential Energy Consumption Survey (RECS – U.S. Energy Information Administration 2023b), which are included as anes_2020 and recs_2020 in the {srvyrexploR} package, respectively.

American National Election Studies Data

American National Election Studies (ANES) collect data from election surveys dating back to 1948. These surveys contain information on public opinion and voting behavior in U.S. presidential elections and some midterm elections3. They cover topics such as party affiliation, voting choice, and level of trust in the government. The 2020 survey (data used in this book) was fielded online, through live video interviews, or via computer-assisted telephone interviews (CATI).

When working with new survey data, we should review the survey documentation (see Chapter 3) to understand the data collection methods. The original ANES data contains variables starting with V20 (DeBell 2010), so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent’s age is now in a variable called Age, and gender is in a variable called Gender. These descriptive variables are included in the {srvyrexploR} package. A complete overview of all variables can be found in Appendix B.

Before beginning an analysis, it is useful to view the data to understand the available variables. The dplyr::glimpse() function produces a list of all variables, their types (e.g., function, double), and a few example values. Below, we remove variables containing a “V” followed by numbers with select(-matches("^V\\d")) before using glimpse() to get a quick overview of the data with descriptive variable names:

anes_2020 %>%
  select(-matches("^V\\d")) %>%
  glimpse()
## Rows: 7,453
## Columns: 21
## $ CaseID                  <dbl> 200015, 200022, 200039, 200046, 200053…
## $ InterviewMode           <fct> Web, Web, Web, Web, Web, Web, Web, Web…
## $ Weight                  <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658…
## $ VarUnit                 <fct> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2,…
## $ Stratum                 <fct> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, …
## $ CampaignInterest        <fct> Somewhat interested, Not much interest…
## $ EarlyVote2020           <fct> NA, NA, NA, NA, NA, NA, NA, NA, Yes, N…
## $ VotedPres2016           <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, …
## $ VotedPres2016_selection <fct> Trump, Other, Clinton, Clinton, Trump,…
## $ PartyID                 <fct> Strong republican, Independent, Indepe…
## $ TrustGovernment         <fct> Never, Never, Some of the time, About …
## $ TrustPeople             <fct> About half the time, Some of the time,…
## $ Age                     <dbl> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43…
## $ AgeGroup                <fct> 40-49, 30-39, 40-49, 40-49, 70 or olde…
## $ Education               <fct> Bachelor's, Post HS, High school, Post…
## $ RaceEth                 <fct> "Hispanic", "Asian, NH/PI", "White", "…
## $ Gender                  <fct> Male, Female, Female, Male, Male, Fema…
## $ Income                  <fct> "$175,000-249,999", "$70,000-74,999", …
## $ Income7                 <fct> $125k or more, $60k to < 80k, $100k to…
## $ VotedPres2020           <fct> NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, …
## $ VotedPres2020_selection <fct> NA, Other, Biden, Biden, Trump, Biden,…

From the output, we can see there are 7,453 rows and 21 variables in the ANES data. This output also indicates that most of the variables are factors (e.g., InterviewMode), while a few variables are in double (numeric) format (e.g., Age).

Residential Energy Consumption Survey Data

Residential Energy Consumption Survey (RECS) is a study that measures energy consumption and expenditure in American households. Funded by the Energy Information Administration, RECS data are collected through interviews with household members and energy suppliers. These interviews take place in person, over the phone, via mail, and on the web, with modes changing over time. The survey has been fielded 14 times between 1950 and 2020. It includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance.

We should read the survey documentation (see Chapter 3) to understand how the data were collected and implemented. An overview of all variables can be found in Appendix C.

Before starting an analysis, we recommend viewing the data to understand the types of data and variables that are included. The dplyr::glimpse() function produces a list of all variables, the type of the variable (e.g., function, double), and a few example values. Below, we remove the weight variables with select(-matches("^NWEIGHT")) before using glimpse() to get a quick overview of the data:

recs_2020 %>%
  select(-matches("^NWEIGHT")) %>%
  glimpse()
## Rows: 18,496
## Columns: 39
## $ DOEID            <dbl> 1e+05, 1e+05, 1e+05, 1e+05, 1e+05, 1e+05, 1e+…
## $ ClimateRegion_BA <fct> Mixed-Dry, Mixed-Humid, Mixed-Dry, Mixed-Humi…
## $ Urbanicity       <fct> Urban Area, Urban Area, Urban Area, Urban Are…
## $ Region           <fct> West, South, West, South, Northeast, South, S…
## $ REGIONC          <chr> "WEST", "SOUTH", "WEST", "SOUTH", "NORTHEAST"…
## $ Division         <fct> Mountain South, West South Central, Mountain …
## $ STATE_FIPS       <chr> "35", "05", "35", "45", "34", "48", "40", "28…
## $ state_postal     <fct> NM, AR, NM, SC, NJ, TX, OK, MS, DC, AZ, CA, T…
## $ state_name       <fct> New Mexico, Arkansas, New Mexico, South Carol…
## $ HDD65            <dbl> 3844, 3766, 3819, 2614, 4219, 901, 3148, 1825…
## $ CDD65            <dbl> 1679, 1458, 1696, 1718, 1363, 3558, 2128, 237…
## $ HDD30YR          <dbl> 4451, 4429, 4500, 3229, 4896, 1150, 3564, 266…
## $ CDD30YR          <dbl> 1027, 1305, 1010, 1653, 1059, 3588, 2043, 216…
## $ HousingUnitType  <fct> Single-family detached, Apartment: 5 or more …
## $ YearMade         <ord> 1970-1979, 1980-1989, 1960-1969, 1980-1989, 1…
## $ TOTSQFT_EN       <dbl> 2100, 590, 900, 2100, 800, 4520, 2100, 900, 7…
## $ TOTHSQFT         <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 900, 7…
## $ TOTCSQFT         <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 0, 500…
## $ SpaceHeatingUsed <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
## $ ACUsed           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FAL…
## $ HeatingBehavior  <fct> Set one temp and leave it, Turn on or off as …
## $ WinterTempDay    <dbl> 70, 70, 69, 68, 68, 76, 74, 70, 68, 70, 72, 7…
## $ WinterTempAway   <dbl> 70, 65, 68, 68, 68, 76, 65, 70, 60, 70, 70, 7…
## $ WinterTempNight  <dbl> 68, 65, 67, 68, 68, 68, 74, 68, 62, 68, 72, 7…
## $ ACBehavior       <fct> Set one temp and leave it, Turn on or off as …
## $ SummerTempDay    <dbl> 71, 68, 70, 72, 72, 69, 68, NA, 72, 74, 77, 7…
## $ SummerTempAway   <dbl> 71, 68, 68, 72, 72, 74, 70, NA, 76, 74, 77, 7…
## $ SummerTempNight  <dbl> 71, 68, 68, 72, 72, 68, 70, NA, 68, 72, 77, 7…
## $ BTUEL            <dbl> 42723, 17889, 8147, 31647, 20027, 48968, 4940…
## $ DOLLAREL         <dbl> 1955.06, 713.27, 334.51, 1424.86, 1087.00, 18…
## $ BTUNG            <dbl> 101924.4, 10145.3, 22603.1, 55118.7, 39099.5,…
## $ DOLLARNG         <dbl> 701.83, 261.73, 188.14, 636.91, 376.04, 439.4…
## $ BTULP            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 178…
## $ DOLLARLP         <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ BTUFO            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 681…
## $ DOLLARFO         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 187…
## $ BTUWOOD          <dbl> 0, 0, 0, 0, 0, 3000, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TOTALBTU         <dbl> 144648, 28035, 30750, 86765, 59127, 85401, 13…
## $ TOTALDOL         <dbl> 2656.9, 975.0, 522.6, 2061.8, 1463.0, 2335.1,…

From the output, we can see that the RECS data has 18,496 rows and 39 non-weight variables. This output also indicates that most of the variables are in double (numeric) format (e.g., TOTSQFT_EN), with some factor (e.g., Region), Boolean (e.g., ACUsed), character (e.g., REGIONC), and ordinal (e.g., YearMade) variables.

4.2.3 Design objects

The design object is the backbone for survey analysis. It is where we specify the sampling design, weights, and other necessary information to ensure we account for errors in the data. Before creating the design object, we should carefully review the survey documentation to understand how to create the design object for accurate analysis.

In this section, we provide details on how to code the design object for the ANES and RECS data used in the book. However, we only provide a high-level overview to get readers started. For a deeper understanding of creating design objects for a variety of sampling designs, see Chapter 10.

While we recommend conducting exploratory data analysis on the original data before diving into complex survey analysis (see Chapter 12), the actual survey analysis and inference should be performed with the survey design objects instead of the original survey data. For example, the ANES data is called anes_2020. If we create a survey design object called anes_des, our survey analyses should begin with anes_des and not anes_2020. Using the survey design object ensures that our calculations appropriately account for the details of the survey design.

American National Election Studies Design Object

The ANES documentation (DeBell 2010) details the sampling and weighting implications for analyzing the survey data. From this documentation and as noted in Chapter 3, the 2020 ANES data are weighted to the sample, not the population. To make generalizations about the population, we need to weigh the data against the full population count. The ANES methodology recommends using the Current Population Survey (CPS) to determine the number of non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or D.C. in March 2020.

We can use the {censusapi} package to obtain the information needed for the survey design object. The getCensus() function allows us to retrieve the CPS data for March (cps/basic/mar) in 2020 (vintage = 2020). Additionally, we extract several variables from the CPS:

  • month (HRMONTH) and year (HRYEAR4) of the interview: to confirm the correct time period
  • age (PRTAGE) of the respondent: to narrow the population to 18 and older (eligible age to vote)
  • citizenship status (PRCITSHP) of the respondent: to narrow the population to only those eligible to vote
  • final person-level weight (PWSSWGT)

Detailed information for these variables can be found in the CPS data dictionary.

cps_state_in <- getCensus(
  name = "cps/basic/mar",
  vintage = 2020,
  region = "state",
  vars = c(
    "HRMONTH", "HRYEAR4",
    "PRTAGE", "PRCITSHP", "PWSSWGT"
  ),
  key = Sys.getenv("CENSUS_KEY")
)

cps_state <- cps_state_in %>%
  as_tibble() %>%
  mutate(across(
    .cols = everything(),
    .fns = as.numeric
  ))

In the code above, we include region = "state". The default region type for the CPS data is at the state level. While not required, including the region can be helpful for understanding the geographical context of the data.

In getCensus(), we filtered the dataset by specifying the month (HRMONTH == 3) and year (HRYEAR4 == 2020) of our request. Therefore, we expect that all interviews within our output were conducted during that particular month and year. We can confirm that the data are from March 2020 by running the code below:

cps_state %>%
  distinct(HRMONTH, HRYEAR4)
## # A tibble: 1 × 2
##   HRMONTH HRYEAR4
##     <dbl>   <dbl>
## 1       3    2020

We can narrow down the dataset using the age and citizenship variables to include only individuals who are 18 years or older (PRTAGE >= 18) and have U.S. citizenship (PRCITSHIP %in% c(1:4)):

cps_narrow_resp <- cps_state %>%
  filter(
    PRTAGE >= 18,
    PRCITSHP %in% c(1:4)
  )

To calculate the U.S. population from the filtered data, we sum the person weights (PWSSWGT):

targetpop <- cps_narrow_resp %>%
  pull(PWSSWGT) %>%
  sum()

scales::comma(targetpop)
## [1] "231,034,125"

The population of interest in 2020 is 231,034,125. This result gives us what we need to create the survey design object for estimating population statistics. Using the anes_2020 data, we adjust the weighting variable (V200010b) using the population of interest we just calculated (targetpop). We determine the proportion of the total weight for each individual weight (V200010b / sum(V200010b)) and then multiply that proportion by the calculated population of interest.

anes_adjwgt <- anes_2020 %>%
  mutate(Weight = V200010b / sum(V200010b) * targetpop)

Once we have the adjusted weights, we can refer to the rest of the documentation to create the survey design. The documentation indicates that the study uses a stratified cluster sampling design. Therefore, we need to specify variables for strata and ids (cluster) and fill in the nest argument. The documentation provides guidance on which strata and cluster variables to use depending on whether we are analyzing pre- or post-election data. In this book, we analyze post-election data, so we need to use the post-election weight V200010b, strata variable V200010d, and Primary Sampling Unit (PSU)/cluster variable V200010c. Additionally, we set nest=TRUE to ensure the clusters are nested within the strata.

anes_des <- anes_adjwgt %>%
  as_survey_design(
    weights = Weight,
    strata = V200010d,
    ids = V200010c,
    nest = TRUE
  )

anes_des
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (101) clusters.
## Called via srvyr
## Sampling variables:
##   - ids: V200010c 
##   - strata: V200010d 
##   - weights: Weight 
## Data variables: 
##   - V200001 (dbl), CaseID (dbl), V200002 (dbl+lbl), InterviewMode
##     (fct), V200010b (dbl), Weight (dbl), V200010c (dbl), VarUnit (fct),
##     V200010d (dbl), Stratum (fct), V201006 (dbl+lbl), CampaignInterest
##     (fct), V201023 (dbl+lbl), EarlyVote2020 (fct), V201024 (dbl+lbl),
##     V201025x (dbl+lbl), V201028 (dbl+lbl), V201029 (dbl+lbl), V201101
##     (dbl+lbl), V201102 (dbl+lbl), VotedPres2016 (fct), V201103
##     (dbl+lbl), VotedPres2016_selection (fct), V201228 (dbl+lbl),
##     V201229 (dbl+lbl), V201230 (dbl+lbl), V201231x (dbl+lbl), PartyID
##     (fct), V201233 (dbl+lbl), TrustGovernment (fct), V201237 (dbl+lbl),
##     TrustPeople (fct), V201507x (dbl+lbl), Age (dbl), AgeGroup (fct),
##     V201510 (dbl+lbl), Education (fct), V201546 (dbl+lbl), V201547a
##     (dbl+lbl), V201547b (dbl+lbl), V201547c (dbl+lbl), V201547d
##     (dbl+lbl), V201547e (dbl+lbl), V201547z (dbl+lbl), V201549x
##     (dbl+lbl), RaceEth (fct), V201600 (dbl+lbl), Gender (fct), V201607
##     (dbl+lbl), V201610 (dbl+lbl), V201611 (dbl+lbl), V201613 (dbl+lbl),
##     V201615 (dbl+lbl), V201616 (dbl+lbl), V201617x (dbl+lbl), Income
##     (fct), Income7 (fct), V202051 (dbl+lbl), V202066 (dbl+lbl), V202072
##     (dbl+lbl), VotedPres2020 (fct), V202073 (dbl+lbl), V202109x
##     (dbl+lbl), V202110x (dbl+lbl), VotedPres2020_selection (fct)

We can examine this new object to learn more about the survey design, such that the ANES is a “Stratified 1 - level Cluster Sampling design (with replacement) With (101) clusters.” Additionally, the output displays the sampling variables and then lists the remaining variables in the dataset. This design object is used throughout this book to conduct survey analysis.

Residential Energy Consumption Survey Design Object

The RECS documentation (U.S. Energy Information Administration 2023b) provides information on the survey’s sampling and weighting implications for analysis. The documentation shows the 2020 RECS uses Jackknife weights, where the main analytic weight is NWEIGHT, and the Jackknife weights are NWEIGHT1-NWEIGHT60. We can specify these in the weights and repweights arguments in the survey design object code, respectively.

With Jackknife weights, additional information is required: type, scale, and mse. Chapter 10 discusses in depth each of these arguments; but to quickly get started, the RECS documentation lets us know that type=JK1, scale=59/60, and mse = TRUE. We can use the following code to create the survey design object:

recs_des <- recs_2020 %>%
  as_survey_rep(
    weights = NWEIGHT,
    repweights = NWEIGHT1:NWEIGHT60,
    type = "JK1",
    scale = 59 / 60,
    mse = TRUE
  )

recs_des
## Call: Called via srvyr
## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
## Sampling variables:
##   - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
##     NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
##     NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 +
##     NWEIGHT17 + NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 +
##     NWEIGHT22 + NWEIGHT23 + NWEIGHT24 + NWEIGHT25 + NWEIGHT26 +
##     NWEIGHT27 + NWEIGHT28 + NWEIGHT29 + NWEIGHT30 + NWEIGHT31 +
##     NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 + NWEIGHT36 +
##     NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
##     NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 +
##     NWEIGHT47 + NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 +
##     NWEIGHT52 + NWEIGHT53 + NWEIGHT54 + NWEIGHT55 + NWEIGHT56 +
##     NWEIGHT57 + NWEIGHT58 + NWEIGHT59 + NWEIGHT60` 
##   - weights: NWEIGHT 
## Data variables: 
##   - DOEID (dbl), ClimateRegion_BA (fct), Urbanicity (fct), Region
##     (fct), REGIONC (chr), Division (fct), STATE_FIPS (chr),
##     state_postal (fct), state_name (fct), HDD65 (dbl), CDD65 (dbl),
##     HDD30YR (dbl), CDD30YR (dbl), HousingUnitType (fct), YearMade
##     (ord), TOTSQFT_EN (dbl), TOTHSQFT (dbl), TOTCSQFT (dbl),
##     SpaceHeatingUsed (lgl), ACUsed (lgl), HeatingBehavior (fct),
##     WinterTempDay (dbl), WinterTempAway (dbl), WinterTempNight (dbl),
##     ACBehavior (fct), SummerTempDay (dbl), SummerTempAway (dbl),
##     SummerTempNight (dbl), NWEIGHT (dbl), NWEIGHT1 (dbl), NWEIGHT2
##     (dbl), NWEIGHT3 (dbl), NWEIGHT4 (dbl), NWEIGHT5 (dbl), NWEIGHT6
##     (dbl), NWEIGHT7 (dbl), NWEIGHT8 (dbl), NWEIGHT9 (dbl), NWEIGHT10
##     (dbl), NWEIGHT11 (dbl), NWEIGHT12 (dbl), NWEIGHT13 (dbl), NWEIGHT14
##     (dbl), NWEIGHT15 (dbl), NWEIGHT16 (dbl), NWEIGHT17 (dbl), NWEIGHT18
##     (dbl), NWEIGHT19 (dbl), NWEIGHT20 (dbl), NWEIGHT21 (dbl), NWEIGHT22
##     (dbl), NWEIGHT23 (dbl), NWEIGHT24 (dbl), NWEIGHT25 (dbl), NWEIGHT26
##     (dbl), NWEIGHT27 (dbl), NWEIGHT28 (dbl), NWEIGHT29 (dbl), NWEIGHT30
##     (dbl), NWEIGHT31 (dbl), NWEIGHT32 (dbl), NWEIGHT33 (dbl), NWEIGHT34
##     (dbl), NWEIGHT35 (dbl), NWEIGHT36 (dbl), NWEIGHT37 (dbl), NWEIGHT38
##     (dbl), NWEIGHT39 (dbl), NWEIGHT40 (dbl), NWEIGHT41 (dbl), NWEIGHT42
##     (dbl), NWEIGHT43 (dbl), NWEIGHT44 (dbl), NWEIGHT45 (dbl), NWEIGHT46
##     (dbl), NWEIGHT47 (dbl), NWEIGHT48 (dbl), NWEIGHT49 (dbl), NWEIGHT50
##     (dbl), NWEIGHT51 (dbl), NWEIGHT52 (dbl), NWEIGHT53 (dbl), NWEIGHT54
##     (dbl), NWEIGHT55 (dbl), NWEIGHT56 (dbl), NWEIGHT57 (dbl), NWEIGHT58
##     (dbl), NWEIGHT59 (dbl), NWEIGHT60 (dbl), BTUEL (dbl), DOLLAREL
##     (dbl), BTUNG (dbl), DOLLARNG (dbl), BTULP (dbl), DOLLARLP (dbl),
##     BTUFO (dbl), DOLLARFO (dbl), BTUWOOD (dbl), TOTALBTU (dbl),
##     TOTALDOL (dbl)

Viewing this new object provides information about the survey design, such that RECS is an “Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.” Additionally, the output shows the sampling variables (NWEIGHT1-NWEIGHT60) and then lists the remaining variables in the dataset. This design object is used throughout this book to conduct survey analysis.

4.3 Survey analysis process

There is a general process for analyzing data to create estimates with {srvyr} package:

  1. Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep()

  2. Subset data (if needed) using filter() (to create subpopulations)

  3. Specify domains of analysis using group_by()

  4. Within summarize(), specify variables to calculate, including means, totals, proportions, quantiles, and more

In Section 4.2.3, we follow Step 1 to create the survey design objects for the ANES and RECS data featured in this book. Additional details on how to create design objects can be found in Chapter 10. Then, once we have the design object, we can filter the data to any subpopulation of interest (if needed). It is important to filter the data after creating the design object. This ensures that we are accurately accounting for the survey design in our calculations. Finally, we can use group_by(), summarize(), and other functions from the {survey} and {srvyr} packages to analyze the survey data by estimating means, totals, and so on.

4.4 Similarities between {dplyr} and {srvyr} functions

The {dplyr} package from the tidyverse offers flexible and intuitive functions for data wrangling (Wickham et al. 2023). One of the major advantages of using {srvyr} is that it applies {dplyr}-like syntax to the {survey} package (Freedman Ellis and Schneider 2024). We can use pipes, such as %>% from the {magrittr} package, to specify a survey design object, apply a function, and then feed that output into the next function’s first argument (Bache and Wickham 2022). Functions follow the ‘tidy’ convention of snake_case function names.

To help explain the similarities between {dplyr} functions and {srvyr} functions, we use the towny dataset from the {gt} package and apistrat data that comes in the {survey} package. The towny dataset provides population data for municipalities in Ontario, Canada on census years between 1996 and 2021. Taking a look at towny with dplyr::glimpse(), we can see the dataset has 25 columns with a mix of character and numeric data.

towny %>%
  glimpse()
## Rows: 414
## Columns: 25
## $ name                     <chr> "Addington Highlands", "Adelaide Metc…
## $ website                  <chr> "https://addingtonhighlands.ca", "htt…
## $ status                   <chr> "lower-tier", "lower-tier", "lower-ti…
## $ csd_type                 <chr> "township", "township", "township", "…
## $ census_div               <chr> "Lennox and Addington", "Middlesex", …
## $ latitude                 <dbl> 45.00, 42.95, 44.13, 45.53, 43.86, 48…
## $ longitude                <dbl> -77.25, -81.70, -79.93, -76.90, -79.0…
## $ land_area_km2            <dbl> 1293.99, 331.11, 371.53, 519.59, 66.6…
## $ population_1996          <int> 2429, 3128, 9359, 2837, 64430, 1027, …
## $ population_2001          <int> 2402, 3149, 10082, 2824, 73753, 956, …
## $ population_2006          <int> 2512, 3135, 10695, 2716, 90167, 958, …
## $ population_2011          <int> 2517, 3028, 10603, 2844, 109600, 864,…
## $ population_2016          <int> 2318, 2990, 10975, 2935, 119677, 969,…
## $ population_2021          <int> 2534, 3011, 10989, 2995, 126666, 954,…
## $ density_1996             <dbl> 1.88, 9.45, 25.19, 5.46, 966.84, 8.81…
## $ density_2001             <dbl> 1.86, 9.51, 27.14, 5.44, 1106.74, 8.2…
## $ density_2006             <dbl> 1.94, 9.47, 28.79, 5.23, 1353.05, 8.2…
## $ density_2011             <dbl> 1.95, 9.14, 28.54, 5.47, 1644.66, 7.4…
## $ density_2016             <dbl> 1.79, 9.03, 29.54, 5.65, 1795.87, 8.3…
## $ density_2021             <dbl> 1.96, 9.09, 29.58, 5.76, 1900.75, 8.1…
## $ pop_change_1996_2001_pct <dbl> -0.0111, 0.0067, 0.0773, -0.0046, 0.1…
## $ pop_change_2001_2006_pct <dbl> 0.0458, -0.0044, 0.0608, -0.0382, 0.2…
## $ pop_change_2006_2011_pct <dbl> 0.0020, -0.0341, -0.0086, 0.0471, 0.2…
## $ pop_change_2011_2016_pct <dbl> -0.0791, -0.0125, 0.0351, 0.0320, 0.0…
## $ pop_change_2016_2021_pct <dbl> 0.0932, 0.0070, 0.0013, 0.0204, 0.058…

Let’s examine the towny object’s class. We verify that it is a tibble, as indicated by "tbl_df", by running the code below:

class(towny)
## [1] "tbl_df"     "tbl"        "data.frame"

All tibbles are data.frames, but not all data.frames are tibbles. Compared to data.frames, tibbles have some advantages, with the printing behavior being a noticeable advantage. When working with tidyverse style code, we recommend making all your datasets tibbles for ease of analysis.

The {survey} package contains datasets related to the California Academic Performance Index, which measures student performance in schools with at least 100 students in California. We can access these datasets by loading the {survey} package and running data(api).

Let’s work with the apistrat dataset, which is a stratified random sample, stratified by school type (stype) with three levels: E for elementary school, M for middle school, and H for high school. We first create the survey design object (see Chapter 10 for more information). The sample is stratified by the stype variable and the sampling weights are found in the pw variable. We can use this information to construct the design object, apistrat_des.

data(api)

apistrat_des <- apistrat %>%
  as_survey_design(
    strata = stype,
    weights = pw
  )

When we check the class of apistrat_des, it is not a typical data.frame. Applying the as_survey_design() function transforms the data into a tbl_svy, a special class specifically for survey design objects. The {srvyr} package is designed to work with the tbl_svy class of objects.

class(apistrat_des)
## [1] "tbl_svy"        "survey.design2" "survey.design"

Let’s look at how {dplyr} works with regular data frames. The example below calculates the mean and median for the land_area_km2 variable in the towny dataset.

towny %>%
  summarize(
    area_mean = mean(land_area_km2),
    area_median = median(land_area_km2)
  )
## # A tibble: 1 × 2
##   area_mean area_median
##       <dbl>       <dbl>
## 1      373.        273.

In the code below, we calculate the mean and median of the variable api00 using apistrat_des. Note the similarity in the syntax. However, the standard error of the statistic is also calculated in addition to the statistic itself.

apistrat_des %>%
  summarize(
    api00_mean = survey_mean(api00),
    api00_med = survey_median(api00)
  )
## # A tibble: 1 × 4
##   api00_mean api00_mean_se api00_med api00_med_se
##        <dbl>         <dbl>     <dbl>        <dbl>
## 1       662.          9.54       668         13.7

The functions in {srvyr} also play nicely with other tidyverse functions. For example, if we wanted to select columns with shared characteristics, we can use {tidyselect} functions such as starts_with(), num_range(), etc. (Henry and Wickham 2024). In the examples below, we use a combination of across() and starts_with() to calculate the mean of variables starting with “population” in the towny data frame and those beginning with api in the apistrat_des survey object.

towny %>%
  summarize(across(
    starts_with("population"),
    ~ mean(.x, na.rm = TRUE)
  ))
## # A tibble: 1 × 6
##   population_1996 population_2001 population_2006 population_2011
##             <dbl>           <dbl>           <dbl>           <dbl>
## 1          25866.          27538.          29173.          30838.
## # ℹ 2 more variables: population_2016 <dbl>, population_2021 <dbl>
apistrat_des %>%
  summarize(across(
    starts_with("api"),
    survey_mean
  ))
## # A tibble: 1 × 6
##   api00 api00_se api99 api99_se api.stu api.stu_se
##   <dbl>    <dbl> <dbl>    <dbl>   <dbl>      <dbl>
## 1  662.     9.54  629.     10.1    498.       16.4

We have the flexibility to use {dplyr} verbs such as mutate(), filter(), and select() on our survey design object. As mentioned in Section 4.3, these steps should be performed on the survey design object. This ensures our survey design is properly considered in all our calculations.

apistrat_des_mod <- apistrat_des %>%
  mutate(api_diff = api00 - api99) %>%
  filter(stype == "E") %>%
  select(stype, api99, api00, api_diff, api_students = api.stu)

apistrat_des_mod
## Stratified Independent Sampling design (with replacement)
## Called via srvyr
## Sampling variables:
##   - ids: `1` 
##   - strata: stype 
##   - weights: pw 
## Data variables: 
##   - stype (fct), api99 (int), api00 (int), api_diff (int), api_students
##     (int)
apistrat_des
## Stratified Independent Sampling design (with replacement)
## Called via srvyr
## Sampling variables:
##   - ids: `1` 
##   - strata: stype 
##   - weights: pw 
## Data variables: 
##   - cds (chr), stype (fct), name (chr), sname (chr), snum (dbl), dname
##     (chr), dnum (int), cname (chr), cnum (int), flag (int), pcttest
##     (int), api00 (int), api99 (int), target (int), growth (int),
##     sch.wide (fct), comp.imp (fct), both (fct), awards (fct), meals
##     (int), ell (int), yr.rnd (fct), mobility (int), acs.k3 (int),
##     acs.46 (int), acs.core (int), pct.resp (int), not.hsg (int), hsg
##     (int), some.col (int), col.grad (int), grad.sch (int), avg.ed
##     (dbl), full (int), emer (int), enroll (int), api.stu (int), pw
##     (dbl), fpc (dbl)

Several functions in {srvyr} must be called within srvyr::summarize(), with the exception of srvyr::survey_count() and srvyr::survey_tally(). This is similar to how dplyr::count() and dplyr::tally() are not called within dplyr::summarize(). The summarize() function can be used in conjunction with the group_by() function or by/.by arguments, which applies the functions on a group-by-group basis to create grouped summaries.

towny %>%
  group_by(csd_type) %>%
  dplyr::summarize(
    area_mean = mean(land_area_km2),
    area_median = median(land_area_km2)
  )
## # A tibble: 5 × 3
##   csd_type     area_mean area_median
##   <chr>            <dbl>       <dbl>
## 1 city             498.        198. 
## 2 municipality     607.        488. 
## 3 town             183.        129. 
## 4 township         363.        301. 
## 5 village           23.0         3.3

We use a similar setup to summarize data in {srvyr}:

apistrat_des %>%
  group_by(stype) %>%
  summarize(
    api00_mean = survey_mean(api00),
    api00_median = survey_median(api00)
  )
## # A tibble: 3 × 5
##   stype api00_mean api00_mean_se api00_median api00_median_se
##   <fct>      <dbl>         <dbl>        <dbl>           <dbl>
## 1 E           674.          12.5          671            20.7
## 2 H           626.          15.5          635            21.6
## 3 M           637.          16.6          648            24.1

An alternative way to do grouped analysis on the towny data would be with the .by argument:

towny %>%
  dplyr::summarize(
    area_mean = mean(land_area_km2),
    area_median = median(land_area_km2),
    .by = csd_type
  )
## # A tibble: 5 × 3
##   csd_type     area_mean area_median
##   <chr>            <dbl>       <dbl>
## 1 township         363.        301. 
## 2 town             183.        129. 
## 3 municipality     607.        488. 
## 4 city             498.        198. 
## 5 village           23.0         3.3

The .by syntax is similarly implemented in {srvyr} for grouped analysis:

apistrat_des %>%
  summarize(
    api00_mean = survey_mean(api00),
    api00_median = survey_median(api00),
    .by = stype
  )
## # A tibble: 3 × 5
##   stype api00_mean api00_mean_se api00_median api00_median_se
##   <fct>      <dbl>         <dbl>        <dbl>           <dbl>
## 1 E           674.          12.5          671            20.7
## 2 H           626.          15.5          635            21.6
## 3 M           637.          16.6          648            24.1

As mentioned above, {srvyr} functions are meant for tbl_svy objects. Attempting to manipulate data on non-tbl_svy objects, like the towny example shown below, results in an error. Running the code lets us know what the issue is: Survey context not set.

towny %>%
  summarize(area_mean = survey_mean(land_area_km2))
## Error in `summarize()`:
## ℹ In argument: `area_mean = survey_mean(land_area_km2)`.
## Caused by error in `cur_svy()`:
## ! Survey context not set

A few functions in {srvyr} have counterparts in {dplyr}, such as srvyr::summarize() and srvyr::group_by(). Unlike {srvyr}-specific verbs, {srvyr} recognizes these parallel functions if applied to a non-survey object. Instead of causing an error, the package provides the equivalent output from {dplyr}:

towny %>%
  srvyr::summarize(area_mean = mean(land_area_km2))
## # A tibble: 1 × 1
##   area_mean
##       <dbl>
## 1      373.

Because this book focuses on survey analysis, most of our pipes stem from a survey object. When we load the {dplyr} and {srvyr} packages, the functions automatically figure out the class of data and use the appropriate one from {dplyr} or {srvyr}. Therefore, we do not need to include the namespace for each function (e.g., srvyr::summarize()).

References

Bache, Stefan Milton, and Hadley Wickham. 2022. magrittr: A Forward-Pipe Operator for R. https://magrittr.tidyverse.org.
Csárdi, Gábor, and Jim Hester. 2024. pak: Another Approach to Package Installation. https://pak.r-lib.org/.
DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf.
Freedman Ellis, Greg, and Ben Schneider. 2024. srvyr: ’dplyr’-Like Syntax for Summary Statistics of Survey Data. http://gdfe.co/srvyr/.
Henry, Lionel, and Hadley Wickham. 2024. tidyselect: Select from a Set of Strings. https://tidyselect.r-lib.org.
Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, JooYoung Seo, Ken Brevoort, and Olivier Roy. 2024. gt: Easily Create Presentation-Ready Display Tables. https://github.com/rstudio/gt.
Lumley, Thomas. 2010. Complex Surveys: A Guide to Analysis Using R. John Wiley & Sons.
Recht, Hannah. 2024. censusapi: Retrieve Data from the Census APIs. https://github.com/hrecht/censusapi.
Robinson, David, Alex Hayes, and Simon Couch. 2023. broom: Convert Statistical Objects into Tidy Tibbles. https://broom.tidymodels.org/.
Sjoberg, Daniel D., Karissa Whiting, Michael Curry, Jessica A. Lavery, and Joseph Larmarange. 2021. “Reproducible Summary Tables with the gtsummary Package.” The R Journal 13: 570–80. https://doi.org/10.32614/RJ-2021-053.
———. 2023b. 2020 Residential Energy Consumption Survey: Household Characteristics Technical Documentation Summary.” https://www.eia.gov/consumption/residential/data/2020/pdf/2020%20RECS_Methodology%20Report.pdf.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org.
Zimmer, Stephanie, Rebecca Powell, and Isabella Velásquez. 2024. srvyrexploR: Data Supplement for Exploring Complex Survey Data Analysis Using R. https://github.com/tidy-survey-r/srvyrexploR.

  1. Note: {broom} is already included in the tidyverse, so no separate installation is required.↩︎

  2. In the United States, presidential elections are held in years divisible by four. In other even years, there are elections at the federal level for Congress, which are referred to as midterm elections as they occur at the middle of the term of a president.↩︎