
Chapter 4 Getting started
4.1 Introduction
This chapter provides an overview of the packages, data, and design objects we use frequently throughout this book. As mentioned in Chapter 2, understanding how a survey was conducted helps us make sense of the results and interpret findings. Therefore, we provide background on the datasets used in examples and exercises. Next, we walk through how to create the survey design objects necessary to begin an analysis. Finally, we provide an overview of the {srvyr} package and the steps needed for analysis. Please report any bugs and issues encountered while going through the book to the book’s GitHub repository.
4.2 Setup
This section provides details on the required packages and data, as well as the steps for preparing survey design objects. For a streamlined learning experience, we recommend taking the time to walk through the code provided here and making sure everything is properly set up.
4.2.1 Packages
We use several packages throughout the book, but let’s install and load specific ones for this chapter. Many functions in the examples and exercises are from three packages: {tidyverse}, {survey}, and {srvyr} (Wickham et al. 2019; Lumley 2010; Freedman Ellis and Schneider 2024). The packages can be installed from the Comprehensive R Archive Network (CRAN) using the code below:
We bundled the datasets used in the book in an R package, {srvyrexploR} (Zimmer, Powell, and Velásquez 2024). To install it from GitHub, use the {pak} package (Csárdi and Hester 2024):
After installing these packages, load them using the library()
function:
The packages {broom}, {gt}, and {gtsummary} play a role in displaying output and creating formatted tables (Iannone et al. 2024; Robinson, Hayes, and Couch 2023; Sjoberg et al. 2021). Install them with the provided code2:
After installing these packages, load them using the library()
function:
Install and load the {censusapi} package to access the Current Population Survey (CPS), which we use to ensure accurate weighting of a key dataset in the book (Recht 2024). Run the code below to install {censusapi}:
After installing this package, load it using the library()
function:
Note that the {censusapi} package requires a Census API key, available for free from the U.S. Census Bureau website (refer to the package documentation for more information). We recommend storing the Census API key in the R environment instead of directly in the code. To do this, run the Sys.setenv()
script below, substituting the API key where it says YOUR_API_KEY_HERE
.
Then, restart the R session. Once the Census API key is stored, we can retrieve it in our R code with Sys.getenv("CENSUS_KEY")
.
There are a few other packages used in the book in limited frequency. We list them in the Prerequisite boxes at the beginning of each chapter. As we work through the book, make sure to check the Prerequisite box and install any missing packages before proceeding.
4.2.2 Data
The {srvyrexploR} package contains the datasets used in the book. Once installed and loaded, explore the documentation using the help()
function. Read the descriptions of the datasets to understand what they contain:
This book uses two main datasets: the American National Election Studies (ANES – DeBell 2010) and the Residential Energy Consumption Survey (RECS – U.S. Energy Information Administration 2023b), which are included as anes_2020
and recs_2020
in the {srvyrexploR} package, respectively.
American National Election Studies Data
American National Election Studies (ANES) collect data from election surveys dating back to 1948. These surveys contain information on public opinion and voting behavior in U.S. presidential elections and some midterm elections3. They cover topics such as party affiliation, voting choice, and level of trust in the government. The 2020 survey (data used in this book) was fielded online, through live video interviews, or via computer-assisted telephone interviews (CATI).
When working with new survey data, we should review the survey documentation (see Chapter 3) to understand the data collection methods. The original ANES data contains variables starting with V20
(DeBell 2010), so to assist with our analysis throughout the book, we created descriptive variable names. For example, the respondent’s age is now in a variable called Age
, and gender is in a variable called Gender
. These descriptive variables are included in the {srvyrexploR} package. A complete overview of all variables can be found in Appendix B.
Before beginning an analysis, it is useful to view the data to understand the available variables. The dplyr::glimpse()
function produces a list of all variables, their types (e.g., function, double), and a few example values. Below, we remove variables containing a “V” followed by numbers with select(-matches("^V\\d"))
before using glimpse()
to get a quick overview of the data with descriptive variable names:
## Rows: 7,453
## Columns: 21
## $ CaseID <dbl> 200015, 200022, 200039, 200046, 200053…
## $ InterviewMode <fct> Web, Web, Web, Web, Web, Web, Web, Web…
## $ Weight <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658…
## $ VarUnit <fct> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2,…
## $ Stratum <fct> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, …
## $ CampaignInterest <fct> Somewhat interested, Not much interest…
## $ EarlyVote2020 <fct> NA, NA, NA, NA, NA, NA, NA, NA, Yes, N…
## $ VotedPres2016 <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, …
## $ VotedPres2016_selection <fct> Trump, Other, Clinton, Clinton, Trump,…
## $ PartyID <fct> Strong republican, Independent, Indepe…
## $ TrustGovernment <fct> Never, Never, Some of the time, About …
## $ TrustPeople <fct> About half the time, Some of the time,…
## $ Age <dbl> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43…
## $ AgeGroup <fct> 40-49, 30-39, 40-49, 40-49, 70 or olde…
## $ Education <fct> Bachelor's, Post HS, High school, Post…
## $ RaceEth <fct> "Hispanic", "Asian, NH/PI", "White", "…
## $ Gender <fct> Male, Female, Female, Male, Male, Fema…
## $ Income <fct> "$175,000-249,999", "$70,000-74,999", …
## $ Income7 <fct> $125k or more, $60k to < 80k, $100k to…
## $ VotedPres2020 <fct> NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, …
## $ VotedPres2020_selection <fct> NA, Other, Biden, Biden, Trump, Biden,…
From the output, we can see there are 7,453 rows and 21 variables in the ANES data. This output also indicates that most of the variables are factors (e.g., InterviewMode
), while a few variables are in double (numeric) format (e.g., Age
).
Residential Energy Consumption Survey Data
Residential Energy Consumption Survey (RECS) is a study that measures energy consumption and expenditure in American households. Funded by the Energy Information Administration, RECS data are collected through interviews with household members and energy suppliers. These interviews take place in person, over the phone, via mail, and on the web, with modes changing over time. The survey has been fielded 14 times between 1950 and 2020. It includes questions about appliances, electronics, heating, air conditioning (A/C), temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance.
We should read the survey documentation (see Chapter 3) to understand how the data were collected and implemented. An overview of all variables can be found in Appendix C.
Before starting an analysis, we recommend viewing the data to understand the types of data and variables that are included. The dplyr::glimpse()
function produces a list of all variables, the type of the variable (e.g., function, double), and a few example values. Below, we remove the weight variables with select(-matches("^NWEIGHT"))
before using glimpse()
to get a quick overview of the data:
## Rows: 18,496
## Columns: 39
## $ DOEID <dbl> 1e+05, 1e+05, 1e+05, 1e+05, 1e+05, 1e+05, 1e+…
## $ ClimateRegion_BA <fct> Mixed-Dry, Mixed-Humid, Mixed-Dry, Mixed-Humi…
## $ Urbanicity <fct> Urban Area, Urban Area, Urban Area, Urban Are…
## $ Region <fct> West, South, West, South, Northeast, South, S…
## $ REGIONC <chr> "WEST", "SOUTH", "WEST", "SOUTH", "NORTHEAST"…
## $ Division <fct> Mountain South, West South Central, Mountain …
## $ STATE_FIPS <chr> "35", "05", "35", "45", "34", "48", "40", "28…
## $ state_postal <fct> NM, AR, NM, SC, NJ, TX, OK, MS, DC, AZ, CA, T…
## $ state_name <fct> New Mexico, Arkansas, New Mexico, South Carol…
## $ HDD65 <dbl> 3844, 3766, 3819, 2614, 4219, 901, 3148, 1825…
## $ CDD65 <dbl> 1679, 1458, 1696, 1718, 1363, 3558, 2128, 237…
## $ HDD30YR <dbl> 4451, 4429, 4500, 3229, 4896, 1150, 3564, 266…
## $ CDD30YR <dbl> 1027, 1305, 1010, 1653, 1059, 3588, 2043, 216…
## $ HousingUnitType <fct> Single-family detached, Apartment: 5 or more …
## $ YearMade <ord> 1970-1979, 1980-1989, 1960-1969, 1980-1989, 1…
## $ TOTSQFT_EN <dbl> 2100, 590, 900, 2100, 800, 4520, 2100, 900, 7…
## $ TOTHSQFT <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 900, 7…
## $ TOTCSQFT <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 0, 500…
## $ SpaceHeatingUsed <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
## $ ACUsed <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FAL…
## $ HeatingBehavior <fct> Set one temp and leave it, Turn on or off as …
## $ WinterTempDay <dbl> 70, 70, 69, 68, 68, 76, 74, 70, 68, 70, 72, 7…
## $ WinterTempAway <dbl> 70, 65, 68, 68, 68, 76, 65, 70, 60, 70, 70, 7…
## $ WinterTempNight <dbl> 68, 65, 67, 68, 68, 68, 74, 68, 62, 68, 72, 7…
## $ ACBehavior <fct> Set one temp and leave it, Turn on or off as …
## $ SummerTempDay <dbl> 71, 68, 70, 72, 72, 69, 68, NA, 72, 74, 77, 7…
## $ SummerTempAway <dbl> 71, 68, 68, 72, 72, 74, 70, NA, 76, 74, 77, 7…
## $ SummerTempNight <dbl> 71, 68, 68, 72, 72, 68, 70, NA, 68, 72, 77, 7…
## $ BTUEL <dbl> 42723, 17889, 8147, 31647, 20027, 48968, 4940…
## $ DOLLAREL <dbl> 1955.06, 713.27, 334.51, 1424.86, 1087.00, 18…
## $ BTUNG <dbl> 101924.4, 10145.3, 22603.1, 55118.7, 39099.5,…
## $ DOLLARNG <dbl> 701.83, 261.73, 188.14, 636.91, 376.04, 439.4…
## $ BTULP <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 178…
## $ DOLLARLP <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ BTUFO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 681…
## $ DOLLARFO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 187…
## $ BTUWOOD <dbl> 0, 0, 0, 0, 0, 3000, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TOTALBTU <dbl> 144648, 28035, 30750, 86765, 59127, 85401, 13…
## $ TOTALDOL <dbl> 2656.9, 975.0, 522.6, 2061.8, 1463.0, 2335.1,…
From the output, we can see that the RECS data has 18,496 rows and 39 non-weight variables. This output also indicates that most of the variables are in double (numeric) format (e.g., TOTSQFT_EN
), with some factor (e.g., Region
), Boolean (e.g., ACUsed
), character (e.g., REGIONC
), and ordinal (e.g., YearMade
) variables.
4.2.3 Design objects
The design object is the backbone for survey analysis. It is where we specify the sampling design, weights, and other necessary information to ensure we account for errors in the data. Before creating the design object, we should carefully review the survey documentation to understand how to create the design object for accurate analysis.
In this section, we provide details on how to code the design object for the ANES and RECS data used in the book. However, we only provide a high-level overview to get readers started. For a deeper understanding of creating design objects for a variety of sampling designs, see Chapter 10.
While we recommend conducting exploratory data analysis on the original data before diving into complex survey analysis (see Chapter 12), the actual survey analysis and inference should be performed with the survey design objects instead of the original survey data. For example, the ANES data is called anes_2020
. If we create a survey design object called anes_des
, our survey analyses should begin with anes_des
and not anes_2020
. Using the survey design object ensures that our calculations appropriately account for the details of the survey design.
American National Election Studies Design Object
The ANES documentation (DeBell 2010) details the sampling and weighting implications for analyzing the survey data. From this documentation and as noted in Chapter 3, the 2020 ANES data are weighted to the sample, not the population. To make generalizations about the population, we need to weigh the data against the full population count. The ANES methodology recommends using the Current Population Survey (CPS) to determine the number of non-institutional U.S. citizens aged 18 or older living in the 50 U.S. states or D.C. in March 2020.
We can use the {censusapi} package to obtain the information needed for the survey design object. The getCensus()
function allows us to retrieve the CPS data for March (cps/basic/mar
) in 2020 (vintage = 2020
). Additionally, we extract several variables from the CPS:
- month (
HRMONTH
) and year (HRYEAR4
) of the interview: to confirm the correct time period - age (
PRTAGE
) of the respondent: to narrow the population to 18 and older (eligible age to vote) - citizenship status (
PRCITSHP
) of the respondent: to narrow the population to only those eligible to vote - final person-level weight (
PWSSWGT
)
Detailed information for these variables can be found in the CPS data dictionary.
cps_state_in <- getCensus(
name = "cps/basic/mar",
vintage = 2020,
region = "state",
vars = c(
"HRMONTH", "HRYEAR4",
"PRTAGE", "PRCITSHP", "PWSSWGT"
),
key = Sys.getenv("CENSUS_KEY")
)
cps_state <- cps_state_in %>%
as_tibble() %>%
mutate(across(
.cols = everything(),
.fns = as.numeric
))
In the code above, we include region = "state"
. The default region type for the CPS data is at the state level. While not required, including the region can be helpful for understanding the geographical context of the data.
In getCensus()
, we filtered the dataset by specifying the month (HRMONTH == 3
) and year (HRYEAR4 == 2020
) of our request. Therefore, we expect that all interviews within our output were conducted during that particular month and year. We can confirm that the data are from March 2020 by running the code below:
## # A tibble: 1 × 2
## HRMONTH HRYEAR4
## <dbl> <dbl>
## 1 3 2020
We can narrow down the dataset using the age and citizenship variables to include only individuals who are 18 years or older (PRTAGE >= 18
) and have U.S. citizenship (PRCITSHIP %in% c(1:4)
):
To calculate the U.S. population from the filtered data, we sum the person weights (PWSSWGT
):
## [1] "231,034,125"
The population of interest in 2020 is 231,034,125. This result gives us what we need to create the survey design object for estimating population statistics. Using the anes_2020
data, we adjust the weighting variable (V200010b
) using the population of interest we just calculated (targetpop
). We determine the proportion of the total weight for each individual weight (V200010b / sum(V200010b)
) and then multiply that proportion by the calculated population of interest.
Once we have the adjusted weights, we can refer to the rest of the documentation to create the survey design. The documentation indicates that the study uses a stratified cluster sampling design. Therefore, we need to specify variables for strata
and ids
(cluster) and fill in the nest
argument. The documentation provides guidance on which strata and cluster variables to use depending on whether we are analyzing pre- or post-election data. In this book, we analyze post-election data, so we need to use the post-election weight V200010b
, strata variable V200010d
, and Primary Sampling Unit (PSU)/cluster variable V200010c
. Additionally, we set nest=TRUE
to ensure the clusters are nested within the strata.
anes_des <- anes_adjwgt %>%
as_survey_design(
weights = Weight,
strata = V200010d,
ids = V200010c,
nest = TRUE
)
anes_des
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (101) clusters.
## Called via srvyr
## Sampling variables:
## - ids: V200010c
## - strata: V200010d
## - weights: Weight
## Data variables:
## - V200001 (dbl), CaseID (dbl), V200002 (dbl+lbl), InterviewMode
## (fct), V200010b (dbl), Weight (dbl), V200010c (dbl), VarUnit (fct),
## V200010d (dbl), Stratum (fct), V201006 (dbl+lbl), CampaignInterest
## (fct), V201023 (dbl+lbl), EarlyVote2020 (fct), V201024 (dbl+lbl),
## V201025x (dbl+lbl), V201028 (dbl+lbl), V201029 (dbl+lbl), V201101
## (dbl+lbl), V201102 (dbl+lbl), VotedPres2016 (fct), V201103
## (dbl+lbl), VotedPres2016_selection (fct), V201228 (dbl+lbl),
## V201229 (dbl+lbl), V201230 (dbl+lbl), V201231x (dbl+lbl), PartyID
## (fct), V201233 (dbl+lbl), TrustGovernment (fct), V201237 (dbl+lbl),
## TrustPeople (fct), V201507x (dbl+lbl), Age (dbl), AgeGroup (fct),
## V201510 (dbl+lbl), Education (fct), V201546 (dbl+lbl), V201547a
## (dbl+lbl), V201547b (dbl+lbl), V201547c (dbl+lbl), V201547d
## (dbl+lbl), V201547e (dbl+lbl), V201547z (dbl+lbl), V201549x
## (dbl+lbl), RaceEth (fct), V201600 (dbl+lbl), Gender (fct), V201607
## (dbl+lbl), V201610 (dbl+lbl), V201611 (dbl+lbl), V201613 (dbl+lbl),
## V201615 (dbl+lbl), V201616 (dbl+lbl), V201617x (dbl+lbl), Income
## (fct), Income7 (fct), V202051 (dbl+lbl), V202066 (dbl+lbl), V202072
## (dbl+lbl), VotedPres2020 (fct), V202073 (dbl+lbl), V202109x
## (dbl+lbl), V202110x (dbl+lbl), VotedPres2020_selection (fct)
We can examine this new object to learn more about the survey design, such that the ANES is a “Stratified 1 - level Cluster Sampling design (with replacement) With (101) clusters.” Additionally, the output displays the sampling variables and then lists the remaining variables in the dataset. This design object is used throughout this book to conduct survey analysis.
Residential Energy Consumption Survey Design Object
The RECS documentation (U.S. Energy Information Administration 2023b) provides information on the survey’s sampling and weighting implications for analysis. The documentation shows the 2020 RECS uses Jackknife weights, where the main analytic weight is NWEIGHT
, and the Jackknife weights are NWEIGHT1
-NWEIGHT60
. We can specify these in the weights
and repweights
arguments in the survey design object code, respectively.
With Jackknife weights, additional information is required: type
, scale
, and mse
. Chapter 10 discusses in depth each of these arguments; but to quickly get started, the RECS documentation lets us know that type=JK1
, scale=59/60
, and mse = TRUE
. We can use the following code to create the survey design object:
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
type = "JK1",
scale = 59 / 60,
mse = TRUE
)
recs_des
## Call: Called via srvyr
## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
## Sampling variables:
## - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
## NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
## NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 +
## NWEIGHT17 + NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 +
## NWEIGHT22 + NWEIGHT23 + NWEIGHT24 + NWEIGHT25 + NWEIGHT26 +
## NWEIGHT27 + NWEIGHT28 + NWEIGHT29 + NWEIGHT30 + NWEIGHT31 +
## NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 + NWEIGHT36 +
## NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
## NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 +
## NWEIGHT47 + NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 +
## NWEIGHT52 + NWEIGHT53 + NWEIGHT54 + NWEIGHT55 + NWEIGHT56 +
## NWEIGHT57 + NWEIGHT58 + NWEIGHT59 + NWEIGHT60`
## - weights: NWEIGHT
## Data variables:
## - DOEID (dbl), ClimateRegion_BA (fct), Urbanicity (fct), Region
## (fct), REGIONC (chr), Division (fct), STATE_FIPS (chr),
## state_postal (fct), state_name (fct), HDD65 (dbl), CDD65 (dbl),
## HDD30YR (dbl), CDD30YR (dbl), HousingUnitType (fct), YearMade
## (ord), TOTSQFT_EN (dbl), TOTHSQFT (dbl), TOTCSQFT (dbl),
## SpaceHeatingUsed (lgl), ACUsed (lgl), HeatingBehavior (fct),
## WinterTempDay (dbl), WinterTempAway (dbl), WinterTempNight (dbl),
## ACBehavior (fct), SummerTempDay (dbl), SummerTempAway (dbl),
## SummerTempNight (dbl), NWEIGHT (dbl), NWEIGHT1 (dbl), NWEIGHT2
## (dbl), NWEIGHT3 (dbl), NWEIGHT4 (dbl), NWEIGHT5 (dbl), NWEIGHT6
## (dbl), NWEIGHT7 (dbl), NWEIGHT8 (dbl), NWEIGHT9 (dbl), NWEIGHT10
## (dbl), NWEIGHT11 (dbl), NWEIGHT12 (dbl), NWEIGHT13 (dbl), NWEIGHT14
## (dbl), NWEIGHT15 (dbl), NWEIGHT16 (dbl), NWEIGHT17 (dbl), NWEIGHT18
## (dbl), NWEIGHT19 (dbl), NWEIGHT20 (dbl), NWEIGHT21 (dbl), NWEIGHT22
## (dbl), NWEIGHT23 (dbl), NWEIGHT24 (dbl), NWEIGHT25 (dbl), NWEIGHT26
## (dbl), NWEIGHT27 (dbl), NWEIGHT28 (dbl), NWEIGHT29 (dbl), NWEIGHT30
## (dbl), NWEIGHT31 (dbl), NWEIGHT32 (dbl), NWEIGHT33 (dbl), NWEIGHT34
## (dbl), NWEIGHT35 (dbl), NWEIGHT36 (dbl), NWEIGHT37 (dbl), NWEIGHT38
## (dbl), NWEIGHT39 (dbl), NWEIGHT40 (dbl), NWEIGHT41 (dbl), NWEIGHT42
## (dbl), NWEIGHT43 (dbl), NWEIGHT44 (dbl), NWEIGHT45 (dbl), NWEIGHT46
## (dbl), NWEIGHT47 (dbl), NWEIGHT48 (dbl), NWEIGHT49 (dbl), NWEIGHT50
## (dbl), NWEIGHT51 (dbl), NWEIGHT52 (dbl), NWEIGHT53 (dbl), NWEIGHT54
## (dbl), NWEIGHT55 (dbl), NWEIGHT56 (dbl), NWEIGHT57 (dbl), NWEIGHT58
## (dbl), NWEIGHT59 (dbl), NWEIGHT60 (dbl), BTUEL (dbl), DOLLAREL
## (dbl), BTUNG (dbl), DOLLARNG (dbl), BTULP (dbl), DOLLARLP (dbl),
## BTUFO (dbl), DOLLARFO (dbl), BTUWOOD (dbl), TOTALBTU (dbl),
## TOTALDOL (dbl)
Viewing this new object provides information about the survey design, such that RECS is an “Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.” Additionally, the output shows the sampling variables (NWEIGHT1
-NWEIGHT60
) and then lists the remaining variables in the dataset. This design object is used throughout this book to conduct survey analysis.
4.3 Survey analysis process
There is a general process for analyzing data to create estimates with {srvyr} package:
Create a
tbl_svy
object (a survey object) using:as_survey_design()
oras_survey_rep()
Subset data (if needed) using
filter()
(to create subpopulations)Specify domains of analysis using
group_by()
Within
summarize()
, specify variables to calculate, including means, totals, proportions, quantiles, and more
In Section 4.2.3, we follow Step 1 to create the survey design objects for the ANES and RECS data featured in this book. Additional details on how to create design objects can be found in Chapter 10. Then, once we have the design object, we can filter the data to any subpopulation of interest (if needed). It is important to filter the data after creating the design object. This ensures that we are accurately accounting for the survey design in our calculations. Finally, we can use group_by()
, summarize()
, and other functions from the {survey} and {srvyr} packages to analyze the survey data by estimating means, totals, and so on.
4.4 Similarities between {dplyr} and {srvyr} functions
The {dplyr} package from the tidyverse offers flexible and intuitive functions for data wrangling (Wickham et al. 2023). One of the major advantages of using {srvyr} is that it applies {dplyr}-like syntax to the {survey} package (Freedman Ellis and Schneider 2024). We can use pipes, such as %>%
from the {magrittr} package, to specify a survey design object, apply a function, and then feed that output into the next function’s first argument (Bache and Wickham 2022). Functions follow the ‘tidy’ convention of snake_case function names.
To help explain the similarities between {dplyr} functions and {srvyr} functions, we use the towny
dataset from the {gt} package and apistrat
data that comes in the {survey} package. The towny
dataset provides population data for municipalities in Ontario, Canada on census years between 1996 and 2021. Taking a look at towny
with dplyr::glimpse()
, we can see the dataset has 25 columns with a mix of character and numeric data.
## Rows: 414
## Columns: 25
## $ name <chr> "Addington Highlands", "Adelaide Metc…
## $ website <chr> "https://addingtonhighlands.ca", "htt…
## $ status <chr> "lower-tier", "lower-tier", "lower-ti…
## $ csd_type <chr> "township", "township", "township", "…
## $ census_div <chr> "Lennox and Addington", "Middlesex", …
## $ latitude <dbl> 45.00, 42.95, 44.13, 45.53, 43.86, 48…
## $ longitude <dbl> -77.25, -81.70, -79.93, -76.90, -79.0…
## $ land_area_km2 <dbl> 1293.99, 331.11, 371.53, 519.59, 66.6…
## $ population_1996 <int> 2429, 3128, 9359, 2837, 64430, 1027, …
## $ population_2001 <int> 2402, 3149, 10082, 2824, 73753, 956, …
## $ population_2006 <int> 2512, 3135, 10695, 2716, 90167, 958, …
## $ population_2011 <int> 2517, 3028, 10603, 2844, 109600, 864,…
## $ population_2016 <int> 2318, 2990, 10975, 2935, 119677, 969,…
## $ population_2021 <int> 2534, 3011, 10989, 2995, 126666, 954,…
## $ density_1996 <dbl> 1.88, 9.45, 25.19, 5.46, 966.84, 8.81…
## $ density_2001 <dbl> 1.86, 9.51, 27.14, 5.44, 1106.74, 8.2…
## $ density_2006 <dbl> 1.94, 9.47, 28.79, 5.23, 1353.05, 8.2…
## $ density_2011 <dbl> 1.95, 9.14, 28.54, 5.47, 1644.66, 7.4…
## $ density_2016 <dbl> 1.79, 9.03, 29.54, 5.65, 1795.87, 8.3…
## $ density_2021 <dbl> 1.96, 9.09, 29.58, 5.76, 1900.75, 8.1…
## $ pop_change_1996_2001_pct <dbl> -0.0111, 0.0067, 0.0773, -0.0046, 0.1…
## $ pop_change_2001_2006_pct <dbl> 0.0458, -0.0044, 0.0608, -0.0382, 0.2…
## $ pop_change_2006_2011_pct <dbl> 0.0020, -0.0341, -0.0086, 0.0471, 0.2…
## $ pop_change_2011_2016_pct <dbl> -0.0791, -0.0125, 0.0351, 0.0320, 0.0…
## $ pop_change_2016_2021_pct <dbl> 0.0932, 0.0070, 0.0013, 0.0204, 0.058…
Let’s examine the towny
object’s class. We verify that it is a tibble, as indicated by "tbl_df"
, by running the code below:
## [1] "tbl_df" "tbl" "data.frame"
All tibbles are data.frames, but not all data.frames are tibbles. Compared to data.frames, tibbles have some advantages, with the printing behavior being a noticeable advantage. When working with tidyverse style code, we recommend making all your datasets tibbles for ease of analysis.
The {survey} package contains datasets related to the California Academic Performance Index, which measures student performance in schools with at least 100 students in California. We can access these datasets by loading the {survey} package and running data(api)
.
Let’s work with the apistrat
dataset, which is a stratified random sample, stratified by school type (stype
) with three levels: E
for elementary school, M
for middle school, and H
for high school. We first create the survey design object (see Chapter 10 for more information). The sample is stratified by the stype
variable and the sampling weights are found in the pw
variable. We can use this information to construct the design object, apistrat_des
.
When we check the class of apistrat_des
, it is not a typical data.frame
. Applying the as_survey_design()
function transforms the data into a tbl_svy
, a special class specifically for survey design objects. The {srvyr} package is designed to work with the tbl_svy
class of objects.
## [1] "tbl_svy" "survey.design2" "survey.design"
Let’s look at how {dplyr} works with regular data frames. The example below calculates the mean and median for the land_area_km2
variable in the towny
dataset.
## # A tibble: 1 × 2
## area_mean area_median
## <dbl> <dbl>
## 1 373. 273.
In the code below, we calculate the mean and median of the variable api00
using apistrat_des
. Note the similarity in the syntax. However, the standard error of the statistic is also calculated in addition to the statistic itself.
## # A tibble: 1 × 4
## api00_mean api00_mean_se api00_med api00_med_se
## <dbl> <dbl> <dbl> <dbl>
## 1 662. 9.54 668 13.7
The functions in {srvyr} also play nicely with other tidyverse functions. For example, if we wanted to select columns with shared characteristics, we can use {tidyselect} functions such as starts_with()
, num_range()
, etc. (Henry and Wickham 2024). In the examples below, we use a combination of across()
and starts_with()
to calculate the mean of variables starting with “population” in the towny
data frame and those beginning with api
in the apistrat_des
survey object.
## # A tibble: 1 × 6
## population_1996 population_2001 population_2006 population_2011
## <dbl> <dbl> <dbl> <dbl>
## 1 25866. 27538. 29173. 30838.
## # ℹ 2 more variables: population_2016 <dbl>, population_2021 <dbl>
## # A tibble: 1 × 6
## api00 api00_se api99 api99_se api.stu api.stu_se
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 662. 9.54 629. 10.1 498. 16.4
We have the flexibility to use {dplyr} verbs such as mutate()
, filter()
, and select()
on our survey design object. As mentioned in Section 4.3, these steps should be performed on the survey design object. This ensures our survey design is properly considered in all our calculations.
apistrat_des_mod <- apistrat_des %>%
mutate(api_diff = api00 - api99) %>%
filter(stype == "E") %>%
select(stype, api99, api00, api_diff, api_students = api.stu)
apistrat_des_mod
## Stratified Independent Sampling design (with replacement)
## Called via srvyr
## Sampling variables:
## - ids: `1`
## - strata: stype
## - weights: pw
## Data variables:
## - stype (fct), api99 (int), api00 (int), api_diff (int), api_students
## (int)
## Stratified Independent Sampling design (with replacement)
## Called via srvyr
## Sampling variables:
## - ids: `1`
## - strata: stype
## - weights: pw
## Data variables:
## - cds (chr), stype (fct), name (chr), sname (chr), snum (dbl), dname
## (chr), dnum (int), cname (chr), cnum (int), flag (int), pcttest
## (int), api00 (int), api99 (int), target (int), growth (int),
## sch.wide (fct), comp.imp (fct), both (fct), awards (fct), meals
## (int), ell (int), yr.rnd (fct), mobility (int), acs.k3 (int),
## acs.46 (int), acs.core (int), pct.resp (int), not.hsg (int), hsg
## (int), some.col (int), col.grad (int), grad.sch (int), avg.ed
## (dbl), full (int), emer (int), enroll (int), api.stu (int), pw
## (dbl), fpc (dbl)
Several functions in {srvyr} must be called within srvyr::summarize()
, with the exception of srvyr::survey_count()
and srvyr::survey_tally()
. This is similar to how dplyr::count()
and dplyr::tally()
are not called within dplyr::summarize()
. The summarize()
function can be used in conjunction with the group_by()
function or by/.by
arguments, which applies the functions on a group-by-group basis to create grouped summaries.
towny %>%
group_by(csd_type) %>%
dplyr::summarize(
area_mean = mean(land_area_km2),
area_median = median(land_area_km2)
)
## # A tibble: 5 × 3
## csd_type area_mean area_median
## <chr> <dbl> <dbl>
## 1 city 498. 198.
## 2 municipality 607. 488.
## 3 town 183. 129.
## 4 township 363. 301.
## 5 village 23.0 3.3
We use a similar setup to summarize data in {srvyr}:
apistrat_des %>%
group_by(stype) %>%
summarize(
api00_mean = survey_mean(api00),
api00_median = survey_median(api00)
)
## # A tibble: 3 × 5
## stype api00_mean api00_mean_se api00_median api00_median_se
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 E 674. 12.5 671 20.7
## 2 H 626. 15.5 635 21.6
## 3 M 637. 16.6 648 24.1
An alternative way to do grouped analysis on the towny
data would be with the .by
argument:
towny %>%
dplyr::summarize(
area_mean = mean(land_area_km2),
area_median = median(land_area_km2),
.by = csd_type
)
## # A tibble: 5 × 3
## csd_type area_mean area_median
## <chr> <dbl> <dbl>
## 1 township 363. 301.
## 2 town 183. 129.
## 3 municipality 607. 488.
## 4 city 498. 198.
## 5 village 23.0 3.3
The .by
syntax is similarly implemented in {srvyr} for grouped analysis:
apistrat_des %>%
summarize(
api00_mean = survey_mean(api00),
api00_median = survey_median(api00),
.by = stype
)
## # A tibble: 3 × 5
## stype api00_mean api00_mean_se api00_median api00_median_se
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 E 674. 12.5 671 20.7
## 2 H 626. 15.5 635 21.6
## 3 M 637. 16.6 648 24.1
As mentioned above, {srvyr} functions are meant for tbl_svy
objects. Attempting to manipulate data on non-tbl_svy
objects, like the towny
example shown below, results in an error. Running the code lets us know what the issue is: Survey context not set
.
## Error in `summarize()`:
## ℹ In argument: `area_mean = survey_mean(land_area_km2)`.
## Caused by error in `cur_svy()`:
## ! Survey context not set
A few functions in {srvyr} have counterparts in {dplyr}, such as srvyr::summarize()
and srvyr::group_by()
. Unlike {srvyr}-specific verbs, {srvyr} recognizes these parallel functions if applied to a non-survey object. Instead of causing an error, the package provides the equivalent output from {dplyr}:
## # A tibble: 1 × 1
## area_mean
## <dbl>
## 1 373.
Because this book focuses on survey analysis, most of our pipes stem from a survey object. When we load the {dplyr} and {srvyr} packages, the functions automatically figure out the class of data and use the appropriate one from {dplyr} or {srvyr}. Therefore, we do not need to include the namespace for each function (e.g., srvyr::summarize()
).
References
Note: {broom} is already included in the tidyverse, so no separate installation is required.↩︎
In the United States, presidential elections are held in years divisible by four. In other even years, there are elections at the federal level for Congress, which are referred to as midterm elections as they occur at the middle of the term of a president.↩︎