Chapter 1 Introduction

Surveys are valuable tools for gathering information about a population. Researchers, governments, and businesses use surveys to better understand public opinion and behaviors. For example, a non-profit group may analyze societal trends to measure their impact, government agencies may study behaviors to inform policy, or companies may seek to learn customer product preferences to refine business strategy. With survey data, we can explore the world around us.

Surveys are often conducted with a sample of the population. Therefore, to use the survey data to understand the population, we use weights to adjust the survey results for unequal probabilities of selection, nonresponse, and post-stratification. These adjustments ensure the sample accurately represents the population of interest (Gard et al. 2023). To account for the intricate nature of the survey design, analysts rely on statistical software such as SAS, Stata, SUDAAN, and R.

In this book, we focus on R to introduce survey analysis. Our goal is to provide a comprehensive guide for individuals new to survey analysis but with some familiarity with statistics and R programming. We use a combination of the {survey} and {srvyr} packages and present the code following best practices from the tidyverse (Freedman Ellis and Schneider 2024; Lumley 2010; Wickham et al. 2019).

1.1 Survey analysis in R

The {survey} package was released on the Comprehensive R Archive Network (CRAN) in 2003 and has been continuously developed over time. This package, primarily authored by Thomas Lumley, offers an extensive array of features, including:

  • Calculation of point estimates and estimates of their uncertainty, including means, totals, ratios, quantiles, and proportions
  • Estimation of regression models, including generalized linear models, log-linear models, and survival curves
  • Variances by Taylor linearization or by replicate weights, including balance repeated replication, jackknife, bootstrap, multistage bootstrap, or user-supplied methods
  • Hypothesis testing for means, proportions, and other parameters

The {srvyr} package builds on the {survey} package by providing wrappers for functions that align with the tidyverse philosophy. This is our motivation for using and recommending the {srvyr} package. We find that it is user-friendly for those familiar with the tidyverse packages in R.

For example, while many functions in the {survey} package access variables through formulas, the {srvyr} package uses tidy selection to pass variable names, a common feature in the tidyverse (Henry and Wickham 2024). Users of the tidyverse are also likely familiar with the magrittr pipe operator (%>%), which seamlessly works with functions from the {srvyr} package. Moreover, several common functions from {dplyr}, such as filter(), mutate(), and summarize(), can be applied to survey objects (Wickham et al. 2023). This enables users to streamline their analysis workflow and leverage the benefits of both the {srvyr} and {tidyverse} packages.

While the {srvyr} package offers many advantages, there is one notable limitation: it doesn’t fully incorporate the modeling capabilities of the {survey} package into tidy wrappers. When discussing modeling and hypothesis testing, we primarily rely on the {survey} package. However, we provide information on how to apply the pipe operator to these functions to maintain clarity and consistency in analyses.

1.2 What to expect

This book covers many aspects of survey design and analysis, from understanding how to create design objects to conducting descriptive analysis, statistical tests, and models. We emphasize coding best practices and effective presentation techniques while using real-world data and practical examples to help readers gain proficiency in survey analysis.

Below is a summary of each chapter:

  • Chapter 2 - Overview of surveys:
    • Overview of survey design processes
    • References for more in-depth knowledge
  • Chapter 3 - Survey data documentation:
    • Guide to survey documentation types
    • How to read survey documentation
  • Chapter 4 - Getting started:
    • Installation of packages
    • Introduction to the {srvyrexploR} package and its analytic datasets
    • Outline of the survey analysis process
    • Comparison between the {dplyr} and {srvyr} packages
  • Chapter 5 - Descriptive analyses:
    • Calculation of point estimates
    • Estimation of standard errors and confidence intervals
    • Calculation of design effects
  • Chapter 6 - Statistical testing:
    • Statistical testing methods
    • Comparison of means and proportions
    • Goodness-of-fit tests, tests of independence, and tests of homogeneity
  • Chapter 7 - Modeling:
    • Overview of model formula specifications
    • Linear regression, ANOVA, and logistic regression modeling
  • Chapter 8 - Communication of results:
    • Strategies for communicating survey results
    • Tools and guidance for creating publishable tables and graphs
  • Chapter 9 - Reproducible research:
    • Tools and methods for achieving reproducibility
    • Resources for reproducible research
  • Chapter 10 - Sample designs and replicate weights:
    • Overview of common sampling designs
    • Replicate weight methods
    • How to specify survey designs in R
  • Chapter 11 - Missing data:
    • Overview of missing data in surveys
    • Approaches to dealing with missing data
  • Chapter 12 - Successful survey analysis recommendations:
    • Tips for successful analysis
    • Recommendations for debugging
  • Chapter 13 - National Crime Victimization Survey Vignette:
    • Vignette on analyzing National Crime Victimization Survey (NCVS) data
    • Illustration of analysis requiring multiple files for victimization rates
  • Chapter 14 - AmericasBarometer Vignette:
    • Vignette on analyzing AmericasBarometer survey data
    • Creation of choropleth maps with survey estimates

The majority of chapters contain code that readers can follow. Each of these chapters starts with a “Prerequisites” section, which includes the code needed to load the packages and datasets used in the chapter. We then provide the main idea of the chapter and examples of how to use the functions. Most chapters conclude with exercises to work through. We provide the solutions to the exercises in the online version of the book.

While we provide a brief overview of survey methodology and statistical theory, this book is not intended to be the sole resource for these topics. We reference other materials and encourage readers to seek them out for more information.

1.3 Prerequisites

To get the most out of this book, we assume a survey has already been conducted and readers have obtained a microdata file. Microdata, also known as respondent-level or row-level data, differ from summarized data typically found in tables. Microdata contain individual survey responses, along with analysis weights and design variables such as strata or clusters.

Additionally, the survey data should already include weights and design variables. These are required to accurately calculate unbiased estimates. The concepts and techniques discussed in this book help readers to extract meaningful insights from survey data, but this book does not cover how to create weights, as this is a separate complex topic. If weights are not already created for the survey data, we recommend reviewing other resources focused on weight creation such as Valliant and Dever (2018).

This book is tailored for analysts already familiar with R and the tidyverse, but who may be new to complex survey analysis in R. We anticipate that readers of this book can:

  • Install R and their Integrated Development Environment (IDE) of choice, such as RStudio
  • Install and load packages from CRAN and GitHub repositories
  • Run R code
  • Read data from a folder or their working directory
  • Understand fundamental tidyverse concepts such as tidy/long/wide data, tibbles, the magrittr pipe (%>%), and tidy selection
  • Use the tidyverse packages to wrangle, tidy, and visualize data

If these concepts or skills are unfamiliar, we recommend starting with introductory resources to cover these topics before reading this book. R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023) is a beginner-friendly guide for getting started in data science using R. It offers guidance on preliminary installation steps, basic R syntax, and tidyverse workflows and packages.

1.4 Datasets used in this book

We work with two key datasets throughout the book: the Residential Energy Consumption Survey (RECS – U.S. Energy Information Administration 2023b) and the American National Election Studies (ANES – DeBell 2010). We introduce the loading and preparation of these datasets in Chapter 4.

1.5 Conventions

Throughout the book, we use the following typographical conventions:

  • Package names are surrounded by curly brackets: {srvyr}
  • Function names are in constant-width text format and include parentheses: survey_mean()
  • Object and variable names are in constant-width text format: anes_des

1.6 Getting help

We recommend first trying to resolve errors and issues independently using the tips provided in Chapter 12.

There are several community forums for asking questions, including:

Please report any bugs and issues to the book’s GitHub repository.

1.7 Acknowledgments

We would like to thank Holly Cast, Greg Freedman Ellis, Joe Murphy, and Sheila Saia for their reviews of the initial draft. Their detailed and honest feedback helped improve this book, and we are grateful for their input. Additionally, this book started with two short courses. The first was at the Annual Conference for the American Association for Public Opinion Research (AAPOR) and the second was a series of webinars for the Midwest Association of Public Opinion Research (MAPOR). We would like to also thank those who assisted us by moderating breakout rooms and answering questions from attendees: Greg Freedman Ellis, Raphael Nishimura, and Benjamin Schneider.

1.8 Colophon

This book was written in bookdown using RStudio. The complete source is available on GitHub.

This version of the book was built with R version 4.4.0 (2024-04-24) and with the packages listed in Table 1.1.

TABLE 1.1: Package versions and sources used in building this book
Package Version Source
DiagrammeR 1.0.11 CRAN
Matrix 1.7-0 CRAN
bookdown 0.39 CRAN
broom 1.0.5 CRAN
censusapi 0.9.0.9000 GitHub (hrecht/censusapi@74334d4)
dplyr 1.1.4 CRAN
forcats 1.0.0 CRAN
ggpattern 1.0.1 CRAN
ggplot2 3.5.1 CRAN
gt 0.11.0.9000 GitHub (rstudio/gt@28de628)
gtsummary 1.7.2 CRAN
haven 2.5.4 CRAN
janitor 2.2.0 CRAN
kableExtra 1.4.0 CRAN
knitr 1.46 CRAN
labelled 2.13.0 CRAN
lubridate 1.9.3 CRAN
naniar 1.1.0 CRAN
osfr 0.2.9 CRAN
prettyunits 1.2.0 CRAN
purrr 1.0.2 CRAN
readr 2.1.5 CRAN
renv 1.0.7 CRAN
rmarkdown 2.26 CRAN
rnaturalearth 1.0.1 CRAN
rnaturalearthdata 1.0.0 CRAN
sf 1.0-16 CRAN
srvyr 1.3.0 CRAN
srvyrexploR 1.0.1 GitHub (tidy-survey-r/srvyrexploR@cdf9316)
stringr 1.5.1 CRAN
styler 1.10.3 CRAN
survey 4.4-2 CRAN
survival 3.6-4 CRAN
tibble 3.2.1 CRAN
tidycensus 1.6.3 CRAN
tidyr 1.3.1 CRAN
tidyselect 1.2.1 CRAN
tidyverse 2.0.0 CRAN

References

DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf.
Freedman Ellis, Greg, and Ben Schneider. 2024. srvyr: ’dplyr’-Like Syntax for Summary Statistics of Survey Data. http://gdfe.co/srvyr/.
Gard, Arianna M., Luke W. Hyde, Steven G. Heeringa, Brady T. West, and Colter Mitchell. 2023. “Why Weight? Analytic Approaches for Large-Scale Population Neuroscience Data.” Developmental Cognitive Neuroscience 59: 101196. https://doi.org/https://doi.org/10.1016/j.dcn.2023.101196.
Henry, Lionel, and Hadley Wickham. 2024. tidyselect: Select from a Set of Strings. https://tidyselect.r-lib.org.
Lumley, Thomas. 2010. Complex Surveys: A Guide to Analysis Using R. John Wiley & Sons.
———. 2023b. 2020 Residential Energy Consumption Survey: Household Characteristics Technical Documentation Summary.” https://www.eia.gov/consumption/residential/data/2020/pdf/2020%20RECS_Methodology%20Report.pdf.
Valliant, Richard, and Jill A. Dever. 2018. Survey Weights: A Step-by-Step Guide to Calculation. Stata Press.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. https://r4ds.hadley.nz/; O’Reilly Media.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org.