Getting Started

Warm Up

Welcome to the Towny Data

We’ll be using the towny data for warm-up exercises
Available in the {gt} package

Welcome to the Towny Data

library(tidyverse)
library(gt)
glimpse(towny)

Rows: 414
Columns: 25
$ name                     <chr> "Addington Highlands", "Adelaide Metcalfe"…
$ website                  <chr> "https://addingtonhighlands.ca", "https://…
$ status                   <chr> "lower-tier", "lower-tier", "lower-tier", …
$ csd_type                 <chr> "township", "township", "township", "towns…
$ census_div               <chr> "Lennox and Addington", "Middlesex", "Simc…
$ latitude                 <dbl> 45.00000, 42.95000, 44.13333, 45.52917, 43…
$ longitude                <dbl> -77.25000, -81.70000, -79.93333, -76.89694…
$ land_area_km2            <dbl> 1293.99, 331.11, 371.53, 519.59, 66.64, 11…
$ population_1996          <int> 2429, 3128, 9359, 2837, 64430, 1027, 8315,…
$ population_2001          <int> 2402, 3149, 10082, 2824, 73753, 956, 8593,…
$ population_2006          <int> 2512, 3135, 10695, 2716, 90167, 958, 8654,…
$ population_2011          <int> 2517, 3028, 10603, 2844, 109600, 864, 9196…
$ population_2016          <int> 2318, 2990, 10975, 2935, 119677, 969, 9680…
$ population_2021          <int> 2534, 3011, 10989, 2995, 126666, 954, 9949…
$ density_1996             <dbl> 1.88, 9.45, 25.19, 5.46, 966.84, 8.81, 21.…
$ density_2001             <dbl> 1.86, 9.51, 27.14, 5.44, 1106.74, 8.20, 21…
$ density_2006             <dbl> 1.94, 9.47, 28.79, 5.23, 1353.05, 8.22, 22…
$ density_2011             <dbl> 1.95, 9.14, 28.54, 5.47, 1644.66, 7.41, 23…
$ density_2016             <dbl> 1.79, 9.03, 29.54, 5.65, 1795.87, 8.31, 24…
$ density_2021             <dbl> 1.96, 9.09, 29.58, 5.76, 1900.75, 8.18, 25…
$ pop_change_1996_2001_pct <dbl> -0.0111, 0.0067, 0.0773, -0.0046, 0.1447, …
$ pop_change_2001_2006_pct <dbl> 0.0458, -0.0044, 0.0608, -0.0382, 0.2226, …
$ pop_change_2006_2011_pct <dbl> 0.0020, -0.0341, -0.0086, 0.0471, 0.2155, …
$ pop_change_2011_2016_pct <dbl> -0.0791, -0.0125, 0.0351, 0.0320, 0.0919, …
$ pop_change_2016_2021_pct <dbl> 0.0932, 0.0070, 0.0013, 0.0204, 0.0584, -0…

Posit Cloud

We will be using Posit Cloud for exercises
Posit Cloud has the same features and appearance as RStudio for ease of use
Access the project and files Posit Cloud
Navigate to “Complex Survey Analysis in R” under Projects then click name to get started

Your Turn

Go to bit.ly/user-survey-tutorial
Open 02-getting-started_exercises.qmd

10:00

Packages

R Packages for Survey Analysis

{survey} package first on CRAN in 2003
- descriptive analysis
- statistical testing
- modeling
- weighting
{srvyr} package first on CRAN in 2016
- “wrapper” for {survey} with {tidyverse}-style syntax
- only descriptive analysis

Install Packages

Install packages for data wrangling and survey analysis:

install.packages(c("tidyverse", "survey", "srvyr"))

Install {srvyrexploR} to access the survey data for today’s workshop:

install.packages("pak")
pak::pak("tidy-survey-r/srvyrexploR")

Load the packages:

library(tidyverse)
library(survey)
library(srvyr)
library(srvyrexploR)

Data

Today’s Data

American National Election Studies (ANES – DeBell 2010)

Stored as anes_2020 in {srvyrexploR}

Residential Energy Consumption Survey (RECS – U.S. Energy Information Administration 2023b)

Stored as recs_2020 in {srvyrexploR}

American National Election Studies (ANES) 2020

Pre- and post-election surveys fielded almost every 2 years since 1948
Topics include voter registration status, candidate preference, opinions on country and government, party and ideology affiliation, opinions on policy, news sources, and more
Collaboration of Stanford, University of Michigan - funding by the National Science Foundation
Target Population: US citizens, 18 and older living in US
Mode: Web, videoconference, or telephone
Sample Information: Pseudo-strata and pseudo-cluster included for variance estimation

https://electionstudies.org/

American National Election Studies (ANES) 2020

anes_2020 %>%
  select(-matches("^V\\d")) %>%
  glimpse()

Rows: 7,453
Columns: 21
$ CaseID                  <dbl> 200015, 200022, 200039, 200046, 200053, 200…
$ InterviewMode           <fct> Web, Web, Web, Web, Web, Web, Web, Web, Web…
$ Weight                  <dbl> 1.0057375, 1.1634731, 0.7686811, 0.5210195,…
$ VarUnit                 <fct> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2…
$ Stratum                 <fct> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7…
$ CampaignInterest        <fct> Somewhat interested, Not much interested, S…
$ EarlyVote2020           <fct> NA, NA, NA, NA, NA, NA, NA, NA, Yes, NA, NA…
$ VotedPres2016           <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, Yes, …
$ VotedPres2016_selection <fct> Trump, Other, Clinton, Clinton, Trump, NA, …
$ PartyID                 <fct> Strong republican, Independent, Independent…
$ TrustGovernment         <fct> Never, Never, Some of the time, About half …
$ TrustPeople             <fct> About half the time, Some of the time, Some…
$ Age                     <dbl> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43, 37,…
$ AgeGroup                <fct> 40-49, 30-39, 40-49, 40-49, 70 or older, 70…
$ Education               <fct> Bachelor's, Post HS, High school, Post HS, …
$ RaceEth                 <fct> "Hispanic", "Asian, NH/PI", "White", "Asian…
$ Gender                  <fct> Male, Female, Female, Male, Male, Female, F…
$ Income                  <fct> "$175,000-249,999", "$70,000-74,999", "$100…
$ Income7                 <fct> $125k or more, $60k to < 80k, $100k to < 12…
$ VotedPres2020           <fct> NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, Yes, …
$ VotedPres2020_selection <fct> NA, Other, Biden, Biden, Trump, Biden, Trum…

Residential Energy Consumption Survey (RECS) 2020

Fielded 14 times between 1950 and 2020
Topics include appliances, electronics, heating, a/c, temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance
Energy consumption/expenditures collected through energy suppliers
Funded by the Energy Information Administration
Target Population: Primary occupied housing units in the US
Mode: In-person, phone, paper, and web interview mode
Sample Information: Jackknife Replicate weights included for variance estimation

https://www.eia.gov/consumption/residential/index.php

Residential Energy Consumption Survey (RECS) 2020

recs_2020 %>%
  select(-matches("^NWEIGHT")) %>%
  glimpse()

Rows: 18,496
Columns: 39
$ DOEID            <dbl> 100001, 100002, 100003, 100004, 100005, 100006, 10…
$ ClimateRegion_BA <fct> Mixed-Dry, Mixed-Humid, Mixed-Dry, Mixed-Humid, Mi…
$ Urbanicity       <fct> Urban Area, Urban Area, Urban Area, Urban Area, Ur…
$ Region           <fct> West, South, West, South, Northeast, South, South,…
$ REGIONC          <chr> "WEST", "SOUTH", "WEST", "SOUTH", "NORTHEAST", "SO…
$ Division         <fct> Mountain South, West South Central, Mountain South…
$ STATE_FIPS       <chr> "35", "05", "35", "45", "34", "48", "40", "28", "1…
$ state_postal     <fct> NM, AR, NM, SC, NJ, TX, OK, MS, DC, AZ, CA, TX, LA…
$ state_name       <fct> New Mexico, Arkansas, New Mexico, South Carolina, …
$ HDD65            <dbl> 3844, 3766, 3819, 2614, 4219, 901, 3148, 1825, 423…
$ CDD65            <dbl> 1679, 1458, 1696, 1718, 1363, 3558, 2128, 2374, 11…
$ HDD30YR          <dbl> 4451, 4429, 4500, 3229, 4896, 1150, 3564, 2660, 44…
$ CDD30YR          <dbl> 1027, 1305, 1010, 1653, 1059, 3588, 2043, 2164, 14…
$ HousingUnitType  <fct> Single-family detached, Apartment: 5 or more units…
$ YearMade         <ord> 1970-1979, 1980-1989, 1960-1969, 1980-1989, 1960-1…
$ TOTSQFT_EN       <dbl> 2100, 590, 900, 2100, 800, 4520, 2100, 900, 750, 7…
$ TOTHSQFT         <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 900, 750, 7…
$ TOTCSQFT         <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 0, 500, 760…
$ SpaceHeatingUsed <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
$ ACUsed           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, T…
$ HeatingBehavior  <fct> Set one temp and leave it, Turn on or off as neede…
$ WinterTempDay    <dbl> 70, 70, 69, 68, 68, 76, 74, 70, 68, 70, 72, 74, 74…
$ WinterTempAway   <dbl> 70, 65, 68, 68, 68, 76, 65, 70, 60, 70, 70, 74, 74…
$ WinterTempNight  <dbl> 68, 65, 67, 68, 68, 68, 74, 68, 62, 68, 72, 74, 74…
$ ACBehavior       <fct> Set one temp and leave it, Turn on or off as neede…
$ SummerTempDay    <dbl> 71, 68, 70, 72, 72, 69, 68, NA, 72, 74, 77, 77, 74…
$ SummerTempAway   <dbl> 71, 68, 68, 72, 72, 74, 70, NA, 76, 74, 77, 77, 74…
$ SummerTempNight  <dbl> 71, 68, 68, 72, 72, 68, 70, NA, 68, 72, 77, 77, 74…
$ BTUEL            <dbl> 42723.28, 17889.29, 8146.63, 31646.53, 20027.42, 4…
$ DOLLAREL         <dbl> 1955.06, 713.27, 334.51, 1424.86, 1087.00, 1895.66…
$ BTUNG            <dbl> 101924.43, 10145.32, 22603.08, 55118.66, 39099.51,…
$ DOLLARNG         <dbl> 701.8300, 261.7348, 188.1400, 636.9100, 376.0400, …
$ BTULP            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.…
$ DOLLARLP         <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.…
$ BTUFO            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.…
$ DOLLARFO         <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.…
$ BTUWOOD          <dbl> 0, 0, 0, 0, 0, 3000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TOTALBTU         <dbl> 144647.71, 28034.61, 30749.71, 86765.19, 59126.93,…
$ TOTALDOL         <dbl> 2656.8900, 975.0048, 522.6500, 2061.7700, 1463.040…

Design Objects

Backbone for survey analysis
Specify the sampling design, weights, and other necessary information to account for errors in the data
Survey analysis and inference should be performed with the survey design objects, not the original survey data
We will be covering survey design in more detail later in the workshop!

American National Election Studies Design Object

anes_des <- anes_2020 %>%
  mutate(Weight = V200010b / sum(V200010b) * 231034125) %>%
  as_survey_design(
    weights = Weight,
    strata = V200010d,
    ids = V200010c,
    nest = TRUE
  )

American National Election Studies Design Object

anes_des

Stratified 1 - level Cluster Sampling design (with replacement)
With (101) clusters.
Called via srvyr
Sampling variables:
  - ids: V200010c 
  - strata: V200010d 
  - weights: Weight 
Data variables: 
  - V200001 (dbl), CaseID (dbl), V200002 (hvn_lbll), InterviewMode (fct),
    V200010b (dbl), Weight (dbl), V200010c (dbl), VarUnit (fct), V200010d
    (dbl), Stratum (fct), V201006 (hvn_lbll), CampaignInterest (fct),
    V201023 (hvn_lbll), EarlyVote2020 (fct), V201024 (hvn_lbll), V201025x
    (hvn_lbll), V201028 (hvn_lbll), V201029 (hvn_lbll), V201101 (hvn_lbll),
    V201102 (hvn_lbll), VotedPres2016 (fct), V201103 (hvn_lbll),
    VotedPres2016_selection (fct), V201228 (hvn_lbll), V201229 (hvn_lbll),
    V201230 (hvn_lbll), V201231x (hvn_lbll), PartyID (fct), V201233
    (hvn_lbll), TrustGovernment (fct), V201237 (hvn_lbll), TrustPeople
    (fct), V201507x (hvn_lbll), Age (dbl), AgeGroup (fct), V201510
    (hvn_lbll), Education (fct), V201546 (hvn_lbll), V201547a (hvn_lbll),
    V201547b (hvn_lbll), V201547c (hvn_lbll), V201547d (hvn_lbll), V201547e
    (hvn_lbll), V201547z (hvn_lbll), V201549x (hvn_lbll), RaceEth (fct),
    V201600 (hvn_lbll), Gender (fct), V201607 (hvn_lbll), V201610
    (hvn_lbll), V201611 (hvn_lbll), V201613 (hvn_lbll), V201615 (hvn_lbll),
    V201616 (hvn_lbll), V201617x (hvn_lbll), Income (fct), Income7 (fct),
    V202051 (hvn_lbll), V202066 (hvn_lbll), V202072 (hvn_lbll),
    VotedPres2020 (fct), V202073 (hvn_lbll), V202109x (hvn_lbll), V202110x
    (hvn_lbll), VotedPres2020_selection (fct)

Residential Energy Consumption Survey Design Object

recs_des <- recs_2020 %>%
  as_survey_rep(
    weights = NWEIGHT,
    repweights = NWEIGHT1:NWEIGHT60,
    type = "JK1",
    scale = 59 / 60,
    mse = TRUE
  )

Residential Energy Consumption Survey Design Object

recs_des

Call: Called via srvyr
Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
Sampling variables:
  - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
    NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
    NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + NWEIGHT17 +
    NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + NWEIGHT22 + NWEIGHT23 +
    NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + NWEIGHT27 + NWEIGHT28 + NWEIGHT29 +
    NWEIGHT30 + NWEIGHT31 + NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 +
    NWEIGHT36 + NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
    NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + NWEIGHT47 +
    NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + NWEIGHT52 + NWEIGHT53 +
    NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + NWEIGHT57 + NWEIGHT58 + NWEIGHT59 +
    NWEIGHT60` 
  - weights: NWEIGHT 
Data variables: 
  - DOEID (dbl), ClimateRegion_BA (fct), Urbanicity (fct), Region (fct),
    REGIONC (chr), Division (fct), STATE_FIPS (chr), state_postal (fct),
    state_name (fct), HDD65 (dbl), CDD65 (dbl), HDD30YR (dbl), CDD30YR
    (dbl), HousingUnitType (fct), YearMade (ord), TOTSQFT_EN (dbl), TOTHSQFT
    (dbl), TOTCSQFT (dbl), SpaceHeatingUsed (lgl), ACUsed (lgl),
    HeatingBehavior (fct), WinterTempDay (dbl), WinterTempAway (dbl),
    WinterTempNight (dbl), ACBehavior (fct), SummerTempDay (dbl),
    SummerTempAway (dbl), SummerTempNight (dbl), NWEIGHT (dbl), NWEIGHT1
    (dbl), NWEIGHT2 (dbl), NWEIGHT3 (dbl), NWEIGHT4 (dbl), NWEIGHT5 (dbl),
    NWEIGHT6 (dbl), NWEIGHT7 (dbl), NWEIGHT8 (dbl), NWEIGHT9 (dbl),
    NWEIGHT10 (dbl), NWEIGHT11 (dbl), NWEIGHT12 (dbl), NWEIGHT13 (dbl),
    NWEIGHT14 (dbl), NWEIGHT15 (dbl), NWEIGHT16 (dbl), NWEIGHT17 (dbl),
    NWEIGHT18 (dbl), NWEIGHT19 (dbl), NWEIGHT20 (dbl), NWEIGHT21 (dbl),
    NWEIGHT22 (dbl), NWEIGHT23 (dbl), NWEIGHT24 (dbl), NWEIGHT25 (dbl),
    NWEIGHT26 (dbl), NWEIGHT27 (dbl), NWEIGHT28 (dbl), NWEIGHT29 (dbl),
    NWEIGHT30 (dbl), NWEIGHT31 (dbl), NWEIGHT32 (dbl), NWEIGHT33 (dbl),
    NWEIGHT34 (dbl), NWEIGHT35 (dbl), NWEIGHT36 (dbl), NWEIGHT37 (dbl),
    NWEIGHT38 (dbl), NWEIGHT39 (dbl), NWEIGHT40 (dbl), NWEIGHT41 (dbl),
    NWEIGHT42 (dbl), NWEIGHT43 (dbl), NWEIGHT44 (dbl), NWEIGHT45 (dbl),
    NWEIGHT46 (dbl), NWEIGHT47 (dbl), NWEIGHT48 (dbl), NWEIGHT49 (dbl),
    NWEIGHT50 (dbl), NWEIGHT51 (dbl), NWEIGHT52 (dbl), NWEIGHT53 (dbl),
    NWEIGHT54 (dbl), NWEIGHT55 (dbl), NWEIGHT56 (dbl), NWEIGHT57 (dbl),
    NWEIGHT58 (dbl), NWEIGHT59 (dbl), NWEIGHT60 (dbl), BTUEL (dbl), DOLLAREL
    (dbl), BTUNG (dbl), DOLLARNG (dbl), BTULP (dbl), DOLLARLP (dbl), BTUFO
    (dbl), DOLLARFO (dbl), BTUWOOD (dbl), TOTALBTU (dbl), TOTALDOL (dbl)

Comparison of {dplyr} and {srvyr} Functions

Why use {srvyr} over {survey}?

Use the pipe function %>% or |>
Ability to use tidy selection
Functions follow snake_case convention

Apply {dplyr} functions to survey objects

A few functions in {srvyr} have counterparts in {dplyr}
- dplyr::summarize() and srvyr::summarize()
- dplyr::group_by() and srvyr::group_by()
Based on the object class, R will recognize which package to use

Using `summarize()` to calculate mean and median

Tibble:

towny %>%
  summarize(
    area_mean = mean(land_area_km2)
  )

# A tibble: 1 × 1
  area_mean
      <dbl>
1      373.

Using `summarize()` to calculate mean and median

Survey object:

recs_des %>%
  summarize(
    TOTHSQFT_mean = survey_mean(TOTHSQFT),
  )

# A tibble: 1 × 2
  TOTHSQFT_mean TOTHSQFT_mean_se
          <dbl>            <dbl>
1         1614.             7.19

Using tidyselect

Tibble:

towny %>%
  summarize(across(
    starts_with("population"),
    ~ mean(.x, na.rm = TRUE)
  ))

# A tibble: 1 × 6
  population_1996 population_2001 population_2006 population_2011
            <dbl>           <dbl>           <dbl>           <dbl>
1          25866.          27538.          29173.          30838.
# ℹ 2 more variables: population_2016 <dbl>, population_2021 <dbl>

Using tidyselect

Survey object:

recs_des %>%
  summarize(across(
    starts_with("TOT"),
    ~ survey_mean(.x, na.rm = TRUE)
  ))

# A tibble: 1 × 10
  TOTSQFT_EN TOTSQFT_EN_se TOTHSQFT TOTHSQFT_se TOTCSQFT TOTCSQFT_se TOTALBTU
       <dbl>         <dbl>    <dbl>       <dbl>    <dbl>       <dbl>    <dbl>
1      1819.          6.62    1614.        7.19    1335.        6.72   76753.
# ℹ 3 more variables: TOTALBTU_se <dbl>, TOTALDOL <dbl>, TOTALDOL_se <dbl>

Using {dplyr} functions

Tibble:

towny %>%
  filter(csd_type == "township") %>%
  select(csd_type, name)

# A tibble: 195 × 2
   csd_type name                  
   <chr>    <chr>                 
 1 township Addington Highlands   
 2 township Adelaide Metcalfe     
 3 township Adjala-Tosorontio     
 4 township Admaston/Bromley      
 5 township Alberton              
 6 township Alfred and Plantagenet
 7 township Algonquin Highlands   
 8 township Alnwick/Haldimand     
 9 township Amaranth              
10 township The Archipelago       
# ℹ 185 more rows

Using {dplyr} functions

Survey object:

recs_des_mod <- recs_des %>%
  filter(Urbanicity == "Rural") %>%
  select(ACUsed)

recs_des_mod

Call: Called via srvyr
Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
Sampling variables:
  - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
    NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
    NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + NWEIGHT17 +
    NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + NWEIGHT22 + NWEIGHT23 +
    NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + NWEIGHT27 + NWEIGHT28 + NWEIGHT29 +
    NWEIGHT30 + NWEIGHT31 + NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 +
    NWEIGHT36 + NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
    NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + NWEIGHT47 +
    NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + NWEIGHT52 + NWEIGHT53 +
    NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + NWEIGHT57 + NWEIGHT58 + NWEIGHT59 +
    NWEIGHT60` 
  - weights: NWEIGHT 
Data variables: 
  - ACUsed (lgl)

Grouping

Tibble:

towny %>%
  group_by(csd_type) %>%
  summarize(area_mean = mean(land_area_km2))

# A tibble: 5 × 2
  csd_type     area_mean
  <chr>            <dbl>
1 city             498. 
2 municipality     607. 
3 town             183. 
4 township         363. 
5 village           23.0

Grouping

Survey object:

recs_des %>%
  group_by(Urbanicity) %>%
  summarize(TOTHSQFT_mean = survey_mean(TOTHSQFT))

# A tibble: 3 × 3
  Urbanicity    TOTHSQFT_mean TOTHSQFT_mean_se
  <fct>                 <dbl>            <dbl>
1 Urban Area            1532.             9.24
2 Urban Cluster         1537.            21.5 
3 Rural                 1966.            17.7

Using {srvyr} functions on non-survey objects

towny %>%
  summarize(area_mean = survey_mean(land_area_km2))

Error in `summarize()`:
ℹ In argument: `area_mean = survey_mean(land_area_km2)`.
Caused by error in `cur_svy()`:
! Survey context not set

Survey Analysis Process

Overview of Survey Analysis using {srvyr} Package

Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep()
Subset data (if needed) using filter() (to create subpopulations)
Specify domains of analysis using group_by()
Within summarize(), specify variables to calculate, including means, totals, proportions, quantiles, and more

Next Up: Descriptive Statistics

Extra Content

Wrap Up

We will be using {tidyverse}, {survey}, and {srvyr} for survey analysis
Our data is from ANES and RECS

Wrap Up

(Survey) design objects are the backbone for survey analysis
They are objects with special class tbl_svy
They specify the sampling design, weights, and other necessary information to account for errors in the data

Wrap Up

{srvyr} wraps {survey} so that we can use friendly, “tidy” syntax
Based on the object type, R will know whether to use {dplyr} or {srvyr}
All functions must be done on the survey design object, not the raw data

Ex. 1

How many different types of CSD are there in the dataset?

towny %>%
  count(csd_type)

towny %>%
  group_by(csd_type) %>%
  summarize(n = n(), .groups = "drop")

# A tibble: 5 × 2
  csd_type         n
  <chr>        <int>
1 city            52
2 municipality    68
3 town            88
4 township       195
5 village         11

Ex. 2

How many different types of CSD and status are there in the dataset?

towny %>%
  count(csd_type, status)

towny %>%
  group_by(csd_type, status) %>%
  summarize(n = n(), .groups = "drop")

# A tibble: 10 × 3
   csd_type     status          n
   <chr>        <chr>       <int>
 1 city         lower-tier     20
 2 city         single-tier    32
 3 municipality lower-tier     41
 4 municipality single-tier    27
 5 town         lower-tier     61
 6 town         single-tier    27
 7 township     lower-tier    113
 8 township     single-tier    82
 9 village      lower-tier      6
10 village      single-tier     5

Ex. 3

What is the proportion of each type of CSD?

towny %>%
  count(csd_type) %>%
  mutate(p = n / sum(n))

# A tibble: 5 × 3
  csd_type         n      p
  <chr>        <int>  <dbl>
1 city            52 0.126 
2 municipality    68 0.164 
3 town            88 0.213 
4 township       195 0.471 
5 village         11 0.0266

Ex. 4

What is the proportion of each status within type of CSD?

towny %>%
  count(csd_type, status) %>%
  group_by(csd_type) %>%
  mutate(p = n / sum(n))

# A tibble: 10 × 4
# Groups:   csd_type [5]
   csd_type     status          n     p
   <chr>        <chr>       <int> <dbl>
 1 city         lower-tier     20 0.385
 2 city         single-tier    32 0.615
 3 municipality lower-tier     41 0.603
 4 municipality single-tier    27 0.397
 5 town         lower-tier     61 0.693
 6 town         single-tier    27 0.307
 7 township     lower-tier    113 0.579
 8 township     single-tier    82 0.421
 9 village      lower-tier      6 0.545
10 village      single-tier     5 0.455

Ex. 5

What is the mean population of all of the municipalities in 2021?

towny %>%
  summarize(mean_pop = mean(population_2021, na.rm = TRUE))

# A tibble: 1 × 1
  mean_pop
     <dbl>
1   34142.

Ex. 6

What is the mean population by CSD in 2021?

towny %>%
  group_by(csd_type) %>%
  summarize(mean_pop = mean(population_2021, na.rm = TRUE))

# A tibble: 5 × 2
  csd_type     mean_pop
  <chr>           <dbl>
1 city          200005.
2 municipality   10102.
3 town           22579.
4 township        5367.
5 village         1276.

Ex. 7

What is the mean population of all of the municipalities in 1996, 2001, 2006, 2011, 2016, and 2021? Try to use the across function.

towny %>%
  summarize(across(starts_with("population"), ~ mean(.x, na.rm = TRUE)))

# A tibble: 1 × 6
  population_1996 population_2001 population_2006 population_2011
            <dbl>           <dbl>           <dbl>           <dbl>
1          25866.          27538.          29173.          30838.
  population_2016 population_2021
            <dbl>           <dbl>
1          32264.          34142.

Ex. 8

Run a simple t-test to see if the average population in 1996 is different from the average population in 2016.

t.test(towny$population_1996, towny$population_2016, paired = TRUE)


    Paired t-test

data:  towny$population_1996 and towny$population_2016
t = -4.3204, df = 410, p-value = 1.956e-05
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -9593.244 -3593.423
sample estimates:
mean difference 
      -6593.333

Getting Started

Warm Up

Welcome to the Towny Data

Welcome to the Towny Data

Posit Cloud

Your Turn

Packages

R Packages for Survey Analysis

Install Packages

Data

Today’s Data

American National Election Studies (ANES) 2020

American National Election Studies (ANES) 2020

Residential Energy Consumption Survey (RECS) 2020

Residential Energy Consumption Survey (RECS) 2020

Design Objects

Design Objects

American National Election Studies Design Object

American National Election Studies Design Object

Residential Energy Consumption Survey Design Object

Residential Energy Consumption Survey Design Object

Comparison of {dplyr} and {srvyr} Functions

Why use {srvyr} over {survey}?

Apply {dplyr} functions to survey objects

Using summarize() to calculate mean and median

Tibble:

Using summarize() to calculate mean and median

Survey object:

Using tidyselect

Tibble:

Using tidyselect

Survey object:

Using {dplyr} functions

Tibble:

Using {dplyr} functions

Survey object:

Grouping

Tibble:

Grouping

Survey object:

Using {srvyr} functions on non-survey objects

Survey Analysis Process

Overview of Survey Analysis using {srvyr} Package

Next Up: Descriptive Statistics

Extra Content

Wrap Up

Wrap Up

Wrap Up

Ex. 1

Ex. 2

Ex. 3

Ex. 4

Ex. 5

Ex. 6

Ex. 7

Ex. 8

Using `summarize()` to calculate mean and median

Using `summarize()` to calculate mean and median