Getting Started

Warm Up

Welcome to the Towny Data

  • We’ll be using the towny data for warm-up exercises
  • Available in the {gt} package

Welcome to the Towny Data

library(tidyverse)
library(gt)
glimpse(towny)
Rows: 414
Columns: 25
$ name                     <chr> "Addington Highlands", "Adelaide Metcalfe"…
$ website                  <chr> "https://addingtonhighlands.ca", "https://…
$ status                   <chr> "lower-tier", "lower-tier", "lower-tier", …
$ csd_type                 <chr> "township", "township", "township", "towns…
$ census_div               <chr> "Lennox and Addington", "Middlesex", "Simc…
$ latitude                 <dbl> 45.00000, 42.95000, 44.13333, 45.52917, 43…
$ longitude                <dbl> -77.25000, -81.70000, -79.93333, -76.89694…
$ land_area_km2            <dbl> 1293.99, 331.11, 371.53, 519.59, 66.64, 11…
$ population_1996          <int> 2429, 3128, 9359, 2837, 64430, 1027, 8315,…
$ population_2001          <int> 2402, 3149, 10082, 2824, 73753, 956, 8593,…
$ population_2006          <int> 2512, 3135, 10695, 2716, 90167, 958, 8654,…
$ population_2011          <int> 2517, 3028, 10603, 2844, 109600, 864, 9196…
$ population_2016          <int> 2318, 2990, 10975, 2935, 119677, 969, 9680…
$ population_2021          <int> 2534, 3011, 10989, 2995, 126666, 954, 9949…
$ density_1996             <dbl> 1.88, 9.45, 25.19, 5.46, 966.84, 8.81, 21.…
$ density_2001             <dbl> 1.86, 9.51, 27.14, 5.44, 1106.74, 8.20, 21…
$ density_2006             <dbl> 1.94, 9.47, 28.79, 5.23, 1353.05, 8.22, 22…
$ density_2011             <dbl> 1.95, 9.14, 28.54, 5.47, 1644.66, 7.41, 23…
$ density_2016             <dbl> 1.79, 9.03, 29.54, 5.65, 1795.87, 8.31, 24…
$ density_2021             <dbl> 1.96, 9.09, 29.58, 5.76, 1900.75, 8.18, 25…
$ pop_change_1996_2001_pct <dbl> -0.0111, 0.0067, 0.0773, -0.0046, 0.1447, …
$ pop_change_2001_2006_pct <dbl> 0.0458, -0.0044, 0.0608, -0.0382, 0.2226, …
$ pop_change_2006_2011_pct <dbl> 0.0020, -0.0341, -0.0086, 0.0471, 0.2155, …
$ pop_change_2011_2016_pct <dbl> -0.0791, -0.0125, 0.0351, 0.0320, 0.0919, …
$ pop_change_2016_2021_pct <dbl> 0.0932, 0.0070, 0.0013, 0.0204, 0.0584, -0…

Posit Cloud

  • We will be using Posit Cloud for exercises
  • Posit Cloud has the same features and appearance as RStudio for ease of use
  • Access the project and files Posit Cloud
  • Navigate to “Complex Survey Analysis in R” under Projects then click name to get started

Your Turn

  • Go to bit.ly/user-survey-tutorial
  • Open 02-getting-started_exercises.qmd
10:00

Packages

R Packages for Survey Analysis

  • {survey} package first on CRAN in 2003
    • descriptive analysis
    • statistical testing
    • modeling
    • weighting
  • {srvyr} package first on CRAN in 2016
    • “wrapper” for {survey} with {tidyverse}-style syntax
    • only descriptive analysis

Install Packages

Install packages for data wrangling and survey analysis:

install.packages(c("tidyverse", "survey", "srvyr"))

Install {srvyrexploR} to access the survey data for today’s workshop:

install.packages("pak")
pak::pak("tidy-survey-r/srvyrexploR")

Load the packages:

library(tidyverse)
library(survey)
library(srvyr)
library(srvyrexploR)

Data

Today’s Data

American National Election Studies (ANES – DeBell 2010)

  • Stored as anes_2020 in {srvyrexploR}

Residential Energy Consumption Survey (RECS – U.S. Energy Information Administration 2023b)

  • Stored as recs_2020 in {srvyrexploR}

American National Election Studies (ANES) 2020

  • Pre- and post-election surveys fielded almost every 2 years since 1948
  • Topics include voter registration status, candidate preference, opinions on country and government, party and ideology affiliation, opinions on policy, news sources, and more
  • Collaboration of Stanford, University of Michigan - funding by the National Science Foundation
  • Target Population: US citizens, 18 and older living in US
  • Mode: Web, videoconference, or telephone
  • Sample Information: Pseudo-strata and pseudo-cluster included for variance estimation

https://electionstudies.org/

American National Election Studies (ANES) 2020

anes_2020 %>%
  select(-matches("^V\\d")) %>%
  glimpse()
Rows: 7,453
Columns: 21
$ CaseID                  <dbl> 200015, 200022, 200039, 200046, 200053, 200…
$ InterviewMode           <fct> Web, Web, Web, Web, Web, Web, Web, Web, Web…
$ Weight                  <dbl> 1.0057375, 1.1634731, 0.7686811, 0.5210195,…
$ VarUnit                 <fct> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2…
$ Stratum                 <fct> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7…
$ CampaignInterest        <fct> Somewhat interested, Not much interested, S…
$ EarlyVote2020           <fct> NA, NA, NA, NA, NA, NA, NA, NA, Yes, NA, NA…
$ VotedPres2016           <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, Yes, …
$ VotedPres2016_selection <fct> Trump, Other, Clinton, Clinton, Trump, NA, …
$ PartyID                 <fct> Strong republican, Independent, Independent…
$ TrustGovernment         <fct> Never, Never, Some of the time, About half …
$ TrustPeople             <fct> About half the time, Some of the time, Some…
$ Age                     <dbl> 46, 37, 40, 41, 72, 71, 37, 45, 70, 43, 37,…
$ AgeGroup                <fct> 40-49, 30-39, 40-49, 40-49, 70 or older, 70…
$ Education               <fct> Bachelor's, Post HS, High school, Post HS, …
$ RaceEth                 <fct> "Hispanic", "Asian, NH/PI", "White", "Asian…
$ Gender                  <fct> Male, Female, Female, Male, Male, Female, F…
$ Income                  <fct> "$175,000-249,999", "$70,000-74,999", "$100…
$ Income7                 <fct> $125k or more, $60k to < 80k, $100k to < 12…
$ VotedPres2020           <fct> NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, Yes, …
$ VotedPres2020_selection <fct> NA, Other, Biden, Biden, Trump, Biden, Trum…

Residential Energy Consumption Survey (RECS) 2020

  • Fielded 14 times between 1950 and 2020
  • Topics include appliances, electronics, heating, a/c, temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance
  • Energy consumption/expenditures collected through energy suppliers
  • Funded by the Energy Information Administration
  • Target Population: Primary occupied housing units in the US
  • Mode: In-person, phone, paper, and web interview mode
  • Sample Information: Jackknife Replicate weights included for variance estimation

https://www.eia.gov/consumption/residential/index.php

Residential Energy Consumption Survey (RECS) 2020

recs_2020 %>%
  select(-matches("^NWEIGHT")) %>%
  glimpse()
Rows: 18,496
Columns: 39
$ DOEID            <dbl> 100001, 100002, 100003, 100004, 100005, 100006, 10…
$ ClimateRegion_BA <fct> Mixed-Dry, Mixed-Humid, Mixed-Dry, Mixed-Humid, Mi…
$ Urbanicity       <fct> Urban Area, Urban Area, Urban Area, Urban Area, Ur…
$ Region           <fct> West, South, West, South, Northeast, South, South,…
$ REGIONC          <chr> "WEST", "SOUTH", "WEST", "SOUTH", "NORTHEAST", "SO…
$ Division         <fct> Mountain South, West South Central, Mountain South…
$ STATE_FIPS       <chr> "35", "05", "35", "45", "34", "48", "40", "28", "1…
$ state_postal     <fct> NM, AR, NM, SC, NJ, TX, OK, MS, DC, AZ, CA, TX, LA…
$ state_name       <fct> New Mexico, Arkansas, New Mexico, South Carolina, …
$ HDD65            <dbl> 3844, 3766, 3819, 2614, 4219, 901, 3148, 1825, 423…
$ CDD65            <dbl> 1679, 1458, 1696, 1718, 1363, 3558, 2128, 2374, 11…
$ HDD30YR          <dbl> 4451, 4429, 4500, 3229, 4896, 1150, 3564, 2660, 44…
$ CDD30YR          <dbl> 1027, 1305, 1010, 1653, 1059, 3588, 2043, 2164, 14…
$ HousingUnitType  <fct> Single-family detached, Apartment: 5 or more units…
$ YearMade         <ord> 1970-1979, 1980-1989, 1960-1969, 1980-1989, 1960-1…
$ TOTSQFT_EN       <dbl> 2100, 590, 900, 2100, 800, 4520, 2100, 900, 750, 7…
$ TOTHSQFT         <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 900, 750, 7…
$ TOTCSQFT         <dbl> 2100, 590, 900, 2100, 800, 3010, 1200, 0, 500, 760…
$ SpaceHeatingUsed <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
$ ACUsed           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, T…
$ HeatingBehavior  <fct> Set one temp and leave it, Turn on or off as neede…
$ WinterTempDay    <dbl> 70, 70, 69, 68, 68, 76, 74, 70, 68, 70, 72, 74, 74…
$ WinterTempAway   <dbl> 70, 65, 68, 68, 68, 76, 65, 70, 60, 70, 70, 74, 74…
$ WinterTempNight  <dbl> 68, 65, 67, 68, 68, 68, 74, 68, 62, 68, 72, 74, 74…
$ ACBehavior       <fct> Set one temp and leave it, Turn on or off as neede…
$ SummerTempDay    <dbl> 71, 68, 70, 72, 72, 69, 68, NA, 72, 74, 77, 77, 74…
$ SummerTempAway   <dbl> 71, 68, 68, 72, 72, 74, 70, NA, 76, 74, 77, 77, 74…
$ SummerTempNight  <dbl> 71, 68, 68, 72, 72, 68, 70, NA, 68, 72, 77, 77, 74…
$ BTUEL            <dbl> 42723.28, 17889.29, 8146.63, 31646.53, 20027.42, 4…
$ DOLLAREL         <dbl> 1955.06, 713.27, 334.51, 1424.86, 1087.00, 1895.66…
$ BTUNG            <dbl> 101924.43, 10145.32, 22603.08, 55118.66, 39099.51,…
$ DOLLARNG         <dbl> 701.8300, 261.7348, 188.1400, 636.9100, 376.0400, …
$ BTULP            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.…
$ DOLLARLP         <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.…
$ BTUFO            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.…
$ DOLLARFO         <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.…
$ BTUWOOD          <dbl> 0, 0, 0, 0, 0, 3000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TOTALBTU         <dbl> 144647.71, 28034.61, 30749.71, 86765.19, 59126.93,…
$ TOTALDOL         <dbl> 2656.8900, 975.0048, 522.6500, 2061.7700, 1463.040…

Design Objects

Design Objects

  • Backbone for survey analysis
  • Specify the sampling design, weights, and other necessary information to account for errors in the data
  • Survey analysis and inference should be performed with the survey design objects, not the original survey data
  • We will be covering survey design in more detail later in the workshop!

American National Election Studies Design Object

anes_des <- anes_2020 %>%
  mutate(Weight = V200010b / sum(V200010b) * 231034125) %>%
  as_survey_design(
    weights = Weight,
    strata = V200010d,
    ids = V200010c,
    nest = TRUE
  )

American National Election Studies Design Object

anes_des
Stratified 1 - level Cluster Sampling design (with replacement)
With (101) clusters.
Called via srvyr
Sampling variables:
  - ids: V200010c 
  - strata: V200010d 
  - weights: Weight 
Data variables: 
  - V200001 (dbl), CaseID (dbl), V200002 (hvn_lbll), InterviewMode (fct),
    V200010b (dbl), Weight (dbl), V200010c (dbl), VarUnit (fct), V200010d
    (dbl), Stratum (fct), V201006 (hvn_lbll), CampaignInterest (fct),
    V201023 (hvn_lbll), EarlyVote2020 (fct), V201024 (hvn_lbll), V201025x
    (hvn_lbll), V201028 (hvn_lbll), V201029 (hvn_lbll), V201101 (hvn_lbll),
    V201102 (hvn_lbll), VotedPres2016 (fct), V201103 (hvn_lbll),
    VotedPres2016_selection (fct), V201228 (hvn_lbll), V201229 (hvn_lbll),
    V201230 (hvn_lbll), V201231x (hvn_lbll), PartyID (fct), V201233
    (hvn_lbll), TrustGovernment (fct), V201237 (hvn_lbll), TrustPeople
    (fct), V201507x (hvn_lbll), Age (dbl), AgeGroup (fct), V201510
    (hvn_lbll), Education (fct), V201546 (hvn_lbll), V201547a (hvn_lbll),
    V201547b (hvn_lbll), V201547c (hvn_lbll), V201547d (hvn_lbll), V201547e
    (hvn_lbll), V201547z (hvn_lbll), V201549x (hvn_lbll), RaceEth (fct),
    V201600 (hvn_lbll), Gender (fct), V201607 (hvn_lbll), V201610
    (hvn_lbll), V201611 (hvn_lbll), V201613 (hvn_lbll), V201615 (hvn_lbll),
    V201616 (hvn_lbll), V201617x (hvn_lbll), Income (fct), Income7 (fct),
    V202051 (hvn_lbll), V202066 (hvn_lbll), V202072 (hvn_lbll),
    VotedPres2020 (fct), V202073 (hvn_lbll), V202109x (hvn_lbll), V202110x
    (hvn_lbll), VotedPres2020_selection (fct)

Residential Energy Consumption Survey Design Object

recs_des <- recs_2020 %>%
  as_survey_rep(
    weights = NWEIGHT,
    repweights = NWEIGHT1:NWEIGHT60,
    type = "JK1",
    scale = 59 / 60,
    mse = TRUE
  )

Residential Energy Consumption Survey Design Object

recs_des
Call: Called via srvyr
Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
Sampling variables:
  - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
    NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
    NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + NWEIGHT17 +
    NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + NWEIGHT22 + NWEIGHT23 +
    NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + NWEIGHT27 + NWEIGHT28 + NWEIGHT29 +
    NWEIGHT30 + NWEIGHT31 + NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 +
    NWEIGHT36 + NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
    NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + NWEIGHT47 +
    NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + NWEIGHT52 + NWEIGHT53 +
    NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + NWEIGHT57 + NWEIGHT58 + NWEIGHT59 +
    NWEIGHT60` 
  - weights: NWEIGHT 
Data variables: 
  - DOEID (dbl), ClimateRegion_BA (fct), Urbanicity (fct), Region (fct),
    REGIONC (chr), Division (fct), STATE_FIPS (chr), state_postal (fct),
    state_name (fct), HDD65 (dbl), CDD65 (dbl), HDD30YR (dbl), CDD30YR
    (dbl), HousingUnitType (fct), YearMade (ord), TOTSQFT_EN (dbl), TOTHSQFT
    (dbl), TOTCSQFT (dbl), SpaceHeatingUsed (lgl), ACUsed (lgl),
    HeatingBehavior (fct), WinterTempDay (dbl), WinterTempAway (dbl),
    WinterTempNight (dbl), ACBehavior (fct), SummerTempDay (dbl),
    SummerTempAway (dbl), SummerTempNight (dbl), NWEIGHT (dbl), NWEIGHT1
    (dbl), NWEIGHT2 (dbl), NWEIGHT3 (dbl), NWEIGHT4 (dbl), NWEIGHT5 (dbl),
    NWEIGHT6 (dbl), NWEIGHT7 (dbl), NWEIGHT8 (dbl), NWEIGHT9 (dbl),
    NWEIGHT10 (dbl), NWEIGHT11 (dbl), NWEIGHT12 (dbl), NWEIGHT13 (dbl),
    NWEIGHT14 (dbl), NWEIGHT15 (dbl), NWEIGHT16 (dbl), NWEIGHT17 (dbl),
    NWEIGHT18 (dbl), NWEIGHT19 (dbl), NWEIGHT20 (dbl), NWEIGHT21 (dbl),
    NWEIGHT22 (dbl), NWEIGHT23 (dbl), NWEIGHT24 (dbl), NWEIGHT25 (dbl),
    NWEIGHT26 (dbl), NWEIGHT27 (dbl), NWEIGHT28 (dbl), NWEIGHT29 (dbl),
    NWEIGHT30 (dbl), NWEIGHT31 (dbl), NWEIGHT32 (dbl), NWEIGHT33 (dbl),
    NWEIGHT34 (dbl), NWEIGHT35 (dbl), NWEIGHT36 (dbl), NWEIGHT37 (dbl),
    NWEIGHT38 (dbl), NWEIGHT39 (dbl), NWEIGHT40 (dbl), NWEIGHT41 (dbl),
    NWEIGHT42 (dbl), NWEIGHT43 (dbl), NWEIGHT44 (dbl), NWEIGHT45 (dbl),
    NWEIGHT46 (dbl), NWEIGHT47 (dbl), NWEIGHT48 (dbl), NWEIGHT49 (dbl),
    NWEIGHT50 (dbl), NWEIGHT51 (dbl), NWEIGHT52 (dbl), NWEIGHT53 (dbl),
    NWEIGHT54 (dbl), NWEIGHT55 (dbl), NWEIGHT56 (dbl), NWEIGHT57 (dbl),
    NWEIGHT58 (dbl), NWEIGHT59 (dbl), NWEIGHT60 (dbl), BTUEL (dbl), DOLLAREL
    (dbl), BTUNG (dbl), DOLLARNG (dbl), BTULP (dbl), DOLLARLP (dbl), BTUFO
    (dbl), DOLLARFO (dbl), BTUWOOD (dbl), TOTALBTU (dbl), TOTALDOL (dbl)

Comparison of {dplyr} and {srvyr} Functions

Why use {srvyr} over {survey}?

  • Use the pipe function %>% or |>
  • Ability to use tidy selection
  • Functions follow snake_case convention

Apply {dplyr} functions to survey objects

  • A few functions in {srvyr} have counterparts in {dplyr}
    • dplyr::summarize() and srvyr::summarize()
    • dplyr::group_by() and srvyr::group_by()
  • Based on the object class, R will recognize which package to use

Using summarize() to calculate mean and median

Tibble:

towny %>%
  summarize(
    area_mean = mean(land_area_km2)
  )
# A tibble: 1 × 1
  area_mean
      <dbl>
1      373.

Using summarize() to calculate mean and median

Survey object:

recs_des %>%
  summarize(
    TOTHSQFT_mean = survey_mean(TOTHSQFT),
  )
# A tibble: 1 × 2
  TOTHSQFT_mean TOTHSQFT_mean_se
          <dbl>            <dbl>
1         1614.             7.19

Using tidyselect

Tibble:

towny %>%
  summarize(across(
    starts_with("population"),
    ~ mean(.x, na.rm = TRUE)
  ))
# A tibble: 1 × 6
  population_1996 population_2001 population_2006 population_2011
            <dbl>           <dbl>           <dbl>           <dbl>
1          25866.          27538.          29173.          30838.
# ℹ 2 more variables: population_2016 <dbl>, population_2021 <dbl>

Using tidyselect

Survey object:

recs_des %>%
  summarize(across(
    starts_with("TOT"),
    ~ survey_mean(.x, na.rm = TRUE)
  ))
# A tibble: 1 × 10
  TOTSQFT_EN TOTSQFT_EN_se TOTHSQFT TOTHSQFT_se TOTCSQFT TOTCSQFT_se TOTALBTU
       <dbl>         <dbl>    <dbl>       <dbl>    <dbl>       <dbl>    <dbl>
1      1819.          6.62    1614.        7.19    1335.        6.72   76753.
# ℹ 3 more variables: TOTALBTU_se <dbl>, TOTALDOL <dbl>, TOTALDOL_se <dbl>

Using {dplyr} functions

Tibble:

towny %>%
  filter(csd_type == "township") %>%
  select(csd_type, name)
# A tibble: 195 × 2
   csd_type name                  
   <chr>    <chr>                 
 1 township Addington Highlands   
 2 township Adelaide Metcalfe     
 3 township Adjala-Tosorontio     
 4 township Admaston/Bromley      
 5 township Alberton              
 6 township Alfred and Plantagenet
 7 township Algonquin Highlands   
 8 township Alnwick/Haldimand     
 9 township Amaranth              
10 township The Archipelago       
# ℹ 185 more rows

Using {dplyr} functions

Survey object:

recs_des_mod <- recs_des %>%
  filter(Urbanicity == "Rural") %>%
  select(ACUsed)

recs_des_mod
Call: Called via srvyr
Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
Sampling variables:
  - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
    NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
    NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + NWEIGHT17 +
    NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + NWEIGHT22 + NWEIGHT23 +
    NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + NWEIGHT27 + NWEIGHT28 + NWEIGHT29 +
    NWEIGHT30 + NWEIGHT31 + NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 +
    NWEIGHT36 + NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
    NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + NWEIGHT47 +
    NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + NWEIGHT52 + NWEIGHT53 +
    NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + NWEIGHT57 + NWEIGHT58 + NWEIGHT59 +
    NWEIGHT60` 
  - weights: NWEIGHT 
Data variables: 
  - ACUsed (lgl)

Grouping

Tibble:

towny %>%
  group_by(csd_type) %>%
  summarize(area_mean = mean(land_area_km2))
# A tibble: 5 × 2
  csd_type     area_mean
  <chr>            <dbl>
1 city             498. 
2 municipality     607. 
3 town             183. 
4 township         363. 
5 village           23.0

Grouping

Survey object:

recs_des %>%
  group_by(Urbanicity) %>%
  summarize(TOTHSQFT_mean = survey_mean(TOTHSQFT))
# A tibble: 3 × 3
  Urbanicity    TOTHSQFT_mean TOTHSQFT_mean_se
  <fct>                 <dbl>            <dbl>
1 Urban Area            1532.             9.24
2 Urban Cluster         1537.            21.5 
3 Rural                 1966.            17.7 

Using {srvyr} functions on non-survey objects

towny %>%
  summarize(area_mean = survey_mean(land_area_km2))
Error in `summarize()`:
ℹ In argument: `area_mean = survey_mean(land_area_km2)`.
Caused by error in `cur_svy()`:
! Survey context not set

Survey Analysis Process

Overview of Survey Analysis using {srvyr} Package

  1. Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep()

  2. Subset data (if needed) using filter() (to create subpopulations)

  3. Specify domains of analysis using group_by()

  4. Within summarize(), specify variables to calculate, including means, totals, proportions, quantiles, and more

Next Up: Descriptive Statistics

Extra Content

Wrap Up

  • We will be using {tidyverse}, {survey}, and {srvyr} for survey analysis
  • Our data is from ANES and RECS

Wrap Up

  • (Survey) design objects are the backbone for survey analysis
  • They are objects with special class tbl_svy
  • They specify the sampling design, weights, and other necessary information to account for errors in the data

Wrap Up

  • {srvyr} wraps {survey} so that we can use friendly, “tidy” syntax
  • Based on the object type, R will know whether to use {dplyr} or {srvyr}
  • All functions must be done on the survey design object, not the raw data

Ex. 1

How many different types of CSD are there in the dataset?

towny %>%
  count(csd_type)

or

towny %>%
  group_by(csd_type) %>%
  summarize(n = n(), .groups = "drop")
# A tibble: 5 × 2
  csd_type         n
  <chr>        <int>
1 city            52
2 municipality    68
3 town            88
4 township       195
5 village         11

Ex. 2

How many different types of CSD and status are there in the dataset?

towny %>%
  count(csd_type, status)

or

towny %>%
  group_by(csd_type, status) %>%
  summarize(n = n(), .groups = "drop")
# A tibble: 10 × 3
   csd_type     status          n
   <chr>        <chr>       <int>
 1 city         lower-tier     20
 2 city         single-tier    32
 3 municipality lower-tier     41
 4 municipality single-tier    27
 5 town         lower-tier     61
 6 town         single-tier    27
 7 township     lower-tier    113
 8 township     single-tier    82
 9 village      lower-tier      6
10 village      single-tier     5

Ex. 3

What is the proportion of each type of CSD?

towny %>%
  count(csd_type) %>%
  mutate(p = n / sum(n))
# A tibble: 5 × 3
  csd_type         n      p
  <chr>        <int>  <dbl>
1 city            52 0.126 
2 municipality    68 0.164 
3 town            88 0.213 
4 township       195 0.471 
5 village         11 0.0266

Ex. 4

What is the proportion of each status within type of CSD?

towny %>%
  count(csd_type, status) %>%
  group_by(csd_type) %>%
  mutate(p = n / sum(n))
# A tibble: 10 × 4
# Groups:   csd_type [5]
   csd_type     status          n     p
   <chr>        <chr>       <int> <dbl>
 1 city         lower-tier     20 0.385
 2 city         single-tier    32 0.615
 3 municipality lower-tier     41 0.603
 4 municipality single-tier    27 0.397
 5 town         lower-tier     61 0.693
 6 town         single-tier    27 0.307
 7 township     lower-tier    113 0.579
 8 township     single-tier    82 0.421
 9 village      lower-tier      6 0.545
10 village      single-tier     5 0.455

Ex. 5

What is the mean population of all of the municipalities in 2021?

towny %>%
  summarize(mean_pop = mean(population_2021, na.rm = TRUE))
# A tibble: 1 × 1
  mean_pop
     <dbl>
1   34142.

Ex. 6

What is the mean population by CSD in 2021?

towny %>%
  group_by(csd_type) %>%
  summarize(mean_pop = mean(population_2021, na.rm = TRUE))
# A tibble: 5 × 2
  csd_type     mean_pop
  <chr>           <dbl>
1 city          200005.
2 municipality   10102.
3 town           22579.
4 township        5367.
5 village         1276.

Ex. 7

What is the mean population of all of the municipalities in 1996, 2001, 2006, 2011, 2016, and 2021? Try to use the across function.

towny %>%
  summarize(across(starts_with("population"), ~ mean(.x, na.rm = TRUE)))
# A tibble: 1 × 6
  population_1996 population_2001 population_2006 population_2011
            <dbl>           <dbl>           <dbl>           <dbl>
1          25866.          27538.          29173.          30838.
  population_2016 population_2021
            <dbl>           <dbl>
1          32264.          34142.

Ex. 8

Run a simple t-test to see if the average population in 1996 is different from the average population in 2016.

t.test(towny$population_1996, towny$population_2016, paired = TRUE)

    Paired t-test

data:  towny$population_1996 and towny$population_2016
t = -4.3204, df = 410, p-value = 1.956e-05
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -9593.244 -3593.423
sample estimates:
mean difference 
      -6593.333