Tidy Data,
Weighted Insights

Analyzing Complex Survey Data in R

About us

Stephanie Zimmer

RTI International

Rebecca Powell

Fors Marsh

Isabella Velásquez

Posit

Overview

Using R for survey analysis
Setting up your analysis
Introducing databases
Calculating means
Conducting t-tests
Creating tables
Wrap up

Introduction to {survey} and {srvyr}

R packages for survey analysis

{survey} package first on CRAN in 2003
- descriptive analysis
- statistical testing
- modeling
- weighting
{srvyr} package first on CRAN in 2016
- “wrapper” for {survey} with {tidyverse}-style syntax
- only descriptive analysis
{gtsummary} package first on CRAN in 2016
- creates publication-ready tables from survey data
- currently cannot handle replicate weights

Comparison with dplyr

dplyr: summary functions called within summarize()

dplyr

towny %>%
  group_by(status) %>%
  summarize(
    area_mean = mean(land_area_km2),
    area_median = median(land_area_km2)
  )

# A tibble: 2 × 3
  status      area_mean area_median
  <chr>           <dbl>       <dbl>
1 lower-tier       362.        310.
2 single-tier      388.        194.

Comparison with dplyr

srvyr: survey_*() functions called with summarize()

srvyr

apistrat_des %>%
  group_by(stype) %>%
  summarize(
    api00_mean = survey_mean(api00),
    api00_med = survey_median(api00)
  )

# A tibble: 3 × 5
  stype api00_mean api00_mean_se api00_med api00_med_se
  <fct>      <dbl>         <dbl>     <dbl>        <dbl>
1 E           674.          12.5       671         20.7
2 H           626.          15.5       635         21.6
3 M           637.          16.6       648         24.1

Survey analysis process

Steps for descriptive analysis

Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep()
Subset data (if needed) using filter() (to create subpopulations)

3. Specify domains of analysis using group_by()

4. Within summarize(), specify variables to calculate, including means, totals, proportions, quantiles, and more

Steps for testing

Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep()
Subset data (if needed) using filter() (to create subpopulations)

3. Use svyttest() for comparisons of proportions and means, svygofchisq() for GOF test, or svychisq() for test of independence and test of homogeneity

Steps for modeling

Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep()
Subset data (if needed) using filter() (to create subpopulations)

3. Use svyglm() for linear models and logistic models, svycoxph() for Cox proportional-hazards, svykm() for Kaplan-Meier, svyloglin() for log-linear models, svyolr() for multinomial

Load packages and data

Load packages

# install.packages(c("survey", "srvyr", "duckdb", "dbplyr", "gt"))
library(survey)
library(srvyr)
library(duckdb)
library(dbplyr)
library(gt)

Today’s data

Data from the 2021 Census of Population conducted by Statistics Canada

Public-use microdata from the long-form census sample, named data_donnees_2021_ind_v2.csv
Analysis uses survey weights and replicate weights provided with the data
980,868 rows and 144 columns (604 MB)
Available online at: https://www150.statcan.gc.ca/n1/pub/98m0001x/index-eng.htm

Database detour

Initiate DuckDB database
Load CSV into database - data does NOT go into RAM
nread creates a small check that the number of records loaded as expected

con <- dbConnect(duckdb())
nread <- duckdb_read_csv(
  conn = con,
  name = "c2021_ind",
  here::here("RawDat", "data_donnees_2021_ind_v2.csv")
)

nread

[1] 980868

Database tables in R

Make a R object pointing to the table
This R object is NOT a data.frame

c2021_ind_db <- dplyr::tbl(con, I("c2021_ind"))
object.size(c2021_ind_db) %>% print(units="Gb")

0 Gb

c2021_ind_db

# A query:  ?? x 144
# Database: DuckDB 1.4.4 [root@Darwin 22.5.0:R 4.5.2/:memory:]
   PPSORT ABOID AGEGRP AGEIMM ATTSCH BFNMEMB BedRm CFInc CFInc_AT CFSTAT   CHDBN
    <int> <int>  <int>  <int>  <int>   <int> <int> <int>    <int>  <int>   <int>
 1      1     6     13      7      1       0     4    30       27      2 8.89 e7
 2      2     6     11      5      1       0     3    18       18      2 1.15 e4
 3      3     1     13     99      1       0     0     7        7      6 1.000e8
 4      4     6     16     99      1       0     4    15       15      2 1.000e8
 5      5     6     18     99      1       0     3    13       13      3 1.000e8
 6      6     2     16     99      1       0     4     1        1      7 1.000e8
 7      7     6     16     99      1       0     3    10       10      1 1.000e8
 8      8     6     16      7      1       0     4    30       27      1 1.000e8
 9      9     6     11     99      1       0     4    25       24      2 1.000e8
10     10     6     12      6      1       0     4    22       22      2 1.000e8
# ℹ more rows
# ℹ 133 more variables: CIP2021 <int>, CIP2021_STEM_SUM <int>, CMA <int>,
#   CONDO <int>, COVID_ERB <int>, COW <int>, CQPPB <int>, CapGn <int>,
#   CfSize <int>, ChldC <int>, CitOth <int>, Citizen <int>, DIST <int>,
#   DPGRSUM <int>, DTYPE <int>, EFDecile <int>, EFInc <int>, EFInc_AT <int>,
#   EICBN <int>, ETHDER <int>, EfDIMBM_2018 <int>, EfSize <int>, EmpIn <int>,
#   FOL <int>, FPTWK <int>, Gender <int>, GENSTAT <int>, GovtI <int>, …

Design object

Syntax: replicate weights

For studies with replicate weights, create the survey object using the as_survey_rep() function.

as_survey_rep(
  .data,
  variables = NULL,
  weights = NULL,
  repweights = NULL,
  type = c("BRR", "Fay", "JK1", "JKn", "bootstrap", 
           "successive-difference", "ACS", "other"),
  combined_weights = TRUE,
  rho = NULL,
  bootstrap_average = NULL,
  scale = NULL,
  rscales = NULL,
  fpc = NULL,
  fpctype = c("fraction", "correction"),
  mse = getOption("survey.replicates.mse"),
  degf = NULL,
  ...
)

Syntax: replicate weights

For studies with replicate weights, create the survey object using the as_survey_rep() function.

as_survey_rep(
  .data,
  variables = NULL,
  weights = NULL,
  repweights = NULL,
  type = c("BRR", "Fay", "JK1", "JKn", "bootstrap", 
           "successive-difference", "ACS","other"),
  combined_weights = TRUE,
  rho = NULL,
  bootstrap_average = NULL,
  scale = NULL,
  rscales = NULL,
  fpc = NULL,
  fpctype = c("fraction", "correction"),
  mse = getOption("survey.replicates.mse"),
  degf = NULL,
  ...
)

Implementation - design object

cens_des <- c2021_ind_db %>%
  as_survey_rep(
1    weights = WEIGHT,
2    repweights = WT1:WT16,
    type = "other",
3    scale = 1/35,
4    mse = TRUE,
5    rscales = 1)

1: Main analytic weight in WEIGHT variable
2: Replicate weights start with WT1-WT16 variables
3: Scaling per documentation is dividing by 35
4: Center replicates around main estimate, not mean of replicates
5: When specifying type="other", must specify rscales

Results

Design object contains all the design data (i.e., replicate weights)
Remaining data is in the database only
Efficiency opportunity: specify degrees of freedom when creating design object

object.size(cens_des) %>% print(units="Gb")

0.2 Gb

cens_des

Call: Called via srvyr
with 16 replicates and MSE variances.
Sampling variables:
  - repweights: `WT1 + WT2 + WT3 + WT4 + WT5 + WT6 + WT7 + WT8 + WT9 + WT10 +
    WT11 + WT12 + WT13 + WT14 + WT15 + WT16` 
  - weights: WEIGHT 
Data variables: 
  - PPSORT (int), ABOID (int), AGEGRP (int), AGEIMM (int), ATTSCH (int),
    BFNMEMB (int), BedRm (int), CFInc (int), CFInc_AT (int), CFSTAT (int),
    CHDBN (int), CIP2021 (int), CIP2021_STEM_SUM (int), CMA (int), CONDO (int),
    COVID_ERB (int), COW (int), CQPPB (int), CapGn (int), CfSize (int), ChldC
    (int), CitOth (int), Citizen (int), DIST (int), DPGRSUM (int), DTYPE (int),
    EFDecile (int), EFInc (int), EFInc_AT (int), EICBN (int), ETHDER (int),
    EfDIMBM_2018 (int), EfSize (int), EmpIn (int), FOL (int), FPTWK (int),
    Gender (int), GENSTAT (int), GovtI (int), GTRfs (int), HCORENEED_IND (int),
    HDGREE (int), HHInc (int), HHInc_AT (int), HHMRKINC (int), HHSIZE (int),
    HHTYPE (int), HLMOSTEN (int), HLMOSTFR (int), HLMOSTNO (int), HLREGEN
    (int), HLREGFR (int), HLREGNO (int), IMMCAT5 (int), IMMSTAT (int), IncTax
    (int), Invst (int), JOBPERM (int), KOL (int), LFACT (int), LICO_BT (int),
    LICO_AT (int), LIPROGTYPE (int), LI_ELIG_OML_U18 (int), LOCSTUD (int),
    LOC_ST_RES (int), LSTWRK (int), LWMOSTEN (int), LWMOSTFR (int), LWMOSTNO
    (int), LWREGEN (int), LWREGFR (int), LWREGNO (int), LoLIMA (int), LoLIMB
    (int), LoMBM_2018 (int), MODE (int), MTNEN (int), MTNFR (int), MTNNO (int),
    MarStH (int), Mob1 (int), Mob5 (int), MrkInc (int), NAICS (int), NOC21
    (int), NOL (int), NOS (int), OASGI (int), OtInc (int), PKID25 (int),
    PKID0_1 (int), PKID15_24 (int), PKID2_5 (int), PKID6_14 (int), PKIDS (int),
    POB (int), POBPAR1 (int), POBPAR2 (int), POWST (int), PR (int), PR1 (int),
    PR5 (int), PresMortG (int), PRIHM (int), PWDUR (int), PWLEAVE (int), PWOCC
    (int), PWPR (int), REGIND (int), Relig (int), REPAIR (int), ROOM (int),
    Retir (int), SHELCO (int), SSGRAD (int), Subsidy (int), SempI (int), Tenur
    (int), TotInc (int), TotInc_AT (int), VISMIN (int), Value (int), WKSWRK
    (int), WRKACT (int), Wages (int), YRIM (int), WEIGHT (dbl), WT1 (dbl), WT2
    (dbl), WT3 (dbl), WT4 (dbl), WT5 (dbl), WT6 (dbl), WT7 (dbl), WT8 (dbl),
    WT9 (dbl), WT10 (dbl), WT11 (dbl), WT12 (dbl), WT13 (dbl), WT14 (dbl), WT15
    (dbl), WT16 (dbl)

Calculate means

Syntax

The survey_mean() calculates means while taking into account the survey design elements.

survey_mean(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  proportion = FALSE,
  prop_method = c("logit", "likelihood", "asin", "beta", "mean"),
  deff = FALSE,
  df = NULL
)

Implementation

Calculate the estimated average cost of housing (SHELCO) in Canada:

cens_des %>%
  summarize(housing_cost = survey_mean(SHELCO, 
                                       vartype = c("se", "ci")))

Use the survey design object, not raw data

Implementation

Calculate the estimated average cost of housing (SHELCO) in Canada:

cens_des %>%
  summarize(housing_cost = survey_mean(SHELCO, 
                                       vartype = c("se", "ci")))

Use the survey design object, not raw data
Call survey_mean() within summarize() function

Implementation

Calculate the estimated average cost of housing (SHELCO) in Canada:

cens_des %>%
  summarize(housing_cost = survey_mean(SHELCO, 
                                       vartype = c("se", "ci")))

Use the survey design object, not raw data
Call survey_mean() within summarize() function
Specify the type of variance output, here we output the standard error and confidence interval

Results

Calculate the estimated average cost of housing (SHELCO) in Canada:

cens_des %>%
  summarize(housing_cost = survey_mean(SHELCO, 
                                       vartype = c("se", "ci")))

# A tibble: 1 × 4
  housing_cost housing_cost_se housing_cost_low housing_cost_upp
         <dbl>           <dbl>            <dbl>            <dbl>
1        1592.            1.12            1590.            1594.

Calculate means with groups

Calculate the estimated average cost of housing (SHELCO) in Canada by each province (PR) by including a group_by() function with the variable of interest before the summarize() function:

cens_des %>%
  group_by(PR) %>%
  summarize(housing_cost = survey_mean(SHELCO, 
                                       vartype = c("se", "ci")))

# A tibble: 11 × 5
      PR housing_cost housing_cost_se housing_cost_low housing_cost_upp
   <int>        <dbl>           <dbl>            <dbl>            <dbl>
 1    10        1113.            7.75            1097.            1130.
 2    11        1126.            8.33            1108.            1143.
 3    12        1187.            5.14            1177.            1198.
 4    13        1011.            3.42            1004.            1019.
 5    24        1211.            1.77            1207.            1215.
 6    35        1810.            2.05            1806.            1814.
 7    46        1274.            6.39            1260.            1287.
 8    47        1350.            4.80            1340.            1360.
 9    48        1760.            2.70            1755.            1766.
10    59        1845.            3.75            1837.            1853.
11    70        1432.           18.0             1394.            1471.

Database detour: Factors/Labelled Data

Working with databases is fast and helpful with large data! BUT…
Cannot have labelled data or factors, so our output may not be meaningful
Couple of options:
1. Create a new variable that is the labelled data
2. After completing your analyses, label the final tables

Factors/Labelled Data - Option 1

cens_des %>%
  mutate(Province = case_when(PR == 10 ~ "Newfoundland and Labrador",
                              PR == 11 ~ "Prince Edward Island",
                              PR == 12 ~ "Nova Scotia",
                              PR == 13 ~ "New Brunswick",
                              PR == 24 ~ "Quebec",
                              PR == 35 ~ "Ontario",
                              PR == 46 ~ "Manitoba",
                              PR == 47 ~ "Saskatchewan",
                              PR == 48 ~ "Alberta",
                              PR == 59 ~ "British Columbia",
                              PR == 70 ~ "Northern Canada")) %>% 
  group_by(Province) %>%
  summarize(housing_cost = survey_mean(SHELCO, 
                                       vartype = c("se", "ci")))

# A tibble: 11 × 5
   Province       housing_cost housing_cost_se housing_cost_low housing_cost_upp
   <chr>                 <dbl>           <dbl>            <dbl>            <dbl>
 1 Alberta               1760.            2.70            1755.            1766.
 2 British Colum…        1845.            3.75            1837.            1853.
 3 Manitoba              1274.            6.39            1260.            1287.
 4 New Brunswick         1011.            3.42            1004.            1019.
 5 Newfoundland …        1113.            7.75            1097.            1130.
 6 Northern Cana…        1432.           18.0             1394.            1471.
 7 Nova Scotia           1187.            5.14            1177.            1198.
 8 Ontario               1810.            2.05            1806.            1814.
 9 Prince Edward…        1126.            8.33            1108.            1143.
10 Quebec                1211.            1.77            1207.            1215.
11 Saskatchewan          1350.            4.80            1340.            1360.

Factors/Labelled Data - Option 2

cens_des %>%
  group_by(PR) %>%
  summarize(housing_cost = survey_mean(SHELCO, 
                                       vartype = c("se", "ci"))) %>%
  mutate(PR = factor(as.character(PR),
                     levels=c("10", "11", "12", "13", "24", "35", "46",
                              "47", "48", "59", "70"),
                     labels=c("Newfoundland and Labrador", "Prince Edward Island", 
                              "Nova Scotia", "New Brunswick", "Quebec", "Ontario",
                              "Manitoba", "Saskatchewan", "Alberta", 
                              "British Columbia", "Northern Canada")))

# A tibble: 11 × 5
   PR             housing_cost housing_cost_se housing_cost_low housing_cost_upp
   <fct>                 <dbl>           <dbl>            <dbl>            <dbl>
 1 Newfoundland …        1113.            7.75            1097.            1130.
 2 Prince Edward…        1126.            8.33            1108.            1143.
 3 Nova Scotia           1187.            5.14            1177.            1198.
 4 New Brunswick         1011.            3.42            1004.            1019.
 5 Quebec                1211.            1.77            1207.            1215.
 6 Ontario               1810.            2.05            1806.            1814.
 7 Manitoba              1274.            6.39            1260.            1287.
 8 Saskatchewan          1350.            4.80            1340.            1360.
 9 Alberta               1760.            2.70            1755.            1766.
10 British Colum…        1845.            3.75            1837.            1853.
11 Northern Cana…        1432.           18.0             1394.            1471.

Conduct t-tests

Syntax

Use the svyttest() function to compare two proportions or means.

Syntax:

svyttest(
  formula,
  design,
  ...
)

Dot notation

Design object is not the first argument
Need to specify with dot notation in pipes

design_obj %>%
  svyttest(formula = var1 ~ var2,
           design = .)

design_obj %>%
  svyttest(design = .,
           formula = var1 ~ var2)

Syntax pipeline with databases

Select the variables involved in the test
Collect the design object
Perform the test

design_obj %>%
  select(var1, var2) %>%
  collect() %>%
  svyttest(formula = var1 ~ var2,
           design = .)

Implementation: one-sample t-test

Is the proportion of females¹ in Canada different from 50% among those 65 years old and over?

cens_des %>%
  filter(between(AGEGRP, 17, 21)) %>%
  mutate(Woman=if_else(Gender==1, 1, 0)) %>%
  summarize(
    WomenPct=survey_mean(Woman, proportion = TRUE, vartype = "ci")
  )

# A tibble: 1 × 3
  WomenPct WomenPct_low WomenPct_upp
     <dbl>        <dbl>        <dbl>
1    0.532        0.529        0.535