Tidy Data,
Weighted Insights

Analyzing Complex Survey Data in R

About us

Stephanie Zimmer

RTI International

Rebecca Powell

Fors Marsh

Isabella Velásquez

Posit

Book overview

Motivation

  • We are R users who work with survey data regularly
  • Share knowledge with
    • R users who are inaccurately analyzing survey data
    • SAS/SUDAAN/Stata users who may not know about the capabilities of R
  • {srvyr} package developed using tidyverse style syntax
  • Stephanie and Rebecca conducted a virtual short course at AAPOR conference in 2021
  • Connected with Isabella to turn the short course into a book

What’s in the book

  • High level overview of survey process
  • Comparison of syntax between {dplyr} and {srvyr}
  • How to read survey documentation
  • Descriptive analysis, statistical testing, and modeling
  • Publication ready tables and figures accounting for error
  • Creating the survey design object
  • Analysis examples using real world data

R, SAS, & SUDAAN capabilities

Feature R {survey} package SAS survey procs SUDAAN procs
Descriptive (out of the box) mean, total, proportion, percentage, quantile, ratio, variance, correlation mean, total, proportion, percentage, geometric mean, quantile, ratio, variance mean, total, proportion, percentage, geometric mean, quantile, ratio, variance, correlation
Custom descriptive functions Yes, but must use delta method No method in docs Yes, through vargen proc
Testing means, proportions, quantiles, assocation, GOF means, proportions, assocation, GOF means, proportions, assocation, GOF
Design effects Not for quantiles, variances, or correlations Only for proportions All ests
Imputation None Hot-deck, approximate Bayesian bootstrap, fully efficient fractional, two-stage fully efficient fractional, fractional hot-deck Weighted sequential hot deck, cell mean, regression-based (linear and logistic)
Weighting Post-stratification in estimation, calibration (linear, raking, logit) Post-stratification in estimation Post-stratification in estimation, calibration: nonresponse and post-stratification (WTADJUST), Using variables only known for respondents in models (WTADJX)
Modeling Linear, Logistic, Cox proportional hazards, Kaplan-Meier, Multinomial, Poisson, Log-linear Linear, Logistic, Cox proportional hazards Linear, Logistic, Cox proportional hazards, Kaplan-Meier, Multinomial, Poisson-like count

Setup

R packages for survey analysis

  • {survey} package first on CRAN in 2003
    • descriptive analysis
    • statistical testing
    • modeling
    • weighting
  • {srvyr} package first on CRAN in 2016
    • “wrapper” for {survey} with {tidyverse}-style syntax
    • only descriptive analysis
  • {gtsummary} package first on CRAN in 2016
    • creates publication-ready tables from survey data
    • currently cannot handle replicate weights

Comparison with dplyr

  • dplyr: summary functions called within summarize()

dplyr

towny %>%
  group_by(status) %>%
  summarize(
    area_mean = mean(land_area_km2),
    area_median = median(land_area_km2)
  )
# A tibble: 2 × 3
  status      area_mean area_median
  <chr>           <dbl>       <dbl>
1 lower-tier       362.        310.
2 single-tier      388.        194.

Comparison with dplyr

  • srvyr: survey_*() functions called with summarize()

srvyr

apistrat_des %>%
  group_by(stype) %>%
  summarize(
    api00_mean = survey_mean(api00),
    api00_med = survey_median(api00)
  )
# A tibble: 3 × 5
  stype api00_mean api00_mean_se api00_med api00_med_se
  <fct>      <dbl>         <dbl>     <dbl>        <dbl>
1 E           674.          12.5       671         20.7
2 H           626.          15.5       635         21.6
3 M           637.          16.6       648         24.1

Steps for descriptive analysis

  1. Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep()

  2. Subset data (if needed) using filter() (to create subpopulations)

  3. Specify domains of analysis using group_by()

  4. Within summarize(), specify variables to calculate, including means, totals, proportions, quantiles, and more

Steps for testing

  1. Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep()

  2. Subset data (if needed) using filter() (to create subpopulations)

  3. Use svyttest() for comparisons of proportions and means, svygofchisq() for GOF test, or svychisq() for test of independence and test of homogeneity

Steps for modeling

  1. Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep()

  2. Subset data (if needed) using filter() (to create subpopulations)

  3. Use svyglm() for linear models and logistic models, svycoxph() for Cox proportional-hazards, svykm() for Kaplan-Meier, svyloglin() for log-linear models, svyolr() for multinomial

Load packages and data

Load packages and data

# install.packages(c("survey", "srvyr", "gt"))
# pak::pak("tidy-survey-r/srvyrexploR")
library(survey)
library(srvyr)
library(gt)
library(srvyrexploR)

summary(recs_2020)
     DOEID           ClimateRegion_BA         Urbanicity          Region    
 Min.   :100001   Cold       :7116    Urban Area   :12395   Northeast:3657  
 1st Qu.:104625   Mixed-Humid:5579    Urban Cluster: 2020   Midwest  :3832  
 Median :109248   Hot-Humid  :2545    Rural        : 4081   South    :6426  
 Mean   :109248   Hot-Dry    :1577                          West     :4581  
 3rd Qu.:113872   Marine     : 911                                          
 Max.   :118496   Very-Cold  : 572                                          
                  (Other)    : 196                                          
   REGIONC                        Division     STATE_FIPS       
 Length:18496       South Atlantic    :3256   Length:18496      
 Class :character   Pacific           :2497   Class :character  
 Mode  :character   East North Central:2014   Mode  :character  
                    Middle Atlantic   :1977                     
                    West South Central:1827                     
                    West North Central:1818                     
                    (Other)           :5107                     
  state_postal           state_name        HDD65           CDD65     
 CA     : 1152   California   : 1152   Min.   :    0   Min.   :   0  
 TX     : 1016   Texas        : 1016   1st Qu.: 2434   1st Qu.: 814  
 NY     :  904   New York     :  904   Median : 4396   Median :1179  
 FL     :  655   Florida      :  655   Mean   : 4272   Mean   :1526  
 PA     :  617   Pennsylvania :  617   3rd Qu.: 5810   3rd Qu.:1805  
 MA     :  552   Massachusetts:  552   Max.   :17383   Max.   :5534  
 (Other):13600   (Other)      :13600                                 
    HDD30YR         CDD30YR                       HousingUnitType 
 Min.   :    0   Min.   :   0   Mobile home               :  974  
 1st Qu.: 2898   1st Qu.: 601   Single-family detached    :12319  
 Median : 4825   Median :1020   Single-family attached    : 1751  
 Mean   : 4679   Mean   :1310   Apartment: 2-4 Units      : 1013  
 3rd Qu.: 6290   3rd Qu.:1703   Apartment: 5 or more units: 2439  
 Max.   :16071   Max.   :4905                                     
                                                                  
        YearMade      TOTSQFT_EN       TOTHSQFT        TOTCSQFT    
 1970-1979  :2817   Min.   :  200   Min.   :    0   Min.   :    0  
 2000-2009  :2748   1st Qu.: 1100   1st Qu.: 1000   1st Qu.:  460  
 Before 1950:2721   Median : 1700   Median : 1520   Median : 1200  
 1990-1999  :2451   Mean   : 1960   Mean   : 1744   Mean   : 1394  
 1980-1989  :2435   3rd Qu.: 2510   3rd Qu.: 2300   3rd Qu.: 2000  
 1960-1969  :1867   Max.   :15000   Max.   :15000   Max.   :14600  
 (Other)    :3457                                                  
 SpaceHeatingUsed   ACUsed       
 Mode :logical    Mode :logical  
 FALSE:751        FALSE:2325     
 TRUE :17745      TRUE :16171    
                                 
                                 
                                 
                                 
                                                               HeatingBehavior
 Set one temp and leave it                                             :7806  
 Manually adjust at night/no one home                                  :4654  
 Programmable or smart thermostat automatically adjusts the temperature:3310  
 Turn on or off as needed                                              :1491  
 No control                                                            : 438  
 Other                                                                 :  46  
 NA's                                                                  : 751  
 WinterTempDay   WinterTempAway  WinterTempNight
 Min.   :50.00   Min.   :50.00   Min.   :50.00  
 1st Qu.:68.00   1st Qu.:65.00   1st Qu.:65.00  
 Median :70.00   Median :68.00   Median :68.00  
 Mean   :69.77   Mean   :67.45   Mean   :68.01  
 3rd Qu.:72.00   3rd Qu.:70.00   3rd Qu.:70.00  
 Max.   :90.00   Max.   :90.00   Max.   :90.00  
 NA's   :751     NA's   :751     NA's   :751    
                                                                  ACBehavior  
 Set one temp and leave it                                             :6738  
 Manually adjust at night/no one home                                  :3637  
 Programmable or smart thermostat automatically adjusts the temperature:2638  
 Turn on or off as needed                                              :2746  
 No control                                                            : 409  
 Other                                                                 :   3  
 NA's                                                                  :2325  
 SummerTempDay   SummerTempAway  SummerTempNight    NWEIGHT       
 Min.   :50.00   Min.   :50.00   Min.   :50.00   Min.   :  437.9  
 1st Qu.:70.00   1st Qu.:70.00   1st Qu.:68.00   1st Qu.: 4018.7  
 Median :72.00   Median :74.00   Median :72.00   Median : 6119.4  
 Mean   :72.01   Mean   :73.45   Mean   :71.22   Mean   : 6678.7  
 3rd Qu.:75.00   3rd Qu.:78.00   3rd Qu.:74.00   3rd Qu.: 8890.0  
 Max.   :90.00   Max.   :90.00   Max.   :90.00   Max.   :29279.1  
 NA's   :2325    NA's   :2325    NA's   :2325                     
    NWEIGHT1        NWEIGHT2        NWEIGHT3        NWEIGHT4    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3950   1st Qu.: 3951   1st Qu.: 3954   1st Qu.: 3953  
 Median : 6136   Median : 6151   Median : 6151   Median : 6153  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8976   3rd Qu.: 8979   3rd Qu.: 8994   3rd Qu.: 8998  
 Max.   :30015   Max.   :29422   Max.   :29431   Max.   :29494  
                                                                
    NWEIGHT5        NWEIGHT6        NWEIGHT7        NWEIGHT8    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3957   1st Qu.: 3966   1st Qu.: 3944   1st Qu.: 3956  
 Median : 6134   Median : 6147   Median : 6135   Median : 6151  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8987   3rd Qu.: 8984   3rd Qu.: 8998   3rd Qu.: 8988  
 Max.   :30039   Max.   :29419   Max.   :29586   Max.   :29499  
                                                                
    NWEIGHT9       NWEIGHT10       NWEIGHT11       NWEIGHT12    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3947   1st Qu.: 3961   1st Qu.: 3950   1st Qu.: 3947  
 Median : 6139   Median : 6163   Median : 6140   Median : 6160  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8974   3rd Qu.: 8994   3rd Qu.: 8991   3rd Qu.: 8988  
 Max.   :29845   Max.   :29635   Max.   :29681   Max.   :29849  
                                                                
   NWEIGHT13       NWEIGHT14       NWEIGHT15       NWEIGHT16    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3967   1st Qu.: 3962   1st Qu.: 3958   1st Qu.: 3958  
 Median : 6142   Median : 6154   Median : 6145   Median : 6133  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8977   3rd Qu.: 8981   3rd Qu.: 8997   3rd Qu.: 8979  
 Max.   :29843   Max.   :30184   Max.   :29970   Max.   :29825  
                                                                
   NWEIGHT17       NWEIGHT18       NWEIGHT19       NWEIGHT20    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3958   1st Qu.: 3937   1st Qu.: 3947   1st Qu.: 3943  
 Median : 6126   Median : 6155   Median : 6153   Median : 6139  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8977   3rd Qu.: 8993   3rd Qu.: 8979   3rd Qu.: 8992  
 Max.   :30606   Max.   :29689   Max.   :29336   Max.   :30274  
                                                                
   NWEIGHT21       NWEIGHT22       NWEIGHT23       NWEIGHT24    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3960   1st Qu.: 3964   1st Qu.: 3943   1st Qu.: 3946  
 Median : 6135   Median : 6149   Median : 6148   Median : 6136  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8956   3rd Qu.: 8988   3rd Qu.: 8980   3rd Qu.: 8978  
 Max.   :29766   Max.   :29791   Max.   :30126   Max.   :29946  
                                                                
   NWEIGHT25       NWEIGHT26       NWEIGHT27       NWEIGHT28    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3952   1st Qu.: 3966   1st Qu.: 3942   1st Qu.: 3956  
 Median : 6150   Median : 6136   Median : 6125   Median : 6149  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8972   3rd Qu.: 8980   3rd Qu.: 8996   3rd Qu.: 8989  
 Max.   :30445   Max.   :29893   Max.   :30030   Max.   :29599  
                                                                
   NWEIGHT29       NWEIGHT30       NWEIGHT31       NWEIGHT32    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3970   1st Qu.: 3956   1st Qu.: 3944   1st Qu.: 3954  
 Median : 6146   Median : 6149   Median : 6144   Median : 6159  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8979   3rd Qu.: 8991   3rd Qu.: 8994   3rd Qu.: 8982  
 Max.   :30136   Max.   :29895   Max.   :29604   Max.   :29310  
                                                                
   NWEIGHT33       NWEIGHT34       NWEIGHT35       NWEIGHT36    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3964   1st Qu.: 3950   1st Qu.: 3967   1st Qu.: 3948  
 Median : 6148   Median : 6139   Median : 6141   Median : 6149  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8993   3rd Qu.: 8985   3rd Qu.: 8990   3rd Qu.: 8979  
 Max.   :29408   Max.   :29564   Max.   :30437   Max.   :27896  
                                                                
   NWEIGHT37       NWEIGHT38       NWEIGHT39       NWEIGHT40    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3955   1st Qu.: 3954   1st Qu.: 3940   1st Qu.: 3959  
 Median : 6133   Median : 6139   Median : 6147   Median : 6144  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8975   3rd Qu.: 8974   3rd Qu.: 8991   3rd Qu.: 8980  
 Max.   :30596   Max.   :30130   Max.   :29262   Max.   :30344  
                                                                
   NWEIGHT41       NWEIGHT42       NWEIGHT43       NWEIGHT44    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3975   1st Qu.: 3949   1st Qu.: 3947   1st Qu.: 3956  
 Median : 6153   Median : 6137   Median : 6157   Median : 6148  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8982   3rd Qu.: 8988   3rd Qu.: 9005   3rd Qu.: 8986  
 Max.   :29594   Max.   :29938   Max.   :29878   Max.   :29896  
                                                                
   NWEIGHT45       NWEIGHT46       NWEIGHT47       NWEIGHT48    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3952   1st Qu.: 3966   1st Qu.: 3938   1st Qu.: 3953  
 Median : 6149   Median : 6152   Median : 6150   Median : 6139  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8992   3rd Qu.: 8959   3rd Qu.: 8991   3rd Qu.: 8991  
 Max.   :29729   Max.   :29103   Max.   :30070   Max.   :29343  
                                                                
   NWEIGHT49       NWEIGHT50       NWEIGHT51       NWEIGHT52    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3947   1st Qu.: 3948   1st Qu.: 3958   1st Qu.: 3938  
 Median : 6146   Median : 6159   Median : 6150   Median : 6154  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8990   3rd Qu.: 8995   3rd Qu.: 8992   3rd Qu.: 9012  
 Max.   :29590   Max.   :30027   Max.   :29247   Max.   :29445  
                                                                
   NWEIGHT53       NWEIGHT54       NWEIGHT55       NWEIGHT56    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3959   1st Qu.: 3954   1st Qu.: 3945   1st Qu.: 3957  
 Median : 6156   Median : 6151   Median : 6143   Median : 6153  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 8979   3rd Qu.: 8973   3rd Qu.: 8977   3rd Qu.: 8995  
 Max.   :30131   Max.   :29439   Max.   :29216   Max.   :29203  
                                                                
   NWEIGHT57       NWEIGHT58       NWEIGHT59       NWEIGHT60    
 Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0  
 1st Qu.: 3942   1st Qu.: 3962   1st Qu.: 3965   1st Qu.: 3953  
 Median : 6138   Median : 6137   Median : 6144   Median : 6140  
 Mean   : 6679   Mean   : 6679   Mean   : 6679   Mean   : 6679  
 3rd Qu.: 9004   3rd Qu.: 8986   3rd Qu.: 8977   3rd Qu.: 8983  
 Max.   :29819   Max.   :29818   Max.   :29606   Max.   :29818  
                                                                
     BTUEL             DOLLAREL           BTUNG            DOLLARNG     
 Min.   :   143.3   Min.   : -889.5   Min.   :      0   Min.   :   0.0  
 1st Qu.: 20205.8   1st Qu.:  836.5   1st Qu.:      0   1st Qu.:   0.0  
 Median : 31890.0   Median : 1257.9   Median :  22012   Median : 313.9  
 Mean   : 37016.2   Mean   : 1424.8   Mean   :  36960   Mean   : 396.0  
 3rd Qu.: 48298.0   3rd Qu.: 1819.0   3rd Qu.:  62714   3rd Qu.: 644.9  
 Max.   :628155.5   Max.   :15680.2   Max.   :1134709   Max.   :8155.0  
                                                                        
     BTULP           DOLLARLP           BTUFO           DOLLARFO      
 Min.   :     0   Min.   :   0.00   Min.   :     0   Min.   :   0.00  
 1st Qu.:     0   1st Qu.:   0.00   1st Qu.:     0   1st Qu.:   0.00  
 Median :     0   Median :   0.00   Median :     0   Median :   0.00  
 Mean   :  3917   Mean   :  80.89   Mean   :  5109   Mean   :  88.43  
 3rd Qu.:     0   3rd Qu.:   0.00   3rd Qu.:     0   3rd Qu.:   0.00  
 Max.   :364215   Max.   :6621.44   Max.   :426268   Max.   :7003.69  
                                                                      
    BTUWOOD          TOTALBTU          TOTALDOL      
 Min.   :     0   Min.   :   1182   Min.   : -150.5  
 1st Qu.:     0   1st Qu.:  45565   1st Qu.: 1258.3  
 Median :     0   Median :  74180   Median : 1793.2  
 Mean   :  3596   Mean   :  83002   Mean   : 1990.2  
 3rd Qu.:     0   3rd Qu.: 108535   3rd Qu.: 2472.0  
 Max.   :500000   Max.   :1367548   Max.   :20043.4  
                                                     

Design object

Syntax: common sampling designs

The as_survey_design() function is used for most common sampling designs, such as stratified or clustered designs.

as_survey_design(
  .data,
  ids = NULL,
  probs = NULL,
  strata = NULL,
  variables = NULL,
  fpc = NULL,
  nest = FALSE,
  check_strata = !nest,
  weights = NULL,
  pps = FALSE,
  variance = c("HT", "YG"),
  ...
)

Syntax: common sampling designs

The as_survey_design() function is used for most common sampling designs, such as stratified or clustered designs.

as_survey_design(
  .data,
  ids = NULL,
  probs = NULL,
  strata = NULL,
  variables = NULL,
  fpc = NULL,
  nest = FALSE,
  check_strata = !nest,
  weights = NULL,
  pps = FALSE,
  variance = c("HT", "YG"),
  ...
)

Syntax: common sampling designs

The as_survey_design() function is used for most common sampling designs, such as stratified or clustered designs.

as_survey_design(
  .data,
  ids = NULL,
  probs = NULL,
  strata = NULL,
  variables = NULL,
  fpc = NULL,
  nest = FALSE,
  check_strata = !nest,
  weights = NULL,
  pps = FALSE,
  variance = c("HT", "YG"),
  ...
)

Syntax: common sampling designs

The as_survey_design() function is used for most common sampling designs, such as stratified or clustered designs.

as_survey_design(
  .data,
  ids = NULL,
  probs = NULL,
  strata = NULL,
  variables = NULL,
  fpc = NULL,
  nest = FALSE,
  check_strata = !nest,
  weights = NULL,
  pps = FALSE,
  variance = c("HT", "YG"),
  ...
)

Syntax: common sampling designs

The as_survey_design() function is used for most common sampling designs, such as stratified or clustered designs.

as_survey_design(
  .data,
  ids = NULL,
  probs = NULL,
  strata = NULL,
  variables = NULL,
  fpc = NULL,
  nest = FALSE,
  check_strata = !nest,
  weights = NULL,
  pps = FALSE,
  variance = c("HT", "YG"),
  ...
)

Syntax: replicate weights

For studies with replicate weights, create the survey object using the as_survey_rep() function.

as_survey_rep(
  .data,
  variables = NULL,
  weights = NULL,
  repweights = NULL,
  type = c("BRR", "Fay", "JK1", "JKn", "bootstrap", 
           "successive-difference", "ACS", "other"),
  combined_weights = TRUE,
  rho = NULL,
  bootstrap_average = NULL,
  scale = NULL,
  rscales = NULL,
  fpc = NULL,
  fpctype = c("fraction", "correction"),
  mse = getOption("survey.replicates.mse"),
  degf = NULL,
  ...
)

Syntax: replicate weights

For studies with replicate weights, create the survey object using the as_survey_rep() function.

as_survey_rep(
  .data,
  variables = NULL,
  weights = NULL,
  repweights = NULL,
  type = c("BRR", "Fay", "JK1", "JKn", "bootstrap", 
           "successive-difference", "ACS","other"),
  combined_weights = TRUE,
  rho = NULL,
  bootstrap_average = NULL,
  scale = NULL,
  rscales = NULL,
  fpc = NULL,
  fpctype = c("fraction", "correction"),
  mse = getOption("survey.replicates.mse"),
  degf = NULL,
  ...
)

Syntax: replicate weights

For studies with replicate weights, create the survey object using the as_survey_rep() function.

as_survey_rep(
  .data,
  variables = NULL,
  weights = NULL,
  repweights = NULL,
  type = c("BRR", "Fay", "JK1", "JKn", "bootstrap", 
           "successive-difference", "ACS", "other"),
  combined_weights = TRUE,
  rho = NULL,
  bootstrap_average = NULL,
  scale = NULL,
  rscales = NULL,
  fpc = NULL,
  fpctype = c("fraction", "correction"),
  mse = getOption("survey.replicates.mse"),
  degf = NULL,
  ...
)

Syntax: replicate weights

For studies with replicate weights, create the survey object using the as_survey_rep() function.

as_survey_rep(
  .data,
  variables = NULL,
  weights = NULL,
  repweights = NULL,
  type = c("BRR", "Fay", "JK1", "JKn", "bootstrap", 
           "successive-difference", "ACS", "other"),
  combined_weights = TRUE,
  rho = NULL,
  bootstrap_average = NULL,
  scale = NULL,
  rscales = NULL,
  fpc = NULL,
  fpctype = c("fraction", "correction"),
  mse = getOption("survey.replicates.mse"),
  degf = NULL,
  ...
)

Syntax: replicate weights

For studies with replicate weights, create the survey object using the as_survey_rep() function.

as_survey_rep(
  .data,
  variables = NULL,
  weights = NULL,
  repweights = NULL,
  type = c("BRR", "Fay", "JK1", "JKn", "bootstrap", 
           "successive-difference", "ACS", "other"),
  combined_weights = TRUE,
  rho = NULL,
  bootstrap_average = NULL,
  scale = NULL,
  rscales = NULL,
  fpc = NULL,
  fpctype = c("fraction", "correction"),
  mse = getOption("survey.replicates.mse"),
  degf = NULL,
  ...
)

Implementation

recs_des <- recs_2020 %>%
  as_survey_rep(
1    weights = NWEIGHT,
2    repweights = NWEIGHT1:NWEIGHT60,
    type = "JK1",
    scale = 59 / 60,
    mse = TRUE
  )
1
Main analytic weight in NWEIGHT variable
2
Jackknife weights in NWEIGHT1-NWEIGHT60 variables

Results

recs_des
Call: Called via srvyr
Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
Sampling variables:
  - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
    NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
    NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + NWEIGHT17 +
    NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + NWEIGHT22 + NWEIGHT23 +
    NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + NWEIGHT27 + NWEIGHT28 + NWEIGHT29 +
    NWEIGHT30 + NWEIGHT31 + NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 +
    NWEIGHT36 + NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
    NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + NWEIGHT47 +
    NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + NWEIGHT52 + NWEIGHT53 +
    NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + NWEIGHT57 + NWEIGHT58 + NWEIGHT59 +
    NWEIGHT60` 
  - weights: NWEIGHT 
Data variables: 
  - DOEID (dbl), ClimateRegion_BA (fct), Urbanicity (fct), Region (fct),
    REGIONC (chr), Division (fct), STATE_FIPS (chr), state_postal (fct),
    state_name (fct), HDD65 (dbl), CDD65 (dbl), HDD30YR (dbl), CDD30YR (dbl),
    HousingUnitType (fct), YearMade (ord), TOTSQFT_EN (dbl), TOTHSQFT (dbl),
    TOTCSQFT (dbl), SpaceHeatingUsed (lgl), ACUsed (lgl), HeatingBehavior
    (fct), WinterTempDay (dbl), WinterTempAway (dbl), WinterTempNight (dbl),
    ACBehavior (fct), SummerTempDay (dbl), SummerTempAway (dbl),
    SummerTempNight (dbl), NWEIGHT (dbl), NWEIGHT1 (dbl), NWEIGHT2 (dbl),
    NWEIGHT3 (dbl), NWEIGHT4 (dbl), NWEIGHT5 (dbl), NWEIGHT6 (dbl), NWEIGHT7
    (dbl), NWEIGHT8 (dbl), NWEIGHT9 (dbl), NWEIGHT10 (dbl), NWEIGHT11 (dbl),
    NWEIGHT12 (dbl), NWEIGHT13 (dbl), NWEIGHT14 (dbl), NWEIGHT15 (dbl),
    NWEIGHT16 (dbl), NWEIGHT17 (dbl), NWEIGHT18 (dbl), NWEIGHT19 (dbl),
    NWEIGHT20 (dbl), NWEIGHT21 (dbl), NWEIGHT22 (dbl), NWEIGHT23 (dbl),
    NWEIGHT24 (dbl), NWEIGHT25 (dbl), NWEIGHT26 (dbl), NWEIGHT27 (dbl),
    NWEIGHT28 (dbl), NWEIGHT29 (dbl), NWEIGHT30 (dbl), NWEIGHT31 (dbl),
    NWEIGHT32 (dbl), NWEIGHT33 (dbl), NWEIGHT34 (dbl), NWEIGHT35 (dbl),
    NWEIGHT36 (dbl), NWEIGHT37 (dbl), NWEIGHT38 (dbl), NWEIGHT39 (dbl),
    NWEIGHT40 (dbl), NWEIGHT41 (dbl), NWEIGHT42 (dbl), NWEIGHT43 (dbl),
    NWEIGHT44 (dbl), NWEIGHT45 (dbl), NWEIGHT46 (dbl), NWEIGHT47 (dbl),
    NWEIGHT48 (dbl), NWEIGHT49 (dbl), NWEIGHT50 (dbl), NWEIGHT51 (dbl),
    NWEIGHT52 (dbl), NWEIGHT53 (dbl), NWEIGHT54 (dbl), NWEIGHT55 (dbl),
    NWEIGHT56 (dbl), NWEIGHT57 (dbl), NWEIGHT58 (dbl), NWEIGHT59 (dbl),
    NWEIGHT60 (dbl), BTUEL (dbl), DOLLAREL (dbl), BTUNG (dbl), DOLLARNG (dbl),
    BTULP (dbl), DOLLARLP (dbl), BTUFO (dbl), DOLLARFO (dbl), BTUWOOD (dbl),
    TOTALBTU (dbl), TOTALDOL (dbl)

Calculate means

Syntax

The survey_mean() calculates means while taking into account the survey design elements.

survey_mean(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  proportion = FALSE,
  prop_method = c("logit", "likelihood", "asin", "beta", "mean"),
  deff = FALSE,
  df = NULL
)

Implementation

Calculate the estimated average cost of electricity (DOLLAREL) in the United States:

recs_des %>%
  summarize(elec_bill = survey_mean(DOLLAREL, 
                                    vartype = c("se", "ci"))) 

Implementation

Calculate the estimated average cost of electricity (DOLLAREL) in the United States:

recs_des %>%
  summarize(elec_bill = survey_mean(DOLLAREL,
                                    vartype = c("se", "ci")))
  • Use the survey design object, not raw data

Implementation

Calculate the estimated average cost of electricity (DOLLAREL) in the United States:

recs_des %>%
  summarize(elec_bill = survey_mean(DOLLAREL,
                                    vartype = c("se", "ci")))
  • Use the survey design object, not raw data
  • Call survey_mean() within summarize() function

Implementation

Calculate the estimated average cost of electricity (DOLLAREL) in the United States:

recs_des %>%
  summarize(elec_bill = survey_mean(DOLLAREL,
                                    vartype = c("se", "ci")))
  • Use the survey design object, not raw data
  • Call survey_mean() within summarize() function
  • Specify the type of variance output, here we output the standard error and confidence interval

Results

Calculate the estimated average cost of electricity (DOLLAREL) in the United States:

recs_des %>%
  summarize(elec_bill = survey_mean(DOLLAREL,
                                    vartype = c("se", "ci")))
# A tibble: 1 × 4
  elec_bill elec_bill_se elec_bill_low elec_bill_upp
      <dbl>        <dbl>         <dbl>         <dbl>
1     1380.         5.38         1369.         1391.

Calculate means with groups

Calculate the estimated average cost of electricity in the U.S. (DOLLAREL) by each region (Region) by including a group_by() function with the variable of interest before the summarize() function:

recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_mean(DOLLAREL,
                                    vartype = c("se", "ci")))

Calculate means with groups

Calculate the estimated average cost of electricity in the U.S. (DOLLAREL) by each region (Region) by including a group_by() function with the variable of interest before the summarize() function:

recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_mean(DOLLAREL,
                                    vartype = c("se", "ci")))

Calculate means with groups

Calculate the estimated average cost of electricity in the U.S. (DOLLAREL) by each region (Region) by including a group_by() function with the variable of interest before the summarize() function:

recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_mean(DOLLAREL,
                                    vartype = c("se", "ci")))
# A tibble: 4 × 5
  Region    elec_bill elec_bill_se elec_bill_low elec_bill_upp
  <fct>         <dbl>        <dbl>         <dbl>         <dbl>
1 Northeast     1343.         14.6         1313.         1372.
2 Midwest       1293.         11.7         1270.         1317.
3 South         1548.         10.3         1527.         1568.
4 West          1211.         12.0         1187.         1235.

Conduct t-tests

Syntax

Use the svyttest() function to compare two proportions or means.

Syntax:

svyttest(formula,
         design,
         ...)

Implementation: one-sample t-test

Stephanie usually sets her home to 68°F at night during the summer. Is this different from the average household in the U.S.?

Implementation: one-sample t-test

Stephanie usually sets her home to 68°F at night during the summer. Is this different from the average household in the U.S.?

First, look at the estimated average nighttime temperature U.S. households set their homes to during the summer (SummerTempNight).

recs_des %>%
  summarize(mu = survey_mean(SummerTempNight, na.rm = TRUE))
# A tibble: 1 × 2
     mu  mu_se
  <dbl>  <dbl>
1  71.4 0.0397

Implementation: one-sample t-test

Test if the average U.S. household sets its temperature at a value different from 68°F using svyttest():

recs_des %>%
  svyttest(
    formula = SummerTempNight - 68 ~ 0, 
    design = ., 
    na.rm = TRUE
  )

Implementation: one-sample t-test

Test if the average U.S. household sets its temperature at a value different from 68°F using svyttest():

recs_des %>%
  svyttest(
    formula = SummerTempNight - 68 ~ 0, 
    design = ., 
    na.rm = TRUE
  )
  • Formula to test if the true mean of SummerTempNight variable minus 68°F is equal to 0

Implementation: one-sample t-test

Test if the average U.S. household sets its temperature at a value different from 68°F using svyttest():

recs_des %>%
  svyttest(
    formula = SummerTempNight - 68 ~ 0, 
    design = ., 
    na.rm = TRUE
  )
  • Formula to test if the true mean of SummerTempNight variable minus 68°F is equal to 0
  • Dot notation . that passes the recs_des object into the design argument

Results: one-sample t-test

Test if the average U.S. household sets its temperature at a value different from 68°F using svyttest():

recs_des %>%
  svyttest(
    formula = SummerTempNight - 68 ~ 0,
    design = .,
    na.rm = TRUE
  )

    Design-based one-sample t-test

data:  SummerTempNight - 68 ~ 0
t = 84.788, df = 58, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 3.287816 3.446810
sample estimates:
    mean 
3.367313 

Implementation: two-sample t-test

On average, is there a significant different electric bill for households with and without air-conditioning?

Implementation: two-sample t-test

On average, is there a significant different electric bill for households with and without air-conditioning?

First, look at the estimated average for households with and without air-condition.

recs_des %>%
  group_by(ACUsed) %>%
  summarize(mean = survey_mean(DOLLAREL, na.rm = TRUE))
# A tibble: 2 × 3
  ACUsed  mean mean_se
  <lgl>  <dbl>   <dbl>
1 FALSE  1056.   16.0 
2 TRUE   1422.    5.69

Implementation: two-sample t-test

Test if the electricity expenditure is significantly different for homes with and without air-conditioning:

recs_des %>%
  svyttest(
    formula = DOLLAREL ~ ACUsed,
    design = ., 
    na.rm = TRUE
  )

Implementation: two-sample t-test

Test if the electricity expenditure is significantly different for homes with and without air-conditioning:

recs_des %>%
  svyttest(
    formula = DOLLAREL ~ ACUsed,
    design = ., 
    na.rm = TRUE
  )
  • Formula with electricity expenditure on the left and air-conditioning usage on the right

Results: two-sample t-test

Test if the electricity expenditure is significantly different for homes with and without air-conditioning:

recs_des %>%
  svyttest(
    formula = DOLLAREL ~ ACUsed,
    design = ., 
    na.rm = TRUE
  )

    Design-based t-test

data:  DOLLAREL ~ ACUsed
t = 21.29, df = 58, p-value < 2.2e-16
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 331.3343 400.1054
sample estimates:
difference in mean 
          365.7199 

Create tables

Syntax

With the {gt} package, supply the input data table to gt() and add options to modify and format your table.

data %>%
  gt() %>%
  ... add options here...

Implementation

Create a table for estimated average household electricity bill by region:

recs_tab <- recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_mean(DOLLAREL,
                                    vartype = "ci"))

recs_tab
# A tibble: 4 × 4
  Region    elec_bill elec_bill_low elec_bill_upp
  <fct>         <dbl>         <dbl>         <dbl>
1 Northeast     1343.         1313.         1372.
2 Midwest       1293.         1270.         1317.
3 South         1548.         1527.         1568.
4 West          1211.         1187.         1235.

Implementation

Pipe (%>%) your data frame (recs_tab) into the gt() function:

recs_tab %>%
  gt()
Census Region elec_bill elec_bill_low elec_bill_upp
Northeast 1342.647 1313.386 1371.907
Midwest 1293.233 1269.827 1316.639
South 1547.653 1527.115 1568.191
West 1211.020 1187.045 1234.994

Implementation

Continue adding to your table, for example, designating Region as a “stub”:

recs_tab %>%
  gt(rowname_col = "Region")
elec_bill elec_bill_low elec_bill_upp
Northeast 1342.647 1313.386 1371.907
Midwest 1293.233 1269.827 1316.639
South 1547.653 1527.115 1568.191
West 1211.020 1187.045 1234.994

Implementation

Add labels to columns:

recs_tab %>%
  gt(rowname_col = "Region") %>%
  cols_label(
    elec_bill = "Average",
    elec_bill_low = "Lower",
    elec_bill_upp = "Upper"
  )
Average Lower Upper
Northeast 1342.647 1313.386 1371.907
Midwest 1293.233 1269.827 1316.639
South 1547.653 1527.115 1568.191
West 1211.020 1187.045 1234.994

Implementation

Add a spanner to break up the labels:

recs_tab %>%
  gt(rowname_col = "Region") %>%
  cols_label(
    elec_bill = "Average",
    elec_bill_low = "Lower",
    elec_bill_upp = "Upper"
  ) %>%
  tab_spanner(
    label = "Cost of electricity in the U.S. by region",
    columns = c(elec_bill, elec_bill_low, elec_bill_upp))
Cost of electricity in the U.S. by region
Average Lower Upper
Northeast 1342.647 1313.386 1371.907
Midwest 1293.233 1269.827 1316.639
South 1547.653 1527.115 1568.191
West 1211.020 1187.045 1234.994

Implementation

Format numbers using the fmt_*() functions:

recs_tab %>%
  gt(rowname_col = "Region") %>%
  cols_label(
    elec_bill = "Average",
    elec_bill_low = "Lower",
    elec_bill_upp = "Upper"
  ) %>%
  tab_spanner(
    label = "Cost of electricity in the U.S. by region",
    columns = c(elec_bill, elec_bill_low, elec_bill_upp)) %>%
  fmt_currency()
Cost of electricity in the U.S. by region
Average Lower Upper
Northeast $1,342.65 $1,313.39 $1,371.91
Midwest $1,293.23 $1,269.83 $1,316.64
South $1,547.65 $1,527.12 $1,568.19
West $1,211.02 $1,187.05 $1,234.99

Wrap-up

References

Where to find our book

Print copies:

Online version:

Q & A