Sampling Designs in {srvyr}

Sampling designs

Sampling methods

Units can be selected in various ways such as:

  • Simple random sampling (with or without replacement): every unit has the same chance of being selected

  • Systematic sampling: sample individuals from an ordered list and sampling individuals at an interval with a random starting point

  • Probability proportional to size: probability of selection is proportional to “size”

Complex designs

Designs can also incorporate stratification and/or clustering:

  • Stratified sampling: divide population into mutually exclusive subgroups (strata). Randomly sample within each stratum

  • Clustered sampling: divide population into mutually exclusive subgroups (clusters). Randomly sample clusters and then individuals within clusters

Why design matters?

  • The design type impacts the variability of the estimates

  • Weights impact the point estimate and the variability estimates

  • Specifying the components of the design (strata and/or clusters) and weights in R is necessary to get correct estimates

Determining the design

  • Check the documentation such as methodology, design, analysis guide, or technical documentation

  • Documentation will indicate the variables needed to specify the design. Look for:

    • weight (almost always)
    • strata and/or clusters/PSUs. Sometimes pseudo-strata and pseudo-cluster OR
    • replicate weights (this is used instead of strata/clusters for analysis)
    • Finite population correction or population sizes (uncommon)
  • Documentation may include SAS, SUDAAN, Stata and/or R syntax

Specifying sampling designs in {srvyr} without replicate weights

as_survey_design(): Syntax

  • Specifying the sampling design when you don’t have replicate weights

  • as_survey_design() creates a tbl_svy object that then correctly calculates weighted estimates and SEs

as_survey_design(
  .data,
  ids = NULL, # cluster IDs/PSUs
  strata = NULL, # strata variables
  variables = NULL, # defaults to all in .data
  fpc = NULL, # variables defining the fpc
  nest = FALSE, # TRUE/FALSE - relabel clusters to nest within strata
  check_strata = !nest, # check that clusters are nested in strata
  weights = NULL # weight variable,
)

Syntax for common designs

Load in the example data from {survey} package:

data(api, package="survey")

Simple Random Sample (SRS)

apisrs %>% as_survey_design(fpc = fpc)

Stratified SRS

apistrat %>% as_survey_design(strata = stype, weights = pw)

One-stage cluster sample with a FPC variable

apiclus1 %>% as_survey_design(ids = dnum, weights = pw, fpc = fpc)

Two-stage cluster sample, weights computed from population size

apiclus2 %>% as_survey_design(ids = c(dnum, snum), fpc = c(fpc1, fpc2))

Stratified, cluster design

apistrat %>% as_survey_design(ids = dnum, strata = stype, weights =pw, nest = TRUE)

Example

Example: ANES 2020

  • User Guide and Codebook1 : Section “Data Analysis, Weights, and Variance Estimation” includes information on weights and strata/cluster variables

For analysis of the complete set of cases using pre-election data only, including all cases and representative of the 2020 electorate, use the full sample pre-election weight, V200010a. For analysis including post-election data for the complete set of participants (i.e., analysis of post-election data only or a combination of pre- and post-election data), use the full sample post-election weight, V200010b. Additional weights are provided for analysis of subsets of the data, as follows.

For weight Use variance unit/PSU/cluster and use variance stratum
V200010a V200010c V200010d
V200010b V200010c V200010d

Example: ANES 2020 Syntax

anes <- anes_2020 %>%
1  mutate(Weight = V200010b / sum(V200010b) * 231592693)

anes_des <- anes %>%
  as_survey_design(
2    weights = Weight,
3    strata = V200010d,
4    ids = V200010c,
    nest = TRUE
5  )
1
Adjust the weight of the ANES data to reflect the national population
2
Specify the weight variable
3
Specify the strata variable per documentation
4
Specify the cluster variable per documentation
5
Indicate that the clusters are nested within strata

Example: ANES 2020 Design

summary(anes_des)
Stratified 1 - level Cluster Sampling design (with replacement)
With (101) clusters.
Called via srvyr
Probabilities:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
4.839e-06 2.657e-05 4.689e-05 7.688e-05 8.331e-05 3.895e-03 
Stratum Sizes: 
             1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
obs        167 148 158 151 147 172 163 159 160 159 137 179 148 160 159 148
design.PSU   3   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
actual.PSU   3   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
            17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32
obs        158 156 154 144 170 146 165 147 169 165 172 133 157 167 154 143
design.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
actual.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
            33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48
obs        143 124 138 130 136 145 140 125 158 146 130 126 126 135 133 140
design.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
actual.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
            49  50
obs        133 130
design.PSU   2   2
actual.PSU   2   2
Data variables:
 [1] "V200001"                 "CaseID"                 
 [3] "V200002"                 "InterviewMode"          
 [5] "V200010b"                "Weight"                 
 [7] "V200010c"                "VarUnit"                
 [9] "V200010d"                "Stratum"                
[11] "V201006"                 "CampaignInterest"       
[13] "V201023"                 "EarlyVote2020"          
[15] "V201024"                 "V201025x"               
[17] "V201028"                 "V201029"                
[19] "V201101"                 "V201102"                
[21] "VotedPres2016"           "V201103"                
[23] "VotedPres2016_selection" "V201228"                
[25] "V201229"                 "V201230"                
[27] "V201231x"                "PartyID"                
[29] "V201233"                 "TrustGovernment"        
[31] "V201237"                 "TrustPeople"            
[33] "V201507x"                "Age"                    
[35] "AgeGroup"                "V201510"                
[37] "Education"               "V201546"                
[39] "V201547a"                "V201547b"               
[41] "V201547c"                "V201547d"               
[43] "V201547e"                "V201547z"               
[45] "V201549x"                "RaceEth"                
[47] "V201600"                 "Gender"                 
[49] "V201607"                 "V201610"                
[51] "V201611"                 "V201613"                
[53] "V201615"                 "V201616"                
[55] "V201617x"                "Income"                 
[57] "Income7"                 "V202051"                
[59] "V202066"                 "V202072"                
[61] "VotedPres2020"           "V202073"                
[63] "V202109x"                "V202110x"               
[65] "VotedPres2020_selection"

National Health and Nutrition Examination Survey (NHANES)

  • Analysis weight: WTINT2YR
  • Variance Stratum: SDMVSTRA
  • Variance Primary Sampling Unit: VPSU
  • Large population
nhanes_des <- nhanes %>%
   as_survey_design(
      weights = ___________,
      ids = ___________, 
      strata = ___________, 
      fpc = ___________, 
   )
nhanes_des <- nhanes %>%
   as_survey_design(
      weights = WTINT2YR,
      ids = VPSU,
      strata = SDMVSTRA,
      fpc = NULL
   )

Replicate weight methods

Replicate weights overview

Replicate weights are another method to estimate variability. In general, they are constructed by:

  1. Divide the sample into subsample replicates that mirror the design of the sample
  2. Calculate weights for each replicate using the same procedures for the full-sample weight (i.e., nonresponse and post-stratification)
  3. Calculate estimates for each replicate using the same method as the full-sample estimate
  4. Calculate the estimated variance, which is proportional to the variance of the replicate estimates

Common replicate weight methods

  • Balanced repeated replication (BRR)
  • Fay’s BRR
  • Jackknife
  • Bootstrap

Specifying sampling designs in {srvyr} with replicate weights

as_survey_rep(): Syntax

  • as_survey_rep() creates a tbl_svy object that then correctly calculates weighted estimates and SEs
as_survey_rep(
  .data,
  variables = NULL, # defaults to all in .data
  repweights = NULL, # Variables specifying the replication weights
  weights = NULL, # Variable specifying the analytic/main weight
  type = c(
    "BRR", "Fay", "JK1", "JKn", "bootstrap",
    "successive-difference", "ACS", "other"
  ), # Type of replication weight
  combined_weights = TRUE, # TRUE if repweights already include sampling weights, usually TRUE
  rho = NULL, # Shrinkage factor for Fay's method
  bootstrap_average = NULL, # For type="bootstrap", if the bootstrap weights have been averaged, gives the number of iterations averaged over
  scale = NULL, # Scaling constant for variance
  rscales = NULL, # Scaling constants for variance
  mse = getOption("survey.replicates.mse"), # If TRUE, compute variance based around point estimates rather than mean of replicates
  degf = NULL, # Design degrees of freedom, otherwise calculated based on number of replicate weights
)

Syntax for common replicate methods

brr_des <- dat %>%
  as_survey_rep(
    weights = WT,
    repweights = starts_with("REPWT"),
    type = "BRR",
    mse = TRUE
  )

fay_des <- dat %>%
  as_survey_rep(
    weights = WT0,
    repweights = num_range("WT", 1:20),
    type = "Fay",
    mse = TRUE,
    rho = 0.3
  )

jkn_des <- dat %>%
  as_survey_rep(
    weights = WT0,
    repweights = WT1:WT20,
    type = "JKN",
    mse = TRUE,
    rscales = rep(0.1, 20)
  )

bs_des <- dat %>%
  as_survey_rep(
    weights = pw,
    repweights = pw1:pw50,
    type = "bootstrap",
    scale = 0.02186589,
    mse = TRUE
  )
  • Note: this uses fake data and can’t be run, just syntax examples

Example

Example: RECS 2020

  • Using the microdata file to compute estimates and relative standard errors1

The following instructions are examples for calculating any RECS estimate using the final weights (NWEIGHT) and the associated RSE using the replicate weights (NWEIGHT1 – NWEIGHT60).

  • Includes R syntax for {survey} package which gets us what we need for {srvyr}
repweights <- select(RECS2020, NWEIGHT1:NWEIGHT60)
RECS <- svrepdesign(
  data = RECS2020,
  weight = ~NWEIGHT,
  repweights = repweights,
  type = "JK1",
  combined.weights = TRUE,
  scale = (ncol(repweights) - 1) / ncol(repweights),
  mse = TRUE
)

Example: RECS 2020 Syntax

recs_des <- recs_2020 %>%
  as_survey_rep(
1    weight = NWEIGHT,
2    repweights = NWEIGHT1:NWEIGHT60,
3    type = "JK1",
4    scale = 59 / 60,
5    mse = TRUE
  )
1
Specify the weight variable
2
Specify the replicate weight variables
3
Specify the replicate type per documentation
4
Specify the scale
5
Specify using MSE for variance estimation

Example: RECS 2020 Output

summary(recs_des)
Call: Called via srvyr
Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
Sampling variables:
  - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
    NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
    NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + NWEIGHT17 +
    NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + NWEIGHT22 + NWEIGHT23 +
    NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + NWEIGHT27 + NWEIGHT28 + NWEIGHT29 +
    NWEIGHT30 + NWEIGHT31 + NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 +
    NWEIGHT36 + NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
    NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + NWEIGHT47 +
    NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + NWEIGHT52 + NWEIGHT53 +
    NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + NWEIGHT57 + NWEIGHT58 + NWEIGHT59 +
    NWEIGHT60` 
  - weights: NWEIGHT 
Data variables: 
  - DOEID (dbl), ClimateRegion_BA (fct), Urbanicity (fct), Region (fct),
    REGIONC (chr), Division (fct), STATE_FIPS (chr), state_postal (fct),
    state_name (fct), HDD65 (dbl), CDD65 (dbl), HDD30YR (dbl), CDD30YR
    (dbl), HousingUnitType (fct), YearMade (ord), TOTSQFT_EN (dbl), TOTHSQFT
    (dbl), TOTCSQFT (dbl), SpaceHeatingUsed (lgl), ACUsed (lgl),
    HeatingBehavior (fct), WinterTempDay (dbl), WinterTempAway (dbl),
    WinterTempNight (dbl), ACBehavior (fct), SummerTempDay (dbl),
    SummerTempAway (dbl), SummerTempNight (dbl), NWEIGHT (dbl), NWEIGHT1
    (dbl), NWEIGHT2 (dbl), NWEIGHT3 (dbl), NWEIGHT4 (dbl), NWEIGHT5 (dbl),
    NWEIGHT6 (dbl), NWEIGHT7 (dbl), NWEIGHT8 (dbl), NWEIGHT9 (dbl),
    NWEIGHT10 (dbl), NWEIGHT11 (dbl), NWEIGHT12 (dbl), NWEIGHT13 (dbl),
    NWEIGHT14 (dbl), NWEIGHT15 (dbl), NWEIGHT16 (dbl), NWEIGHT17 (dbl),
    NWEIGHT18 (dbl), NWEIGHT19 (dbl), NWEIGHT20 (dbl), NWEIGHT21 (dbl),
    NWEIGHT22 (dbl), NWEIGHT23 (dbl), NWEIGHT24 (dbl), NWEIGHT25 (dbl),
    NWEIGHT26 (dbl), NWEIGHT27 (dbl), NWEIGHT28 (dbl), NWEIGHT29 (dbl),
    NWEIGHT30 (dbl), NWEIGHT31 (dbl), NWEIGHT32 (dbl), NWEIGHT33 (dbl),
    NWEIGHT34 (dbl), NWEIGHT35 (dbl), NWEIGHT36 (dbl), NWEIGHT37 (dbl),
    NWEIGHT38 (dbl), NWEIGHT39 (dbl), NWEIGHT40 (dbl), NWEIGHT41 (dbl),
    NWEIGHT42 (dbl), NWEIGHT43 (dbl), NWEIGHT44 (dbl), NWEIGHT45 (dbl),
    NWEIGHT46 (dbl), NWEIGHT47 (dbl), NWEIGHT48 (dbl), NWEIGHT49 (dbl),
    NWEIGHT50 (dbl), NWEIGHT51 (dbl), NWEIGHT52 (dbl), NWEIGHT53 (dbl),
    NWEIGHT54 (dbl), NWEIGHT55 (dbl), NWEIGHT56 (dbl), NWEIGHT57 (dbl),
    NWEIGHT58 (dbl), NWEIGHT59 (dbl), NWEIGHT60 (dbl), BTUEL (dbl), DOLLAREL
    (dbl), BTUNG (dbl), DOLLARNG (dbl), BTULP (dbl), DOLLARLP (dbl), BTUFO
    (dbl), DOLLARFO (dbl), BTUWOOD (dbl), TOTALBTU (dbl), TOTALDOL (dbl)
Variables: 
  [1] "DOEID"            "ClimateRegion_BA" "Urbanicity"      
  [4] "Region"           "REGIONC"          "Division"        
  [7] "STATE_FIPS"       "state_postal"     "state_name"      
 [10] "HDD65"            "CDD65"            "HDD30YR"         
 [13] "CDD30YR"          "HousingUnitType"  "YearMade"        
 [16] "TOTSQFT_EN"       "TOTHSQFT"         "TOTCSQFT"        
 [19] "SpaceHeatingUsed" "ACUsed"           "HeatingBehavior" 
 [22] "WinterTempDay"    "WinterTempAway"   "WinterTempNight" 
 [25] "ACBehavior"       "SummerTempDay"    "SummerTempAway"  
 [28] "SummerTempNight"  "NWEIGHT"          "NWEIGHT1"        
 [31] "NWEIGHT2"         "NWEIGHT3"         "NWEIGHT4"        
 [34] "NWEIGHT5"         "NWEIGHT6"         "NWEIGHT7"        
 [37] "NWEIGHT8"         "NWEIGHT9"         "NWEIGHT10"       
 [40] "NWEIGHT11"        "NWEIGHT12"        "NWEIGHT13"       
 [43] "NWEIGHT14"        "NWEIGHT15"        "NWEIGHT16"       
 [46] "NWEIGHT17"        "NWEIGHT18"        "NWEIGHT19"       
 [49] "NWEIGHT20"        "NWEIGHT21"        "NWEIGHT22"       
 [52] "NWEIGHT23"        "NWEIGHT24"        "NWEIGHT25"       
 [55] "NWEIGHT26"        "NWEIGHT27"        "NWEIGHT28"       
 [58] "NWEIGHT29"        "NWEIGHT30"        "NWEIGHT31"       
 [61] "NWEIGHT32"        "NWEIGHT33"        "NWEIGHT34"       
 [64] "NWEIGHT35"        "NWEIGHT36"        "NWEIGHT37"       
 [67] "NWEIGHT38"        "NWEIGHT39"        "NWEIGHT40"       
 [70] "NWEIGHT41"        "NWEIGHT42"        "NWEIGHT43"       
 [73] "NWEIGHT44"        "NWEIGHT45"        "NWEIGHT46"       
 [76] "NWEIGHT47"        "NWEIGHT48"        "NWEIGHT49"       
 [79] "NWEIGHT50"        "NWEIGHT51"        "NWEIGHT52"       
 [82] "NWEIGHT53"        "NWEIGHT54"        "NWEIGHT55"       
 [85] "NWEIGHT56"        "NWEIGHT57"        "NWEIGHT58"       
 [88] "NWEIGHT59"        "NWEIGHT60"        "BTUEL"           
 [91] "DOLLAREL"         "BTUNG"            "DOLLARNG"        
 [94] "BTULP"            "DOLLARLP"         "BTUFO"           
 [97] "DOLLARFO"         "BTUWOOD"          "TOTALBTU"        
[100] "TOTALDOL"        

American Community Survey (ACS)

  • Analysis weight: PWGTP
  • replicate weights: PWGTP1-PWGTP180
  • jackknife with scale adjustment of 4/80
acs_des <- acs_pums %>%
   as_survey_rep(
      weights = ___________,
      repweights = ___________,
      type = ___________,
      scale = _________ 
   )
acs_des <- acs_pums %>%
   as_survey_rep(
      weights = PWGTP,
      repweights = num_range("PWGTP", 1:80),
      type = "JK1",
      scale = 4/80
   )

Your Turn

Open 05-design-exercises.qmd

15:00

Open Q & A