Chapter 10 Sample designs and replicate weights
For this chapter, load the following packages:
To help explain the different types of sample designs, this chapter uses the api
and scd
data that are included in the {survey} package (Lumley 2010):
This chapter uses data from the Residential Energy Consumption Survey (RECS), both 2015 and 2020, so we load the RECS data from the {srvyrexploR} package using their object names recs_2015
and recs_2020
, respectively (Zimmer, Powell, and Velásquez 2024).
10.1 Introduction
The primary reason for using packages like {survey} and {srvyr} is to account for the sampling design or replicate weights into point and uncertainty estimates (Freedman Ellis and Schneider 2024; Lumley 2010). By incorporating the sampling design or replicate weights, these estimates are appropriately calculated.
In this chapter, we introduce common sampling designs and common types of replicate weights, the mathematical methods for calculating estimates and standard errors for a given sampling design, and the R syntax to specify the sampling design or replicate weights. While we show the math behind the estimates, the functions in these packages handle the calculation. To deeply understand the math and the derivation, refer to Penn State (2019), Särndal, Swensson, and Wretman (2003), Wolter (2007), or Fuller (2011) (these are listed in order of increasing statistical rigorousness).
The general process for estimation in the {srvyr} package is to:
Create a
tbl_svy
object (a survey object) using:as_survey_design()
oras_survey_rep()
Subset data (if needed) using
filter()
(subpopulations)Specify domains of analysis using
group_by()
Within
summarize()
, specify variables to calculate, including means, totals, proportions, quantiles, and more
This chapter includes details on the first step: creating the survey object. Once this survey object is created, it can be used in the other steps (detailed in Chapters 5 through 7) to account for the complex survey design.
10.2 Common sampling designs
A sampling design is the method used to draw a sample. Both logistical and statistical elements are considered when developing a sampling design. When specifying a sampling design in R, we specify the levels of sampling along with the weights. The weight for each record is constructed so that the particular record represents that many units in the population. For example, in a survey of 6th-grade students in the United States, the weight associated with each responding student reflects how many 6th-grade students across the country that record represents. Generally, the weights represent the inverse of the probability of selection, such that the sum of the weights corresponds to the total population size, although some studies may have the sum of the weights equal to the number of respondent records.
Some common terminology across the designs are:
- sample size, generally denoted as \(n\), is the number of units selected to be sampled
- population size, generally denoted as \(N\), is the number of units in the population of interest
- sampling frame, the list of units from which the sample is drawn (see Chapter 2 for more information)
10.2.1 Simple random sample without replacement
The simple random sample (SRS) without replacement is a sampling design in which a fixed sample size is selected from a sampling frame, and every possible subsample has an equal probability of selection. Without replacement refers to the fact that once a sampling unit has been selected, it is removed from the sample frame and cannot be selected again.
- Requirements: The sampling frame must include the entire population.
- Advantages: SRS requires no information about the units apart from contact information.
- Disadvantages: The sampling frame may not be available for the entire population.
- Example: Randomly select students in a university from a roster provided by the registrar’s office.
The math
The estimate for the population mean of variable \(y\) is:
\[\bar{y}=\frac{1}{n}\sum_{i=1}^n y_i\] where \(\bar{y}\) represents the sample mean, \(n\) is the total number of respondents (or observations), and \(y_i\) is each individual value of \(y\).
The estimate of the standard error of the mean is:
\[se(\bar{y})=\sqrt{\frac{s^2}{n}\left( 1-\frac{n}{N} \right)}\] where
\[s^2=\frac{1}{n-1}\sum_{i=1}^n\left(y_i-\bar{y}\right)^2.\]
and \(N\) is the population size. This standard error estimate might look very similar to equations in other statistical applications except for the part on the right side of the equation: \(1-\frac{n}{N}\). This is called the finite population correction (FPC) factor. If the size of the frame, \(N\), is very large in comparison to the sample, the FPC is negligible, so it is often ignored. A common guideline is if the sample is less than 10% of the population, the FPC is negligible.
To estimate proportions, we define \(x_i\) as the indicator if the outcome is observed. That is, \(x_i=1\) if the outcome is observed, and \(x_i=0\) if the outcome is not observed for respondent \(i\). Then the estimated proportion from an SRS design is:
\[\hat{p}=\frac{1}{n}\sum_{i=1}^n x_i \] and the estimated standard error of the proportion is:
\[se(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n-1}\left(1-\frac{n}{N}\right)} \]
The syntax
If a sample was drawn through SRS and had no nonresponse or other weighting adjustments, we specify this design in R as:
where dat
is a tibble or data.frame with the survey data, and fpcvar
is a variable in the data indicating the sampling frame’s size (this variable has the same value for all cases in an SRS design). If the frame is very large, sometimes the frame size is not provided. In that case, the FPC is not needed, and we specify the design as:
If some post-survey adjustments were implemented and the weights are not all equal, we specify the design as:
where wtvar
is a variable in the data indicating the weight for each case. Again, the FPC can be omitted if it is unnecessary because the frame is large compared to the sample size.
Example
The {survey} package in R provides some example datasets that we use throughout this chapter. One of the example datasets we use is from the Academic Performance Index Program (APIP). The APIP program administered by the California Department of Education, and the {survey} package includes a population file (sample frame) of all schools with at least 100 students and several different samples pulled from that data using different sampling methods. For this first example, we use the apisrs
dataset, which contains an SRS of 200 schools. For printing purposes, we create a new dataset called apisrs_slim
, which sorts the data by the school district and school ID and subsets the data to only a few columns. The SRS sample data are illustrated below:
apisrs_slim <-
apisrs %>%
as_tibble() %>%
arrange(dnum, snum) %>%
select(cds, dnum, snum, dname, sname, fpc, pw)
apisrs_slim
## # A tibble: 200 × 7
## cds dnum snum dname sname fpc pw
## <chr> <int> <dbl> <chr> <chr> <dbl> <dbl>
## 1 19642126061220 1 1121 ABC Unified Haske… 6194 31.0
## 2 19642126066716 1 1124 ABC Unified Stowe… 6194 31.0
## 3 36675876035174 5 3895 Adelanto Elementary Adela… 6194 31.0
## 4 33669776031512 19 3347 Alvord Unified Arlan… 6194 31.0
## 5 33669776031595 19 3352 Alvord Unified Wells… 6194 31.0
## 6 31667876031033 39 3271 Auburn Union Elementary Cain … 6194 31.0
## 7 19642876011407 42 1169 Baldwin Park Unified Deanz… 6194 31.0
## 8 19642876011464 42 1175 Baldwin Park Unified Heath… 6194 31.0
## 9 19642956011589 48 1187 Bassett Unified Erwin… 6194 31.0
## 10 41688586043392 49 4948 Bayshore Elementary Baysh… 6194 31.0
## # ℹ 190 more rows
Table 10.1 provides details on all the variables in this dataset.
Variable Name | Description |
---|---|
cds |
Unique identifier for each school |
dnum |
School district identifier within county |
snum |
School identifier within district |
dname |
District Name |
sname |
School Name |
fpc |
Finite population correction factor |
pw |
Weight |
To create the tbl_survey
object for the SRS data, we specify the design as:
## Independent Sampling design
## Called via srvyr
## Sampling variables:
## - ids: `1`
## - fpc: fpc
## - weights: pw
## Data variables:
## - cds (chr), dnum (int), snum (dbl), dname (chr), sname (chr), fpc
## (dbl), pw (dbl)
In the printed design object, the design is described as an “Independent Sampling design,” which is another term for SRS. The ids are specified as 1
, which means there is no clustering (a topic described in Section 10.2.4), the FPC variable is indicated, and the weights are indicated. We can also look at the summary of the design object (summary()
) and see the distribution of the probabilities (inverse of the weights) along with the population size and a list of the variables in the dataset.
## Independent Sampling design
## Called via srvyr
## Probabilities:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0323 0.0323 0.0323 0.0323 0.0323 0.0323
## Population size (PSUs): 6194
## Data variables:
## [1] "cds" "dnum" "snum" "dname" "sname" "fpc" "pw"
10.2.2 Simple random sample with replacement
Similar to the SRS design, the simple random sample with replacement (SRSWR) design randomly selects the sample from the entire sampling frame. However, while SRS removes sampled units before selecting again, the SRSWR instead replaces each sampled unit before drawing again, so units can be selected more than once.
- Requirements: The sampling frame must include the entire population.
- Advantages: SRSWR requires no information about the units apart from contact information.
- Disadvantages:
- The sampling frame may not be available for the entire population.
- Units can be selected more than once, resulting in a smaller realized sample size because receiving duplicate information from a single respondent does not provide additional information.
- For small populations, SRSWR has larger standard errors than SRS designs.
- Example: A professor puts all students’ names on paper slips and selects them randomly to ask students questions, but the professor replaces the paper after calling on the student so they can be selected again at any time.
In general for surveys, using an SRS design (without replacement) is preferred as we do not want respondents to answer a survey more than once.
The math
The estimate for the population mean of variable \(y\) is:
\[\bar{y}=\frac{1}{n}\sum_{i=1}^n y_i\]
and the estimate of the standard error of mean is:
\[se(\bar{y})=\sqrt{\frac{s^2}{n}}\] where
\[s^2=\frac{1}{n-1}\sum_{i=1}^n\left(y_i-\bar{y}\right)^2.\] To calculate the estimated proportion, we define \(x_i\) as the indicator that the outcome is observed (as we did with SRS):
\[\hat{p}=\frac{1}{n}\sum_{i=1}^n x_i \] and the estimated standard error of the proportion is:
\[se(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]
The syntax
If we had a sample that was drawn through SRSWR and had no nonresponse or other weighting adjustments, in R, we specify this design as:
where dat
is a tibble or data.frame containing our survey data. This syntax is the same as an SRS design, except an FPC is not included. This is because when calculating a sample with replacement, the population pool to select from is no longer finite, so a correction is not needed. Therefore, with large populations where the FPC is negligible, the underlying formulas for SRS and SRSWR designs are the same.
If some post-survey adjustments were implemented and the weights are not all equal, we specify the design as:
where wtvar
is the variable for the weight of the data.
Example
The {survey} package does not include an example of SRSWR. To illustrate this design, we need to create an example. We use the APIP population data provided by the {survey} package (apipop
) and select a sample of 200 cases using the slice_sample()
function from the tidyverse. One of the arguments in the slice_sample()
function is replace
. If replace=TRUE
, then we are conducting an SRSWR. We then calculate selection weights as the inverse of the probability of selection and call this new dataset apisrswr
.
set.seed(409963)
apisrswr <- apipop %>%
as_tibble() %>%
slice_sample(n = 200, replace = TRUE) %>%
select(cds, dnum, snum, dname, sname) %>%
mutate(weight = nrow(apipop) / 200)
head(apisrswr)
## # A tibble: 6 × 6
## cds dnum snum dname sname weight
## <chr> <int> <dbl> <chr> <chr> <dbl>
## 1 43696416060065 533 5348 Palo Alto Unified Jordan (Da… 31.0
## 2 07618046005060 650 509 San Ramon Valley Unified Alamo Elem… 31.0
## 3 19648086085674 457 2134 Montebello Unified La Merced … 31.0
## 4 07617056003719 346 377 Knightsen Elementary Knightsen … 31.0
## 5 19650606023022 744 2351 Torrance Unified Carr (Evel… 31.0
## 6 01611196090120 6 13 Alameda City Unified Paden (Wil… 31.0
Because this is an SRS design with replacement, there may be duplicates in the data. It is important to keep the duplicates in the data for proper estimation. For reference, we can view the duplicates in the example data we just created.
## # A tibble: 4 × 6
## # Groups: cds [2]
## cds dnum snum dname sname weight
## <chr> <int> <dbl> <chr> <chr> <dbl>
## 1 15633216008841 41 869 Bakersfield City Elem Chipman Junio… 31.0
## 2 15633216008841 41 869 Bakersfield City Elem Chipman Junio… 31.0
## 3 39686766042782 716 4880 Stockton City Unified Tyler Skills … 31.0
## 4 39686766042782 716 4880 Stockton City Unified Tyler Skills … 31.0
We created a weight variable in this example data, which is the inverse of the probability of selection. We specify the sampling design for apisrswr
as:
## Independent Sampling design (with replacement)
## Called via srvyr
## Sampling variables:
## - ids: `1`
## - weights: weight
## Data variables:
## - cds (chr), dnum (int), snum (dbl), dname (chr), sname (chr), weight
## (dbl)
## Independent Sampling design (with replacement)
## Called via srvyr
## Probabilities:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0323 0.0323 0.0323 0.0323 0.0323 0.0323
## Data variables:
## [1] "cds" "dnum" "snum" "dname" "sname" "weight"
In the output above, the design object and the object summary are shown. Both note that the sampling is done “with replacement” because no FPC was specified. The probabilities, which are derived from the weights, are summarized in the summary function output.
10.2.3 Stratified sampling
Stratified sampling occurs when a population is divided into mutually exclusive subpopulations (strata), and then samples are selected independently within each stratum.
- Requirements: The sampling frame must include the information to divide the population into strata for every unit.
- Advantages:
- This design ensures sample representation in all subpopulations.
- If the strata are correlated with survey outcomes, a stratified sample has smaller standard errors compared to a SRS sample of the same size.
- This results in a more efficient design.
- Disadvantages: Auxiliary data may not exist to divide the sampling frame into strata, or the data may be outdated.
- Examples:
- Example 1: A population of North Carolina residents could be stratified into urban and rural areas, and then an SRS of residents from both rural and urban areas is selected independently. This ensures there are residents from both areas in the sample.
- Example 2: Law enforcement agencies could be stratified into the three primary general-purpose categories in the U.S.: local police, sheriff’s departments, and state police. An SRS of agencies from each of the three types is then selected independently to ensure all three types of agencies are represented.
The math
Let \(\bar{y}_h\) be the sample mean for stratum \(h\), \(N_h\) be the population size of stratum \(h\), \(n_h\) be the sample size of stratum \(h\), and \(H\) be the total number of strata. Then, the estimate for the population mean under stratified SRS sampling is:
\[\bar{y}=\frac{1}{N}\sum_{h=1}^H N_h\bar{y}_h\] and the estimate of the standard error of \(\bar{y}\) is:
\[se(\bar{y})=\sqrt{\frac{1}{N^2} \sum_{h=1}^H N_h^2 \frac{s_h^2}{n_h}\left(1-\frac{n_h}{N_h}\right)} \]
where \[s_h^2=\frac{1}{n_h-1}\sum_{i=1}^{n_h}\left(y_{i,h}-\bar{y}_h\right)^2\]
For estimates of proportions, let \(\hat{p}_h\) be the estimated proportion in stratum \(h\). Then, the population proportion estimate is:
\[\hat{p}= \frac{1}{N}\sum_{h=1}^H N_h \hat{p}_h\]
The standard error of the proportion is:
\[se(\hat{p}) = \frac{1}{N} \sqrt{ \sum_{h=1}^H N_h^2 \frac{\hat{p}_h(1-\hat{p}_h)}{n_h-1} \left(1-\frac{n_h}{N_h}\right)}\]
The syntax
In addition to the fpc
and weights
arguments discussed in the types above, stratified designs require the addition of the strata
argument. For example, to specify a stratified SRS design in {srvyr} when using the FPC, that is, where the population sizes of the strata are not too large and are known, we specify the design as:
where fpcvar
is a variable on our data that indicates \(N_h\) for each row, and stratavar
is a variable indicating the stratum for each row. We can omit the FPC if it is not applicable. Additionally, we can indicate the weight variable if it is present where wtvar
is a variable on our data with a numeric weight.
Example
In the example APIP data, apistrat
is a stratified random sample, stratified by school type (stype
) with three levels: E
for elementary school, M
for middle school, and H
for high school. As with the SRS example above, we sort and select specific variables for use in printing. The data are illustrated below, including a count of the number of cases per stratum:
apistrat_slim <-
apistrat %>%
as_tibble() %>%
arrange(dnum, snum) %>%
select(cds, dnum, snum, dname, sname, stype, fpc, pw)
apistrat_slim %>%
count(stype, fpc)
## # A tibble: 3 × 3
## stype fpc n
## <fct> <dbl> <int>
## 1 E 4421 100
## 2 H 755 50
## 3 M 1018 50
The FPC is the same for each case within each stratum. This output also shows that 100 elementary schools, 50 middle schools, and 50 high schools were sampled. It is often common for the number of units sampled from each strata to be different based on the goals of the project, or to mirror the size of each strata in the population. We specify the design as:
apistrat_des <- apistrat_slim %>%
as_survey_design(
strata = stype,
weights = pw,
fpc = fpc
)
apistrat_des
## Stratified Independent Sampling design
## Called via srvyr
## Sampling variables:
## - ids: `1`
## - strata: stype
## - fpc: fpc
## - weights: pw
## Data variables:
## - cds (chr), dnum (int), snum (dbl), dname (chr), sname (chr), stype
## (fct), fpc (dbl), pw (dbl)
## Stratified Independent Sampling design
## Called via srvyr
## Probabilities:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0226 0.0226 0.0359 0.0401 0.0534 0.0662
## Stratum Sizes:
## E H M
## obs 100 50 50
## design.PSU 100 50 50
## actual.PSU 100 50 50
## Population stratum sizes (PSUs):
## E H M
## 4421 755 1018
## Data variables:
## [1] "cds" "dnum" "snum" "dname" "sname" "stype" "fpc" "pw"
When printing the object, it is specified as a “Stratified Independent Sampling design,” also known as a stratified SRS, and the strata variable is included. Printing the summary, we see a distribution of probabilities, as we saw with SRS; but we also see the sample and population sizes by stratum.
10.2.4 Clustered sampling
Clustered sampling occurs when a population is divided into mutually exclusive subgroups called clusters or primary sampling units (PSUs). A random selection of PSUs is sampled, and then another level of sampling is done within these clusters. There can be multiple levels of this selection. Clustered sampling is often used when a list of the entire population is not available or data collection involves interviewers needing direct contact with respondents.
- Requirements: There must be a way to divide the population into clusters. Clusters are commonly structural, such as institutions (e.g., schools, prisons) or geography (e.g., states, counties).
- Advantages:
- Clustered sampling is advantageous when data collection is done in person, so interviewers are sent to specific sampled areas rather than completely at random across a country.
- With clustered sampling, a list of the entire population is not necessary. For example, if sampling students, we do not need a list of all students, but only a list of all schools. Once the schools are sampled, lists of students can be obtained within the sampled schools.
- Disadvantages: Compared to a simple random sample for the same sample size, clustered samples generally have larger standard errors of estimates.
- Examples:
- Example 1: Consider a study needing a sample of 6th-grade students in the United States. No list likely exists of all these students. However, it is more likely to obtain a list of schools that enroll 6th graders, so a study design could select a random sample of schools that enroll 6th graders. The selected schools can then provide a list of students to do a second stage of sampling where 6th-grade students are randomly sampled within each of the sampled schools. This is a one-stage sample design (the one representing the number of clusters) and is the type of design we discuss in the formulas below.
- Example 2: Consider a study sending interviewers to households for a survey. This is a more complicated example that requires two levels of clustering (two-stage sample design) to efficiently use interviewers in geographic clusters. First, in the U.S., counties could be selected as the PSU and then census block groups within counties could be selected as the secondary sampling unit (SSU). Households could then be randomly sampled within the block groups. This type of design is popular for in-person surveys, as it reduces the travel necessary for interviewers.
The math
Consider a survey where \(a\) clusters are sampled from a population of \(A\) clusters via SRS. Within each sampled cluster, \(i\), there are \(B_i\) units in the population, and \(b_i\) units are sampled via SRS. Let \(\bar{y}_{i}\) be the sample mean of cluster \(i\). Then, a ratio estimator of the population mean is:
\[\bar{y}=\frac{\sum_{i=1}^a B_i \bar{y}_{i}}{ \sum_{i=1}^a B_i}\] Note this is a consistent but biased estimator. Often the population size is not known, so this is a method to estimate a mean without knowing the population size. The estimated standard error of the mean is:
\[se(\bar{y})= \frac{1}{\hat{N}}\sqrt{\left(1-\frac{a}{A}\right)\frac{s_a^2}{a} + \frac{A}{a} \sum_{i=1}^a \left(1-\frac{b_i}{B_i}\right) \frac{s_i^2}{b_i} }\] where \(\hat{N}\) is the estimated population size, \(s_a^2\) is the between-cluster variance, and \(s_i^2\) is the within-cluster variance.
The formula for the between-cluster variance (\(s_a^2\)) is:
\[s_a^2=\frac{1}{a-1}\sum_{i=1}^a \left( \hat{y}_i - \frac{\sum_{i=1}^a \hat{y}_{i} }{a}\right)^2\] where \(\hat{y}_i =B_i\bar{y_i}\).
The formula for the within-cluster variance (\(s_i^2\)) is:
\[s_i^2=\frac{1}{a(b_i-1)} \sum_{j=1}^{b_i} \left(y_{ij}-\bar{y}_i\right)^2\] where \(y_{ij}\) is the outcome for sampled unit \(j\) within cluster \(i\).
The syntax
Clustered sampling designs require the addition of the ids
argument, which specifies the cluster level variable(s). To specify a two-stage clustered design without replacement, we specify the design as:
where PSU
and SSU
are the variables indicating the PSU and SSU identifiers, and A
and B
are the variables indicating the population sizes for each level (i.e., A
is the number of clusters, and B
is the number of units within each cluster). Note that A
is the same for all records, and B
is the same for all records within the same cluster.
If clusters were sampled with replacement or from a very large population, the FPC is unnecessary. Additionally, only the first stage of selection is necessary regardless of whether the units were selected with replacement at any stage. The subsequent stages of selection are ignored in computation as their contribution to the variance is overpowered by the first stage (see Särndal, Swensson, and Wretman (2003) or Wolter (2007) for a more in-depth discussion). Therefore, the two design objects specified below yield the same estimates in the end:
clus2ex1_des <- dat %>%
as_survey_design(weights = wtvar,
ids = c(PSU, SSU))
clus2ex2_des <- dat %>%
as_survey_design(weights = wtvar,
ids = PSU)
Note that there is one additional argument that is sometimes necessary, which is nest = TRUE
. This option relabels cluster IDs to enforce nesting within strata. Sometimes, as an example, there may be a cluster 1
within each stratum, but cluster 1
in stratum 1
is a different cluster than cluster 1
in stratum 2
. These are actually different clusters. This option indicates that repeated numbering does not mean it is the same cluster. If this option is not used and there are repeated cluster IDs across different strata, an error is generated.
Example
The survey
package includes a two-stage cluster sample data, apiclus2
, in which school districts were sampled, and then a random sample of five schools was selected within each district. For districts with fewer than five schools, all schools were sampled. School districts are identified by dnum
, and schools are identified by snum
. The variable fpc1
indicates how many districts there are in California (the total number of PSUs or A
), and fpc2
indicates how many schools were in a given district with at least 100 students (the total number of SSUs or B
). The data include a row for each school. In the data printed below, there are 757 school districts, as indicated by fpc1
, and there are nine schools in District 731, one school in District 742, two schools in District 768, and so on as indicated by fpc2
. For illustration purposes, the object apiclus2_slim
has been created from apiclus2
, which subsets the data to only the necessary columns and sorts the data.
apiclus2_slim <-
apiclus2 %>%
as_tibble() %>%
arrange(desc(dnum), snum) %>%
select(cds, dnum, snum, fpc1, fpc2, pw)
apiclus2_slim
## # A tibble: 126 × 6
## cds dnum snum fpc1 fpc2 pw
## <chr> <int> <dbl> <dbl> <int[1d]> <dbl>
## 1 47704826050942 795 5552 757 1 18.9
## 2 07618126005169 781 530 757 6 22.7
## 3 07618126005177 781 531 757 6 22.7
## 4 07618126005185 781 532 757 6 22.7
## 5 07618126005193 781 533 757 6 22.7
## 6 07618126005243 781 535 757 6 22.7
## 7 19650786023337 768 2371 757 2 18.9
## 8 19650786023345 768 2372 757 2 18.9
## 9 54722076054423 742 5898 757 1 18.9
## 10 50712906053086 731 5781 757 9 34.1
## # ℹ 116 more rows
To specify this design in R, we use the following:
apiclus2_des <- apiclus2_slim %>%
as_survey_design(
ids = c(dnum, snum),
fpc = c(fpc1, fpc2),
weights = pw
)
apiclus2_des
## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## Called via srvyr
## Sampling variables:
## - ids: `dnum + snum`
## - fpc: `fpc1 + fpc2`
## - weights: pw
## Data variables:
## - cds (chr), dnum (int), snum (dbl), fpc1 (dbl), fpc2 (int[1d]), pw
## (dbl)
## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## Called via srvyr
## Probabilities:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00367 0.03774 0.05284 0.04239 0.05284 0.05284
## Population size (PSUs): 757
## Data variables:
## [1] "cds" "dnum" "snum" "fpc1" "fpc2" "pw"
The design objects are described as “2 - level Cluster Sampling design,” and include the ids (cluster), FPC, and weight variables. The summary notes that the sample includes 40 first-level clusters (PSUs), which are school districts, and 126 second-level clusters (SSUs), which are schools. Additionally, the summary includes a numeric summary of the probabilities of selection and the population size (number of PSUs) as 757.
10.3 Combining sampling methods
SRS, stratified, and clustered designs are the backbone of sampling designs, and the features are often combined in one design. Additionally, rather than using SRS for selection, other sampling mechanisms are commonly used, such as probability proportional to size (PPS), systematic sampling, or selection with unequal probabilities, which are briefly described here. In PPS sampling, a size measure is constructed for each unit (e.g., the population of the PSU or the number of occupied housing units), and units with larger size measures are more likely to be sampled. Systematic sampling is commonly used to ensure representation across a population. Units are sorted by a feature, and then every \(k\) units is selected from a random start point so the sample is spread across the population. In addition to PPS, other unequal probabilities of selection may be used. For example, in a study of establishments (e.g., businesses or public institutions) that conducts a survey every year, an establishment that recently participated (e.g., participated last year) may have a reduced chance of selection in a subsequent round to reduce the burden on the establishment. To learn more about sampling designs, refer to Valliant, Dever, and Kreuter (2013), Cox et al. (2011), Cochran (1977), and Deming (1991).
A common method of sampling is to stratify PSUs, select PSUs within the stratum using PPS selection, and then select units within the PSUs either with SRS or PPS. Reading survey documentation is an important first step in survey analysis to understand the design of the survey we are using and variables necessary to specify the design. Good documentation highlights the variables necessary to specify the design. This is often found in the user guide, methodology report, analysis guide, or technical documentation (see Chapter 3 for more details).
Example
For example, the 2017-2019 National Survey of Family Growth had a stratified multi-stage area probability sample:
- In the first stage, PSUs are counties or collections of counties and are stratified by Census region/division, size (population), and MSA status. Within each stratum, PSUs were selected via PPS.
- In the second stage, neighborhoods were selected within the sampled PSUs using PPS selection.
- In the third stage, housing units were selected within the sampled neighborhoods.
- In the fourth stage, a person was randomly chosen among eligible persons within the selected housing units using unequal probabilities based on the person’s age and sex.
The public use file does not include all these levels of selection and instead has pseudo-strata and pseudo-clusters, which are the variables used in R to specify the design. As specified on page 4 of the documentation, the stratum variable is SEST
, the cluster variable is SECU
, and the weight variable is WGT2017_2019
. Thus, to specify this design in R, we use the following syntax:
10.4 Replicate weights
Replicate weights are often included on analysis files instead of, or in addition to, the design variables (strata and PSUs). Replicate weights are used as another method to estimate variability. Often, researchers choose to use replicate weights to avoid publishing design variables (strata or clustering variables) as a measure to reduce the risk of disclosure. There are several types of replicate weights, including balanced repeated replication (BRR), Fay’s BRR, jackknife, and bootstrap methods. An overview of the process for using replicate weights is as follows:
- Divide the sample into subsample replicates that mirror the design of the sample
- Calculate weights for each replicate using the same procedures for the full-sample weight (i.e., nonresponse and post-stratification)
- Calculate estimates for each replicate using the same method as the full-sample estimate
- Calculate the estimated variance, which is proportional to the variance of the replicate estimates
The different types of replicate weights largely differ between step 1 (how the sample is divided into subsamples) and step 4 (which multiplication factors, scales, are used to multiply the variance). The general format for the standard error is:
\[ \sqrt{\alpha \sum_{r=1}^R \alpha_r (\hat{\theta}_r - \hat{\theta})^2 }\]
where \(R\) is the number of replicates, \(\alpha\) is a constant that depends on the replication method, \(\alpha_r\) is a factor associated with each replicate, \(\hat{\theta}\) is the weighted estimate based on the full sample, and \(\hat{\theta}_r\) is the weighted estimate of \(\theta\) based on the \(r^{\text{th}}\) replicate.
To create the design object for surveys with replicate weights, we use as_survey_rep()
instead of as_survey_design()
, which we use for the common sampling designs in the sections above.
10.4.1 Balanced Repeated Replication method
The balanced repeated replication (BRR) method requires a stratified sample design with two PSUs in each stratum. Each replicate is constructed by deleting one PSU per stratum using a Hadamard matrix. For the PSU that is included, the weight is generally multiplied by two but may have other adjustments, such as post-stratification. A Hadamard matrix is a special square matrix with entries of +1 or –1 with mutually orthogonal rows. Hadamard matrices must have one row, two rows, or a multiple of four rows. The size of the Hadamard matrix is determined by the first multiple of 4 greater than or equal to the number of strata. For example, if a survey had seven strata, the Hadamard matrix would be an \(8\times8\) matrix. Additionally, a survey with eight strata would also have an \(8\times8\) Hadamard matrix. The columns in the matrix specify the strata, and the rows specify the replicate. In each replicate (row), a +1 means to use the first PSU, and a –1 means to use the second PSU in the estimate. For example, here is a \(4\times4\) Hadamard matrix:
\[ \begin{array}{rrrr} +1 &+1 &+1 &+1\\ +1&-1&+1&-1\\ +1&+1&-1&-1\\ +1 &-1&-1&+1 \end{array} \] In the first replicate (row), all the values are +1; so in each stratum, the first PSU would be used in the estimate. In the second replicate, the first PSU would be used in strata 1 and 3, while the second PSU would be used in strata 2 and 4. In the third replicate, the first PSU would be used in strata 1 and 2, while the second PSU would be used in strata 3 and 4. Finally, in the fourth replicate, the first PSU would be used in strata 1 and 4, while the second PSU would be used in strata 2 and 3. For more information about Hadamard matrices, see Wolter (2007). Note that supplied BRR weights from a data provider already incorporate this adjustment, and the {survey} package generates the Hadamard matrix, if necessary, for calculating BRR weights; so an analyst does not need to create or provide the matrix.
The math
A weighted estimate for the full sample is calculated as \(\hat{\theta}\), and then a weighted estimate for each replicate is calculated as \(\hat{\theta}_r\) for \(R\) replicates. Using the generic notation above, \(\alpha=\frac{1}{R}\) and \(\alpha_r=1\) for each \(r\). The standard error of the estimate is calculated as follows:
\[se(\hat{\theta})=\sqrt{\frac{1}{R} \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}\]
Specifying replicate weights in R requires specifying the type of replicate weights, the main weight variable, the replicate weight variables, and other options. One of the key options is for the mean squared error (MSE). If mse=TRUE
, variances are computed around the point estimate \((\hat{\theta})\); whereas if mse=FALSE
, variances are computed around the mean of the replicates \((\bar{\theta})\) instead, which looks like this:
\[se(\hat{\theta})=\sqrt{\frac{1}{R} \sum_{r=1}^R \left( \hat{\theta}_r-\bar{\theta}\right)^2}\] where \[\bar{\theta}=\frac{1}{R}\sum_{r=1}^R \hat{\theta}_r\]
The default option for mse
is to use the global option of “survey.replicates.mse,” which is set to FALSE
initially unless a user changes it. To determine if mse
should be set to TRUE
or FALSE
, read the survey documentation. If there is no indication in the survey documentation for BRR, we recommend setting mse
to TRUE
, as this is the default in other software (e.g., SAS, SUDAAN).
The syntax
Replicate weights generally come in groups and are sequentially numbered, such as PWGTP1, PWGTP2, …, PWGTP80 for the person weights in the American Community Survey (ACS) (U.S. Census Bureau 2021) or BRRWT1, BRRWT2, …, BRRWT96 in the 2015 Residential Energy Consumption Survey (RECS) (U.S. Energy Information Administration 2017). This makes it easy to use some of the tidy selection functions in R.
To specify a BRR design, we need to specify the weight variable (weights
), the replicate weight variables (repweights
), the type of replicate weights as BRR (type = BRR
), and whether the mean squared error should be used (mse = TRUE
) or not (mse = FALSE
). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated WT1, WT2, …, WT20, we can use the following syntax (both are equivalent):
brr_des <- dat %>%
as_survey_rep(weights = WT0,
repweights = all_of(str_c("WT", 1:20)),
type = "BRR",
mse = TRUE)
brr_des <- dat %>%
as_survey_rep(weights = WT0,
repweights = num_range("WT", 1:20),
type = "BRR",
mse = TRUE)
If a dataset had WT for the main weight and had 20 BRR weights indicated REPWT1, REPWT2, …, REPWT20, we can use the following syntax (both are equivalent):
brr_des <- dat %>%
as_survey_rep(weights = WT,
repweights = all_of(str_c("REPWT", 1:20)),
type = "BRR",
mse = TRUE)
brr_des <- dat %>%
as_survey_rep(weights = WT,
repweights = starts_with("REPWT"),
type = "BRR",
mse = TRUE)
If the replicate weight variables are in the file consecutively, we can also use the following syntax:
brr_des <- dat %>%
as_survey_rep(weights = WT,
repweights = REPWT1:REPWT20,
type = "BRR",
mse = TRUE)
Typically, each replicate weight sums to a value similar to the main weight, as both the replicate weights and the main weight are supposed to provide population estimates. Rarely, an alternative method is used where the replicate weights have values of 0 or 2 in the case of BRR weights. This would be indicated in the documentation (see Chapter 3 for more information on reading documentation). In this case, the replicate weights are not combined, and the option combined_weights = FALSE
should be indicated, as the default value for this argument is TRUE
. This specific syntax is shown below:
Example
The {survey} package includes a data example from section 12.2 of Levy and Lemeshow (2013). In this fictional data, two out of five ambulance stations were sampled from each of three emergency service areas (ESAs); thus BRR weights are appropriate with two PSUs (stations) sampled in each stratum (ESA). In the code below, we create BRR weights as was done by Levy and Lemeshow (2013).
scdbrr <- scd %>%
as_tibble() %>%
mutate(
wt = 5 / 2,
rep1 = 2 * c(1, 0, 1, 0, 1, 0),
rep2 = 2 * c(1, 0, 0, 1, 0, 1),
rep3 = 2 * c(0, 1, 1, 0, 0, 1),
rep4 = 2 * c(0, 1, 0, 1, 1, 0)
)
scdbrr
## # A tibble: 6 × 9
## ESA ambulance arrests alive wt rep1 rep2 rep3 rep4
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 120 25 2.5 2 2 0 0
## 2 1 2 78 24 2.5 0 0 2 2
## 3 2 1 185 30 2.5 2 0 2 0
## 4 2 2 228 49 2.5 0 2 0 2
## 5 3 1 670 80 2.5 2 0 0 2
## 6 3 2 530 70 2.5 0 2 2 0
To specify the BRR weights, we use the following syntax:
scdbrr_des <- scdbrr %>%
as_survey_rep(
type = "BRR",
repweights = starts_with("rep"),
combined_weights = FALSE,
weight = wt
)
scdbrr_des
## Call: Called via srvyr
## Balanced Repeated Replicates with 4 replicates.
## Sampling variables:
## - repweights: `rep1 + rep2 + rep3 + rep4`
## - weights: wt
## Data variables:
## - ESA (int), ambulance (int), arrests (dbl), alive (dbl), wt (dbl),
## rep1 (dbl), rep2 (dbl), rep3 (dbl), rep4 (dbl)
## Call: Called via srvyr
## Balanced Repeated Replicates with 4 replicates.
## Sampling variables:
## - repweights: `rep1 + rep2 + rep3 + rep4`
## - weights: wt
## Data variables:
## - ESA (int), ambulance (int), arrests (dbl), alive (dbl), wt (dbl),
## rep1 (dbl), rep2 (dbl), rep3 (dbl), rep4 (dbl)
## Variables:
## [1] "ESA" "ambulance" "arrests" "alive" "wt"
## [6] "rep1" "rep2" "rep3" "rep4"
Note that combined_weights
was specified as FALSE
because these weights are simply specified as 0 and 2 and do not incorporate the overall weight. When printing the object, the type of replication is noted as Balanced Repeated Replicates, and the replicate weights and the weight variable are specified. Additionally, the summary lists the variables included in the data and design object.
10.4.2 Fay’s BRR method
Fay’s BRR method for replicate weights is similar to the BRR method in that it uses a Hadamard matrix to construct replicate weights. However, rather than deleting PSUs for each replicate, with Fay’s BRR, half of the PSUs have a replicate weight, which is the main weight multiplied by \(\rho\), and the other half have the main weight multiplied by \((2-\rho)\), where \(0 \le \rho < 1\). Note that when \(\rho=0\), this is equivalent to the standard BRR weights, and as \(\rho\) becomes closer to 1, this method is more similar to jackknife discussed in Section 10.4.3. To obtain the value of \(\rho\), it is necessary to read the survey documentation (see Chapter 3).
The math
The standard error estimate for \(\hat{\theta}\) is slightly different than the BRR, due to the addition of the multiplier of \(\rho\). Using the generic notation above, \(\alpha=\frac{1}{R \left(1-\rho\right)^2}\) and \(\alpha_r=1 \text{ for all } r\). The standard error is calculated as:
\[se(\hat{\theta})=\sqrt{\frac{1}{R (1-\rho)^2} \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}\]
The syntax
The syntax is very similar for BRR and Fay’s BRR. To specify a Fay’s BRR design, we need to specify the weight variable (weights
), the replicate weight variables (repweights
), the type of replicate weights as Fay’s BRR (type = Fay
), whether the mean squared error should be used (mse = TRUE
) or not (mse = FALSE
), and Fay’s multiplier (rho
). For example, if a dataset had WT0 for the main weight and had 20 BRR weights indicated as WT1, WT2, …, WT20, and Fay’s multiplier is 0.3, we use the following syntax:
Example
The 2015 RECS (U.S. Energy Information Administration 2017) uses Fay’s BRR weights with the final weight as NWEIGHT and replicate weights as BRRWT1 - BRRWT96, and the documentation specifies a Fay’s multiplier of 0.5. On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total energy cost, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the census region. We use the 2015 RECS data from the {srvyrexploR} package that provides data for this book (see the Prerequisites box at the beginning of this chapter). To specify the design for the recs_2015
data, we use the following syntax:
recs_2015_des <- recs_2015 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = BRRWT1:BRRWT96,
type = "Fay",
rho = 0.5,
mse = TRUE,
variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC)
)
recs_2015_des
## Call: Called via srvyr
## Fay's variance method (rho= 0.5 ) with 96 replicates and MSE variances.
## Sampling variables:
## - repweights: `BRRWT1 + BRRWT2 + BRRWT3 + BRRWT4 + BRRWT5 + BRRWT6 +
## BRRWT7 + BRRWT8 + BRRWT9 + BRRWT10 + BRRWT11 + BRRWT12 + BRRWT13 +
## BRRWT14 + BRRWT15 + BRRWT16 + BRRWT17 + BRRWT18 + BRRWT19 + BRRWT20
## + BRRWT21 + BRRWT22 + BRRWT23 + BRRWT24 + BRRWT25 + BRRWT26 +
## BRRWT27 + BRRWT28 + BRRWT29 + BRRWT30 + BRRWT31 + BRRWT32 + BRRWT33
## + BRRWT34 + BRRWT35 + BRRWT36 + BRRWT37 + BRRWT38 + BRRWT39 +
## BRRWT40 + BRRWT41 + BRRWT42 + BRRWT43 + BRRWT44 + BRRWT45 + BRRWT46
## + BRRWT47 + BRRWT48 + BRRWT49 + BRRWT50 + BRRWT51 + BRRWT52 +
## BRRWT53 + BRRWT54 + BRRWT55 + BRRWT56 + BRRWT57 + BRRWT58 + BRRWT59
## + BRRWT60 + BRRWT61 + BRRWT62 + BRRWT63 + BRRWT64 + BRRWT65 +
## BRRWT66 + BRRWT67 + BRRWT68 + BRRWT69 + BRRWT70 + BRRWT71 + BRRWT72
## + BRRWT73 + BRRWT74 + BRRWT75 + BRRWT76 + BRRWT77 + BRRWT78 +
## BRRWT79 + BRRWT80 + BRRWT81 + BRRWT82 + BRRWT83 + BRRWT84 + BRRWT85
## + BRRWT86 + BRRWT87 + BRRWT88 + BRRWT89 + BRRWT90 + BRRWT91 +
## BRRWT92 + BRRWT93 + BRRWT94 + BRRWT95 + BRRWT96`
## - weights: NWEIGHT
## Data variables:
## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (dbl)
## Call: Called via srvyr
## Fay's variance method (rho= 0.5 ) with 96 replicates and MSE variances.
## Sampling variables:
## - repweights: `BRRWT1 + BRRWT2 + BRRWT3 + BRRWT4 + BRRWT5 + BRRWT6 +
## BRRWT7 + BRRWT8 + BRRWT9 + BRRWT10 + BRRWT11 + BRRWT12 + BRRWT13 +
## BRRWT14 + BRRWT15 + BRRWT16 + BRRWT17 + BRRWT18 + BRRWT19 + BRRWT20
## + BRRWT21 + BRRWT22 + BRRWT23 + BRRWT24 + BRRWT25 + BRRWT26 +
## BRRWT27 + BRRWT28 + BRRWT29 + BRRWT30 + BRRWT31 + BRRWT32 + BRRWT33
## + BRRWT34 + BRRWT35 + BRRWT36 + BRRWT37 + BRRWT38 + BRRWT39 +
## BRRWT40 + BRRWT41 + BRRWT42 + BRRWT43 + BRRWT44 + BRRWT45 + BRRWT46
## + BRRWT47 + BRRWT48 + BRRWT49 + BRRWT50 + BRRWT51 + BRRWT52 +
## BRRWT53 + BRRWT54 + BRRWT55 + BRRWT56 + BRRWT57 + BRRWT58 + BRRWT59
## + BRRWT60 + BRRWT61 + BRRWT62 + BRRWT63 + BRRWT64 + BRRWT65 +
## BRRWT66 + BRRWT67 + BRRWT68 + BRRWT69 + BRRWT70 + BRRWT71 + BRRWT72
## + BRRWT73 + BRRWT74 + BRRWT75 + BRRWT76 + BRRWT77 + BRRWT78 +
## BRRWT79 + BRRWT80 + BRRWT81 + BRRWT82 + BRRWT83 + BRRWT84 + BRRWT85
## + BRRWT86 + BRRWT87 + BRRWT88 + BRRWT89 + BRRWT90 + BRRWT91 +
## BRRWT92 + BRRWT93 + BRRWT94 + BRRWT95 + BRRWT96`
## - weights: NWEIGHT
## Data variables:
## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (dbl)
## Variables:
## [1] "DOEID" "TOTALDOL" "TOTSQFT_EN" "REGIONC"
In specifying the design, the variables
option was also used to include which variables might be used in analyses. This is optional but can make our object smaller and easier to work with. When printing the design object or looking at the summary, the replicate weight type is re-iterated as Fay's variance method (rho= 0.5) with 96 replicates and MSE variances
, and the variables are included. No weight or probability summary is included in this output, as we have seen in some other design objects.
10.4.3 Jackknife method
There are three jackknife estimators implemented in {srvyr}: jackknife 1 (JK1), jackknife n (JKn), and jackknife 2 (JK2). The JK1 method can be used for unstratified designs, and replicates are created by removing one PSU at a time so the number of replicates is the same as the number of PSUs. If there is no clustering, then the PSU is the ultimate sampling unit (e.g., students).
The JKn method is used for stratified designs and requires two or more PSUs per stratum. In this case, each replicate is created by deleting one PSU from a single stratum, so the number of replicates is the number of total PSUs across all strata. The JK2 method is a special case of JKn when there are exactly 2 PSUs sampled per stratum. For variance estimation, we also need to specify the scaling constants.
The math
Using the generic notation above, \(\alpha=\frac{R-1}{R}\) and \(\alpha_r=1 \text{ for all } r\). For the JK1 method, the standard error estimate for \(\hat{\theta}\) is calculated as:
\[se(\hat{\theta})=\sqrt{\frac{R-1}{R} \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}\] The JKn method is a bit more complex, but the coefficients are generally provided with restricted and public-use files. For each replicate, one stratum has a PSU removed, and the weights are adjusted by \(n_h/(n_h-1)\) where \(n_h\) is the number of PSUs in stratum \(h\). The coefficients in other strata are set to 1. Denote the coefficient that results from this process for replicate \(r\) as \(\alpha_r\), then the standard error estimate for \(\hat{\theta}\) is calculated as:
\[se(\hat{\theta})=\sqrt{\sum_{r=1}^R \alpha_r \left( \hat{\theta}_r-\hat{\theta}\right)^2}\]
The syntax
To specify the jackknife method, we use the survey documentation to understand the type of jackknife (1, n, or 2) and the multiplier. In the syntax, we need to specify the weight variable (weights
), the replicate weight variables (repweights
), the type of replicate weights as jackknife 1 (type = "JK1"
), n (type = "JKN"
), or 2 (type = "JK2"
), whether the mean squared error should be used (mse = TRUE
) or not (mse = FALSE
), and the multiplier (scale
). For example, if the survey is a jackknife 1 method with a multiplier of \(\alpha_r=(R-1)/R=19/20=0.95\), the dataset has WT0 for the main weight and 20 replicate weights indicated as WT1, WT2, …, WT20, we use the following syntax:
jk1_des <- dat %>%
as_survey_rep(
weights = WT0,
repweights = num_range("WT", 1:20),
type = "JK1",
mse = TRUE,
scale = 0.95
)
For a jackknife n method, we need to specify the multiplier for all replicates. In this case, we use the rscales
argument to specify each one. The documentation provides details on what the multipliers (\(\alpha_r\)) are, and they may be the same for all replicates. For example, consider a case where \(\alpha_r=0.1\) for all replicates, and the dataset had WT0 for the main weight and had 20 replicate weights indicated as WT1, WT2, …, WT20. We specify the type as type = "JKN"
, and the multiplier as rscales=rep(0.1,20)
:
Example
The 2020 RECS (U.S. Energy Information Administration 2023c) uses jackknife weights with the final weight as NWEIGHT and replicate weights as NWEIGHT1 - NWEIGHT60 with a scale of \((R-1)/R=59/60\). On the file, DOEID is a unique identifier for each respondent, TOTALDOL is the total cost of energy, TOTSQFT_EN is the total square footage of the residence, and REGOINC is the census region. We use the 2020 RECS data from the {srvyrexploR} package that provides data for this book (see the Prerequisites box at the beginning of this chapter).
To specify this design, we use the following syntax:
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
type = "JK1",
scale = 59 / 60,
mse = TRUE,
variables = c(DOEID, TOTALDOL, TOTSQFT_EN, REGIONC)
)
recs_des
## Call: Called via srvyr
## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
## Sampling variables:
## - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
## NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
## NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 +
## NWEIGHT17 + NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 +
## NWEIGHT22 + NWEIGHT23 + NWEIGHT24 + NWEIGHT25 + NWEIGHT26 +
## NWEIGHT27 + NWEIGHT28 + NWEIGHT29 + NWEIGHT30 + NWEIGHT31 +
## NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 + NWEIGHT36 +
## NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
## NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 +
## NWEIGHT47 + NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 +
## NWEIGHT52 + NWEIGHT53 + NWEIGHT54 + NWEIGHT55 + NWEIGHT56 +
## NWEIGHT57 + NWEIGHT58 + NWEIGHT59 + NWEIGHT60`
## - weights: NWEIGHT
## Data variables:
## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (chr)
## Call: Called via srvyr
## Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
## Sampling variables:
## - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
## NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
## NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 +
## NWEIGHT17 + NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 +
## NWEIGHT22 + NWEIGHT23 + NWEIGHT24 + NWEIGHT25 + NWEIGHT26 +
## NWEIGHT27 + NWEIGHT28 + NWEIGHT29 + NWEIGHT30 + NWEIGHT31 +
## NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 + NWEIGHT36 +
## NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
## NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 +
## NWEIGHT47 + NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 +
## NWEIGHT52 + NWEIGHT53 + NWEIGHT54 + NWEIGHT55 + NWEIGHT56 +
## NWEIGHT57 + NWEIGHT58 + NWEIGHT59 + NWEIGHT60`
## - weights: NWEIGHT
## Data variables:
## - DOEID (dbl), TOTALDOL (dbl), TOTSQFT_EN (dbl), REGIONC (chr)
## Variables:
## [1] "DOEID" "TOTALDOL" "TOTSQFT_EN" "REGIONC"
When printing the design object or looking at the summary, the replicate weight type is reiterated as Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances
, and the variables are included. No weight or probability summary is included.
10.4.4 Bootstrap method
In bootstrap resampling, replicates are created by selecting random samples of the PSUs with replacement (SRSWR). If there are \(A\) PSUs in the sample, then each replicate is created by selecting a random sample of \(A\) PSUs with replacement. Each replicate is created independently, and the weights for each replicate are adjusted to reflect the population, generally using the same method as how the analysis weight was adjusted.
The math
A weighted estimate for the full sample is calculated as \(\hat{\theta}\), and then a weighted estimate for each replicate is calculated as \(\hat{\theta}_r\) for \(R\) replicates. Then the standard error of the estimate is calculated as follows:
\[se(\hat{\theta})=\sqrt{\alpha \sum_{r=1}^R \left( \hat{\theta}_r-\hat{\theta}\right)^2}\]
where \(\alpha\) is the scaling constant. Note that the scaling constant (\(\alpha\)) is provided in the survey documentation, as there are many types of bootstrap methods that generate custom scaling constants.
The syntax
To specify a bootstrap method, we need to specify the weight variable (weights
), the replicate weight variables (repweights
), the type of replicate weights as bootstrap (type = "bootstrap"
), whether the mean squared error should be used (mse = TRUE
) or not (mse = FALSE
), and the multiplier (scale
). For example, if a dataset had WT0 for the main weight, 20 bootstrap weights indicated WT1, WT2, …, WT20, and a multiplier of \(\alpha=.02\), we use the following syntax:
Example
Returning to the APIP example, we are going to create a dataset with bootstrap weights to use as an example. In this example, we construct a one-cluster design with 50 replicate weights28.
apiclus1_slim <-
apiclus1 %>%
as_tibble() %>%
arrange(dnum) %>%
select(cds, dnum, fpc, pw)
set.seed(662152)
apibw <-
bootweights(
psu = apiclus1_slim$dnum,
strata = rep(1, nrow(apiclus1_slim)),
fpc = apiclus1_slim$fpc,
replicates = 50
)
bwmata <-
apibw$repweights$weights[apibw$repweights$index, ] * apiclus1_slim$pw
apiclus1_slim <- bwmata %>%
as.data.frame() %>%
set_names(str_c("pw", 1:50)) %>%
cbind(apiclus1_slim) %>%
as_tibble() %>%
select(cds, dnum, fpc, pw, everything())
apiclus1_slim
## # A tibble: 183 × 54
## cds dnum fpc pw pw1 pw2 pw3 pw4 pw5 pw6 pw7
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0
## 2 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0
## 3 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0
## 4 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0
## 5 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0
## 6 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0
## 7 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0
## 8 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0
## 9 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0
## 10 43693776… 61 757 33.8 33.8 0 0 33.8 0 33.8 0
## # ℹ 173 more rows
## # ℹ 43 more variables: pw8 <dbl>, pw9 <dbl>, pw10 <dbl>, pw11 <dbl>,
## # pw12 <dbl>, pw13 <dbl>, pw14 <dbl>, pw15 <dbl>, pw16 <dbl>,
## # pw17 <dbl>, pw18 <dbl>, pw19 <dbl>, pw20 <dbl>, pw21 <dbl>,
## # pw22 <dbl>, pw23 <dbl>, pw24 <dbl>, pw25 <dbl>, pw26 <dbl>,
## # pw27 <dbl>, pw28 <dbl>, pw29 <dbl>, pw30 <dbl>, pw31 <dbl>,
## # pw32 <dbl>, pw33 <dbl>, pw34 <dbl>, pw35 <dbl>, pw36 <dbl>, …
The output of apiclus1_slim
includes the same variables we have seen in other APIP examples (see Table 10.1), but now it additionally includes bootstrap weights pw1
, …, pw50
. When creating the survey design object, we use the bootstrap weights as the replicate weights. Additionally, with replicate weights we need to include the scale (\(\alpha\)). For this example, we created:
\[\alpha=\frac{A}{(A-1)(R-1)}=\frac{15}{(15-1)*(50-1)}=0.02186589\]
where \(A\) is the average number of PSUs per stratum, and \(R\) is the number of replicates. There is only 1 stratum and the number of clusters/PSUs is 15 so \(A=15\). Using this information, we specify the design object as:
api1_bs_des <- apiclus1_slim %>%
as_survey_rep(
weights = pw,
repweights = pw1:pw50,
type = "bootstrap",
scale = 0.02186589,
mse = TRUE
)
api1_bs_des
## Call: Called via srvyr
## Survey bootstrap with 50 replicates and MSE variances.
## Sampling variables:
## - repweights: `pw1 + pw2 + pw3 + pw4 + pw5 + pw6 + pw7 + pw8 + pw9 +
## pw10 + pw11 + pw12 + pw13 + pw14 + pw15 + pw16 + pw17 + pw18 + pw19
## + pw20 + pw21 + pw22 + pw23 + pw24 + pw25 + pw26 + pw27 + pw28 +
## pw29 + pw30 + pw31 + pw32 + pw33 + pw34 + pw35 + pw36 + pw37 + pw38
## + pw39 + pw40 + pw41 + pw42 + pw43 + pw44 + pw45 + pw46 + pw47 +
## pw48 + pw49 + pw50`
## - weights: pw
## Data variables:
## - cds (chr), dnum (int), fpc (dbl), pw (dbl), pw1 (dbl), pw2 (dbl),
## pw3 (dbl), pw4 (dbl), pw5 (dbl), pw6 (dbl), pw7 (dbl), pw8 (dbl),
## pw9 (dbl), pw10 (dbl), pw11 (dbl), pw12 (dbl), pw13 (dbl), pw14
## (dbl), pw15 (dbl), pw16 (dbl), pw17 (dbl), pw18 (dbl), pw19 (dbl),
## pw20 (dbl), pw21 (dbl), pw22 (dbl), pw23 (dbl), pw24 (dbl), pw25
## (dbl), pw26 (dbl), pw27 (dbl), pw28 (dbl), pw29 (dbl), pw30 (dbl),
## pw31 (dbl), pw32 (dbl), pw33 (dbl), pw34 (dbl), pw35 (dbl), pw36
## (dbl), pw37 (dbl), pw38 (dbl), pw39 (dbl), pw40 (dbl), pw41 (dbl),
## pw42 (dbl), pw43 (dbl), pw44 (dbl), pw45 (dbl), pw46 (dbl), pw47
## (dbl), pw48 (dbl), pw49 (dbl), pw50 (dbl)
## Call: Called via srvyr
## Survey bootstrap with 50 replicates and MSE variances.
## Sampling variables:
## - repweights: `pw1 + pw2 + pw3 + pw4 + pw5 + pw6 + pw7 + pw8 + pw9 +
## pw10 + pw11 + pw12 + pw13 + pw14 + pw15 + pw16 + pw17 + pw18 + pw19
## + pw20 + pw21 + pw22 + pw23 + pw24 + pw25 + pw26 + pw27 + pw28 +
## pw29 + pw30 + pw31 + pw32 + pw33 + pw34 + pw35 + pw36 + pw37 + pw38
## + pw39 + pw40 + pw41 + pw42 + pw43 + pw44 + pw45 + pw46 + pw47 +
## pw48 + pw49 + pw50`
## - weights: pw
## Data variables:
## - cds (chr), dnum (int), fpc (dbl), pw (dbl), pw1 (dbl), pw2 (dbl),
## pw3 (dbl), pw4 (dbl), pw5 (dbl), pw6 (dbl), pw7 (dbl), pw8 (dbl),
## pw9 (dbl), pw10 (dbl), pw11 (dbl), pw12 (dbl), pw13 (dbl), pw14
## (dbl), pw15 (dbl), pw16 (dbl), pw17 (dbl), pw18 (dbl), pw19 (dbl),
## pw20 (dbl), pw21 (dbl), pw22 (dbl), pw23 (dbl), pw24 (dbl), pw25
## (dbl), pw26 (dbl), pw27 (dbl), pw28 (dbl), pw29 (dbl), pw30 (dbl),
## pw31 (dbl), pw32 (dbl), pw33 (dbl), pw34 (dbl), pw35 (dbl), pw36
## (dbl), pw37 (dbl), pw38 (dbl), pw39 (dbl), pw40 (dbl), pw41 (dbl),
## pw42 (dbl), pw43 (dbl), pw44 (dbl), pw45 (dbl), pw46 (dbl), pw47
## (dbl), pw48 (dbl), pw49 (dbl), pw50 (dbl)
## Variables:
## [1] "cds" "dnum" "fpc" "pw" "pw1" "pw2" "pw3" "pw4" "pw5"
## [10] "pw6" "pw7" "pw8" "pw9" "pw10" "pw11" "pw12" "pw13" "pw14"
## [19] "pw15" "pw16" "pw17" "pw18" "pw19" "pw20" "pw21" "pw22" "pw23"
## [28] "pw24" "pw25" "pw26" "pw27" "pw28" "pw29" "pw30" "pw31" "pw32"
## [37] "pw33" "pw34" "pw35" "pw36" "pw37" "pw38" "pw39" "pw40" "pw41"
## [46] "pw42" "pw43" "pw44" "pw45" "pw46" "pw47" "pw48" "pw49" "pw50"
As with other replicate design objects, when printing the object or looking at the summary, the replicate weights are provided along with the data variables.
10.5 Exercises
For this chapter, the exercises entail reading public documentation to determine how to specify the survey design. While reading the documentation, be on the lookout for description of the weights and the survey design variables or replicate weights.
The National Health Interview Survey (NHIS) is an annual household survey conducted by the National Center for Health Statistics (NCHS). The NHIS includes a wide variety of health topics for adults including health status and conditions, functioning and disability, health care access and health service utilization, health-related behaviors, health promotion, mental health, barriers to receiving care, and community engagement. Like many national in-person surveys, the sampling design is a stratified clustered design with details included in the Survey Description (National Center for Health Statistics 2023). The Survey Description provides information on setting up syntax in SUDAAN, Stata, SPSS, SAS, and R ({survey} package implementation). We have imported the data and the variable containing the data as:
nhis_adult_data
. How would we specify the design using eitheras_survey_design()
oras_survey_rep()
?The General Social Survey (GSS) is a survey that has been administered since 1972 on social, behavioral, and attitudinal topics. The 2016-2020 GSS Panel codebook provides examples of setting up syntax in SAS and Stata but not R (Davern et al. 2021). We have imported the data and the variable containing the data as:
gss_data
. How would we specify the design in R using eitheras_survey_design()
oras_survey_rep()
?