Chapter 5 Descriptive analyses
For this chapter, load the following packages:
We are using data from ANES and RECS described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information).
targetpop <- 231592693
anes_adjwgt <- anes_2020 %>%
mutate(Weight = Weight / sum(Weight) * targetpop)
anes_des <- anes_adjwgt %>%
as_survey_design(
weights = Weight,
strata = Stratum,
ids = VarUnit,
nest = TRUE
)
For RECS, details are included in the RECS documentation and Chapters 4 and 10.
5.1 Introduction
Descriptive analyses, such as basic counts, cross-tabulations, or means, are among the first steps in making sense of our survey results. During descriptive analyses, we calculate point estimates of unknown population parameters, such as population mean, and uncertainty estimates, such as confidence intervals. By reviewing the findings, we can glean insight into the data, the underlying population, and any unique aspects of the data or population. For example, if only 10% of survey respondents are male, it could indicate a unique population, a potential error or bias, an intentional survey sampling method, or other factors. Additionally, descriptive analyses provide summaries of distribution and other measures. These analyses lay the groundwork for the next steps of running statistical tests or developing models.
We discuss many different types of descriptive analyses in this chapter. However, it is important to know what type of data we are working with and which statistics are appropriate. In survey data, we typically consider data as one of four main types:
- Categorical/nominal data: variables with levels or descriptions that cannot be ordered, such as the region of the country (North, South, East, and West)
- Ordinal data: variables that can be ordered, such as those from a Likert scale (strongly disagree, disagree, agree, and strongly agree)
- Discrete data: variables that are counted or measured, such as number of children
- Continuous data: variables that are measured and whose values can lie anywhere on an interval, such as income
This chapter discusses how to analyze measures of distribution (e.g., cross-tabulations), central tendency (e.g., means), relationship (e.g., ratios), and dispersion (e.g., standard deviation) using functions from the {srvyr} package (Freedman Ellis and Schneider 2024).
Measures of distribution describe how often an event or response occurs. These measures include counts and totals. We cover the following functions:
- Count of observations (
survey_count()
andsurvey_tally()
) - Summation of variables (
survey_total()
)
Measures of central tendency find the central (or average) responses. These measures include means and medians. We cover the following functions:
- Means and proportions (
survey_mean()
andsurvey_prop()
) - Quantiles and medians (
survey_quantile()
andsurvey_median()
)
Measures of relationship describe how variables relate to each other. These measures include correlations and ratios. We cover the following functions:
- Correlations (
survey_corr()
) - Ratios (
survey_ratio()
)
Measures of dispersion describe how data spread around the central tendency for continuous variables. These measures include standard deviations and variances. We cover the following functions:
- Variances and standard deviations (
survey_var()
andsurvey_sd()
)
To incorporate each of these survey functions, recall the general process for survey estimation from Chapter 4:
- Create a
tbl_svy
object usingsrvyr::as_survey_design()
orsrvyr::as_survey_rep()
. - Subset the data for subpopulations using
srvyr::filter()
, if needed. - Specify domains of analysis using
srvyr::group_by()
, if needed. - Analyze the data with survey-specific functions.
This chapter walks through how to apply the survey functions in Step 4. Note that unless otherwise specified, our estimates are weighted as a result of setting up the survey design object.
To look at the data by different subgroups, we can choose to filter and/or group the data. It is very important that we filter and group the data only after creating the design object. This ensures that the results accurately reflect the survey design. If we filter or group data before creating the survey design object, the data for those cases are not included in the survey design information and estimations of the variance, leading to inaccurate results.
For the sake of simplicity, we’ve removed cases with missing values in the examples below. For a more detailed explanation of how to handle missing data, please refer to Chapter 11.
5.2 Counts and cross-tabulations
Using survey_count()
and survey_tally()
, we can calculate the estimated population counts for a given variable or combination of variables. These summaries, often referred to as cross-tabulations or cross-tabs, are applied to categorical data. They help in estimating counts of the population size for different groups based on the survey data.
5.2.1 Syntax
The syntax for survey_count()
is similar to the dplyr::count()
syntax, as mentioned in Chapter 4. However, as noted above, this function can only be called on tbl_svy
objects. Let’s explore the syntax:
survey_count(
x,
...,
wt = NULL,
sort = FALSE,
name = "n",
.drop = dplyr::group_by_drop_default(x),
vartype = c("se", "ci", "var", "cv")
)
The arguments are:
x
: atbl_svy
object created byas_survey
...
: variables to group by, passed togroup_by
wt
: a variable to weight on in addition to the survey weights, defaults toNULL
sort
: how to sort the variables, defaults toFALSE
name
: the name of the count variable, defaults ton
.drop
: whether to drop empty groupsvartype
: type(s) of variation estimate to calculate including any ofc("se", "ci", "var", "cv")
, defaults tose
(standard error) (see Section 5.2.1 for more information)
To generate a count or cross-tabs by different variables, we include them in the (...
) argument. This argument can take any number of variables and breaks down the counts by all combinations of the provided variables. This is similar to dplyr::count()
. To obtain an estimate of the overall population, we can exclude any variables from the (...
) argument or use the survey_tally()
function. While the survey_tally()
function has a similar syntax to the survey_count()
function, it does not include the (...
) or the .drop
arguments:
Both functions include the vartype
argument with four different values:
se
: standard error- The estimated standard deviation of the estimate
- Output has a column with the variable name specified in the
name
argument with a suffix of “_se”
ci
: confidence interval- The lower and upper limits of a confidence interval
- Output has two columns with the variable name specified in the
name
argument with a suffix of “_low” and “_upp” - By default, this is a 95% confidence interval but can be changed by using the argument level and specifying a number between 0 and 1. For example,
level=0.8
would produce an 80% confidence interval.
var
: variance- The estimated variance of the estimate
- Output has a column with the variable name specified in the
name
argument with a suffix of “_var”
cv
: coefficient of variation- A ratio of the standard error and the estimate
- Output has a column with the variable name specified in the
name
argument with a suffix of “_cv”
The confidence intervals are always calculated using a symmetric t-distribution based method, given by the formula:
\[ \text{estimate} \pm t^*_{df}\times SE\]
where \(t^*_{df}\) is the critical value from a t-distribution based on the confidence level and the degrees of freedom. By default, the degrees of freedom are based on the design or number of replicates, but they can be specified using the df
argument. For survey design objects, the degrees of freedom are calculated as the number of primary sampling units (PSUs or clusters) minus the number of strata (see Chapter 10 for more information on PSUs, strata, and sample designs). For replicate-based objects, the degrees of freedom are calculated as one less than the rank of the matrix of replicate weight, where the number of replicates is typically the rank. Note that specifying df = Inf
is equivalent to using a normal (z-based) confidence interval – this is the default in {survey}. These variability types are the same for most of the survey functions, and we provide examples using different variability types throughout this chapter.
5.2.2 Examples
Example 1: Estimated population count
If we want to obtain the estimated number of households in the U.S. (the population of interest) using the Residential Energy Consumption Survey (RECS) data, we can use survey_count()
. If we do not specify any variables in the survey_count()
function, it outputs the estimated population count (n
) and its corresponding standard error (n_se
).
## # A tibble: 1 × 2
## n n_se
## <dbl> <dbl>
## 1 123529025. 0.148
Based on this calculation, the estimated number of households in the U.S. is 123,529,025.
Alternatively, we could also use the survey_tally()
function. The example below yields the same results as survey_count()
.
## # A tibble: 1 × 2
## n n_se
## <dbl> <dbl>
## 1 123529025. 0.148
Example 2: Estimated counts by subgroups (cross-tabs)
To calculate the estimated number of observations for specific subgroups, such as Region and Division, we can include the variables of interest in the survey_count()
function. In the example below, we calculate the estimated number of housing units by region and division. The argument name =
in survey_count()
allows us to change the name of the count variable in the output from the default n
to N
.
## # A tibble: 10 × 4
## Region Division N N_se
## <fct> <fct> <dbl> <dbl>
## 1 Northeast New England 5876166 0.0000000137
## 2 Northeast Middle Atlantic 16043503 0.0000000487
## 3 Midwest East North Central 18546912 0.000000437
## 4 Midwest West North Central 8495815 0.0000000177
## 5 South South Atlantic 24843261 0.0000000418
## 6 South East South Central 7380717. 0.114
## 7 South West South Central 14619094 0.000488
## 8 West Mountain North 4615844 0.119
## 9 West Mountain South 4602070 0.0000000492
## 10 West Pacific 18505643. 0.00000295
When we run the cross-tab, we see that there are an estimated 5,876,166 housing units in the New England Division.
The code results in an error if we try to use the survey_count()
syntax with survey_tally()
:
## Error in `dplyr::summarise()`:
## ℹ In argument: `N = survey_total(Region, vartype = vartype,
## na.rm = TRUE)`.
## Caused by error:
## ! Factor not allowed in survey functions, should be used as a grouping variable.
Use a group_by()
function prior to using survey_tally()
to successfully run the cross-tab:
## # A tibble: 10 × 4
## # Groups: Region [4]
## Region Division N N_se
## <fct> <fct> <dbl> <dbl>
## 1 Northeast New England 5876166 0.0000000137
## 2 Northeast Middle Atlantic 16043503 0.0000000487
## 3 Midwest East North Central 18546912 0.000000437
## 4 Midwest West North Central 8495815 0.0000000177
## 5 South South Atlantic 24843261 0.0000000418
## 6 South East South Central 7380717. 0.114
## 7 South West South Central 14619094 0.000488
## 8 West Mountain North 4615844 0.119
## 9 West Mountain South 4602070 0.0000000492
## 10 West Pacific 18505643. 0.00000295
5.3 Totals and sums
The survey_total()
function is analogous to sum
. It can be applied to continuous variables to obtain the estimated total quantity in a population. Starting from this point in the chapter, all the introduced functions must be called within summarize()
.
5.3.1 Syntax
Here is the syntax:
survey_total(
x,
na.rm = FALSE,
vartype = c("se", "ci", "var", "cv"),
level = 0.95,
deff = FALSE,
df = NULL
)
The arguments are:
x
: a variable, expression, or emptyna.rm
: an indicator of whether missing values should be dropped, defaults toFALSE
vartype
: type(s) of variation estimate to calculate including any ofc("se", "ci", "var", "cv")
, defaults tose
(standard error) (see Section 5.2.1 for more information)level
: a number or a vector indicating the confidence level, defaults to 0.95deff
: a logical value stating whether the design effect should be returned, defaults to FALSE (this is described in more detail in Section 5.9.3)df
: (forvartype = 'ci'
), a numeric value indicating degrees of freedom for the t-distribution
5.3.2 Examples
Example 1: Estimated population count
To calculate a population count estimate with survey_total()
, we leave the argument x
empty, as shown in the example below:
## # A tibble: 1 × 2
## Tot Tot_se
## <dbl> <dbl>
## 1 123529025. 0.148
The estimated number of households in the U.S. is 123,529,025. Note that this result obtained from survey_total()
is equivalent to the ones from the survey_count()
and survey_tally()
functions. However, the survey_total()
function is called within summarize()
, whereas survey_count()
and survey_tally()
are not.
Example 2: Overall summation of continuous variables
The distinction between survey_total()
and survey_count()
becomes more evident when working with continuous variables. Let’s compute the total cost of electricity in whole dollars from variable DOLLAREL
4.
## # A tibble: 1 × 2
## elec_bill elec_bill_se
## <dbl> <dbl>
## 1 170473527909. 664893504.
It is estimated that American residential households spent a total of $170,473,527,909 on electricity in 2020, and the estimate has a standard error of $664,893,504.
Example 3: Summation by groups
Since we are using the {srvyr} package, we can use group_by()
to calculate the cost of electricity for different groups. Let’s examine the variations in the cost of electricity in whole dollars across regions and display the confidence interval instead of the default standard error.
## # A tibble: 4 × 4
## Region elec_bill elec_bill_low elec_bill_upp
## <fct> <dbl> <dbl> <dbl>
## 1 Northeast 29430369947. 28788987554. 30071752341.
## 2 Midwest 34972544751. 34339576041. 35605513460.
## 3 South 72496840204. 71534780902. 73458899506.
## 4 West 33573773008. 32909111702. 34238434313.
The survey results estimate that households in the Northeast spent $29,430,369,947 with a confidence interval of ($28,788,987,554, $30,071,752,341) on electricity in 2020, while households in the South spent an estimated $72,496,840,204 with a confidence interval of ($71,534,780,902, $73,458,899,506).
As we calculate these numbers, we may notice that the confidence interval of the South is larger than those of other regions. This implies that we have less certainty about the true value of electricity spending in the South. A larger confidence interval could be due to a variety of factors, such as a wider range of electricity spending in the South. We could try to analyze smaller regions within the South to identify areas that are contributing to more variability. Descriptive analyses serve as a valuable starting point for more in-depth exploration and analysis.
5.4 Means and proportions
Means and proportions form the foundation of many research studies. These estimates are often the first things we look for when reviewing research on a given topic. The survey_mean()
and survey_prop()
functions calculate means and proportions while taking into account the survey design elements. The survey_mean()
function should be used on continuous variables of survey data, while the survey_prop()
function should be used on categorical variables.
5.4.1 Syntax
The syntax for both means and proportions is very similar:
survey_mean(
x,
na.rm = FALSE,
vartype = c("se", "ci", "var", "cv"),
level = 0.95,
proportion = FALSE,
prop_method = c("logit", "likelihood", "asin", "beta", "mean"),
deff = FALSE,
df = NULL
)
survey_prop(
na.rm = FALSE,
vartype = c("se", "ci", "var", "cv"),
level = 0.95,
proportion = TRUE,
prop_method =
c("logit", "likelihood", "asin", "beta", "mean", "xlogit"),
deff = FALSE,
df = NULL
)
Both functions have the following arguments and defaults:
na.rm
: an indicator of whether missing values should be dropped, defaults toFALSE
vartype
: type(s) of variation estimate to calculate including any ofc("se", "ci", "var", "cv")
, defaults tose
(standard error) (see Section 5.2.1 for more information)level
: a number or a vector indicating the confidence level, defaults to 0.95prop_method
: Method to calculate the confidence interval for confidence intervalsdeff
: a logical value stating whether the design effect should be returned, defaults to FALSE (this is described in more detail in Section 5.9.3)df
: (forvartype = 'ci'
), a numeric value indicating degrees of freedom for the t-distribution
There are two main differences in the syntax. The survey_mean()
function includes the first argument x
, representing the variable or expression on which the mean should be calculated. The survey_prop()
does not have an argument to include the variables directly. Instead, prior to summarize()
, we must use the group_by()
function to specify the variables of interest for survey_prop()
. For survey_mean()
, including a group_by()
function allows us to obtain the means by different groups.
The other main difference is with the proportion
argument. The survey_mean()
function can be used to calculate both means and proportions. Its proportion
argument defaults to FALSE
, indicating it is used for calculating means. If we wish to calculate a proportion using survey_mean()
, we need to set the proportion
argument to TRUE
. In the survey_prop()
function, the proportion
argument defaults to TRUE
because the function is specifically designed for calculating proportions.
In Section 5.2.1, we provide an overview of different variability types. The confidence interval used for most measures, such as means and counts, is referred to as a Wald-type interval. However, for proportions, a Wald-type interval with a symmetric t-based confidence interval may not provide accurate coverage, especially when dealing with small sample sizes or proportions “near” 0 or 1. We can use other methods to calculate confidence intervals, which we specify using the prop_method
option in survey_prop()
. The options include:
logit
: fits a logistic regression model and computes a Wald-type interval on the log-odds scale, which is then transformed to the probability scale. This is the default method.likelihood
: uses the (Rao-Scott) scaled chi-squared distribution for the log-likelihood from a binomial distribution.asin
: uses the variance-stabilizing transformation for the binomial distribution, the arcsine square root, and then back-transforms the interval to the probability scale.beta
: uses the incomplete beta function with an effective sample size based on the estimated variance of the proportion.mean
: the Wald-type interval (\(\pm t_{df}^*\times SE\)).xlogit
: uses a logit transformation of the proportion, calculates a Wald-type interval, and then back-transforms to the probability scale. This method is the same as those used by default in SUDAAN and SPSS.
Each option yields slightly different confidence interval bounds when dealing with proportions. Please note that when working with survey_mean()
, we do not need to specify a method unless the proportion
argument is TRUE
. If proportion
is FALSE
, it calculates a symmetric mean
type of confidence interval.
5.4.2 Examples
Example 1: One variable proportion
If we are interested in obtaining the proportion of people in each region in the RECS data, we can use group_by()
and survey_prop()
as shown below:
## # A tibble: 4 × 3
## Region p p_se
## <fct> <dbl> <dbl>
## 1 Northeast 0.177 0.000000000212
## 2 Midwest 0.219 0.000000000262
## 3 South 0.379 0.000000000740
## 4 West 0.224 0.000000000816
17.7% of the households are in the Northeast, 21.9% are in the Midwest, and so on. Note that the proportions in column p
add up to one.
The survey_prop()
function is essentially the same as using survey_mean()
with a categorical variable and without specifying a numeric variable in the x
argument. The following code gives us the same results as above:
## # A tibble: 4 × 3
## Region p p_se
## <fct> <dbl> <dbl>
## 1 Northeast 0.177 0.000000000212
## 2 Midwest 0.219 0.000000000262
## 3 South 0.379 0.000000000740
## 4 West 0.224 0.000000000816
Example 2: Conditional proportions
We can also obtain proportions by more than one variable. In the following example, we look at the proportion of housing units by Region and whether air conditioning (A/C) is used (ACUsed
)5.
## # A tibble: 8 × 4
## # Groups: Region [4]
## Region ACUsed p p_se
## <fct> <lgl> <dbl> <dbl>
## 1 Northeast FALSE 0.110 0.00590
## 2 Northeast TRUE 0.890 0.00590
## 3 Midwest FALSE 0.0666 0.00508
## 4 Midwest TRUE 0.933 0.00508
## 5 South FALSE 0.0581 0.00278
## 6 South TRUE 0.942 0.00278
## 7 West FALSE 0.255 0.00759
## 8 West TRUE 0.745 0.00759
When specifying multiple variables, the proportions are conditional. In the results above, notice that the proportions sum to 1 within each region. This can be interpreted as the proportion of housing units with A/C within each region. For example, in the Northeast region, approximately 11.0% of housing units don’t have A/C, while around 89.0% have A/C.
Example 3: Joint proportions
If we’re interested in a joint proportion, we use the interact()
function. In the example below, we apply the interact()
function to Region
and ACUsed
:
## # A tibble: 8 × 4
## Region ACUsed p p_se
## <fct> <lgl> <dbl> <dbl>
## 1 Northeast FALSE 0.0196 0.00105
## 2 Northeast TRUE 0.158 0.00105
## 3 Midwest FALSE 0.0146 0.00111
## 4 Midwest TRUE 0.204 0.00111
## 5 South FALSE 0.0220 0.00106
## 6 South TRUE 0.357 0.00106
## 7 West FALSE 0.0573 0.00170
## 8 West TRUE 0.167 0.00170
In this case, all proportions sum to 1, not just within regions. This means that 15.8% of the population lives in the Northeast and has A/C. As noted earlier, we can use both the survey_prop()
and survey_mean()
functions, and they produce the same results.
Example 4: Overall mean
Below, we calculate the estimated average cost of electricity in the U.S. using survey_mean()
. To include both the standard error and the confidence interval, we can include them in the vartype
argument:
## # A tibble: 1 × 4
## elec_bill elec_bill_se elec_bill_low elec_bill_upp
## <dbl> <dbl> <dbl> <dbl>
## 1 1380. 5.38 1369. 1391.
Nationally, the average household spent $1,380 in 2020.
Example 5: Means by subgroup
We can also calculate the estimated average cost of electricity in the U.S. by each region. To do this, we include a group_by()
function with the variable of interest before the summarize()
function:
## # A tibble: 4 × 3
## Region elec_bill elec_bill_se
## <fct> <dbl> <dbl>
## 1 Northeast 1343. 14.6
## 2 Midwest 1293. 11.7
## 3 South 1548. 10.3
## 4 West 1211. 12.0
Households from the West spent approximately $1,211, while in the South, the average spending was $1,548.
5.5 Quantiles and medians
To better understand the distribution of a continuous variable like income, we can calculate quantiles at specific points. For example, computing estimates of the quartiles (25%, 50%, 75%) helps us understand how income is spread across the population. We use the survey_quantile()
function to calculate quantiles in survey data.
Medians are useful for finding the midpoint of a continuous distribution when the data are skewed, as medians are less affected by outliers compared to means. The median is the same as the 50th percentile, meaning the value where 50% of the data are higher and 50% are lower. Because medians are a special, common case of quantiles, we have a dedicated function called survey_median()
for calculating the median in survey data. Alternatively, we can use the survey_quantile()
function with the quantiles
argument set to 0.5
to achieve the same result.
5.5.1 Syntax
The syntax for survey_quantile()
and survey_median()
are nearly identical:
survey_quantile(
x,
quantiles,
na.rm = FALSE,
vartype = c("se", "ci", "var", "cv"),
level = 0.95,
interval_type =
c("mean", "beta", "xlogit", "asin", "score", "quantile"),
qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3",
"hf4", "hf5", "hf6", "hf7", "hf8", "hf9"),
df = NULL
)
survey_median(
x,
na.rm = FALSE,
vartype = c("se", "ci", "var", "cv"),
level = 0.95,
interval_type =
c("mean", "beta", "xlogit", "asin", "score", "quantile"),
qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3",
"hf4", "hf5", "hf6", "hf7", "hf8", "hf9"),
df = NULL
)
The arguments available in both functions are:
x
: a variable, expression, or emptyna.rm
: an indicator of whether missing values should be dropped, defaults toFALSE
vartype
: type(s) of variation estimate to calculate, defaults tose
(standard error)level
: a number or a vector indicating the confidence level, defaults to 0.95interval_type
: method for calculating a confidence intervalqrule
: rule for defining quantiles. The default is the lower end of the quantile interval (“math”). The midpoint of the quantile interval is the “school” rule. “hf1” to “hf9” are weighted analogs to type=1 to 9 inquantile()
. “shahvaish” corresponds to a rule proposed by Shah and Vaish (2006). Seevignette("qrule", package="survey")
for more information.df
: (forvartype = 'ci'
), a numeric value indicating degrees of freedom for the t-distribution
The only difference between survey_quantile()
and survey_median()
is the inclusion of the quantiles
argument in the survey_quantile()
function. This argument takes a vector with values between 0 and 1 to indicate which quantiles to calculate. For example, if we wanted the quartiles of a variable, we would provide quantiles = c(0.25, 0.5, 0.75)
. While we can specify quantiles of 0 and 1, which represent the minimum and maximum, this is not recommended. It only returns the minimum and maximum of the respondents and cannot be extrapolated to the population, as there is no valid definition of standard error.
In Section 5.2.1, we provide an overview of the different variability types. The interval used in confidence intervals for most measures, such as means and counts, is referred to as a Wald-type interval. However, this is not always the most accurate interval for quantiles. Similar to confidence intervals for proportions, quantiles have various interval types, including asin, beta, mean, and xlogit (see Section 5.4.1). Quantiles also have two more methods available:
score
: the Francisco and Fuller confidence interval based on inverting a score test (only available for design-based survey objects and not replicate-based objects)quantile
: based on the replicates of the quantile. This is not valid for jackknife-type replicates but is available for bootstrap and BRR replicates.
One note with the score
method is that when there are numerous ties in the data, this method may produce confidence intervals that do not contain the estimate. When dealing with a high propensity for ties (e.g., many respondents are the same age), it is recommended to use another method. SUDAAN, for example, uses the score
method but adds noise to the values to prevent issues. The documentation in the {survey} package indicates, in general, that the score
method may have poorer performance compared to the beta and logit intervals (Lumley 2010).
5.5.2 Examples
Example 1: Overall quartiles
Quantiles provide insights into the distribution of a variable. Let’s look into the quartiles, specifically, the first quartile (p=0.25), the median (p=0.5), and the third quartile (p=0.75) of electric bills.
## # A tibble: 1 × 6
## elec_bill_q25 elec_bill_q50 elec_bill_q75 elec_bill_q25_se
## <dbl> <dbl> <dbl> <dbl>
## 1 795. 1215. 1770. 5.69
## elec_bill_q50_se elec_bill_q75_se
## <dbl> <dbl>
## 1 6.33 9.99
The output above shows the values for the three quartiles of electric bill costs and their respective standard errors: the 25th percentile is $795 with a standard error of $5.69, the 50th percentile (median) is $1,215 with a standard error of $6.33, and the 75th percentile is $1,770 with a standard error of $9.99.
Example 2: Quartiles by subgroup
We can estimate the quantiles of electric bills by region by using the group_by()
function:
recs_des %>%
group_by(Region) %>%
summarize(elec_bill = survey_quantile(DOLLAREL,
quantiles = c(0.25, .5, 0.75)
))
## # A tibble: 4 × 7
## Region elec_bill_q25 elec_bill_q50 elec_bill_q75 elec_bill_q25_se
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Northeast 740. 1148. 1712. 13.7
## 2 Midwest 769. 1149. 1632. 8.88
## 3 South 968. 1402. 1945. 10.6
## 4 West 623. 1028. 1568. 10.8
## elec_bill_q50_se elec_bill_q75_se
## <dbl> <dbl>
## 1 16.6 25.8
## 2 11.6 18.6
## 3 9.17 13.9
## 4 14.3 20.5
The 25th percentile for the Northeast region is $740, while it is $968 for the South.
Example 3: Minimum and maximum
As mentioned in the syntax section, we can specify quantiles of 0
(minimum) and 1
(maximum), and R calculates these values. However, these are only the minimum and maximum values in the data, and there is not enough information to determine their standard errors:
## # A tibble: 1 × 4
## elec_bill_q00 elec_bill_q100 elec_bill_q00_se elec_bill_q100_se
## <dbl> <dbl> <dbl> <dbl>
## 1 -889. 15680. NaN 0
The minimum cost of electricity in the dataset is –$889, while the maximum is $15,680, but the standard error is shown as NaN
and 0
, respectively. Notice that the minimum cost is a negative number. This may be surprising, but some housing units with solar power sell their energy back to the grid and earn money, which is recorded as a negative expenditure.
Example 4: Overall median
We can calculate the estimated median cost of electricity in the U.S. using the survey_median()
function:
## # A tibble: 1 × 2
## elec_bill elec_bill_se
## <dbl> <dbl>
## 1 1215. 6.33
Nationally, the median household spent $1,215 in 2020. This is the same result as we obtained using the survey_quantile()
function. Interestingly, the average electric bill for households that we calculated in Section 5.4 is $1,380, but the estimated median electric bill is $1,215, indicating the distribution is likely right-skewed.
Example 5: Medians by subgroup
We can calculate the estimated median cost of electricity in the U.S. by region using the group_by()
function with the variable(s) of interest before the summarize()
function, similar to when we found the mean by region.
## # A tibble: 4 × 3
## Region elec_bill elec_bill_se
## <fct> <dbl> <dbl>
## 1 Northeast 1148. 16.6
## 2 Midwest 1149. 11.6
## 3 South 1402. 9.17
## 4 West 1028. 14.3
We estimate that households in the Northeast spent a median of $1,148 on electricity, and in the South, they spent a median of $1,402.
5.6 Ratios
A ratio is a measure of the ratio of the sum of two variables, specifically in the form of:
\[ \frac{\sum x_i}{\sum y_i}.\] Note that the ratio is not the same as calculating the following:
\[ \frac{1}{N} \sum \frac{x_i}{y_i} \]
which can be calculated with survey_mean()
by creating a derived variable \(z=x/y\) and then calculating the mean of \(z\).
Say we wanted to assess the energy efficiency of homes in a standardized way, where we can compare homes of different sizes. We can calculate the ratio of energy consumption to the square footage of a home. This helps us meaningfully compare homes of different sizes by identifying how much energy is being used per unit of space. To calculate this ratio, we would run survey_ratio(Energy Consumption in BTUs, Square Footage of Home)
. If, instead, we used survey_mean(Energy Consumption in BTUs/Square Footage of Home)
, we would estimate the average energy consumption per square foot of all surveyed homes. While helpful in understanding general energy use, this statistic does not account for differences in home sizes.
5.6.1 Syntax
The syntax for survey_ratio()
is as follows:
survey_ratio(
numerator,
denominator,
na.rm = FALSE,
vartype = c("se", "ci", "var", "cv"),
level = 0.95,
deff = FALSE,
df = NULL
)
The arguments are:
numerator
: The numerator of the ratiodenominator
: The denominator of the rationa.rm
: A logical value to indicate whether missing values should be droppedvartype
: type(s) of variation estimate to calculate including any ofc("se", "ci", "var", "cv")
, defaults tose
(standard error) (see Section 5.2.1 for more information)level
: A single number or vector of numbers indicating the confidence leveldeff
: A logical value to indicate whether the design effect should be returned (this is described in more detail in Section 5.9.3)df
: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution
5.6.2 Examples
Example 1: Overall ratios
Suppose we wanted to find the ratio of dollars spent on liquid propane per unit (in British thermal unit [Btu]) nationally6. To find the average cost to a household, we can use survey_mean()
. However, to find the national unit rate, we can use survey_ratio()
. In the following example, we show both methods and discuss the interpretation of each:
recs_des %>%
summarize(
DOLLARLP_Tot = survey_total(DOLLARLP, vartype = NULL),
BTULP_Tot = survey_total(BTULP, vartype = NULL),
DOL_BTU_Rat = survey_ratio(DOLLARLP, BTULP),
DOL_BTU_Avg = survey_mean(DOLLARLP / BTULP, na.rm = TRUE)
)
## # A tibble: 1 × 6
## DOLLARLP_Tot BTULP_Tot DOL_BTU_Rat DOL_BTU_Rat_se DOL_BTU_Avg
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8122911173. 391425311586. 0.0208 0.000240 0.0240
## DOL_BTU_Avg_se
## <dbl>
## 1 0.000223
The ratio of the total spent on liquid propane to the total consumption was 0.0208, but the average rate was 0.024. With a bit of calculation, we can show that the ratio is the ratio of the totals DOLLARLP_Tot
/BTULP_Tot
=8,122,911,173/391,425,311,586=0.0208. Although the estimated ratio can be calculated manually in this manner, the standard error requires the use of the survey_ratio()
function. The average can be interpreted as the average rate paid by a household.
Example 2: Ratios by subgroup
As previously done with other estimates, we can use group_by()
to examine whether this ratio varies by region.
recs_des %>%
group_by(Region) %>%
summarize(DOL_BTU_Rat = survey_ratio(DOLLARLP, BTULP)) %>%
arrange(DOL_BTU_Rat)
## # A tibble: 4 × 3
## Region DOL_BTU_Rat DOL_BTU_Rat_se
## <fct> <dbl> <dbl>
## 1 Midwest 0.0158 0.000240
## 2 South 0.0245 0.000388
## 3 West 0.0246 0.000875
## 4 Northeast 0.0247 0.000488
Although not a formal statistical test, it appears that the cost ratios for liquid propane are the lowest in the Midwest (0.0158).
5.7 Correlations
The correlation is a measure of the linear relationship between two continuous variables, which ranges between –1 and 1. The most commonly used method is Pearson’s correlation (referred to as correlation henceforth). A sample correlation for a simple random sample is calculated as follows:
\[\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum (x_i-\bar{x})^2} \sqrt{\sum(y_i-\bar{y})^2}} \]
When using survey_corr()
for designs other than a simple random sample, the weights are applied when estimating the correlation.
5.7.1 Syntax
The syntax for survey_corr()
is as follows:
The arguments are:
x
: A variable or expressiony
: A variable or expressionna.rm
: A logical value to indicate whether missing values should be droppedvartype
: Type(s) of variation estimate to calculate including any ofc("se", "ci", "var", "cv")
, defaults tose
(standard error) (see Section 5.2.1 for more information)level
: (For vartype = “ci” only) A single number or vector of numbers indicating the confidence leveldf
: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution
5.7.2 Examples
Example 1: Overall correlation
We can calculate the correlation between the total square footage of homes (TOTSQFT_EN
)7 and electricity consumption (BTUEL
)8.
## # A tibble: 1 × 2
## SQFT_Elec_Corr SQFT_Elec_Corr_se
## <dbl> <dbl>
## 1 0.417 0.00689
The correlation between the total square footage of homes and electricity consumption is 0.417, indicating a moderate positive relationship.
Example 2: Correlations by subgroup
We can explore the correlation between total square footage and electricity consumption based on subgroups, such as whether A/C is used (ACUsed
).
## # A tibble: 2 × 3
## ACUsed SQFT_Elec_Corr SQFT_Elec_Corr_se
## <lgl> <dbl> <dbl>
## 1 FALSE 0.290 0.0240
## 2 TRUE 0.401 0.00808
For homes without A/C, there is a small positive correlation between total square footage with electricity consumption (0.29). For homes with A/C, the correlation of 0.401 indicates a stronger positive correlation between total square footage and electricity consumption.
5.8 Standard deviation and variance
All survey functions produce an estimate of the variability of a given estimate. No additional function is needed when dealing with variable estimates. However, if we are specifically interested in population variance and standard deviation, we can use the survey_var()
and survey_sd()
functions. In our experience, it is not common practice to use these functions. They can be used when designing a future study to gauge population variability and inform sampling precision.
5.8.1 Syntax
As with non-survey data, the standard deviation estimate is the square root of the variance estimate. Therefore, the survey_var()
and survey_sd()
functions share the same arguments, except the standard deviation does not allow the usage of vartype
.
survey_var(
x,
na.rm = FALSE,
vartype = c("se", "ci", "var"),
level = 0.95,
df = NULL
)
survey_sd(
x,
na.rm = FALSE
)
The arguments are:
x
: A variable or expression, or emptyna.rm
: A logical value to indicate whether missing values should be droppedvartype
: Type(s) of variation estimate to calculate including any ofc("se", "ci", "var")
, defaults tose
(standard error) (see Section 5.2.1 for more information)level
: (For vartype = “ci” only) A single number or vector of numbers indicating the confidence leveldf
: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution
5.8.2 Examples
Example 1: Overall variability
Let’s return to electricity bills and explore the variability in electricity expenditure.
## # A tibble: 1 × 3
## var_elbill var_elbill_se sd_elbill
## <dbl> <dbl> <dbl>
## 1 704906. 13926. 840.
We may encounter a warning related to deprecated underlying calculations performed by the survey_var()
function. This warning is a result of changes in the way R handles recycling in vectorized operations. The results are still valid. They give an estimate of the population variance of electricity bills (var_elbill
), the standard error of that variance (var_elbill_se
), and the estimated population standard deviation of electricity bills (sd_elbill
). Note that no standard error is associated with the standard deviation; this is the only estimate that does not include a standard error.
Example 2: Variability by subgroup
To find out if the variability in electricity expenditure is similar across regions, we can calculate the variance by region using group_by()
:
recs_des %>%
group_by(Region) %>%
summarize(
var_elbill = survey_var(DOLLAREL),
sd_elbill = survey_sd(DOLLAREL)
)
## # A tibble: 4 × 4
## Region var_elbill var_elbill_se sd_elbill
## <fct> <dbl> <dbl> <dbl>
## 1 Northeast 775450. 38843. 881.
## 2 Midwest 552423. 25252. 743.
## 3 South 702521. 30641. 838.
## 4 West 717886. 30597. 847.
5.9 Additional topics
5.9.1 Unweighted analysis
Sometimes, it is helpful to calculate an unweighted estimate of a given variable. For this, we use the unweighted()
function in the summarize()
function. The unweighted()
function calculates unweighted summaries from a tbl_svy
object, providing the summary among the respondents without extrapolating to a population estimate. The unweighted()
function can be used in conjunction with any {dplyr} functions. Here is an example looking at the average household electricity cost:
recs_des %>%
summarize(
elec_bill = survey_mean(DOLLAREL),
elec_unweight = unweighted(mean(DOLLAREL))
)
## # A tibble: 1 × 3
## elec_bill elec_bill_se elec_unweight
## <dbl> <dbl> <dbl>
## 1 1380. 5.38 1425.
It is estimated that American residential households spent an average of $1,380 on electricity in 2020, and the estimate has a standard error of $5.38. The unweighted()
function calculates the unweighted average and represents the average amount of money spent on electricity in 2020 by the respondents, which was $1,425.
5.9.2 Subpopulation analysis
We mentioned using filter()
to subset a survey object for analysis. This operation should be done after creating the survey design object. Subsetting data before creating the object can lead to incorrect variability estimates, if subsetting removes an entire Primary Sampling Unit (PSU; see Chapter 10 for more information on PSUs and sample designs).
Suppose we want estimates of the average amount spent on natural gas among housing units using natural gas (based on the variable BTUNG
)9. We first filter records to only include records where BTUNG > 0
and then find the average amount spent.
recs_des %>%
filter(BTUNG > 0) %>%
summarize(NG_mean = survey_mean(DOLLARNG,
vartype = c("se", "ci")
))
## # A tibble: 1 × 4
## NG_mean NG_mean_se NG_mean_low NG_mean_upp
## <dbl> <dbl> <dbl> <dbl>
## 1 631. 4.64 621. 640.
The estimated average amount spent on natural gas among households that use natural gas is $631. Let’s compare this to the mean when we do not filter.
## # A tibble: 1 × 4
## NG_mean NG_mean_se NG_mean_low NG_mean_upp
## <dbl> <dbl> <dbl> <dbl>
## 1 382. 3.41 375. 389.
Based on this calculation, the estimated average amount spent on natural gas is $382. Note that applying the filter to include only housing units that use natural gas yields a higher mean than when not applying the filter. This is because including housing units that do not use natural gas introduces many $0 amounts, impacting the mean calculation.
5.9.3 Design effects
The design effect measures how the precision of an estimate is influenced by the sampling design. In other words, it measures how much more or less statistically efficient the survey design is compared to a simple random sample (SRS). It is computed by taking the ratio of the estimate’s variance under the design at hand to the estimate’s variance under a simple random sample without replacement. A design effect less than 1 indicates that the design is more statistically efficient than an SRS design, which is rare but possible in a stratified sampling design where the outcome correlates with the stratification variable(s). A design effect greater than 1 indicates that the design is less statistically efficient than an SRS design. From a design effect, we can calculate the effective sample size as follows:
\[n_{eff}=\frac{n}{D_{eff}} \]
where \(n\) is the nominal sample size (the number of survey responses) and \(D_{eff}\) is the estimated design effect. We can interpret the effective sample size \(n_{eff}\) as the hypothetical sample size that a survey using an SRS design would need to achieve the same precision as the design at hand. Design effects specific to each outcome — outcomes that are less clustered in the population have smaller design effects than outcomes that are clustered.
In the {srvyr} package, design effects can be calculated for totals, proportions, means, and ratio estimates by setting the deff
argument to TRUE
in the corresponding functions. In the example below, we calculate the design effects for the average consumption of electricity (BTUEL
), natural gas (BTUNG
), liquid propane (BTULP
), fuel oil (BTUFO
), and wood (BTUWOOD
) by setting deff = TRUE
:
recs_des %>%
summarize(across(
c(BTUEL, BTUNG, BTULP, BTUFO, BTUWOOD),
~ survey_mean(.x, deff = TRUE, vartype = NULL)
)) %>%
select(ends_with("deff"))
## # A tibble: 1 × 5
## BTUEL_deff BTUNG_deff BTULP_deff BTUFO_deff BTUWOOD_deff
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.597 0.938 1.21 0.720 1.10
For the values less than 1 (BTUEL_deff
and BTUFO_deff
), the results suggest that the survey design is more efficient than a simple random sample. For the values greater than 1 (BTUNG_deff
, BTULP_deff
, and BTUWOOD_deff
), the results indicate that the survey design is less efficient than a simple random sample.
5.9.4 Creating summary rows
When using group_by()
in analysis, the results are returned with a row for each group or combination of groups. Often, we want both breakdowns by group and a summary row for the estimate representing the entire population. For example, we may want the average electricity consumption by region and nationally. The {srvyr} package has the convenient cascade()
function, which adds summary rows for the total of a group. It is used instead of summarize()
and has similar functionalities along with some additional features.
Syntax
The syntax is as follows:
cascade(
.data,
...,
.fill = NA,
.fill_level_top = FALSE,
.groupings = NULL
)
where the arguments are:
.data
: Atbl_svy
object...
: Name-value pairs of summary functions (same as thesummarize()
function).fill
: Value to fill in for group summaries (defaults toNA
).fill_level_top
: When filling factor variables, whether to put the value ‘.fill’ in the first position (defaults to FALSE, placing it in the bottom)
Example
First, let’s look at an example where we calculate the average household electricity cost. Then, we build on it to examine the features of the cascade()
function. In the first example below, we calculate the average household energy cost DOLLAREL_mn
using survey_mean()
without modifying any of the argument defaults in the function:
## # A tibble: 1 × 2
## DOLLAREL_mn DOLLAREL_mn_se
## <dbl> <dbl>
## 1 1380. 5.38
Next, let’s group the results by region by adding group_by()
before the cascade()
function:
## # A tibble: 5 × 3
## Region DOLLAREL_mn DOLLAREL_mn_se
## <fct> <dbl> <dbl>
## 1 Northeast 1343. 14.6
## 2 Midwest 1293. 11.7
## 3 South 1548. 10.3
## 4 West 1211. 12.0
## 5 <NA> 1380. 5.38
We can see the estimated average electricity bills by region: $1,343 for the Northeast, $1,548 for the South, and so on. The last row, where Region = NA
, is the national average electricity bill, $1,380. However, naming the national “region” as NA
is not very informative. We can give it a better name using the .fill
argument.
recs_des %>%
group_by(Region) %>%
cascade(
DOLLAREL_mn = survey_mean(DOLLAREL),
.fill = "National"
)
## # A tibble: 5 × 3
## Region DOLLAREL_mn DOLLAREL_mn_se
## <fct> <dbl> <dbl>
## 1 Northeast 1343. 14.6
## 2 Midwest 1293. 11.7
## 3 South 1548. 10.3
## 4 West 1211. 12.0
## 5 National 1380. 5.38
We can move the summary row to the first row by adding .fill_level_top = TRUE
to cascade()
:
recs_des %>%
group_by(Region) %>%
cascade(
DOLLAREL_mn = survey_mean(DOLLAREL),
.fill = "National",
.fill_level_top = TRUE
)
## # A tibble: 5 × 3
## Region DOLLAREL_mn DOLLAREL_mn_se
## <fct> <dbl> <dbl>
## 1 National 1380. 5.38
## 2 Northeast 1343. 14.6
## 3 Midwest 1293. 11.7
## 4 South 1548. 10.3
## 5 West 1211. 12.0
While the results remain the same, the table is now easier to interpret.
5.9.5 Calculating estimates for many outcomes
Often, we are interested in a summary statistic across many variables. Useful tools include the across()
function in {dplyr}, shown a few times above, and the map()
function in {purrr}.
The across()
function applies the same function to multiple columns within summarize()
. This works well with all functions shown above, except for survey_prop()
. In a later example, we tackle summarizing multiple proportions.
Example 1: across()
Suppose we want to calculate the total and average consumption, along with coefficients of variation (CV), for each fuel type. These include the reported consumption of electricity (BTUEL
), natural gas (BTUNG
), liquid propane (BTULP
), fuel oil (BTUFO
), and wood (BTUWOOD
), as mentioned in the section on design effects. We can take advantage of the fact that these are the only variables that start with “BTU” by selecting them with starts_with("BTU")
in the across()
function. For each selected column (.x
), across()
creates a list of two functions to be applied: survey_total()
to calculate the total and survey_mean()
to calculate the mean, along with their CV (vartype = "cv"
). Finally, .unpack = "{outer}.{inner}"
specifies that the resulting column names are a concatenation of the variable name, followed by Total or Mean, and then “coef” or “cv.”
consumption_ests <- recs_des %>%
summarize(across(
starts_with("BTU"),
list(
Total = ~ survey_total(.x, vartype = "cv"),
Mean = ~ survey_mean(.x, vartype = "cv")
),
.unpack = "{outer}.{inner}"
))
consumption_ests
## # A tibble: 1 × 20
## BTUEL_Total.coef BTUEL_Total._cv BTUEL_Mean.coef BTUEL_Mean._cv
## <dbl> <dbl> <dbl> <dbl>
## 1 4453284510065 0.00377 36051. 0.00377
## # ℹ 16 more variables: BTUNG_Total.coef <dbl>, BTUNG_Total._cv <dbl>,
## # BTUNG_Mean.coef <dbl>, BTUNG_Mean._cv <dbl>,
## # BTULP_Total.coef <dbl>, BTULP_Total._cv <dbl>,
## # BTULP_Mean.coef <dbl>, BTULP_Mean._cv <dbl>,
## # BTUFO_Total.coef <dbl>, BTUFO_Total._cv <dbl>,
## # BTUFO_Mean.coef <dbl>, BTUFO_Mean._cv <dbl>,
## # BTUWOOD_Total.coef <dbl>, BTUWOOD_Total._cv <dbl>, …
The estimated total consumption of electricity (BTUEL
) is 4,453,284,510,065 (BTUEL_Total.coef
), the estimated average consumption is 36,051 (BTUEL_Mean.coef
), and the CV is 0.0038.
In the example above, the table was quite wide. We may prefer a row for each fuel type. Using the pivot_longer()
and pivot_wider()
functions from {tidyr} can help us achieve this. First, we use pivot_longer()
to make each variable a column, changing the data to a “long” format. We use the names_to
argument to specify new column names: FuelType
, Stat
, and Type
. Then, the names_pattern
argument extracts the names in the original column names based on the regular expression pattern BTU(.*)_(.*)\\.(.*)
. They are saved in the column names defined in names_to
.
consumption_ests_long <- consumption_ests %>%
pivot_longer(
cols = everything(),
names_to = c("FuelType", "Stat", "Type"),
names_pattern = "BTU(.*)_(.*)\\.(.*)"
)
consumption_ests_long
## # A tibble: 20 × 4
## FuelType Stat Type value
## <chr> <chr> <chr> <dbl>
## 1 EL Total coef 4453284510065
## 2 EL Total _cv 0.00377
## 3 EL Mean coef 36051.
## 4 EL Mean _cv 0.00377
## 5 NG Total coef 4240769382106.
## 6 NG Total _cv 0.00908
## 7 NG Mean coef 34330.
## 8 NG Mean _cv 0.00908
## 9 LP Total coef 391425311586.
## 10 LP Total _cv 0.0380
## 11 LP Mean coef 3169.
## 12 LP Mean _cv 0.0380
## 13 FO Total coef 395699976655.
## 14 FO Total _cv 0.0343
## 15 FO Mean coef 3203.
## 16 FO Mean _cv 0.0343
## 17 WOOD Total coef 345091088404.
## 18 WOOD Total _cv 0.0454
## 19 WOOD Mean coef 2794.
## 20 WOOD Mean _cv 0.0454
Then, we use pivot_wider()
to create a table that is nearly ready for publication. Within the function, we can make the names for each element more descriptive and informative by gluing the Stat
and Type
together with names_glue
. Further details on creating publication-ready tables are covered in Chapter 8.
consumption_ests_long %>%
mutate(Type = case_when(
Type == "coef" ~ "",
Type == "_cv" ~ " (CV)"
)) %>%
pivot_wider(
id_cols = FuelType,
names_from = c(Stat, Type),
names_glue = "{Stat}{Type}",
values_from = value
)
## # A tibble: 5 × 5
## FuelType Total `Total (CV)` Mean `Mean (CV)`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 EL 4453284510065 0.00377 36051. 0.00377
## 2 NG 4240769382106. 0.00908 34330. 0.00908
## 3 LP 391425311586. 0.0380 3169. 0.0380
## 4 FO 395699976655. 0.0343 3203. 0.0343
## 5 WOOD 345091088404. 0.0454 2794. 0.0454
Example 2: Proportions with across()
As mentioned earlier, proportions do not work as well directly with the across()
method. If we want the proportion of houses with A/C and the proportion of houses with heating, we require two separate group_by()
statements as shown below:
## # A tibble: 2 × 3
## ACUsed p p_se
## <lgl> <dbl> <dbl>
## 1 FALSE 0.113 0.00306
## 2 TRUE 0.887 0.00306
## # A tibble: 2 × 3
## SpaceHeatingUsed p p_se
## <lgl> <dbl> <dbl>
## 1 FALSE 0.0469 0.00207
## 2 TRUE 0.953 0.00207
We estimate 88.7% of households have A/C and 95.3% have heating.
If we are only interested in the TRUE
outcomes, that is, the proportion of households that have A/C and the proportion that have heating, we can simplify the code. Applying survey_mean()
to a logical variable is the same as using survey_prop()
, as shown below:
cool_heat_tab <- recs_des %>%
summarize(across(c(ACUsed, SpaceHeatingUsed), ~ survey_mean(.x),
.unpack = "{outer}.{inner}"
))
cool_heat_tab
## # A tibble: 1 × 4
## ACUsed.coef ACUsed._se SpaceHeatingUsed.coef SpaceHeatingUsed._se
## <dbl> <dbl> <dbl> <dbl>
## 1 0.887 0.00306 0.953 0.00207
Note that the estimates are the same as those obtained using the separate group_by()
statements. As before, we can use pivot_longer()
to structure the table in a more suitable format for distribution.
cool_heat_tab %>%
pivot_longer(everything(),
names_to = c("Comfort", ".value"),
names_pattern = "(.*)\\.(.*)"
) %>%
rename(
p = coef,
se = `_se`
)
## # A tibble: 2 × 3
## Comfort p se
## <chr> <dbl> <dbl>
## 1 ACUsed 0.887 0.00306
## 2 SpaceHeatingUsed 0.953 0.00207
Example 3: purrr::map()
Loops are a common tool when dealing with repetitive calculations. The {purrr} package provides the map()
functions, which, like a loop, allow us to perform the same task across different elements (Wickham and Henry 2023). In our case, we may want to calculate proportions from the same design multiple times. A straightforward approach is to design the calculation for one variable, build a function based on that, and then apply it iteratively for the rest of the variables.
Suppose we want to create a table that shows the proportion of people who express trust in their government (TrustGovernment
)10 as well as those that trust in people (TrustPeople
)11 using data from the 2020 ANES.
First, we create a table for a single variable. The table includes the variable name as a column, the response, and the corresponding percentage with its standard error.
anes_des %>%
drop_na(TrustGovernment) %>%
group_by(TrustGovernment) %>%
summarize(p = survey_prop() * 100) %>%
mutate(Variable = "TrustGovernment") %>%
rename(Answer = TrustGovernment) %>%
select(Variable, everything())
## # A tibble: 5 × 4
## Variable Answer p p_se
## <chr> <fct> <dbl> <dbl>
## 1 TrustGovernment Always 1.55 0.204
## 2 TrustGovernment Most of the time 13.2 0.553
## 3 TrustGovernment About half the time 30.9 0.829
## 4 TrustGovernment Some of the time 43.4 0.855
## 5 TrustGovernment Never 11.0 0.566
We estimate that 1.55% of people always trust the government, 13.16% trust the government most of the time, and so on.
Now, we want to use the original series of steps as a template to create a general function calcps()
that can apply the same steps to other variables. We replace TrustGovernment
with an argument for a generic variable, var
. Referring to var
involves a bit of tidy evaluation, an advanced skill. To learn more, we recommend Wickham (2019).
calcps <- function(var) {
anes_des %>%
drop_na(!!sym(var)) %>%
group_by(!!sym(var)) %>%
summarize(p = survey_prop() * 100) %>%
mutate(Variable = var) %>%
rename(Answer := !!sym(var)) %>%
select(Variable, everything())
}
We then apply this function to the two variables of interest, TrustGovernment
and TrustPeople
:
## # A tibble: 5 × 4
## Variable Answer p p_se
## <chr> <fct> <dbl> <dbl>
## 1 TrustGovernment Always 1.55 0.204
## 2 TrustGovernment Most of the time 13.2 0.553
## 3 TrustGovernment About half the time 30.9 0.829
## 4 TrustGovernment Some of the time 43.4 0.855
## 5 TrustGovernment Never 11.0 0.566
## # A tibble: 5 × 4
## Variable Answer p p_se
## <chr> <fct> <dbl> <dbl>
## 1 TrustPeople Always 0.809 0.164
## 2 TrustPeople Most of the time 41.4 0.857
## 3 TrustPeople About half the time 28.2 0.776
## 4 TrustPeople Some of the time 24.5 0.670
## 5 TrustPeople Never 5.05 0.422
Finally, we use map()
to iterate over as many variables as needed. We feed our desired variables into map()
along with our custom function, calcps
. The output is a tibble with the variable names in the “Variable” column, the responses in the “Answer” column, along with the percentage and standard error. The list_rbind()
function combines the rows into a single tibble. This example extends nicely when dealing with numerous variables for which we want percentage estimates.
## # A tibble: 10 × 4
## Variable Answer p p_se
## <chr> <fct> <dbl> <dbl>
## 1 TrustGovernment Always 1.55 0.204
## 2 TrustGovernment Most of the time 13.2 0.553
## 3 TrustGovernment About half the time 30.9 0.829
## 4 TrustGovernment Some of the time 43.4 0.855
## 5 TrustGovernment Never 11.0 0.566
## 6 TrustPeople Always 0.809 0.164
## 7 TrustPeople Most of the time 41.4 0.857
## 8 TrustPeople About half the time 28.2 0.776
## 9 TrustPeople Some of the time 24.5 0.670
## 10 TrustPeople Never 5.05 0.422
In addition to our results above, we can also see the output for TrustPeople
. While we estimate that 1.55% of people always trust the government, 0.81% always trust people.
5.10 Exercises
The exercises use the design objects anes_des
and recs_des
provided in the Prerequisites box at the beginning of the chapter.
How many females have a graduate degree? Hint: The variables
Gender
andEducation
will be useful.What percentage of people identify as “Strong Democrat”? Hint: The variable
PartyID
indicates someone’s party affiliation.What percentage of people who voted in the 2020 election identify as “Strong Republican”? Hint: The variable
VotedPres2020
indicates whether someone voted in 2020.What percentage of people voted in both the 2016 election and the 2020 election? Include the logit confidence interval. Hint: The variable
VotedPres2016
indicates whether someone voted in 2016.What is the design effect for the proportion of people who voted early? Hint: The variable
EarlyVote2020
indicates whether someone voted early in 2020.What is the median temperature people set their thermostats to at night during the winter? Hint: The variable
WinterTempNight
indicates the temperature that people set their thermostat to in the winter at night.People sometimes set their temperature differently over different seasons and during the day. What median temperatures do people set their thermostats to in the summer and winter, both during the day and at night? Include confidence intervals. Hint: Use the variables
WinterTempDay
,WinterTempNight
,SummerTempDay
, andSummerTempNight
.What is the correlation between the temperature that people set their temperature at during the night and during the day in the summer?
What is the 1st, 2nd, and 3rd quartile of money spent on energy by Building America (BA) climate zone? Hint:
TOTALDOL
indicates the total amount spent on all fuel, andClimateRegion_BA
indicates the BA climate zones.
References
RECS has two components: a household survey and an energy supplier survey. For each household that responds, their energy providers are contacted to obtain their energy consumption and expenditure. This value reflects the dollars spent on electricity in 2020, according to the energy supplier. See U.S. Energy Information Administration (2023a) for more details.↩︎
Question text: “Is any air conditioning equipment used in your home?” (U.S. Energy Information Administration 2020)↩︎
The value of
DOLLARLP
reflects the annualized amount spent on liquid propane andBTULP
reflects the annualized consumption in Btu of liquid propane (U.S. Energy Information Administration 2020).↩︎Question text: “What is the square footage of your home?” (U.S. Energy Information Administration 2020)↩︎
BTUEL
is derived from the supplier side component of the survey whereBTUEL
represents the electricity consumption in British thermal units (Btus) converted from kilowatt hours (kWh) in a year (U.S. Energy Information Administration 2020).↩︎BTUNG
is derived from the supplier side component of the survey whereBTUNG
represents the natural gas consumption in British thermal units (Btus) in a year (U.S. Energy Information Administration 2020).↩︎Question text: “How often can you trust the federal government in Washington to do what is right? (Always, most of the time, about half the time, some of the time, or never)” (American National Election Studies 2021)↩︎
Question text: “Generally speaking, how often can you trust other people? (Always, most of the time, about half the time, some of the time, or never)” (American National Election Studies 2021)↩︎