Chapter 5 Descriptive analyses

Prerequisites

For this chapter, load the following packages:

library(tidyverse)
library(srvyr)
library(srvyrexploR)
library(broom)

We are using data from ANES and RECS described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information).

targetpop <- 231592693

anes_adjwgt <- anes_2020 %>%
  mutate(Weight = Weight / sum(Weight) * targetpop)

anes_des <- anes_adjwgt %>%
  as_survey_design(
    weights = Weight,
    strata = Stratum,
    ids = VarUnit,
    nest = TRUE
  )

For RECS, details are included in the RECS documentation and Chapters 4 and 10.

recs_des <- recs_2020 %>%
  as_survey_rep(
    weights = NWEIGHT,
    repweights = NWEIGHT1:NWEIGHT60,
    type = "JK1",
    scale = 59 / 60,
    mse = TRUE
  )

5.1 Introduction

Descriptive analyses, such as basic counts, cross-tabulations, or means, are among the first steps in making sense of our survey results. During descriptive analyses, we calculate point estimates of unknown population parameters, such as population mean, and uncertainty estimates, such as confidence intervals. By reviewing the findings, we can glean insight into the data, the underlying population, and any unique aspects of the data or population. For example, if only 10% of survey respondents are male, it could indicate a unique population, a potential error or bias, an intentional survey sampling method, or other factors. Additionally, descriptive analyses provide summaries of distribution and other measures. These analyses lay the groundwork for the next steps of running statistical tests or developing models.

We discuss many different types of descriptive analyses in this chapter. However, it is important to know what type of data we are working with and which statistics are appropriate. In survey data, we typically consider data as one of four main types:

  • Categorical/nominal data: variables with levels or descriptions that cannot be ordered, such as the region of the country (North, South, East, and West)
  • Ordinal data: variables that can be ordered, such as those from a Likert scale (strongly disagree, disagree, agree, and strongly agree)
  • Discrete data: variables that are counted or measured, such as number of children
  • Continuous data: variables that are measured and whose values can lie anywhere on an interval, such as income

This chapter discusses how to analyze measures of distribution (e.g., cross-tabulations), central tendency (e.g., means), relationship (e.g., ratios), and dispersion (e.g., standard deviation) using functions from the {srvyr} package (Freedman Ellis and Schneider 2024).

Measures of distribution describe how often an event or response occurs. These measures include counts and totals. We cover the following functions:

  • Count of observations (survey_count() and survey_tally())
  • Summation of variables (survey_total())

Measures of central tendency find the central (or average) responses. These measures include means and medians. We cover the following functions:

  • Means and proportions (survey_mean() and survey_prop())
  • Quantiles and medians (survey_quantile() and survey_median())

Measures of relationship describe how variables relate to each other. These measures include correlations and ratios. We cover the following functions:

  • Correlations (survey_corr())
  • Ratios (survey_ratio())

Measures of dispersion describe how data spread around the central tendency for continuous variables. These measures include standard deviations and variances. We cover the following functions:

  • Variances and standard deviations (survey_var() and survey_sd())

To incorporate each of these survey functions, recall the general process for survey estimation from Chapter 4:

  1. Create a tbl_svy object using srvyr::as_survey_design() or srvyr::as_survey_rep().
  2. Subset the data for subpopulations using srvyr::filter(), if needed.
  3. Specify domains of analysis using srvyr::group_by(), if needed.
  4. Analyze the data with survey-specific functions.

This chapter walks through how to apply the survey functions in Step 4. Note that unless otherwise specified, our estimates are weighted as a result of setting up the survey design object.

To look at the data by different subgroups, we can choose to filter and/or group the data. It is very important that we filter and group the data only after creating the design object. This ensures that the results accurately reflect the survey design. If we filter or group data before creating the survey design object, the data for those cases are not included in the survey design information and estimations of the variance, leading to inaccurate results.

For the sake of simplicity, we’ve removed cases with missing values in the examples below. For a more detailed explanation of how to handle missing data, please refer to Chapter 11.

5.2 Counts and cross-tabulations

Using survey_count() and survey_tally(), we can calculate the estimated population counts for a given variable or combination of variables. These summaries, often referred to as cross-tabulations or cross-tabs, are applied to categorical data. They help in estimating counts of the population size for different groups based on the survey data.

5.2.1 Syntax

The syntax for survey_count() is similar to the dplyr::count() syntax, as mentioned in Chapter 4. However, as noted above, this function can only be called on tbl_svy objects. Let’s explore the syntax:

survey_count(
  x,
  ...,
  wt = NULL,
  sort = FALSE,
  name = "n",
  .drop = dplyr::group_by_drop_default(x),
  vartype = c("se", "ci", "var", "cv")
  )

The arguments are:

  • x: a tbl_svy object created by as_survey
  • ...: variables to group by, passed to group_by
  • wt: a variable to weight on in addition to the survey weights, defaults to NULL
  • sort: how to sort the variables, defaults to FALSE
  • name: the name of the count variable, defaults to n
  • .drop: whether to drop empty groups
  • vartype: type(s) of variation estimate to calculate including any of c("se", "ci", "var", "cv"), defaults to se (standard error) (see Section 5.2.1 for more information)

To generate a count or cross-tabs by different variables, we include them in the (...) argument. This argument can take any number of variables and breaks down the counts by all combinations of the provided variables. This is similar to dplyr::count(). To obtain an estimate of the overall population, we can exclude any variables from the (...) argument or use the survey_tally() function. While the survey_tally() function has a similar syntax to the survey_count() function, it does not include the (...) or the .drop arguments:

survey_tally(
  x,
  wt,
  sort = FALSE,
  name = "n",
  vartype = c("se", "ci", "var", "cv")
)

Both functions include the vartype argument with four different values:

  • se: standard error
    • The estimated standard deviation of the estimate
    • Output has a column with the variable name specified in the name argument with a suffix of “_se”
  • ci: confidence interval
    • The lower and upper limits of a confidence interval
    • Output has two columns with the variable name specified in the name argument with a suffix of “_low” and “_upp”
    • By default, this is a 95% confidence interval but can be changed by using the argument level and specifying a number between 0 and 1. For example, level=0.8 would produce an 80% confidence interval.
  • var: variance
    • The estimated variance of the estimate
    • Output has a column with the variable name specified in the name argument with a suffix of “_var”
  • cv: coefficient of variation
    • A ratio of the standard error and the estimate
    • Output has a column with the variable name specified in the name argument with a suffix of “_cv”

The confidence intervals are always calculated using a symmetric t-distribution based method, given by the formula:

\[ \text{estimate} \pm t^*_{df}\times SE\]

where \(t^*_{df}\) is the critical value from a t-distribution based on the confidence level and the degrees of freedom. By default, the degrees of freedom are based on the design or number of replicates, but they can be specified using the df argument. For survey design objects, the degrees of freedom are calculated as the number of primary sampling units (PSUs or clusters) minus the number of strata (see Chapter 10 for more information on PSUs, strata, and sample designs). For replicate-based objects, the degrees of freedom are calculated as one less than the rank of the matrix of replicate weight, where the number of replicates is typically the rank. Note that specifying df = Inf is equivalent to using a normal (z-based) confidence interval – this is the default in {survey}. These variability types are the same for most of the survey functions, and we provide examples using different variability types throughout this chapter.

5.2.2 Examples

Example 1: Estimated population count

If we want to obtain the estimated number of households in the U.S. (the population of interest) using the Residential Energy Consumption Survey (RECS) data, we can use survey_count(). If we do not specify any variables in the survey_count() function, it outputs the estimated population count (n) and its corresponding standard error (n_se).

recs_des %>%
  survey_count()
## # A tibble: 1 × 2
##            n  n_se
##        <dbl> <dbl>
## 1 123529025. 0.148

Based on this calculation, the estimated number of households in the U.S. is 123,529,025.

Alternatively, we could also use the survey_tally() function. The example below yields the same results as survey_count().

recs_des %>%
  survey_tally()
## # A tibble: 1 × 2
##            n  n_se
##        <dbl> <dbl>
## 1 123529025. 0.148

Example 2: Estimated counts by subgroups (cross-tabs)

To calculate the estimated number of observations for specific subgroups, such as Region and Division, we can include the variables of interest in the survey_count() function. In the example below, we calculate the estimated number of housing units by region and division. The argument name = in survey_count() allows us to change the name of the count variable in the output from the default n to N.

recs_des %>%
  survey_count(Region, Division, name = "N")
## # A tibble: 10 × 4
##    Region    Division                   N         N_se
##    <fct>     <fct>                  <dbl>        <dbl>
##  1 Northeast New England         5876166  0.0000000137
##  2 Northeast Middle Atlantic    16043503  0.0000000487
##  3 Midwest   East North Central 18546912  0.000000437 
##  4 Midwest   West North Central  8495815  0.0000000177
##  5 South     South Atlantic     24843261  0.0000000418
##  6 South     East South Central  7380717. 0.114       
##  7 South     West South Central 14619094  0.000488    
##  8 West      Mountain North      4615844  0.119       
##  9 West      Mountain South      4602070  0.0000000492
## 10 West      Pacific            18505643. 0.00000295

When we run the cross-tab, we see that there are an estimated 5,876,166 housing units in the New England Division.

The code results in an error if we try to use the survey_count() syntax with survey_tally():

recs_des %>%
  survey_tally(Region, Division, name = "N")
## Error in `dplyr::summarise()`:
## ℹ In argument: `N = survey_total(Region, vartype = vartype,
##   na.rm = TRUE)`.
## Caused by error:
## ! Factor not allowed in survey functions, should be used as a grouping variable.

Use a group_by() function prior to using survey_tally() to successfully run the cross-tab:

recs_des %>%
  group_by(Region, Division) %>%
  survey_tally(name = "N")
## # A tibble: 10 × 4
## # Groups:   Region [4]
##    Region    Division                   N         N_se
##    <fct>     <fct>                  <dbl>        <dbl>
##  1 Northeast New England         5876166  0.0000000137
##  2 Northeast Middle Atlantic    16043503  0.0000000487
##  3 Midwest   East North Central 18546912  0.000000437 
##  4 Midwest   West North Central  8495815  0.0000000177
##  5 South     South Atlantic     24843261  0.0000000418
##  6 South     East South Central  7380717. 0.114       
##  7 South     West South Central 14619094  0.000488    
##  8 West      Mountain North      4615844  0.119       
##  9 West      Mountain South      4602070  0.0000000492
## 10 West      Pacific            18505643. 0.00000295

5.3 Totals and sums

The survey_total() function is analogous to sum. It can be applied to continuous variables to obtain the estimated total quantity in a population. Starting from this point in the chapter, all the introduced functions must be called within summarize().

5.3.1 Syntax

Here is the syntax:

survey_total(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  deff = FALSE,
  df = NULL
)

The arguments are:

  • x: a variable, expression, or empty
  • na.rm: an indicator of whether missing values should be dropped, defaults to FALSE
  • vartype: type(s) of variation estimate to calculate including any of c("se", "ci", "var", "cv"), defaults to se (standard error) (see Section 5.2.1 for more information)
  • level: a number or a vector indicating the confidence level, defaults to 0.95
  • deff: a logical value stating whether the design effect should be returned, defaults to FALSE (this is described in more detail in Section 5.9.3)
  • df: (for vartype = 'ci'), a numeric value indicating degrees of freedom for the t-distribution

5.3.2 Examples

Example 1: Estimated population count

To calculate a population count estimate with survey_total(), we leave the argument x empty, as shown in the example below:

recs_des %>%
  summarize(Tot = survey_total())
## # A tibble: 1 × 2
##          Tot Tot_se
##        <dbl>  <dbl>
## 1 123529025.  0.148

The estimated number of households in the U.S. is 123,529,025. Note that this result obtained from survey_total() is equivalent to the ones from the survey_count() and survey_tally() functions. However, the survey_total() function is called within summarize(), whereas survey_count() and survey_tally() are not.

Example 2: Overall summation of continuous variables

The distinction between survey_total() and survey_count() becomes more evident when working with continuous variables. Let’s compute the total cost of electricity in whole dollars from variable DOLLAREL4.

recs_des %>%
  summarize(elec_bill = survey_total(DOLLAREL))
## # A tibble: 1 × 2
##       elec_bill elec_bill_se
##           <dbl>        <dbl>
## 1 170473527909.   664893504.

It is estimated that American residential households spent a total of $170,473,527,909 on electricity in 2020, and the estimate has a standard error of $664,893,504.

Example 3: Summation by groups

Since we are using the {srvyr} package, we can use group_by() to calculate the cost of electricity for different groups. Let’s examine the variations in the cost of electricity in whole dollars across regions and display the confidence interval instead of the default standard error.

recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_total(DOLLAREL,
    vartype = "ci"
  ))
## # A tibble: 4 × 4
##   Region       elec_bill elec_bill_low elec_bill_upp
##   <fct>            <dbl>         <dbl>         <dbl>
## 1 Northeast 29430369947.  28788987554.  30071752341.
## 2 Midwest   34972544751.  34339576041.  35605513460.
## 3 South     72496840204.  71534780902.  73458899506.
## 4 West      33573773008.  32909111702.  34238434313.

The survey results estimate that households in the Northeast spent $29,430,369,947 with a confidence interval of ($28,788,987,554, $30,071,752,341) on electricity in 2020, while households in the South spent an estimated $72,496,840,204 with a confidence interval of ($71,534,780,902, $73,458,899,506).

As we calculate these numbers, we may notice that the confidence interval of the South is larger than those of other regions. This implies that we have less certainty about the true value of electricity spending in the South. A larger confidence interval could be due to a variety of factors, such as a wider range of electricity spending in the South. We could try to analyze smaller regions within the South to identify areas that are contributing to more variability. Descriptive analyses serve as a valuable starting point for more in-depth exploration and analysis.

5.4 Means and proportions

Means and proportions form the foundation of many research studies. These estimates are often the first things we look for when reviewing research on a given topic. The survey_mean() and survey_prop() functions calculate means and proportions while taking into account the survey design elements. The survey_mean() function should be used on continuous variables of survey data, while the survey_prop() function should be used on categorical variables.

5.4.1 Syntax

The syntax for both means and proportions is very similar:

survey_mean(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  proportion = FALSE,
  prop_method = c("logit", "likelihood", "asin", "beta", "mean"),
  deff = FALSE,
  df = NULL
)

survey_prop(
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  proportion = TRUE,
  prop_method = 
    c("logit", "likelihood", "asin", "beta", "mean", "xlogit"),
  deff = FALSE,
  df = NULL
)

Both functions have the following arguments and defaults:

  • na.rm: an indicator of whether missing values should be dropped, defaults to FALSE
  • vartype: type(s) of variation estimate to calculate including any of c("se", "ci", "var", "cv"), defaults to se (standard error) (see Section 5.2.1 for more information)
  • level: a number or a vector indicating the confidence level, defaults to 0.95
  • prop_method: Method to calculate the confidence interval for confidence intervals
  • deff: a logical value stating whether the design effect should be returned, defaults to FALSE (this is described in more detail in Section 5.9.3)
  • df: (for vartype = 'ci'), a numeric value indicating degrees of freedom for the t-distribution

There are two main differences in the syntax. The survey_mean() function includes the first argument x, representing the variable or expression on which the mean should be calculated. The survey_prop() does not have an argument to include the variables directly. Instead, prior to summarize(), we must use the group_by() function to specify the variables of interest for survey_prop(). For survey_mean(), including a group_by() function allows us to obtain the means by different groups.

The other main difference is with the proportion argument. The survey_mean() function can be used to calculate both means and proportions. Its proportion argument defaults to FALSE, indicating it is used for calculating means. If we wish to calculate a proportion using survey_mean(), we need to set the proportion argument to TRUE. In the survey_prop() function, the proportion argument defaults to TRUE because the function is specifically designed for calculating proportions.

In Section 5.2.1, we provide an overview of different variability types. The confidence interval used for most measures, such as means and counts, is referred to as a Wald-type interval. However, for proportions, a Wald-type interval with a symmetric t-based confidence interval may not provide accurate coverage, especially when dealing with small sample sizes or proportions “near” 0 or 1. We can use other methods to calculate confidence intervals, which we specify using the prop_method option in survey_prop(). The options include:

  • logit: fits a logistic regression model and computes a Wald-type interval on the log-odds scale, which is then transformed to the probability scale. This is the default method.
  • likelihood: uses the (Rao-Scott) scaled chi-squared distribution for the log-likelihood from a binomial distribution.
  • asin: uses the variance-stabilizing transformation for the binomial distribution, the arcsine square root, and then back-transforms the interval to the probability scale.
  • beta: uses the incomplete beta function with an effective sample size based on the estimated variance of the proportion.
  • mean: the Wald-type interval (\(\pm t_{df}^*\times SE\)).
  • xlogit: uses a logit transformation of the proportion, calculates a Wald-type interval, and then back-transforms to the probability scale. This method is the same as those used by default in SUDAAN and SPSS.

Each option yields slightly different confidence interval bounds when dealing with proportions. Please note that when working with survey_mean(), we do not need to specify a method unless the proportion argument is TRUE. If proportion is FALSE, it calculates a symmetric mean type of confidence interval.

5.4.2 Examples

Example 1: One variable proportion

If we are interested in obtaining the proportion of people in each region in the RECS data, we can use group_by() and survey_prop() as shown below:

recs_des %>%
  group_by(Region) %>%
  summarize(p = survey_prop())
## # A tibble: 4 × 3
##   Region        p           p_se
##   <fct>     <dbl>          <dbl>
## 1 Northeast 0.177 0.000000000212
## 2 Midwest   0.219 0.000000000262
## 3 South     0.379 0.000000000740
## 4 West      0.224 0.000000000816

17.7% of the households are in the Northeast, 21.9% are in the Midwest, and so on. Note that the proportions in column p add up to one.

The survey_prop() function is essentially the same as using survey_mean() with a categorical variable and without specifying a numeric variable in the x argument. The following code gives us the same results as above:

recs_des %>%
  group_by(Region) %>%
  summarize(p = survey_mean())
## # A tibble: 4 × 3
##   Region        p           p_se
##   <fct>     <dbl>          <dbl>
## 1 Northeast 0.177 0.000000000212
## 2 Midwest   0.219 0.000000000262
## 3 South     0.379 0.000000000740
## 4 West      0.224 0.000000000816

Example 2: Conditional proportions

We can also obtain proportions by more than one variable. In the following example, we look at the proportion of housing units by Region and whether air conditioning (A/C) is used (ACUsed)5.

recs_des %>%
  group_by(Region, ACUsed) %>%
  summarize(p = survey_prop())
## # A tibble: 8 × 4
## # Groups:   Region [4]
##   Region    ACUsed      p    p_se
##   <fct>     <lgl>   <dbl>   <dbl>
## 1 Northeast FALSE  0.110  0.00590
## 2 Northeast TRUE   0.890  0.00590
## 3 Midwest   FALSE  0.0666 0.00508
## 4 Midwest   TRUE   0.933  0.00508
## 5 South     FALSE  0.0581 0.00278
## 6 South     TRUE   0.942  0.00278
## 7 West      FALSE  0.255  0.00759
## 8 West      TRUE   0.745  0.00759

When specifying multiple variables, the proportions are conditional. In the results above, notice that the proportions sum to 1 within each region. This can be interpreted as the proportion of housing units with A/C within each region. For example, in the Northeast region, approximately 11.0% of housing units don’t have A/C, while around 89.0% have A/C.

Example 3: Joint proportions

If we’re interested in a joint proportion, we use the interact() function. In the example below, we apply the interact() function to Region and ACUsed:

recs_des %>%
  group_by(interact(Region, ACUsed)) %>%
  summarize(p = survey_prop())
## # A tibble: 8 × 4
##   Region    ACUsed      p    p_se
##   <fct>     <lgl>   <dbl>   <dbl>
## 1 Northeast FALSE  0.0196 0.00105
## 2 Northeast TRUE   0.158  0.00105
## 3 Midwest   FALSE  0.0146 0.00111
## 4 Midwest   TRUE   0.204  0.00111
## 5 South     FALSE  0.0220 0.00106
## 6 South     TRUE   0.357  0.00106
## 7 West      FALSE  0.0573 0.00170
## 8 West      TRUE   0.167  0.00170

In this case, all proportions sum to 1, not just within regions. This means that 15.8% of the population lives in the Northeast and has A/C. As noted earlier, we can use both the survey_prop() and survey_mean() functions, and they produce the same results.

Example 4: Overall mean

Below, we calculate the estimated average cost of electricity in the U.S. using survey_mean(). To include both the standard error and the confidence interval, we can include them in the vartype argument:

recs_des %>%
  summarize(elec_bill = survey_mean(DOLLAREL,
    vartype = c("se", "ci")
  ))
## # A tibble: 1 × 4
##   elec_bill elec_bill_se elec_bill_low elec_bill_upp
##       <dbl>        <dbl>         <dbl>         <dbl>
## 1     1380.         5.38         1369.         1391.

Nationally, the average household spent $1,380 in 2020.

Example 5: Means by subgroup

We can also calculate the estimated average cost of electricity in the U.S. by each region. To do this, we include a group_by() function with the variable of interest before the summarize() function:

recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_mean(DOLLAREL))
## # A tibble: 4 × 3
##   Region    elec_bill elec_bill_se
##   <fct>         <dbl>        <dbl>
## 1 Northeast     1343.         14.6
## 2 Midwest       1293.         11.7
## 3 South         1548.         10.3
## 4 West          1211.         12.0

Households from the West spent approximately $1,211, while in the South, the average spending was $1,548.

5.5 Quantiles and medians

To better understand the distribution of a continuous variable like income, we can calculate quantiles at specific points. For example, computing estimates of the quartiles (25%, 50%, 75%) helps us understand how income is spread across the population. We use the survey_quantile() function to calculate quantiles in survey data.

Medians are useful for finding the midpoint of a continuous distribution when the data are skewed, as medians are less affected by outliers compared to means. The median is the same as the 50th percentile, meaning the value where 50% of the data are higher and 50% are lower. Because medians are a special, common case of quantiles, we have a dedicated function called survey_median() for calculating the median in survey data. Alternatively, we can use the survey_quantile() function with the quantiles argument set to 0.5 to achieve the same result.

5.5.1 Syntax

The syntax for survey_quantile() and survey_median() are nearly identical:

survey_quantile(
  x,
  quantiles,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  interval_type = 
    c("mean", "beta", "xlogit", "asin", "score", "quantile"),
  qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3", 
            "hf4", "hf5", "hf6", "hf7", "hf8", "hf9"),
  df = NULL
)

survey_median(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  interval_type = 
    c("mean", "beta", "xlogit", "asin", "score", "quantile"),
  qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3", 
            "hf4", "hf5", "hf6", "hf7", "hf8", "hf9"),
  df = NULL
)

The arguments available in both functions are:

  • x: a variable, expression, or empty
  • na.rm: an indicator of whether missing values should be dropped, defaults to FALSE
  • vartype: type(s) of variation estimate to calculate, defaults to se (standard error)
  • level: a number or a vector indicating the confidence level, defaults to 0.95
  • interval_type: method for calculating a confidence interval
  • qrule: rule for defining quantiles. The default is the lower end of the quantile interval (“math”). The midpoint of the quantile interval is the “school” rule. “hf1” to “hf9” are weighted analogs to type=1 to 9 in quantile(). “shahvaish” corresponds to a rule proposed by Shah and Vaish (2006). See vignette("qrule", package="survey") for more information.
  • df: (for vartype = 'ci'), a numeric value indicating degrees of freedom for the t-distribution

The only difference between survey_quantile() and survey_median() is the inclusion of the quantiles argument in the survey_quantile() function. This argument takes a vector with values between 0 and 1 to indicate which quantiles to calculate. For example, if we wanted the quartiles of a variable, we would provide quantiles = c(0.25, 0.5, 0.75). While we can specify quantiles of 0 and 1, which represent the minimum and maximum, this is not recommended. It only returns the minimum and maximum of the respondents and cannot be extrapolated to the population, as there is no valid definition of standard error.

In Section 5.2.1, we provide an overview of the different variability types. The interval used in confidence intervals for most measures, such as means and counts, is referred to as a Wald-type interval. However, this is not always the most accurate interval for quantiles. Similar to confidence intervals for proportions, quantiles have various interval types, including asin, beta, mean, and xlogit (see Section 5.4.1). Quantiles also have two more methods available:

  • score: the Francisco and Fuller confidence interval based on inverting a score test (only available for design-based survey objects and not replicate-based objects)
  • quantile: based on the replicates of the quantile. This is not valid for jackknife-type replicates but is available for bootstrap and BRR replicates.

One note with the score method is that when there are numerous ties in the data, this method may produce confidence intervals that do not contain the estimate. When dealing with a high propensity for ties (e.g., many respondents are the same age), it is recommended to use another method. SUDAAN, for example, uses the score method but adds noise to the values to prevent issues. The documentation in the {survey} package indicates, in general, that the score method may have poorer performance compared to the beta and logit intervals (Lumley 2010).

5.5.2 Examples

Example 1: Overall quartiles

Quantiles provide insights into the distribution of a variable. Let’s look into the quartiles, specifically, the first quartile (p=0.25), the median (p=0.5), and the third quartile (p=0.75) of electric bills.

recs_des %>%
  summarize(elec_bill = survey_quantile(DOLLAREL,
    quantiles = c(0.25, .5, 0.75)
  ))
## # A tibble: 1 × 6
##   elec_bill_q25 elec_bill_q50 elec_bill_q75 elec_bill_q25_se
##           <dbl>         <dbl>         <dbl>            <dbl>
## 1          795.         1215.         1770.             5.69
##   elec_bill_q50_se elec_bill_q75_se
##              <dbl>            <dbl>
## 1             6.33             9.99

The output above shows the values for the three quartiles of electric bill costs and their respective standard errors: the 25th percentile is $795 with a standard error of $5.69, the 50th percentile (median) is $1,215 with a standard error of $6.33, and the 75th percentile is $1,770 with a standard error of $9.99.

Example 2: Quartiles by subgroup

We can estimate the quantiles of electric bills by region by using the group_by() function:

recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_quantile(DOLLAREL,
    quantiles = c(0.25, .5, 0.75)
  ))
## # A tibble: 4 × 7
##   Region    elec_bill_q25 elec_bill_q50 elec_bill_q75 elec_bill_q25_se
##   <fct>             <dbl>         <dbl>         <dbl>            <dbl>
## 1 Northeast          740.         1148.         1712.            13.7 
## 2 Midwest            769.         1149.         1632.             8.88
## 3 South              968.         1402.         1945.            10.6 
## 4 West               623.         1028.         1568.            10.8 
##   elec_bill_q50_se elec_bill_q75_se
##              <dbl>            <dbl>
## 1            16.6              25.8
## 2            11.6              18.6
## 3             9.17             13.9
## 4            14.3              20.5

The 25th percentile for the Northeast region is $740, while it is $968 for the South.

Example 3: Minimum and maximum

As mentioned in the syntax section, we can specify quantiles of 0 (minimum) and 1 (maximum), and R calculates these values. However, these are only the minimum and maximum values in the data, and there is not enough information to determine their standard errors:

recs_des %>%
  summarize(elec_bill = survey_quantile(DOLLAREL,
    quantiles = c(0, 1)
  ))
## # A tibble: 1 × 4
##   elec_bill_q00 elec_bill_q100 elec_bill_q00_se elec_bill_q100_se
##           <dbl>          <dbl>            <dbl>             <dbl>
## 1         -889.         15680.              NaN                 0

The minimum cost of electricity in the dataset is –$889, while the maximum is $15,680, but the standard error is shown as NaN and 0, respectively. Notice that the minimum cost is a negative number. This may be surprising, but some housing units with solar power sell their energy back to the grid and earn money, which is recorded as a negative expenditure.

Example 4: Overall median

We can calculate the estimated median cost of electricity in the U.S. using the survey_median() function:

recs_des %>%
  summarize(elec_bill = survey_median(DOLLAREL))
## # A tibble: 1 × 2
##   elec_bill elec_bill_se
##       <dbl>        <dbl>
## 1     1215.         6.33

Nationally, the median household spent $1,215 in 2020. This is the same result as we obtained using the survey_quantile() function. Interestingly, the average electric bill for households that we calculated in Section 5.4 is $1,380, but the estimated median electric bill is $1,215, indicating the distribution is likely right-skewed.

Example 5: Medians by subgroup

We can calculate the estimated median cost of electricity in the U.S. by region using the group_by() function with the variable(s) of interest before the summarize() function, similar to when we found the mean by region.

recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_median(DOLLAREL))
## # A tibble: 4 × 3
##   Region    elec_bill elec_bill_se
##   <fct>         <dbl>        <dbl>
## 1 Northeast     1148.        16.6 
## 2 Midwest       1149.        11.6 
## 3 South         1402.         9.17
## 4 West          1028.        14.3

We estimate that households in the Northeast spent a median of $1,148 on electricity, and in the South, they spent a median of $1,402.

5.6 Ratios

A ratio is a measure of the ratio of the sum of two variables, specifically in the form of:

\[ \frac{\sum x_i}{\sum y_i}.\] Note that the ratio is not the same as calculating the following:

\[ \frac{1}{N} \sum \frac{x_i}{y_i} \] which can be calculated with survey_mean() by creating a derived variable \(z=x/y\) and then calculating the mean of \(z\).

Say we wanted to assess the energy efficiency of homes in a standardized way, where we can compare homes of different sizes. We can calculate the ratio of energy consumption to the square footage of a home. This helps us meaningfully compare homes of different sizes by identifying how much energy is being used per unit of space. To calculate this ratio, we would run survey_ratio(Energy Consumption in BTUs, Square Footage of Home). If, instead, we used survey_mean(Energy Consumption in BTUs/Square Footage of Home), we would estimate the average energy consumption per square foot of all surveyed homes. While helpful in understanding general energy use, this statistic does not account for differences in home sizes.

5.6.1 Syntax

The syntax for survey_ratio() is as follows:

survey_ratio(
  numerator,
  denominator,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  deff = FALSE,
  df = NULL
)

The arguments are:

  • numerator: The numerator of the ratio
  • denominator: The denominator of the ratio
  • na.rm: A logical value to indicate whether missing values should be dropped
  • vartype: type(s) of variation estimate to calculate including any of c("se", "ci", "var", "cv"), defaults to se (standard error) (see Section 5.2.1 for more information)
  • level: A single number or vector of numbers indicating the confidence level
  • deff: A logical value to indicate whether the design effect should be returned (this is described in more detail in Section 5.9.3)
  • df: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution

5.6.2 Examples

Example 1: Overall ratios

Suppose we wanted to find the ratio of dollars spent on liquid propane per unit (in British thermal unit [Btu]) nationally6. To find the average cost to a household, we can use survey_mean(). However, to find the national unit rate, we can use survey_ratio(). In the following example, we show both methods and discuss the interpretation of each:

recs_des %>%
  summarize(
    DOLLARLP_Tot = survey_total(DOLLARLP, vartype = NULL),
    BTULP_Tot = survey_total(BTULP, vartype = NULL),
    DOL_BTU_Rat = survey_ratio(DOLLARLP, BTULP),
    DOL_BTU_Avg = survey_mean(DOLLARLP / BTULP, na.rm = TRUE)
  )
## # A tibble: 1 × 6
##   DOLLARLP_Tot     BTULP_Tot DOL_BTU_Rat DOL_BTU_Rat_se DOL_BTU_Avg
##          <dbl>         <dbl>       <dbl>          <dbl>       <dbl>
## 1  8122911173. 391425311586.      0.0208       0.000240      0.0240
##   DOL_BTU_Avg_se
##            <dbl>
## 1       0.000223

The ratio of the total spent on liquid propane to the total consumption was 0.0208, but the average rate was 0.024. With a bit of calculation, we can show that the ratio is the ratio of the totals DOLLARLP_Tot/BTULP_Tot=8,122,911,173/391,425,311,586=0.0208. Although the estimated ratio can be calculated manually in this manner, the standard error requires the use of the survey_ratio() function. The average can be interpreted as the average rate paid by a household.

Example 2: Ratios by subgroup

As previously done with other estimates, we can use group_by() to examine whether this ratio varies by region.

recs_des %>%
  group_by(Region) %>%
  summarize(DOL_BTU_Rat = survey_ratio(DOLLARLP, BTULP)) %>%
  arrange(DOL_BTU_Rat)
## # A tibble: 4 × 3
##   Region    DOL_BTU_Rat DOL_BTU_Rat_se
##   <fct>           <dbl>          <dbl>
## 1 Midwest        0.0158       0.000240
## 2 South          0.0245       0.000388
## 3 West           0.0246       0.000875
## 4 Northeast      0.0247       0.000488

Although not a formal statistical test, it appears that the cost ratios for liquid propane are the lowest in the Midwest (0.0158).

5.7 Correlations

The correlation is a measure of the linear relationship between two continuous variables, which ranges between –1 and 1. The most commonly used method is Pearson’s correlation (referred to as correlation henceforth). A sample correlation for a simple random sample is calculated as follows:

\[\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum (x_i-\bar{x})^2} \sqrt{\sum(y_i-\bar{y})^2}} \]

When using survey_corr() for designs other than a simple random sample, the weights are applied when estimating the correlation.

5.7.1 Syntax

The syntax for survey_corr() is as follows:

survey_corr(
  x,
  y,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  df = NULL
)

The arguments are:

  • x: A variable or expression
  • y: A variable or expression
  • na.rm: A logical value to indicate whether missing values should be dropped
  • vartype: Type(s) of variation estimate to calculate including any of c("se", "ci", "var", "cv"), defaults to se (standard error) (see Section 5.2.1 for more information)
  • level: (For vartype = “ci” only) A single number or vector of numbers indicating the confidence level
  • df: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution

5.7.2 Examples

Example 1: Overall correlation

We can calculate the correlation between the total square footage of homes (TOTSQFT_EN)7 and electricity consumption (BTUEL)8.

recs_des %>%
  summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, BTUEL))
## # A tibble: 1 × 2
##   SQFT_Elec_Corr SQFT_Elec_Corr_se
##            <dbl>             <dbl>
## 1          0.417           0.00689

The correlation between the total square footage of homes and electricity consumption is 0.417, indicating a moderate positive relationship.

Example 2: Correlations by subgroup

We can explore the correlation between total square footage and electricity consumption based on subgroups, such as whether A/C is used (ACUsed).

recs_des %>%
  group_by(ACUsed) %>%
  summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, DOLLAREL))
## # A tibble: 2 × 3
##   ACUsed SQFT_Elec_Corr SQFT_Elec_Corr_se
##   <lgl>           <dbl>             <dbl>
## 1 FALSE           0.290           0.0240 
## 2 TRUE            0.401           0.00808

For homes without A/C, there is a small positive correlation between total square footage with electricity consumption (0.29). For homes with A/C, the correlation of 0.401 indicates a stronger positive correlation between total square footage and electricity consumption.

5.8 Standard deviation and variance

All survey functions produce an estimate of the variability of a given estimate. No additional function is needed when dealing with variable estimates. However, if we are specifically interested in population variance and standard deviation, we can use the survey_var() and survey_sd() functions. In our experience, it is not common practice to use these functions. They can be used when designing a future study to gauge population variability and inform sampling precision.

5.8.1 Syntax

As with non-survey data, the standard deviation estimate is the square root of the variance estimate. Therefore, the survey_var() and survey_sd() functions share the same arguments, except the standard deviation does not allow the usage of vartype.

survey_var(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci", "var"),
  level = 0.95,
  df = NULL
)

survey_sd(
  x, 
  na.rm = FALSE
)

The arguments are:

  • x: A variable or expression, or empty
  • na.rm: A logical value to indicate whether missing values should be dropped
  • vartype: Type(s) of variation estimate to calculate including any of c("se", "ci", "var"), defaults to se (standard error) (see Section 5.2.1 for more information)
  • level: (For vartype = “ci” only) A single number or vector of numbers indicating the confidence level
  • df: (For vartype = “ci” only) A numeric value indicating the degrees of freedom for t-distribution

5.8.2 Examples

Example 1: Overall variability

Let’s return to electricity bills and explore the variability in electricity expenditure.

recs_des %>%
  summarize(
    var_elbill = survey_var(DOLLAREL),
    sd_elbill = survey_sd(DOLLAREL)
  )
## # A tibble: 1 × 3
##   var_elbill var_elbill_se sd_elbill
##        <dbl>         <dbl>     <dbl>
## 1    704906.        13926.      840.

We may encounter a warning related to deprecated underlying calculations performed by the survey_var() function. This warning is a result of changes in the way R handles recycling in vectorized operations. The results are still valid. They give an estimate of the population variance of electricity bills (var_elbill), the standard error of that variance (var_elbill_se), and the estimated population standard deviation of electricity bills (sd_elbill). Note that no standard error is associated with the standard deviation; this is the only estimate that does not include a standard error.

Example 2: Variability by subgroup

To find out if the variability in electricity expenditure is similar across regions, we can calculate the variance by region using group_by():

recs_des %>%
  group_by(Region) %>%
  summarize(
    var_elbill = survey_var(DOLLAREL),
    sd_elbill = survey_sd(DOLLAREL)
  )
## # A tibble: 4 × 4
##   Region    var_elbill var_elbill_se sd_elbill
##   <fct>          <dbl>         <dbl>     <dbl>
## 1 Northeast    775450.        38843.      881.
## 2 Midwest      552423.        25252.      743.
## 3 South        702521.        30641.      838.
## 4 West         717886.        30597.      847.

5.9 Additional topics

5.9.1 Unweighted analysis

Sometimes, it is helpful to calculate an unweighted estimate of a given variable. For this, we use the unweighted() function in the summarize() function. The unweighted() function calculates unweighted summaries from a tbl_svy object, providing the summary among the respondents without extrapolating to a population estimate. The unweighted() function can be used in conjunction with any {dplyr} functions. Here is an example looking at the average household electricity cost:

recs_des %>%
  summarize(
    elec_bill = survey_mean(DOLLAREL),
    elec_unweight = unweighted(mean(DOLLAREL))
  )
## # A tibble: 1 × 3
##   elec_bill elec_bill_se elec_unweight
##       <dbl>        <dbl>         <dbl>
## 1     1380.         5.38         1425.

It is estimated that American residential households spent an average of $1,380 on electricity in 2020, and the estimate has a standard error of $5.38. The unweighted() function calculates the unweighted average and represents the average amount of money spent on electricity in 2020 by the respondents, which was $1,425.

5.9.2 Subpopulation analysis

We mentioned using filter() to subset a survey object for analysis. This operation should be done after creating the survey design object. Subsetting data before creating the object can lead to incorrect variability estimates, if subsetting removes an entire Primary Sampling Unit (PSU; see Chapter 10 for more information on PSUs and sample designs).

Suppose we want estimates of the average amount spent on natural gas among housing units using natural gas (based on the variable BTUNG)9. We first filter records to only include records where BTUNG > 0 and then find the average amount spent.

recs_des %>%
  filter(BTUNG > 0) %>%
  summarize(NG_mean = survey_mean(DOLLARNG,
    vartype = c("se", "ci")
  ))
## # A tibble: 1 × 4
##   NG_mean NG_mean_se NG_mean_low NG_mean_upp
##     <dbl>      <dbl>       <dbl>       <dbl>
## 1    631.       4.64        621.        640.

The estimated average amount spent on natural gas among households that use natural gas is $631. Let’s compare this to the mean when we do not filter.

recs_des %>%
  summarize(NG_mean = survey_mean(DOLLARNG,
    vartype = c("se", "ci")
  ))
## # A tibble: 1 × 4
##   NG_mean NG_mean_se NG_mean_low NG_mean_upp
##     <dbl>      <dbl>       <dbl>       <dbl>
## 1    382.       3.41        375.        389.

Based on this calculation, the estimated average amount spent on natural gas is $382. Note that applying the filter to include only housing units that use natural gas yields a higher mean than when not applying the filter. This is because including housing units that do not use natural gas introduces many $0 amounts, impacting the mean calculation.

5.9.3 Design effects

The design effect measures how the precision of an estimate is influenced by the sampling design. In other words, it measures how much more or less statistically efficient the survey design is compared to a simple random sample (SRS). It is computed by taking the ratio of the estimate’s variance under the design at hand to the estimate’s variance under a simple random sample without replacement. A design effect less than 1 indicates that the design is more statistically efficient than an SRS design, which is rare but possible in a stratified sampling design where the outcome correlates with the stratification variable(s). A design effect greater than 1 indicates that the design is less statistically efficient than an SRS design. From a design effect, we can calculate the effective sample size as follows:

\[n_{eff}=\frac{n}{D_{eff}} \]

where \(n\) is the nominal sample size (the number of survey responses) and \(D_{eff}\) is the estimated design effect. We can interpret the effective sample size \(n_{eff}\) as the hypothetical sample size that a survey using an SRS design would need to achieve the same precision as the design at hand. Design effects specific to each outcome — outcomes that are less clustered in the population have smaller design effects than outcomes that are clustered.

In the {srvyr} package, design effects can be calculated for totals, proportions, means, and ratio estimates by setting the deff argument to TRUE in the corresponding functions. In the example below, we calculate the design effects for the average consumption of electricity (BTUEL), natural gas (BTUNG), liquid propane (BTULP), fuel oil (BTUFO), and wood (BTUWOOD) by setting deff = TRUE:

recs_des %>%
  summarize(across(
    c(BTUEL, BTUNG, BTULP, BTUFO, BTUWOOD),
    ~ survey_mean(.x, deff = TRUE, vartype = NULL)
  )) %>%
  select(ends_with("deff"))
## # A tibble: 1 × 5
##   BTUEL_deff BTUNG_deff BTULP_deff BTUFO_deff BTUWOOD_deff
##        <dbl>      <dbl>      <dbl>      <dbl>        <dbl>
## 1      0.597      0.938       1.21      0.720         1.10

For the values less than 1 (BTUEL_deff and BTUFO_deff), the results suggest that the survey design is more efficient than a simple random sample. For the values greater than 1 (BTUNG_deff, BTULP_deff, and BTUWOOD_deff), the results indicate that the survey design is less efficient than a simple random sample.

5.9.4 Creating summary rows

When using group_by() in analysis, the results are returned with a row for each group or combination of groups. Often, we want both breakdowns by group and a summary row for the estimate representing the entire population. For example, we may want the average electricity consumption by region and nationally. The {srvyr} package has the convenient cascade() function, which adds summary rows for the total of a group. It is used instead of summarize() and has similar functionalities along with some additional features.

Syntax

The syntax is as follows:

cascade(
  .data, 
  ..., 
  .fill = NA, 
  .fill_level_top = FALSE, 
  .groupings = NULL
)

where the arguments are:

  • .data: A tbl_svy object
  • ...: Name-value pairs of summary functions (same as the summarize() function)
  • .fill: Value to fill in for group summaries (defaults to NA)
  • .fill_level_top: When filling factor variables, whether to put the value ‘.fill’ in the first position (defaults to FALSE, placing it in the bottom)

Example

First, let’s look at an example where we calculate the average household electricity cost. Then, we build on it to examine the features of the cascade() function. In the first example below, we calculate the average household energy cost DOLLAREL_mn using survey_mean() without modifying any of the argument defaults in the function:

recs_des %>%
  cascade(DOLLAREL_mn = survey_mean(DOLLAREL))
## # A tibble: 1 × 2
##   DOLLAREL_mn DOLLAREL_mn_se
##         <dbl>          <dbl>
## 1       1380.           5.38

Next, let’s group the results by region by adding group_by() before the cascade() function:

recs_des %>%
  group_by(Region) %>%
  cascade(DOLLAREL_mn = survey_mean(DOLLAREL))
## # A tibble: 5 × 3
##   Region    DOLLAREL_mn DOLLAREL_mn_se
##   <fct>           <dbl>          <dbl>
## 1 Northeast       1343.          14.6 
## 2 Midwest         1293.          11.7 
## 3 South           1548.          10.3 
## 4 West            1211.          12.0 
## 5 <NA>            1380.           5.38

We can see the estimated average electricity bills by region: $1,343 for the Northeast, $1,548 for the South, and so on. The last row, where Region = NA, is the national average electricity bill, $1,380. However, naming the national “region” as NA is not very informative. We can give it a better name using the .fill argument.

recs_des %>%
  group_by(Region) %>%
  cascade(
    DOLLAREL_mn = survey_mean(DOLLAREL),
    .fill = "National"
  )
## # A tibble: 5 × 3
##   Region    DOLLAREL_mn DOLLAREL_mn_se
##   <fct>           <dbl>          <dbl>
## 1 Northeast       1343.          14.6 
## 2 Midwest         1293.          11.7 
## 3 South           1548.          10.3 
## 4 West            1211.          12.0 
## 5 National        1380.           5.38

We can move the summary row to the first row by adding .fill_level_top = TRUE to cascade():

recs_des %>%
  group_by(Region) %>%
  cascade(
    DOLLAREL_mn = survey_mean(DOLLAREL),
    .fill = "National",
    .fill_level_top = TRUE
  )
## # A tibble: 5 × 3
##   Region    DOLLAREL_mn DOLLAREL_mn_se
##   <fct>           <dbl>          <dbl>
## 1 National        1380.           5.38
## 2 Northeast       1343.          14.6 
## 3 Midwest         1293.          11.7 
## 4 South           1548.          10.3 
## 5 West            1211.          12.0

While the results remain the same, the table is now easier to interpret.

5.9.5 Calculating estimates for many outcomes

Often, we are interested in a summary statistic across many variables. Useful tools include the across() function in {dplyr}, shown a few times above, and the map() function in {purrr}.

The across() function applies the same function to multiple columns within summarize(). This works well with all functions shown above, except for survey_prop(). In a later example, we tackle summarizing multiple proportions.

Example 1: across()

Suppose we want to calculate the total and average consumption, along with coefficients of variation (CV), for each fuel type. These include the reported consumption of electricity (BTUEL), natural gas (BTUNG), liquid propane (BTULP), fuel oil (BTUFO), and wood (BTUWOOD), as mentioned in the section on design effects. We can take advantage of the fact that these are the only variables that start with “BTU” by selecting them with starts_with("BTU") in the across() function. For each selected column (.x), across() creates a list of two functions to be applied: survey_total() to calculate the total and survey_mean() to calculate the mean, along with their CV (vartype = "cv"). Finally, .unpack = "{outer}.{inner}" specifies that the resulting column names are a concatenation of the variable name, followed by Total or Mean, and then “coef” or “cv.”

consumption_ests <- recs_des %>%
  summarize(across(
    starts_with("BTU"),
    list(
      Total = ~ survey_total(.x, vartype = "cv"),
      Mean = ~ survey_mean(.x, vartype = "cv")
    ),
    .unpack = "{outer}.{inner}"
  ))

consumption_ests
## # A tibble: 1 × 20
##   BTUEL_Total.coef BTUEL_Total._cv BTUEL_Mean.coef BTUEL_Mean._cv
##              <dbl>           <dbl>           <dbl>          <dbl>
## 1    4453284510065         0.00377          36051.        0.00377
## # ℹ 16 more variables: BTUNG_Total.coef <dbl>, BTUNG_Total._cv <dbl>,
## #   BTUNG_Mean.coef <dbl>, BTUNG_Mean._cv <dbl>,
## #   BTULP_Total.coef <dbl>, BTULP_Total._cv <dbl>,
## #   BTULP_Mean.coef <dbl>, BTULP_Mean._cv <dbl>,
## #   BTUFO_Total.coef <dbl>, BTUFO_Total._cv <dbl>,
## #   BTUFO_Mean.coef <dbl>, BTUFO_Mean._cv <dbl>,
## #   BTUWOOD_Total.coef <dbl>, BTUWOOD_Total._cv <dbl>, …

The estimated total consumption of electricity (BTUEL) is 4,453,284,510,065 (BTUEL_Total.coef), the estimated average consumption is 36,051 (BTUEL_Mean.coef), and the CV is 0.0038.

In the example above, the table was quite wide. We may prefer a row for each fuel type. Using the pivot_longer() and pivot_wider() functions from {tidyr} can help us achieve this. First, we use pivot_longer() to make each variable a column, changing the data to a “long” format. We use the names_to argument to specify new column names: FuelType, Stat, and Type. Then, the names_pattern argument extracts the names in the original column names based on the regular expression pattern BTU(.*)_(.*)\\.(.*). They are saved in the column names defined in names_to.

consumption_ests_long <- consumption_ests %>%
  pivot_longer(
    cols = everything(),
    names_to = c("FuelType", "Stat", "Type"),
    names_pattern = "BTU(.*)_(.*)\\.(.*)"
  )

consumption_ests_long
## # A tibble: 20 × 4
##    FuelType Stat  Type                value
##    <chr>    <chr> <chr>               <dbl>
##  1 EL       Total coef  4453284510065      
##  2 EL       Total _cv               0.00377
##  3 EL       Mean  coef          36051.     
##  4 EL       Mean  _cv               0.00377
##  5 NG       Total coef  4240769382106.     
##  6 NG       Total _cv               0.00908
##  7 NG       Mean  coef          34330.     
##  8 NG       Mean  _cv               0.00908
##  9 LP       Total coef   391425311586.     
## 10 LP       Total _cv               0.0380 
## 11 LP       Mean  coef           3169.     
## 12 LP       Mean  _cv               0.0380 
## 13 FO       Total coef   395699976655.     
## 14 FO       Total _cv               0.0343 
## 15 FO       Mean  coef           3203.     
## 16 FO       Mean  _cv               0.0343 
## 17 WOOD     Total coef   345091088404.     
## 18 WOOD     Total _cv               0.0454 
## 19 WOOD     Mean  coef           2794.     
## 20 WOOD     Mean  _cv               0.0454

Then, we use pivot_wider() to create a table that is nearly ready for publication. Within the function, we can make the names for each element more descriptive and informative by gluing the Stat and Type together with names_glue. Further details on creating publication-ready tables are covered in Chapter 8.

consumption_ests_long %>%
  mutate(Type = case_when(
    Type == "coef" ~ "",
    Type == "_cv" ~ " (CV)"
  )) %>%
  pivot_wider(
    id_cols = FuelType,
    names_from = c(Stat, Type),
    names_glue = "{Stat}{Type}",
    values_from = value
  )
## # A tibble: 5 × 5
##   FuelType          Total `Total (CV)`   Mean `Mean (CV)`
##   <chr>             <dbl>        <dbl>  <dbl>       <dbl>
## 1 EL       4453284510065       0.00377 36051.     0.00377
## 2 NG       4240769382106.      0.00908 34330.     0.00908
## 3 LP        391425311586.      0.0380   3169.     0.0380 
## 4 FO        395699976655.      0.0343   3203.     0.0343 
## 5 WOOD      345091088404.      0.0454   2794.     0.0454

Example 2: Proportions with across()

As mentioned earlier, proportions do not work as well directly with the across() method. If we want the proportion of houses with A/C and the proportion of houses with heating, we require two separate group_by() statements as shown below:

recs_des %>%
  group_by(ACUsed) %>%
  summarize(p = survey_prop())
## # A tibble: 2 × 3
##   ACUsed     p    p_se
##   <lgl>  <dbl>   <dbl>
## 1 FALSE  0.113 0.00306
## 2 TRUE   0.887 0.00306
recs_des %>%
  group_by(SpaceHeatingUsed) %>%
  summarize(p = survey_prop())
## # A tibble: 2 × 3
##   SpaceHeatingUsed      p    p_se
##   <lgl>             <dbl>   <dbl>
## 1 FALSE            0.0469 0.00207
## 2 TRUE             0.953  0.00207

We estimate 88.7% of households have A/C and 95.3% have heating.

If we are only interested in the TRUE outcomes, that is, the proportion of households that have A/C and the proportion that have heating, we can simplify the code. Applying survey_mean() to a logical variable is the same as using survey_prop(), as shown below:

cool_heat_tab <- recs_des %>%
  summarize(across(c(ACUsed, SpaceHeatingUsed), ~ survey_mean(.x),
    .unpack = "{outer}.{inner}"
  ))

cool_heat_tab
## # A tibble: 1 × 4
##   ACUsed.coef ACUsed._se SpaceHeatingUsed.coef SpaceHeatingUsed._se
##         <dbl>      <dbl>                 <dbl>                <dbl>
## 1       0.887    0.00306                 0.953              0.00207

Note that the estimates are the same as those obtained using the separate group_by() statements. As before, we can use pivot_longer() to structure the table in a more suitable format for distribution.

cool_heat_tab %>%
  pivot_longer(everything(),
    names_to = c("Comfort", ".value"),
    names_pattern = "(.*)\\.(.*)"
  ) %>%
  rename(
    p = coef,
    se = `_se`
  )
## # A tibble: 2 × 3
##   Comfort              p      se
##   <chr>            <dbl>   <dbl>
## 1 ACUsed           0.887 0.00306
## 2 SpaceHeatingUsed 0.953 0.00207

Example 3: purrr::map()

Loops are a common tool when dealing with repetitive calculations. The {purrr} package provides the map() functions, which, like a loop, allow us to perform the same task across different elements (Wickham and Henry 2023). In our case, we may want to calculate proportions from the same design multiple times. A straightforward approach is to design the calculation for one variable, build a function based on that, and then apply it iteratively for the rest of the variables.

Suppose we want to create a table that shows the proportion of people who express trust in their government (TrustGovernment)10 as well as those that trust in people (TrustPeople)11 using data from the 2020 ANES.

First, we create a table for a single variable. The table includes the variable name as a column, the response, and the corresponding percentage with its standard error.

anes_des %>%
  drop_na(TrustGovernment) %>%
  group_by(TrustGovernment) %>%
  summarize(p = survey_prop() * 100) %>%
  mutate(Variable = "TrustGovernment") %>%
  rename(Answer = TrustGovernment) %>%
  select(Variable, everything())
## # A tibble: 5 × 4
##   Variable        Answer                  p  p_se
##   <chr>           <fct>               <dbl> <dbl>
## 1 TrustGovernment Always               1.55 0.204
## 2 TrustGovernment Most of the time    13.2  0.553
## 3 TrustGovernment About half the time 30.9  0.829
## 4 TrustGovernment Some of the time    43.4  0.855
## 5 TrustGovernment Never               11.0  0.566

We estimate that 1.55% of people always trust the government, 13.16% trust the government most of the time, and so on.

Now, we want to use the original series of steps as a template to create a general function calcps() that can apply the same steps to other variables. We replace TrustGovernment with an argument for a generic variable, var. Referring to var involves a bit of tidy evaluation, an advanced skill. To learn more, we recommend Wickham (2019).

calcps <- function(var) {
  anes_des %>%
    drop_na(!!sym(var)) %>%
    group_by(!!sym(var)) %>%
    summarize(p = survey_prop() * 100) %>%
    mutate(Variable = var) %>%
    rename(Answer := !!sym(var)) %>%
    select(Variable, everything())
}

We then apply this function to the two variables of interest, TrustGovernment and TrustPeople:

calcps("TrustGovernment")
## # A tibble: 5 × 4
##   Variable        Answer                  p  p_se
##   <chr>           <fct>               <dbl> <dbl>
## 1 TrustGovernment Always               1.55 0.204
## 2 TrustGovernment Most of the time    13.2  0.553
## 3 TrustGovernment About half the time 30.9  0.829
## 4 TrustGovernment Some of the time    43.4  0.855
## 5 TrustGovernment Never               11.0  0.566
calcps("TrustPeople")
## # A tibble: 5 × 4
##   Variable    Answer                   p  p_se
##   <chr>       <fct>                <dbl> <dbl>
## 1 TrustPeople Always               0.809 0.164
## 2 TrustPeople Most of the time    41.4   0.857
## 3 TrustPeople About half the time 28.2   0.776
## 4 TrustPeople Some of the time    24.5   0.670
## 5 TrustPeople Never                5.05  0.422

Finally, we use map() to iterate over as many variables as needed. We feed our desired variables into map() along with our custom function, calcps. The output is a tibble with the variable names in the “Variable” column, the responses in the “Answer” column, along with the percentage and standard error. The list_rbind() function combines the rows into a single tibble. This example extends nicely when dealing with numerous variables for which we want percentage estimates.

c("TrustGovernment", "TrustPeople") %>%
  map(calcps) %>%
  list_rbind()
## # A tibble: 10 × 4
##    Variable        Answer                   p  p_se
##    <chr>           <fct>                <dbl> <dbl>
##  1 TrustGovernment Always               1.55  0.204
##  2 TrustGovernment Most of the time    13.2   0.553
##  3 TrustGovernment About half the time 30.9   0.829
##  4 TrustGovernment Some of the time    43.4   0.855
##  5 TrustGovernment Never               11.0   0.566
##  6 TrustPeople     Always               0.809 0.164
##  7 TrustPeople     Most of the time    41.4   0.857
##  8 TrustPeople     About half the time 28.2   0.776
##  9 TrustPeople     Some of the time    24.5   0.670
## 10 TrustPeople     Never                5.05  0.422

In addition to our results above, we can also see the output for TrustPeople. While we estimate that 1.55% of people always trust the government, 0.81% always trust people.

5.10 Exercises

The exercises use the design objects anes_des and recs_des provided in the Prerequisites box at the beginning of the chapter.

  1. How many females have a graduate degree? Hint: The variables Gender and Education will be useful.

  2. What percentage of people identify as “Strong Democrat”? Hint: The variable PartyID indicates someone’s party affiliation.

  3. What percentage of people who voted in the 2020 election identify as “Strong Republican”? Hint: The variable VotedPres2020 indicates whether someone voted in 2020.

  4. What percentage of people voted in both the 2016 election and the 2020 election? Include the logit confidence interval. Hint: The variable VotedPres2016 indicates whether someone voted in 2016.

  5. What is the design effect for the proportion of people who voted early? Hint: The variable EarlyVote2020 indicates whether someone voted early in 2020.

  6. What is the median temperature people set their thermostats to at night during the winter? Hint: The variable WinterTempNight indicates the temperature that people set their thermostat to in the winter at night.

  7. People sometimes set their temperature differently over different seasons and during the day. What median temperatures do people set their thermostats to in the summer and winter, both during the day and at night? Include confidence intervals. Hint: Use the variables WinterTempDay, WinterTempNight, SummerTempDay, and SummerTempNight.

  8. What is the correlation between the temperature that people set their temperature at during the night and during the day in the summer?

  9. What is the 1st, 2nd, and 3rd quartile of money spent on energy by Building America (BA) climate zone? Hint: TOTALDOL indicates the total amount spent on all fuel, and ClimateRegion_BA indicates the BA climate zones.

References

American National Election Studies. 2021. “ANES 2020 Time Series Study: Pre-Election and Post-Election Survey Questionnaires.” https://electionstudies.org/wp-content/uploads/2021/07/anes_timeseries_2020_questionnaire_20210719.pdf.
Freedman Ellis, Greg, and Ben Schneider. 2024. srvyr: ’dplyr’-Like Syntax for Summary Statistics of Survey Data. http://gdfe.co/srvyr/.
Lumley, Thomas. 2010. Complex Surveys: A Guide to Analysis Using R. John Wiley & Sons.
Shah, Babubhai V, and Akhil K Vaish. 2006. “Confidence Intervals for Quantile Estimation from Complex Survey Data.” In Proceedings of the Section on Survey Research Methods. http://www.asasrms.org/Proceedings/y2006/Files/JSM2006-000749.pdf.
———. 2020. “Residential Energy Consumption Survey (RECS) Form EIA-457A 2020 Household Questionnaire.” https://www.eia.gov/survey/form/eia_457/archive/2020_RECS-457A.pdf.
———. 2023a. 2020 Residential Energy Consumption Survey: Consumption and Expenditures Technical Documentation Summary.” https://www.eia.gov/consumption/residential/data/2020/pdf/2020%20RECS%20CE%20Methodology_Final.pdf.
———. 2019. Advanced R. https://adv-r.hadley.nz/; CRC Press.
Wickham, Hadley, and Lionel Henry. 2023. purrr: Functional Programming Tools. https://purrr.tidyverse.org/.

  1. RECS has two components: a household survey and an energy supplier survey. For each household that responds, their energy providers are contacted to obtain their energy consumption and expenditure. This value reflects the dollars spent on electricity in 2020, according to the energy supplier. See U.S. Energy Information Administration (2023a) for more details.↩︎

  2. Question text: “Is any air conditioning equipment used in your home?” (U.S. Energy Information Administration 2020)↩︎

  3. The value of DOLLARLP reflects the annualized amount spent on liquid propane and BTULP reflects the annualized consumption in Btu of liquid propane (U.S. Energy Information Administration 2020).↩︎

  4. Question text: “What is the square footage of your home?” (U.S. Energy Information Administration 2020)↩︎

  5. BTUEL is derived from the supplier side component of the survey where BTUEL represents the electricity consumption in British thermal units (Btus) converted from kilowatt hours (kWh) in a year (U.S. Energy Information Administration 2020).↩︎

  6. BTUNG is derived from the supplier side component of the survey where BTUNG represents the natural gas consumption in British thermal units (Btus) in a year (U.S. Energy Information Administration 2020).↩︎

  7. Question text: “How often can you trust the federal government in Washington to do what is right? (Always, most of the time, about half the time, some of the time, or never)” (American National Election Studies 2021)↩︎

  8. Question text: “Generally speaking, how often can you trust other people? (Always, most of the time, about half the time, some of the time, or never)” (American National Election Studies 2021)↩︎