Descriptive Analysis

Introduction

Descriptive analyses lay the groundwork for the next steps of running statistical tests or developing models.

Calculate point estimates of…

  • Unknown population parameters, such as mean.
  • Uncertainty estimates, such as confidence intervals.

Types of data

Types of data

  • Categorical/nominal data: variables with levels or descriptions that cannot be ordered, such as the region of the country (North, South, East, and West)
  • Ordinal data: variables that can be ordered, such as those from a Likert scale (strongly disagree, disagree, agree, and strongly agree)
  • Discrete data: variables that are counted or measured, such as number of children
  • Continuous data: variables that are measured and whose values can lie anywhere on an interval, such as income

Types of measures

Measures of distribution

Measures of distribution describe how often an event or response occurs.

We cover the following functions:

  • Count of observations (survey_count() and survey_tally())
  • Summation of variables (survey_total())

Measures of central tendency

Measures of central tendency find the central (or average) responses. These measures include means and medians.

  • Means and proportions (survey_mean() and survey_prop())
  • Quantiles and medians (survey_quantile() and survey_median())

Measures of dispersion

Measures of dispersion describe how data spread around the central tendency for continuous variables. These measures include standard deviations and variances.

  • Variances and standard deviations (survey_var() and survey_sd())

Measure of relationships

Measures of relationship describe how variables relate to each other. These measures include correlations and ratios.

  • Correlations (survey_corr())
  • Ratios (survey_ratio())

Survey analysis process

Overview of survey analysis using the {srvyr} package

  1. Create a tbl_svy object (a survey design object) using: as_survey_design() or as_survey_rep()

Overview of survey analysis using the {srvyr} package

For ANES:

anes_des <- anes_adjwgt %>%
  as_survey_design(
    weights = Weight,
    strata = Stratum,
    ids = VarUnit,
    nest = TRUE
  )

For RECS:

recs_des <- recs_2020 %>%
  as_survey_rep(
    weights = NWEIGHT,
    repweights = NWEIGHT1:NWEIGHT60,
    type = "JK1",
    scale = 59 / 60,
    mse = TRUE
  )

Overview of survey analysis using the {srvyr} package

  1. Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep()
  1. Subset data (if needed) using filter() (to create subpopulations)

  2. Specify domains of analysis using group_by()

  1. Specify variables to calculate, including means, totals, proportions, quantiles, and more

Counts and totals

survey_count()

  • Calculate the estimated observation counts for a given variable or combination of variables
  • Applied to categorical data
  • Sometimes called “cross-tabulations” or “cross-tabs”
  • survey_count() functions similarly to dplyr::count() in that it is NOT called within summarize()

survey_count(): syntax

survey_count(
  x,
  ...,
  wt = NULL,
  sort = FALSE,
  name = "n",
  .drop = dplyr::group_by_drop_default(x),
  vartype = c("se", "ci", "var", "cv")
)

survey_count: examples

survey_count: example

Calculate the estimated number of households in the U.S. using (RECS) data:

recs_des %>%
  survey_count()
# A tibble: 1 × 2
           n  n_se
       <dbl> <dbl>
1 123529025. 0.148

survey_count: subgroup example

Calculate the estimated number of observations for Region and Division:

recs_des %>%
  survey_count(Region, Division, 
               name = "N")
# A tibble: 10 × 4
   Region    Division                   N         N_se
   <fct>     <fct>                  <dbl>        <dbl>
 1 Northeast New England         5876166  0.0000000137
 2 Northeast Middle Atlantic    16043503  0.0000000487
 3 Midwest   East North Central 18546912  0.000000437 
 4 Midwest   West North Central  8495815  0.0000000177
 5 South     South Atlantic     24843261  0.0000000418
 6 South     East South Central  7380717. 0.114       
 7 South     West South Central 14619094  0.000488    
 8 West      Mountain North      4615844  0.119       
 9 West      Mountain South      4602070  0.0000000492
10 West      Pacific            18505643. 0.00000295  

survey_total()

  • Calculate the estimated total quantity in a population
  • Applied to continuous data
  • Must be called within summarize()
  • If used with no x-variable, survey_total() calculates a population count estimate within summarize()

survey_total(): syntax

survey_total(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  deff = FALSE,
  df = NULL
)

survey_total(): examples

survey_total(): example

Calculate the U.S. population count estimate:

recs_des %>%
  summarize(Tot = survey_total())
# A tibble: 1 × 2
         Tot Tot_se
       <dbl>  <dbl>
1 123529025.  0.148

survey_total(): continuous data example

Calculate the total cost of electricity in whole dollars:

recs_des %>%
  summarize(elec_bill = survey_total(DOLLAREL))
# A tibble: 1 × 2
      elec_bill elec_bill_se
          <dbl>        <dbl>
1 170473527909.   664893504.

survey_total(): group_by() example

Calculate the variation in the cost of electricity in whole dollars across regions:

recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_total(DOLLAREL,
    vartype = "ci"
  ))
# A tibble: 4 × 4
  Region       elec_bill elec_bill_low elec_bill_upp
  <fct>            <dbl>         <dbl>         <dbl>
1 Northeast 29430369947.  28788987554.  30071752341.
2 Midwest   34972544751.  34339576041.  35605513460.
3 South     72496840204.  71534780902.  73458899506.
4 West      33573773008.  32909111702.  34238434313.

unweighted()

  • Sometimes, it is helpful to calculate an unweighted estimate of a given variable
  • unweighted() does not extrapolate to a population estimate
  • Used in conjunction with any {dplyr} functions

unweighted(): example

unweighted(): example

Calculate the unweighted average household electricity cost:

recs_des %>%
  summarize(
    elec_bill = survey_mean(DOLLAREL),
    elec_unweight = unweighted(mean(DOLLAREL))
  )
# A tibble: 1 × 3
  elec_bill elec_bill_se elec_unweight
      <dbl>        <dbl>         <dbl>
1     1380.         5.38         1425.

Means, proportions, and quantiles

survey_mean() and survey_prop()

  • Calculate the estimated observation counts for a given variable or combination of variables
  • survey_mean() applies to continuous data, survey_prop() to categorical data
  • Must be called within summarize()

survey_mean(): syntax

survey_mean(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  proportion = FALSE,
  prop_method = c("logit", "likelihood", "asin", "beta", "mean"),
  deff = FALSE,
  df = NULL
)

survey_prop(): syntax

survey_prop(
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  proportion = TRUE,
  prop_method =
    c("logit", "likelihood", "asin", "beta", "mean", "xlogit"),
  deff = FALSE,
  df = NULL
)

survey_mean() and survey_prop(): examples

survey_prop(): one variable proportion example

Calculate the proportion of households in each region in the RECS data:

recs_des %>%
  group_by(Region) %>%
  summarize(p = survey_prop())
# A tibble: 4 × 3
  Region        p     p_se
  <fct>     <dbl>    <dbl>
1 Northeast 0.177 2.12e-10
2 Midwest   0.219 2.62e-10
3 South     0.379 7.40e-10
4 West      0.224 8.16e-10

survey_mean(): one variable proportion example

Calculate the proportion of people in each region in the RECS data:

recs_des %>%
  group_by(Region) %>%
  summarize(p = survey_mean())
# A tibble: 4 × 3
  Region        p     p_se
  <fct>     <dbl>    <dbl>
1 Northeast 0.177 2.12e-10
2 Midwest   0.219 2.62e-10
3 South     0.379 7.40e-10
4 West      0.224 8.16e-10

survey_prop(): conditional proportions example

Calculate the proportion of housing units by Region and whether air conditioning (A/C) is used:

recs_des %>%
  group_by(Region, ACUsed) %>%
  summarize(p = survey_prop())
# A tibble: 8 × 4
# Groups:   Region [4]
  Region    ACUsed      p    p_se
  <fct>     <lgl>   <dbl>   <dbl>
1 Northeast FALSE  0.110  0.00590
2 Northeast TRUE   0.890  0.00590
3 Midwest   FALSE  0.0666 0.00508
4 Midwest   TRUE   0.933  0.00508
5 South     FALSE  0.0581 0.00278
6 South     TRUE   0.942  0.00278
7 West      FALSE  0.255  0.00759
8 West      TRUE   0.745  0.00759

survey_prop(): joint proportions example

Calculate the joint proportion for each combination using interact():

recs_des %>%
  group_by(interact(Region, ACUsed)) %>%
  summarize(p = survey_prop())
# A tibble: 8 × 4
  Region    ACUsed      p    p_se
  <fct>     <lgl>   <dbl>   <dbl>
1 Northeast FALSE  0.0196 0.00105
2 Northeast TRUE   0.158  0.00105
3 Midwest   FALSE  0.0146 0.00111
4 Midwest   TRUE   0.204  0.00111
5 South     FALSE  0.0220 0.00106
6 South     TRUE   0.357  0.00106
7 West      FALSE  0.0573 0.00170
8 West      TRUE   0.167  0.00170

survey_mean(): overall mean example

Calculate the estimated average cost of electricity in the U.S.:

recs_des %>%
  summarize(
    elec_bill = survey_mean(DOLLAREL, vartype = c("se", "ci")
  ))
# A tibble: 1 × 4
  elec_bill elec_bill_se elec_bill_low elec_bill_upp
      <dbl>        <dbl>         <dbl>         <dbl>
1     1380.         5.38         1369.         1391.

survey_mean(): mean by subgroup example

Calculate the estimated average cost of electricity in the U.S. by each region:

recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_mean(DOLLAREL))
# A tibble: 4 × 3
  Region    elec_bill elec_bill_se
  <fct>         <dbl>        <dbl>
1 Northeast     1343.         14.6
2 Midwest       1293.         11.7
3 South         1548.         10.3
4 West          1211.         12.0

Your Turn

  • Open 03-descriptive-exercises.qmd
  • Work through Exercises - Part 1
15:00

Quantiles and Medians

survey_quantile() and survey_median(): Quantiles and Medians

  • Calculate quantiles at specific points
  • Because median is a special, common case of quantiles, there is the survey_median() function
  • Must be called within summarize()

survey_quantile(): syntax

survey_quantile(
  x,
  quantiles,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  interval_type = 
    c("mean", "beta", "xlogit", "asin", "score", "quantile"),
  qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3", 
            "hf4", "hf5", "hf6", "hf7", "hf8", "hf9"),
  df = NULL
)

survey_quantile() and survey_median(): examples

survey_quantile(): example

Calculate the first quartile (p=0.25), the median (p=0.5), and the third quartile (p=0.75) of electric bills:

recs_des %>%
  summarize(elec_bill = survey_quantile(DOLLAREL,
    quantiles = c(0.25, 0.5, 0.75)
  ))
# A tibble: 1 × 6
  elec_bill_q25 elec_bill_q50 elec_bill_q75 elec_bill_q25_se elec_bill_q50_se
          <dbl>         <dbl>         <dbl>            <dbl>            <dbl>
1          795.         1215.         1770.             5.69             6.33
# ℹ 1 more variable: elec_bill_q75_se <dbl>

survey_median(): example

Calculate the estimated median cost of electricity in the U.S.:

recs_des %>%
  summarize(elec_bill = survey_median(DOLLAREL))
# A tibble: 1 × 2
  elec_bill elec_bill_se
      <dbl>        <dbl>
1     1215.         6.33

Helper functions for survey analysis

filter()

  • Subpopulation analysis
  • Use filter() to subset a survey object for analysis
  • Must be done after creating the survey design object

filter(): example

filter(): example

Calculate an estimate of the average amount spent on natural gas among housing units using natural gas:

recs_des %>%
  filter(BTUNG > 0) %>%
  summarize(NG_mean = survey_mean(DOLLARNG,
    vartype = c("se", "ci")
  ))
# A tibble: 1 × 4
  NG_mean NG_mean_se NG_mean_low NG_mean_upp
    <dbl>      <dbl>       <dbl>       <dbl>
1    631.       4.64        621.        640.

cascade()

  • Creates a summary row for the estimate representing the entire population
  • The {srvyr} package has the convenient cascade() function
  • Used instead of summarize()

cascade(): syntax

cascade(
  .data, 
  ..., 
  .fill = NA, 
  .fill_level_top = FALSE, 
  .groupings = NULL
)

cascade(): examples

cascade(): example

Calculate the average household electricity cost. Let’s build on it to show the features of cascade():

recs_des %>%
  cascade(DOLLAREL_mn = survey_mean(DOLLAREL))
# A tibble: 1 × 2
  DOLLAREL_mn DOLLAREL_mn_se
        <dbl>          <dbl>
1       1380.           5.38

cascade(): example

Group by region:

recs_des %>%
  group_by(Region) %>%
  cascade(DOLLAREL_mn = survey_mean(DOLLAREL))
# A tibble: 5 × 3
  Region    DOLLAREL_mn DOLLAREL_mn_se
  <fct>           <dbl>          <dbl>
1 Northeast       1343.          14.6 
2 Midwest         1293.          11.7 
3 South           1548.          10.3 
4 West            1211.          12.0 
5 <NA>            1380.           5.38

cascade(): example

Give the summary row a better name with .fill:

recs_des %>%
  group_by(Region) %>%
  cascade(
    DOLLAREL_mn = survey_mean(DOLLAREL),
    .fill = "National"
  )
# A tibble: 5 × 3
  Region    DOLLAREL_mn DOLLAREL_mn_se
  <fct>           <dbl>          <dbl>
1 Northeast       1343.          14.6 
2 Midwest         1293.          11.7 
3 South           1548.          10.3 
4 West            1211.          12.0 
5 National        1380.           5.38

cascade(): example

Move the summary row to the top with .fill_level_top = TRUE:

recs_des %>%
  group_by(Region) %>%
  cascade(
    DOLLAREL_mn = survey_mean(DOLLAREL),
    .fill = "National",
    .fill_level_top = TRUE
  )
# A tibble: 5 × 3
  Region    DOLLAREL_mn DOLLAREL_mn_se
  <fct>           <dbl>          <dbl>
1 National        1380.           5.38
2 Northeast       1343.          14.6 
3 Midwest         1293.          11.7 
4 South           1548.          10.3 
5 West            1211.          12.0 

Your Turn

  • Open 03-descriptive-exercises.qmd
  • Work through Exercises - Part 2
10:00

Descriptive analysis summary

Summary

Descriptive analyses…

  • lay the groundwork for the next steps of running statistical tests or developing models
  • help us glean insight into the data, the underlying population, and any unique aspects of the data or population

Summary

The {srvyr} package has functions for calculating measures of distribution, central tendency, relationship, and dispersion.

  • Depending on the type of data, we determine what statistics to calculate

Summary

  • We create variables after creating the design object, running the functions on the tbl_svy object
  • filter() and group_by() precede the calculation functions, but still follow the design object
  • There are additional functions for unweighted analyses and calculating summary rows

Extra Content

Correlations

survey_corr()

  • Measure the linear relationship between two continuous variables
  • The most commonly used method is Pearson’s correlation
  • Ranges between –1 and 1

survey_corr(): syntax

survey_corr(
  x,
  y,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  df = NULL
)

survey_corr(): examples

survey_corr(): example

Calculate the correlation between the total square footage of homes and electricity consumption:

recs_des %>%
  summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, BTUEL))
# A tibble: 1 × 2
  SQFT_Elec_Corr SQFT_Elec_Corr_se
           <dbl>             <dbl>
1          0.417           0.00689