Descriptive Analysis

Introduction

Descriptive analyses lay the groundwork for the next steps of running statistical tests or developing models.

Calculate point estimates of…

Unknown population parameters, such as mean.

Uncertainty estimates, such as confidence intervals.

Types of data

Categorical/nominal data: variables with levels or descriptions that cannot be ordered, such as the region of the country (North, South, East, and West)
Ordinal data: variables that can be ordered, such as those from a Likert scale (strongly disagree, disagree, agree, and strongly agree)
Discrete data: variables that are counted or measured, such as number of children
Continuous data: variables that are measured and whose values can lie anywhere on an interval, such as income

Types of measures

Measures of distribution

Measures of distribution describe how often an event or response occurs.

We cover the following functions:

Count of observations (survey_count() and survey_tally())
Summation of variables (survey_total())

Measures of central tendency

Measures of central tendency find the central (or average) responses. These measures include means and medians.

Means and proportions (survey_mean() and survey_prop())
Quantiles and medians (survey_quantile() and survey_median())

Measures of dispersion

Measures of dispersion describe how data spread around the central tendency for continuous variables. These measures include standard deviations and variances.

Variances and standard deviations (survey_var() and survey_sd())

Measure of relationships

Measures of relationship describe how variables relate to each other. These measures include correlations and ratios.

Correlations (survey_corr())
Ratios (survey_ratio())

Survey analysis process

Overview of survey analysis using the {srvyr} package

Create a tbl_svy object (a survey design object) using: as_survey_design() or as_survey_rep()

Overview of survey analysis using the {srvyr} package

For ANES:

anes_des <- anes_adjwgt %>%
  as_survey_design(
    weights = Weight,
    strata = Stratum,
    ids = VarUnit,
    nest = TRUE
  )

For RECS:

recs_des <- recs_2020 %>%
  as_survey_rep(
    weights = NWEIGHT,
    repweights = NWEIGHT1:NWEIGHT60,
    type = "JK1",
    scale = 59 / 60,
    mse = TRUE
  )

Overview of survey analysis using the {srvyr} package

Create a tbl_svy object (a survey object) using: as_survey_design() or as_survey_rep()

Subset data (if needed) using filter() (to create subpopulations)
Specify domains of analysis using group_by()

Specify variables to calculate, including means, totals, proportions, quantiles, and more

Counts and totals

`survey_count()`

Calculate the estimated observation counts for a given variable or combination of variables
Applied to categorical data
Sometimes called “cross-tabulations” or “cross-tabs”
survey_count() functions similarly to dplyr::count() in that it is NOT called within summarize()

`survey_count()`: syntax

survey_count(
  x,
  ...,
  wt = NULL,
  sort = FALSE,
  name = "n",
  .drop = dplyr::group_by_drop_default(x),
  vartype = c("se", "ci", "var", "cv")
)

`survey_count`: examples

`survey_count`: example

Calculate the estimated number of households in the U.S. using (RECS) data:

recs_des %>%
  survey_count()

# A tibble: 1 × 2
           n  n_se
       <dbl> <dbl>
1 123529025. 0.148

`survey_count`: subgroup example

Calculate the estimated number of observations for Region and Division:

recs_des %>%
  survey_count(Region, Division, 
               name = "N")

# A tibble: 10 × 4
   Region    Division                   N         N_se
   <fct>     <fct>                  <dbl>        <dbl>
 1 Northeast New England         5876166  0.0000000137
 2 Northeast Middle Atlantic    16043503  0.0000000487
 3 Midwest   East North Central 18546912  0.000000437 
 4 Midwest   West North Central  8495815  0.0000000177
 5 South     South Atlantic     24843261  0.0000000418
 6 South     East South Central  7380717. 0.114       
 7 South     West South Central 14619094  0.000488    
 8 West      Mountain North      4615844  0.119       
 9 West      Mountain South      4602070  0.0000000492
10 West      Pacific            18505643. 0.00000295

`survey_total()`

Calculate the estimated total quantity in a population
Applied to continuous data
Must be called within summarize()
If used with no x-variable, survey_total() calculates a population count estimate within summarize()

`survey_total()`: syntax

survey_total(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  deff = FALSE,
  df = NULL
)

`survey_total()`: examples

`survey_total()`: example

Calculate the U.S. population count estimate:

recs_des %>%
  summarize(Tot = survey_total())

# A tibble: 1 × 2
         Tot Tot_se
       <dbl>  <dbl>
1 123529025.  0.148

`survey_total()`: continuous data example

Calculate the total cost of electricity in whole dollars:

recs_des %>%
  summarize(elec_bill = survey_total(DOLLAREL))

# A tibble: 1 × 2
      elec_bill elec_bill_se
          <dbl>        <dbl>
1 170473527909.   664893504.

`survey_total()`: `group_by()` example

Calculate the variation in the cost of electricity in whole dollars across regions:

recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_total(DOLLAREL,
    vartype = "ci"
  ))

# A tibble: 4 × 4
  Region       elec_bill elec_bill_low elec_bill_upp
  <fct>            <dbl>         <dbl>         <dbl>
1 Northeast 29430369947.  28788987554.  30071752341.
2 Midwest   34972544751.  34339576041.  35605513460.
3 South     72496840204.  71534780902.  73458899506.
4 West      33573773008.  32909111702.  34238434313.

`unweighted()`

Sometimes, it is helpful to calculate an unweighted estimate of a given variable
unweighted() does not extrapolate to a population estimate
Used in conjunction with any {dplyr} functions

`unweighted()`: example

Calculate the unweighted average household electricity cost:

recs_des %>%
  summarize(
    elec_bill = survey_mean(DOLLAREL),
    elec_unweight = unweighted(mean(DOLLAREL))
  )

# A tibble: 1 × 3
  elec_bill elec_bill_se elec_unweight
      <dbl>        <dbl>         <dbl>
1     1380.         5.38         1425.

Means, proportions, and quantiles

`survey_mean()` and `survey_prop()`

Calculate the estimated observation counts for a given variable or combination of variables
survey_mean() applies to continuous data, survey_prop() to categorical data
Must be called within summarize()

`survey_mean()`: syntax

survey_mean(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  proportion = FALSE,
  prop_method = c("logit", "likelihood", "asin", "beta", "mean"),
  deff = FALSE,
  df = NULL
)

`survey_prop()`: syntax

survey_prop(
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  proportion = TRUE,
  prop_method =
    c("logit", "likelihood", "asin", "beta", "mean", "xlogit"),
  deff = FALSE,
  df = NULL
)

`survey_mean()` and `survey_prop()`: examples

`survey_prop()`: one variable proportion example

Calculate the proportion of households in each region in the RECS data:

recs_des %>%
  group_by(Region) %>%
  summarize(p = survey_prop())

# A tibble: 4 × 3
  Region        p     p_se
  <fct>     <dbl>    <dbl>
1 Northeast 0.177 2.12e-10
2 Midwest   0.219 2.62e-10
3 South     0.379 7.40e-10
4 West      0.224 8.16e-10

`survey_mean()`: one variable proportion example

Calculate the proportion of people in each region in the RECS data:

recs_des %>%
  group_by(Region) %>%
  summarize(p = survey_mean())

# A tibble: 4 × 3
  Region        p     p_se
  <fct>     <dbl>    <dbl>
1 Northeast 0.177 2.12e-10
2 Midwest   0.219 2.62e-10
3 South     0.379 7.40e-10
4 West      0.224 8.16e-10

`survey_prop()`: conditional proportions example

Calculate the proportion of housing units by Region and whether air conditioning (A/C) is used:

recs_des %>%
  group_by(Region, ACUsed) %>%
  summarize(p = survey_prop())

# A tibble: 8 × 4
# Groups:   Region [4]
  Region    ACUsed      p    p_se
  <fct>     <lgl>   <dbl>   <dbl>
1 Northeast FALSE  0.110  0.00590
2 Northeast TRUE   0.890  0.00590
3 Midwest   FALSE  0.0666 0.00508
4 Midwest   TRUE   0.933  0.00508
5 South     FALSE  0.0581 0.00278
6 South     TRUE   0.942  0.00278
7 West      FALSE  0.255  0.00759
8 West      TRUE   0.745  0.00759

`survey_prop()`: joint proportions example

Calculate the joint proportion for each combination using interact():

recs_des %>%
  group_by(interact(Region, ACUsed)) %>%
  summarize(p = survey_prop())

# A tibble: 8 × 4
  Region    ACUsed      p    p_se
  <fct>     <lgl>   <dbl>   <dbl>
1 Northeast FALSE  0.0196 0.00105
2 Northeast TRUE   0.158  0.00105
3 Midwest   FALSE  0.0146 0.00111
4 Midwest   TRUE   0.204  0.00111
5 South     FALSE  0.0220 0.00106
6 South     TRUE   0.357  0.00106
7 West      FALSE  0.0573 0.00170
8 West      TRUE   0.167  0.00170

`survey_mean()`: overall mean example

Calculate the estimated average cost of electricity in the U.S.:

recs_des %>%
  summarize(
    elec_bill = survey_mean(DOLLAREL, vartype = c("se", "ci")
  ))

# A tibble: 1 × 4
  elec_bill elec_bill_se elec_bill_low elec_bill_upp
      <dbl>        <dbl>         <dbl>         <dbl>
1     1380.         5.38         1369.         1391.

`survey_mean()`: mean by subgroup example

Calculate the estimated average cost of electricity in the U.S. by each region:

recs_des %>%
  group_by(Region) %>%
  summarize(elec_bill = survey_mean(DOLLAREL))

# A tibble: 4 × 3
  Region    elec_bill elec_bill_se
  <fct>         <dbl>        <dbl>
1 Northeast     1343.         14.6
2 Midwest       1293.         11.7
3 South         1548.         10.3
4 West          1211.         12.0

Your Turn

Open 03-descriptive-exercises.qmd
Work through Exercises - Part 1

15:00

Quantiles and Medians

`survey_quantile()` and `survey_median()`: Quantiles and Medians

Calculate quantiles at specific points
Because median is a special, common case of quantiles, there is the survey_median() function
Must be called within summarize()

`survey_quantile()`: syntax

survey_quantile(
  x,
  quantiles,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  interval_type = 
    c("mean", "beta", "xlogit", "asin", "score", "quantile"),
  qrule = c("math", "school", "shahvaish", "hf1", "hf2", "hf3", 
            "hf4", "hf5", "hf6", "hf7", "hf8", "hf9"),
  df = NULL
)

`survey_quantile()` and `survey_median()`: examples

`survey_quantile()`: example

Calculate the first quartile (p=0.25), the median (p=0.5), and the third quartile (p=0.75) of electric bills:

recs_des %>%
  summarize(elec_bill = survey_quantile(DOLLAREL,
    quantiles = c(0.25, 0.5, 0.75)
  ))

# A tibble: 1 × 6
  elec_bill_q25 elec_bill_q50 elec_bill_q75 elec_bill_q25_se elec_bill_q50_se
          <dbl>         <dbl>         <dbl>            <dbl>            <dbl>
1          795.         1215.         1770.             5.69             6.33
# ℹ 1 more variable: elec_bill_q75_se <dbl>

`survey_median()`: example

Calculate the estimated median cost of electricity in the U.S.:

recs_des %>%
  summarize(elec_bill = survey_median(DOLLAREL))

# A tibble: 1 × 2
  elec_bill elec_bill_se
      <dbl>        <dbl>
1     1215.         6.33

Helper functions for survey analysis

`filter()`

Subpopulation analysis
Use filter() to subset a survey object for analysis
Must be done after creating the survey design object

`filter()`: example

Calculate an estimate of the average amount spent on natural gas among housing units using natural gas:

recs_des %>%
  filter(BTUNG > 0) %>%
  summarize(NG_mean = survey_mean(DOLLARNG,
    vartype = c("se", "ci")
  ))

# A tibble: 1 × 4
  NG_mean NG_mean_se NG_mean_low NG_mean_upp
    <dbl>      <dbl>       <dbl>       <dbl>
1    631.       4.64        621.        640.

`cascade()`

Creates a summary row for the estimate representing the entire population
The {srvyr} package has the convenient cascade() function
Used instead of summarize()

`cascade()`: syntax

cascade(
  .data, 
  ..., 
  .fill = NA, 
  .fill_level_top = FALSE, 
  .groupings = NULL
)

`cascade()`: examples

`cascade()`: example

Calculate the average household electricity cost. Let’s build on it to show the features of cascade():

recs_des %>%
  cascade(DOLLAREL_mn = survey_mean(DOLLAREL))

# A tibble: 1 × 2
  DOLLAREL_mn DOLLAREL_mn_se
        <dbl>          <dbl>
1       1380.           5.38

`cascade()`: example

Group by region:

recs_des %>%
  group_by(Region) %>%
  cascade(DOLLAREL_mn = survey_mean(DOLLAREL))

# A tibble: 5 × 3
  Region    DOLLAREL_mn DOLLAREL_mn_se
  <fct>           <dbl>          <dbl>
1 Northeast       1343.          14.6 
2 Midwest         1293.          11.7 
3 South           1548.          10.3 
4 West            1211.          12.0 
5 <NA>            1380.           5.38

`cascade()`: example

Give the summary row a better name with .fill:

recs_des %>%
  group_by(Region) %>%
  cascade(
    DOLLAREL_mn = survey_mean(DOLLAREL),
    .fill = "National"
  )

# A tibble: 5 × 3
  Region    DOLLAREL_mn DOLLAREL_mn_se
  <fct>           <dbl>          <dbl>
1 Northeast       1343.          14.6 
2 Midwest         1293.          11.7 
3 South           1548.          10.3 
4 West            1211.          12.0 
5 National        1380.           5.38

`cascade()`: example

Move the summary row to the top with .fill_level_top = TRUE:

recs_des %>%
  group_by(Region) %>%
  cascade(
    DOLLAREL_mn = survey_mean(DOLLAREL),
    .fill = "National",
    .fill_level_top = TRUE
  )

# A tibble: 5 × 3
  Region    DOLLAREL_mn DOLLAREL_mn_se
  <fct>           <dbl>          <dbl>
1 National        1380.           5.38
2 Northeast       1343.          14.6 
3 Midwest         1293.          11.7 
4 South           1548.          10.3 
5 West            1211.          12.0

Your Turn

Open 03-descriptive-exercises.qmd
Work through Exercises - Part 2

10:00

Descriptive analysis summary

Summary

Descriptive analyses…

lay the groundwork for the next steps of running statistical tests or developing models
help us glean insight into the data, the underlying population, and any unique aspects of the data or population

Summary

The {srvyr} package has functions for calculating measures of distribution, central tendency, relationship, and dispersion.

Depending on the type of data, we determine what statistics to calculate

Summary

We create variables after creating the design object, running the functions on the tbl_svy object
filter() and group_by() precede the calculation functions, but still follow the design object
There are additional functions for unweighted analyses and calculating summary rows

Extra Content

Correlations

`survey_corr()`

Measure the linear relationship between two continuous variables
The most commonly used method is Pearson’s correlation
Ranges between –1 and 1

`survey_corr()`: syntax

survey_corr(
  x,
  y,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  df = NULL
)

`survey_corr()`: examples

`survey_corr()`: example

Calculate the correlation between the total square footage of homes and electricity consumption:

recs_des %>%
  summarize(SQFT_Elec_Corr = survey_corr(TOTSQFT_EN, BTUEL))

# A tibble: 1 × 2
  SQFT_Elec_Corr SQFT_Elec_Corr_se
           <dbl>             <dbl>
1          0.417           0.00689

Descriptive Analysis

Introduction

Types of data

Types of data

Types of measures

Measures of distribution

Measures of central tendency

Measures of dispersion

Measure of relationships

Survey analysis process

Overview of survey analysis using the {srvyr} package

Overview of survey analysis using the {srvyr} package

Overview of survey analysis using the {srvyr} package

Counts and totals

survey_count()

survey_count(): syntax

survey_count: examples

survey_count: example

survey_count: subgroup example

survey_total()

survey_total(): syntax

survey_total(): examples

survey_total(): example

survey_total(): continuous data example

survey_total(): group_by() example

unweighted()

unweighted(): example

unweighted(): example

Means, proportions, and quantiles

survey_mean() and survey_prop()

survey_mean(): syntax

survey_prop(): syntax

survey_mean() and survey_prop(): examples

survey_prop(): one variable proportion example

survey_mean(): one variable proportion example

survey_prop(): conditional proportions example

survey_prop(): joint proportions example

survey_mean(): overall mean example

survey_mean(): mean by subgroup example

Your Turn

Quantiles and Medians

survey_quantile() and survey_median(): Quantiles and Medians

survey_quantile(): syntax

survey_quantile() and survey_median(): examples

survey_quantile(): example

survey_median(): example

Helper functions for survey analysis

filter()

filter(): example

filter(): example

cascade()

cascade(): syntax

cascade(): examples

cascade(): example

cascade(): example

cascade(): example

cascade(): example

Your Turn

Descriptive analysis summary

Summary

Summary

Summary

Extra Content

Correlations

survey_corr()

survey_corr(): syntax

survey_corr(): examples

survey_corr(): example

`survey_count()`

`survey_count()`: syntax

`survey_count`: examples

`survey_count`: example

`survey_count`: subgroup example

`survey_total()`

`survey_total()`: syntax

`survey_total()`: examples

`survey_total()`: example

`survey_total()`: continuous data example

`survey_total()`: `group_by()` example

`unweighted()`

`unweighted()`: example

`unweighted()`: example

`survey_mean()` and `survey_prop()`

`survey_mean()`: syntax

`survey_prop()`: syntax

`survey_mean()` and `survey_prop()`: examples

`survey_prop()`: one variable proportion example

`survey_mean()`: one variable proportion example

`survey_prop()`: conditional proportions example

`survey_prop()`: joint proportions example

`survey_mean()`: overall mean example

`survey_mean()`: mean by subgroup example

`survey_quantile()` and `survey_median()`: Quantiles and Medians

`survey_quantile()`: syntax

`survey_quantile()` and `survey_median()`: examples

`survey_quantile()`: example

`survey_median()`: example

`filter()`

`filter()`: example

`filter()`: example

`cascade()`

`cascade()`: syntax

`cascade()`: examples

`cascade()`: example

`cascade()`: example

`cascade()`: example

`cascade()`: example

`survey_corr()`

`survey_corr()`: syntax

`survey_corr()`: examples

`survey_corr()`: example