Descriptive analyses lay the groundwork for the next steps of running statistical tests or developing models.
Calculate point estimates of…
Measures of distribution describe how often an event or response occurs.
We cover the following functions:
survey_count()
and survey_tally()
)survey_total()
)Measures of central tendency find the central (or average) responses. These measures include means and medians.
survey_mean()
and survey_prop()
)survey_quantile()
and survey_median()
)Measures of dispersion describe how data spread around the central tendency for continuous variables. These measures include standard deviations and variances.
survey_var()
and survey_sd()
)Measures of relationship describe how variables relate to each other. These measures include correlations and ratios.
survey_corr()
)survey_ratio()
)tbl_svy
object (a survey design object) using: as_survey_design()
or as_survey_rep()
tbl_svy
object (a survey object) using: as_survey_design()
or as_survey_rep()
Subset data (if needed) using filter()
(to create subpopulations)
Specify domains of analysis using group_by()
survey_count()
survey_count()
functions similarly to dplyr::count()
in that it is NOT called within summarize()
survey_count()
: syntaxsurvey_count
: examplessurvey_count
: exampleCalculate the estimated number of households in the U.S. using (RECS) data:
# A tibble: 1 × 2
n n_se
<dbl> <dbl>
1 123529025. 0.148
survey_count
: subgroup exampleCalculate the estimated number of observations for Region and Division:
# A tibble: 10 × 4
Region Division N N_se
<fct> <fct> <dbl> <dbl>
1 Northeast New England 5876166 0.0000000137
2 Northeast Middle Atlantic 16043503 0.0000000487
3 Midwest East North Central 18546912 0.000000437
4 Midwest West North Central 8495815 0.0000000177
5 South South Atlantic 24843261 0.0000000418
6 South East South Central 7380717. 0.114
7 South West South Central 14619094 0.000488
8 West Mountain North 4615844 0.119
9 West Mountain South 4602070 0.0000000492
10 West Pacific 18505643. 0.00000295
survey_total()
summarize()
survey_total()
calculates a population count estimate within summarize()
survey_total()
: syntaxsurvey_total()
: examplessurvey_total()
: exampleCalculate the U.S. population count estimate:
# A tibble: 1 × 2
Tot Tot_se
<dbl> <dbl>
1 123529025. 0.148
survey_total()
: continuous data exampleCalculate the total cost of electricity in whole dollars:
# A tibble: 1 × 2
elec_bill elec_bill_se
<dbl> <dbl>
1 170473527909. 664893504.
survey_total()
: group_by()
exampleCalculate the variation in the cost of electricity in whole dollars across regions:
# A tibble: 4 × 4
Region elec_bill elec_bill_low elec_bill_upp
<fct> <dbl> <dbl> <dbl>
1 Northeast 29430369947. 28788987554. 30071752341.
2 Midwest 34972544751. 34339576041. 35605513460.
3 South 72496840204. 71534780902. 73458899506.
4 West 33573773008. 32909111702. 34238434313.
unweighted()
unweighted()
does not extrapolate to a population estimateunweighted()
: exampleunweighted()
: exampleCalculate the unweighted average household electricity cost:
# A tibble: 1 × 3
elec_bill elec_bill_se elec_unweight
<dbl> <dbl> <dbl>
1 1380. 5.38 1425.
survey_mean()
and survey_prop()
survey_mean()
applies to continuous data, survey_prop()
to categorical datasummarize()
survey_mean()
: syntaxsurvey_prop()
: syntaxsurvey_mean()
and survey_prop()
: examplessurvey_prop()
: one variable proportion exampleCalculate the proportion of households in each region in the RECS data:
# A tibble: 4 × 3
Region p p_se
<fct> <dbl> <dbl>
1 Northeast 0.177 2.12e-10
2 Midwest 0.219 2.62e-10
3 South 0.379 7.40e-10
4 West 0.224 8.16e-10
survey_mean()
: one variable proportion exampleCalculate the proportion of people in each region in the RECS data:
# A tibble: 4 × 3
Region p p_se
<fct> <dbl> <dbl>
1 Northeast 0.177 2.12e-10
2 Midwest 0.219 2.62e-10
3 South 0.379 7.40e-10
4 West 0.224 8.16e-10
survey_prop()
: conditional proportions exampleCalculate the proportion of housing units by Region and whether air conditioning (A/C) is used:
# A tibble: 8 × 4
# Groups: Region [4]
Region ACUsed p p_se
<fct> <lgl> <dbl> <dbl>
1 Northeast FALSE 0.110 0.00590
2 Northeast TRUE 0.890 0.00590
3 Midwest FALSE 0.0666 0.00508
4 Midwest TRUE 0.933 0.00508
5 South FALSE 0.0581 0.00278
6 South TRUE 0.942 0.00278
7 West FALSE 0.255 0.00759
8 West TRUE 0.745 0.00759
survey_prop()
: joint proportions exampleCalculate the joint proportion for each combination using interact()
:
# A tibble: 8 × 4
Region ACUsed p p_se
<fct> <lgl> <dbl> <dbl>
1 Northeast FALSE 0.0196 0.00105
2 Northeast TRUE 0.158 0.00105
3 Midwest FALSE 0.0146 0.00111
4 Midwest TRUE 0.204 0.00111
5 South FALSE 0.0220 0.00106
6 South TRUE 0.357 0.00106
7 West FALSE 0.0573 0.00170
8 West TRUE 0.167 0.00170
survey_mean()
: overall mean exampleCalculate the estimated average cost of electricity in the U.S.:
# A tibble: 1 × 4
elec_bill elec_bill_se elec_bill_low elec_bill_upp
<dbl> <dbl> <dbl> <dbl>
1 1380. 5.38 1369. 1391.
survey_mean()
: mean by subgroup exampleCalculate the estimated average cost of electricity in the U.S. by each region:
# A tibble: 4 × 3
Region elec_bill elec_bill_se
<fct> <dbl> <dbl>
1 Northeast 1343. 14.6
2 Midwest 1293. 11.7
3 South 1548. 10.3
4 West 1211. 12.0
03-descriptive-exercises.qmd
15:00
survey_quantile()
and survey_median()
: Quantiles and Medianssurvey_median()
functionsummarize()
survey_quantile()
: syntaxsurvey_quantile()
and survey_median()
: examplessurvey_quantile()
: exampleCalculate the first quartile (p=0.25), the median (p=0.5), and the third quartile (p=0.75) of electric bills:
# A tibble: 1 × 6
elec_bill_q25 elec_bill_q50 elec_bill_q75 elec_bill_q25_se elec_bill_q50_se
<dbl> <dbl> <dbl> <dbl> <dbl>
1 795. 1215. 1770. 5.69 6.33
# ℹ 1 more variable: elec_bill_q75_se <dbl>
survey_median()
: exampleCalculate the estimated median cost of electricity in the U.S.:
# A tibble: 1 × 2
elec_bill elec_bill_se
<dbl> <dbl>
1 1215. 6.33
filter()
filter()
to subset a survey object for analysisfilter()
: examplefilter()
: exampleCalculate an estimate of the average amount spent on natural gas among housing units using natural gas:
# A tibble: 1 × 4
NG_mean NG_mean_se NG_mean_low NG_mean_upp
<dbl> <dbl> <dbl> <dbl>
1 631. 4.64 621. 640.
cascade()
cascade()
functionsummarize()
cascade()
: syntaxcascade()
: examplescascade()
: exampleCalculate the average household electricity cost. Let’s build on it to show the features of cascade()
:
# A tibble: 1 × 2
DOLLAREL_mn DOLLAREL_mn_se
<dbl> <dbl>
1 1380. 5.38
cascade()
: exampleGroup by region:
# A tibble: 5 × 3
Region DOLLAREL_mn DOLLAREL_mn_se
<fct> <dbl> <dbl>
1 Northeast 1343. 14.6
2 Midwest 1293. 11.7
3 South 1548. 10.3
4 West 1211. 12.0
5 <NA> 1380. 5.38
cascade()
: exampleGive the summary row a better name with .fill
:
# A tibble: 5 × 3
Region DOLLAREL_mn DOLLAREL_mn_se
<fct> <dbl> <dbl>
1 Northeast 1343. 14.6
2 Midwest 1293. 11.7
3 South 1548. 10.3
4 West 1211. 12.0
5 National 1380. 5.38
cascade()
: exampleMove the summary row to the top with .fill_level_top = TRUE
:
# A tibble: 5 × 3
Region DOLLAREL_mn DOLLAREL_mn_se
<fct> <dbl> <dbl>
1 National 1380. 5.38
2 Northeast 1343. 14.6
3 Midwest 1293. 11.7
4 South 1548. 10.3
5 West 1211. 12.0
03-descriptive-exercises.qmd
10:00
Descriptive analyses…
The {srvyr} package has functions for calculating measures of distribution, central tendency, relationship, and dispersion.
tbl_svy
objectfilter()
and group_by()
precede the calculation functions, but still follow the design objectsurvey_corr()
survey_corr()
: syntaxsurvey_corr()
: examplessurvey_corr()
: exampleCalculate the correlation between the total square footage of homes and electricity consumption:
# A tibble: 1 × 2
SQFT_Elec_Corr SQFT_Elec_Corr_se
<dbl> <dbl>
1 0.417 0.00689