Statistical Testing

t-tests

t-tests

  • Determine if one proportion or mean is statistically different from another
  • Different types:
    • One-Sample: Comparing our data to a set value
    • Two-Sample: Comparing two sets of data
      • Unpaired: Comparing two values for two different groups
      • Paired: Comparing two values for the same unit

t-test syntax

Use the svyttest() function to compare two proportions or means.

Syntax:

svyttest(
  formula,
  design,
  ...
)

t-test syntax

recs_des %>%
  svyttest(SummerTempNight - 68 ~ 0,
    na.rm = TRUE
  )
Error in formula[[3]] == 1 || formula[[3]] == 0: 'length = 60' in coercion to 'logical(1)'

t-test syntax

Use the svyttest() function to compare two proportions or means.

Syntax:

svyttest(
  formula,
  design,
  ...
)

Dot notation

Dot notation

Let’s walk through an example using the towny data.

towny
# A tibble: 414 × 25
   name   website status csd_type census_div latitude longitude land_area_km2
   <chr>  <chr>   <chr>  <chr>    <chr>         <dbl>     <dbl>         <dbl>
 1 Addin… https:… lower… township Lennox an…     45       -77.2        1294. 
 2 Adela… https:… lower… township Middlesex      43.0     -81.7         331. 
 3 Adjal… https:… lower… township Simcoe         44.1     -79.9         372. 
 4 Admas… https:… lower… township Renfrew        45.5     -76.9         520. 
 5 Ajax   https:… lower… town     Durham         43.9     -79.0          66.6
 6 Alber… https:… singl… township Rainy Riv…     48.6     -93.5         117. 
 7 Alfre… https:… lower… township Prescott …     45.6     -74.9         392. 
 8 Algon… https:… lower… township Haliburton     45.4     -78.8        1000. 
 9 Alnwi… https:… lower… township Northumbe…     44.1     -78.0         398. 
10 Amara… https:… lower… township Dufferin       44.0     -80.2         265. 
# ℹ 404 more rows
# ℹ 17 more variables: population_1996 <int>, population_2001 <int>,
#   population_2006 <int>, population_2011 <int>, population_2016 <int>,
#   population_2021 <int>, density_1996 <dbl>, density_2001 <dbl>,
#   density_2006 <dbl>, density_2011 <dbl>, density_2016 <dbl>,
#   density_2021 <dbl>, pop_change_1996_2001_pct <dbl>,
#   pop_change_2001_2006_pct <dbl>, pop_change_2006_2011_pct <dbl>, …

Dot notation

Using the towny data, let’s filter to only cities.

filter(towny, csd_type == "city")
# A tibble: 52 × 25
   name   website status csd_type census_div latitude longitude land_area_km2
   <chr>  <chr>   <chr>  <chr>    <chr>         <dbl>     <dbl>         <dbl>
 1 Barrie https:… singl… city     Simcoe         44.4     -79.7          99.0
 2 Belle… https:… singl… city     Hastings       44.2     -77.4         247. 
 3 Bramp… https:… lower… city     Peel           43.7     -79.8         266. 
 4 Brant  https:… singl… city     Brant          43.1     -80.4         818. 
 5 Brant… https:… singl… city     Brant          43.1     -80.3          98.6
 6 Brock… https:… singl… city     Leeds and…     44.6     -75.7          20.9
 7 Burli… https:… lower… city     Halton         43.4     -79.8         186. 
 8 Cambr… https:… lower… city     Waterloo       43.4     -80.3         113. 
 9 Clare… https:… lower… city     Prescott …     45.5     -75.2         297. 
10 Cornw… https:… singl… city     Stormont,…     45.0     -74.7          61.5
# ℹ 42 more rows
# ℹ 17 more variables: population_1996 <int>, population_2001 <int>,
#   population_2006 <int>, population_2011 <int>, population_2016 <int>,
#   population_2021 <int>, density_1996 <dbl>, density_2001 <dbl>,
#   density_2006 <dbl>, density_2011 <dbl>, density_2016 <dbl>,
#   density_2021 <dbl>, pop_change_1996_2001_pct <dbl>,
#   pop_change_2001_2006_pct <dbl>, pop_change_2006_2011_pct <dbl>, …

Dot notation

Using the towny data, let’s filter to only cities.

filter(towny, csd_type == "city")
towny %>% filter(csd_type == "city")
# A tibble: 52 × 25
   name   website status csd_type census_div latitude longitude land_area_km2
   <chr>  <chr>   <chr>  <chr>    <chr>         <dbl>     <dbl>         <dbl>
 1 Barrie https:… singl… city     Simcoe         44.4     -79.7          99.0
 2 Belle… https:… singl… city     Hastings       44.2     -77.4         247. 
 3 Bramp… https:… lower… city     Peel           43.7     -79.8         266. 
 4 Brant  https:… singl… city     Brant          43.1     -80.4         818. 
 5 Brant… https:… singl… city     Brant          43.1     -80.3          98.6
 6 Brock… https:… singl… city     Leeds and…     44.6     -75.7          20.9
 7 Burli… https:… lower… city     Halton         43.4     -79.8         186. 
 8 Cambr… https:… lower… city     Waterloo       43.4     -80.3         113. 
 9 Clare… https:… lower… city     Prescott …     45.5     -75.2         297. 
10 Cornw… https:… singl… city     Stormont,…     45.0     -74.7          61.5
# ℹ 42 more rows
# ℹ 17 more variables: population_1996 <int>, population_2001 <int>,
#   population_2006 <int>, population_2011 <int>, population_2016 <int>,
#   population_2021 <int>, density_1996 <dbl>, density_2001 <dbl>,
#   density_2006 <dbl>, density_2011 <dbl>, density_2016 <dbl>,
#   density_2021 <dbl>, pop_change_1996_2001_pct <dbl>,
#   pop_change_2001_2006_pct <dbl>, pop_change_2006_2011_pct <dbl>, …

Dot notation

Using the towny data, let’s filter to only cities.

filter(towny, csd_type == "city")
towny %>% filter(csd_type == "city")
towny %>% filter(., csd_type == "city")
# A tibble: 52 × 25
   name   website status csd_type census_div latitude longitude land_area_km2
   <chr>  <chr>   <chr>  <chr>    <chr>         <dbl>     <dbl>         <dbl>
 1 Barrie https:… singl… city     Simcoe         44.4     -79.7          99.0
 2 Belle… https:… singl… city     Hastings       44.2     -77.4         247. 
 3 Bramp… https:… lower… city     Peel           43.7     -79.8         266. 
 4 Brant  https:… singl… city     Brant          43.1     -80.4         818. 
 5 Brant… https:… singl… city     Brant          43.1     -80.3          98.6
 6 Brock… https:… singl… city     Leeds and…     44.6     -75.7          20.9
 7 Burli… https:… lower… city     Halton         43.4     -79.8         186. 
 8 Cambr… https:… lower… city     Waterloo       43.4     -80.3         113. 
 9 Clare… https:… lower… city     Prescott …     45.5     -75.2         297. 
10 Cornw… https:… singl… city     Stormont,…     45.0     -74.7          61.5
# ℹ 42 more rows
# ℹ 17 more variables: population_1996 <int>, population_2001 <int>,
#   population_2006 <int>, population_2011 <int>, population_2016 <int>,
#   population_2021 <int>, density_1996 <dbl>, density_2001 <dbl>,
#   density_2006 <dbl>, density_2011 <dbl>, density_2016 <dbl>,
#   density_2021 <dbl>, pop_change_1996_2001_pct <dbl>,
#   pop_change_2001_2006_pct <dbl>, pop_change_2006_2011_pct <dbl>, …

Dot notation

Using the towny data, let’s filter to only cities.

filter(towny, csd_type == "city")
towny %>% filter(csd_type == "city")
towny %>% filter(., csd_type == "city")
towny %>% filter(.data = ., csd_type == "city")
# A tibble: 52 × 25
   name   website status csd_type census_div latitude longitude land_area_km2
   <chr>  <chr>   <chr>  <chr>    <chr>         <dbl>     <dbl>         <dbl>
 1 Barrie https:… singl… city     Simcoe         44.4     -79.7          99.0
 2 Belle… https:… singl… city     Hastings       44.2     -77.4         247. 
 3 Bramp… https:… lower… city     Peel           43.7     -79.8         266. 
 4 Brant  https:… singl… city     Brant          43.1     -80.4         818. 
 5 Brant… https:… singl… city     Brant          43.1     -80.3          98.6
 6 Brock… https:… singl… city     Leeds and…     44.6     -75.7          20.9
 7 Burli… https:… lower… city     Halton         43.4     -79.8         186. 
 8 Cambr… https:… lower… city     Waterloo       43.4     -80.3         113. 
 9 Clare… https:… lower… city     Prescott …     45.5     -75.2         297. 
10 Cornw… https:… singl… city     Stormont,…     45.0     -74.7          61.5
# ℹ 42 more rows
# ℹ 17 more variables: population_1996 <int>, population_2001 <int>,
#   population_2006 <int>, population_2011 <int>, population_2016 <int>,
#   population_2021 <int>, density_1996 <dbl>, density_2001 <dbl>,
#   density_2006 <dbl>, density_2011 <dbl>, density_2016 <dbl>,
#   density_2021 <dbl>, pop_change_1996_2001_pct <dbl>,
#   pop_change_2001_2006_pct <dbl>, pop_change_2006_2011_pct <dbl>, …

Example
one-sample t-test

Example: one-sample t-test

  • one sample t-tests compare data against a single value
  • Stephanie usually sets her home to 68°F at night during the summer. Is this different from the average household in the U.S.?
  • Use SummerTempNight

Example: one-sample t-test

Test if the average U.S. household sets its temperature at a value different from 68°F using svyttest():

recs_des %>%
  svyttest(
    formula = SummerTempNight - 68 ~ 0,
    design = .,
    na.rm = TRUE
  )
  • Formula to test if the true mean of SummerTempNight variable minus 68°F is equal to 0
  • Dot notation . that passes the recs_des object into the design argument

Results: one-sample t-test

Test if the average U.S. household sets its temperature at a value different from 68°F using svyttest():

recs_des %>%
  svyttest(
    formula = SummerTempNight - 68 ~ 0,
    design = .,
    na.rm = TRUE
  )

    Design-based one-sample t-test

data:  SummerTempNight - 68 ~ 0
t = 84.788, df = 58, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 3.287816 3.446810
sample estimates:
    mean 
3.367313 

Example
unpaired two-sample t-test

Example: unpaired two-sample t-test

  • Two-sample t-test compares data from two populations
  • On average, is there a significant different electric bill for households with and without air-conditioning?
  • Using DOLLAREL for the electricity bill and ACUsed to determine if households have air-conditioning

Example: unpaired two-sample t-test

Test if the electricity expenditure is significantly different for homes with and without air-conditioning.

recs_des %>%
  svyttest(
    formula = DOLLAREL ~ ACUsed,
    design = .,
    na.rm = TRUE
  )
  • Formula with electricity expenditure on the left and air-conditioning usage on the right
  • Dot notation . that passes the recs_des object into the design argument

Results: unpaired two-sample t-test

Test if the electricity expenditure is significantly different for homes with and without air-conditioning:

recs_des %>%
  svyttest(
    formula = DOLLAREL ~ ACUsed,
    design = .,
    na.rm = TRUE
  )

    Design-based t-test

data:  DOLLAREL ~ ACUsed
t = 21.29, df = 58, p-value < 2.2e-16
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 331.3343 400.1054
sample estimates:
difference in mean 
          365.7199 

Example
paired two-sample t-test

Example: paired t-test

Do U.S. households set their thermostat differently at night in the summer and winter?

recs_des %>%
  svyttest(
    formula = SummerTempNight - WinterTempNight ~ 0,
    design = .,
    na.rm = TRUE
  )

    Design-based one-sample t-test

data:  SummerTempNight - WinterTempNight ~ 0
t = 50.833, df = 58, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 2.739403 2.963995
sample estimates:
    mean 
2.851699 

Your Turn

  • Open 04-testing-exercises.qmd
  • Work through Exercises - Part 1
10:00

Chi-square tests

Chi-square tests

  • Compare multiple proportions to see if they are statistically different from each other
  • Different types:
    • Goodness-of-fit tests compare observed data to expected data
    • Tests of independence compare two types of observed data to see if there is a relationship
    • Tests of homogeneity compare two distributions to see if they match

Chi-square syntax

There are two functions that we use to compare proportions:

  • svygofchisq(): For goodness-of-fit tests
  • svychisq(): For tests of independence and homogeneity

Syntax for goodness-of-fit tests

svygofchisq(formula,
  p,
  design,
  na.rm = TRUE,
  ...
)

Syntax for tests of independence and homogeneity

svychisq(formula,
  design,
  statistic = c(
    "F", "Chisq", "Wald", "adjWald",
    "lincom", "saddlepoint"
  ),
  na.rm = TRUE,
  ...
)

Example
goodness-of-fit test

Example: goodness-of-fit test

  • Let’s check to see if the education levels from the ANES match the levels from the ACS
  • Here is the education breakdown from the ACS in 2020
    • 11% had less than a high school degree
    • 27% had a high school degree
    • 29% had some college or an associate’s degree
    • 33% had a bachelor’s degree or higher

Example: goodness-of-fit test

  • Let’s check to see if the education levels from the ANES match the levels from the ACS
  • Here is the education breakdown from our survey data
anes_des %>%
  drop_na(Education) %>%
  group_by(Education) %>%
  summarize(p = survey_mean())
# A tibble: 5 × 3
  Education         p    p_se
  <fct>         <dbl>   <dbl>
1 Less than HS 0.0805 0.00568
2 High school  0.277  0.0102 
3 Post HS      0.290  0.00713
4 Bachelor's   0.226  0.00633
5 Graduate     0.126  0.00499

Example: goodness-of-fit test

Let’s collapse Bachelor’s and Graduate degrees into a single category for comparison.

anes_des_educ <- anes_des %>%
  mutate(
    Education2 =
      fct_collapse(Education,
        "Bachelor or Higher" = c("Bachelor's", "Graduate")
      )
  )

Example: goodness-of-fit test

anes_des_educ %>%
  drop_na(Education2) %>%
  group_by(Education2) %>%
  summarize(p = survey_mean())
# A tibble: 4 × 3
  Education2              p    p_se
  <fct>               <dbl>   <dbl>
1 Less than HS       0.0805 0.00568
2 High school        0.277  0.0102 
3 Post HS            0.290  0.00713
4 Bachelor or Higher 0.352  0.00732

Results: goodness-of-fit test

Test to see if the ANES education matches the population percentages

anes_des_educ %>%
  svygofchisq(
    formula = ~Education2,
    p = c(0.11, 0.27, 0.29, 0.33),
    design = .,
    na.rm = TRUE
  )

    Design-based chi-squared test for given probabilities

data:  ~Education2
X-squared = 2177472, scale = 1.1203e+05, df = 2.2629e+00, p-value =
8.745e-05

Results: goodness-of-fit test

Code
anes_des_educ %>%
  drop_na(Education2) %>%
  group_by(Education2) %>%
  summarize(Observed = survey_mean(vartype = "ci")) %>%
  rename(Education = Education2) %>%
  mutate(Expected = c(0.11, 0.27, 0.29, 0.33)) %>%
  select(Education, Expected, everything()) %>%
  pivot_longer(
    cols = c("Expected", "Observed"),
    names_to = "Names",
    values_to = "Proportion"
  ) %>%
  mutate(
    Observed_low = if_else(Names == "Observed", Observed_low, NA_real_),
    Observed_upp = if_else(Names == "Observed", Observed_upp, NA_real_),
    Names = if_else(Names == "Observed",
      "ANES (observed)", "ACS (expected)"
    )
  ) %>%
  ggplot(aes(x = Education, y = Proportion, color = Names)) +
  geom_point(alpha = 0.75, size = 2) +
  geom_errorbar(aes(ymin = Observed_low, ymax = Observed_upp),
    width = 0.25
  ) +
  theme_bw() +
  scale_color_manual(name = "Type", values = c("#ff8484", "#0b3954")) +
  theme(legend.position = "bottom", legend.title = element_blank())

Example
test of independence

Example: test of independence

ANES asked respondents two questions about trust:

  • Question text: “How often can you trust the federal government to do what is right?”
  • Question text: “How often can you trust other people?”

Run a test of independence to see if there is a relationship between these two questions.

Example: test of independence

Run a test of independence to see if there is a relationship between these two questions.

anes_des %>%
  svychisq(
    formula = ~ TrustGovernment + TrustPeople,
    design = .,
    statistic = "Wald",
    na.rm = TRUE
  )

    Design-based Wald test of association

data:  NextMethod()
F = 20.876, ndf = 16, ddf = 51, p-value < 2.2e-16

Your Turn

  • Open 04-testing-exercises.qmd
  • Work through Exercises - Part 2
15:00

Next Up: Survey Design Object

Extra Content

Analysis of Variance (ANOVA)

ANOVA

Use Analysis of Variance (ANOVA) to compare two or more means.

ANOVA syntax

Use Analysis of Variance (ANOVA) to compare two or more means.

Syntax:

svyglm(
  formula = outcome ~ group,
  design = .,
  na.action = na.omit,
  df.resid = NULL
)

Example: ANOVA

  • Does the temperature that U.S. households set their AC during summer nights vary by region?
  • Use svyglm() function and variables SummerTempNight and Region
recs_des %>%
  svyglm(
    formula = SummerTempNight ~ Region,
    design = .
  ) %>%
  tidy()

Example: ANOVA

recs_des %>%
  svyglm(
    formula = SummerTempNight ~ Region,
    design = .
  ) %>%
  tidy()
# A tibble: 4 × 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)      69.7      0.103    674.   3.69e-111
2 RegionMidwest     1.34     0.138      9.68 1.46e- 13
3 RegionSouth       2.05     0.128     16.0  1.36e- 22
4 RegionWest        2.80     0.177     15.9  2.27e- 22