Tidy Survey Analysis in R using the srvyr Package

class: center, middle, inverse, title-slide

# Tidy Survey Analysis in R using the srvyr Package
## Workshop Day 2 - Continuous Data
### Stephanie Zimmer, Abt Associates
### Rebecca Powell, RTI International
### Isabella Velásquez, RStudio
### April 22, 2022

---

class: inverse center middle
# Introduction
---

## Overview

- At the end of this workshop series, you should be able to 
   - Calculate point estimates and their standard errors with survey data 
      - Proportions, totals, and counts
      - Means, quantiles, and ratios
   - Perform t-tests and chi-squared tests
   - Fit regression models
   - Specify a survey design in R to create a survey object

- We will not be going over the following but provide some resources at the end
   - Weighting (calibration, post-stratification, raking, etc.)
   - Survival analysis
   - Nonlinear models
---

## About Us

<div class="row">
<div class="column">
<center>
<img src="http://www.mapor.org/wp-content/uploads/2022/03/StephanieZimmer_Headshot.jpeg" width="200px" />
<br>
<b>Stephanie Zimmer</b>
<br>
Abt Associates
</center>
</div>

<div class="column">
<center>
<img src="http://www.mapor.org/wp-content/uploads/2020/03/Powell_Rebecca_image-e1584649023839.jpg" width="200px" />
<br>
<b>Rebecca Powell</b>
<br>
RTI International
</center>
</div>

<br>
<b>Isabella Velásquez</b>
<br>
RStudio
</center>
</div>

</div>

---

## About This Workshop

- Hosted by Midwest Association for Public Opinion Research (MAPOR), a regional chapter of the American Association for Public Opinion Research (AAPOR).

- Originally delivered at AAPOR Conference in May 2021

---

## Upcoming Work

- Book on analyzing survey data in R, published by CRC, Taylor & Francis Group

- We would love your help! After each course, we will send out a survey to gather your feedback on the material, organization, etc.

- Keep updated by following our project on GitHub: [https://github.com/tidy-survey-r](https://github.com/tidy-survey-r)

---

## Overview

- Last week we introduced how to do survey analysis in R with categorical data

- Today we focus on continuous data
- At the end of today, you should be able to 
  - Calculate means and quantiles with their standard errors 
  
  - Perform t-tests 
  - Fit linear regression models

- Next week, we will discuss
   - Specifying a survey design object
   - Creating replicate weights
   - Creating derived/analysis/recoded variables
   - Reproducibility

---
## Overview: Workshop 2 Roadmap

- Quick refresh of RStudio Cloud with a warm-up exercise using the tidyverse

- Introduce the survey data we'll be using today

- Calculate point estimates for continuous data with time for practice

- Significance testing with t-test and linear regression models with time for practice

- Closing

---
## Logistics

- We will be using RStudio Cloud today to ensure everyone has access

- Sign-up for a free RStudio Cloud account (https://rstudio.cloud/)

- Access the project and files via link in email and Zoom chat

- Click "START" to open the project and get started

- RStudio Cloud has the same features and appearance as RStudio for ease of use

- All slides and code are available on GitHub: https://github.com/tidy-survey-r/tidy-survey-short-course

???
Github repo is for future reference, all material on RStudio cloud

---
## Intro to RStudio Cloud: Penguins!!

- Using `palmerpenguins` data for warm-up exercises

- Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

- Access data through `palmerpenguins` package https://github.com/allisonhorst/palmerpenguins/

####If you are using your own RStudio environment:
- Make sure you have `tidyverse`, `here`, and `palmerpenguins` installed

```r
# Run package installation if you don't have these packages already
# As a reminder, installing takes package from internet to your computer 
# and only needs to be done once, not each session

install.packages(c("tidyverse", "here", "palmerpenguins"))
```

---
## Intro to RStudio Cloud: Penguins!!

- Load `tidyverse`, `here`, and `palmerpenguins`

- Look at the penguins dataset using `glimpse`

```r
library(tidyverse) # for tidyverse
library(here) # for file paths
library(palmerpenguins) # for warm-up data
glimpse(penguins)
```

```
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex               <fct> male, female, female, NA, female, male, female, male~
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
```

---
## Warm-up Exercises: WarmUpExercises.Rmd

- <b>Let's open RStudio cloud and do some warm-up examples</b>

- Take 8 minutes to do these exercises in breakout rooms.

- Explore the penguins data
   - What is the mean body mass in grams of all penguins? Hint: use `summarize` and remove missing data
   
   - What is the mean length of flipper by species? Hint: use `group_by`
   
   - What is the mean flipper length by species and sex? 
   
   - Fit a simple linear regression between body mass and flipper length.

---
## Ex. 1: What is the mean body mass in grams of all penguins? Hint: use `summarize` and remove missing data

```r
penguins %>%
   summarize(
     MeanBodyMass=mean(body_mass_g, 
                       na.rm=TRUE))
```

```
## # A tibble: 1 x 1
##   MeanBodyMass
##          <dbl>
## 1        4202.
```
???
- `na.rm=TRUE` removes missing data from the calculation
    - forgetting this argument will result in a value of `NA` as the function will try to average missing data

---
## Ex. 2: What is the mean length of flipper by species? Hint: use `group_by`

```r
penguins %>%
   group_by(species) %>%
   summarize(
     MeanFlipperLength=mean(flipper_length_mm, 
                            na.rm=TRUE))
```

```
## # A tibble: 3 x 2
##   species   MeanFlipperLength
##   <fct>                 <dbl>
## 1 Adelie                 190.
## 2 Chinstrap              196.
## 3 Gentoo                 217.
```
???
- `group_by` allows us to look at metrics by different subgroups like species
- when using `group_by` follow it with `summarize` to get metrics (like average) at the group level

---
## Ex. 3: What is the mean flipper length by species and sex?

```r
penguins %>%
   group_by(species,sex) %>%
   summarize(
     MeanFlipperLength=mean(flipper_length_mm, 
                            na.rm=TRUE))
```

```
## # A tibble: 8 x 3
## # Groups:   species [3]
##   species   sex    MeanFlipperLength
##   <fct>     <fct>              <dbl>
## 1 Adelie    female              188.
## 2 Adelie    male                192.
## 3 Adelie    <NA>                186.
## 4 Chinstrap female              192.
## 5 Chinstrap male                200.
## 6 Gentoo    female              213.
## 7 Gentoo    male                222.
## 8 Gentoo    <NA>                216.
```
???
- As we learned last week with `count`, you can also `group_by` multiple variables

---
## Ex. 4: Fit a simple linear regression between body mass and flipper length.
.small[

```r
mod1 <- lm(body_mass_g ~ flipper_length_mm, data=penguins)
summary(mod1)
```

```
## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1058.80  -259.27   -26.88   247.33  1288.69 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5780.831    305.815  -18.90   <2e-16 ***
## flipper_length_mm    49.686      1.518   32.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 394.3 on 340 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.759,	Adjusted R-squared:  0.7583 
## F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16
```
]
???
- here we use the `lm` (linear model) function
- can also use the `glm` (generalized linear model) function
- equation is written as y-variable ~ x-variables

---
class: inverse center middle
# Continuous survey data analysis
---
## Overview of Survey Analysis using `srvyr` Package

1. Create a `tbl_svy` object using: `as_survey_design` or `as_survey_rep`

2. Subset data (if needed) using `filter` (subpopulations)

3. Specify domains of analysis using `group_by`

4. Within `summarize`, specify variables to calculate including means, medians, quantiles and more

<b>Note: We will be teaching this in the reverse order!!!</b>
???
- Session 3 will cover how to create the survey design object

---
## Dataset: Residential Energy Consumption Survey (RECS) 2015

- Energy consumption/expenditures collected through energy suppliers

- Fielded 14 times between 1950 and 2015

- Topics include appliances, electronics, heating, a/c, temperatures, water heating, lighting, energy bills, respondent demographics, and energy assistance

- Funded by the Energy Information Administration

- <b>Target Population</b>: Primary occupied housing units in the US

- <b>Mode</b>: In-person, paper, and web interview mode

- <b>Sample Information</b>: BRR Replicate weights included for variance estimation

https://www.eia.gov/consumption/residential/index.php

???
- We have subset the columns of this data and created derived variables, code in repository

---
## Set-up for Analysis
- `srvyr` package uses tidy-syntax but uses the `survey` package behind it to do calculations

- If using your own RStudio environment, install both packages:

```r
# Install survey and srvyr packages

remotes::install_github("bschneidr/r-forge-survey-mirror")
install.packages("srvyr")
```

- First, we will set-up a design object and talk about what it means in Session 3

```r
library(survey) # for survey analysis
library(srvyr) # for tidy survey analysis

recs <- read_rds(here("Data", "recs.rds"))

recs_des <- recs %>%
   as_survey_rep(weights=NWEIGHT,
                 repweights=starts_with("BRRWT"),
                 type="Fay",
                 rho=0.5,
                 mse=TRUE)
```
???
- need to install github version of survey package if you want CIs with quantiles
- we will cover setting up the sample design object later

---
## Weighted Analysis for Continuous Variables

- Common functions for continuous summaries
   - survey_mean
   - survey_total (like sum)
   - survey_median
   - survey_quantile
   - survey_ratio

- Always call within `summarize`/`summarise`
---
## `survey_mean` Syntax

```r
survey_mean(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  proportion = FALSE,
  deff = FALSE,
  df = NULL,
  ...
)
```

To calculate a survey mean, we use this in `summarize`/`summarise`

```r
survey_design_object %>%
   summarize(
      mean_varname=survey_mean(x = continuous_varname)
      )
```

???
Only required argument is the variable

---
## `survey_mean` Example 1: On average, how much do US households spend on energy each year?

This is an example using the `recs_des` survey design object and `survey_mean` function defaults

```r
recs_des %>%
   summarize(
      TD_mean=survey_mean(x = TOTALDOL)
      )
```

```
## # A tibble: 1 x 2
##   TD_mean TD_mean_se
##     <dbl>      <dbl>
## 1   1859.       15.6
```
---
## `survey_mean` Example 2: What is the average temperature US households set their homes to on a summer day?

Run this code. What happens?

```r
recs_des %>%
   summarize(
      TD_mean=survey_mean(x = SummerTempDay)
      )
```
---
## `survey_mean` Example 2: What is the average temperature US households set their homes to on a summer day?

Run this code. What happens?

```r
recs_des %>%
   summarize(
      TD_mean=survey_mean(x = SummerTempDay)
      )
```

```
## Error in `dplyr::summarise()`:
## ! Problem while computing `TD_mean = survey_mean(x = SummerTempDay)`.
## Caused by error in `svrVar()`:
## ! All replicates contained NAs
```

???
- missing data in temperature, need `na.rm=TRUE`

---
## `survey_mean` Example 2: Missing data solution

```r
recs_des %>%
   summarize(
      TD_mean = survey_mean(
        x = SummerTempDay, 
*       na.rm = TRUE )
      )
```

```
## # A tibble: 1 x 2
##   TD_mean TD_mean_se
##     <dbl>      <dbl>
## 1    72.4     0.0793
```

---
## `survey_median` Syntax

```r
survey_median(
  x,
  na.rm = FALSE,
  vartype = c("se", "ci"),
  level = 0.95,
  df = NULL,
  ...
)
```

???
Only required argument is the variable

---
## `survey_median` Example: What is the median temperature US households set their homes to on a summer day?

.pull-left[

```r
recs_des %>%
   summarize(
      TD_median=survey_median(x=_________,
                          na.rm=_________)
      )
```
]
--
.pull-right[

```r
recs_des %>%
 summarize(
   TD_median=survey_median(x=SummerTempDay,
                           na.rm=TRUE)
 )
```

```
## # A tibble: 1 x 2
##   TD_median TD_median_se
##       <dbl>        <dbl>
## 1        72        0.252
```
]
???
- Mean temperature set is 72.4 degrees Fahrenheit with a standard error of 0.08
- Median temperature set is 72 degrees Fahrenheit with a standard error of 0.25

---
## `survey_quantile` Syntax

```r
survey_quantile(
  x,
* quantiles,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  df = NULL,
  ...
)
```

???
- need both the variable and the quantiles in a vector e.g. (c(.25, .75))

---
## `survey_quantile` Example 1: What are the 1st and 3rd quantile of dollars spent on energy?

```r
recs_des %>%
   summarize(
      Spent=survey_quantile(
        x = TOTALDOL,
*       quantiles = c(.25, .75))
      )
```

```
## # A tibble: 1 x 4
##   Spent_q25 Spent_q75 Spent_q25_se Spent_q75_se
##       <dbl>     <dbl>        <dbl>        <dbl>
## 1     1153.     2353.         13.9         22.7
```
???
- This estimates the 25th and 75th percentile

---
## `survey_quantile` Example 2:  What are the 1st and 3rd quantile of dollars spent on energy with confidence intervals?

```r
recs_des %>%
   summarize(
      Spent=survey_quantile(x = TOTALDOL,
                            quantiles = c(.25, .75),
*                           vartype = "ci"
         )
      )
```

```
## # A tibble: 1 x 6
##   Spent_q25 Spent_q75 Spent_q25_low Spent_q75_low Spent_q25_upp Spent_q75_upp
##       <dbl>     <dbl>         <dbl>         <dbl>         <dbl>         <dbl>
## 1     1153.     2353.         1124.         2310.         1179.         2400.
```

---
## `survey_quantile` Updated Output

```r
recs_des %>%
   summarize(
      Spent=survey_quantile(x = TOTALDOL,
                            quantiles = c(.25, .75),
                            vartype = "ci"
         )
      ) %>% 
   pivot_longer(cols=c(Spent_q25:Spent_q75_upp),
                names_to="varname",values_to="value") %>% 
   mutate(Quantile=paste0(str_sub(varname,8,9),"th Quantile"),
          type=case_when(str_detect(varname,"_low")~"Lower_Bound",
                         str_detect(varname,"_upp")~"Upper_Bound",
                         TRUE~"Estimate")) %>% 
   pivot_wider(id_cols=Quantile,names_from=type,values_from=value)
```

```
## # A tibble: 2 x 4
##   Quantile      Estimate Lower_Bound Upper_Bound
##   <chr>            <dbl>       <dbl>       <dbl>
## 1 25th Quantile    1153.       1124.       1179.
## 2 75th Quantile    2353.       2310.       2400.
```

---
## `survey_ratio` Syntax

- Note this estimates: `$\sum x_i/\sum y_i$` not `$\sum \frac{x_i}{y_i}$`

```r
survey_ratio(
* numerator,
* denominator,
  na.rm = FALSE,
  vartype = c("se", "ci", "var", "cv"),
  level = 0.95,
  deff = FALSE,
  df = NULL,
  ...
)
```

---
## `survey_ratio` Example: What is the average dollar per BTU spent on energy?

```r
recs_des %>%
   summarize(
      DolPerBTU=survey_ratio(
*        numerator = TOTALDOL,
*        denominator = TOTALBTU,
         na.rm = TRUE
         )
      )
```

```
## # A tibble: 1 x 2
##   DolPerBTU DolPerBTU_se
##       <dbl>        <dbl>
## 1    0.0241     0.000217
```
???
- BTU (British Thermal Units)

---
## Breakout rooms: Practice time

- Open ContinuousExercises.Rmd and work through Part 1

- We will take 15 minutes. Use this time for the exercises and questions.

---
## Weighted Analysis for Continuous Variables: Domain Analysis

- If we want to get estimates by another variable, we need to add a `group_by` statement before doing the analysis.

- Example: What is the average amount of dollars spent on electricity for households that use AC and those that do not use AC?

```r
recs_des %>%
* group_by(ACUsed) %>%
  summarize(
    ElBill=survey_mean(DOLLAREL, 
                       na.rm=TRUE)
  )
```

```
## # A tibble: 2 x 3
##   ACUsed ElBill ElBill_se
##   <lgl>   <dbl>     <dbl>
## 1 FALSE    972.      25.8
## 2 TRUE    1435.      15.8
```
---
## Domain Analysis: Totals

- If we want the overall average electric bill too, use the `cascade` function instead of `summarize`

```r
recs_des %>%
   group_by(ACUsed) %>%
   cascade(
      ElBill=survey_mean(DOLLAREL, 
                         na.rm=TRUE)
   )
```

```
## # A tibble: 3 x 3
##   ACUsed ElBill ElBill_se
##   <lgl>   <dbl>     <dbl>
## 1 FALSE    972.      25.8
## 2 TRUE    1435.      15.8
## 3 NA      1375.      14.1
```

<b>Note: The overall average electric bill appears as NA</b>

---
## Domain Analysis: Totals

- Also can add sample and pop sizes

```r
recs_des %>%
   group_by(ACUsed) %>%
   cascade(
      ElBill=survey_mean(DOLLAREL, na.rm=TRUE),
*     N=survey_total(!is.na(DOLLAREL)),
*     n=unweighted(sum(!is.na(DOLLAREL)))
   )
```

```
## # A tibble: 3 x 6
##   ACUsed ElBill ElBill_se          N        N_se     n
##   <lgl>   <dbl>     <dbl>      <dbl>       <dbl> <int>
## 1 FALSE    972.      25.8  15401242. 976901.       737
## 2 TRUE    1435.      15.8 102807008. 976901.      4949
## 3 NA      1375.      14.1 118208250.      0.0320  5686
```

???
- survey_total gets a weighted total
- unweighted does just that, an unweighted estimate, can also get an unweighted mean or any other stat

---
## Weighted Analysis for Specific Subpopulations

- filtering (subsetting) the data should be done AFTER specifying the design to ensure accurate standard errors

- Use the `filter` function after creating the survey design object and before summarizing

Wrong way:

```r
data %>%
* filter(state=="NC") %>%
  as_survey_design(...) %>%
  summarize(AvgAge=mean(Age))
```

Right way:

```r
data %>%
  as_survey_design(...) %>%
* filter(state=="NC") %>%
  summarize(AvgAge=mean(Age))
```

???
- Required to ensure correct calculation when sub-population is not in all strata or PSUs

---
## Subpopulation Example: Average electric cost of single family homes

```r
recs_des %>%
  filter(HousingUnitType %in% c("Single-family detached",
                                "Single-family attached")) %>%
  summarize(
    ElBill=survey_mean(DOLLAREL, 
                       na.rm=TRUE)
  )
```

```
## # A tibble: 1 x 2
##   ElBill ElBill_se
##    <dbl>     <dbl>
## 1  1542.      17.2
```

???
- Filter goes AFTER the design object
- Previous syntax showed the creation of the design object, but with this example, we already created it

---
## Comparisons with t-tests: `svyttest` Syntax

- t-tests are done in the package `survey` not `srvyr` but you can use the same design object

```r
svyttest(formula, # outcome~group for two-sample, outcome~0 for one-sample
         design,
         na.rm = FALSE
         ....)
```

???
- Uses standard R formula notation
- will go over examples of 1-sample, 2-sample, and paired t-test

---
## `svyttest` Syntax with `%>%`

```r
recs_des %>%
   svyttest(formula=,
*           design=.,
            na.rm=TRUE)
```

???
- To use in tidyverse need the "dot" notation as highlighted here
- Pipe then puts the design object in the correct placement

---
## `svyttest` Syntax with `%>%`

```r
recs_des %>%
*  svyttest(design=.,
            formula=,
            na.rm=TRUE)
```

???
- Order doesn't matter for the arguments when you use the "dot" notation

---
## `svyttest` Example 1: One-sample t-test

- I keep my house at 68 degrees at night during the summer. Is this different from the national average?

```r
recs_des %>%
   svyttest(design=.,
*           formula=I(SummerTempNight-68)~0,
            na.rm=TRUE)
```

```
## 
## 	Design-based one-sample t-test
## 
## data:  I(SummerTempNight - 68) ~ 0
## t = 41.013, df = 94, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3.424776 3.773247
## sample estimates:
##     mean 
## 3.599012
```

???
- Note the I notation, this does the arithmetic before modeling

---
## `svyttest` Example 2: Comparing two variables

- Do people keep their house the same temperature at night during the summer and the winter?

```r
recs_des %>%
   svyttest(design=.,
            formula=I(SummerTempNight-WinterTempNight)~0,
            na.rm=TRUE)
```

```
## 
## 	Design-based one-sample t-test
## 
## data:  I(SummerTempNight - WinterTempNight) ~ 0
## t = 29.079, df = 94, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  2.995084 3.434072
## sample estimates:
##     mean 
## 3.214578
```

???
- this is a paired t-test
- testing whether the difference is 0 for each household

---
## `svyttest` Example 3: Two-sample t-test

- Are electric bills different between those with and without A/C?

```r
recs_des %>%
   svyttest(design=.,
            formula=DOLLAREL~ACUsed,
            na.rm=TRUE)
```

```
## 
## 	Design-based t-test
## 
## data:  DOLLAREL ~ ACUsed
## t = 14.772, df = 94, p-value < 2.2e-16
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
##  400.6588 525.0903
## sample estimates:
## difference in mean 
##           462.8746
```

---
## Linear Regression or ANOVA: `svyglm` Syntax

- As with t-tests, regressions are done in the package `survey` not `srvyr` but you can use the same design object

- Syntax is similar between t-test and glm

```r
svyglm(formula, 
       design,
       na.action, #default is na.omit
       ....)
```
---
## `svyglm` Example: Two-sample

Same example as two-sample t-test: Are electric bills different between those with and without A/C?

```r
recs_des %>%
   svyttest(design=.,
            formula=DOLLAREL~ACUsed,
*           na.rm=TRUE)
```

```r
recs_des %>%
   svyglm(design=.,
          formula=DOLLAREL~ACUsed,
*         na.action=na.omit)
```

???
- one major difference in how you specify to ignore NA values
- svyttest can only have 2-levels in group variable
- svyglm, the variable on right can be anything (continuous or factor)

---
## `svyglm` Example: Two-sample

Are electric bills different between those with and without A/C?
.small[

```r
recs_des %>%
   svyglm(design=.,
          formula=DOLLAREL~ACUsed,
          na.action=na.omit) %>%
  summary()
```

```
## 
## Call:
## svyglm(design = ., formula = DOLLAREL ~ ACUsed, na.action = na.omit)
## 
## Survey design:
## Called via srvyr
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   972.09      25.81   37.66   <2e-16 ***
## ACUsedTRUE    462.87      31.33   14.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3543220488)
## 
## Number of Fisher Scoring iterations: 2
```
]

???
- same results as t-test

---
## `svyglm` Example 1: ANOVA Test

Does temperature of AC at night vary by region?
.smaller[

```r
recs_des %>%
   svyglm(design=.,
          formula=SummerTempNight~Region,
          na.action=na.omit) %>%
  summary()
```

```
## 
## Call:
## svyglm(design = ., formula = SummerTempNight ~ Region, na.action = na.omit)
## 
## Survey design:
## Called via srvyr
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    70.4848     0.1968 358.151  < 2e-16 ***
## RegionMidwest   0.8744     0.2526   3.461 0.000818 ***
## RegionSouth     1.4865     0.2306   6.446 5.20e-09 ***
## RegionWest      1.6568     0.3529   4.695 9.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 119075)
## 
## Number of Fisher Scoring iterations: 2
```
]

???
- Region is a factor variable, if it is numeric - this will treat it like a linear model

---
## `svyglm` Example 2: Linear Model

- Is there a relationship between square footage and electric bill?
- Let's review the data first with a ggplot. <i>Note we use the original data and do <b>NOT</b> use the survey design object.</i>

```r
p <- recs %>%
  ggplot(aes(x=TOTSQFT_EN, y=DOLLAREL, weight=NWEIGHT)) +
  geom_hex(color="white") + 
  scale_fill_gradient(guide="colourbar",name="Count of Housing Units",labels=comma)
```

???
- When using the original data, make sure you include the weight variable

---
## `svyglm` Example 2: Linear Model
<img src="Slides-day-2_files/figure-html/plot_sf_elbill_disp-1.png" width="90%" style="display: block; margin: auto;" />

---
## `svyglm` Example 2: Linear Model
.small[

```r
m_electric_sqft <- recs_des %>%
   svyglm(design=.,
          formula=DOLLAREL~TOTSQFT_EN,
          na.action=na.omit)
summary(m_electric_sqft)
```

```
## 
## Call:
## svyglm(design = ., formula = DOLLAREL ~ TOTSQFT_EN, na.action = na.omit)
## 
## Survey design:
## Called via srvyr
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 879.89542   26.31370   33.44   <2e-16 ***
## TOTSQFT_EN    0.24633    0.01338   18.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3125448288)
## 
## Number of Fisher Scoring iterations: 2
```
]

???
- for every additional square foot, the electricity cost is on average 24.6 cents more
---
## Breakout rooms: Practice time

- Open ContinuousExercises.Rmd and work through Part 2

- We will take 15 minutes. Use this time for the exercises and questions.

---
class: inverse center middle
# Closing
---
## Resources for more learning

- https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html

- https://r-survey.r-forge.r-project.org/survey/ 
  - Includes more advanced modeling

---
## Thank You!

### We hope you learned a lot in this session!

Please let us know if you have any feedback on this workshop. All feedback is welcome!

## Questions?

---
## Sources

- <font size="2">*Residential Energy Consumption Survey: Using the 2015 Microdata File to Compute Estimates and Standard Errors.* U.S. Department of Energy (2017) https://www.eia.gov/consumption/residential/data/2015/pdf/microdata_v3.pdf </font>

- <font size="2">Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/ </font>

- <font size="2">T. Lumley (2020) "survey: analysis of complex survey samples". R package version 4.0. https://r-survey.r-forge.r-project.org/survey/ </font>

- <font size="2">Greg Freedman Ellis and Ben Schneider (2020). srvyr: 'dplyr'-Like Syntax for Summary Statistics of Survey Data. R package version 1.0.0. https://CRAN.R-project.org/package=srvyr </font>

- <font size="2">Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.5. https://CRAN.R-project.org/package=dplyr </font>

---
## Session info - platform

```
##  setting  value
##  version  R version 4.1.3 (2022-03-10)
##  os       Windows 10 x64 (build 19042)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_United States.1252
##  ctype    English_United States.1252
##  tz       America/New_York
##  date     2022-04-12
##  pandoc   2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown)
```

---
## Session info - packages

```
##  package        * version date (UTC) lib source
##  dplyr          * 1.0.8   2022-02-08 [1] CRAN (R 4.1.2)
##  forcats        * 0.5.1   2021-01-27 [1] CRAN (R 4.1.2)
##  ggplot2        * 3.3.5   2021-06-25 [1] CRAN (R 4.1.2)
##  here           * 1.0.1   2020-12-13 [1] CRAN (R 4.1.2)
##  hexbin         * 1.28.2  2021-01-08 [1] CRAN (R 4.1.2)
##  knitr          * 1.37    2021-12-16 [1] CRAN (R 4.1.2)
##  Matrix         * 1.4-0   2021-12-08 [2] CRAN (R 4.1.3)
##  palmerpenguins * 0.1.0   2020-07-23 [1] CRAN (R 4.1.2)
##  purrr          * 0.3.4   2020-04-17 [1] CRAN (R 4.1.2)
##  readr          * 2.1.2   2022-01-30 [1] CRAN (R 4.1.2)
##  remotes        * 2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  scales         * 1.1.1   2020-05-11 [1] CRAN (R 4.1.2)
##  srvyr          * 1.1.1   2022-02-20 [1] CRAN (R 4.1.3)
##  stringr        * 1.4.0   2019-02-10 [1] CRAN (R 4.1.2)
##  survey         * 4.2     2022-03-31 [1] Github (bschneidr/r-forge-survey-mirror@69c62ff)
##  survival       * 3.2-13  2021-08-24 [2] CRAN (R 4.1.3)
##  tibble         * 3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
##  tidyr          * 1.2.0   2022-02-01 [1] CRAN (R 4.1.2)
##  tidyverse      * 1.3.1   2021-04-15 [1] CRAN (R 4.1.2)
##  xaringan       * 0.23    2022-03-08 [1] CRAN (R 4.1.3)
## 
##  [1] D:/Users/zimmers/Documents/R/win-library/4.1
##  [2] C:/Program Files/R/R-4.1.3/library
```