Skip to contents

The srvyexploR package provides datasets used in the book Exploring Complex Survey Data Analysis Using R: A Tidy Introduction with {srvyr} and {survey}. This will help readers follow along with the examples and work through the exercises.

Installation

To install the development version from GitHub, use:

# install.packages("pak")
pak::pak("tidy-survey-r/srvyrexploR")

To load the package, use:

About the data

This package includes data from three surveys including the American National Election Studies (ANES), the National Crime Victimization Survey (NCVS), and the Residential Energy Consumption Survey (RECS).

ANES

The ANES data is based on the publicly available 2020 ANES data with additional derived variables and is subset to people who completed both pre and post-election interviews. The ANES Times Series Studies collect data on political polling in the United States and has been conducted since 1948. For more information about the 2020 study, see the American National Election Studies website. On the ANES website, you can learn more about the study, see codebooks and methodology reports, and download the data (after registering). We received permission to distribute this data for the purpose of the book. Once the package is loaded, you can use the data immediately as follows:

head(anes_2020)
#>   V200001 CaseID V200002 InterviewMode  V200010b    Weight V200010c VarUnit
#> 1  200015 200015       3           Web 1.0057375 1.0057375        2       2
#> 2  200022 200022       3           Web 1.1634731 1.1634731        2       2
#> 3  200039 200039       3           Web 0.7686811 0.7686811        1       1
#> 4  200046 200046       3           Web 0.5210195 0.5210195        2       2
#> 5  200053 200053       3           Web 0.9657892 0.9657892        1       1
#> 6  200060 200060       3           Web 0.2347078 0.2347078        2       2
#>   V200010d Stratum V201006     CampaignInterest V201023 EarlyVote2020 V201024
#> 1        9       9       2  Somewhat interested      -1          <NA>      -1
#> 2       26      26       3  Not much interested      -1          <NA>      -1
#> 3       41      41       2  Somewhat interested      -1          <NA>      -1
#> 4       29      29       3  Not much interested      -1          <NA>      -1
#> 5       23      23       2  Somewhat interested      -1          <NA>      -1
#> 6       37      37       1 Very much interested      -1          <NA>      -1
#>   V201025x V201028 V201029 V201101 V201102 VotedPres2016 V201103
#> 1        3      -1      -1      -1       1           Yes       2
#> 2        3      -1      -1      -1       1           Yes       5
#> 3        3      -1      -1      -1       1           Yes       1
#> 4        3      -1      -1      -1       1           Yes       1
#> 5        3      -1      -1      -1       1           Yes       2
#> 6        3      -1      -1      -1       2            No      -1
#>   VotedPres2016_selection V201228 V201229 V201230 V201231x
#> 1                   Trump       2       1      -1        7
#> 2                   Other       5      -1       2        4
#> 3                 Clinton       3      -1       3        3
#> 4                 Clinton       2       2      -1        6
#> 5                   Trump       3      -1       2        4
#> 6                    <NA>       3      -1       3        3
#>                      PartyID V201233     TrustGovernment V201237
#> 1          Strong republican       5               Never       3
#> 2                Independent       5               Never       4
#> 3       Independent-democrat       4    Some of the time       4
#> 4 Not very strong republican       3 About half the time       2
#> 5                Independent       5               Never       4
#> 6       Independent-democrat       4    Some of the time       2
#>           TrustPeople V201507x Age    AgeGroup V201510   Education V201546
#> 1 About half the time       46  46       40-49       6  Bachelor's       1
#> 2    Some of the time       37  37       30-39       3     Post HS       2
#> 3    Some of the time       40  40       40-49       2 High school       2
#> 4    Most of the time       41  41       40-49       4     Post HS       2
#> 5    Some of the time       72  72 70 or older       8    Graduate       2
#> 6    Most of the time       71  71 70 or older       3     Post HS       2
#>   V201547a V201547b V201547c V201547d V201547e V201547z V201549x      RaceEth
#> 1       -3       -3       -3       -3       -3       -3        3     Hispanic
#> 2       -3       -3       -3       -3       -3       -3        4 Asian, NH/PI
#> 3       -3       -3       -3       -3       -3       -3        1        White
#> 4       -3       -3       -3       -3       -3       -3        4 Asian, NH/PI
#> 5       -3       -3       -3       -3       -3       -3        5        AI/AN
#> 6       -3       -3       -3       -3       -3       -3        1        White
#>   V201600 Gender V201607 V201610 V201611 V201613 V201615 V201616 V201617x
#> 1       1   Male      -3      -3      -3      -3      -3      -3       21
#> 2       2 Female      -3      -3      -3      -3      -3      -3       13
#> 3       2 Female      -3      -3      -3      -3      -3      -3       17
#> 4       1   Male      -3      -3      -3      -3      -3      -3        7
#> 5       1   Male      -3      -3      -3      -3      -3      -3       22
#> 6       2 Female      -3      -3      -3      -3      -3      -3        3
#>             Income         Income7 V202051 V202066 V202072 VotedPres2020
#> 1 $175,000-249,999   $125k or more      -1       1      -1          <NA>
#> 2   $70,000-74,999   $60k to < 80k      -1       4       1           Yes
#> 3 $100,000-109,999 $100k to < 125k      -1       4       1           Yes
#> 4   $35,000-39,999   $20k to < 40k      -1       4       1           Yes
#> 5 $250,000 or more   $125k or more      -1       4       1           Yes
#> 6   $15,000-19,999      Under $20k      -1       4       1           Yes
#>   V202073 V202109x V202110x VotedPres2020_selection
#> 1      -1        0       -1                    <NA>
#> 2       3        1        3                   Other
#> 3       1        1        1                   Biden
#> 4       1        1        1                   Biden
#> 5       2        1        2                   Trump
#> 6       1        1        1                   Biden

See ?anes_2020 for more information about the data.

Also, included in the package is a Stata version of the ANES data with a subset of the columns and is subset to people who completed both pre and post-election interviews. To load this dataset, we recommend using the {haven} package as follows:

anes_stata <- haven::read_dta(system.file("extdata", "anes_2020_stata_example.dta", package = "srvyrexploR"))

NCVS

The NCVS data is based off of publicly available data for the 2021 NCVS. The NCVS is a survey conducted by the Bureau of Justice Statistics and asks people age 12 and over about their crime victimizations. The study has been conducted continuously since 1992. This package includes three datasets - one for household-level data (ncvs_2021_household), one for person-level data (ncvs_2021_person), and one for incident-level data (ncvs_2021_incident) where each includes a subset of the columns of the full data available from 2021 at ICPSR. This data is reproduced here with permission from ICPSR.

head(ncvs_2021_household)
#> # A tibble: 6 × 12
#>   YEARQ IDHH    WGTHHCY V2117 V2118 V2015 V2143 SC214A V2122 V2126B V2127B V2129
#>   <dbl> <chr>     <dbl> <dbl> <dbl> <fct> <fct> <fct>  <fct> <fct>  <fct>  <fct>
#> 1 2021. 171005…      0    139     1 <NA>  3     12     33    0      2      3    
#> 2 2021. 171005…   1072.    63     2 2     2     8      32    17     2      1    
#> 3 2021. 171005…      0    140     1 <NA>  2     5      33    13     2      3    
#> 4 2021. 171005…      0    139     1 <NA>  3     13     33    0      2      3    
#> 5 2021. 171005…   1200.   138     1 1     2     11     29    18     2      1    
#> 6 2021. 171005…   1254.   138     1 1     2     8      24    13     2      2
head(ncvs_2021_person)
#> # A tibble: 6 × 11
#>   YEARQ IDHH           IDPER WGTPERCY V3014 V3015 V3018 V3023A V3024 V3084 V3086
#>   <dbl> <chr>          <chr>    <dbl> <dbl> <fct> <fct> <fct>  <fct> <fct> <fct>
#> 1 2021. 1710051365368… 1710…    1216.    84 3     2     1      2     6     2    
#> 2 2021. 1710053925458… 1710…    1362.    70 5     2     1      2     2     2    
#> 3 2021. 1710053925458… 1710…       0     43 5     1     1      2     <NA>  <NA> 
#> 4 2021. 1710053925458… 1710…       0     15 5     1     1      2     <NA>  <NA> 
#> 5 2021. 1710053965345… 1710…    1422.    89 1     2     1      2     2     2    
#> 6 2021. 1710053965345… 1710…       0     90 1     1     1      2     <NA>  <NA>
head(ncvs_2021_incident)
#> # A tibble: 6 × 60
#>   YEARQ IDHH     IDPER V4012 WGTVICCY V4016 V4017 V4018 V4019 V4021B V4022 V4024
#>   <dbl> <chr>    <chr> <dbl>    <dbl> <dbl> <fct> <fct> <fct> <fct>  <fct> <fct>
#> 1 2021. 1710071… 1710…     1    1780.     1 1     <NA>  <NA>  9      3     6    
#> 2 2021. 1710071… 1710…     1    1990.     2 1     <NA>  <NA>  8      3     7    
#> 3 2021. 1710071… 1710…     2    1990.     2 1     <NA>  <NA>  8      3     7    
#> 4 2021. 1710073… 1710…     1    4653.     1 1     <NA>  <NA>  1      3     5    
#> 5 2021. 1710074… 1710…     1    2302.     1 1     <NA>  <NA>  2      3     21   
#> 6 2021. 1710074… 1710…     1    2308.     1 1     <NA>  <NA>  8      3     5    
#> # ℹ 48 more variables: V4049 <fct>, V4050 <fct>, V4051 <fct>, V4052 <fct>,
#> #   V4053 <fct>, V4054 <fct>, V4055 <fct>, V4056 <fct>, V4057 <fct>,
#> #   V4058 <fct>, V4234 <fct>, V4235 <fct>, V4241 <fct>, V4242 <fct>,
#> #   V4243 <fct>, V4244 <fct>, V4245 <fct>, V4248 <dbl>, V4256 <fct>,
#> #   V4257 <fct>, V4258 <fct>, V4259 <fct>, V4260 <fct>, V4261 <fct>,
#> #   V4262 <fct>, V4263 <fct>, V4264 <fct>, V4265 <fct>, V4266 <fct>,
#> #   V4267 <fct>, V4268 <fct>, V4269 <fct>, V4270 <fct>, V4271 <fct>, …

NSDUH

The National Survey on Drug Use and Health (NSDUH) is an annual survey of the civilian, non-institutionalized population in the United States who are at least 12 years old. Topics include substance use (tobacco, alcohol, and illicit drugs including marijuana), mental health, and general health. This package provides a subset of the variables from the 2023 Public Use File. For more details about the study and the data, refer to the Methodological Summary and Definitions, Data User’s Guide, and Codebook.

head(nsduh_2023)
#> # A tibble: 6 × 22
#>   QUESTID2 ANALWT2_C VESTR_C VEREP NICVAPMON TOBMON ALCMON ILLMON ILTOBVAPALC
#>      <dbl>     <dbl>   <dbl> <dbl>     <int>  <int>  <int>  <int>       <int>
#> 1 10000053     3276.   40031     2         0      0      1      0           1
#> 2 10000679    15630.   40021     2         0      1      1      0           1
#> 3 10001208     4018.   40043     1         0      1      0      1           1
#> 4 10001260    10712.   40030     2         0      0      0      0           0
#> 5 10001588     8195.   40023     2         0      0      1      0           1
#> 6 10004996     3771.   40048     1         1      1      1      0           1
#> # ℹ 13 more variables: BNGDRKMON <int>, IRPYUD5ALC <int>, UD5ILLANY <int>,
#> #   UD5ILALANY <int>, YMDELT <fct>, YMDEYR <fct>, MDEIMPY <fct>, AMIPY <int>,
#> #   SMIPY <int>, AGE3 <fct>, NEWRACE2 <fct>, IRSEX <fct>, POVERTY3 <fct>

RECS

Three files are included associated with RECS - a dataset with the 2015 data with some derived variables created for the book (recs_2015), the 2020 data with some derived variables created for the book (recs_2020), and the 2020 data with the original variables (recs_2020_raw). RECS is a survey about energy consumption and expenditure among residential households in the United States and has been conducted since 1979 by the Energy Information Administration. More information about the original data is available at the RECS website.

head(recs_2015)
#> # A tibble: 6 × 141
#>   DOEID REGIONC Region    Division MSAStatus Urbanicity HousingUnitType YearMade
#>   <dbl>   <dbl> <fct>     <fct>    <fct>     <fct>      <fct>           <ord>   
#> 1 10001       4 West      Pacific  Metropol… Urban Area Single-family … 2000-20…
#> 2 10002       3 South     West So… None      Rural      Single-family … 1980-19…
#> 3 10003       3 South     East So… Metropol… Urban Area Single-family … 1970-19…
#> 4 10004       2 Midwest   West No… Micropol… Urban Clu… Single-family … 1950-19…
#> 5 10005       1 Northeast Middle … Metropol… Urban Area Single-family … 1970-19…
#> 6 10006       1 Northeast New Eng… None      Urban Clu… Apartment: 5 o… 1980-19…
#> # ℹ 133 more variables: SpaceHeatingUsed <lgl>, HeatingBehavior <fct>,
#> #   WinterTempDay <dbl>, WinterTempAway <dbl>, WinterTempNight <dbl>,
#> #   ACUsed <lgl>, ACBehavior <fct>, SummerTempDay <dbl>, SummerTempAway <dbl>,
#> #   SummerTempNight <dbl>, TOTCSQFT <dbl>, TOTHSQFT <dbl>, TOTSQFT_EN <dbl>,
#> #   TOTUCSQFT <dbl>, TOTUSQFT <dbl>, NWEIGHT <dbl>, BRRWT1 <dbl>, BRRWT2 <dbl>,
#> #   BRRWT3 <dbl>, BRRWT4 <dbl>, BRRWT5 <dbl>, BRRWT6 <dbl>, BRRWT7 <dbl>,
#> #   BRRWT8 <dbl>, BRRWT9 <dbl>, BRRWT10 <dbl>, BRRWT11 <dbl>, BRRWT12 <dbl>, …
head(recs_2020)
#> # A tibble: 6 × 100
#>    DOEID ClimateRegion_BA Urbanicity Region    REGIONC   Division     STATE_FIPS
#>    <dbl> <fct>            <fct>      <fct>     <chr>     <fct>        <chr>     
#> 1 100001 Mixed-Dry        Urban Area West      WEST      Mountain So… 35        
#> 2 100002 Mixed-Humid      Urban Area South     SOUTH     West South … 05        
#> 3 100003 Mixed-Dry        Urban Area West      WEST      Mountain So… 35        
#> 4 100004 Mixed-Humid      Urban Area South     SOUTH     South Atlan… 45        
#> 5 100005 Mixed-Humid      Urban Area Northeast NORTHEAST Middle Atla… 34        
#> 6 100006 Hot-Humid        Urban Area South     SOUTH     West South … 48        
#> # ℹ 93 more variables: state_postal <fct>, state_name <fct>, HDD65 <dbl>,
#> #   CDD65 <dbl>, HDD30YR <dbl>, CDD30YR <dbl>, HousingUnitType <fct>,
#> #   YearMade <ord>, TOTSQFT_EN <dbl>, TOTHSQFT <dbl>, TOTCSQFT <dbl>,
#> #   SpaceHeatingUsed <lgl>, ACUsed <lgl>, HeatingBehavior <fct>,
#> #   WinterTempDay <dbl>, WinterTempAway <dbl>, WinterTempNight <dbl>,
#> #   ACBehavior <fct>, SummerTempDay <dbl>, SummerTempAway <dbl>,
#> #   SummerTempNight <dbl>, NWEIGHT <dbl>, NWEIGHT1 <dbl>, NWEIGHT2 <dbl>, …
head(recs_2020_raw)
#> # A tibble: 6 × 789
#>    DOEID REGIONC   DIVISION        STATE_FIPS state_postal state_name BA_climate
#>    <dbl> <chr>     <chr>           <chr>      <chr>        <chr>      <chr>     
#> 1 100001 WEST      Mountain South  35         NM           New Mexico Mixed-Dry 
#> 2 100002 SOUTH     West South Cen… 05         AR           Arkansas   Mixed-Hum…
#> 3 100003 WEST      Mountain South  35         NM           New Mexico Mixed-Dry 
#> 4 100004 SOUTH     South Atlantic  45         SC           South Car… Mixed-Hum…
#> 5 100005 NORTHEAST Middle Atlantic 34         NJ           New Jersey Mixed-Hum…
#> 6 100006 SOUTH     West South Cen… 48         TX           Texas      Hot-Humid 
#> # ℹ 782 more variables: IECC_climate_code <chr>, UATYP10 <chr>, HDD65 <dbl>,
#> #   CDD65 <dbl>, HDD30YR_PUB <dbl>, CDD30YR_PUB <dbl>, TYPEHUQ <dbl>,
#> #   CELLAR <dbl>, CRAWL <dbl>, CONCRETE <dbl>, BASEOTH <dbl>, BASEFIN <dbl>,
#> #   ATTIC <dbl>, ATTICFIN <dbl>, STORIES <dbl>, PRKGPLC1 <dbl>,
#> #   SIZEOFGARAGE <dbl>, KOWNRENT <dbl>, YEARMADERANGE <dbl>, BEDROOMS <dbl>,
#> #   NCOMBATH <dbl>, NHAFBATH <dbl>, OTHROOMS <dbl>, TOTROOMS <dbl>,
#> #   STUDIO <dbl>, WALLTYPE <dbl>, ROOFTYPE <dbl>, HIGHCEIL <dbl>, …

CHIS

The CHIS data is a subset of variables from the 2023 California Health Interview Survey Adult Public Use File. CHIS is an annual survey of people in households in California with several topics related to health and social determinants of health. For more information about the study, refer to the CHIS website. To download a full version of the data with all variables or view codebooks, create an account and download the public use files. See a snippet of the data below:

head(chis_2023)
#> # A tibble: 6 × 98
#>   PUF1Y_ID AH1V2 AH22  SMKCUR30 AB1    DIABETES BMI_P RBMI  AB17  DSTRS12 AB29V2
#>   <chr>    <fct> <fct> <fct>    <fct>  <fct>    <dbl> <fct> <fct> <fct>   <fct> 
#> 1 23021436 Yes   No    No       Very … No        35.6 Obes… No    No      No    
#> 2 23009146 Yes   No    No       Excel… No        23.0 Norm… No    No      No    
#> 3 23005039 Yes   No    No       Good   No        25.6 Over… Yes   No      Borde…
#> 4 23025815 Yes   Yes   No       Fair   No        42.5 Obes… No    No      Borde…
#> 5 23010158 Yes   No    No       Good   Yes       24.7 Norm… No    No      Yes   
#> 6 23006250 Yes   No    No       Excel… No        19.1 Norm… No    No      No    
#> # ℹ 87 more variables: SPK_ENG <fct>, POVLL2_P1V2 <dbl>, POVLL <fct>,
#> #   SRAGE_P1 <ord>, SRSEX <fct>, OMBSRR_P1 <fct>, RAKEDW0 <dbl>, RAKEDW1 <dbl>,
#> #   RAKEDW2 <dbl>, RAKEDW3 <dbl>, RAKEDW4 <dbl>, RAKEDW5 <dbl>, RAKEDW6 <dbl>,
#> #   RAKEDW7 <dbl>, RAKEDW8 <dbl>, RAKEDW9 <dbl>, RAKEDW10 <dbl>,
#> #   RAKEDW11 <dbl>, RAKEDW12 <dbl>, RAKEDW13 <dbl>, RAKEDW14 <dbl>,
#> #   RAKEDW15 <dbl>, RAKEDW16 <dbl>, RAKEDW17 <dbl>, RAKEDW18 <dbl>,
#> #   RAKEDW19 <dbl>, RAKEDW20 <dbl>, RAKEDW21 <dbl>, RAKEDW22 <dbl>, …

See ?chis_2023 for more information about the data.

Examples

To analyze the survey data, we recommend using the {srvyr} package as follows:

# install.packages("pak")
pak::pak("gergness/srvyr")
library(srvyr)

recs_des <- recs_2020 %>%
  as_survey_rep(
    weights = NWEIGHT, repweights = NWEIGHT1:NWEIGHT60,
    type = "JK1", scale = 59 / 60, mse = TRUE,
    variables = c(ACUsed, Region)
  )

recs_des
#> Call: Called via srvyr
#> Unstratified cluster jacknife (JK1) with 60 replicates and MSE variances.
#> Sampling variables:
#>   - repweights: `NWEIGHT1 + NWEIGHT2 + NWEIGHT3 + NWEIGHT4 + NWEIGHT5 +
#>     NWEIGHT6 + NWEIGHT7 + NWEIGHT8 + NWEIGHT9 + NWEIGHT10 + NWEIGHT11 +
#>     NWEIGHT12 + NWEIGHT13 + NWEIGHT14 + NWEIGHT15 + NWEIGHT16 + NWEIGHT17 +
#>     NWEIGHT18 + NWEIGHT19 + NWEIGHT20 + NWEIGHT21 + NWEIGHT22 + NWEIGHT23 +
#>     NWEIGHT24 + NWEIGHT25 + NWEIGHT26 + NWEIGHT27 + NWEIGHT28 + NWEIGHT29 +
#>     NWEIGHT30 + NWEIGHT31 + NWEIGHT32 + NWEIGHT33 + NWEIGHT34 + NWEIGHT35 +
#>     NWEIGHT36 + NWEIGHT37 + NWEIGHT38 + NWEIGHT39 + NWEIGHT40 + NWEIGHT41 +
#>     NWEIGHT42 + NWEIGHT43 + NWEIGHT44 + NWEIGHT45 + NWEIGHT46 + NWEIGHT47 +
#>     NWEIGHT48 + NWEIGHT49 + NWEIGHT50 + NWEIGHT51 + NWEIGHT52 + NWEIGHT53 +
#>     NWEIGHT54 + NWEIGHT55 + NWEIGHT56 + NWEIGHT57 + NWEIGHT58 + NWEIGHT59 +
#>     NWEIGHT60` 
#>   - weights: NWEIGHT 
#> Data variables: 
#>   - ACUsed (lgl), Region (fct)

recs_des %>%
  group_by(Region) %>%
  summarize(
    p = survey_mean(ACUsed, vartype = "ci", proportion = TRUE, prop_method = "logit")
  )
#> # A tibble: 4 × 4
#>   Region        p p_low p_upp
#>   <fct>     <dbl> <dbl> <dbl>
#> 1 Northeast 0.890 0.877 0.901
#> 2 Midwest   0.933 0.922 0.943
#> 3 South     0.942 0.936 0.947
#> 4 West      0.745 0.729 0.760

The above example estimates the proportion of residential households that use air-conditioning by region with a 95% confidence interval.

License

Data are available by CC BY 4.0 license. Additionally, re-distributing the ANES or NCVS datasets is subject to their policies.

Additional data use information

Anyone interested in redistributing the NCVS data should refer to ICPSR: Requests for Permission to Redistribute ICPSR Data.

Anyone interested in redistributing the ANES data should refer to the ANES FAQ - disseminate.

References

Data citations:

ANES:

  • American National Election Studies, 2021. ANES 2020 Time Series Study Full Release [dataset and documentation]. July 19, 2021 version. https://www.electionstudies.org

CHIS:

  • California Health Interview Survey. CHIS 2023 Adult Public Use Files. [Computer file]. UCLA Center for Health Policy Research, Los Angeles, CA. February 2025 version

NCVS:

  • United States. Bureau of Justice Statistics. National Crime Victimization Survey, [United States], 2021. Inter-university Consortium for Political and Social Research [distributor], 2022-09-19. https://doi.org/10.3886/ICPSR38429.v1

NSDUH:

RECS: