A Importing survey data into R
To analyze a survey, we need to bring the survey data into R. This process is often referred to as importing, loading, or reading in data. Survey files come in different formats depending on the software used to create them. One of the many advantages of R is its flexibility in handling various data formats, regardless of their file extensions. Here are examples of common public-use survey file formats we may encounter:
- Delimiter-separated text files
- Excel spreadsheets in
.xls
or.xlsx
format - R native
.rda
files - Stata datasets in
.dta
format - SAS datasets in
.sas
format - SPSS datasets in
.sav
format - Application Programming Interfaces (APIs), often in JavaScript Object Notation (JSON) format
- Data stored in databases
This appendix guides analysts through the process of importing these various types of survey data into R.
A.1 Importing delimiter-separated files into R
Delimiter-separated files use specific characters, known as delimiters, to separate values within the file. For example, CSV (comma-separated values) files use commas as delimiters, while TSV (tab-separated values) files use tabs. These file formats are widely used because of their simplicity and compatibility with various software applications.
The {readr} package, part of the tidyverse ecosystem, offers efficient ways to import delimiter-separated files into R (Wickham, Hester, and Bryan 2024). It offers several advantages, including automatic data type detection and flexible handling of missing values, depending on one’s survey analysis needs. The {readr} package includes functions for:
read_csv()
: This function is specifically designed to read CSV files.read_tsv()
: Use this function for TSV files.read_delim()
: This function can handle a broader range of delimiter-separated files, including CSV and TSV. Specify the delimiter using thedelim
argument.read_fwf()
: This function is useful for importing fixed-width files (FWF), where columns have predetermined widths, and values are aligned in specific positions.read_table()
: Use this function when dealing with whitespace-separated files, such as those with spaces or multiple spaces as delimiters.read_log()
: This function can read and parse web log files.
The syntax for read_csv()
is:
read_csv(
file,
col_names = TRUE,
col_types = NULL,
col_select = NULL,
id = NULL,
locale = default_locale(),
na = c("", "NA"),
comment = "",
trim_ws = TRUE,
skip = 0,
n_max = Inf,
guess_max = min(1000, n_max),
name_repair = "unique",
num_threads = readr_threads(),
progress = show_progress(),
show_col_types = should_show_types(),
skip_empty_rows = TRUE,
lazy = should_read_lazy()
)
The arguments are:
file
: the path to the CSV file to importcol_names
: a value ofTRUE
imports the first row of thefile
as column names and not included in the data frame. A value ofFALSE
creates automated column names. Alternatively, we can provide a vector of column names.col_types
: by default, R infers the column variable types. We can also provide a column specification usinglist()
orcols()
; for example, usecol_types = cols(.default = "c")
to read all the columns as characters. Alternatively, we can use a string to specify the variable types for each column.col_select
: the columns to include in the resultsid
: a column for storing the file path. This is useful for keeping track of the input file when importing multiple CSVs at a time.locale
: the location-specific defaults for the filena
: a character vector of values to interpret as missingcomment
: a character vector of values to interpret as commentstrim_ws
: a value ofTRUE
trims leading and trailing white spaceskip
: number of lines to skip before importing the datan_max
: maximum number of lines to readguess_max
: maximum number of lines used for guessing column typesname_repair
: whether to check column names. By default, the column names are unique.num_threads
: the number of processing threads to use for initial parsing and lazy reading of dataprogress
: a value ofTRUE
displays a progress barshow_col_types
: a value ofTRUE
displays the column typesskip_empty_rows
: a value ofTRUE
ignores blank rowslazy
: a value ofTRUE
reads values lazily
The other functions share a similar syntax to read_csv()
. To find more details, run ??
followed by the function name. For example, run ??read_tsv
in the Console for additional information on importing TSV files.
In the example below, we use {readr} to import a CSV file named ‘anes_timeseries_2020_csv_20220210.csv’ into an R object called anes_csv
. The read_csv()
imports the file and stores the data in the anes_csv
object. We can then use this object for further analysis.
A.2 Importing Excel files into R
Excel, a widely used spreadsheet software program created by Microsoft, is a common file format in survey research. We can import Excel spreadsheets into the R environment using the {readxl} package. The package supports both the legacy .xls
files and the modern .xlsx
format.
To import Excel data into R, we can use the read_excel()
function from the {readxl} package. This function offers a range of options for the import process. Let’s explore the syntax:
read_excel(
path,
sheet = NULL,
range = NULL,
col_names = TRUE,
col_types = NULL,
na = "",
trim_ws = TRUE,
skip = 0,
n_max = Inf,
guess_max = min(1000, n_max),
progress = readxl_progress(),
.name_repair = "unique"
)
The arguments are:
path
: the path to the Excel file to importsheet
: the name or index of the sheet (sometimes called tabs) within the Excel filerange
: the range of cells to import (for example,P15:T87
)col_names
: indicates whether the first row of the dataset contains column namescol_types
: specifies the data types of columnsna
: defines the representation of missing values (for example,NULL
)trim_ws
: controls whether leading and trailing whitespaces should be trimmedskip
andn_max
: enable skipping rows and limit the number of rows importedguess_max
: sets the maximum number of rows used for data type guessingprogress
: specifies a progress bar for large imports.name_repair
: determines how column names are repaired if they are not valid
In the code example below, we import an Excel spreadsheet named ‘anes_timeseries_2020_csv_20220210.xlsx’ into R. The resulting data is saved as a tibble in the anes_excel
object, ready for further analysis.
A.3 Importing Stata, SAS, and SPSS files into R
The {haven} package, also from the tidyverse ecosystem, imports various proprietary data formats: Stata .dta
files, SPSS .sav
files, and SAS .sas7bdat
and .sas7bcat
files (Wickham, Miller, and Smith 2023). One of the notable strengths of the {haven} package is its ability to handle multiple proprietary formats within a unified framework. It offers dedicated functions for each supported proprietary format, making it straightforward to import data regardless of the program. Here, we introduce read_dta()
for Stata files, read_sav()
for SPSS files, and read_sas()
for SAS files.
A.3.1 Syntax
Let’s explore the syntax for importing Stata files .dta
files using haven::read_dta()
:
read_dta(
file,
encoding = NULL,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = "unique"
)
The arguments are:
file
: the path to the proprietary data file to importencoding
: specifies the character encoding of the data filecol_select
: selects specific columns for importskip
andn_max
: control the number of rows skipped and the maximum number of rows imported.name_repair
: determines how column names are repaired if they are not valid
The syntax for read_sav()
is similar to read_dta()
:
read_sav(
file,
encoding = NULL,
user_na = FALSE,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = "unique"
)
The arguments are:
file
: the path to the proprietary data file to importencoding
: specifies the character encoding of the data filecol_select
: selects specific columns for importuser_na
: a value ofTRUE
reads variables with user-defined missing labels intolabelled_spss()
objectsskip
andn_max
: control the number of rows skipped and the maximum number of rows imported.name_repair
: determines how column names are repaired if they are not valid
The syntax for importing SAS files with read_sas()
is as follows:
read_sas(
data_file,
catalog_file = NULL,
encoding = NULL,
catalog_encoding = encoding,
col_select = NULL,
skip = 0L,
n_max = Inf,
.name_repair = "unique"
)
The arguments are:
data_file
: the path to the proprietary data file to importcatalog_file
: the path to the catalog file to importencoding
: specifies the character encoding of the data filecatalog_encoding
: specifies the character encoding of the catalog filecol_select
: selects specific columns for importskip
andn_max
: control the number of rows skipped and the maximum number of rows imported.name_repair
: determines how column names are repaired if they are not valid
In the code examples below, we demonstrate how to import Stata, SPSS, and SAS files into R using the respective {haven} functions. The resulting data are stored in anes_dta
, anes_sav
, and anes_sas
objects as tibbles, ready for use in R. For the Stata example, we show how to import the data from the {srvyrexploR} package to use in examples.
Stata:
library(haven)
anes_dta <-
read_dta(file = system.file("extdata",
"anes_2020_stata_example.dta",
package = "srvyrexploR"
))
SPSS:
SAS:
A.3.2 Working with labeled data
Stata, SPSS, and SAS files can contain labeled variables and values. These labels provide descriptive information about categorical data, making them easier to understand and analyze. When importing data from Stata, SPSS, or SAS, we want to preserve these labels to maintain data fidelity.
Consider a variable like ‘Education Level’ with coded values (e.g., 1, 2, 3). Without labels, these codes can be cryptic. However, with labels (‘High School Graduate,’ ‘Bachelor’s Degree,’ ‘Master’s Degree’), the data become more informative and easier to work with.
With the {haven} package, we have the capability to import and work with labeled data from Stata, SPSS, and SAS files. The package uses a special class of data called haven_labelled
to store labeled variables. When a dataset label is defined in Stata, it is stored in the ‘label’ attribute of the tibble when imported, ensuring that the information is not lost.
We can use functions like select()
, glimpse()
, and is.labelled()
to inspect the imported data and verify if the variables are labeled. Take a look at the ANES Stata file. Notice that categorical variables V200002
and V201006
are marked with a type of <dbl+lbl>
. This notation indicates that these variables are labeled.
## Rows: 7,453
## Columns: 6
## $ V200001 <dbl> 200015, 200022, 200039, 200046, 200053, 200060, 20008…
## $ V200002 <dbl+lbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ V200010b <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658, 0.2347, 0.440…
## $ V200010d <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 38, 21, …
## $ V200010c <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,…
## $ V201006 <dbl+lbl> 2, 3, 2, 3, 2, 1, 2, 3, 2, 2, 2, 2, 2, 1, 2, 1, 1…
We can confirm their label status using the haven::is.labelled()
function.
## [1] TRUE
To explore the labels further, we can use the attributes()
function. This function provides insights into both the variable labels ($label
) and the associated value labels ($labels
).
## $label
## [1] "Mode of interview: pre-election interview"
##
## $format.stata
## [1] "%10.0g"
##
## $class
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $labels
## 1. Video 2. Telephone 3. Web
## 1 2 3
When we import a labeled dataset using {haven}, it results in a tibble containing both the data and label information. However, this is meant to be an intermediary data structure and not intended to be the final data format for analysis. Instead, we should convert it into a regular R data frame before continuing our data workflow. There are two primary methods to achieve this conversion: (1) convert to factors or (2) remove the labels.
Option 1: Convert the vector into a factor
Factors are native R data types for working with categorical data. They consist of integer values that correspond to character values, known as levels. Below is a dummy example of factors. The factors
show the four different levels in the data: strongly agree
, agree
, disagree
, and strongly disagree
.
response <-
c("strongly agree", "agree", "agree", "disagree", "strongly disagree")
response_levels <-
c("strongly agree", "agree", "disagree", "strongly disagree")
factors <- factor(response, levels = response_levels)
factors
## [1] strongly agree agree agree
## [4] disagree strongly disagree
## Levels: strongly agree agree disagree strongly disagree
Factors are integer vectors, though they may look like character strings. We can confirm by looking at the vector’s structure:
## Factor w/ 4 levels "strongly agree",..: 1 2 2 3 4
R’s factors differ from Stata, SPSS, or SAS labeled vectors. However, we can convert labeled variables into factors using the as_factor()
function.
## # A tibble: 7,453 × 1
## V200002
## <fct>
## 1 3. Web
## 2 3. Web
## 3 3. Web
## 4 3. Web
## 5 3. Web
## 6 3. Web
## 7 3. Web
## 8 3. Web
## 9 3. Web
## 10 3. Web
## # ℹ 7,443 more rows
The as_factor()
function can be applied to all columns in a data frame or individual ones. Below, we convert all <dbl+lbl>
columns into factors.
## Rows: 7,453
## Columns: 6
## $ V200001 <dbl> 200015, 200022, 200039, 200046, 200053, 200060, 20008…
## $ V200002 <fct> 3. Web, 3. Web, 3. Web, 3. Web, 3. Web, 3. Web, 3. We…
## $ V200010b <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658, 0.2347, 0.440…
## $ V200010d <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 38, 21, …
## $ V200010c <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,…
## $ V201006 <fct> 2. Somewhat interested, 3. Not much interested, 2. So…
Option 2: Strip the labels
The second option is to remove the labels altogether, converting the labeled data into a regular R data frame. To remove, or ‘zap,’ the labels from our tibble, we can use the {haven} package’s zap_label()
and zap_labels()
functions. This approach removes the labels but retains the data values in their original form.
The ANES Stata file columns contain variable labels. Using the map()
function from {purrr}, we can review the labels using attr
. In the example below, we list the first two variables and their labels. For instance, the label for V200002
is “Mode of interview: pre-election interview.”
## $V200001
## [1] "2020 Case ID"
##
## $V200002
## [1] "Mode of interview: pre-election interview"
Use zap_label()
to remove the variable labels but retain the value labels. Notice that the labels return as NULL
.
## $V200001
## NULL
##
## $V200002
## 1. Video 2. Telephone 3. Web
## 1 2 3
To remove the value labels, use zap_labels()
. Notice the previous <dbl+lbl>
columns are now <dbl>
.
## Rows: 7,453
## Columns: 6
## $ V200001 <dbl> 200015, 200022, 200039, 200046, 200053, 200060, 20008…
## $ V200002 <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ V200010b <dbl> 1.0057, 1.1635, 0.7687, 0.5210, 0.9658, 0.2347, 0.440…
## $ V200010d <dbl> 9, 26, 41, 29, 23, 37, 7, 37, 32, 41, 22, 7, 38, 21, …
## $ V200010c <dbl> 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,…
## $ V201006 <dbl> 2, 3, 2, 3, 2, 1, 2, 3, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1,…
While it is important to convert labeled datasets into regular R data frames for working in R, the labels themselves often contain valuable information that provides context and meaning to the survey variables. To aid with interpretability and documentation, we can create a data dictionary from the labeled dataset. A data dictionary is a reference document that provides detailed information about the variables and values of a survey.
The {labelled} package offers a convenient function, generate_dictionary()
, that creates data dictionaries directly from a labeled dataset (Larmarange 2024). This function extracts variable labels, value labels, and other metadata and organizes them into a structured document that we can browse and reference throughout our analysis.
Let’s create a data dictionary from the ANES Stata dataset as an example:
Once we’ve generated the data dictionary, we can take a look at the V200002
variable and see the label, column type, number of missing entries, and associated values.
## pos variable label col_type missing values
## 2 V200002 Mode of interview: pre~ dbl+lbl 0 [1] 1. Video
## [2] 2. Telephone
## [3] 3. Web
A.3.3 Labeled missing data values
In survey data analysis, dealing with missing values is a crucial aspect of data preparation. Stata, SPSS, and SAS files each have their own method for handling missing values.
- Stata has “extended” missing values,
.A
through.Z
. - SAS has “special” missing values,
.A
through.Z
and._
. - SPSS has per-column “user” missing values. Each column can declare up to three distinct values or a range of values (plus one distinct value) that should be treated as missing.
SAS and Stata use a concept known as ‘tagged’ missing values, which extend R’s regular NA
. A ‘tagged’ missing value is essentially an NA
with an additional single-character label. These values behave identically to regular NA
in standard R operations while preserving the informative tag associated with the missing value.
Here is an example from the NORC at the University of Chicago’s 2018 General Society Survey, where Don’t Know (DK
) responses are tagged as NA(d)
, Inapplicable (IAP
) responses are tagged as NA(i)
, and No Answer
responses are tagged as NA(n)
(Davern et al. 2021).
head(gss_dta$HEALTH)
#> <labelled<double>[6]>: condition of health
#> [1] 2 1 NA(i) NA(i) 1 2
#>
#> Labels:
#> value label
#> 1 excellent
#> 2 good
#> 3 fair
#> 4 poor
#> NA(d) DK
#> NA(i) IAP
#> NA(n) NA
In contrast, SPSS uses a different approach called ‘user-defined values’ to denote missing values. Each column in an SPSS dataset can have up to three distinct values designated as missing or a specified range of missing values. To model these additional user-defined missing values, {haven} provides the labeled_spss()
subclass of labeled()
. When importing SPSS data using {haven}, it ensures that user-defined missing values are correctly handled. We can work with these data in R while preserving the unique missing value conventions from SPSS.
Here is what the GSS SPSS dataset looks like when loaded with {haven}.
head(gss_sps$HEALTH)
#> <labelled_spss<double>[6]>: Condition of health
#> [1] 2 1 0 0 1 2
#> Missing values: 0, 8, 9
#>
#> Labels:
#> value label
#> 0 IAP
#> 1 EXCELLENT
#> 2 GOOD
#> 3 FAIR
#> 4 POOR
#> 8 DK
#> 9 NA
A.4 Importing data from APIs into R
In addition to working with data saved as files, we may also need to retrieve data through Application Programming Interfaces (APIs). APIs provide a structured way to access data hosted on external servers and import them directly into R for analysis.
To access these data, we need to understand how to construct API requests. Each API has unique endpoints, parameters, and authentication requirements. Pay attention to:
- Endpoints: These are URLs that point to specific data or services
- Parameters: Information passed to the API to customize the request (e.g., date ranges, filters)
- Authentication: APIs may require API keys or tokens for access
- Rate Limits: APIs may have usage limits, so be aware of any rate limits or quotas
Typically, we begin by making a GET request to an API endpoint. The {httr2} package allows us to generate and process HTTP requests (Wickham 2024). We can make the GET request by pointing to the URL that contains the data we would like:
Once we make the request, we obtain the data as the response
. The data often come in JSON format. We can extract and parse the data using the {jsonlite} package, allowing us to work with them in R (Ooms 2014). The fromJSON()
function, shown below, converts JSON data to an R object.
Note that these are dummy examples. Please review the documentation to understand how to make requests from a specific API.
R offers several packages that simplify API access by providing ready-to-use functions for popular APIs. These packages are called “wrappers,” as they “wrap” the API in R to make it easier to use. For example, the {tidycensus} package used in this book simplifies access to U.S. Census data, allowing us to retrieve data with R commands instead of writing API requests from scratch (Walker and Herman 2024). Behind the scenes, get_pums()
is making a GET request from the Census API, and the {tidycensus} functions are converting the response into an R-friendly format. For example, if we are interested in the age, sex, race, and Hispanicity of those in the American Community Survey sample of Durham County, North Carolina31, we can use the get_pums()
function to extract the microdata as shown in the code below. We can then use the replicate weights to create a survey object and calculate estimates for Durham County.
library(tidycensus)
durh_pums <- get_pums(
variables = c("PUMA", "SEX", "AGEP", "RAC1P", "HISP"),
state = "NC",
puma = c("01301", "01302"),
survey = "acs1",
year = 2022,
rep_weights = "person"
)
## Getting data from the 2022 1-year ACS Public Use Microdata Sample
## Warning: • You have not set a Census API key. Users without a key are limited to 500
## queries per day and may experience performance limitations.
## ℹ For best results, get a Census API key at
## http://api.census.gov/data/key_signup.html and then supply the key to the
## `census_api_key()` function to use it throughout your tidycensus session.
## This warning is displayed once per session.
## # A tibble: 2,724 × 90
## SERIALNO SPORDER AGEP PUMA ST SEX HISP RAC1P WGTP PWGTP
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2022HU0937941 1 60 01302 37 2 01 1 132 132
## 2 2022HU0937941 2 61 01302 37 1 01 1 132 107
## 3 2022HU0938759 1 44 01301 37 1 01 1 60 61
## 4 2022HU0938759 2 48 01301 37 2 01 1 60 63
## 5 2022HU0938759 3 19 01301 37 1 01 1 60 107
## 6 2022HU0938759 4 16 01301 37 2 01 1 60 50
## 7 2022HU0938759 5 12 01301 37 2 01 1 60 84
## 8 2022HU0939904 1 53 01302 37 1 01 1 104 104
## 9 2022HU0939904 2 53 01302 37 1 01 1 104 101
## 10 2022HU0941348 1 70 01301 37 1 01 1 77 77
## # ℹ 2,714 more rows
## # ℹ 80 more variables: PWGTP1 <dbl>, PWGTP2 <dbl>, PWGTP3 <dbl>,
## # PWGTP4 <dbl>, PWGTP5 <dbl>, PWGTP6 <dbl>, PWGTP7 <dbl>,
## # PWGTP8 <dbl>, PWGTP9 <dbl>, PWGTP10 <dbl>, PWGTP11 <dbl>,
## # PWGTP12 <dbl>, PWGTP13 <dbl>, PWGTP14 <dbl>, PWGTP15 <dbl>,
## # PWGTP16 <dbl>, PWGTP17 <dbl>, PWGTP18 <dbl>, PWGTP19 <dbl>,
## # PWGTP20 <dbl>, PWGTP21 <dbl>, PWGTP22 <dbl>, PWGTP23 <dbl>, …
In Chapter 4, we used the {censusapi} package to get data from the Census data API for the Current Population Survey. To discover if there is an R package that directly interfaces with a specific survey or data source, search for “[survey] R wrapper” or “[data source] R package” online.
A.5 Importing data from databases in R
Databases provide a secure and organized solution as the volume and complexity of data grow. We can access, manage, and update data stored in databases in a systematic way. Because of how the data are organized, teams can draw from the same source and obtain any metadata that would be helpful for analysis.
There are various ways of using R to work with databases. If using RStudio, we can connect to different databases through the Connections Pane in the top right of the IDE. We can also use packages like {DBI} and {odbc} to access database tables in R files. Here is an example script connecting to a database:
con <-
DBI::dbConnect(
odbc::odbc(),
Driver = "[driver name]",
Server = "[server path]",
UID = rstudioapi::askForPassword("Database user"),
PWD = rstudioapi::askForPassword("Database password"),
Database = "[database name]",
Warehouse = "[warehouse name]",
Schema = "[schema name]"
)
The {dbplyr} and {dplyr} packages allow us to make queries and run data analysis entirely using {dplyr} syntax. All of the code can be written in R, so we do not have to switch between R and SQL to explore the data. Here is some sample code:
q1 <- tbl(con, "bank") %>%
group_by(month_idx, year, month) %>%
summarize(subscribe = sum(ifelse(term_deposit == "yes", 1, 0)),
total = n())
show_query(q1)
Be sure to check the documentation to configure a database connection.
A.6 Importing data from other formats
R also offers dedicated packages such as {googlesheets4} for Google Sheets or {qualtRics} for Qualtrics. With less common or proprietary file formats, the broader data science community can often provide guidance. Online resources like Stack Overflow and dedicated forums like Posit Community are valuable sources of information for importing data into R.
References
The public use microdata areas (PUMA) for Durham County were identified using the 2020 PUMA Names File: https://www2.census.gov/geo/pdfs/reference/puma2020/2020_PUMA_Names.pdf↩︎