Join us at useR on August 8, 2025 for our workshop: Complex Survey Data Analysis: A Tidy Introduction with {srvyr} and {survey}. Register here!

Chapter 11 Missing data

Prerequisites

For this chapter, load the following packages:

library(tidyverse)
library(survey)
library(srvyr)
library(srvyrexploR)
library(naniar)
library(haven)
library(gt)

We are using data from ANES and RECS described in Chapter 4. As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter 4 for more information).

targetpop <- 231592693

anes_adjwgt <- anes_2020 %>%
  mutate(Weight = Weight / sum(Weight) * targetpop)

anes_des <- anes_adjwgt %>%
  as_survey_design(
    weights = Weight,
    strata = Stratum,
    ids = VarUnit,
    nest = TRUE
  )

For RECS, details are included in the RECS documentation and Chapter 10.

recs_des <- recs_2020 %>%
  as_survey_rep(
    weights = NWEIGHT,
    repweights = NWEIGHT1:NWEIGHT60,
    type = "JK1",
    scale = 59 / 60,
    mse = TRUE
  )

11.1 Introduction

Missing data in surveys refer to situations where participants do not provide complete responses to survey questions. Respondents may not have seen a question by design. Or, they may not respond to a question for various other reasons, such as not wanting to answer a particular question, not understanding the question, or simply forgetting to answer. Missing data are important to consider and account for, as they can introduce bias and reduce the representativeness of the data. This chapter provides an overview of the types of missing data, how to assess missing data in surveys, and how to conduct analysis when missing data are present. Understanding this complex topic can help ensure accurate reporting of survey results and provide insight into potential changes to the survey design for the future.

11.2 Missing data mechanisms

There are two main categories that missing data typically fall into: missing by design and unintentional missing data. Missing by design is part of the survey plan and can be more easily incorporated into weights and analyses. Unintentional missing data, on the other hand, can lead to bias in survey estimates if not correctly accounted for. Below we provide more information on the types of missing data.

Missing by design/questionnaire skip logic: This type of missingness occurs when certain respondents are intentionally directed to skip specific questions based on their previous responses or characteristics. For example, in a survey about employment, if a respondent indicates that they are not employed, they may be directed to skip questions related to their job responsibilities. Additionally, some surveys randomize questions or modules so that not all participants respond to all questions. In these instances, respondents would have missing data for the modules not randomly assigned to them.
Unintentional missing data: This type of missingness occurs when researchers do not intend for there to be missing data on a particular question, for example, if respondents did not finish the survey or refused to answer individual questions. There are three main types of unintentional missing data that each should be considered and handled differently (Mack, Su, and Westreich 2018; Schafer and Graham 2002):
1. Missing completely at random (MCAR): The missing data are unrelated to both observed and unobserved data, and the probability of being missing is the same across all cases. For example, if a respondent missed a question because they had to leave the survey early due to an emergency.
2. Missing at random (MAR): The missing data are related to observed data but not unobserved data, and the probability of being missing is the same within groups. For example, we know the respondents’ ages and older respondents choose not to answer specific questions but younger respondents do answer them.
3. Missing not at random (MNAR): The missing data are related to unobserved data, and the probability of being missing varies for reasons we are not measuring. For example, if respondents with depression do not answer a question about depression severity.

11.3 Assessing missing data

Before beginning an analysis, we should explore the data to determine if there is missing data and what types of missing data are present. Conducting descriptive analysis can help with the analysis and reporting of survey data and can inform the survey design in future studies. For example, large amounts of unexpected missing data may indicate the questions were unclear or difficult to recall. There are several ways to explore missing data, which we walk through below. When assessing the missing data, we recommend using a data.frame object and not the survey object, as most of the analysis is about patterns of records, and weights are not necessary.

11.3.1 Summarize data

A very rudimentary first exploration is to use the summary() function to summarize the data, which illuminates NA values in the data. Let’s look at a few analytic variables on the ANES 2020 data using summary():

anes_2020 %>%
  select(V202051:EarlyVote2020) %>%
  summary()

##     V202051                Income7                  Income    
##  Min.   :-9.000   $125k or more:1468   Under $9,999    : 647  
##  1st Qu.:-1.000   Under $20k   :1076   $50,000-59,999  : 485  
##  Median :-1.000   $20k to < 40k:1051   $100,000-109,999: 451  
##  Mean   :-0.726   $40k to < 60k: 984   $250,000 or more: 405  
##  3rd Qu.:-1.000   $60k to < 80k: 920   $80,000-89,999  : 383  
##  Max.   : 3.000   (Other)      :1437   (Other)         :4565  
##                   NA's         : 517   NA's            : 517  
##     V201617x       V201616      V201615      V201613      V201611  
##  Min.   :-9.0   Min.   :-3   Min.   :-3   Min.   :-3   Min.   :-3  
##  1st Qu.: 4.0   1st Qu.:-3   1st Qu.:-3   1st Qu.:-3   1st Qu.:-3  
##  Median :11.0   Median :-3   Median :-3   Median :-3   Median :-3  
##  Mean   :10.4   Mean   :-3   Mean   :-3   Mean   :-3   Mean   :-3  
##  3rd Qu.:17.0   3rd Qu.:-3   3rd Qu.:-3   3rd Qu.:-3   3rd Qu.:-3  
##  Max.   :22.0   Max.   :-3   Max.   :-3   Max.   :-3   Max.   :-3  
##                                                                    
##     V201610      V201607      Gender        V201600     
##  Min.   :-3   Min.   :-3   Male  :3375   Min.   :-9.00  
##  1st Qu.:-3   1st Qu.:-3   Female:4027   1st Qu.: 1.00  
##  Median :-3   Median :-3   NA's  :  51   Median : 2.00  
##  Mean   :-3   Mean   :-3                 Mean   : 1.47  
##  3rd Qu.:-3   3rd Qu.:-3                 3rd Qu.: 2.00  
##  Max.   :-3   Max.   :-3                 Max.   : 2.00  
##                                                         
##                 RaceEth        V201549x       V201547z     V201547e 
##  White              :5420   Min.   :-9.0   Min.   :-3   Min.   :-3  
##  Black              : 650   1st Qu.: 1.0   1st Qu.:-3   1st Qu.:-3  
##  Hispanic           : 662   Median : 1.0   Median :-3   Median :-3  
##  Asian, NH/PI       : 248   Mean   : 1.5   Mean   :-3   Mean   :-3  
##  AI/AN              : 155   3rd Qu.: 2.0   3rd Qu.:-3   3rd Qu.:-3  
##  Other/multiple race: 237   Max.   : 6.0   Max.   :-3   Max.   :-3  
##  NA's               :  81                                           
##     V201547d     V201547c     V201547b     V201547a     V201546     
##  Min.   :-3   Min.   :-3   Min.   :-3   Min.   :-3   Min.   :-9.00  
##  1st Qu.:-3   1st Qu.:-3   1st Qu.:-3   1st Qu.:-3   1st Qu.: 2.00  
##  Median :-3   Median :-3   Median :-3   Median :-3   Median : 2.00  
##  Mean   :-3   Mean   :-3   Mean   :-3   Mean   :-3   Mean   : 1.84  
##  3rd Qu.:-3   3rd Qu.:-3   3rd Qu.:-3   3rd Qu.:-3   3rd Qu.: 2.00  
##  Max.   :-3   Max.   :-3   Max.   :-3   Max.   :-3   Max.   : 2.00  
##                                                                     
##         Education       V201510             AgeGroup         Age      
##  Less than HS: 312   Min.   :-9.00   18-29      : 871   Min.   :18.0  
##  High school :1160   1st Qu.: 3.00   30-39      :1241   1st Qu.:37.0  
##  Post HS     :2514   Median : 5.00   40-49      :1081   Median :53.0  
##  Bachelor's  :1877   Mean   : 5.62   50-59      :1200   Mean   :51.8  
##  Graduate    :1474   3rd Qu.: 6.00   60-69      :1436   3rd Qu.:66.0  
##  NA's        : 116   Max.   :95.00   70 or older:1330   Max.   :80.0  
##                                      NA's       : 294   NA's   :294   
##     V201507x                 TrustPeople      V201237     
##  Min.   :-9.0   Always             :  48   Min.   :-9.00  
##  1st Qu.:35.0   Most of the time   :3511   1st Qu.: 2.00  
##  Median :51.0   About half the time:2020   Median : 3.00  
##  Mean   :49.4   Some of the time   :1597   Mean   : 2.78  
##  3rd Qu.:66.0   Never              : 264   3rd Qu.: 3.00  
##  Max.   :80.0   NA's               :  13   Max.   : 5.00  
##                                                           
##             TrustGovernment    V201233     
##  Always             :  80   Min.   :-9.00  
##  Most of the time   :1016   1st Qu.: 3.00  
##  About half the time:2313   Median : 4.00  
##  Some of the time   :3313   Mean   : 3.43  
##  Never              : 702   3rd Qu.: 4.00  
##  NA's               :  29   Max.   : 5.00  
##                                            
##                      PartyID        V201231x        V201230      
##  Strong democrat         :1796   Min.   :-9.00   Min.   :-9.000  
##  Strong republican       :1545   1st Qu.: 2.00   1st Qu.:-1.000  
##  Independent-democrat    : 881   Median : 4.00   Median :-1.000  
##  Independent             : 876   Mean   : 3.83   Mean   : 0.013  
##  Not very strong democrat: 790   3rd Qu.: 6.00   3rd Qu.: 1.000  
##  (Other)                 :1540   Max.   : 7.00   Max.   : 3.000  
##  NA's                    :  25                                   
##     V201229          V201228      VotedPres2016_selection
##  Min.   :-9.000   Min.   :-9.00   Clinton:2911           
##  1st Qu.:-1.000   1st Qu.: 1.00   Trump  :2466           
##  Median : 1.000   Median : 2.00   Other  : 390           
##  Mean   : 0.515   Mean   : 1.99   NA's   :1686           
##  3rd Qu.: 1.000   3rd Qu.: 3.00                          
##  Max.   : 2.000   Max.   : 5.00                          
##                                                          
##     V201103      VotedPres2016    V201102          V201101      
##  Min.   :-9.00   Yes :5810     Min.   :-9.000   Min.   :-9.000  
##  1st Qu.: 1.00   No  :1622     1st Qu.:-1.000   1st Qu.:-1.000  
##  Median : 1.00   NA's:  21     Median : 1.000   Median :-1.000  
##  Mean   : 1.04                 Mean   : 0.105   Mean   : 0.085  
##  3rd Qu.: 2.00                 3rd Qu.: 1.000   3rd Qu.: 1.000  
##  Max.   : 5.00                 Max.   : 2.000   Max.   : 2.000  
##                                                                 
##     V201029          V201028        V201025x        V201024     
##  Min.   :-9.000   Min.   :-9.0   Min.   :-4.00   Min.   :-9.00  
##  1st Qu.:-1.000   1st Qu.:-1.0   1st Qu.: 3.00   1st Qu.:-1.00  
##  Median :-1.000   Median :-1.0   Median : 3.00   Median :-1.00  
##  Mean   :-0.897   Mean   :-0.9   Mean   : 2.92   Mean   :-0.86  
##  3rd Qu.:-1.000   3rd Qu.:-1.0   3rd Qu.: 3.00   3rd Qu.:-1.00  
##  Max.   :12.000   Max.   : 2.0   Max.   : 4.00   Max.   : 4.00  
##                                                                 
##  EarlyVote2020
##  Yes : 375    
##  No  : 115    
##  NA's:6963    
##               
##               
##               
##

We see that there are NA values in several of the derived variables (those not beginning with “V”) and negative values in the original variables (those beginning with “V”). We can also use the count() function to get an understanding of the different types of missing data on the original variables. For example, let’s look at the count of data for V202072, which corresponds to our VotedPres2020 variable.

anes_2020 %>%
  count(VotedPres2020, V202072)

## # A tibble: 7 × 3
##   VotedPres2020 V202072                                   n
##   <fct>         <dbl+lbl>                             <int>
## 1 Yes           -1 [-1. Inapplicable]                   361
## 2 Yes            1 [1. Yes, voted for President]       5952
## 3 No            -1 [-1. Inapplicable]                    10
## 4 No             2 [2. No, didn't vote for President]    77
## 5 <NA>          -9 [-9. Refused]                          2
## 6 <NA>          -6 [-6. No post-election interview]       4
## 7 <NA>          -1 [-1. Inapplicable]                  1047

Here, we can see that there are three types of missing data, and the majority of them fall under the “Inapplicable” category. This is usually a term associated with data missing due to skip patterns and is considered to be missing data by design. Based on the documentation from ANES (DeBell 2010), we can see that this question was only asked to respondents who voted in the election.

11.3.2 Visualization of missing data

It can be challenging to look at tables for every variable and instead may be more efficient to view missing data in a graphical format to help narrow in on patterns or unique variables. The {naniar} package is very useful in exploring missing data visually. We can use the vis_miss() function available in both {visdat} and {naniar} packages to view the amount of missing data by variable (see Figure 11.1) (Tierney 2017; Tierney and Cook 2023).

anes_2020_derived <- anes_2020 %>%
  select(
    -starts_with("V2"), -CaseID, -InterviewMode,
    -Weight, -Stratum, -VarUnit
  )

anes_2020_derived %>%
  vis_miss(cluster = TRUE, show_perc = FALSE) +
  scale_fill_manual(
    values = book_colors[c(3, 1)],
    labels = c("Present", "Missing"),
    name = ""
  ) +
  theme(
    plot.margin = margin(5.5, 30, 5.5, 5.5, "pt"),
    axis.text.x = element_text(angle = 70)
  )

This chart shows a the missingness of the selected variables where missing is highlighted in a dark color. Each row of the plot is an observation and each column is a variable. There are some patterns observed such as a large block of missing for `VotedPres2016_selection` and many of the same respondents also having missing for `VotedPres2020_selection`.

FIGURE 11.1: Visual depiction of missing data in the ANES 2020 data

From the visualization in Figure 11.1, we can start to get a picture of what questions may be connected in terms of missing data. Even if we did not have the informative variable names, we could deduce that VotedPres2020, VotedPres2020_selection, and EarlyVote2020 are likely connected since their missing data patterns are similar.

Additionally, we can also look at VotedPres2016_selection and see that there are a lot of missing data in that variable. The missing data are likely due to a skip pattern, and we can look at other graphics to see how they relate to other variables. The {naniar} package has multiple visualization functions that can help dive deeper, such as the gg_miss_fct() function, which looks at missing data for all variables by levels of another variable (see Figure 11.2).

anes_2020_derived %>%
  gg_miss_fct(VotedPres2016) +
  scale_fill_gradientn(
    guide = "colorbar",
    name = "% Miss",
    colors = book_colors[c(3, 2, 1)]
  ) +
  ylab("Variable") +
  xlab("Voted for President in 2016")

This chart has x-axis 'Voted for President in 2016' with labels Yes, No and NA and has y-axis 'Variable' with labels Age, AgeGroup, CampaignInterest, EarlyVote2020, Education, Gender, Income, Income7, PartyID, RaceEth, TrustGovernment, TrustPeople, VotedPres2016_selection, VotedPres2020 and VotedPres2020_selection. There is a legend indicating fill is used to show pct_miss, ranging from 0 represented by fill very pale blue to 100 shown as fill dark blue. Among those that voted for president in 2016, they had little missing for other variables (light color) but those that did not vote have more missing data in their 2020 voting patterns and their 2016 president selection.

FIGURE 11.2: Missingness in variables for each level of ‘VotedPres2016,’ in the ANES 2020 data

In Figure 11.2, we can see that if respondents did not vote for president in 2016 or did not answer that question, then they were not asked about who they voted for in 2016 (the percentage of missing data is 100%). Additionally, we can see with Figure 11.2 that there are more missing data across all questions if they did not provide an answer to VotedPres2016.

There are other visualizations that work well with numeric data. For example, in the RECS 2020 data, we can plot two continuous variables and the missing data associated with them to see if there are any patterns in the missingness. To do this, we can use the bind_shadow() function from the {naniar} package. This creates a nabular (combination of “na” with “tabular”), which features the original columns followed by the same number of columns with a specific NA format. These NA columns are indicators of whether the value in the original data is missing or not. The example printed below shows how most levels of HeatingBehavior are not missing (!NA) in the NA variable of HeatingBehavior_NA, but those missing in HeatingBehavior are also missing in HeatingBehavior_NA.

recs_2020_shadow <- recs_2020 %>%
  bind_shadow()

ncol(recs_2020)

## [1] 100

ncol(recs_2020_shadow)

## [1] 200

recs_2020_shadow %>%
  count(HeatingBehavior, HeatingBehavior_NA)

## # A tibble: 7 × 3
##   HeatingBehavior                               HeatingBehavior_NA     n
##   <fct>                                         <fct>              <int>
## 1 Set one temp and leave it                     !NA                 7806
## 2 Manually adjust at night/no one home          !NA                 4654
## 3 Programmable or smart thermostat automatical… !NA                 3310
## 4 Turn on or off as needed                      !NA                 1491
## 5 No control                                    !NA                  438
## 6 Other                                         !NA                   46
## 7 <NA>                                          NA                   751

We can then use these new variables to plot the missing data alongside the actual data. For example, let’s plot a histogram of the total electric bill grouped by those missing and not missing by heating behavior (see Figure 11.3).

recs_2020_shadow %>%
  filter(TOTALDOL < 5000) %>%
  ggplot(aes(x = TOTALDOL, fill = HeatingBehavior_NA)) +
  geom_histogram() +
  scale_fill_manual(
    values = book_colors[c(3, 1)],
    labels = c("Present", "Missing"),
    name = "Heating Behavior"
  ) +
  theme_minimal() +
  xlab("Total Energy Cost (Truncated at $5000)") +
  ylab("Number of Households")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This chart has title 'Histogram of Energy Cost by Heating Behavior Missing Data'. It has x-axis 'Total Energy Cost (Truncated at $5000)' with labels 0, 1000, 2000, 3000, 4000 and 5000. It has y-axis 'Number of Households' with labels 0, 500, 1000 and 1500. There is a legend indicating fill is used to show HeatingBehavior_NA, with 2 levels: !NA shown as very pale blue fill and NA shown as dark blue fill. The chart is a bar chart with 30 vertical bars. These are stacked, as sorted by HeatingBehavior_NA.

FIGURE 11.3: Histogram of energy cost by heating behavior missing data

Figure 11.3 indicates that respondents who did not provide a response for the heating behavior question may have a different distribution of total energy cost compared to respondents who did provide a response. This view of the raw data and missingness could indicate some bias in the data. Researchers take these different bias aspects into account when calculating weights, and we need to make sure that we incorporate the weights when analyzing the data.

There are many other visualizations that can be helpful in reviewing the data, and we recommend reviewing the {naniar} documentation for more information (Tierney and Cook 2023).

11.4 Analysis with missing data

Once we understand the types of missingness, we can begin the analysis of the data. Different missingness types may be handled in different ways. In most publicly available datasets, researchers have already calculated weights and imputed missing values if necessary. Often, there are imputation flags included in the data that indicate if each value in a given variable is imputed. For example, in the RECS data we may see a logical variable of ZWinterTempNight, where a value of TRUE means that the value of WinterTempNight for that respondent was imputed, and FALSE means that it was not imputed. We may use these imputation flags if we are interested in examining the nonresponse rates in the original data. For those interested in learning more about how to calculate weights and impute data for different missing data mechanisms, we recommend Kim and Shao (2021) and Valliant and Dever (2018).

Even with weights and imputation, missing data are most likely still present and need to be accounted for in analysis. This section provides an overview on how to recode missing data in R, and how to account for skip patterns in analysis.

11.4.1 Recoding missing data

Even within a variable, there can be different reasons for missing data. In publicly released data, negative values are often present to provide different meanings for values. For example, in the ANES 2020 data, they have the following negative values to represent different types of missing data:

–9: Refused
–8: Don’t Know
–7: No post-election data, deleted due to incomplete interview
–6: No post-election interview
–5: Interview breakoff (sufficient partial IW)
–4: Technical error
–3: Restricted
–2: Other missing reason (question specific)
–1: Inapplicable

When we created the derived variables for use in this book, we coded all negative values as NA and proceeded to analyze the data. For most cases, this is an appropriate approach as long as we filter the data appropriately to account for skip patterns (see Section 11.4.2). However, the {naniar} package does have the option to code special missing values. For example, if we wanted to have two NA values, one that indicated the question was missing by design (e.g., due to skip patterns) and one for the other missing categories, we can use the nabular format to incorporate these with the recode_shadow() function.

anes_2020_shadow <- anes_2020 %>%
  select(starts_with("V2")) %>%
  mutate(across(everything(), ~ case_when(
    .x < -1 ~ NA,
    TRUE ~ .x
  ))) %>%
  bind_shadow() %>%
  recode_shadow(V201103 = .where(V201103 == -1 ~ "skip"))

anes_2020_shadow %>%
  count(V201103, V201103_NA)

## # A tibble: 5 × 3
##   V201103                 V201103_NA     n
##   <dbl+lbl>               <fct>      <int>
## 1 -1 [-1. Inapplicable]   NA_skip     1643
## 2  1 [1. Hillary Clinton] !NA         2911
## 3  2 [2. Donald Trump]    !NA         2466
## 4  5 [5. Other {SPECIFY}] !NA          390
## 5 NA                      NA            43

However, it is important to note that at the time of publication, there is no easy way to implement recode_shadow() to multiple variables at once (e.g., we cannot use the tidyverse feature of across()). The example code above only implements this for a single variable, so this would have to be done manually or in a loop for all variables of interest.

11.4.2 Accounting for skip patterns

When questions are skipped by design in a survey, it is meaningful that the data are later missing. For example, the RECS asks people how they control the heat in their home in the winter (HeatingBehavior). This is only among those who have heat in their home (SpaceHeatingUsed). If there is no heating equipment used, the value of HeatingBehavior is missing. One has several choices when analyzing these data which include: (1) only including those with a valid value of HeatingBehavior and specifying the universe as those with heat or (2) including those who do not have heat. It is important to specify what population an analysis generalizes to.

Here is an example where we only include those with a valid value of HeatingBehavior (choice 1). Note that we use the design object (recs_des) and then filter to those that are not missing on HeatingBehavior.

heat_cntl_1 <- recs_des %>%
  filter(!is.na(HeatingBehavior)) %>%
  group_by(HeatingBehavior) %>%
  summarize(
    p = survey_prop()
  )

heat_cntl_1

## # A tibble: 6 × 3
##   HeatingBehavior                                              p    p_se
##   <fct>                                                    <dbl>   <dbl>
## 1 Set one temp and leave it                              0.430   4.69e-3
## 2 Manually adjust at night/no one home                   0.264   4.54e-3
## 3 Programmable or smart thermostat automatically adjust… 0.168   3.12e-3
## 4 Turn on or off as needed                               0.102   2.89e-3
## 5 No control                                             0.0333  1.70e-3
## 6 Other                                                  0.00208 3.59e-4

Here is an example where we include those who do not have heat (choice 2). To help understand what we are looking at, we have included the output to show both variables, SpaceHeatingUsed and HeatingBehavior.

heat_cntl_2 <- recs_des %>%
  group_by(interact(SpaceHeatingUsed, HeatingBehavior)) %>%
  summarize(
    p = survey_prop()
  )

heat_cntl_2

## # A tibble: 7 × 4
##   SpaceHeatingUsed HeatingBehavior                             p    p_se
##   <lgl>            <fct>                                   <dbl>   <dbl>
## 1 FALSE            <NA>                                  0.0469  2.07e-3
## 2 TRUE             Set one temp and leave it             0.410   4.60e-3
## 3 TRUE             Manually adjust at night/no one home  0.251   4.36e-3
## 4 TRUE             Programmable or smart thermostat aut… 0.160   2.95e-3
## 5 TRUE             Turn on or off as needed              0.0976  2.79e-3
## 6 TRUE             No control                            0.0317  1.62e-3
## 7 TRUE             Other                                 0.00198 3.41e-4

If we ran the first analysis, we would say that 16.8% of households with heat use a programmable or smart thermostat for heating their home. If we used the results from the second analysis, we would say that 16% of households use a programmable or smart thermostat for heating their home. The distinction between the two statements is made bold for emphasis. Skip patterns often change the universe we are talking about and need to be carefully examined.

Filtering to the correct universe is important when handling these types of missing data. The nabular we created above can also help with this. If we have NA_skip values in the shadow, we can make sure that we filter out all of these values and only include relevant missing values. To do this with survey data, we could first create the nabular, then create the design object on that data, and then use the shadow variables to assist with filtering the data. Let’s use the nabular we created above for ANES 2020 (anes_2020_shadow) to create the design object.

anes_adjwgt_shadow <- anes_2020_shadow %>%
  mutate(V200010b = V200010b / sum(V200010b) * targetpop)

anes_des_shadow <- anes_adjwgt_shadow %>%
  as_survey_design(
    weights = V200010b,
    strata = V200010d,
    ids = V200010c,
    nest = TRUE
  )

Then, we can use this design object to look at the percentage of the population who voted for each candidate in 2016 (V201103). First, let’s look at the percentages without removing any cases:

pres16_select1 <- anes_des_shadow %>%
  group_by(V201103) %>%
  summarize(
    All_Missing = survey_prop()
  )

pres16_select1

## # A tibble: 5 × 3
##   V201103                 All_Missing All_Missing_se
##   <dbl+lbl>                     <dbl>          <dbl>
## 1 -1 [-1. Inapplicable]       0.324          0.00933
## 2  1 [1. Hillary Clinton]     0.330          0.00728
## 3  2 [2. Donald Trump]        0.299          0.00728
## 4  5 [5. Other {SPECIFY}]     0.0409         0.00230
## 5 NA                          0.00627        0.00121

Next, we look at the percentages, removing only those missing due to skip patterns (i.e., they did not receive this question).

pres16_select2 <- anes_des_shadow %>%
  filter(V201103_NA != "NA_skip") %>%
  group_by(V201103) %>%
  summarize(
    No_Skip_Missing = survey_prop()
  )

pres16_select2

## # A tibble: 4 × 3
##   V201103                 No_Skip_Missing No_Skip_Missing_se
##   <dbl+lbl>                         <dbl>              <dbl>
## 1  1 [1. Hillary Clinton]         0.488              0.00870
## 2  2 [2. Donald Trump]            0.443              0.00856
## 3  5 [5. Other {SPECIFY}]         0.0606             0.00330
## 4 NA                              0.00928            0.00178

Finally, we look at the percentages, removing all missing values both due to skip patterns and due to those who refused to answer the question.

pres16_select3 <- anes_des_shadow %>%
  filter(V201103_NA == "!NA") %>%
  group_by(V201103) %>%
  summarize(
    No_Missing = survey_prop()
  )

pres16_select3

## # A tibble: 3 × 3
##   V201103                No_Missing No_Missing_se
##   <dbl+lbl>                   <dbl>         <dbl>
## 1 1 [1. Hillary Clinton]     0.492        0.00875
## 2 2 [2. Donald Trump]        0.447        0.00861
## 3 5 [5. Other {SPECIFY}]     0.0611       0.00332

TABLE 11.1: Percentage of votes by candidate for different missing data inclusions
Candidate	Including All Missing Data		Removing Skip Patterns Only		Removing All Missing Data
Candidate	%	s.e. (%)	%	s.e. (%)	%	s.e. (%)
Did Not Vote for President in 2016	32.4	0.9	NA	NA	NA	NA
Hillary Clinton	33.0	0.7	48.8	0.9	49.2	0.9
Donald Trump	29.9	0.7	44.3	0.9	44.7	0.9
Other Candidate	4.1	0.2	6.1	0.3	6.1	0.3
Missing	0.6	0.1	0.9	0.2	NA	NA

As Table 11.1 shows, the results can vary greatly depending on which type of missing data are removed. If we remove only the skip patterns, the margin between Clinton and Trump is 4.5 percentage points; but if we include all data, even those who did not vote in 2016, the margin is 3.1 percentage points. How we handle the different types of missing values is important for interpreting the data.

References

DeBell, Matthew. 2010. “How to Analyze ANES Survey Data.” ANES Technical Report Series nes012492. Palo Alto, CA: Stanford University; Ann Arbor, MI: the University of Michigan; https://electionstudies.org/wp-content/uploads/2018/05/HowToAnalyzeANESData.pdf.

Kim, Jae Kwang, and Jun Shao. 2021. Statistical Methods for Handling Incomplete Data. Chapman & Hall/CRC Press.

Mack, Christina, Zhaohui Su, and Daniel Westreich. 2018. “Types of Missing Data.” In Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide, Third Edition [Internet]. https://www.ncbi.nlm.nih.gov/books/NBK493614/; Rockville (MD): Agency for Healthcare Research; Quality (US).

Schafer, Joseph L, and John W Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7: 147–77. https://doi.org/10.1037//1082-989X.7.2.147.

Tierney, Nicholas. 2017. “visdat: Visualising Whole Data Frames.” Journal of Open Source Software 2 (16): 355. https://doi.org/10.21105/joss.00355.

Tierney, Nicholas, and Dianne Cook. 2023. “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” Journal of Statistical Software 105 (7): 1–31. https://doi.org/10.18637/jss.v105.i07.

Valliant, Richard, and Jill A. Dever. 2018. Survey Weights: A Step-by-Step Guide to Calculation. Stata Press.