Back to Article
Decision Tree & Random Forest
Download Source

Decision Tree & Random Forest (predicting voter choice 2024)

predicting voter choice 2024

Authors
Affiliations

Aashia Khan

Binghamton University

Zihan Hei

Binghamton University

Jeff John

Binghamton University

Shane McCarty

Binghamton University

Promote Care & Prevent Harm

Published

November 3, 2025

Abstract

Background: Zero-sum beliefs—the perception that one group’s gains necessarily result in another group’s losses—are important predictors of political attitudes. However, the referents for zero-sum beliefs as economic or social identity remain underexplored in relation to political ideology, party affiliation, and voting behavior in contemporary elections.

Method: We conducted a comprehensive analysis examining three dimensions of zero-sum beliefs (general, economic, and social identity). Using Kruskal-Wallis tests on eleven zero-sum beliefs, we investigated how political party affiliation and racial/ethnic identity influenced endorsement of zero-sum beliefs across multiple domains. Subsequently, we examined whether these zero-sum belief patterns predicted self-reported voting for Donald Trump versus Kamala Harris in the 2024 presidential election.

Results: Political party affiliation was a significant predictor for all eight zero-sum social identity beliefs, but none of the economic or general beliefs. Republican voters and certain racial/ethnic groups demonstrated higher endorsement of zero-sum social identity beliefs. A logistic regression shows that after controlling for political ideology, a composite of zero-sum social identity beliefs explains voting behavior in the 2024 presidential election, with stronger zero-sum social identity thinking associated with Trump support and lower zero-sum social identity beliefs predicting Harris support. Other sociodemographic factors and zero-sum economic thinking were not significant predictors.

Discussion: Zero-sum social identity beliefs may represent a competitive core belief underlying contemporary political party affiliation and candidate preference. These findings affirm prior work that zero-sum thinking about economics differ from social identities, with similar levels of agreement on zero-sum economic beliefs across political parties but significantly different levels of agreement on zero-sum social identity beliefs by party affiliation. To the best of our knowledge, this study is the first to show that zero-sum thinking about social identities predicts voter preference in the 2024 election. Ultimately, future work needs to examine how to reduce zero-sum social identity thinking.

Keywords

zero sum beliefs, social identities, political affiliation, racial identity

Predicting Voting Behavior for 2024 Presidential Candidate by Machine Learning

In [1]:
Show the code

# Load necessary libraries
library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Show the code

library(tidyr)
library(ggrain)  
Registered S3 methods overwritten by 'ggpp':
  method                  from   
  heightDetails.titleGrob ggplot2
  widthDetails.titleGrob  ggplot2
Show the code

library(rmarkdown)
library(readr)
library(dplyr, warn.conflicts = FALSE)
library(haven)
library(rempsyc)
Suggested APA citation: Thériault, R. (2023). rempsyc: Convenience functions for psychology. 
Journal of Open Source Software, 8(87), 5466. https://doi.org/10.21105/joss.05466
Show the code

library(knitr)
library(broom)
library(ggdist)
library(devtools)
Loading required package: usethis
Show the code

library(apaTables)
library(ggpubr)
library(psych)

Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':

    %+%, alpha
Show the code

library(forcats)
library(corrplot)
corrplot 0.95 loaded
In [2]:
Show the code
select_data <- read.csv("/cloud/project/data/select_data.csv")

Predicting Model (Decision Tree & Random Forest)

Decision Tree and Random Forest Analysis

To further validate these findings and examine the predictive power of our variables using a different analytical approach, we employed a series of machine learning techniques. Our analysis proceeded in three stages:

  • Stage 1: Initial Decision Tree We first constructed a simple decision tree to identify the primary predictors and their splitting thresholds for Trump voting behavior. This provided an interpretable baseline model showing how the algorithm naturally segments voters.

  • Stage 2: Extended Decision Tree with Cross-Validation We then built a more complex decision tree incorporating additional demographic variables and used cross-validation to determine the optimal model complexity. Through this process, we found that the best performing tree is the 2-split model, which achieved a cross-validation error of 0.24. This suggests that despite having access to multiple demographic and ideological variables, the most predictive model requires only two key splits to effectively classify voters.

  • Stage 3: Random Forest Analysis Finally, we employed a Random Forest ensemble method to capture potential non-linear relationships and interactions while providing robust variable importance measures. This approach confirmed our regression findings by identifying ZEROSUM_IDENTITY and POLITICALBELIEFS as the most important predictors, with substantially higher importance scores than all other variables.

This machine learning approach serves as an independent validation of our regression based findings, using fundamentally different algorithms to examine the same relationships and providing additional confidence in our substantive conclusions about voting behavior predictors.

In [3]:
Show the code

# create new variable
select_data <- select_data %>%
  mutate(TRUMPVOTE = case_when(
    VOTE2024 == 1 ~ 1,
    VOTE2024 == 2 ~ 0,
    TRUE ~ NA
  ))
In [4]:
Show the code

select_data <- select_data %>%
  mutate(
    ZEROSUM_ECONOMIC = (ZEROSUM_2 + ZEROSUM_3)/2,
    ZEROSUM_IDENTITY = (ZEROSUM_4 + ZEROSUM_5 + ZEROSUM_6 + ZEROSUM_7 + ZEROSUM_8 + ZEROSUM_9 + ZEROSUM_10 + ZEROSUM_11)/8
  )
In [5]:
Show the code
select_data <- select_data %>%
  mutate(TRUMPVOTE = factor(TRUMPVOTE, levels = c(0, 1)))  # 0 = non-Trump, 1 = Trump
In [6]:
Show the code

library(rpart)
library(rpart.plot)

dt_model <- rpart(TRUMPVOTE ~ POLITICALBELIEFS + ZEROSUM_ECONOMIC + ZEROSUM_IDENTITY + ZEROSUM_1 + GENDER_MALE +
    RELIGIOUS_YES + RACE_BLACK + RACE_ASIAN + RACE_OTHER + EDUCATION_HIGH + SOCIALSTATUS,
                  data = select_data,
                  method = "class")

rpart.plot(dt_model, extra = 104)

Figure x. Decision Tree Analysis of Voting Predictors.

Figure x. Decision Tree Analysis of Voting Predictors.
  • The only variable the tree uses is ZEROSUM_IDENTITY, suggesting it is the most important predictor in our model.

  • If a respondent scores below 3.3 on ZEROSUM_IDENTITY, they are much more likely not to vote for Trump (84%).

  • If they score 3.3 or higher, they are much more likely to vote for Trump (87%).

In [7]:
Show the code

dt_model <- rpart(
  TRUMPVOTE ~ POLITICALBELIEFS + ZEROSUM_ECONOMIC + ZEROSUM_IDENTITY + ZEROSUM_1 + GENDER_MALE +
    RELIGIOUS_YES + RACE_BLACK + RACE_ASIAN + RACE_OTHER + EDUCATION_HIGH + SOCIALSTATUS,
  data = select_data,
  method = "class",
  control = rpart.control(
    cp = 0.001,         # smaller = deeper tree
    minsplit = 10,      # smaller = allows more splits
    maxdepth = 5        # allow up to 5 levels deep
  )
)

rpart.plot(dt_model, extra = 104)

Figure x. Extended Decision Tree with Demographic Variables.

Figure x. Extended Decision Tree with Demographic Variables.
  • This expanded decision tree incorporates demographic variables (gender and race) alongside the core predictors. The tree shows how demographic factors interact with ideological variables to refine predictions, with male respondents and those from “other” racial categories showing higher Trump support within similar ideological profiles.
In [8]:
Show the code

dt_model <- rpart(
  TRUMPVOTE ~ POLITICALBELIEFS + ZEROSUM_ECONOMIC + ZEROSUM_IDENTITY + ZEROSUM_1 + GENDER_MALE +
    RELIGIOUS_YES + RACE_BLACK + RACE_ASIAN + RACE_OTHER + EDUCATION_HIGH + SOCIALSTATUS,
  data = select_data,
  method = "class",
  control = rpart.control(cp = 0.001)
)

printcp(dt_model)

Classification tree:
rpart(formula = TRUMPVOTE ~ POLITICALBELIEFS + ZEROSUM_ECONOMIC + 
    ZEROSUM_IDENTITY + ZEROSUM_1 + GENDER_MALE + RELIGIOUS_YES + 
    RACE_BLACK + RACE_ASIAN + RACE_OTHER + EDUCATION_HIGH + SOCIALSTATUS, 
    data = select_data, method = "class", control = rpart.control(cp = 0.001))

Variables actually used in tree construction:
[1] POLITICALBELIEFS ZEROSUM_IDENTITY

Root node error: 50/101 = 0.49505

n=101 (21 observations deleted due to missingness)

     CP nsplit rel error xerror     xstd
1 0.720      0      1.00   1.22 0.098302
2 0.040      1      0.28   0.36 0.076921
3 0.001      2      0.24   0.32 0.073390
  • The best tree is the 2-split model with Cross-validation error (0.24)
In [9]:
Show the code

library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:psych':

    outlier
The following object is masked from 'package:dplyr':

    combine
The following object is masked from 'package:ggplot2':

    margin
Show the code

library(tidyr)

# need to drop NA to get accuracy
select_data <- select_data %>%
  drop_na(TRUMPVOTE, POLITICALBELIEFS, ZEROSUM_ECONOMIC, ZEROSUM_IDENTITY, ZEROSUM_1,
          GENDER_MALE, RELIGIOUS_YES, RACE_BLACK, RACE_ASIAN, RACE_OTHER, EDUCATION_HIGH, SOCIALSTATUS)

# split into training and testing sets
set.seed(123)
train_idx <- sample(seq_len(nrow(select_data)), size = 0.7 * nrow(select_data))
train <- select_data[train_idx, ]
test  <- select_data[-train_idx, ]

# Fit random forest model
rf_model <- randomForest(
  TRUMPVOTE ~ POLITICALBELIEFS + ZEROSUM_ECONOMIC + ZEROSUM_IDENTITY + ZEROSUM_1 + GENDER_MALE +
    RELIGIOUS_YES + RACE_BLACK + RACE_ASIAN + RACE_OTHER + EDUCATION_HIGH + SOCIALSTATUS,
  data = train,
  na.action = na.roughfix,
  ntree = 500
)

# Predict on test set
pred <- predict(rf_model, newdata = test)


# Confusion matrix
table(Predicted = pred, Actual = test$TRUMPVOTE)
         Actual
Predicted  0  1
        0 13  3
        1  2 12
Show the code

# Accuracy
mean(pred == test$TRUMPVOTE)
[1] 0.8333333
Show the code

# Variable importance
varImpPlot(rf_model)

Figure x. Variable Importance in Random Forest Model.

Figure x. Variable Importance in Random Forest Model.
  • Zero-sum identity beliefs and political beliefs emerge as the most important predictors, with Mean Decrease Gini values around 9-12, substantially higher than other variables. This ranking confirms our regression results that these two variables are the main drivers of Trump’s voting behavior, while demographic and other ideological variables play a secondary role.
In [10]:
Show the code

library(yardstick)

Attaching package: 'yardstick'
The following object is masked from 'package:readr':

    spec
Show the code

library(ggplot2)
library(dplyr)

# Create data frame for predictions and actual values
conf_df <- data.frame(
  truth = test$TRUMPVOTE,
  prediction = pred
)

# Create confusion matrix object
conf_mat_obj <- conf_mat(conf_df, truth = truth, estimate = prediction)

# Visualize it
autoplot(conf_mat_obj, type = "heatmap") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Confusion Matrix: Random Forest",
       x = "Predicted",
       y = "Actual")
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

Figure x. Random Forest Model Performance.

Figure x. Random Forest Model Performance.
  • The confusion matrix shows the random forest model’s prediction accuracy on the test data. The model achieved an overall accuracy of 83.33%, correctly classifying 13 of 16 non-Trump voters and 12 of 14 Trump voters. The model experienced two false negatives (predicting Trump voters as non-Trump voters) and three false positives (predicting non-Trump voters as Trump voters), indicating strong but not perfect prediction performance.