
“Elections belong to the people.”
— Abraham Lincoln
How well can I predict vote share and the winner of congressional elections?
Methodology
With the 2022 congressional midterms just around the corner, I use a machine learning algorithm called random forest to make these predictions. Random forest works like a series of decision trees. In short, it organizes predictive variables from most useful to least useful and creates decision points that maximize the model’s predictive power.
I use congressional election data from 1974 to 2018 collected by my colleague Carlos Algara and U.S Census data.
In my predictive model, I have two outcomes of interest. The first is a two-party congressional vote share (or the percentage of the vote the Democratic candidate wins on Election Day). Data limitations necessitated that I use two-party vote share instead of normal vote share. The second is which party’s candidate (Democratic, Republican, or Independent) won the election.
I include the same predictor variables in both models. These variables include incumbency, whether a congressional district was redistricted, the Democratic candidate’s campaign contributions, the Republican candidate’s campaign contributions, the president’s vote share from the previous presidential election, the Democratic candidate’s vote share from the previous congressional election, the percent of unemployed constituents, the district’s median income, the district’s income inequality level, and a term for each congressional election cycle. To create realism in the exercise, I selected these variables because I wanted to use predictors that The New York Times might have access to the day before the election
Next, I split my data into two groups, a training set (that I use to train and execute the model) and a testing set (the data my model uses to make predictions). I determine the accuracy of a model by how well it does at predicting a testing set (known as out-of-sample estimation). The better my model is at predicting outcomes and vote share in the testing set, the more confidence we can have that it will do equally as well at predicting future vote share and correctly picking winners.
Results
The first set of results reports how accurate my model predicts the winner of congressional elections. The random forest algorithm estimates the winner of out-of-sample congressional elections with 94 percent accuracy. That is, out of the 2,981 observations in the testing set, the model accurately predicted the winner of 2,803 races.
The figure on the left shows the percent of correct and incorrect predictions across Democratic and Republican winning races. The figure shows the model is excellent at predicting the winner of congressional contests with greater than 90 percent accuracy in both parties.
The figure on the right shows the variables with the strongest predictive power in determining the winner of a congressional contest. This figure shows that lagged Democratic vote share is the most important variable in predicting the winner, followed by incumbency and campaign contributions.
The second set of figures reports how accurately my model predicts Democratic candidates’ two-party vote share. The random forest model estimates the vote shares of out of sample congressional candidates with an average prediction error (root mean squared error) of 5.2 percent.
The figure on the left shows how well the model performs at making out-of-sample predictions by plotting the actual vote share against the predicted vote share. It reports the correlation between the predicted and actual vote share is 0.96, and the training model’s R-squared is 0.91. The figure shows that the model does a solid job predicting congressional vote share.
The figure on the right shows which variables have the strongest predictive power in determining two-party vote share. This figure shows that lagged democratic vote share is the most important variable in predicting candidates’ vote share, followed by incumbency and campaign contributions.
Together, both models exceed expectations. Classifying almost 95 percent of out-of-sample races correctly and vote share with an average error rate of approximately five percentage points shows the power of machine learning and is a solid starting place to build an even more accurate model that can improve its predictive capacity.
Moving Forward
Given the limited time to prepare this model, I could not incorporate all my ideas to maximize the model’s predictive power. Yet, if had more time, here are several ideas that I would incorporate to improve the model’s predictive power.
First, rather than specifying the model with covariates that I believe have a theoretical relationship to our outcomes of interests, I can use an inductive approach to predictor selection. Specifically, I can use correlations or Predictive Power Scores (PPS) to select predictors with the highest quantitative relationship to both vote share and outcomes, then include them in my models.
Second, I can construct an ensemble model. Put simply, an ensemble model in machine learning seeks better predictive performance by combining the predictions from multiple machine learning models.
Third, I’d like to add variables that measure the general mood of the congressional district. Variables like presidential approval, congressional approval, and polling data would be useful in capturing mood, which would likely improve both models’ out-of-sample performance, especially in wave elections. This would be individual-level data aggregated to the congressional district level.
Finally, I use a bootstrapped random sample of 70 percent of the data to train my model. That means my testing set is comprised of random races across multiple years. The next step would be to use a bootstrapped random sample of data to predict an entire election cycle’s outcome. This exercise would more closely approximate the job of generating baseline estimates for the needle.
Thank you for the opportunity to interview for the position. I am extremely enthusiastic about it and the contribution I can make to the needle and any other projects. If you have any questions or anything else I can do to help in the selection process, please feel free to reach out.
To see the code I use to generate my predictions, click here: