Predicting election outcomes using machine learning
Elections understandably generate a lot of interest given the large stakes their outcomes wield on our lives. So intense is the interest that we seek to preview the results through exit polls even as the results would be available in just a few days. Wouldn’t it be nice then to use machine learning algorithms to predict the results even more in advance using past data from previous elections? It wouldn’t accurate but nor are exit polls accurate.
My home state Tamil Nadu in India, recently went to the ballot and I would be using the data from the Election Commission of India’s website found here:
https://eci.gov.in/files/file/3473-tamil-nadu-general-legislative-election-2016/
First we would do some imports, not all of them are essential, like the module OS and Seaborn, and there are other ways to do things, but this is just convenient and appealing to me.
I am interested in the second file that contains the results of the State election in Tamil Nadu (TN). So I’m going to read the .xls file into a dataframe with the header parameter indicating the excel row to be used as the header. Rows above the header row will be left out of the dataframe. I also take care to drop rows with nan values. And then we view the last five rows using the DataFrame.tail() method.
In the following code snippet, we perform groupby operation and plot to visualize the mean age of the different gender along with their acual numbers.
It’s dissapointing to see such low number female and other genders being given tickets to contest. Wonder how this affects the party’s chances of winning. Do parties with more female contestents perform better. We will see that later. But before that, below is the function I used to create annotations for the bar plots based on something I came across on stackoverflow.
But before we explore the party-wise gender participation, let’s also take a look at another important metric of inclusivity, i.e., the caste of the candidate which is indicated in the column ‘Candidate Category’ and see how inclusive are the party’s there.
I really wish I wrote a more readable code for the second plot. I apologise for it’s draconian length. But it does the job of showing the percentages of the different categories and the stark contrast already tells there’s a lot desired to be done on this front as well. But clearly aren’t their rewards for parties to pursue inclusivity on these two criteria. We will see it in the following plots how the various political parties fare. Let’s examine this in the case of parties that field candidates in at least 24 consituencies to filter out small parties which do not have a pan State presence.
Now I know that the two principal parties in my State are DMK and ADMK and neither can be said to be particularly succesful due to being inclusive to either the sexes or castes. Even a party like the BJP which at the time of this elction in 2016 formed the government at the union, fares badly on sponsoring female candidates and only marginally better than the regional heavy weights in supporting suppressed castes.
So far so good, but we haven’t done any bit of machine learning so far. Let’s just look at one more plot showing the vote shares of the top six parties before seeing we can make use of machine learning algorithms to predict them.
So now we have some context around the dataset to do some actual machine learning. We are going to use the same dataset we’ve been exploring and select features and target for our X and y variables respectively.
We should pay attention to include only those columns that we would have access to before the announcement of result for our features variable. This is often one of the leakage challenges that can ruin machine learning projects if not carefully identified.
In the first step I am renaming the column names and performing some preprocessing like encoding non numeric fields and year field using scikits label encoder. I’m skipping normalizing the data since I chose to use a tree based model. However, if you use a distance based model, then scaling to normalize the features should be a part of the preprocessing pipeline.
Finally we drop columns that are now encoded and other columns that would cause leakage and run the random forest regressor. The topic of leakage is very important and even after dropping s few columns that’d cause leakage, I can certainly say that the dataset is still not impervious to leakage. I’ll explain why I feel this way in just a bit.
The score on the validation set seems ok, but what does it mean to the voteshare prediction. Let’s plot pie charts for the actual vote share and predicted to see what a score 91.26% looks like.
So we see that 91% accuracy is not that bad on the charts. Now if we’re about to prematurely celebrate the success of our model, I’d like to revisit the topic of leakage.
Conclusion
In this model I’m solely using the data collected from the 2016 elections both for training and testing the model. And the voters choices that may have depended on features which were not represented as columns, like alliances, anti-incumbency, or recent events, were present in both the sets which would cause the model to perform better than if it had been tried on datasets which were previously unseen from a different election. The results dataset of the recently concluded 2021 elections have yet to be updated on the ECI’s website, and I look forward to run the model as soon as the data becomes available.