Voter Segmentation

Using unsupervised machine learning to segment voters

Posted by Christian Payne on July, 2022

Given the recent talk about new Conservative leadership candidates and their potential to hold the 2019 Conservative coalition together, I've been thinking about voters. Specifically I've been thinking about types of voters. Lots of voter segmentation work focuses dividing voters by their liklihood to vote, purely for the purposes of campaigning. But what about identifying groups of voters using more than just their propensity to vote?

Voter Data

The data we'll be using is the latest wave of the British Election Study (BES) panel survey. This data contains interesting fields concerning voter demogrpahics and polictical preferences, including their estimated left/right alignment. Indeed, using this data, I've provided a plot showcasing the left/right alignment of voters by their vote in the 2019 general election.

Altogether, the BES panel is an incredibly rich dataset for exploring voter behaviour. Exactly what we need for our clustering project.

K-means Clustering

For the cluster modelling, we'll be using the k-means clustering algorithm. In layman's terms, the algorithm works by placing a number (K) of points throughout the data, called "centroids". Where 'K' is some number that we have to determine prior to fitting the algorithm. The points that are closest to that centroid are identified as being in that centroid's cluster. How much of the data is explained by those centroids is then calculated. Then the centroids move, and the process starts again. This continues until the amount of the data that is explained by the clusters is maximised. But how do we decide on how many centroids to start with?

To determine how many centroids we should use, the value of 'K', we will use an elbow plot. This plots how much of the data can be explained by an algorithm with k clusters. Obviously more clusters would increase the amount of the data the algorithm can explain. However, the marginal impact from adding another cluster reduces as we add more clusters. The optimal number of clusters would be where the value of K where there is a change in the trajectory of the line, like an elbow.

In the plot above the change is relatively subtle, particularly as the plot is a little stretched. We'll be using 3 clusters for the algorithm we'll build.

Clustering Voters

With our value of k determined, we can then fit the algorithm. Visualising the clusters is a little difficult given the number of variables in our data. However, what we can do is use another algorithm called Principal Components Analysis (PCA). This reduces the number of columns by turning combinations of columns into "components" based on how much of the data they explain. This allows us to visualise the clusters we've identified a little easier. Below I've plotted the voter clusters with the largest two of these components, with the percentage of the data they explain in brackets.

I admit, it's a little difficult to see the difference when plotting against only two of these components. So I've put together an interactive 3D visualisation of the clusters with a third component added.

Meet the Clusters

Now that we have the clusters, I think it's time to actually look into the voters within each cluster to identify any common patterns.

Cluster 1: 'The Liberal Cosmopolitans'

Looking at the first cluster, we can see that voters in this cluster tend to be the more akin to the cosmopolitan voter typeology. Looking at the graph below, the first cluster has the highest proportions for readers of the Guardian, university education, Remain voters, as well as those considering voting for Labour at the next election. Cluster 1 is also the most economically secure, as measured by self-reported concerns around unemployment and poverty.

Finally, Cluster 1 is also the largest cluster in our sample. With a membership encompassing around 64% of the voters in our sample.

Cluster 2: 'The Tory Loyalists'

Cluster 2 is very similar to Cluster 1 in many respects, as we can see from the above graph. Particularly with regard to measures of economic well-being. Indeed around 26.7% of voters in Cluster 1 report a gross household income higher than £50,000 compared with 26.1% and 21.7% for clusters 1 and 4 respectively (although Cluster 1 does edge Cluster 2 out in the higher income brackets). However, one factor that seperates Cluster 2 from the others is their staunch loyalty to the Conservative party. Below I've identified some more key features that stand out for Cluster 2 and plotted them.

As the name would suggest, Cluster 2 are ardent supports of the Conservatives. With the party enjoying large support from them in 2015 and 2019, although their support dipped in 2017. Cluster 3 also heavily voted for 'Leave' in 2016 and are the most Christian of the clusters. One final interesting piece of information that can be gleaned from this chart actually concerns Cluster 3. Specifically it is interesting that the significant rise in the support of Cluster 3 for the Tories is correlated with both their sizable wins in 2015 and 2019. Indeed it is perhaps through mobilising this group that the Conservatives have paved their pathway to No. 10.

Cluster 3: 'The Politically Uninterested'

The final cluster we have identified, Cluster 3, is interesting with respect to a number of features. From an initial look Cluster 3 stands out as the most economically insecure of the 3 clusters as described earlier, with only 1 in 5 voters in Cluster 3 responding that they have a gross household income above £50,000 compared with just over 1 in 4 voters within clusters 1 and 2. Cluster 3 is also the most ethnically diverse of the clusters as well as the one with the fewest number of formal qualifications. I've presented these insights in the graph below.

Final Thoughts

I hope that this analysis has improved your understanding of unsupervised machine learning as well as the British political landscape.

One thing that has suprised me during this project is how similar the clusters in our sample are. Below I've plotted the left/right alignment by cluster and what sticks out is how similar most of the clusters are. Perhaps we aren't so different after all.

As always, all the code I've written for this project can be found on my GitHub here.