I was recently inspired by this YouTube video and accompanying repo that built a character relationship network for the Witcher series of books. Given that I've never worked with network data before, I thought conducting a similar analysis would be a great project to familiarise myself with network analysis. While I don't know much about the Witcher series, I love Lord of the Rings. So I thought, why not replicate this analysis with that series?
All of the code used for this analysis, as well as the outputs, are all stored on my repo here
Extracting Character Relationships
The first thing we need to do is create the character relationship data from the books. For this we'll be using the English language model from the Spacy library. Spacy is a pre-trained natural language model that is designed to extract entities from text. This makes it particularly good at identifying characters from sentences in a book for example. After loading the model, I decided to coduct a proof of concept analysis by only using The Two Towers. Letting the model loose on the book returned the following output.
Pretty good, albeit not perfect. Particularly when the model attempts to determine the class of the entity (person, group, etc). However, it generally seems pretty decent at extracting the characters from sentences. With this we can then move onto extracting those entities and filtering them.
The next thing we'll need is a list of characters so we can filter through the model's output. For this purpose, I have scraped a list of character names from here and manually edited a couple of them so they're more representative of the names used in the book. This list of character names is not perfect, and a better alternative would be to use a csv file with a field for alternative names (such as Strider or Mithrandir), but it'll work for now. Using this names list we can generate the character to character relationship data by filtering the output of the spacy model to produce individual connections between characters. Summing these individual connections together gives us the total value for these relationships in the book.
I've included an extract of this network data above and it will serve as the basis for building the network.
The Two Towers Network Relationship
With this data we can then begin to build an initial character relationship network graph for the characters in The Two Towers. I've created one below using the Networkx package in Python:
Pretty cool, right? Feel free to have to play around with the dynamic network graph, and take a look at the relationships between your favourite characters. The network is dynamic so feel free to move the nodes around or rotate the network.
Community Detection
With this initial network we can then identify different communities of nodes in the data. For example, one group of characters might interact with each other more than they would with another group of characters. We can then use an algorithm to identify these groups and display the result visually. For this, we'll be using the Louvain algorithm. In layman's terms this works by identifying communities by looking at density (number of connections) throughout the network.
Once we've run this algorithm for all the characters in our network, we can then visualise these communities on another network with color coding to represent the different communities the algorithm has identified.
Measures of Centrality
This is all well and good, but it doesn't settle an obvious next thought. Who's the most important character in the Two Towers? When dealing with a network there are a number of definitions of 'importance'. For example, would the character that has the most connections be the most important? Or maybe the character that serves as the best "bridge" between groups of characters? Here we will provide measures for four definitions of importance:
- Degree Centrality: which characters have the most connections
- Betweenness Centrality: which characters are the best "bridges" between groups of characters
- Closeness Centrality: which characters are best placed to spread information to other characters
Degree Centrality
Using the network we have defined, we are able to quantify the centrality for each of the characters. I have plotted the top 10 characters below.
A couple of suprises there. Firstly the top three (Aragorn, Gandalf, and Frodo) are not surpising. What is suprising is that Merry doesn't even make the top 10. While Legolas struggles to make the cut compared to the rest of the fellowship.
Betweenness Centrality
As mentioned before, betweenness is essentially defined here as the ability of that character to act as a bridge. More technically, the more times that the shortest path between two characters passes through that character, the better a bridge they are. Again, I have visualised the top 10 characters.
Very interesting. Sam and Aragorn wayyy ahead of the rest, Legolas nowhere to be seen, and a suprise performance from Grishnakh the Orc.
Closeness Centrality
Finally we have closeness. Basically this is a function of the distance to all the other characters. The higher the number, the easier it would be to spread stuff from that character throughout the network.
Interestingly, not much difference between the characters in terms of their closeness. Indicating something about the density of the network we have built.
One Network to Rule Them All
Now that we have our proof of concept network running, I think it's time to let the Spacy model loose on the entire series and build a definitive network. Following all the steps laid out previously I have produced this final definitive network for characters from the Lord of the Rings book series:
One final thing we can do is visualise how the importance of characters has evolved over time. We could do this by measuring the centrality for each of the characters in the fellowship, and plotting their importance by book. We can see this plot below
Here we can see that Frodo starts out strong, while Aragorn becomes progressively more important. What's also interesting is that whilst Merry goes from strength to strength with regards to his centrality; Pippin remains the least important character in the fellowship and actually becomes less important as the series goes on.
There and Back Again
I hope you enjoyed reading about this project and learnt a little about networks along the way. As mentioned before, all the code I wrote is stored on my Github here. Please feel free to take a look and watch the YouTube video I referred to earlier that inspired this project.