Social network analysis has been around for many years now, but with the burgeoning amounts of network data available to both governmental and private organisations, the use of these techniques is becoming ever more popular and valuable for business and crime fighting purposes. As a lot of the technical terms come up repeatedly in meetings with customers, I thought it’d be useful to provide a very brief (and reasonably non-technical) overview of what these terms mean.
First of all, a clarification. The term “social network” in this context should not be confused with social media, such as Twitter, Facebook and Google+ etc. Although these sites are good sources of social network information, the term is actually generic, encompassing all kinds of network data between people. This includes intelligence data gathered by police or counterterrorist agencies, providing a set of data comprising people and links between them, or even telecoms data regarding calling patterns. It is true, however, that most applications using publicly available data will probably obtain that information from social media as it is such a rich seam.
The problem with social network data comes when there’s a lot of it. For small amounts of data, it is easy to draw a network graph showing the people and the connections between them, and visually identify those individuals who are “key players” in the network. But when the data set is large, the picture looks like a mess of spaghetti, and it’s very difficult to see anything meaningful in there.
Several mathematical techniques are available to make some sense of the data. The most common methods are called “centrality” measures, as they aim to show which people are most “central” in the network. Each of these measures tells us something different about the network, and they all have value.
Degree
This is a nice simple measure. It’s just the number of links from and to the node (i.e. the person). So a person with 5 links to other people has a higher degree centrality than a person with only 2 links to other people.
As it is so simple, this measure is exceptionally easy (and fast) to calculate. It’s not particularly powerful, but it is a good first step in analysis to see which people have a lot of connections to others.
Betweenness
This measure tells us which people are most “between” other people. There’s a mathematical definition of what this means, but loosely speaking we can say that a person who is on the shortest path of connections between other people is between them. Another way of putting this is that if there is a set of connections between A and Z going through other people, then if Q is on a path which is the shortest path between A and Z then Q is said to be between A and Z.
A person who is between a lot of other people has a higher betweenness centrality measure than a person who is not between many other people. Betweenness is useful because it potentially tells us which people are the key connectors of other people, or groups of people. These connectors could be individuals who act as go-betweens linking two criminal groups, or perhaps people who straddle two communities, maybe sharing ideas back-and-forth between them.
Closeness
The measure of closeness indicates which people are closest to other people in the network. A person who is only a couple of hops away from everyone else in the network has a higher closeness centrality than someone who is a large number of hops from many people.
By measuring closeness, we can determine the people who have the best access to everyone else in the network in terms of the number of connections needed to get to them.
Eigenvector
Eigenvector centrality is a little bit harder to describe easily, but is one of the most powerful techniques in the social network analysis toolkit. This measure takes into account not just the number of links that each person has (as in degree centrality), but also the number of links of the connected people, and their links too, and so on throughout the network. So if A is the key player in the group, with lots of connections to many other people, then a person B connected directly to A (but only to A) still has a lot of importance, even though B has only one connection. Person Z, out at the edge of the network might be connected to three people, but if those individuals are not of high importance themselves, then Z’s importance is similarly low.
If we rank people by eigenvector centrality, we can see who the key important people are in the network. At the top of the list these may be obvious, but things can get more interesting as we see people who have a high eigenvector centrality even though they are not obviously important. Their appearance high up in the list gives us a clue that we may need to investigate further to determine why they are so high.
The key point to remember is that as powerful as these methods are, the algorithms do not themselves understand what the data represents, or what the results of the calculations mean in real-world terms. By ranking the people in the network according to their centrality measures, it is possible for an educated analyst to use the results to inform their understanding of what is going on in the network. But they need to bring to bear their own expertise in interpreting the data. These techniques are merely tools. Humans are not obsolete yet!
Tags: centrality, counterterrorism, eigenvector, social network analysis
Interesting article. I’m curious, though, when you say eigenvector centrality is significant. What evidence is there that someone only connected to a highly connected person, has any influence on them? Is it also based on the nature (eg content & direction) of the connection?
I dont spend much time on Twitter, but I do have an account. I follow Setphen Fry & Obama (via RSS into my Google+) but I doubt I can start a secular revolution even with the aid of these two highly influential connections.
Hi Alan.
As with all these mathematical techniques, the key point is the one I made in the last paragraph:
“as powerful as these methods are, the algorithms do not themselves understand what the data represents, or what the results of the calculations mean in real-world terms.”
In many cases it will, indeed, be the case that someone who is connected to a very central figure is not actually of significant influence. It depends on the nature of the problem being investigated, and the maths can’t really help with that.
What it can do is draw out aspects of the network that are not obvious to a human observer. The term ‘influence’ is a bit of a misnomer in many cases, but we have to call it something, and in many cases it is an appropriate term to use.
In the example you gave you are correct that you don’t have much influence on these two twitterers. In fact, there are mathematical techniques that can be used to compensate for this to some extent in the case of celebrities. But even without these methods, if the math said that you did have influence then a human user would be able to knock @stephenfry out of the calculation as he’s creating an anomaly.
It’s Quite Interesting!