Part 1: What is Social Network Analysis and how do I get started quickly?
I am going to be sharing with you the path that I took to get myself up to speed on Social Network Analysis (SNA) using SPSS Modeler. If you are new to SNA, or just curious on what SNA is all about, hopefully you find this useful. I will walk you through starting from scratch to some use cases where I have successfully used SNA in various predictive models and data mining projects to uncover criminal activity. I will also show some examples of SNA in SPSS. In future posts I will get into some more detail on how to do SNA within SPSS Modeler and how to incorporate into your predictive models.
I first came across SNA about five years when I was working with a health insurer client in the US developing predictive models to detect fraud and abuse. This particular client had a skunk works operation that was basically charged with trying new technologies to detect payment fraud and abuse. At that time, SNA was attracting a lot of buzz as the next silver bullet we were all looking for to detect fraud. Unfortunately, the industry has since found that SNA alone will not catch all of the bad guys. However, SNA and predictive modeling can successfully detect the bad guys and I will walk you through an example.
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
So what is SNA? At a high level it finds relationships between various entities. An entity can be people, parts of a machine, particle physics, pretty much anything that has a connection or relation to another entity. If you have used LinkedIn, you have seen SNA in action. The list of people that you might know that LinkedIn generates is based upon SNA. It could be argued that most of LinkedIn’s IP is around its SNA engine.
So how do you get up to speed quickly on SNA and start using it in your predictive models? As you start to research and learn more about SNA, you will find that getting the data in the right format is key to doing SNA. All SNA programs are very finicky as to how the data needs to be structured to feed into the SNA software. There is no way I am going to be able to teach you everything about SNA and all the different ways data arrives to SNA in a single posting so I am going to recommend that you start with an open source program called NodeXL.
NodeXL is an add-on for Microsoft Excel so it has a user interface that you will find familiar. Along with the software, I recommend you spend $30 and get the companion book. It has several examples and will walk you through all of the steps from getting your data in shape to do the analytics through interpreting the results. The book also does a great job with introducing you to all of the terminology that is unique to SNA. Another nice feature that NodeXL has is that it will tap into Twitter and bring back Tweets that you can do SNA on.
If you want to learn more about SNA I encourage you to check out Coursera. They typically have at least one course going on at any one time.
If you are working with really large datasets and you want to do more graphical exploration of your data I recommend Gephi. I will often use NodeXL or Gephi on my SNA data before I run SNA in SPSS just to see what type of relationships it can find. If it can’t find anything interesting, it might be a problem with my ETL or SNA may not be appropriate and something like association rule modeling might work best.
In my next post, I will go into more detail about SNA with some use cases.