Data & Intelligence

Writeprint Analysis Using IBM SPSS Statistics and Modeler

shutterstock_122786062_350_ibmpostWriteprint or forensic linguistics analytics as it is often termed is used to analyze unstructured text data to determine authorship.  Different variations of writeprint analysis has been used for hundreds of years analyzing the different books of the Bible to determine their authorship.  The works of Shakespeare have also been extensively studied to determine if Shakespeare was the true author.  Currently, this type of analytics is often used by law enforcement and national security agencies.  An offshoot of forensic linguistics is to analyze social media postings to determine the age and gender of the poster based upon the content that they have posted.

There is very little “out-of-the-box” software to do forensic linguistics.  This is mainly due to the business rules inherent to the analysis are specific to the problem that is being analyzed and often to do not translate very easily into other use cases.  IBM SPSS does not offer an “out-of-the-box” solution however using the powerful text string functions built into SPSS Statistics and SPSS Modeler along with SPSS text analytics it is possible to construct a writeprint solution.

The core to this analysis is the development of the business rules.  If you are analyzing emails for example this could be the number of paragraphs, number of sentences in the paragraph, number of words in a sentence, the length of the words, etc.  If you are looking at chat room usage or phone text messages it could be the number of posts since a particular word occurred and the order of the word within the message.  Potentially there can be dozens of these rules generating hundreds of variables.

Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

This is when IBM SPSS Modeler comes in handy.  If your training data has the author identified or maybe just a flag for good guy verses bad guy you can use a Support Vector Machine Model (SVM) to determine how well the rules work at identify the authorship.  If you do not have any training data with known authors, you can use a Kohonen Self Organizing Map to segment the data.  An easy way to think about how a Kohonen Self Organizing Map works is to think of it like a traditional segmentation clustering model like a K-Means but with the added feature of a neural network to uncover hidden patterns within the clusters.

Based upon the results of these models, you then revisit your rules and modify or enhance the rules with rules such as word order similarity, intensity or other types of measures.  There are multiple published papers on writeprint analysis and forensic linguistics but there is no one technique or set of rules that works across all data.  So it is up to you and the limits of your imagination to use the tools to develop solutions to the problem which is what makes it fun and challenging.

Below is a real-world graphical representation  of a writeprint for an individual involved in on-line criminal activity that is unique to that individual.

writeprint finger print

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Follow Us
TwitterLinkedinFacebookYoutubeInstagram