In Part 2 of this series we saw the details about the approach we employed to predict the spammers using Neural Networks and Text Mining.
In this post, we’re going to look at some of the complexities involved in this approach and finally wrap it up by looking at some of the alternative approaches and latest trends.
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
COMPLEXITIES & WORKAROUNDS
- Size of the dataset – The dataset size being approximately 3GB (3 million records), to employ any approach on such a massive dataset with the average 4GB ram machines is going to be a huge challenge. The workaround one could adopt is to use ‘Random Sampling’. This is one of the best practices of data mining as it creates unbiased samplesand ensures the prediction is not impacted.
- Neural Network learning
We know that Neural networks take a considerable amount of time to learn a model. Whilst its important to partition the dataset in such a way that a higher percentage is allocated for training than for validation, we should not overdo it. There needs be a good amount of data to perform the validation as well. Also we need to choose wisely when it comes to factors such as the number of levels and hidden nodes.
- Insufficient JVM – While using Java based open source tools such as Rapid Miner & Weka, one could run into issues with JVM size. To get over this we can set the -Xmx parameter to use the maximum heap size available.
ALTERNATIVE APPROACHES
- Naive Bayes Classifier
In this approach, the number of posts and posted tags for each user are extracted from training data, which are then sorted by mutual information. Then the tags which have high mutual information value are chosen for the classification task. The advantage of Naive Bayes Classifier is its simplicity & efficiency. More details of this approach can be found here.
- Combining Clustering with Classification
In this method, clustering is used as complementary step to text classification. There can be a significant improvement in the performance of the application by using clustering. A detailed paper about this approach can be found here.
- Using MapReduce / Hadoop
Though this is not a data mining approach as such to detect spammers, it is important to emphasize the power of frameworks such as MapReduce & Hadoop. Using these frameworks in place of traditional databases to do the data processing can increase the performance significantly.