Skip to main content

Data & Intelligence

A Data Mining Approach to Spam Detection in Social Bookmarking Sites – Part 3

In Part 2 of this series we saw the details about the approach we employed to predict the spammers using Neural Networks and Text Mining.

In this post, we’re going to look at some of the complexities involved in this approach and finally wrap it up by looking at some of the alternative approaches and latest trends.

 

COMPLEXITIES & WORKAROUNDS

Random Sampling
  • Size of the dataset – The dataset size being approximately 3GB (3 million records), to employ any approach on such a massive dataset with the average 4GB ram machines is going to be a huge challenge. The workaround one could adopt is to use ‘Random Sampling’. This is one of the best practices of data mining as it creates unbiased samplesand ensures the prediction is not impacted.
  • Neural Network learning
    Complex Neural Network

    We know that Neural networks take a considerable amount of time to learn a model. Whilst its important to partition the dataset in such a way that a higher percentage is allocated for training than for validation, we should not overdo it. There needs be a good amount of data to perform the validation as well. Also we need to choose wisely when it comes to factors such as the number of levels and hidden nodes.

  • Insufficient JVM – While using Java based open source tools such as Rapid Miner & Weka, one could run into issues with JVM size. To get over this we can set the -Xmx parameter to use the maximum heap size available.

 

ALTERNATIVE APPROACHES

  • Naive Bayes Classifier

In this approach, the number of posts and posted tags for each user are extracted from training data, which are then sorted by mutual information. Then the tags which have high mutual information value are chosen for the classification task. The advantage of Naive Bayes Classifier is its simplicity & efficiency. More details of this approach can be found here.

  • Combining Clustering with Classification

In this method, clustering is used as complementary step to text classification. There can be a significant improvement in the performance of the application by using clustering. A detailed paper about this approach can be found here.

  •  Using MapReduce / Hadoop
    MapReduce

Though this is not a data mining approach as such to detect spammers, it is important to emphasize the power of frameworks such as MapReduce & Hadoop. Using these frameworks in place of traditional databases to do the data processing can increase the performance significantly.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Deepak Ramanathan

More from this Author

Follow Us