With the growing popularity of social bookmarking sites, spammers typically use these kind of services as a playground for their activities. As we sll know, one of the main disadvantages of Social Bookmarking Systems is Spam. The intention of spammers to use these systems is to pursue two goals:
- Place links in the sites to attract people to advertising sites
- Increase the PageRank of their sites by placing links in as many popular websites as possible, in order to increase their visibility in search engines
The usual counter-measures like captchas (a challenge-response test to ensure the response is not generated by a computer) are not efficient enough to effectively prevent the misuse of the system. In this 3 part series, we will take a look at a novel method using Neural Networks and Text Mining to learn a model to predict if a user is a spammer or not.
(Ref: ECML PKDD Discovery Challenge 2008)
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
What is social bookmarking?
Social Bookmarking is a method for users to store, organize, search and manage bookmarks of web pages with the help of tags. In a Social Bookmarking site, users save links to webpages that they want to remember or share which in turn can be public, private or shared with only specified people. Some of the popular social bookmarking sites include:
- del.icio.us (Delicious)
- Bibsonomy
- Digg
Dataset
The training data (provided on the ECML Site) is heavily skewed – it consists of a list of 25000 spammers to 2000 non spammers.
In Part 2 of this series, we will look at the Approach, specifically related to Text Mining and the Modeling (Neural Networks) involved in order to make the predictions.