Primary goal: to develop a classification algorithm to detect Web Spam.
Web Spam refers to a set of techniques that intend to increase the ranking of a page in a search engine. From search engine providers and Web users point of view, Web Spam decreases the quality of information search in the Web [1] [2] [3]. The Web Spam can be broadly classified into two types: content spam and link spam. It is a critical and challenging task to detect Web Spam. The success of Web Spam detection has a high commercial value for industries.
The goal of detecting Web Spam is to identify whether a given page or website is a spam or not. This is a typical classification problem in Machine learning Field.
This project will focus on developing a classification algorithm to detect Web Spam. It is expected to target one or more Web Spam types, which may be content spam and or link spam. The outcome of this project is a classification algorithm with a prototype. The dataset is from the WEbspam-uk2006 and 2007 [4] for training and testing.