Research Area:  Machine Learning
The rapid growth of unsolicited and unwanted messages has inspired the development of many anti-spam methods. Machine-learning methods such as Naive Bayes, support vector machines or neural networks have been particularly effective in categorizing spam/non-spam messages. In order to further enhance the performance of review spam detection, I propose a novel contentbased approach that considers both bag-of-words and word context. More precisely, the proposed approach utilizes n-grams and the Skip-Gram word embedding method to build a vector model. As a result, high-dimensional eature representation is generated. To handle the representation and classify the spam accurately, ensemble learning techniques with regularized deep feed-forward neural networks as base learners are used in order to overcome slow optimization convergence to a poor local minimum and overfitting ssues. In order to verify the proposed approach, I use seven different types of datasets from different spam filtering domains. I show that the proposed spam filtering model outperforms existing methods in terms of classification accuracy, false negative and false positive rates, F-score, area under ROC and misclassification cost. The only drawback of the proposed algorithm is its higher computation complexity.
Name of the Researcher:  Barushka, Aliaksandr
Name of the Supervisor(s):  Pokorný, Miroslav
Year of Completion:  2020
University:  Univerzita Pardubice
Thesis Link:   Home Page Url