Research Area:  Machine Learning
The social media world nowadays is overwhelmed with unfiltered content ranging from cyberbullying and cyberstalking to hate speech. Therefore, identifying and cleaning up such toxic language presents a big challenge and an active area of research. This study is dedicated to multi-aspect hate speech detection based on classifying text in multi-labels including ‘identity hate’, ‘threat’, ‘insult’, ‘obscene’, ‘toxic’ and ‘severe toxic’. The proposed approach is based on the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model combined with Deep Learning (DL) models to compose several ensemble learning architectures. The DL models used are built by stacking Bidirectional Long-Short Term Memory (Bi-LSTM) and/or Bidirectional Gated Recurrent Unit (Bi-GRU) on GloVe and FastText word embeddings. Whereby, these models and BERT are trained individually on multi-label hateful dataset and used in combination for hate speech detection tasks on social media. Thus, we demonstrate that encoding texts by using recent word embedding techniques as FastText and GloVe alongside Bi-LSTM and Bi-GRU can create models that, when combined with BERT, can enhance the ROC-AUC score to 98.63%.
Keywords:  
Author(s) Name:  Ahmed Cherif Mazari, Nesrine Boudoukhani & Abdelhamid Djeffal
Journal name:  Cluster Computing
Conferrence name:  
Publisher name:  Springer
DOI:  10.1007/s10586-022-03956-x
Volume Information:  Volume 27, pages 325-339, (2024)
Paper Link:   https://link.springer.com/article/10.1007/s10586-022-03956-x