Research Area:  Machine Learning
During the past years, we have seen a steady increase in the number of social networks worldwide. Among them, Twitter has consolidated its position as one of the most influential social platforms, with Brazilian Portuguese speakers holding the fifth position in the number of users. Due to the informal linguistic style of tweets, the discovery of information in such an environment poses a challenge to Natural Language Processing (NLP) tasks such as sentiment analysis. In this work, we state sentiment analysis as a binary (positive and negative) and multiclass (positive, negative, and neutral) classification task at the Portuguese-written tweet level. Following a feature extraction approach, embeddings are initially gathered for a tweet and then given as input to learning a classifier. This study was designed to evaluate the effectiveness of different word representations, from the original pre-trained language model to continued pre-training strategies, to improve the predictive performance of sentiment classification, using three different classifier algorithms and eight Portuguese tweets datasets. Because of the lack of a language model specific to Brazilian Portuguese tweets, we have expanded our evaluation to consider six different embeddings: fastText, GloVe, Word2Vec, BERT-multilingual (mBERT), BERTweet, and BERTimbau. The experiments showed that embeddings trained from scratch solely using the target Portuguese language, BERTimbau, outperform the static representations, fastText, GloVe, and Word2Vec, and the Transformer-based models BERT multilingual and BERTweet. In addition, we show that extracting the contextualized embedding without any adjustment to the pre-trained language model is the best approach for most datasets.
Keywords:  
Author(s) Name:  Daniela Vianna, Fernando Carneiro, Jonnathan Carvalho, Alexandre Plastino & Aline Paes
Journal name:  Language Resources and Evaluation
Conferrence name:  
Publisher name:  Springer
DOI:  10.1007/s10579-023-09661-4
Volume Information:  Volume 58, pages 223-272, (2024)
Paper Link:   https://link.springer.com/article/10.1007/s10579-023-09661-4