Deep Learning Methods for Short,Informal, and Multilingual Text Analytics

Research Area:  Machine Learning


The popularity of social media platforms and knowledge sharing websites has tremendously increased the amount of user-generated textual content. Such content is usually short in length and is often written informally (e.g., improper grammar, self-created abbreviations, and varying spellings). It is also influenced by local languages and mix multiple languages mid-utterance, a phenomenon known as code-switching. Traditional text analytics and natural language processing (NLP) approaches perform poorly on short, informal, and multilingual text as compared to well-written longer documents because of the limited context and language resources available for learning. In recent years, deep learning has produced enhanced results for many NLP tasks. However, these approaches have some major shortcomings: (1) they are tailored for specific problem settings (e.g., short text or informal languages) and do not generalize well to other settings, (2) they do not exploit multiple perspectives and resources for effective learning, and (3) they are hampered by smaller training datasets. In this research, we present methods and models for effective classification of usergenerated text with a specific application to English and Roman Urdu short and informal text. We present a novel multi-cascaded deep learning model (McM) for robust classification of noisy and clean short text. McM incorporates three independent CNN and LSTM (with and without soft attention) cascades for feature learning. Each cascade is responsible for capturing a specific aspect of natural language. The CNN based cascade extracts n-gram information.

Name of the Researcher:  Muhammad Haroon Shakeel

Name of the Supervisor(s):   Dr. Asim Karim

Year of Completion:  2020

University:  Lahore University of Management Sciences

Thesis Link:   Home Page Url