Research Area:  Machine Learning
Only humans can understand and comprehend the actual meaning that underlies natural written language, whereas machines can form semantic relationships only after humans have provided the parameters that are necessary to model the meaning. To enable computer models to access the underlying meaning in written language, accurate and sufficient document representation is crucial. Recent word embedding approaches have drawn much attention to text mining research. One of the main benefits of such approaches is the use of global corp uses with the generation of pre-trained word vectors. Although very effective, these approaches have their disadvantages, namely sole reliance on pre-trained word vectors that may neglect the local context and increase word ambiguity. In this thesis, four new document representation approaches are introduced to mitigate the risk of word ambiguity and inject a local context into globally pretrained word vectors. The proposed approaches, which are frameworks for document representation while using word embedding learning features for the task of text classification, are: Content Tree Word Embedding; Composed Maximum Spanning Content Tree; Embedding-based Word Clustering; and Auto encoder-based Word Embedding. The results show improvement in the F_score accuracy measure for a document classification task applied to IMDB Movie Reviews, Hate Speech Identification, 20 Newsgroups, Reuters-21578, and AG News as benchmark datasets in comparison to using three deep learning-based word embedding approaches, namely GloVe, Word2Vec, and fast Text, as well as two other document representations: LSA and Random word embedding.
Name of the Researcher:  Kamkarhaghighi, Mehran
Name of the Supervisor(s):  Makrehchi, Masoud
Year of Completion:  2019
University:  University of Ontario Institute of Technology
Thesis Link:   Home Page Url