Research Area:  Data Mining
The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.
Keywords:  
Author(s) Name:  Donghwa Kim,Deokseong Seo,Pilsung Kang and Suhyoun Cho
Journal name:  Information Sciences
Conferrence name:  
Publisher name:  Springer
DOI:  10.1016/j.ins.2018.10.006
Volume Information:  Volume 477, March 2019, Pages 15-29
Paper Link:   https://www.sciencedirect.com/science/article/abs/pii/S0020025518308028#!