Amazing technological breakthrough possible @S-Logix pro@slogix.in

Office Address

  • #5, First Floor, 4th Street Dr. Subbarayan Nagar Kodambakkam, Chennai-600 024 Landmark : Samiyar Madam
  • pro@slogix.in
  • +91- 81240 01111

Social List

Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec - 2018

Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

Research Area:  Data Mining

Abstract:

The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.

Keywords:  

Author(s) Name:  Donghwa Kim,Deokseong Seo,Pilsung Kang and Suhyoun Cho

Journal name:  Information Sciences

Conferrence name:  

Publisher name:  Springer

DOI:  10.1016/j.ins.2018.10.006

Volume Information:  Volume 477, March 2019, Pages 15-29