A Mechanics-Based Similarity Measure for Text Classification in Machine Learning Paradigm - 2018

Research Area:  Machine Learning


Document classification and clustering is emerging as a new challenge in the Big Data era where terabytes of data are generated every second through billions of mobile phones, desktops, servers, and mobile devices such as cameras and watches. The effectiveness of classification and clustering algorithms depends on the similarity measure used between two text documents in the corpus. We have applied Maxwell-Boltzmann distribution to find the similarity between the two documents within a document corpus. In this paper, the document corpus is treated as a large system, individual documents as containers, attributes as subcontainers, and each term as a particle. The proposed similarity measure is named Maxwell-Boltzmann Similarity Measure (MBSM). MBSM is derived from the overall distribution of feature values and total number of nonzero features among the documents. We demonstrate that MBSM satisfies all properties of a document similarity measure. The MBSM is incorporated in single label K-nearest neighbors classification (SLKNN), multi label K-nearest neighbors classification (MLKNN) and K-means clustering. We benchmark MBSM against other similarity measures like Euclidian, Cosine, Jaccard, Pairwise, ITSim, and SMTP. The comparative performance shows that MBSM outperformed all existing similarity measures and increased classification accuracy of SLKNN and MLKNN and clustering accuracy and entropy of K-means algorithm while making them more robust. The highest accuracy obtained from tenfold cross validation for SLKNN is 0.9531 and MLKNN is 0.9373. The MBSM achieved maximum accuracy of 0.6592 and minimum entropy of 0.2426 amongst all similarity measures in the scale of unity for K-means clustering.

Author(s) Name:  Venkatanareshbabu Kuppili; Mainak Biswas; Damodar Reddy Edla; K. J. Ravi Prasad and Jasjit S. Suri

Journal name:  IEEE Transactions on Emerging Topics in Computational Intelligence

Conferrence name:  

Publisher name:  IEEE

DOI:  10.1109/TETCI.2018.2863728

Volume Information:  Volume: 4, Issue: 2, April 2020,Page(s): 180 - 200