Research Area:  Machine Learning
In this paper, two new similarity measures, namely distance of term frequency-based similarity measure (DTFSM) and presence of common terms-based similarity measure (PCTSM), are proposed to compute the similarity between two documents for improving the effectiveness of text document clustering. The effectiveness of the proposed similarity measures is evaluated on reuters-21578 and WebKB datasets for clustering the documents using K-means and K-means++ clustering algorithms. The results obtained by using the proposed DTFSM and PCTSM are significantly better than other measures for document clustering in terms of accuracy, entropy, recall and F-measure. It is evident that the proposed similarity measures not only improve the effectiveness of the text document clustering, but also reduce the complexity of similarity measures based on the number of required operations during text document clustering.
Keywords:  
similarity measure
document
WebKB dataset
accuracy
entropy
recall
F-measure
Author(s) Name:  R. Lakshmi and S. Baskar
Journal name:  International Journal of Business Intelligence and Data Mining
Conferrence name:  
Publisher name:   Inderscience
DOI:  10.1504/IJBIDM.2021.111741
Volume Information:  Vol. 18, No. 1,November 6, 2020pp 49-72
Paper Link:   https://www.inderscienceonline.com/doi/abs/10.1504/IJBIDM.2021.111741