Research Area:  Big Data
In a fast growing big data era, volume and varieties of data processed in Internet applications drastically increase. Real-world search engines commonly use text classifiers with thousands of classes to improve relevance or data quality. These large scale classification problems lead to severe runtime performance challenges, so practitioners often resort to fast approximation techniques. However, the increase in classification speed comes at a cost, as approximations are lossy, mis-assigning classes relative to the original reference classification algorithm. To address this problem, we introduce a Lossless Pruned Naive Bayes (LPNB) classification algorithm tailored to real-world, big data applications with thousands of classes. LPNB achieves significant speed-ups by drawing on Information Retrieval (IR) techniques for efficient posting list traversal and pruning. We show empirically that LPNB can classify text up to eleven times faster than standard Naive Bayes on a real-world data set with 7205 classes, with larger gains extrapolated for larger taxonomies. In practice, the achieved acceleration is significant as it can greatly cut required computation time. In addition, it is lossless: the output is identical to standard Naive Bayes, in contrast to extant techniques such as hierarchical classification. The acceleration does not rely on the taxonomy structure, and it can be used for both hierarchical and flat taxonomies.
Keywords:  
Author(s) Name:  Nanfei Sun,Bingjun Sun,Jian (Denny) Lin and Michael Yu-Chi Wu
Journal name:  Big Data Research
Conferrence name:  
Publisher name:  ELSEVIER
DOI:  10.1016/j.bdr.2018.05.007
Volume Information:  Volume 14, December 2018, Pages 27-36
Paper Link:   https://www.sciencedirect.com/science/article/abs/pii/S2214579616301320