Topic modeling algorithms and applications: A survey

Research Area: Machine Learning

Abstract:

Topic modeling is used in information retrieval to infer the hidden themes in a collection of documents and thus provides an automatic means to organize, understand and summarize large collections of textual information. Topic models also offer an interpretable representation of documents used in several downstream Natural Language Processing (NLP) tasks. Modeling techniques vary from probabilistic graphical models to the more recent neural models. This paper surveys topic models from four aspects. The first aspect categorizes different topic modeling techniques into four categories: algebraic, fuzzy, probabilistic, and neural. We review the wide variety of available models from each category, highlight differences and similarities between models and model categories using a unified perspective, investigate these models’ characteristics and limitations, and discuss their proper use cases. The second aspect illustrates six criteria for proper evaluation of topic models, from modeling quality to interpretability, stability, efficiency, and beyond. Topic modeling has found applications in various disciplines, owing to its interpretability. We examine these applications along with some popular software tools which provide an implementation of some models. The fourth aspect reviews available datasets and benchmarks. Using two benchmark datasets, we conducted experiments to compare seven topic models along the proposed metrics. The discussion highlights the differences between the models and their relative suitability for various applications. It notes the relationship between evaluation metrics and proposes four key aspects to help decide which model to use for an application. Our discussion also shows that the research trends move towards developing and tuning neural topic models and leveraging the power of pre-trained language models. Finally, it highlights research gaps in developing unified benchmarks and evaluation metrics.

Keywords:
Topic modeling
Neural
Probabilistic
Evaluation
LDA
Natural Language Processing
Machine learning

Author(s) Name: Aly Abdelrazek, Yomna Eid, Eman Gawish, Walaa Medhat, Ahmed Hassan

Journal name: Information Systems

Conferrence name:

Publisher name: Elsevier

DOI: 10.1016/j.is.2022.102131

Volume Information: Volume 112, February 2023, 102131

Paper Link: https://www.sciencedirect.com/science/article/abs/pii/S0306437922001090

Office Address

Social List