Urdu Named Entity Recognition: Corpus Generation and Deep

Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications - 2019

Research Paper on Urdu Named Entity Recognition: Corpus Generation And Deep Learning Applications

Research Area: Machine Learning

Abstract:

Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this article makes four key contributions. First, we have developed the largest Urdu NER corpus, which contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has at least doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpora. Second, we have generated six new word embeddings using three different techniques, fastText, Word2vec, and Glove, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Third, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Finally, we have performed 10-folds of 32 different experiments using the combinations of a traditional supervised learning and deep learning techniques, seven types of word embeddings, and two different Urdu NER datasets. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques, the impact of word embeddings, and variations of datasets.

Keywords:
Named Entity Recognition
Corpus Generation
Deep Learning Applications
word embeddings
Machine Learning

Author(s) Name: Safia Kanwal , Kamran Malik , Khurram Shahzad , Faisal Aslam , Zubair Nawaz

Journal name: ACM Transactions on Asian and Low-Resource Language Information Processing

Conferrence name:

Publisher name: ACM

DOI: 10.1145/3329710

Volume Information: Volume 19,Issue 1,January 2020 Article No.: 8,pp 1–13

Paper Link: https://dl.acm.org/doi/abs/10.1145/3329710

Office Address

Social List