Multilingual Corpus Creation for Semantic Similarity Task

Research Area: Machine Learning

Abstract:

In natural language processing, the performance of a semantic similarity task relies heavily on the availability of a large corpus. Various monolingual corpora are available (mainly English); but multilingual resources are very limited. In this work, we describe a semi-automated framework to create a multilingual corpus which can be used for the multilingual semantic similarity task. The similar sentence pairs are obtained by crawling bilingual websites, whereas the dissimilar sentence pairs are selected by applying topic modeling and an Open-AI GPT model on the similar sentence pairs. We focus on websites in the government, insurance, and banking domains to collect English-French and English-Spanish sentence pairs; however, this corpus creation approach can be applied to any other industry vertical provided that a bilingual website exists. We also show experimental results for multilingual semantic similarity to verify the quality of the corpus and demonstrate its usage.

Keywords:

Author(s) Name: Mahtab Ahmed, Chahna Dixit, Robert E. Mercer, Atif Khan, Muhammad Rifayat Samee, Felipe Urra

Journal name:

Conferrence name: Proceedings of the 12th Language Resources and Evaluation Conference

Publisher name: European Language Resources Association

DOI:

Volume Information:

Paper Link: https://aclanthology.org/2020.lrec-1.516/

Office Address

Social List

Multilingual Corpus Creation for Multilingual Semantic Similarity Task - 2020

Abstract:

S-Logix (OPC) Private Limited

Office Address

Multilingual Corpus Creation for Multilingual Semantic Similarity Task - 2020

Abstract:

Related Papers