In the realm of textual data, Word embeddings model the context and the relationship between the target word and its context words through neural networks. It is a neural network-based distributed representation containing fruitful contextual and semantic information in embeddings, obtained by training language models over large corpora. The main significance of word embedding is to construct word representation as vector representation that contains potential language information, such as semantics. Static word embeddings and contextualized word embeddings are the two types of word embedding discussed below.
Static Word Embedding:
Static word embeddings do not consider the polysemy of a word, and the same embeddings are generated from the same word in different contexts. Static word embeddings consist of word representation methods such as Word2vec, Glove, and Fast-Text.
It is an efficient and effective tool for learning word representations from corpus, which implements two models: CBOW and Skip-gram. These two models can effectively capture the semantics of words and easily transfer them into other downstream tasks. The main aim of Word2vec is to capture the linguistic context, syntactic, and semantic word relationship by clustering similar meaning words together and placing dissimilar words.
• Global Vectors for Word Representation(GLoVe):
GLoVe vectors capture local and global statistics to form word representations. It learns based on the global word-to-word co-occurrence matrix and trains word vectors to derive semantic relationships between words from a constructed co-occurrence matrix.
Fast-text, a simple and effective word embedding model, constructs the word vector based on n-gram models. It allows the model to represent the morphology and lexical similarity of words and unseen words.
Fast-Text is useful for words that do not occur in the training corpus are assigned a vector value for their characters. It superimposes word embeddings as word2vec, skip-gram model and handles the new, out-of-vocabulary (OOV).
Contextualized Word Embedding:
These word embeddings consider the word semantics in different contexts by considering the context of a word. Recently, some of the prominent language models such as ELMo, and BERT have been introduced.
• Count-based language model or n-gram language model:
It is a probabilistic language model for predicting the next word in a given sequence of words. An N-gram model signifies a sequence of N words in the given text. It is based on the counting of often word sequences that occur in corpus text and then estimating the probabilities.
• Embeddings from language models (ELMo):
A deep contextualized word representation feature-based approach as ElMo has been introduced to address both the complex characteristics of words and ambiguities embeddings from language models(ELMo). In ELMo, representations are based on the given word depending completely on the context in which it is used. It has achieved outstanding performance over a wide range of natural language problems.
• Bidirectional Encoder Representations from Transformers (BERT):
BERT is a contextualized word representation model based on a multi-layer bi-directional transformer encoder. It is a pre-trained model for language representations that utilizes masked language models (MLM) and the next sentence prediction. It captures the long-term or pragmatic information and enables pre-trained deep bidirectional representations.
In real-world applications, Language models play a fundamentally important role in everyday applications such as grammatical error correction, speech recognition, and even in the domain of text summarization. Word embeddings have offered solutions to various NLP-based applications such as computing similarity of words, document clustering, sentimental analysis, text classification, and recommendation systems.
Research Challenges in Word Embeddings:
Despite a high volume of word embeddings available for language models and their developments in many application domains, many challenges and opportunities exist for further research.
• Inflection words appear less frequently than other words in certain contexts is a challenging issue for word embedding.
• Failure to handle out of vocabulary words during word embedding becomes ineffective.
• The lack of shared representation for morphologically similar words becomes a significant issue when embedding the model.