Amazing technological breakthrough possible @S-Logix pro@slogix.in

Office Address

  • #5, First Floor, 4th Street Dr. Subbarayan Nagar Kodambakkam, Chennai-600 024 Landmark : Samiyar Madam
  • pro@slogix.in
  • +91- 81240 01111

Social List

Unsupervised Keyphrase Extraction for Web Pages - 2019

Unsupervised Keyphrase Extraction for Web Pages

Research Paper on Unsupervised Keyphrase Extraction for Web Pages

Research Area:  Machine Learning

Abstract:

Keyphrase extraction is an important part of natural language processing (NLP) research, although little research is done in the domain of web pages. The World Wide Web contains billions of pages that are potentially interesting for various NLP tasks, yet it remains largely untouched in scientific research. Current research is often only applied to clean corpora such as abstracts and articles from academic journals or sets of scraped texts from a single domain. However, textual data from web pages differ from normal text documents, as it is structured using HTML elements and often consists of many small fragments. These elements are furthermore used in a highly inconsistent manner and are likely to contain noise. We evaluated the keyphrases extracted by several state-of-the-art extraction methods and found that they did not transfer well to web pages. We therefore propose WebEmbedRank, an adaptation of a recently proposed extraction method that can make use of structural information in web pages in a robust manner. We compared this novel method to other baselines and state-of-the-art methods using a manually annotated dataset and found that WebEmbedRank achieved significant improvements over existing extraction methods on web pages.

Keywords:  
Unsupervised keyphrase extraction
Sequence embeddings
Web pages; WebEmbedRank
Natural Language Processing (NLP)
Machine Learning
Deep Learning

Author(s) Name:  Tim Haarman,Bastiaan Zijlema and Marco Wiering

Journal name:  Multimodal Technologies and Interaction

Conferrence name:  

Publisher name:  MDPI

DOI:  10.3390/mti3030058

Volume Information:  Volume 3,Issue 3