Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

How to Perform Word Stemming Using the NLTK Library in NLP?

how-to-do-word-stemming-using-nltk-library-in-nlp

Condition for Performing Word Stemming Using the NLTK Library in NLP

  • Description:
    Word stemming is the process of reducing a word to its root or base form, known as the stem. Unlike lemmatization, stemming doesn't always produce valid words; it works by applying heuristic rules to remove suffixes and prefixes. The nltk library provides tools like the PorterStemmer and Lancaster Stemmer for stemming.
Step-by-Step Process
  • Install and Import NLTK:
    Ensure nltk is installed and import the necessary modules:
  • Using the Porter Stemmer:
    The Porter Stemmer is one of the most commonly used stemmers,known for its simplicity and speed.
  • Using the Lancaster Stemmer:
    The Lancaster Stemmer is more aggressive than the Porter Stemmer and may result in over-stemming (removing more characters than necessary).
  • Comparison of Stemmers:
    The choice of stemmer depends on the task.
    The Porter Stemmer is less aggressive and retains more recognizable roots,whereas the Lancaster Stemmer may be useful for highly compact representations.
  • Word Stemming with Tokenized Text:
    You can use stemming with tokenized sentences to process entire texts.
Sample Code
  • import nltk
    nltk.download('punkt') # For tokenization (if needed)
    from nltk.stem import PorterStemmer
    # Initialize the stemmer
    porter = PorterStemmer()
    # Words to stem words = ["running", "runner", "easily", "fairness", "studies"]
    # Stem each word
    stems = [porter.stem(word) for word in words]
    print(stems)
    from nltk.stem import LancasterStemmer
    # Initialize the stemmer
    lancaster = LancasterStemmer()
    # Words to stem
    words = ["running", "runner", "easily", "fairness", "studies"]
    # Stem each word
    stems = [lancaster.stem(word) for word in words]
    print(stems)
    # Comparing Porter and Lancaster
    words = ["connection", "connected", "connections", "connecting"]
    porter_stems = [porter.stem(word) for word in words]
    lancaster_stems = [lancaster.stem(word) for word in words]
    print("Porter Stems:", porter_stems)
    print("Lancaster Stems:", lancaster_stems)
    from nltk.tokenize import word_tokenize
    text = "The quick brown fox jumps over the lazy dogs while running and connected to
    the internet."
    # Tokenize text
    tokens = word_tokenize(text)
    # Stem each token
    porter_stems = [porter.stem(token) for token in tokens]
    print("Original:", tokens)
    print("Stemmed:", porter_stems)
Screenshots
  • Stemming.png