Description: CountVectorizer is a technique from the sklearn library that converts a collection of text documents into a matrix of token counts.It transforms text into a numerical representation by counting the frequency of each word in the document.
Step-by-Step Process
Import Required Libraries: First, you need to import the necessary libraries for Count Vectorization.
Create the Text Data: Create a list of strings (or documents) that we want to analyze.
Initialize CountVectorizer: We initialize the CountVectorizer which will convert our text data into a matrix of word counts.
Fit and Transform the Text Data: Now, we apply the fit_transform method to the text data. This method learns the vocabulary from the text data and transforms the text into a numerical matrix.
View the Feature Names (Vocabulary): We can also see the vocabulary (the list of words) that the vectorizer has learned.This helps understand which index corresponds to which word.
Sample Code
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
text_data = [
"I love programming",
"Programming is fun",
"I love fun",
"I love Python programming"
]
# Initialize CountVectorizer
vectorizer = CountVectorizer()
# Fit and transform the data
X = vectorizer.fit_transform(text_data)
# Convert the result to an array and display
print(X.toarray())
# View the feature names (vocabulary)
print(vectorizer.get_feature_names_out())