Description: One-hot encoding is a method of converting categorical data into a binary matrix (0s and 1s). For text, each unique word is represented as a unique vector of 1s and 0s. This is useful for converting text data into numerical form for machine learning models.
Step-by-Step Process
Import Required Libraries: We will use libraries like pandas or sklearn for encoding.
Organize the Data: Convert the text data into a format suitable for encoding (like a column in a DataFrame).
Initialize OneHotEncoder: We create an instance of OneHotEncoder from sklearn.
Fit and Transform Data: Fit the encoder to the data and transform it into a binary matrix.
Sample Code
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
text_data = ["apple", "banana", "apple", "orange", "banana", "apple"]
# Convert to a DataFrame
df = pd.DataFrame(text_data, columns=["Fruits"])
print(df)
# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse=False) # sparse=False returns a dense array
# Fit and transform
onehot_encoded = encoder.fit_transform(df[["Fruits"]])
print(onehot_encoded)