Description: TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document in a collection. This process is implemented using MapReduce, where the first MapReduce function calculates Term Frequency (TF), the second computes the Total Word Count, and the third computes the TF-IDF score for each term.
Steps for Implementing TF-IDF Using MapReduce
Step 1: First MapReduce - Calculate Term Frequency
Map Phase:
Input: Document ID and the line of the document.
Process: Extract words from the line using a tokenizer. Emit: (word, documentID) as key and 1 as value.
Output: (word, documentID) as key and word frequency (1) as value.
Reduce Phase:
Input: (word, documentID) as key and list of counts (1s) as values.
Process: Sum the counts for each word in the document.
Output: (word, documentID) as key and word frequency as value.
Step 2: Second MapReduce - Calculate Total Word Count and Term Frequency
Map Phase:
Input: (word, documentID) as key and word frequency as value.
Output: (documentID, word frequency) as key and frequency as value.
Reduce Phase:
Input: Document ID as key and list of word frequencies as values.
Process: Sum the frequencies of words to get the total word count in the document. Calculate Term Frequency (TF) by dividing word frequency by total word count.
Output: (word) as key and Term Frequency as value.
Step 3: Third MapReduce - Calculate TF-IDF
Map Phase:
Input: (word) as key and Term Frequency as value.
Output: (word, Term Frequency) as key and value.
Reduce Phase:
Input: Word as key and Term Frequency as value.
Process: Calculate N (total number of documents) and n (number of documents containing the word). Compute TF-IDF as TF * log(N/n).