TF-IDF (Term Frequency and Inverse Document Frequency)
TF Term Frequency
Step 1 lets take example of 2 sentences contains words:
s1 = w1 w3 w2 w2 w5 → 5 words
s2 = w1 w2 w3 w5 w6 w4 → 6 words
Step 2 Create Bag of words representation:
Tf ( wi , sj) = [No of times wi occurs in sj / Total number of words in sj]
Tf ( w2,s1) = i.e. w2 is repeated 2 times in sentence 1 so (2 / 5 )
Tf ( w4,s1) = i.e. w4 is repeated 0 times in sentence 1 so (0 / 5 )
in words: TF = (No of repetation of words in sentence) /( Total no of words in sentence)
Note : TF ( of any word in sentence) lies between 0 to 1
0 ≤ TF(wi , sj) ≤ 1
So, Term Freq can be thought as how often word [wi] occurs in sentence [sj]
More often the word wi occurs in sj higher will be Term Freq [i.e. more towards 1]
Less the word wi occurs in sj lower will be Term Freq [i.e. more towards 0]
IDF Inverse Document Frequency
Lets S = number of sentences ( s1 ,s2 ,s3, s4 …… sn )
s1 = w2 w3 w1 w4 w5 …
s2 = w1 w4 w1 w4 w2
.
.
sn = w5 w3 w1 w3 w5
IDF (wi , S) = log (N /ni) .. N = Number of sentences , ni = number of sentences contains wi
Imp point to remember :
1: ni ≤ N
2: N / ni ≥ 1
3: log (N / ni) ≥ 0
Case : ni increases , N/ni decreases , log( N/ni) also decrease:
i.e. 1000 / 10 > 1000 / 20 ,
where 1st , out of 1000 sentence 10 sentence actually contains that word (N = 1000 , ni = 10)
2nd, out of 1000 sentence 20 sentence actually contains that word (N = 1000 , ni = 20)
so, as ni value increase so on N/ni will keep on reducing and same for log (N/ni)
1 : words like {the , is , not, like, you,see, etc etc} occurs a lot in sentence so during that time ni will be large and N/ni will small and log (N/ni) will also be small
2: words like {historical , civilization ,goodfellas, translation , etc etc } occurs a very less in sentence so during that time ni will be small and N/ni will be large and log (N/ni) will also be large
How to combine Both TF & Idf ?
Drawback : It does not take semantic meaning of words