TF-IDF (Term Frequency and Inverse Document Frequency)

Nihar Jamdar

3 min readFeb 28, 2021

https://dataaspirant.com/tf-idf-term-frequency-inverse-document-frequency/

TF Term Frequency

Step 1 lets take example of 2 sentences contains words:

s1 = w1 w3 w2 w2 w5 → 5 words
s2 = w1 w2 w3 w5 w6 w4 → 6 words

Step 2 Create Bag of words representation:

**Representation of words in vector format**

Tf ( wi , sj) = [No of times wi occurs in sj / Total number of words in sj]
Tf ( w2,s1) = i.e. w2 is repeated 2 times in sentence 1 so (2 / 5 )
Tf ( w4,s1) = i.e. w4 is repeated 0 times in sentence 1 so (0 / 5 )
in words: TF = (No of repetation of words in sentence) /( Total no of words in sentence)

Note : TF ( of any word in sentence) lies between 0 to 1
0 ≤ TF(wi , sj) ≤ 1

So, Term Freq can be thought as how often word [wi] occurs in sentence [sj]
More often the word wi occurs in sj higher will be Term Freq [i.e. more towards 1]
Less the word wi occurs in sj lower will be Term Freq [i.e. more towards 0]

IDF Inverse Document Frequency

https://medium.datadriveninvestor.com/tf-idf-in-natural-language-processing-8db8ef4a7736

Lets S = number of sentences ( s1 ,s2 ,s3, s4 …… sn )

s1 = w2 w3 w1 w4 w5 …
s2 = w1 w4 w1 w4 w2
.
.
sn = w5 w3 w1 w3 w5

IDF (wi , S) = log (N /ni) .. N = Number of sentences , ni = number of sentences contains wi

Imp point to remember :

1: ni ≤ N
2: N / ni ≥ 1
3: log (N / ni) ≥ 0

Case : ni increases , N/ni decreases , log( N/ni) also decrease:

i.e. 1000 / 10 > 1000 / 20 ,
where 1st , out of 1000 sentence 10 sentence actually contains that word (N = 1000 , ni = 10)
2nd, out of 1000 sentence 20 sentence actually contains that word (N = 1000 , ni = 20)
so, as ni value increase so on N/ni will keep on reducing and same for log (N/ni)
1 : words like {the , is , not, like, you,see, etc etc} occurs a lot in sentence so during that time ni will be large and N/ni will small and log (N/ni) will also be small
2: words like {historical , civilization ,goodfellas, translation , etc etc } occurs a very less in sentence so during that time ni will be small and N/ni will be large and log (N/ni) will also be large

How to combine Both TF & Idf ?

Drawback : It does not take semantic meaning of words

TF-IDF (Term Frequency and Inverse Document Frequency)

TF Term Frequency

IDF Inverse Document Frequency

Case : ni increases , N/ni decreases , log( N/ni) also decrease:

How to combine Both TF & Idf ?

Written by Nihar Jamdar