TF-IDF (Term Frequency and Inverse Document Frequency)

Nihar Jamdar
3 min readFeb 28, 2021

--

https://dataaspirant.com/tf-idf-term-frequency-inverse-document-frequency/

TF Term Frequency

Step 1 lets take example of 2 sentences contains words:

s1 = w1 w3 w2 w2 w5 → 5 words

s2 = w1 w2 w3 w5 w6 w4 → 6 words

Step 2 Create Bag of words representation:

Representation of words in vector format

Tf ( wi , sj) = [No of times wi occurs in sj / Total number of words in sj]

Tf ( w2,s1) = i.e. w2 is repeated 2 times in sentence 1 so (2 / 5 )

Tf ( w4,s1) = i.e. w4 is repeated 0 times in sentence 1 so (0 / 5 )

in words: TF = (No of repetation of words in sentence) /( Total no of words in sentence)

Note : TF ( of any word in sentence) lies between 0 to 1

0 ≤ TF(wi , sj) ≤ 1

So, Term Freq can be thought as how often word [wi] occurs in sentence [sj]

More often the word wi occurs in sj higher will be Term Freq [i.e. more towards 1]

Less the word wi occurs in sj lower will be Term Freq [i.e. more towards 0]

IDF Inverse Document Frequency

https://medium.datadriveninvestor.com/tf-idf-in-natural-language-processing-8db8ef4a7736

Lets S = number of sentences ( s1 ,s2 ,s3, s4 …… sn )

s1 = w2 w3 w1 w4 w5 …

s2 = w1 w4 w1 w4 w2

.

.

sn = w5 w3 w1 w3 w5

IDF (wi , S) = log (N /ni) .. N = Number of sentences , ni = number of sentences contains wi

Imp point to remember :

1: ni ≤ N

2: N / ni ≥ 1

3: log (N / ni) ≥ 0

Case : ni increases , N/ni decreases , log( N/ni) also decrease:

i.e. 1000 / 10 > 1000 / 20 ,

where 1st , out of 1000 sentence 10 sentence actually contains that word (N = 1000 , ni = 10)

2nd, out of 1000 sentence 20 sentence actually contains that word (N = 1000 , ni = 20)

so, as ni value increase so on N/ni will keep on reducing and same for log (N/ni)

1 : words like {the , is , not, like, you,see, etc etc} occurs a lot in sentence so during that time ni will be large and N/ni will small and log (N/ni) will also be small

2: words like {historical , civilization ,goodfellas, translation , etc etc } occurs a very less in sentence so during that time ni will be small and N/ni will be large and log (N/ni) will also be large

How to combine Both TF & Idf ?

Drawback : It does not take semantic meaning of words

--

--