Q1: What is MultiCollinearity?

If any of our Independent Feature(x1,x2) is internally co-related more than 90%.

lets say we are solving Regression/classification problem where we have 10–15 features. i.e (n rows, 15 features) n x 15.

Step 1: we plot correlation heat map by comparing each feature with each other.

Step 2: So,lets say after doing step 1 we got features[f3,f4] which are highly correlated with more than 90%.

Step 3: So, what we can do is remove any one of the feature which has [p-value > 0.05]

Important Note: It is not Possible for finding correlation for each feature if we have large…

Going to perform this using 2 methods

Method 1 :We are going to perform it manually by computing eigen values and eigen vectors explicitly
Method 2 : PCA Using Sklearn.

Method 1 : Compute Manually

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("mnist_train.csv")print("Shape of df:",df.shape) # i.e. we have 60K data points and 784 features with [28*28]

Shape of df: (60000, 785)

Separate Dependent and Independent features

labels = df['label']
data = df.drop("label",axis = 'columns')
#Dependent features with class labels from (0–9)
# Independent features

labels.shape: (60000,)
data.shape: (60000, 784)

Step 2 : Data Preprocessing using sklearn.preprocessing

As soon as we got…


Haberman’s data set contains data from the study conducted in University of Chicago’s Billings Hospital between year 1958 to 1970 for the patients who undergone surgery of breast cancer.


Predict survival status  of patients who undergone from surgery.
Survival status [1] = the patient survived 5 years or longer
Survival status [2] = the patient died within 5 years

Import Libraries:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Load Dataset:

haberman = pd.read_csv("haberman.csv")

Attribute Information:

Age:                It represent the age of…

TF Term Frequency

Step 1 lets take example of 2 sentences contains words:

s1 = w1 w3 w2 w2 w5 → 5 words

s2 = w1 w2 w3 w5 w6 w4 → 6 words

Step 2 Create Bag of words representation:


Lets take a sentence :

Text = “I’m going to make him an offer he can’t refuse”

Step 1 : Cleaning text

text = str(text).lower() → to lower text

text = text.replace(“i’m”,”i am”).replace(“can’t”,”cannot”) → expanding contradiction

>>> Cleaned text → “ i am going to make him an offer he cannot refuse”

Step 2 Remove stopwords

Stopwords.remove(“ i ”)

Stopwords.remove(“ him ”)

Stopwords.remove(“ he ”)

>>>> final_text → i going make him offer he cannot refuse

Step 3 Apply TFIDF Weighted W2V on this final_text


Lest take a random Variables [Heights , Weights]

height   weight
120cm 50
130cm 60
150cm 80
140cm 75
130cm 65

Co-variance Quantify relationship between 2 Parameters i.e.

If Height Increase and Weight also Increase

If Height Decrease and Weight also Decrease

[Positive Co-Variance]

If Height Increase and Weight Decrease

If Height Decrease and Weight Increase

[Negative Co-Variance]

Mathematical Formula is represented as :

Nihar Jamdar

Data Science Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store