What is MultiCollinearity and how to resolve it?
Q1: What is MultiCollinearity?
If any of our Independent Feature(x1,x2) is internally co-related more than 90%.
Q2: How multicollinearity works and how to resolve it?
lets say we are solving Regression/classification problem where we have 10–15 features. i.e (n rows, 15 features) n x 15.
Step 1: we plot correlation heat map by comparing each feature with each other.
Step 2: So,lets say after doing step 1 we got features[f3,f4] which are highly correlated with more than 90%.
Step 3: So, what we can do is remove any one of the feature which has [p-value > 0.05]
Important Note: It is not Possible for finding correlation for each feature if we have large amount of features such as , eg: (n rows,200 features) n x 200. So to solve that issue we use something called Ridge and Lasso Regression.
Lets take example:
1:
df = pd.read_csv(“Advertising.csv”)
X = df[[‘TV’, ‘radio’, ’newspaper’]] #Independent Features
y = df[‘sales’] #Dependent Features
df.head()
y = b0 + b1x1 + b2x2 + b3x3
x1 = TV , x2 = radio , x3 = newspaper , y = sales
b0 = Intercept
b1,b2,b3 = Slopes or Coefficient
So in order to check if there is Multi Collinearity issue or not we will use OLS MODEL : Ordinary Least square.
import statsmodels.api as sm
X = sm.add_constant(X) # add B0 with all const values
X.head()
model= sm.OLS(y, X).fit()
model.summary()
import matplotlib.pyplot as plt
X.iloc[:,1:].corr()
Conclusion: None of Features are internally corelated, all features(Tv, Radio,Newspaper) having values nearer to zero,i.e. all independent features are not correlated to each other.
2:
df1 = pd.read_csv(‘Salary_Data.csv’)
df1.head()
X = df1[[“YearsExperience”,”Age”]]
y = df1[‘Salary’]
Using OLS model
import statsmodels.api as sm
X = sm.add_constant(X) # add B0 with all const values
X.head()
model= sm.OLS(y, X).fit()
model.summary()
import matplotlib.pyplot as plt
X.iloc[:,1:].corr()