https://kaiserm.medium.com/how-to-tackle-multicollinearity-79afe58e9479

Q1: What is MultiCollinearity?

If any of our Independent Feature(x1,x2) is internally co-related more than 90%.

lets say we are solving Regression/classification problem where we have 10–15 features. i.e (n rows, 15 features) n x 15.

Step 1: we plot correlation heat map by comparing each feature with each other.

Step 2: So,lets say after doing step 1 we got features[f3,f4] which are highly correlated with more than 90%.

Step 3: So, what we can do is remove any one of the feature which has [p-value > 0.05]

Important Note: It is not Possible for finding correlation for each feature if we have large…


https://www.neuraldesigner.com/blog/principal-components-analysis

Going to perform this using 2 methods

Method 1 :We are going to perform it manually by computing eigen values and eigen vectors explicitly
Method 2 : PCA Using Sklearn.

Method 1 : Compute Manually

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("mnist_train.csv")print("Shape of df:",df.shape) # i.e. we have 60K data points and 784 features with [28*28]

Shape of df: (60000, 785)

Separate Dependent and Independent features

labels = df['label']
data = df.drop("label",axis = 'columns')
print("labels.shape:",labels.shape)
#Dependent features with class labels from (0–9)
print("data.shape:",data.shape)
# Independent features

labels.shape: (60000,)
data.shape: (60000, 784)

Step 2 : Data Preprocessing using sklearn.preprocessing

As soon as we got…


https://mahavirdabasmd.wixsite.com/blog/post/abc-of-eda-with-haberman-s-survival-dataset

Introduction:

Haberman’s data set contains data from the study conducted in University of Chicago’s Billings Hospital between year 1958 to 1970 for the patients who undergone surgery of breast cancer.

Objective:

Predict survival status  of patients who undergone from surgery.
Survival status [1] = the patient survived 5 years or longer
Survival status [2] = the patient died within 5 years

Import Libraries:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Load Dataset:

https://www.kaggle.com/gilsousa/habermans-survival-data-set

haberman = pd.read_csv("haberman.csv")

Attribute Information:

Age:                It represent the age of…

https://dataaspirant.com/tf-idf-term-frequency-inverse-document-frequency/

TF Term Frequency

Step 1 lets take example of 2 sentences contains words:

s1 = w1 w3 w2 w2 w5 → 5 words

s2 = w1 w2 w3 w5 w6 w4 → 6 words

Step 2 Create Bag of words representation:


TF-IDF -W2V

Lets take a sentence :

Text = “I’m going to make him an offer he can’t refuse”

Step 1 : Cleaning text

text = str(text).lower() → to lower text

text = text.replace(“i’m”,”i am”).replace(“can’t”,”cannot”) → expanding contradiction

>>> Cleaned text → “ i am going to make him an offer he cannot refuse”

Step 2 Remove stopwords

Stopwords.remove(“ i ”)

Stopwords.remove(“ him ”)

Stopwords.remove(“ he ”)

>>>> final_text → i going make him offer he cannot refuse

Step 3 Apply TFIDF Weighted W2V on this final_text


Co-Variance:

Lest take a random Variables [Heights , Weights]

height   weight
120cm 50
130cm 60
150cm 80
.
.
.
140cm 75
130cm 65

Co-variance Quantify relationship between 2 Parameters i.e.

If Height Increase and Weight also Increase

If Height Decrease and Weight also Decrease

[Positive Co-Variance]

If Height Increase and Weight Decrease

If Height Decrease and Weight Increase

[Negative Co-Variance]

Mathematical Formula is represented as :

Nihar Jamdar

Data Science Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store