`If any of our Independent Feature(x1,x2) is internally co-related more than 90%.`

lets say we are solving Regression/classification problem where we have 10–15 features. i.e (n rows, 15 features) n x 15.

**Step 1**: we plot correlation heat map by comparing each feature with each other.

**Step 2**: So,lets say after doing step 1 we got features[f3,f4] which are highly correlated with more than 90%.

**Step 3**: So, what we can do is remove any one of the feature which has [p-value > 0.05]

**Important Note**: It is not Possible for finding correlation for each feature if we have large…

**Going to perform this using 2 methods**

**Method 1** :We are going to perform it manually by computing eigen values and eigen vectors explicitly **Method 2** : PCA Using Sklearn.

importnumpyasnpimportpandasaspdimportmatplotlib.pyplotaspltdf=pd.read_csv("mnist_train.csv")print("Shape of df:",df.shape)# i.e. we have 60K data points and 784 features with [28*28]

Shape of df: (60000, 785)

**Separate Dependent and Independent features**

labels=df['label']

data=df.drop("label",axis='columns')print("labels.shape:",labels.shape)

#Dependent features with class labels from (0–9)

print("data.shape:",data.shape)

# Independent features

labels.shape: (60000,)

data.shape: (60000, 784)

**As soon as we got…**

`Haberman’s data set contains data from the study conducted in University of Chicago’s Billings Hospital between year 1958 to 1970 for the patients who undergone surgery of breast cancer.`

`Predict survival status of patients who undergone from surgery.`

Survival status [1] = the patient survived 5 years or longer

Survival status [2] = the patient died within 5 years

**import** pandas **as** pd

**import** seaborn **as** sns

**import** matplotlib.pyplot **as** plt

**import** numpy **as** np

https://www.kaggle.com/gilsousa/habermans-survival-data-set

`haberman = pd.read_csv("haberman.csv")`

**Age**: It represent the age of…

**Step 1 lets take example of 2 sentences contains words:**

s1 = w1 w3 w2 w2 w5 → 5 words

s2 = w1 w2 w3 w5 w6 w4 → 6 words

**Step 2 Create Bag of words representation:**

**TF-IDF -W2V**

Lets take a sentence :

**Text = “I’m going to make him an offer he can’t refuse”**

**Step 1 : Cleaning text**

text = str(text).lower() → to lower text

text = text.replace(“i’m”,”i am”).replace(“can’t”,”cannot”) → expanding contradiction

**>>> Cleaned text → “ i am going to make him an offer he cannot refuse”**

**Step 2 Remove stopwords**

Stopwords.remove(“ i ”)

Stopwords.remove(“ him ”)

Stopwords.remove(“ he ”)

>>>> **final_text → i going make him offer he cannot refuse**

**Step 3 Apply TFIDF Weighted W2V on this final_text**

Co-Variance:

Lest take a random Variables [Heights , Weights]

`height weight`

120cm 50

130cm 60

150cm 80

.

.

.

140cm 75

130cm 65

Co-variance Quantify relationship between 2 Parameters i.e.

If Height Increase and Weight also Increase

If Height Decrease and Weight also Decrease

[Positive Co-Variance]

If Height Increase and Weight Decrease

If Height Decrease and Weight Increase

[Negative Co-Variance]

Mathematical Formula is represented as :

Data Science Enthusiast