2-D Visualization using Principal Component Analysis (PCA) on MNIST dataset
Going to perform this using 2 methods
Method 1 :We are going to perform it manually by computing eigen values and eigen vectors explicitly
Method 2 : PCA Using Sklearn.
Method 1 : Compute Manually
Step 1: import libraries , load dataset:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pltdf = pd.read_csv("mnist_train.csv")print("Shape of df:",df.shape) # i.e. we have 60K data points and 784 features with [28*28]
Shape of df: (60000, 785)
Separate Dependent and Independent features
labels = df['label']
data = df.drop("label",axis = 'columns')print("labels.shape:",labels.shape)
#Dependent features with class labels from (0–9)
print("data.shape:",data.shape)
# Independent features
labels.shape: (60000,)
data.shape: (60000, 784)
Step 2 : Data Preprocessing using sklearn.preprocessing
As soon as we got our data we do column Standardization using sklearn.preprocessing
Q] What is Column standardization ?
>>let say we have fj feature with [x1 to xn] values accordingly. we will transform it to [x1` to xn`] features such that where mean of each xi` = 0 and std-dev of each xi`= 1
Q2] How to compute col- standardization manually?
>>> lets , [xi` = xi -x_mean / x_std]
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_tranform(data)print(standardized_data.shape)
(60000, 784)
So, till here we have standardized our data where each datapoint will have mean = 0 and stand-dev = 1
Step 3 : Find Covariance Matrix i.e. S = X^T * X , where now S = [d x d] matrix with X^T =[d x n] , X = [n x d]
sample_data = standardized_data# matrix multiplication using Numpy
cov_matrix = np.matmul(sample_data.T,sample_data)print("The shape of variance matrix = ", cov_matrix.shape)
The shape of co- variance matrix = (784, 784)
Step 4 : After we have done basic steps now its tike to compute top 2 Eigen values and Eigen Vectors as we are transforming data to 2-D.
Note: as we will use sklearn.linalg for computing eigen values and vectors where (eigh) func akways gives us eigen values and vectors in Ascending order i.e. from lower to higher.
from scipy.linalg import eigh
value,vectors = eigh(cov_matrix ,eigvals=(782,783))
# top 2 in DESC ORDER
print(vectors.shape) # (784 , 2) i.e. compute top 2 eigen valuesvectors = vectors.T # tranforming vectors from (784, 2) to (2,784)
print("Updated shape of eigen vectors = ",vectors.shape)
(784, 2)
# projecting the original data sample on the plane
#formed by two principal eigen vectors by vector-vector multiplication.new_coordinates = np.matmul(vectors.T,sample_data.T)print("new_cordinates:", vectors.T.shape, " X ", sample_data.T.shape, " = ",new_coordinates.shape)
new_cordinates: (2, 784) X (784, 60000) = (2, 60000)
import pandas as pd# appending label to the 2d projected datanew_coordinates = np.vstack((new_coordinates, labels)).T
# vstack = vertical stack# creating a new data frame for ploting the labeled points.dataframe = pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label"))print(dataframe.head())
# ploting the 2d data points with seabornimport seaborn as sn
sn.FacetGrid(dataframe, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()plt.show()
Observations : So, we have plotted Scatter-Plot with 1st_principal on X-axis & 2nd_principal on y-axis. here PCA gives not that well visualization but it tries it best by separating Class Label [9, 7,0,8], while other labels are not that well separated.
Method 2 : PCA using Scikit-Learn
Step 1 : from sklearn import decomposition
# initializing the pca
from sklearn import decomposition
pca = decomposition.PCA()
Step 2 : Select number of pca components you need for Visualization
pca.n_components = 2
# select 2 components for 2-D visualization pca_data = pca.fit_transform(sample_data)# pca_reduced will contain the 2-d projects of simple data
dataprint("shape of pca_reduced.shape = ", pca_data.shape)
shape of pca_reduced.shape = (60000, 2)
Step 3 : Visualization
pca_data = np.vstack((pca_data.T, labels)).Tpca_df = pd.DataFrame(data=pca_data, columns=("1st_principal", "2nd_principal", "label"))sn.FacetGrid(pca_df, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()plt.show()
Observation: It works exaclty same as what we have done manually above instead of that it just rotates the plot silghtly but as we discussed PCA tries to do its best by seperating some class labels but not all .