# 2-D Visualization using Principal Component Analysis (PCA) on MNIST dataset

**Going to perform this using 2 methods**

**Method 1** :We are going to perform it manually by computing eigen values and eigen vectors explicitly **Method 2** : PCA Using Sklearn.

# Method 1 : Compute Manually

**Step 1: import libraries , load dataset:**

importnumpyasnpimportpandasaspdimportmatplotlib.pyplotaspltdf=pd.read_csv("mnist_train.csv")print("Shape of df:",df.shape)# i.e. we have 60K data points and 784 features with [28*28]

Shape of df: (60000, 785)

**Separate Dependent and Independent features**

labels=df['label']

data=df.drop("label",axis='columns')print("labels.shape:",labels.shape)

#Dependent features with class labels from (0–9)

print("data.shape:",data.shape)

# Independent features

labels.shape: (60000,)

data.shape: (60000, 784)

**Step 2 : Data Preprocessing using sklearn.preprocessing**

**As soon as we got our data we do column Standardization using sklearn.preprocessing**

Q] What is Column standardization ?

>>let say we have fj feature with [x1 to xn] values accordingly. we will transform it to [x1` to xn`] features such that where **mean of each xi` = 0 and std-dev of each xi`= 1**

Q2] How to compute col- standardization manually?

>>> lets , **[xi` = xi -x_mean / x_std]**

fromsklearn.preprocessingimportStandardScaler

standardized_data=StandardScaler().fit_tranform(data)print(standardized_data.shape)

(60000, 784)

**So, till here we have standardized our data where each datapoint will have mean = 0 and stand-dev = 1**

## Step 3 : **Find Covariance Matrix** i.e. **S = X^T * X **, where now **S = [d x d]** matrix with **X^T =[d x n] , X = [n x d]**

sample_data=standardized_data# matrix multiplication using Numpycov_matrix=np.matmul(sample_data.T,sample_data)print("The shape of variance matrix = ", cov_matrix.shape)

The shape of co- variance matrix = (784, 784)

## Step 4 : After we have done basic steps now its tike to compute top 2 Eigen values and Eigen Vectors as we are transforming data to 2-D.

**Note**: as we will use sklearn.linalg for computing eigen values and vectors where (eigh) func akways gives us eigen values and vectors in Ascending order i.e. from lower to higher.

fromscipy.linalgimporteigh

value,vectors=eigh(cov_matrix ,eigvals=(782,783))

# top 2 in DESC ORDER

print(vectors.shape) # (784 , 2) i.e. compute top 2 eigen valuesvectors=vectors.T# tranforming vectors from (784, 2) to (2,784)print("Updated shape of eigen vectors = ",vectors.shape)

(784, 2)

# projecting the original data sample on the plane

#formed by two principal eigen vectors by vector-vector multiplication.new_coordinates=np.matmul(vectors.T,sample_data.T)print("new_cordinates:", vectors.T.shape, " X ", sample_data.T.shape, " = ",new_coordinates.shape)

new_cordinates: (2, 784) X (784, 60000) = (2, 60000)

importpandasaspd# appending label to the 2d projected datanew_coordinates=np.vstack((new_coordinates, labels)).T

# vstack = vertical stack# creating a new data frame for ploting the labeled points.dataframe=pd.DataFrame(data=new_coordinates, columns=("1st_principal", "2nd_principal", "label"))print(dataframe.head())

# ploting the 2d data points with seabornimportseabornassn

sn.FacetGrid(dataframe, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()plt.show()

**Observations** : So, we have plotted Scatter-Plot with 1st_principal on X-axis & 2nd_principal on y-axis. here PCA gives not that well visualization but it tries it best by separating Class Label [9, 7,0,8], while other labels are not that well separated.

# Method 2 : PCA using Scikit-Learn

**Step 1 : from sklearn import decomposition**

*# initializing the pca*

**from** sklearn **import** decomposition

pca **=** decomposition.PCA()

## Step 2 : Select number of pca components you need for Visualization

pca.n_components=2

# select 2 components for 2-D visualization pca_data=pca.fit_transform(sample_data)# pca_reduced will contain the 2-d projects of simple dataprint("shape of pca_reduced.shape = ", pca_data.shape)

data

shape of pca_reduced.shape = (60000, 2)

# Step 3 : Visualization

pca_data=np.vstack((pca_data.T, labels)).Tpca_df=pd.DataFrame(data=pca_data, columns=("1st_principal", "2nd_principal", "label"))sn.FacetGrid(pca_df, hue="label", size=6).map(plt.scatter, '1st_principal', '2nd_principal').add_legend()plt.show()

**Observation**: It works exaclty same as what we have done manually above instead of that it just rotates the plot silghtly but as we discussed PCA tries to do its best by seperating some class labels but not all .