Exploratory Data Analysis on Haberman Cancer Survival Dataset
Introduction:
Haberman’s data set contains data from the study conducted in University of Chicago’s Billings Hospital between year 1958 to 1970 for the patients who undergone surgery of breast cancer.
Objective:
Predict survival status of patients who undergone from surgery.
Survival status [1] = the patient survived 5 years or longer
Survival status [2] = the patient died within 5 years
Import Libraries:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
Load Dataset:
https://www.kaggle.com/gilsousa/habermans-survival-data-set
haberman = pd.read_csv("haberman.csv")
Attribute Information:
Age: It represent the age of patient at which they undergone surgery (age from 30 to 83)
Operation_Age: Year in which patient was undergone surgery(1958–1969)
Auxilary_Nodes: Number of positive axillary nodes detected
Survival_Status: [1] = the patient survived 5 years or longer [2] = the patient died within 5 years
# (Q1) how many data-points and features?
print (haberman.shape)
(306, 4) i.e. 306 data-points , 4 features
#(Q2) What are the column names in our dataset?
print (haberman.columns)
Index([ “Age”,”Operation_Age”,”Auxilary_Nodes”,”Survival_Status”],
dtype='object')
#(Q3) How many data points for each class 1 and 2 are present?
haberman["Survival_Status"].value_counts()
1: 225
2 : 81
Name: Survival_Status, dtype: int64
Haberman is an imbalanced dataset as the number of data points for both class not equal and having large difference as compared to this dataset.
2-D Scatter Plot:
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="Survival_Status", size=5) \
.map(plt.scatter, "Auxilary_Nodes", "Age") \
.add_legend();
plt.show();
Observations
Using Positive Lymph Nodes as x-axis and Age as y-axis with respect to our output i.e. Survival_status_after_5 ,here blue points represent Class 1= patient survived 5 years or longer and orange points represent Class 2 = patient died within 5 years. Where we can see that blue and orange points are not well separated as they have considerable overlap.
Q] Can we draw multiple 2-D scatter plots for each combination of features?
How many combinations exist? 3C2 = 3.
Pair-plot:
sns.pairplot(haberman, hue="Survival_status_after_5", height=3);
plt.show()
Observations
As we have 3C2 of combinations i.e. (Age ,Operation_Age) , (Age,Auxilary_Nodes) ,( Opeartion_Age , Auxilary_Nodes).So if we consider 3 scatter plots from right bcoz left side of 3 scatter plots are almost same as combinations may be reversed. so lets talk obout our first combination i.e.(Age ,Operation_Age) — Most of points are overlapping to each other and are not seperated so we cant distinguish with respect to our class labels .(Age,Auxilary_Nodes) — Highly overlapping specially between 0 to 20 Auxilary nodes and from Age 30 to 70 that means at during this Age there is highly chances that people may Survive or not survive but again we can distinguish with respect to our class labels. ( Opeartion_Age , Auxilary_Nodes) — Contains less overlap as compared to above 2 combinations but still we cant conclude our survival_status.
Q] What about 1-D scatter plot using just one feature?
Histogram, PDF, CDF:
sns.FacetGrid(haberman, hue="Survival_Status", height=5) \
.map(sns.distplot, "Operation_Age") \
.add_legend();
plt.show();
sns.FacetGrid(haberman, hue="Survival_Status", height=5) \
.map(sns.distplot, "Auxilary_Nodes") \
.add_legend();
plt.show();
Obervation for Auxilary_Nodes:
if(AxillaryNodes≤0)
Patient= Long survival
else if(AxillaryNodes≥0 && Axillary nodes≤3.5(approx))
Patient= Long survival chances are high
else if(Axillary nodes ≥3.5)
Patient = Short survival
sns.FacetGrid(haberman, hue="Survival_Status", height=5) \
.map(sns.distplot, "Age") \
.add_legend();
plt.show();
Observation: Similar here we cannot predict anything with these histograms as there is equal number of density in each data point. Even the PDF of both classification overlap on each other.
Mean, Variance and Std-dev:
import numpy as np
survival_status_1 = haberman.loc[haberman["Survival_Status"] == 1];
survival_status_2 = haberman.loc[haberman["Survival_Status"] == 2];
Age
print("Means:")
print(np.mean(survival_status_1["Age"]))
print(np.mean(survival_status_2["Age"]))
print("\nStd-dev:");
print(np.std(survival_status_1["Age"]))
print(np.std(survival_status_2["Age"]))
Means:
52.01777777777778
53.67901234567901
Std-dev:
10.98765547510051
10.10418219303131
Operation_Age
print("Means:")
print(np.mean(survival_status_1["Operation_Age"]))
print(np.mean(survival_status_2["Operation_Age"]))
print("\nStd-dev:");
print(np.std(survival_status_1["Operation_Age"]))
print(np.std(survival_status_2["Operation_Age"]))
Means:
62.86222222222222
62.82716049382716
Std-dev:
3.2157452144021956
3.3214236255207883
Auxilary_Nodes
print("Means:")
print(np.mean(survival_status_1["Auxilary_Nodes"]))
print(np.mean(survival_status_2["Auxilary_Nodes"]))
print("\nStd-dev:");
print(np.std(survival_status_1["Auxilary_Nodes"]))
print(np.std(survival_status_2["Auxilary_Nodes"]))
Means:
2.7911111111111113
7.45679012345679
Std-dev:
5.857258449412131
9.128776076761632
Median, Percentile, Quantile, IQR, MAD
For Operation _Age
print("\nMedian:")
print(np.median(survival_status_1["Operation_Age"]))
print(np.median(survival_status_2["Operation_Age"]))print("\nQuantiles:")
print(np.percentile(survival_status_1["Operation_Age"],np.arange(0, 100, 25)))
print(np.percentile(survival_status_2["Operation_Age"],np.arange(0, 100, 25)))print("\n90th Percentiles:")
print(np.percentile(survival_status_1["Operation_Age"],90))
print(np.percentile(survival_status_2["Operation_Age"],90))from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(survival_status_1["Operation_Age"]))
print(robust.mad(survival_status_2["Operation_Age"]))
Median:
63.0
63.0
Quantiles:
[58. 60. 63. 66.]
[58. 59. 63. 65.]
90th Percentiles:
67.0
67.0
Median Absolute Deviation
4.447806655516806
4.447806655516806
For Auxilary_Nodes
print("\nMedian:")
print(np.median(survival_status_1["Auxilary_Nodes"]))
print(np.median(survival_status_2["Auxilary_Nodes"]))print("\nQuantiles:")
print(np.percentile(survival_status_1["Auxilary_Nodes"],np.arange(0, 100, 25)))
print(np.percentile(survival_status_2["Auxilary_Nodes"],np.arange(0, 100, 25)))print("\n90th Percentiles:")
print(np.percentile(survival_status_1["Auxilary_Nodes"],90))
print(np.percentile(survival_status_2["Auxilary_Nodes"],90))from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(survival_status_1["Auxilary_Nodes"]))
print(robust.mad(survival_status_2["Auxilary_Nodes"]))
Median:
0.0
4.0
Quantiles:
[0. 0. 0. 3.]
[ 0. 1. 4. 11.]
90th Percentiles:
8.0
20.0
Median Absolute Deviation
0.0
5.930408874022408
For Age
print("\nMedian:")
print(np.median(survival_status_1["Age"]))
print(np.median(survival_status_2["Age"]))print("\nQuantiles:")
print(np.percentile(survival_status_1["Age"],np.arange(0, 100, 25)))
print(np.percentile(survival_status_2["Age"],np.arange(0, 100, 25)))print("\n90th Percentiles:")
print(np.percentile(survival_status_1["Age"],90))
print(np.percentile(survival_status_2["Age"],90))from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(survival_status_1["Age"]))
print(robust.mad(survival_status_2["Age"]))
Median:
52.0
53.0
Quantiles:
[30. 43. 52. 60.]
[34. 46. 53. 61.]
90th Percentiles:
67.0
67.0
Median Absolute Deviation
13.343419966550417
11.860817748044816
Box plot and Whiskers:
Another method of visualizing the 1-D scatter plot more intuitivey. Here Box contains 25th Percentile to 75th Percentile and Yellow line here denotes Median (i.e. 50th Percentile) . Whiskers according to seaborn belong to Min(Quantile1–1.5*IQR) & Max(Quantile3+1.5*IQR).Outside this anything is Considered as Outlier as we can get some negative as well as positive Outliers accordingly.
For Age:
sns.boxplot(x='Survival_Status',y='Age', data=haberman)
plt.show()
For Operation_Age:
sns.boxplot(x='Survival_Status',y='Operation_Age', data=haberman)
plt.show()
For ‘Auxilary_Nodes:
sns.boxplot(x='Survival_Status',y=''Auxilary_Nodes',data=haberman)
plt.show()
Conclusion:
Yes, you can diagnose the Cancer using Haberman’s Data set by applying various data analysis techniques and using various Python libraries.