Exploratory Data Analysis on Haberman Cancer Survival Dataset

Nihar Jamdar
5 min readMar 21, 2021
https://mahavirdabasmd.wixsite.com/blog/post/abc-of-eda-with-haberman-s-survival-dataset

Introduction:

Haberman’s data set contains data from the study conducted in University of Chicago’s Billings Hospital between year 1958 to 1970 for the patients who undergone surgery of breast cancer.

Objective:

Predict survival status  of patients who undergone from surgery.
Survival status [1] = the patient survived 5 years or longer
Survival status [2] = the patient died within 5 years

Import Libraries:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Load Dataset:

https://www.kaggle.com/gilsousa/habermans-survival-data-set

haberman = pd.read_csv("haberman.csv")

Attribute Information:

Age:                It represent the age of patient at which they undergone surgery (age from 30 to 83)
Operation_Age: Year in which patient was undergone surgery(1958–1969)
Auxilary_Nodes: Number of positive axillary nodes detected
Survival_Status: [1] = the patient survived 5 years or longer [2] = the patient died within 5 years

# (Q1) how many data-points and features?

print (haberman.shape)

(306, 4) i.e. 306 data-points , 4 features

#(Q2) What are the column names in our dataset?

print (haberman.columns)

Index([ “Age”,”Operation_Age”,”Auxilary_Nodes”,”Survival_Status”],
dtype='object')

#(Q3) How many data points for each class 1 and 2 are present?

haberman["Survival_Status"].value_counts()

1: 225
2 : 81
Name: Survival_Status, dtype: int64

Haberman is an imbalanced dataset as the number of data points for both class not equal and having large difference as compared to this dataset.

2-D Scatter Plot:

sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="Survival_Status", size=5) \
.map(plt.scatter, "Auxilary_Nodes", "Age") \
.add_legend();
plt.show();
2D scatter plot

Observations

Using Positive Lymph Nodes as x-axis and Age as y-axis with respect to our output i.e. Survival_status_after_5 ,here blue points represent Class 1= patient survived 5 years or longer and orange points represent Class 2 = patient died within 5 years. Where we can see that blue and orange points are not well separated as they have considerable overlap.

Q] Can we draw multiple 2-D scatter plots for each combination of features?
How many combinations exist? 3C2 = 3.

Pair-plot:

sns.pairplot(haberman, hue="Survival_status_after_5", height=3);
plt.show()
Pair Plot

Observations

As we have 3C2 of combinations i.e. (Age ,Operation_Age) , (Age,Auxilary_Nodes) ,( Opeartion_Age , Auxilary_Nodes).So if we consider 3 scatter plots from right bcoz left side of 3 scatter plots are almost same as combinations may be reversed. so lets talk obout our first combination i.e.(Age ,Operation_Age) — Most of points are overlapping to each other and are not seperated so we cant distinguish with respect to our class labels .(Age,Auxilary_Nodes) — Highly overlapping specially between 0 to 20 Auxilary nodes and from Age 30 to 70 that means at during this Age there is highly chances that people may Survive or not survive but again we can distinguish with respect to our class labels. ( Opeartion_Age , Auxilary_Nodes) — Contains less overlap as compared to above 2 combinations but still we cant conclude our survival_status.

Q] What about 1-D scatter plot using just one feature?

Histogram, PDF, CDF:

sns.FacetGrid(haberman, hue="Survival_Status", height=5) \
.map(sns.distplot, "Operation_Age") \
.add_legend();
plt.show();
PDF for Operation_Age
sns.FacetGrid(haberman, hue="Survival_Status", height=5) \
.map(sns.distplot, "Auxilary_Nodes") \
.add_legend();
plt.show();
PDF for Auxilary_Nodes

Obervation for Auxilary_Nodes:

if(AxillaryNodes≤0)

Patient= Long survival

else if(AxillaryNodes≥0 && Axillary nodes≤3.5(approx))

Patient= Long survival chances are high

else if(Axillary nodes ≥3.5)

Patient = Short survival

sns.FacetGrid(haberman, hue="Survival_Status", height=5) \
.map(sns.distplot, "Age") \
.add_legend();
plt.show();
PDF for Age

Observation: Similar here we cannot predict anything with these histograms as there is equal number of density in each data point. Even the PDF of both classification overlap on each other.

Mean, Variance and Std-dev:

import numpy as np
survival_status_1 = haberman.loc[haberman["Survival_Status"] == 1];
survival_status_2 = haberman.loc[haberman["Survival_Status"] == 2];

Age

print("Means:")
print(np.mean(survival_status_1["Age"]))
print(np.mean(survival_status_2["Age"]))
print("\nStd-dev:");
print(np.std(survival_status_1["Age"]))
print(np.std(survival_status_2["Age"]))

Means:
52.01777777777778
53.67901234567901

Std-dev:
10.98765547510051
10.10418219303131

Operation_Age

print("Means:")
print(np.mean(survival_status_1["Operation_Age"]))
print(np.mean(survival_status_2["Operation_Age"]))
print("\nStd-dev:");
print(np.std(survival_status_1["Operation_Age"]))
print(np.std(survival_status_2["Operation_Age"]))

Means:
62.86222222222222
62.82716049382716

Std-dev:
3.2157452144021956
3.3214236255207883

Auxilary_Nodes

print("Means:")
print(np.mean(survival_status_1["Auxilary_Nodes"]))
print(np.mean(survival_status_2["Auxilary_Nodes"]))
print("\nStd-dev:");
print(np.std(survival_status_1["Auxilary_Nodes"]))
print(np.std(survival_status_2["Auxilary_Nodes"]))

Means:
2.7911111111111113
7.45679012345679

Std-dev:
5.857258449412131
9.128776076761632

Median, Percentile, Quantile, IQR, MAD

For Operation _Age

print("\nMedian:")
print(np.median(survival_status_1["Operation_Age"]))
print(np.median(survival_status_2["Operation_Age"]))
print("\nQuantiles:")
print(np.percentile(survival_status_1["Operation_Age"],np.arange(0, 100, 25)))
print(np.percentile(survival_status_2["Operation_Age"],np.arange(0, 100, 25)))
print("\n90th Percentiles:")
print(np.percentile(survival_status_1["Operation_Age"],90))
print(np.percentile(survival_status_2["Operation_Age"],90))
from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(survival_status_1["Operation_Age"]))
print(robust.mad(survival_status_2["Operation_Age"]))

Median:
63.0
63.0

Quantiles:
[58. 60. 63. 66.]
[58. 59. 63. 65.]

90th Percentiles:
67.0
67.0

Median Absolute Deviation
4.447806655516806
4.447806655516806

For Auxilary_Nodes

print("\nMedian:")
print(np.median(survival_status_1["Auxilary_Nodes"]))
print(np.median(survival_status_2["Auxilary_Nodes"]))
print("\nQuantiles:")
print(np.percentile(survival_status_1["Auxilary_Nodes"],np.arange(0, 100, 25)))
print(np.percentile(survival_status_2["Auxilary_Nodes"],np.arange(0, 100, 25)))
print("\n90th Percentiles:")
print(np.percentile(survival_status_1["Auxilary_Nodes"],90))
print(np.percentile(survival_status_2["Auxilary_Nodes"],90))
from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(survival_status_1["Auxilary_Nodes"]))
print(robust.mad(survival_status_2["Auxilary_Nodes"]))

Median:
0.0
4.0

Quantiles:
[0. 0. 0. 3.]
[ 0. 1. 4. 11.]

90th Percentiles:
8.0
20.0

Median Absolute Deviation
0.0
5.930408874022408

For Age

print("\nMedian:")
print(np.median(survival_status_1["Age"]))
print(np.median(survival_status_2["Age"]))
print("\nQuantiles:")
print(np.percentile(survival_status_1["Age"],np.arange(0, 100, 25)))
print(np.percentile(survival_status_2["Age"],np.arange(0, 100, 25)))
print("\n90th Percentiles:")
print(np.percentile(survival_status_1["Age"],90))
print(np.percentile(survival_status_2["Age"],90))
from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(survival_status_1["Age"]))
print(robust.mad(survival_status_2["Age"]))

Median:
52.0
53.0

Quantiles:
[30. 43. 52. 60.]
[34. 46. 53. 61.]

90th Percentiles:
67.0
67.0

Median Absolute Deviation
13.343419966550417
11.860817748044816

Box plot and Whiskers:

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

Another method of visualizing the 1-D scatter plot more intuitivey. Here Box contains 25th Percentile to 75th Percentile and Yellow line here denotes Median (i.e. 50th Percentile) . Whiskers according to seaborn belong to Min(Quantile1–1.5*IQR) & Max(Quantile3+1.5*IQR).Outside this anything is Considered as Outlier as we can get some negative as well as positive Outliers accordingly.

For Age:

sns.boxplot(x='Survival_Status',y='Age', data=haberman)
plt.show()
Box Plot and Whiskers for Age

For Operation_Age:

sns.boxplot(x='Survival_Status',y='Operation_Age', data=haberman)
plt.show()
Box Plot and Whiskers for Operation_Age

For ‘Auxilary_Nodes:

sns.boxplot(x='Survival_Status',y=''Auxilary_Nodes',data=haberman)
plt.show()
Box Plot and Whiskers for Auxilary_Nodes

Conclusion:

Yes, you can diagnose the Cancer using Haberman’s Data set by applying various data analysis techniques and using various Python libraries.

--

--