Data Wrangling with Pandas - Kyphosis
Business Tasks.
Perform basic EDA on the Data
In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
df=pd.read_csv('/Users/mekki/Python_Projects_Datasets/Practical Data Wrangling with Pandas/00-kyphosis.csv')
In [2]:
df.shape
Out[2]:
(81, 4)
In [3]:
df.info
Out[3]:
<bound method DataFrame.info of Kyphosis Age Number Start 0 absent 71 3 5 1 absent 158 3 14 2 present 128 4 5 3 absent 2 5 1 4 absent 1 4 15 .. ... ... ... ... 76 present 157 3 13 77 absent 26 7 13 78 absent 120 2 13 79 present 42 7 6 80 absent 36 4 13 [81 rows x 4 columns]>
In [4]:
df.isnull().sum().sum()
Out[4]:
0
In [5]:
round(df.describe(),2)
Out[5]:
Age | Number | Start | |
---|---|---|---|
count | 81.00 | 81.00 | 81.00 |
mean | 83.65 | 4.05 | 11.49 |
std | 58.10 | 1.62 | 4.88 |
min | 1.00 | 2.00 | 1.00 |
25% | 26.00 | 3.00 | 9.00 |
50% | 87.00 | 4.00 | 13.00 |
75% | 130.00 | 5.00 | 16.00 |
max | 206.00 | 10.00 | 18.00 |
List the Average, Minimum and Maximum age (in years) considered in this study
In [6]:
#The "Age" is in months, so we need to devide the results by 12
mmm = round(df['Age'].describe().loc[['mean','max','min']],2)
mmm_years = round(mmm / 12, 2)
mmm_years
Out[6]:
mean 6.97 max 17.17 min 0.08 Name: Age, dtype: float64
Plot the Correlation Matrix
In [7]:
import matplotlib.pyplot as plt
import seaborn as sns
In [8]:
df2 = df.select_dtypes(include=['number'])
sns.heatmap(df2.corr(),annot=True)
plt.show()
Convert the age column datatype from int64 to float64
In [9]:
df['Age']=df['Age'].astype("float64")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 81 entries, 0 to 80 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Kyphosis 81 non-null object 1 Age 81 non-null float64 2 Number 81 non-null int64 3 Start 81 non-null int64 dtypes: float64(1), int64(2), object(1) memory usage: 2.7+ KB
Define a function that converts age from months to years
In [10]:
def months_to_years(age) :
return age/12
Apply the function to the "Age" column and add the results into a new column entitled "Age in years"
In [11]:
df['Age in Years']=round(df['Age'].apply(months_to_years),2)
df
Out[11]:
Kyphosis | Age | Number | Start | Age in Years | |
---|---|---|---|---|---|
0 | absent | 71.0 | 3 | 5 | 5.92 |
1 | absent | 158.0 | 3 | 14 | 13.17 |
2 | present | 128.0 | 4 | 5 | 10.67 |
3 | absent | 2.0 | 5 | 1 | 0.17 |
4 | absent | 1.0 | 4 | 15 | 0.08 |
... | ... | ... | ... | ... | ... |
76 | present | 157.0 | 3 | 13 | 13.08 |
77 | absent | 26.0 | 7 | 13 | 2.17 |
78 | absent | 120.0 | 2 | 13 | 10.00 |
79 | present | 42.0 | 7 | 6 | 3.50 |
80 | absent | 36.0 | 4 | 13 | 3.00 |
81 rows × 5 columns
List the features of the oldest and youngest child in this study
In [12]:
df[df['Age']==df['Age'].max()]
Out[12]:
Kyphosis | Age | Number | Start | Age in Years | |
---|---|---|---|---|---|
73 | absent | 206.0 | 4 | 10 | 17.17 |
In [13]:
df[df['Age']==df['Age'].min()]
Out[13]:
Kyphosis | Age | Number | Start | Age in Years | |
---|---|---|---|---|---|
4 | absent | 1.0 | 4 | 15 | 0.08 |
5 | absent | 1.0 | 2 | 16 | 0.08 |
13 | absent | 1.0 | 4 | 12 | 0.08 |
15 | absent | 1.0 | 3 | 16 | 0.08 |
36 | absent | 1.0 | 3 | 9 | 0.08 |
Scale the raw Age column (in months) using both Standardization and Normalization. Perform a Sanity Check
In [14]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
In [15]:
scaler = MinMaxScaler()
df['Age']=scaler.fit_transform(df['Age'].values.reshape(-1,1))
In [16]:
df.describe().round(2)
Out[16]:
Age | Number | Start | Age in Years | |
---|---|---|---|---|
count | 81.00 | 81.00 | 81.00 | 81.00 |
mean | 0.40 | 4.05 | 11.49 | 6.97 |
std | 0.28 | 1.62 | 4.88 | 4.84 |
min | 0.00 | 2.00 | 1.00 | 0.08 |
25% | 0.12 | 3.00 | 9.00 | 2.17 |
50% | 0.42 | 4.00 | 13.00 | 7.25 |
75% | 0.63 | 5.00 | 16.00 | 10.83 |
max | 1.00 | 10.00 | 18.00 | 17.17 |
In [17]:
scaler = StandardScaler()
df['Age']=scaler.fit_transform(df['Age'].values.reshape(-1,1))
In [18]:
#df.describe().round(2)
round(df.describe(),2)
Out[18]:
Age | Number | Start | Age in Years | |
---|---|---|---|---|
count | 81.00 | 81.00 | 81.00 | 81.00 |
mean | 0.00 | 4.05 | 11.49 | 6.97 |
std | 1.01 | 1.62 | 4.88 | 4.84 |
min | -1.43 | 2.00 | 1.00 | 0.08 |
25% | -1.00 | 3.00 | 9.00 | 2.17 |
50% | 0.06 | 4.00 | 13.00 | 7.25 |
75% | 0.80 | 5.00 | 16.00 | 10.83 |
max | 2.12 | 10.00 | 18.00 | 17.17 |