Mastering Data Analysis with Pandas #3
Task 1. DataFrames with Functions
Import DataSet
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
employee_df=pd.read_csv('/Users/mekki/Python_Projects_Datasets/Mastering Data Analysis with Pandas Part 3/00-employee_information.csv')
employee_df
First Name | Last Name | Salary | Years with Company | Postal Code | ||
---|---|---|---|---|---|---|
0 | Mike | Moe | 5000.00 | 3 | N94 3M0 | bird@gmail.com |
1 | Noah | Ryan | 10000.00 | 8 | N8S 14K | nsmall@hotmail.com |
2 | Nina | Keller | 9072.02 | 17 | S1T 4E6 | azikez@gahew.mr |
3 | Chanel | Steve | 11072.02 | 12 | N7T 3E6 | chanel@gmail.com |
4 | Kate | Noor | 5000.00 | 23 | K8N 5H6 | kate@hotmail.com |
5 | Samer | Mo | 100000.00 | 13 | J7H 3HY | samer@gmail.com |
6 | Heba | Steve | 50000.00 | 7 | K8Y 3M8 | heba.ismail@hotmail.com |
7 | Laila | Aly | 20000.00 | 5 | J8Y 3M0 | Laila.a@hotmail.com |
8 | Joseph | Patton | 2629.13 | 2 | M6U 5U7 | daafeja@boh.jm |
9 | Noah | Moran | 8626.96 | 11 | K2D 4M9 | guutodi@bigwoc.kw |
Define a Function that increases all salaries by 10%
def salary_update(salary): return salary*1.1
employee_df['Salary']=employee_df['Salary'].apply(salary_update)
employee_df
First Name | Last Name | Salary | Years with Company | Postal Code | ||
---|---|---|---|---|---|---|
0 | Mike | Moe | 5500.000 | 3 | N94 3M0 | bird@gmail.com |
1 | Noah | Ryan | 11000.000 | 8 | N8S 14K | nsmall@hotmail.com |
2 | Nina | Keller | 9979.222 | 17 | S1T 4E6 | azikez@gahew.mr |
3 | Chanel | Steve | 12179.222 | 12 | N7T 3E6 | chanel@gmail.com |
4 | Kate | Noor | 5500.000 | 23 | K8N 5H6 | kate@hotmail.com |
5 | Samer | Mo | 110000.000 | 13 | J7H 3HY | samer@gmail.com |
6 | Heba | Steve | 55000.000 | 7 | K8Y 3M8 | heba.ismail@hotmail.com |
7 | Laila | Aly | 22000.000 | 5 | J8Y 3M0 | Laila.a@hotmail.com |
8 | Joseph | Patton | 2892.043 | 2 | M6U 5U7 | daafeja@boh.jm |
9 | Noah | Moran | 9489.656 | 11 | K2D 4M9 | guutodi@bigwoc.kw |
Define a Function that doubles all salaries and adds 100$
def salary_update_2(salary): return salary*2+100
employee_df['Salary']=employee_df['Salary'].apply(salary_update_2)
employee_df
First Name | Last Name | Salary | Years with Company | Postal Code | ||
---|---|---|---|---|---|---|
0 | Mike | Moe | 11100.000 | 3 | N94 3M0 | bird@gmail.com |
1 | Noah | Ryan | 22100.000 | 8 | N8S 14K | nsmall@hotmail.com |
2 | Nina | Keller | 20058.444 | 17 | S1T 4E6 | azikez@gahew.mr |
3 | Chanel | Steve | 24458.444 | 12 | N7T 3E6 | chanel@gmail.com |
4 | Kate | Noor | 11100.000 | 23 | K8N 5H6 | kate@hotmail.com |
5 | Samer | Mo | 220100.000 | 13 | J7H 3HY | samer@gmail.com |
6 | Heba | Steve | 110100.000 | 7 | K8Y 3M8 | heba.ismail@hotmail.com |
7 | Laila | Aly | 44100.000 | 5 | J8Y 3M0 | Laila.a@hotmail.com |
8 | Joseph | Patton | 5884.086 | 2 | M6U 5U7 | daafeja@boh.jm |
9 | Noah | Moran | 19079.312 | 11 | K2D 4M9 | guutodi@bigwoc.kw |
Calculate the updated total salaries of all employees
employee_df['Salary'].sum()
488080.286
Task 2. Pandas DataFrame Operations/Filtering #1
employee_df=pd.read_csv('/Users/mekki/Python_Projects_Datasets/Mastering Data Analysis with Pandas Part 3/00-employee_information.csv')
Retrieve the employees who have been working at the company for at least 10 years
loyal=employee_df[employee_df['Years with Company']>=10].sort_values(by='Years with Company', ascending=False)
loyal
First Name | Last Name | Salary | Years with Company | Postal Code | ||
---|---|---|---|---|---|---|
4 | Kate | Noor | 5000.00 | 23 | K8N 5H6 | kate@hotmail.com |
2 | Nina | Keller | 9072.02 | 17 | S1T 4E6 | azikez@gahew.mr |
5 | Samer | Mo | 100000.00 | 13 | J7H 3HY | samer@gmail.com |
3 | Chanel | Steve | 11072.02 | 12 | N7T 3E6 | chanel@gmail.com |
9 | Noah | Moran | 8626.96 | 11 | K2D 4M9 | guutodi@bigwoc.kw |
Retrieve the employees who have been working at the company for at least 10 years with a salary of at least 50k$
c1=employee_df['Years with Company']>=10
c2=employee_df['Salary']>=50000
loyal=employee_df[c1 & c2].sort_values(by='Salary', ascending=False)
loyal
First Name | Last Name | Salary | Years with Company | Postal Code | ||
---|---|---|---|---|---|---|
5 | Samer | Mo | 100000.0 | 13 | J7H 3HY | samer@gmail.com |
Retrieve the employees with a min salary of 20k$ and find the Combined Salaries of these employees
high=employee_df[employee_df['Salary']>=20000].sort_values(by='Salary', ascending=False)
high
First Name | Last Name | Salary | Years with Company | Postal Code | ||
---|---|---|---|---|---|---|
5 | Samer | Mo | 100000.0 | 13 | J7H 3HY | samer@gmail.com |
6 | Heba | Steve | 50000.0 | 7 | K8Y 3M8 | heba.ismail@hotmail.com |
7 | Laila | Aly | 20000.0 | 5 | J8Y 3M0 | Laila.a@hotmail.com |
high['Salary'].sum()
170000.0
Task 3. Pandas DataFrame Operations/Filtering #2
employee_df=pd.read_csv('/Users/mekki/Python_Projects_Datasets/Mastering Data Analysis with Pandas Part 3/00-employee_information.csv')
Retrieve the employees with 'Heba' as First Name
heba=employee_df[employee_df['First Name']=='Heba']
heba
First Name | Last Name | Salary | Years with Company | Postal Code | ||
---|---|---|---|---|---|---|
6 | Heba | Steve | 50000.0 | 7 | K8Y 3M8 | heba.ismail@hotmail.com |
Retrieve the employees with 'Steve' or 'Mo' as Last Name
mask=employee_df['Last Name'].isin(['Steve','Mo'])
employee_df[mask]
First Name | Last Name | Salary | Years with Company | Postal Code | ||
---|---|---|---|---|---|---|
3 | Chanel | Steve | 11072.02 | 12 | N7T 3E6 | chanel@gmail.com |
5 | Samer | Mo | 100000.00 | 13 | J7H 3HY | samer@gmail.com |
6 | Heba | Steve | 50000.00 | 7 | K8Y 3M8 | heba.ismail@hotmail.com |
Retrieve the employees who earn between 5000 and 9000
employee_df[employee_df['Salary'].between(5000,9000)]
First Name | Last Name | Salary | Years with Company | Postal Code | ||
---|---|---|---|---|---|---|
0 | Mike | Moe | 5000.00 | 3 | N94 3M0 | bird@gmail.com |
4 | Kate | Noor | 5000.00 | 23 | K8N 5H6 | kate@hotmail.com |
9 | Noah | Moran | 8626.96 | 11 | K2D 4M9 | guutodi@bigwoc.kw |
Delete Employees with similar Last Names
#Detect duplicates
employee_df['Last Name'].duplicated(keep=False)
0 False 1 False 2 False 3 True 4 False 5 False 6 True 7 False 8 False 9 False Name: Last Name, dtype: bool
#Convert False into True and vice versa
mask = ~employee_df['Last Name'].duplicated(keep=False)
mask
0 True 1 True 2 True 3 False 4 True 5 True 6 False 7 True 8 True 9 True Name: Last Name, dtype: bool
#Delete Duplicates
employee_df[mask]
First Name | Last Name | Salary | Years with Company | Postal Code | ||
---|---|---|---|---|---|---|
0 | Mike | Moe | 5000.00 | 3 | N94 3M0 | bird@gmail.com |
1 | Noah | Ryan | 10000.00 | 8 | N8S 14K | nsmall@hotmail.com |
2 | Nina | Keller | 9072.02 | 17 | S1T 4E6 | azikez@gahew.mr |
4 | Kate | Noor | 5000.00 | 23 | K8N 5H6 | kate@hotmail.com |
5 | Samer | Mo | 100000.00 | 13 | J7H 3HY | samer@gmail.com |
7 | Laila | Aly | 20000.00 | 5 | J8Y 3M0 | Laila.a@hotmail.com |
8 | Joseph | Patton | 2629.13 | 2 | M6U 5U7 | daafeja@boh.jm |
9 | Noah | Moran | 8626.96 | 11 | K2D 4M9 | guutodi@bigwoc.kw |
Delete Employees with similar Last Names using Drop Duplicates
employee_df=pd.read_csv('/Users/mekki/Python_Projects_Datasets/Mastering Data Analysis with Pandas Part 3/00-employee_information.csv')
employee_df.drop_duplicates(subset=['Last Name'], inplace=True)
employee_df
First Name | Last Name | Salary | Years with Company | Postal Code | ||
---|---|---|---|---|---|---|
0 | Mike | Moe | 5000.00 | 3 | N94 3M0 | bird@gmail.com |
1 | Noah | Ryan | 10000.00 | 8 | N8S 14K | nsmall@hotmail.com |
2 | Nina | Keller | 9072.02 | 17 | S1T 4E6 | azikez@gahew.mr |
3 | Chanel | Steve | 11072.02 | 12 | N7T 3E6 | chanel@gmail.com |
4 | Kate | Noor | 5000.00 | 23 | K8N 5H6 | kate@hotmail.com |
5 | Samer | Mo | 100000.00 | 13 | J7H 3HY | samer@gmail.com |
7 | Laila | Aly | 20000.00 | 5 | J8Y 3M0 | Laila.a@hotmail.com |
8 | Joseph | Patton | 2629.13 | 2 | M6U 5U7 | daafeja@boh.jm |
9 | Noah | Moran | 8626.96 | 11 | K2D 4M9 | guutodi@bigwoc.kw |
#Drop duplicates keeps first value and drops the following, while the first method gets rid of all values
Task 4. DataFrame Concatenation
employee_df=pd.read_csv('/Users/mekki/Python_Projects_Datasets/Mastering Data Analysis with Pandas Part 3/00-employee_information.csv')
Create a new DataFrame with Employee ID, First Name, Last Name as Group 1
raw_data = {'Employee ID':['1','2','3','4','5'],
'First Name':['Nancy','Alex','Shep','Max','Allen'],
'Last Name':['Rob','Ali','George','Mitch','Steve']}
employee1_df=pd.DataFrame(raw_data, columns=['Employee ID','First Name','Last Name'])
employee1_df
Employee ID | First Name | Last Name | |
---|---|---|---|
0 | 1 | Nancy | Rob |
1 | 2 | Alex | Ali |
2 | 3 | Shep | George |
3 | 4 | Max | Mitch |
4 | 5 | Allen | Steve |
Create a new DataFrame with Employee ID, First Name, Last Name as Group 2
raw_data = {'Employee ID':['6','7','8','9','10'],
'First Name':['Bill','Dina','Sarah','Heather','Holly'],
'Last Name':['Christian','Mo','Steve','Bob','Michelle']}
employee2_df=pd.DataFrame(raw_data, columns=['Employee ID','First Name','Last Name'])
employee2_df
Employee ID | First Name | Last Name | |
---|---|---|---|
0 | 6 | Bill | Christian |
1 | 7 | Dina | Mo |
2 | 8 | Sarah | Steve |
3 | 9 | Heather | Bob |
4 | 10 | Holly | Michelle |
Concatenate Group 1 and Group 2
employee_all_df=pd.concat([employee1_df,employee2_df])
employee_all_df
Employee ID | First Name | Last Name | |
---|---|---|---|
0 | 1 | Nancy | Rob |
1 | 2 | Alex | Ali |
2 | 3 | Shep | George |
3 | 4 | Max | Mitch |
4 | 5 | Allen | Steve |
0 | 6 | Bill | Christian |
1 | 7 | Dina | Mo |
2 | 8 | Sarah | Steve |
3 | 9 | Heather | Bob |
4 | 10 | Holly | Michelle |
Concatenate Group 1 and Group 2 ignoring the Index
employee_all_df=pd.concat([employee1_df,employee2_df],ignore_index=True)
employee_all_df
Employee ID | First Name | Last Name | |
---|---|---|---|
0 | 1 | Nancy | Rob |
1 | 2 | Alex | Ali |
2 | 3 | Shep | George |
3 | 4 | Max | Mitch |
4 | 5 | Allen | Steve |
5 | 6 | Bill | Christian |
6 | 7 | Dina | Mo |
7 | 8 | Sarah | Steve |
8 | 9 | Heather | Bob |
9 | 10 | Holly | Michelle |
Task 5. DataFrame Concatenation with Multi-Indexing
Display and Index Group 1 and Group 2 in the DataFrame
employee_all_df=pd.concat([employee1_df, employee2_df], keys=['Group 1', 'Group 2'])
employee_all_df
Employee ID | First Name | Last Name | ||
---|---|---|---|---|
Group 1 | 0 | 1 | Nancy | Rob |
1 | 2 | Alex | Ali | |
2 | 3 | Shep | George | |
3 | 4 | Max | Mitch | |
4 | 5 | Allen | Steve | |
Group 2 | 0 | 6 | Bill | Christian |
1 | 7 | Dina | Mo | |
2 | 8 | Sarah | Steve | |
3 | 9 | Heather | Bob | |
4 | 10 | Holly | Michelle |
Retrieve the First row in Group 1
employee_all_df.loc[('Group 1'),0]
Employee ID 1 First Name Nancy Last Name Rob Name: (Group 1, 0), dtype: object
Retrieve all First Names in Group 2
employee_all_df.loc[('Group 2'), 'First Name'].to_frame()
First Name | |
---|---|
0 | Bill |
1 | Dina |
2 | Sarah |
3 | Heather |
4 | Holly |
Retrieve all Data in Group 1
employee_all_df.loc[('Group 1'), :]
Employee ID | First Name | Last Name | |
---|---|---|---|
0 | 1 | Nancy | Rob |
1 | 2 | Alex | Ali |
2 | 3 | Shep | George |
3 | 4 | Max | Mitch |
4 | 5 | Allen | Steve |
Task 6. DataFrame Merging
employee_all_df=pd.concat([employee1_df,employee2_df],ignore_index=True)
Creating additional information : Annual Salary [$/year]
raw_data={
'Employee ID':['1','2','3','4','5','6','7','8','9','10'],
'Annual Salary [$/year]':[25000,35000,45000,48000,49000,32000,33000,34000,23000,22000]
}
raw_data
{'Employee ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'], 'Annual Salary [$/year]': [25000, 35000, 45000, 48000, 49000, 32000, 33000, 34000, 23000, 22000]}
employee_salary_df=pd.DataFrame(raw_data, columns=['Employee ID', 'Annual Salary [$/year]'])
employee_salary_df
Employee ID | Annual Salary [$/year] | |
---|---|---|
0 | 1 | 25000 |
1 | 2 | 35000 |
2 | 3 | 45000 |
3 | 4 | 48000 |
4 | 5 | 49000 |
5 | 6 | 32000 |
6 | 7 | 33000 |
7 | 8 | 34000 |
8 | 9 | 23000 |
9 | 10 | 22000 |
Merge all the Data from the 2 DataFrames
employee_all_df=pd.merge(employee_all_df, employee_salary_df, on = 'Employee ID')
employee_all_df
Employee ID | First Name | Last Name | Annual Salary [$/year] | |
---|---|---|---|---|
0 | 1 | Nancy | Rob | 25000 |
1 | 2 | Alex | Ali | 35000 |
2 | 3 | Shep | George | 45000 |
3 | 4 | Max | Mitch | 48000 |
4 | 5 | Allen | Steve | 49000 |
5 | 6 | Bill | Christian | 32000 |
6 | 7 | Dina | Mo | 33000 |
7 | 8 | Sarah | Steve | 34000 |
8 | 9 | Heather | Bob | 23000 |
9 | 10 | Holly | Michelle | 22000 |
Creating additional information : Credit Card Debt, Age
raw_data={
'Employee ID':['1','2','3','4','5','6','7','8','9','10'],
'Credit Card Debt':[1000,100,500,600,0,20,360,127,3000,2200],
'Age':[44,35,67,19,22,45,48,33,34,36]
}
raw_data
{'Employee ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'], 'Credit Card Debt': [1000, 100, 500, 600, 0, 20, 360, 127, 3000, 2200], 'Age': [44, 35, 67, 19, 22, 45, 48, 33, 34, 36]}
employee_credit_age_df=pd.DataFrame(raw_data, columns=['Employee ID', 'Credit Card Debt', 'Age'])
employee_credit_age_df
Employee ID | Credit Card Debt | Age | |
---|---|---|---|
0 | 1 | 1000 | 44 |
1 | 2 | 100 | 35 |
2 | 3 | 500 | 67 |
3 | 4 | 600 | 19 |
4 | 5 | 0 | 22 |
5 | 6 | 20 | 45 |
6 | 7 | 360 | 48 |
7 | 8 | 127 | 33 |
8 | 9 | 3000 | 34 |
9 | 10 | 2200 | 36 |
Merge all the Data from the 2 DataFrames
employee_all_df=pd.merge(employee_all_df,employee_credit_age_df, on='Employee ID')
employee_all_df
Employee ID | First Name | Last Name | Annual Salary [$/year] | Credit Card Debt | Age | |
---|---|---|---|---|---|---|
0 | 1 | Nancy | Rob | 25000 | 1000 | 44 |
1 | 2 | Alex | Ali | 35000 | 100 | 35 |
2 | 3 | Shep | George | 45000 | 500 | 67 |
3 | 4 | Max | Mitch | 48000 | 600 | 19 |
4 | 5 | Allen | Steve | 49000 | 0 | 22 |
5 | 6 | Bill | Christian | 32000 | 20 | 45 |
6 | 7 | Dina | Mo | 33000 | 360 | 48 |
7 | 8 | Sarah | Steve | 34000 | 127 | 33 |
8 | 9 | Heather | Bob | 23000 | 3000 | 34 |
9 | 10 | Holly | Michelle | 22000 | 2200 | 36 |