Mastering Data Analysis with Pandas #3


Task 1. DataFrames with Functions

Import DataSet

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
In [2]:
employee_df=pd.read_csv('/Users/mekki/Python_Projects_Datasets/Mastering Data Analysis with Pandas Part 3/00-employee_information.csv')
In [3]:
employee_df
Out[3]:
First Name Last Name Salary Years with Company Postal Code Email
0 Mike Moe 5000.00 3 N94 3M0 bird@gmail.com
1 Noah Ryan 10000.00 8 N8S 14K nsmall@hotmail.com
2 Nina Keller 9072.02 17 S1T 4E6 azikez@gahew.mr
3 Chanel Steve 11072.02 12 N7T 3E6 chanel@gmail.com
4 Kate Noor 5000.00 23 K8N 5H6 kate@hotmail.com
5 Samer Mo 100000.00 13 J7H 3HY samer@gmail.com
6 Heba Steve 50000.00 7 K8Y 3M8 heba.ismail@hotmail.com
7 Laila Aly 20000.00 5 J8Y 3M0 Laila.a@hotmail.com
8 Joseph Patton 2629.13 2 M6U 5U7 daafeja@boh.jm
9 Noah Moran 8626.96 11 K2D 4M9 guutodi@bigwoc.kw

Define a Function that increases all salaries by 10%

In [4]:
def salary_update(salary): return salary*1.1
In [5]:
employee_df['Salary']=employee_df['Salary'].apply(salary_update)
In [6]:
employee_df
Out[6]:
First Name Last Name Salary Years with Company Postal Code Email
0 Mike Moe 5500.000 3 N94 3M0 bird@gmail.com
1 Noah Ryan 11000.000 8 N8S 14K nsmall@hotmail.com
2 Nina Keller 9979.222 17 S1T 4E6 azikez@gahew.mr
3 Chanel Steve 12179.222 12 N7T 3E6 chanel@gmail.com
4 Kate Noor 5500.000 23 K8N 5H6 kate@hotmail.com
5 Samer Mo 110000.000 13 J7H 3HY samer@gmail.com
6 Heba Steve 55000.000 7 K8Y 3M8 heba.ismail@hotmail.com
7 Laila Aly 22000.000 5 J8Y 3M0 Laila.a@hotmail.com
8 Joseph Patton 2892.043 2 M6U 5U7 daafeja@boh.jm
9 Noah Moran 9489.656 11 K2D 4M9 guutodi@bigwoc.kw

Define a Function that doubles all salaries and adds 100$

In [7]:
def salary_update_2(salary): return salary*2+100
In [8]:
employee_df['Salary']=employee_df['Salary'].apply(salary_update_2)
employee_df
Out[8]:
First Name Last Name Salary Years with Company Postal Code Email
0 Mike Moe 11100.000 3 N94 3M0 bird@gmail.com
1 Noah Ryan 22100.000 8 N8S 14K nsmall@hotmail.com
2 Nina Keller 20058.444 17 S1T 4E6 azikez@gahew.mr
3 Chanel Steve 24458.444 12 N7T 3E6 chanel@gmail.com
4 Kate Noor 11100.000 23 K8N 5H6 kate@hotmail.com
5 Samer Mo 220100.000 13 J7H 3HY samer@gmail.com
6 Heba Steve 110100.000 7 K8Y 3M8 heba.ismail@hotmail.com
7 Laila Aly 44100.000 5 J8Y 3M0 Laila.a@hotmail.com
8 Joseph Patton 5884.086 2 M6U 5U7 daafeja@boh.jm
9 Noah Moran 19079.312 11 K2D 4M9 guutodi@bigwoc.kw

Calculate the updated total salaries of all employees

In [9]:
employee_df['Salary'].sum()
Out[9]:
488080.286

Task 2. Pandas DataFrame Operations/Filtering #1

In [10]:
employee_df=pd.read_csv('/Users/mekki/Python_Projects_Datasets/Mastering Data Analysis with Pandas Part 3/00-employee_information.csv')

Retrieve the employees who have been working at the company for at least 10 years

In [11]:
loyal=employee_df[employee_df['Years with Company']>=10].sort_values(by='Years with Company', ascending=False)
loyal
Out[11]:
First Name Last Name Salary Years with Company Postal Code Email
4 Kate Noor 5000.00 23 K8N 5H6 kate@hotmail.com
2 Nina Keller 9072.02 17 S1T 4E6 azikez@gahew.mr
5 Samer Mo 100000.00 13 J7H 3HY samer@gmail.com
3 Chanel Steve 11072.02 12 N7T 3E6 chanel@gmail.com
9 Noah Moran 8626.96 11 K2D 4M9 guutodi@bigwoc.kw

Retrieve the employees who have been working at the company for at least 10 years with a salary of at least 50k$

In [12]:
c1=employee_df['Years with Company']>=10
c2=employee_df['Salary']>=50000
loyal=employee_df[c1 & c2].sort_values(by='Salary', ascending=False)
loyal
Out[12]:
First Name Last Name Salary Years with Company Postal Code Email
5 Samer Mo 100000.0 13 J7H 3HY samer@gmail.com

Retrieve the employees with a min salary of 20k$ and find the Combined Salaries of these employees

In [13]:
high=employee_df[employee_df['Salary']>=20000].sort_values(by='Salary', ascending=False)
high
Out[13]:
First Name Last Name Salary Years with Company Postal Code Email
5 Samer Mo 100000.0 13 J7H 3HY samer@gmail.com
6 Heba Steve 50000.0 7 K8Y 3M8 heba.ismail@hotmail.com
7 Laila Aly 20000.0 5 J8Y 3M0 Laila.a@hotmail.com
In [14]:
high['Salary'].sum()
Out[14]:
170000.0

Task 3. Pandas DataFrame Operations/Filtering #2

In [15]:
employee_df=pd.read_csv('/Users/mekki/Python_Projects_Datasets/Mastering Data Analysis with Pandas Part 3/00-employee_information.csv')

Retrieve the employees with 'Heba' as First Name

In [16]:
heba=employee_df[employee_df['First Name']=='Heba']
heba
Out[16]:
First Name Last Name Salary Years with Company Postal Code Email
6 Heba Steve 50000.0 7 K8Y 3M8 heba.ismail@hotmail.com

Retrieve the employees with 'Steve' or 'Mo' as Last Name

In [17]:
mask=employee_df['Last Name'].isin(['Steve','Mo'])
employee_df[mask]
Out[17]:
First Name Last Name Salary Years with Company Postal Code Email
3 Chanel Steve 11072.02 12 N7T 3E6 chanel@gmail.com
5 Samer Mo 100000.00 13 J7H 3HY samer@gmail.com
6 Heba Steve 50000.00 7 K8Y 3M8 heba.ismail@hotmail.com

Retrieve the employees who earn between 5000 and 9000

In [18]:
employee_df[employee_df['Salary'].between(5000,9000)]
Out[18]:
First Name Last Name Salary Years with Company Postal Code Email
0 Mike Moe 5000.00 3 N94 3M0 bird@gmail.com
4 Kate Noor 5000.00 23 K8N 5H6 kate@hotmail.com
9 Noah Moran 8626.96 11 K2D 4M9 guutodi@bigwoc.kw

Delete Employees with similar Last Names

In [19]:
#Detect duplicates
employee_df['Last Name'].duplicated(keep=False)
Out[19]:
0    False
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
9    False
Name: Last Name, dtype: bool
In [20]:
#Convert False into True and vice versa
mask = ~employee_df['Last Name'].duplicated(keep=False)
mask
Out[20]:
0     True
1     True
2     True
3    False
4     True
5     True
6    False
7     True
8     True
9     True
Name: Last Name, dtype: bool
In [21]:
#Delete Duplicates
employee_df[mask]
Out[21]:
First Name Last Name Salary Years with Company Postal Code Email
0 Mike Moe 5000.00 3 N94 3M0 bird@gmail.com
1 Noah Ryan 10000.00 8 N8S 14K nsmall@hotmail.com
2 Nina Keller 9072.02 17 S1T 4E6 azikez@gahew.mr
4 Kate Noor 5000.00 23 K8N 5H6 kate@hotmail.com
5 Samer Mo 100000.00 13 J7H 3HY samer@gmail.com
7 Laila Aly 20000.00 5 J8Y 3M0 Laila.a@hotmail.com
8 Joseph Patton 2629.13 2 M6U 5U7 daafeja@boh.jm
9 Noah Moran 8626.96 11 K2D 4M9 guutodi@bigwoc.kw

Delete Employees with similar Last Names using Drop Duplicates

In [22]:
employee_df=pd.read_csv('/Users/mekki/Python_Projects_Datasets/Mastering Data Analysis with Pandas Part 3/00-employee_information.csv')
In [23]:
employee_df.drop_duplicates(subset=['Last Name'], inplace=True)
employee_df
Out[23]:
First Name Last Name Salary Years with Company Postal Code Email
0 Mike Moe 5000.00 3 N94 3M0 bird@gmail.com
1 Noah Ryan 10000.00 8 N8S 14K nsmall@hotmail.com
2 Nina Keller 9072.02 17 S1T 4E6 azikez@gahew.mr
3 Chanel Steve 11072.02 12 N7T 3E6 chanel@gmail.com
4 Kate Noor 5000.00 23 K8N 5H6 kate@hotmail.com
5 Samer Mo 100000.00 13 J7H 3HY samer@gmail.com
7 Laila Aly 20000.00 5 J8Y 3M0 Laila.a@hotmail.com
8 Joseph Patton 2629.13 2 M6U 5U7 daafeja@boh.jm
9 Noah Moran 8626.96 11 K2D 4M9 guutodi@bigwoc.kw
In [24]:
#Drop duplicates keeps first value and drops the following, while the first method gets rid of all values

Task 4. DataFrame Concatenation

In [25]:
employee_df=pd.read_csv('/Users/mekki/Python_Projects_Datasets/Mastering Data Analysis with Pandas Part 3/00-employee_information.csv')

Create a new DataFrame with Employee ID, First Name, Last Name as Group 1

In [26]:
raw_data = {'Employee ID':['1','2','3','4','5'],
           'First Name':['Nancy','Alex','Shep','Max','Allen'],
           'Last Name':['Rob','Ali','George','Mitch','Steve']}
employee1_df=pd.DataFrame(raw_data, columns=['Employee ID','First Name','Last Name'])
employee1_df
Out[26]:
Employee ID First Name Last Name
0 1 Nancy Rob
1 2 Alex Ali
2 3 Shep George
3 4 Max Mitch
4 5 Allen Steve

Create a new DataFrame with Employee ID, First Name, Last Name as Group 2

In [27]:
raw_data = {'Employee ID':['6','7','8','9','10'],
           'First Name':['Bill','Dina','Sarah','Heather','Holly'],
           'Last Name':['Christian','Mo','Steve','Bob','Michelle']}
employee2_df=pd.DataFrame(raw_data, columns=['Employee ID','First Name','Last Name'])
employee2_df
Out[27]:
Employee ID First Name Last Name
0 6 Bill Christian
1 7 Dina Mo
2 8 Sarah Steve
3 9 Heather Bob
4 10 Holly Michelle

Concatenate Group 1 and Group 2

In [28]:
employee_all_df=pd.concat([employee1_df,employee2_df])
employee_all_df
Out[28]:
Employee ID First Name Last Name
0 1 Nancy Rob
1 2 Alex Ali
2 3 Shep George
3 4 Max Mitch
4 5 Allen Steve
0 6 Bill Christian
1 7 Dina Mo
2 8 Sarah Steve
3 9 Heather Bob
4 10 Holly Michelle

Concatenate Group 1 and Group 2 ignoring the Index

In [29]:
employee_all_df=pd.concat([employee1_df,employee2_df],ignore_index=True)
employee_all_df
Out[29]:
Employee ID First Name Last Name
0 1 Nancy Rob
1 2 Alex Ali
2 3 Shep George
3 4 Max Mitch
4 5 Allen Steve
5 6 Bill Christian
6 7 Dina Mo
7 8 Sarah Steve
8 9 Heather Bob
9 10 Holly Michelle

Task 5. DataFrame Concatenation with Multi-Indexing

Display and Index Group 1 and Group 2 in the DataFrame

In [30]:
employee_all_df=pd.concat([employee1_df, employee2_df], keys=['Group 1', 'Group 2'])
employee_all_df
Out[30]:
Employee ID First Name Last Name
Group 1 0 1 Nancy Rob
1 2 Alex Ali
2 3 Shep George
3 4 Max Mitch
4 5 Allen Steve
Group 2 0 6 Bill Christian
1 7 Dina Mo
2 8 Sarah Steve
3 9 Heather Bob
4 10 Holly Michelle

Retrieve the First row in Group 1

In [31]:
employee_all_df.loc[('Group 1'),0]
Out[31]:
Employee ID        1
First Name     Nancy
Last Name        Rob
Name: (Group 1, 0), dtype: object

Retrieve all First Names in Group 2

In [32]:
employee_all_df.loc[('Group 2'), 'First Name'].to_frame()
Out[32]:
First Name
0 Bill
1 Dina
2 Sarah
3 Heather
4 Holly

Retrieve all Data in Group 1

In [33]:
employee_all_df.loc[('Group 1'), :]
Out[33]:
Employee ID First Name Last Name
0 1 Nancy Rob
1 2 Alex Ali
2 3 Shep George
3 4 Max Mitch
4 5 Allen Steve

Task 6. DataFrame Merging

In [34]:
employee_all_df=pd.concat([employee1_df,employee2_df],ignore_index=True)

Creating additional information : Annual Salary [$/year]

In [35]:
raw_data={
    'Employee ID':['1','2','3','4','5','6','7','8','9','10'],
    'Annual Salary [$/year]':[25000,35000,45000,48000,49000,32000,33000,34000,23000,22000]
}
raw_data
Out[35]:
{'Employee ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'],
 'Annual Salary [$/year]': [25000,
  35000,
  45000,
  48000,
  49000,
  32000,
  33000,
  34000,
  23000,
  22000]}
In [36]:
employee_salary_df=pd.DataFrame(raw_data, columns=['Employee ID', 'Annual Salary [$/year]'])
employee_salary_df
Out[36]:
Employee ID Annual Salary [$/year]
0 1 25000
1 2 35000
2 3 45000
3 4 48000
4 5 49000
5 6 32000
6 7 33000
7 8 34000
8 9 23000
9 10 22000

Merge all the Data from the 2 DataFrames

In [37]:
employee_all_df=pd.merge(employee_all_df, employee_salary_df, on = 'Employee ID')
employee_all_df
Out[37]:
Employee ID First Name Last Name Annual Salary [$/year]
0 1 Nancy Rob 25000
1 2 Alex Ali 35000
2 3 Shep George 45000
3 4 Max Mitch 48000
4 5 Allen Steve 49000
5 6 Bill Christian 32000
6 7 Dina Mo 33000
7 8 Sarah Steve 34000
8 9 Heather Bob 23000
9 10 Holly Michelle 22000

Creating additional information : Credit Card Debt, Age

In [38]:
raw_data={
    'Employee ID':['1','2','3','4','5','6','7','8','9','10'],
    'Credit Card Debt':[1000,100,500,600,0,20,360,127,3000,2200],
    'Age':[44,35,67,19,22,45,48,33,34,36]
}
raw_data
Out[38]:
{'Employee ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'],
 'Credit Card Debt': [1000, 100, 500, 600, 0, 20, 360, 127, 3000, 2200],
 'Age': [44, 35, 67, 19, 22, 45, 48, 33, 34, 36]}
In [39]:
employee_credit_age_df=pd.DataFrame(raw_data, columns=['Employee ID', 'Credit Card Debt', 'Age'])
employee_credit_age_df
Out[39]:
Employee ID Credit Card Debt Age
0 1 1000 44
1 2 100 35
2 3 500 67
3 4 600 19
4 5 0 22
5 6 20 45
6 7 360 48
7 8 127 33
8 9 3000 34
9 10 2200 36

Merge all the Data from the 2 DataFrames

In [40]:
employee_all_df=pd.merge(employee_all_df,employee_credit_age_df, on='Employee ID')
employee_all_df
Out[40]:
Employee ID First Name Last Name Annual Salary [$/year] Credit Card Debt Age
0 1 Nancy Rob 25000 1000 44
1 2 Alex Ali 35000 100 35
2 3 Shep George 45000 500 67
3 4 Max Mitch 48000 600 19
4 5 Allen Steve 49000 0 22
5 6 Bill Christian 32000 20 45
6 7 Dina Mo 33000 360 48
7 8 Sarah Steve 34000 127 33
8 9 Heather Bob 23000 3000 34
9 10 Holly Michelle 22000 2200 36