IE6400_Day6_Lab6
html
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6400
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
html
Pages
20
Uploaded by ColonelStraw13148
IE6400 Foundations for Data Analytics
Engineering
¶
Fall 2023
¶
Module 1: Pandas Library - Part 2 - Lab 6
¶
Handling Missing Data
¶
Again, real-world datasets are rarely perfect. They may contain missing values, wrong data types, unreadable characters, and other unexpected values.
The first step to any proper data analysis is cleansing and organizing the data. We'll be
working with small employee dataset.
In [1]:
import pandas as pd
dataframe = pd.read_csv('employee.csv')
In [2]:
# Displaying using Jupyter Notebook
dataframe
# For non-Jupyter environments # print(dataframe) Out[2]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
n.a.
115163.0 10.125
False
Legal
4
NaN
Female
0.0
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
na
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
10
NaN
NaN
NaN
NaN
NaN
NaN
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
Taking a closer look at the dataset, we note that Pandas assigns NaN if the value for a particular column is empty string. However, there are cases where missing values are represented by a custom value, for example, the string 'na' or 0 for a numeric column. These technically aren't missing values, as there's something there, but they're functionally missing values.
If we try using utility methods such as dropna( ), it will not work properly.
We'll want to first clean this up and categorize them, before we try to handle them as missing values.
Exercise 6.1 Customizing Missing Data Values
¶
In our dataset, we want to consider these as missing values:
•
A 0 value in the Salary column •
An na value in the Team column The easiest way to deal with missing values is to handle them at import-time.
When loading our dataset in, let's set 0 and na as missing values for the Salary and Team columns:
In [3]:
import pandas as pd
df = pd.read_csv('employee.csv', na_values = {"Salary" : [0], "Team" : ['na']})
In [4]:
# Display using Jupyter Notebook
df
Out[4]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
n.a.
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
9
Stephanie
Female
36844.0
5.574
True
Business Development
10
NaN
NaN
NaN
NaN
NaN
NaN
Now, when we load our data in, all instances of 0 and na will be turned into NaN, which
is the correct data type of missing values.
Great! Now we don't have hidden missing values anymore. Or do we?
There's a n.a. cell in the Gender column, on index 3. We'll want to add more matchers for missing values to solve this.
We will create a list of values which will be treated as missing globally, in all columns. Let's make a list of various "hidden" missing values and pass that list to the na_values argument:
In [5]:
missing_values = ["n.a.", "NA", "n/a", "na", 0]
In [6]:
df = pd.read_csv('employee.csv', na_values = missing_values)
In [7]:
# Display using Jupyter Notebook
df
Out[7]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
9
Stephanie
Female
36844.0
5.574
True
Business Development
10
NaN
NaN
NaN
NaN
NaN
NaN
Now, the dataset is finally cleaned from the garbage values, and we can work on handling missing values now.
Exercise 6.2 Removing Rows With Missing Values
¶
One approach would be removing all the rows which contain missing values. This can easily be done with dropna( ) function.
In [8]:
df
Out[8]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
10
NaN
NaN
NaN
NaN
NaN
NaN
In [9]:
# drops all rows with NaN values
clean_df = df.dropna(axis = 0, inplace = False)
inplace = True makes all the changes in the existing DataFrame without returning a new one.
The axis argument specifies if you're working with rows or columns - 0 being rows and 1 being columns.
In [10]:
clean_df
Out[10]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
You can control whether you want to remove the rows containing at least 1 NaN or all NaN values by setting the how parameter in the dropna method.
how:
•
any : if any NaN values are present, drop the row •
all : if all values are NaN, drop the row In [11]:
df.dropna(axis = 0, inplace = True, how = 'all')
In [12]:
df
Out[12]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
This would only remove the last row from the dataset since how = 'all' would only drop
a row if all of the values are missing from it.
Filling out Missing Values
¶
It might not be the best approach to remove the rows containing missing values. They might contain valuable data in other columns and we don't want to skew the data towards an inaccurate state.
There are several options for dealing with this type of issue, and the most common ones are:
•
Fill NaNs with Mean, Median, or Mode of the data •
Fill NaNs with a constant value •
Forward Fill or Backward Fill NaNs •
Interpolate Data and Fill NaNs Exercise 6.3 Filling Missing Values with Column Mean
¶
Let's start out with the fillna( ) method. It fills the NaN-marked values with values you supply the method with. For example, you can use .mean( ) function on a column and supply those as the fill value:
In [13]:
import pandas as pd
# Cleansing the missing values
missing_values = ["n.a.", "NA", "n/a", "na", 0]
df = pd.read_csv('employee.csv', na_values = missing_values)
# Removing all NaN
df.dropna(axis = 0, inplace = True, how = 'all')
# Show df
df
Out[13]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
In [14]:
# Using mean
df['Salary'].fillna(int(df['Salary'].mean()), inplace = True)
In [15]:
df
Out[15]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
88920.0
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
88920.0
11.598
NaN
Finance
5
Angela
NaN
88920.0
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
Exercise 6.4 Filling Missing Values with Column Median
¶
In [16]:
import pandas as pd
# Cleansing the missing values
missing_values = ["n.a.", "NA", "n/a", "na", 0]
df = pd.read_csv('employee.csv', na_values = missing_values)
# Removing all NaN
df.dropna(axis = 0, inplace = True, how = 'all')
# Show df
df
Out[16]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
In [17]:
# Using median
df['Salary'].fillna(int(df['Salary'].median()), inplace = True)
In [18]:
df
Out[18]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
2
Jerry
Male
97308.0
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
97308.0
11.598
NaN
Finance
5
Angela
NaN
97308.0
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
Exercise 6.5 Filling Missing Values with Column Mode
¶
In [19]:
import pandas as pd
# Cleansing the missing values
missing_values = ["n.a.", "NA", "n/a", "na", 0]
df = pd.read_csv('employee.csv', na_values = missing_values)
# Removing all NaN
df.dropna(axis = 0, inplace = True, how = 'all')
# Show df
df
Out[19]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
In [20]:
# Using mode
df['Salary'].fillna(df['Salary'].mode(), inplace = True)
In [21]:
df
Out[21]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
61933.0
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
111737.0 11.598
NaN
Finance
5
Angela
NaN
115163.0 18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
Now, if a salary is missing from a person's row, a mean, mode, or median are used to fill that value.
Exercise 6.6 Filling Missing Values with a Constant
¶
You could also decide to fill the NaN-marked values with a constant value. For example, you can put in a special string or numerical value:
In [22]:
import pandas as pd
# Cleansing the missing values
missing_values = ["n.a.", "NA", "n/a", "na", 0]
df = pd.read_csv('employee.csv', na_values = missing_values)
# Removing all NaN
df.dropna(axis = 0, inplace = True, how = 'all')
# Show df
df
Out[22]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
In [23]:
df['Salary'].fillna(0, inplace = True)
In [24]:
df
Out[24]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
0.0
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
0.0
11.598
NaN
Finance
5
Angela
NaN
0.0
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
Notice that all NaN values in Salary column are changed to "0" now.
Exercise 6.7 Forward Fill Missing DataFrame Values
¶
This method would fill the missing values with first non-missing value that shows before it.
In [25]:
import pandas as pd
# Cleansing the missing values
missing_values = ["n.a.", "NA", "n/a", "na", 0]
df = pd.read_csv('employee.csv', na_values = missing_values)
# Removing all NaN
df.dropna(axis = 0, inplace = True, how = 'all')
# Show df
df
Out[25]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
In [26]:
df['Salary'].fillna(method = 'ffill', inplace = True)
In [27]:
df
Out[27]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
61933.0
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
115163.0 11.598
NaN
Finance
5
Angela
NaN
115163.0 18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
9
Stephanie
Female
36844.0
5.574
True
Business Development
For example, Jerry would inherit the Salary from Thomas.
Exercise 6.8 Backward Fill Missing DataFrame Values
¶
This method would fill the missing values with first non-missing value that shows after it.
In [28]:
import pandas as pd
# Cleansing the missing values
missing_values = ["n.a.", "NA", "n/a", "na", 0]
df = pd.read_csv('employee.csv', na_values = missing_values)
# Removing all NaN
df.dropna(axis = 0, inplace = True, how = 'all')
# Show df
df
Out[28]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
In [29]:
df['Team'].fillna(method = 'bfill', inplace = True)
In [30]:
df
Out[30]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
Finance
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
Business Development
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
This work exactly the other way around from the previous approach, and Thomas will inherit Finance as his Team from Jerry.
Exercise 6.9 Fill Missing DataFrame Values with Interpolation - polynomial
¶
Finally, this method uses mathematical interpolation to determine what value would have been in the place of a missing value.
In [31]:
import pandas as pd
# Cleansing the missing values
missing_values = ["n.a.", "NA", "n/a", "na", 0]
df = pd.read_csv('employee.csv', na_values = missing_values)
# Removing all NaN
df.dropna(axis = 0, inplace = True, how = 'all')
# Show df
df
Out[31]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
In [32]:
df['Salary'].interpolate(method = 'polynomial', order = 5, inplace = True)
In [33]:
df
Out[33]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.000000
6.945
True
Marketing
1
Thomas
Male
61933.000000
NaN
True
NaN
2
Jerry
Male
108558.907129 9.340
True
Finance
3
Dennis
NaN
115163.000000 10.125
False
Legal
4
NaN
Female
85046.355671
11.598
NaN
Finance
5
Angela
NaN
73044.323666
18.523
True
Engineering
6
Shawn
Male
111737.000000 6.414
False
NaN
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
7
Rachel
Female
142032.000000 12.599
False
Business Development
8
Linda
Female
57427.000000
9.557
True
Client Services
9
Stephanie
Female
36844.000000
5.574
True
Business Development
Exercise 6.10 Fill Missing DataFrame Values with Interpolation - spline
¶
In [34]:
import pandas as pd
# Cleansing the missing values
missing_values = ["n.a.", "NA", "n/a", "na", 0]
df = pd.read_csv('employee.csv', na_values = missing_values)
# Removing all NaN
df.dropna(axis = 0, inplace = True, how = 'all')
# Show df
df
Out[34]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
Development
In [35]:
df['Salary'].interpolate(method = 'spline', order = 5, inplace = True)
In [36]:
df
Out[36]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.000000
6.945
True
Marketing
1
Thomas
Male
61933.000000
NaN
True
NaN
2
Jerry
Male
108557.233181 9.340
True
Finance
3
Dennis
NaN
115163.000000 10.125
False
Legal
4
NaN
Female
85049.266288
11.598
NaN
Finance
5
Angela
NaN
73048.433135
18.523
True
Engineering
6
Shawn
Male
111737.000000 6.414
False
NaN
7
Rachel
Female
142032.000000 12.599
False
Business Development
8
Linda
Female
57427.000000
9.557
True
Client Services
9
Stephanie
Female
36844.000000
5.574
True
Business Development
Exercise 6.11 Fill Missing DataFrame Values with Interpolation - linear
¶
In [37]:
import pandas as pd
# Cleansing the missing values
missing_values = ["n.a.", "NA", "n/a", "na", 0]
df = pd.read_csv('employee.csv', na_values = missing_values)
# Removing all NaN
df.dropna(axis = 0, inplace = True, how = 'all')
# Show df
df
Out[37]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
NaN
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
NaN
11.598
NaN
Finance
5
Angela
NaN
NaN
18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
In [38]:
df['Salary'].interpolate(method = 'linear', inplace = True)
In [39]:
df
Out[39]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
88548.0
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
114021.0 11.598
NaN
Finance
5
Angela
NaN
112879.0 18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
Exercise 6.12 Writing CSV File with to_csv( )
¶
Here, we've cleaned the employee DataFrame. Then, we've gone ahead and saved that data into a CSV file using to_csv( ) and providing the filename.
In [40]:
df
Out[40]:
First
Name
Gender
Salary
Bonus
%
Senior
Management
Team
0
Douglas
Male
97308.0
6.945
True
Marketing
1
Thomas
Male
61933.0
NaN
True
NaN
2
Jerry
Male
88548.0
9.340
True
Finance
3
Dennis
NaN
115163.0 10.125
False
Legal
4
NaN
Female
114021.0 11.598
NaN
Finance
5
Angela
NaN
112879.0 18.523
True
Engineering
6
Shawn
Male
111737.0 6.414
False
NaN
7
Rachel
Female
142032.0 12.599
False
Business Development
8
Linda
Female
57427.0
9.557
True
Client Services
9
Stephanie
Female
36844.0
5.574
True
Business Development
In [41]:
df.to_csv('cleaned_employee.csv')
Revised Date: September 7th, 2023
In [ ]: