Handling Missing Data in Pandas - Data Analytics Lab 6

IE6400 Foundations for Data Analytics Engineering ¶ Fall 2023 ¶ Module 1: Pandas Library - Part 2 - Lab 6 ¶ Handling Missing Data ¶ Again, real-world datasets are rarely perfect. They may contain missing values, wrong data types, unreadable characters, and other unexpected values. The first step to any proper data analysis is cleansing and organizing the data. We'll be working with small employee dataset. In [1]: import pandas as pd dataframe = pd.read_csv('employee.csv') In [2]: # Displaying using Jupyter Notebook dataframe # For non-Jupyter environments # print(dataframe) Out[2]: First Name Gender Salary Bonus % Senior Management Team 0 Douglas Male 97308.0 6.945 True Marketing 1 Thomas Male 61933.0 NaN True NaN 2 Jerry Male NaN 9.340 True Finance 3 Dennis n.a. 115163.0 10.125 False Legal 4 NaN Female 0.0 11.598 NaN Finance 5 Angela NaN NaN 18.523 True Engineering 6 Shawn Male 111737.0 6.414 False na 7 Rachel Female 142032.0 12.599 False Business Development 8 Linda Female 57427.0 9.557 True Client Services 9 Stephanie Female 36844.0 5.574 True Business Development 10 NaN NaN NaN NaN NaN NaN

First Name Gender Salary Bonus % Senior Management Team Taking a closer look at the dataset, we note that Pandas assigns NaN if the value for a particular column is empty string. However, there are cases where missing values are represented by a custom value, for example, the string 'na' or 0 for a numeric column. These technically aren't missing values, as there's something there, but they're functionally missing values. If we try using utility methods such as dropna( ), it will not work properly. We'll want to first clean this up and categorize them, before we try to handle them as missing values. Exercise 6.1 Customizing Missing Data Values ¶ In our dataset, we want to consider these as missing values: • A 0 value in the Salary column • An na value in the Team column The easiest way to deal with missing values is to handle them at import-time. When loading our dataset in, let's set 0 and na as missing values for the Salary and Team columns: In [3]: import pandas as pd df = pd.read_csv('employee.csv', na_values = {"Salary" : [0], "Team" : ['na']}) In [4]: # Display using Jupyter Notebook df Out[4]: First Name Gender Salary Bonus % Senior Management Team 0 Douglas Male 97308.0 6.945 True Marketing 1 Thomas Male 61933.0 NaN True NaN 2 Jerry Male NaN 9.340 True Finance 3 Dennis n.a. 115163.0 10.125 False Legal 4 NaN Female NaN 11.598 NaN Finance 5 Angela NaN NaN 18.523 True Engineering 6 Shawn Male 111737.0 6.414 False NaN 7 Rachel Female 142032.0 12.599 False Business Development 8 Linda Female 57427.0 9.557 True Client Services

First Name Gender Salary Bonus % Senior Management Team 9 Stephanie Female 36844.0 5.574 True Business Development 10 NaN NaN NaN NaN NaN NaN Now, when we load our data in, all instances of 0 and na will be turned into NaN, which is the correct data type of missing values. Great! Now we don't have hidden missing values anymore. Or do we? There's a n.a. cell in the Gender column, on index 3. We'll want to add more matchers for missing values to solve this. We will create a list of values which will be treated as missing globally, in all columns. Let's make a list of various "hidden" missing values and pass that list to the na_values argument: In [5]: missing_values = ["n.a.", "NA", "n/a", "na", 0] In [6]: df = pd.read_csv('employee.csv', na_values = missing_values) In [7]: # Display using Jupyter Notebook df Out[7]: First Name Gender Salary Bonus % Senior Management Team 0 Douglas Male 97308.0 6.945 True Marketing 1 Thomas Male 61933.0 NaN True NaN 2 Jerry Male NaN 9.340 True Finance 3 Dennis NaN 115163.0 10.125 False Legal 4 NaN Female NaN 11.598 NaN Finance 5 Angela NaN NaN 18.523 True Engineering 6 Shawn Male 111737.0 6.414 False NaN 7 Rachel Female 142032.0 12.599 False Business Development 8 Linda Female 57427.0 9.557 True Client Services

Your preview ends here