01_Missing _Value_Data_Cleaning

pdf

School

Dr. Filemon C. Aguilar Memorial College of Las Piñas City *

*We aren’t endorsed by this school

Course

ECONOMIC H

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

20

Uploaded by AgentProton22731

Report
1 Unit 1: Missing Value Data Cleaning Case Study: Unit Topics 1. Research motivation and goal 2. Initial dataset 3. Checking for missing data- a tutorial 3.1. Identifying explicit missing values 3.2. Dealing with explicit missing values 3.3. pd.read_csv() will automatically convert some strings to NaN values 3.4. Forcing pd.read_csv() to convert additionally specified strings to NaN values 4. Cleaning the data in the Chicago Airbnb dataset 5. Dataframe exporting 6. Potential missing value data cleaning techniques shortcomings Following Along in the e-Book Check out Modules 7-01 to 7-04 in the E-Book if you’d like the “written”/”book” version of this unit lecture material: https://d7.cs.illinois.edu/ds207/dev/ds207-exploration-dev/ Research Goal: Design a predictive model that will effectively predict the price of a Chicago Airbnb listing Main Approach: We will eventually (in a later unit) specifically use a linear regression model with our Chicago Airbnb dataset described in this unit to pursue this goal.
2 1. R ESEARCH M OTIVATION AND G OAL Motivation: Suppose that you are a data scientist that works for Airbnb and that you've found from company research that many hosts with new properties that they would like to list are often unsure as to what daily price to initially set their listing at. You'd like to design a new service on the Airbnb website that can recommend to hosts with new properties what a good starting price might be, based on listings in the same city and other property related information including: 1. Neighborhood 2. Room type 3. How many people the listing accommodates 4. How many bedrooms the listing has 5. How many beds the listing has Research Goal: Design a predictive model that will effectively predict the price of a Chicago Airbnb listing Main Approach: We will eventually (in a later unit) specifically use a linear regression model with our Chicago Airbnb dataset described in this unit to pursue this goal.
3 2. I NITIAL D ATASET Below is the initial dataset that we were given. 2.1. Reading the initial full dataset. 2.2. Inspecting the initial full dataset
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Let’s inspect the first row of the dataset. Let’s temporarily force pandas to not truncate the number of rows it displays to just 10.
5 2.3. Creating a reduced dataset Let’s focus on just the 6 variables that we need for this analysis. 2.4. Potential Dataset Questions/Shortcomings Question: 1. Given the data scientist's research goal, what questions might you have about how the data in this 'chicago_listings.csv' that we were given was collected? 2. How might the answer to these questions impact how well the data scientist is able to meet their research goal?
6 3. C HECKING FOR M ISSING D ATA A T UTORIAL There's actually two ways that a cell in a dataframe might be considered a "missing value" in Python: explicitly or implicitly . 3.1. I DENTIFYING E XPLICIT M ISSING V ALUES 3.1.1. NaN: an Explicit Missing Value In Python, we use the numeric data type NaN , standing for “not a number”, to explicitly represent any value that is undefined or unpresentable. Question : What are two ways we could isolate the NaN value in the first row (using Pandas functions)? 3.1.2. Checking for NaN Values
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 3.1.3. Counting the Number of NaN Values in each Column
8 3.2. D EALING WITH E XPLICIT M ISSING V ALUES 3.2.1. Approach #1: Drop all rows with a NaN value. 3.2.2. Approach #2: Imputing all NaN values with 0. 3.2.3. Approach #3: Imputing all NaN values with the respective column mean.
9 3.3. PD . READ _ CSV () WILL AUTOMATICALLY CONVERT SOME STRINGS TO N A N V ALUES 3.3.1. Additional Functionality of the pd.read_csv() Function The pd.read_csv() function will automatically convert the following strings into a NaN value in the pandas dataframe. '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '- nan', '1.#IND', '1.#QNAN', '', 'N/A', 'NA', 'NULL', 'NaN', 'None', 'n/a', 'nan', 'null Exercise : Fill in the blanks for the read data frame below based on what you see in the dirty_data.csv to the right. 3.3.2. Checking the Column Datatypes of a Dataframe Common datatypes outputs are the following. float64 : which means __________ objects in the column are float64 data types (basically a decimal number). int64 : which means __________ objects in the column are int64 data types (a type of integer object). object : which means _____________ object in the column is a _______________. Note: pd.read_csv() will convert a column to a float64(int64) type if it only detects what look like decimals(integers) and the string values in this list above.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 3.3.3. Mathematical Calculations with Different Column Types 1. Mathematical functions in Python work well with float64 or int64 columns. Question : What do you think the .mean() function does with the NaN value? 2. Mathematical functions in Python don’t’ work well with string columns. A detectable error example…
11 A less detectable error… Question : What do you think this code would return? What would we ideally want it to return?
12 3.4. F ORCING PD . READ _ CSV () TO CONVERT ADDITIONALLY S PECIFIED STRINGS TO N A N V ALUES Automatically converted strings: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '- 1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '', 'N/A', 'NA', 'NULL', 'NaN', 'None', 'n/a', 'nan', 'null
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 4. C LEANING THE DATA IN THE C HICAGO A IRBNB D ATASET 4.1. After a Basic pd.read_csv() True or False: The code below demonstrates that the data in our 'chicago_listings.csv' (just the 6 columns) has no missing values. 4.2. Inspecting the Intended Numerical Variable Types Question : Is there any evidence to suggest that there may have been some implicit missing values in the 'chicago_listings.csv' file that were not automatically detected and converted to NaN? If so, which variables may have some of these un-detected implicit missing values?
14 Let’s inspect the values in the bedroom column. For easier viewing, let’s just inspect the unique values in these two columns.
15 Question : Were there any other unaccounted for (and non- converted) implicit missing values in the intended categorical variable columns?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16 4.3. Re-Reading the csv File with Forced Specified Missing Value Conversion
17 4.4. Rechecking the Data Types 4.5. Looking for NaN Values 4.6. Dropping the Missing Values What if we had only written the following code instead here?
18 5. D ATAFRAME E XPORTING Let’s export this dataframe (with the rows that have been dropped) to a new csv file.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
19 6. P OTENTIAL M ISSING V ALUE D ATA C LEANING T ECHNIQUES S HORTCOMINGS Approach #1: Dropping the Rows with the Missing Values Question: What if the distribution of room types had changed in the following way (pre- row dropping vs. post-row dropping? How would this impact the ability to meet our research goal with a linear regression model? Research Goal: Design a predictive model that will effectively predict the price of a Chicago Airbnb listing Main Approach: We will eventually (in a later unit) specifically use a linear regression model with our Chicago Airbnb dataset described in this unit to pursue this goal.
20 Approach #2: Imputing the Missing Values with 0’s Approach #3: Imputing the Missing Values with Column Means