Skip to main content

Documents Statistics

01_Missing _Value_Data_Cleaning.pdf

01_Missing _Value_Data_Cleaning

pdf

School

Dr. Filemon C. Aguilar Memorial College of Las Piñas City *

*We aren’t endorsed by this school

Course

ECONOMIC H

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

20

Uploaded by AgentProton22731

1 Unit 1: Missing Value Data Cleaning Case Study: Unit Topics 1. Research motivation and goal 2. Initial dataset 3. Checking for missing data- a tutorial 3.1. Identifying explicit missing values 3.2. Dealing with explicit missing values 3.3. pd.read_csv() will automatically convert some strings to NaN values 3.4. Forcing pd.read_csv() to convert additionally specified strings to NaN values 4. Cleaning the data in the Chicago Airbnb dataset 5. Dataframe exporting 6. Potential missing value data cleaning techniques shortcomings Following Along in the e-Book Check out Modules 7-01 to 7-04 in the E-Book if you’d like the “written”/”book” version of this unit lecture material: https://d7.cs.illinois.edu/ds207/dev/ds207-exploration-dev/ Research Goal: Design a predictive model that will effectively predict the price of a Chicago Airbnb listing Main Approach: We will eventually (in a later unit) specifically use a linear regression model with our Chicago Airbnb dataset described in this unit to pursue this goal.

2 1. R ESEARCH M OTIVATION AND G OAL Motivation: Suppose that you are a data scientist that works for Airbnb and that you've found from company research that many hosts with new properties that they would like to list are often unsure as to what daily price to initially set their listing at. You'd like to design a new service on the Airbnb website that can recommend to hosts with new properties what a good starting price might be, based on listings in the same city and other property related information including: 1. Neighborhood 2. Room type 3. How many people the listing accommodates 4. How many bedrooms the listing has 5. How many beds the listing has Research Goal: Design a predictive model that will effectively predict the price of a Chicago Airbnb listing Main Approach: We will eventually (in a later unit) specifically use a linear regression model with our Chicago Airbnb dataset described in this unit to pursue this goal.

3 2. I NITIAL D ATASET Below is the initial dataset that we were given. 2.1. Reading the initial full dataset. 2.2. Inspecting the initial full dataset

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

4 Let’s inspect the first row of the dataset. Let’s temporarily force pandas to not truncate the number of rows it displays to just 10.

5 2.3. Creating a reduced dataset Let’s focus on just the 6 variables that we need for this analysis. 2.4. Potential Dataset Questions/Shortcomings Question: 1. Given the data scientist's research goal, what questions might you have about how the data in this 'chicago_listings.csv' that we were given was collected? 2. How might the answer to these questions impact how well the data scientist is able to meet their research goal?

6 3. C HECKING FOR M ISSING D ATA – A T UTORIAL There's actually two ways that a cell in a dataframe might be considered a "missing value" in Python: explicitly or implicitly . 3.1. I DENTIFYING E XPLICIT M ISSING V ALUES 3.1.1. NaN: an Explicit Missing Value In Python, we use the numeric data type NaN , standing for “not a number”, to explicitly represent any value that is undefined or unpresentable. Question : What are two ways we could isolate the NaN value in the first row (using Pandas functions)? 3.1.2. Checking for NaN Values

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

7 3.1.3. Counting the Number of NaN Values in each Column

8 3.2. D EALING WITH E XPLICIT M ISSING V ALUES 3.2.1. Approach #1: Drop all rows with a NaN value. 3.2.2. Approach #2: Imputing all NaN values with 0. 3.2.3. Approach #3: Imputing all NaN values with the respective column mean.

9 3.3. PD . READ _ CSV () WILL AUTOMATICALLY CONVERT SOME STRINGS TO N A N V ALUES 3.3.1. Additional Functionality of the pd.read_csv() Function The pd.read_csv() function will automatically convert the following strings into a NaN value in the pandas dataframe. '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '- nan', '1.#IND', '1.#QNAN', '', 'N/A', 'NA', 'NULL', 'NaN', 'None', 'n/a', 'nan', 'null Exercise : Fill in the blanks for the read data frame below based on what you see in the dirty_data.csv to the right. 3.3.2. Checking the Column Datatypes of a Dataframe Common datatypes outputs are the following. • float64 : which means __________ objects in the column are float64 data types (basically a decimal number). • int64 : which means __________ objects in the column are int64 data types (a type of integer object). • object : which means _____________ object in the column is a _______________. Note: pd.read_csv() will convert a column to a float64(int64) type if it only detects what look like decimals(integers) and the string values in this list above.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

10 3.3.3. Mathematical Calculations with Different Column Types 1. Mathematical functions in Python work well with float64 or int64 columns. Question : What do you think the .mean() function does with the NaN value? 2. Mathematical functions in Python don’t’ work well with string columns. A detectable error example…

11 A less detectable error… Question : What do you think this code would return? What would we ideally want it to return?

12 3.4. F ORCING PD . READ _ CSV () TO CONVERT ADDITIONALLY S PECIFIED STRINGS TO N A N V ALUES Automatically converted strings: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '- 1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '', 'N/A', 'NA', 'NULL', 'NaN', 'None', 'n/a', 'nan', 'null

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

13 4. C LEANING THE DATA IN THE C HICAGO A IRBNB D ATASET 4.1. After a Basic pd.read_csv() True or False: The code below demonstrates that the data in our 'chicago_listings.csv' (just the 6 columns) has no missing values. 4.2. Inspecting the Intended Numerical Variable Types Question : Is there any evidence to suggest that there may have been some implicit missing values in the 'chicago_listings.csv' file that were not automatically detected and converted to NaN? If so, which variables may have some of these un-detected implicit missing values?

14 Let’s inspect the values in the bedroom column. For easier viewing, let’s just inspect the unique values in these two columns.

15 Question : Were there any other unaccounted for (and non- converted) implicit missing values in the intended categorical variable columns?

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

16 4.3. Re-Reading the csv File with Forced Specified Missing Value Conversion

17 4.4. Rechecking the Data Types 4.5. Looking for NaN Values 4.6. Dropping the Missing Values What if we had only written the following code instead here?

18 5. D ATAFRAME E XPORTING Let’s export this dataframe (with the rows that have been dropped) to a new csv file.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

19 6. P OTENTIAL M ISSING V ALUE D ATA C LEANING T ECHNIQUES S HORTCOMINGS Approach #1: Dropping the Rows with the Missing Values Question: What if the distribution of room types had changed in the following way (pre- row dropping vs. post-row dropping? How would this impact the ability to meet our research goal with a linear regression model? Research Goal: Design a predictive model that will effectively predict the price of a Chicago Airbnb listing Main Approach: We will eventually (in a later unit) specifically use a linear regression model with our Chicago Airbnb dataset described in this unit to pursue this goal.

20 Approach #2: Imputing the Missing Values with 0’s Approach #3: Imputing the Missing Values with Column Means

Related Documents

BUSN5000 HW 10 Part A.pdf

BUSN5000 HW 7 Part A.pdf

BUSN5000 HW5 Part B.pdf

Final Project Report draft 9th Dec_WIP.docx

chapter_3updatedStatCrunch.docx

chapter_2updatedStatCrunchBlank (1).docx

CJ 315 Project Three.docx

Chapter 7.2-7.5 - 002.pdf

Stat116 002 test1_F23.ans.pdf

STAT 116 Review 3.pdf

STAT 116 Review 1 .pdf

Recommended textbooks for you

Text book image

Linear Algebra: A Modern Introduction

Algebra

ISBN:9781285463247

Author:David Poole

Publisher:Cengage Learning

Text book image

Glencoe Algebra 1, Student Edition, 9780079039897...

Algebra

ISBN:9780079039897

Author:Carter

Publisher:McGraw Hill

Text book image

Big Ideas Math A Bridge To Success Algebra 1: Stu...

Algebra

ISBN:9781680331141

Author:HOUGHTON MIFFLIN HARCOURT

Publisher:Houghton Mifflin Harcourt

Text book image

Holt Mcdougal Larson Pre-algebra: Student Edition...

Algebra

ISBN:9780547587776

Author:HOLT MCDOUGAL

Publisher:HOLT MCDOUGAL

SEE MORE TEXTBOOKS

Recommended textbooks for you

Linear Algebra: A Modern Introduction
Algebra
ISBN:9781285463247
Author:David Poole
Publisher:Cengage Learning
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Text book image

Linear Algebra: A Modern Introduction

Algebra

ISBN:9781285463247

Author:David Poole

Publisher:Cengage Learning

Text book image

Glencoe Algebra 1, Student Edition, 9780079039897...

Algebra

ISBN:9780079039897

Author:Carter

Publisher:McGraw Hill

Text book image

Big Ideas Math A Bridge To Success Algebra 1: Stu...

Algebra

ISBN:9781680331141

Author:HOUGHTON MIFFLIN HARCOURT

Publisher:Houghton Mifflin Harcourt

Text book image

Holt Mcdougal Larson Pre-algebra: Student Edition...

Algebra

ISBN:9780547587776

Author:HOLT MCDOUGAL

Publisher:HOLT MCDOUGAL

SEE MORE TEXTBOOKS