data_repairing_and_priming_lab

School

University of Central Florida *

*We aren’t endorsed by this school

Course

5610

Subject

Computer Science

Date

Apr 3, 2024

Type

Pages

Uploaded by madhu_iitm

# -*- coding: utf-8 -*- """Data Repairing and Priming Lab.ipynb Automatically generated by Colaboratory. Original file is located at ## Imports """ # Commented out IPython magic to ensure Python compatibility. import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split # %matplotlib inline import matplotlib.pyplot as plt """## Download the input file""" !wget bit.ly/housing_data -O housing_prices_data.zip !unzip housing_prices_data.zip !cat housing_prices_data_description.txt """## Get the input file's meta-data""" # Load the input file from the current working directory where it was downloaded and unzipped data = "housing_prices_data.csv" XAll = pd.read_csv(data, index_col='Id') XAll.shape XAll.columns XAll.info() # Notice that we have many missing values in the number of columns where # the Non-Null Count in the output shows values below the maximum row index of 1460, # e.g. the Alley column has only 91 non-null values.

"""### *Q1:Some columns (variables) have very few non-null values, e.g. PoolQC has only 7 non-null values. Does it make them good candidates for removing from the input dataset so that you can reduce the dimensionality of the dataset and make the model simpler (and less prone to overfitting)?*""" XAll[:10] # Keep the originally created DataFrame XAll around in case we need it # and create a copy of it df = XAll.copy() """## Remove the "outliers" using the YearBuilt variable""" df.YrSold.describe() df.YearBuilt.describe() df.YearBuilt.min() def hist_for_df (df): plt.figure (figsize=(10,5)) plt.grid() plt.xticks(range(df.YearBuilt.min(),df.YearBuilt.max(),10)) plt.hist(df.YearBuilt, histtype='step',); hist_for_df(df) # We will remove the houses built before 1954 (below the 25 percentile) # for these reasons: # -- outdated building codes # -- property bought for land/location, not accommodation # ..... basically, those are treated as outliers that could skew the prediction results df = df[df.YearBuilt >= 1954 ] print (f' We dropped {XAll.shape[0] - df.shape[0]} residential properties ...') hist_for_df(df)

"""### *Q2: There seems to be a noticeable drop in the YearBuilt histogram around 1984 ... How would you interpret this fact?*""" """## Get stats on the missing data""" count_of_missing_values = df.shape[0] // 2 # 50% # Let's find out the columns that have more than count_of_missing_values columns_with_nulls = [] report_nulls = df.isnull().sum() # report_nulls is a pandas Series object, which is an iterator # yielding tuples containing column name and the count of the missing values in that column for item in report_nulls.items(): if item[1] > count_of_missing_values: print (item) columns_with_nulls.append(item[0]) # column name print (f'Count of columns with NaN more than {count_of_missing_values}: {len(columns_with_nulls)} out of {df.shape[1]} columns') #count_of_missing_values columns_with_nulls """## Interprete the missing data ``` The data may be missing for a variety of reasons: * Data is not available * The type of data is not applicable * Data has not been reported/recorded For example, 'NA' in the Alley column means that there is no alley access to the property. We will treat the missing data in the input dataset as not applicable to the property in question.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

``` ## Deal with the GarageYrBlt variable ``` Some properties do not have garages; others may have garages (re-) built at the year different from the year the house was built. Generally, it appears that GarageYrBlt may be a very weak predictor. We will introduce a new boolean variable (sort of a one-hot- encoded feature) with values 0 and 1, where 0 represents no garage(s), and 1 -- there is/are garage(s). And drop GarageYrBlt. ``` ### *Exercise 1: Write Python code that will capture all the variable names pertaining to garages.* """ # Write you code in this cell. """### *Q3: So, there is about half a dozen predictors related to the concept. Would it make sense to combine those in a single variable that would account for different aspects of the Garage idea?*""" df[["GarageYrBlt", "YearBuilt"]] # mostly == df[df.GarageYrBlt != df.YearBuilt][["GarageYrBlt", "YearBuilt"]] # 171 rows def has_or_not(g): if np.isnan(g): # g::np.float64 return 0 else: return 1

df['HasGarage'] = df['GarageYrBlt'].apply(lambda g: has_or_not(g)) df.HasGarage # Finding the count of each value's occurrence # Take 1: from collections import Counter c = Counter() c.update(df.HasGarage) print (c) print () # Take 2: Simpler print (df.HasGarage.value_counts()) df.drop("GarageYrBlt", inplace=True, axis=1) """## Convert the MSSubClass variable into categorical ``` By design, MSSubClass is an ordinal categorical variable that has been cataloged by pandas as being of type int. There is no meaning in the numeric order implied in the variable, so we will change its type to categorical. ``` """ df.MSSubClass = df.MSSubClass.astype("category") df.MSSubClass # Now if we apply the one-hot encoding scheme, the values will be mapped to # MSSubClass_20, MSSubClass_40, etc... columns """## Apply the One-Hot Encoding Scheme""" # Let's review the existing categorical columns categorical_col_names = [] count_of_unique_values = 0 for c in df.columns:

col = df.loc[:,c] # .loc is the accessor if col.dtype == np.dtype('O'): # 'O' [object] is how pandas (via NumPy) sees the Python type str (strings) print (col.name, col.dtype, col.unique()) categorical_col_names.append(col.name) count_of_unique_values += len(col.unique()) # Apply the one-hot encoding via the get_dummies () pandas method df_h = pd.get_dummies(df) len(df_h.columns) for c in df_h.columns: print (c) """## Save the DataFrame as a CSV file""" df_h.to_csv("housing_2.csv", header=True, index=False) """## Using the Files perspective, download the saved above file to your local computer -- you will use it in the next lab! *End of the lab notebook* """

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Related Documents

CS 693 Lab 5.docx

CS 693 Lab 3.docx

CS 693 Lab 2.docx

Cyber Security Essay Ernesto Gonzales.docx

decision_tree_classifier_visualization.py

Project 4 Introduction W22.pptx

understanding_k_means.py

introduction_to_nlp.py

understanding_pca.py

Body Identificaton Workbook.pdf

Model Midterm Term.pdf

Blockchain 4_ Assignment 12.docx

Recommended textbooks for you

Programming Logic & Design Comprehensive

Computer Science

ISBN:9781337669405

Author:FARRELL

Publisher:Cengage

Systems Architecture

Computer Science

ISBN:9781305080195

Author:Stephen D. Burd

Publisher:Cengage Learning

C++ for Engineers and Scientists

Computer Science

ISBN:9781133187844

Author:Bronson, Gary J.

Publisher:Course Technology Ptr

C++ Programming: From Problem Analysis to Program...

Computer Science

ISBN:9781337102087

Author:D. S. Malik

Publisher:Cengage Learning

EBK JAVA PROGRAMMING

Computer Science

ISBN:9781337671385

Author:FARRELL

Publisher:CENGAGE LEARNING - CONSIGNMENT

New Perspectives on HTML5, CSS3, and JavaScript

Computer Science

ISBN:9781305503922

Author:Patrick M. Carey

Publisher:Cengage Learning

SEE MORE TEXTBOOKS

Recommended textbooks for you

Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr
C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning
EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning