data_repairing_and_priming_lab

py

School

University of Central Florida *

*We aren’t endorsed by this school

Course

5610

Subject

Computer Science

Date

Apr 3, 2024

Type

py

Pages

6

Uploaded by madhu_iitm

Report
# -*- coding: utf-8 -*- """Data Repairing and Priming Lab.ipynb Automatically generated by Colaboratory. Original file is located at ## Imports """ # Commented out IPython magic to ensure Python compatibility. import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split # %matplotlib inline import matplotlib.pyplot as plt """## Download the input file""" !wget bit.ly/housing_data -O housing_prices_data.zip !unzip housing_prices_data.zip !cat housing_prices_data_description.txt """## Get the input file's meta-data""" # Load the input file from the current working directory where it was downloaded and unzipped data = "housing_prices_data.csv" XAll = pd.read_csv(data, index_col='Id') XAll.shape XAll.columns XAll.info() # Notice that we have many missing values in the number of columns where # the Non-Null Count in the output shows values below the maximum row index of 1460, # e.g. the Alley column has only 91 non-null values.
"""### *Q1:Some columns (variables) have very few non-null values, e.g. PoolQC has only 7 non-null values. Does it make them good candidates for removing from the input dataset so that you can reduce the dimensionality of the dataset and make the model simpler (and less prone to overfitting)?*""" XAll[:10] # Keep the originally created DataFrame XAll around in case we need it # and create a copy of it df = XAll.copy() """## Remove the "outliers" using the YearBuilt variable""" df.YrSold.describe() df.YearBuilt.describe() df.YearBuilt.min() def hist_for_df (df): plt.figure (figsize=(10,5)) plt.grid() plt.xticks(range(df.YearBuilt.min(),df.YearBuilt.max(),10)) plt.hist(df.YearBuilt, histtype='step',); hist_for_df(df) # We will remove the houses built before 1954 (below the 25 percentile) # for these reasons: # -- outdated building codes # -- property bought for land/location, not accommodation # ..... basically, those are treated as outliers that could skew the prediction results df = df[df.YearBuilt >= 1954 ] print (f' We dropped {XAll.shape[0] - df.shape[0]} residential properties ...') hist_for_df(df)
"""### *Q2: There seems to be a noticeable drop in the YearBuilt histogram around 1984 ... How would you interpret this fact?*""" """## Get stats on the missing data""" count_of_missing_values = df.shape[0] // 2 # 50% # Let's find out the columns that have more than count_of_missing_values columns_with_nulls = [] report_nulls = df.isnull().sum() # report_nulls is a pandas Series object, which is an iterator # yielding tuples containing column name and the count of the missing values in that column for item in report_nulls.items(): if item[1] > count_of_missing_values: print (item) columns_with_nulls.append(item[0]) # column name print (f'Count of columns with NaN more than {count_of_missing_values}: {len(columns_with_nulls)} out of {df.shape[1]} columns') #count_of_missing_values columns_with_nulls """## Interprete the missing data ``` The data may be missing for a variety of reasons: * Data is not available * The type of data is not applicable * Data has not been reported/recorded For example, 'NA' in the Alley column means that there is no alley access to the property. We will treat the missing data in the input dataset as not applicable to the property in question.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
``` ## Deal with the GarageYrBlt variable ``` Some properties do not have garages; others may have garages (re-) built at the year different from the year the house was built. Generally, it appears that GarageYrBlt may be a very weak predictor. We will introduce a new boolean variable (sort of a one-hot- encoded feature) with values 0 and 1, where 0 represents no garage(s), and 1 -- there is/are garage(s). And drop GarageYrBlt. ``` ### *Exercise 1: Write Python code that will capture all the variable names pertaining to garages.* """ # Write you code in this cell. """### *Q3: So, there is about half a dozen predictors related to the concept. Would it make sense to combine those in a single variable that would account for different aspects of the Garage idea?*""" df[["GarageYrBlt", "YearBuilt"]] # mostly == df[df.GarageYrBlt != df.YearBuilt][["GarageYrBlt", "YearBuilt"]] # 171 rows def has_or_not(g): if np.isnan(g): # g::np.float64 return 0 else: return 1
df['HasGarage'] = df['GarageYrBlt'].apply(lambda g: has_or_not(g)) df.HasGarage # Finding the count of each value's occurrence # Take 1: from collections import Counter c = Counter() c.update(df.HasGarage) print (c) print () # Take 2: Simpler print (df.HasGarage.value_counts()) df.drop("GarageYrBlt", inplace=True, axis=1) """## Convert the MSSubClass variable into categorical ``` By design, MSSubClass is an ordinal categorical variable that has been cataloged by pandas as being of type int. There is no meaning in the numeric order implied in the variable, so we will change its type to categorical. ``` """ df.MSSubClass = df.MSSubClass.astype("category") df.MSSubClass # Now if we apply the one-hot encoding scheme, the values will be mapped to # MSSubClass_20, MSSubClass_40, etc... columns """## Apply the One-Hot Encoding Scheme""" # Let's review the existing categorical columns categorical_col_names = [] count_of_unique_values = 0 for c in df.columns:
col = df.loc[:,c] # .loc is the accessor if col.dtype == np.dtype('O'): # 'O' [object] is how pandas (via NumPy) sees the Python type str (strings) print (col.name, col.dtype, col.unique()) categorical_col_names.append(col.name) count_of_unique_values += len(col.unique()) # Apply the one-hot encoding via the get_dummies () pandas method df_h = pd.get_dummies(df) len(df_h.columns) for c in df_h.columns: print (c) """## Save the DataFrame as a CSV file""" df_h.to_csv("housing_2.csv", header=True, index=False) """## Using the Files perspective, download the saved above file to your local computer -- you will use it in the next lab! *End of the lab notebook* """
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help