Homework7_Aaryan Pimple
pdf
keyboard_arrow_up
School
University of Southern California *
*We aren’t endorsed by this school
Course
558
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
Pages
15
Uploaded by CoachMorningYak37
ISE-558 Data Management for Analytics
Homework 7
For this homework assignment, enter your answers in the blank cells below, either as Python
code (in code cells) or as text in Markdown cells. When you are completed, create a PDF version
of your solution with the menu command File > Download As HTML and then open the html �
le
in a browser and "print" it to a PDF �
le. Upload this PDF �
le to Gradescope in the normal way.
import pandas as pd
import numpy as np
Problem 1
You are to use the tables found in “Orders - Data Integration.csv” and “Product Costs - Data
Integration.csv” to generate a report of the total pro
�
t from each of your customers. Perform the
following steps:
1A) Read in the two �
les and determine if either (or both) of the �
les are not in Tidy format.
Summarize below which �
les are not in Tidy format and why:
# Load the CSV files
orders_data = pd.read_csv('/content/Orders - Data Integration.csv')
product_costs_data = pd.read_csv('/content/Product Costs - Data Integration.csv')
def check_tidy_format(data, table_name):
variables_in_columns = all (data.columns == data.columns.str.strip())
observations_in_rows = data.shape[0] == len(data)
observational_unit = len(data.index.unique()) == len(data)
tidy_summary = {
"Table Name": table_name,
"Variables in Columns": variables_in_columns,
"Observations in Rows": observations_in_rows,
"Observational Unit": observational_unit,
"Tidy Format": variables_in_columns and observations_in_rows and observational_unit
}
return tidy_summary
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
1 of 15
05-12-2023, 22:39
orders_tidy_summary = check_tidy_format (orders_data, "Orders")
product_costs_tidy_summary = check_tidy_format (product_costs_data, "Product Costs")
print("Summary for Orders table:")
print(orders_tidy_summary)
print("\nSummary for Product Costs table:")
print(product_costs_tidy_summary)
Summary for Orders table:
{'Table Name': 'Orders', 'Variables in Columns': True, 'Observations in Rows': True, Summary for Product Costs table:
{'Table Name': 'Product Costs', 'Variables in Columns': True, 'Observations in Rows':
xxx
1B) If either table is not in Tidy format, correct it. Also, convert the costs from character to
numeric types.
def tidy_and_convert_numeric (data, table_name):
if 'cost' in data.columns:
data[ 'cost'] = pd.to_numeric(data['cost'], errors='coerce')
data = data.dropna()
data = data.reset_index(drop=True)
print(f"{table_name} table has been corrected and costs have been converted to numeric.
return data
orders_data_corrected = tidy_and_convert_numeric (orders_data, "Orders")
product_costs_data_corrected = tidy_and_convert_numeric (product_costs_data, "Product Cos
Orders table has been corrected and costs have been converted to numeric.
Product Costs table has been corrected and costs have been converted to numeric.
1C) Combine the two data frames to add the “cost” information column from the Product Costs -
Data Integration data frame to the Orders - Data Integration data frame. Display your resulting
combined data frame.
combined_data = pd.merge(orders_data_corrected, product_costs_data_corrected, on='Product
print("Combined Data Frame:")
print(combined_data)
Combined Data Frame:
Order_Num Customer Line Item Year Purchased Product Code Quantity \
0 1000 Customer A 1 2019 X1189 65 1 1000 Customer A 2 2019 A33 63 2 1000 Customer A 3 2019 BW243 75 3 1000 Customer A 4 2019 X1388 20 Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
2 of 15
05-12-2023, 22:39
3 1000 Customer A 4 2019 X1388 20 4 1000 Customer A 5 2019 Y12 82 5 1001 Customer B 1 2020 X1388 83 6 1001 Customer B 2 2020 BW243 29 7 1001 Customer B 3 2020 GG2554 70 8 1002 Customer C 1 2020 X1388 52 9 1002 Customer C 2 2020 HC155 73 10 1002 Customer C 3 2020 ZZ52 81 11 1002 Customer C 4 2020 YYS1 73 12 1002 Customer C 5 2020 GG2554 98 13 1002 Customer C 6 2020 Y12 93 14 1002 Customer C 7 2020 HCK15 81 15 1003 Customer D 1 2019 X1388 62 Unit Price 2019 Cost 2020 Cost 0 314 244.92 273.92 1 698 467.66 484.66 2 483 367.08 380.08 3 684 540.36 566.36 4 474 322.32 335.32 5 684 540.36 566.36 6 483 367.08 380.08 7 595 428.40 447.40 8 684 540.36 566.36 9 404 282.80 303.80 10 317 196.54 226.54 11 258 157.38 172.38 12 595 428.40 447.40 13 474 322.32 335.32 14 532 335.16 357.16 15 684 540.36 566.36 1D) Create a new column in the joined table that is equal to the total pro
�
t for that line item
(note: pro
�
t equals the price that you charge for an item minus the cost that you have to pay to
get that item from your supplier). Display the resulting dataframe.
combined_data['Total Profit'] = (combined_data['Unit Price'] - combined_data['2020 Cost']
print("Combined Data Frame with Total Profit: ")
print(combined_data)
Combined Data Frame with Total Profit: Order_Num Customer Line Item Year Purchased Product Code Quantity \
0 1000 Customer A 1 2019 X1189 65 1 1000 Customer A 2 2019 A33 63 2 1000 Customer A 3 2019 BW243 75 3 1000 Customer A 4 2019 X1388 20 4 1000 Customer A 5 2019 Y12 82 5 1001 Customer B 1 2020 X1388 83 6 1001 Customer B 2 2020 BW243 29 7 1001 Customer B 3 2020 GG2554 70 8 1002 Customer C 1 2020 X1388 52 9 1002 Customer C 2 2020 HC155 73 Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
3 of 15
05-12-2023, 22:39
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
9 1002 Customer C 2 2020 HC155 73 10 1002 Customer C 3 2020 ZZ52 81 11 1002 Customer C 4 2020 YYS1 73 12 1002 Customer C 5 2020 GG2554 98 13 1002 Customer C 6 2020 Y12 93 14 1002 Customer C 7 2020 HCK15 81 15 1003 Customer D 1 2019 X1388 62 Unit Price 2019 Cost 2020 Cost Total Profit 0 314 244.92 273.92 2605.20 1 698 467.66 484.66 13440.42 2 483 367.08 380.08 7719.00 3 684 540.36 566.36 2352.80 4 474 322.32 335.32 11371.76 5 684 540.36 566.36 9764.12 6 483 367.08 380.08 2984.68 7 595 428.40 447.40 10332.00 8 684 540.36 566.36 6117.28 9 404 282.80 303.80 7314.60 10 317 196.54 226.54 7327.26 11 258 157.38 172.38 6250.26 12 595 428.40 447.40 14464.80 13 474 322.32 335.32 12897.24 14 532 335.16 357.16 14162.04 15 684 540.36 566.36 7293.68 1E) Generate and display a data frame that contains the total pro
�
t received from each of your
customers (your resulting table will contain four rows - one row for each customer).
customer_profit = combined_data.groupby('Customer')['Total Profit'].sum().reset_index()
print("Total Profit from Each Customer:")
print(customer_profit)
Total Profit from Each Customer:
Customer Total Profit
0 Customer A 37489.18
1 Customer B 23080.80
2 Customer C 68533.48
3 Customer D 7293.68
Problem 2
2A) Read into dataframe the following six star schema �
les: rentals.csv, addresses.csv,
cities.csv, countries.csv, payments.csv, and customers.csv. Then, combine the six �
les into a
single dataframe as your analytical base table. Display the �
rst few rows of this analytical base
table (don’t worry if you can’t display all the columns in your printout).
rental_df = pd.read_csv('/content/rentals.csv')
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
4 of 15
05-12-2023, 22:39
rental_df = pd.read_csv('/content/rentals.csv')
address_df = pd.read_csv('/content/addresses.csv')
city_df = pd.read_csv('/content/cities.csv')
country_df = pd.read_csv('/content/countries.csv')
customer_df = pd. read_csv ('/content/customers.csv')
payments_df = pd.read_csv('/content/payments.csv')
merged_df = pd.merge(rental_df, customer_df, on='CUSTOMER_ID')
merged_df = pd.merge(merged_df, address_df, on='ADDRESS_ID')
merged_df = pd.merge(merged_df, city_df, on='CITY_ID')
merged_df = pd.merge(merged_df, country_df, on='COUNTRY_ID')
merged_df = pd.merge(merged_df, payments_df, on='RENTAL_ID')
print (merged_df.head())
RENTAL_ID RENTAL_DATE CUSTOMER_ID RETURN_DATE STAFF_ID \
0 1 2005-05-24 22:53:30 130 2005-05-26 22:04:30 1 1 746 2005-05-29 09:25:10 130 2005-06-02 04:20:10 2 2 1630 2005-06-16 07:55:01 130 2005-06-19 06:38:01 1 3 1864 2005-06-17 01:39:47 130 2005-06-24 19:39:47 2 4 2163 2005-06-17 23:46:16 130 2005-06-22 22:48:16 2 FIRST_NAME LAST_NAME EMAIL ADDRESS_ID \
0 CHARLOTTE HUNTER CHARLOTTE.HUNTER@sakilacustomer.org 134 1 CHARLOTTE HUNTER CHARLOTTE.HUNTER@sakilacustomer.org 134 2 CHARLOTTE HUNTER CHARLOTTE.HUNTER@sakilacustomer.org 134 3 CHARLOTTE HUNTER CHARLOTTE.HUNTER@sakilacustomer.org 134 4 CHARLOTTE HUNTER CHARLOTTE.HUNTER@sakilacustomer.org 134 ACTIVE ... DISTRICT CITY_ID POSTAL_CODE PHONE \
0 True ... Gois 190 82639.0 9.354486e+11 1 True ... Gois 190 82639.0 9.354486e+11 2 True ... Gois 190 82639.0 9.354486e+11 3 True ... Gois 190 82639.0 9.354486e+11 4 True ... Gois 190 82639.0 9.354486e+11 CITY COUNTRY_ID COUNTRY PAYMENT_ID AMOUNT \
0 Águas Lindas de Goiás 15 Brazil 3504 2.99 1 Águas Lindas de Goiás 15 Brazil 3505 2.99 2 Águas Lindas de Goiás 15 Brazil 3506 2.99 3 Águas Lindas de Goiás 15 Brazil 3507 2.99 4 Águas Lindas de Goiás 15 Brazil 3508 2.99 PAYMENT_DATE 0 2005-05-24 22:53:30 1 2005-05-29 09:25:10 2 2005-06-16 07:55:01 3 2005-06-17 01:39:47 4 2005-06-17 23:46:16 [5 rows x 22 columns]
2B) Write a command to return the total amount of all of the payments by customers who live in
the United States
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
5 of 15
05-12-2023, 22:39
the United States
total_payments_us = merged_df.loc[merged_df['COUNTRY'] == 'United States', 'AMOUNT'].sum(
print(f'Total payments by customers in th eUnited States: {total_payments_us}')
Total payments by customers in th eUnited States: 2015.1900000000003
2C) Write a command to return the total payment amounts for every country that has a total
payment abount greater than 1000. Sort the list from highest to lowest.
result_df = merged_df.groupby('COUNTRY')['AMOUNT'].sum().reset_index()
result_df = result_df[result_df['AMOUNT'] > 1000].sort_values(by='AMOUNT', ascending=Fals
print(result_df)
COUNTRY AMOUNT
42 India 3052.44
21 China 2769.04
101 United States 2015.19
48 Japan 1767.85
58 Mexico 1762.00
13 Brazil 1591.23
77 Russian Federation 1552.35
73 Philippines 1124.21
2D) Write a command to return the total payment amounts for the 10 customer IDs with the
largest total payment amounts. Sort the list from highest to lowest amounts.
result_df = merged_df.groupby('CUSTOMER_ID')['AMOUNT'].sum().reset_index()
result_df = result_df.nlargest(10, 'AMOUNT').sort_values(by='AMOUNT', ascending=False)
print(result_df)
CUSTOMER_ID AMOUNT
525 526 116.74
175 176 110.77
235 236 103.73
468 469 101.78
143 144 97.80
177 178 95.82
308 309 92.84
186 187 92.82
238 239 92.80
306 307 92.79
Problem 3
The �
les corporations.csv, departments.csv, and teams.csv contain data that you want to use to
analyze the effects of various factors on a department’s error rates on the tasks that they
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
6 of 15
05-12-2023, 22:39
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
perform and track. Read these �
les into tibbles and combine in them into a single one row per
analysis subject tibble to perform this analysis.
corporations_df =pd.read_csv('/content/corporations.csv')
departments_df =pd.read_csv('/content/departments.csv')
teams_df =pd.read_csv('/content/teams.csv')
merged_df = corporations_df.merge(departments_df, on = 'Corporation_ID', how='left')
merged_df = merged_df.merge(teams_df, left_on = ['Corporation_ID','Department_Name'], rig
print(merged_df.head())
Corporation_ID Corporation_Name Total_Employees Total_Revenue \
0 1 TechCorp Inc. 10801 $7B 1 1 TechCorp Inc. 10801 $7B 2 1 TechCorp Inc. 10801 $7B 3 1 TechCorp Inc. 10801 $7B 4 1 TechCorp Inc. 10801 $7B Headquarters Department_Name Department_Size Department \
0 San Francisco, CA HR 277 HR 1 San Francisco, CA HR 277 HR 2 San Francisco, CA HR 277 HR 3 San Francisco, CA Finance 1302 Finance 4 San Francisco, CA Finance 1302 Finance Team_Name Team_Leader Team_Size Task_Completion_Rate \
0 Training Frank 55 0.93 1 Employee Relations Karen 229 0.76 2 Recruitment Larry 166 0.95 3 Audit Hannah 47 0.96 4 Investments Hannah 23 0.61 Error_Rate Training_Hours 0 0.13 36 1 0.06 40 2 0.15 38 3 0.09 14 4 0.11 11 merged_df.drop(['Team_Name','Team_Leader'],axis=1,inplace=True)
merged_df.columns
Index(['Corporation_ID', 'Corporation_Name', 'Total_Employees',
'Total_Revenue', 'Headquarters', 'Department_Name', 'Department_Size',
'Department', 'Team_Size', 'Task_Completion_Rate', 'Error_Rate',
'Training_Hours'],
dtype='object')
merged_df_group = merged_df.groupby(['Corporation_ID','Corporation_Name','Total_Employees'
'Team_Size':['sum'], 'Task_Completion_Rate':['mean'],'Error_Rate':['mean'], 'Training_
merged_df_group.head(10)
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
7 of 15
05-12-2023, 22:39
Corporation_ID
Corporation_Name
Total_Employees
Total_Revenue
Headquarters
De
0
1
TechCorp Inc.
10801
$7B
San
Francisco, CA
1
1
TechCorp Inc.
10801
$7B
San
Francisco, CA
2
1
TechCorp Inc.
10801
$7B
San
Francisco, CA
3
1
TechCorp Inc.
10801
$7B
San
Francisco, CA
4
1
TechCorp Inc.
10801
$7B
San
Francisco, CA
5
2
HealthHub LLC
9568
$3B
New York, NY
6
2
HealthHub LLC
9568
$3B
New York, NY
7
2
HealthHub LLC
9568
$3B
New York, NY
8
2
HealthHub LLC
9568
$3B
New York, NY
9
2
HealthHub LLC
9568
$3B
New York, NY
Problem 4
For this problem, you will be performing initial data cleansing and feature engineering on the
dataset "House Price Dataset.csv" that will be used to predict house prices based on the
following attributes:
• House Sale Identi
�
er
• House Age
• SqFt (square footage of the house)
• Num Bathrooms
• Num Bedrooms
• Average Income
• Sales Price
Read the csv �
le into a Python/Pandas dataframe and prepare the dataset for analytics by
performing the steps below.
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
8 of 15
05-12-2023, 22:39
4A) Are there any missing values? If so, decide how to correct the issue and show your code
that does so.
df = pd.read_csv('/content/House Price Dataset.csv')
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)
Missing Values:
House Sale Identifier 0
House Age 180
SqFt 0
Num Bedrooms 0
Num Bathrooms 50
Average Income 0
Sales Price 0
dtype: int64
House Sale
Identifier
House
Age
SqFt
Num
Bedrooms
Num
Bathrooms
Average
Income
Sales
Price
0
1615233
4.0
2194.0
2
1.0
63.433
218862
1
8524268
22.0
3217.0
6
5.0
43.375
293655
2
5739494
45.0
2006.0
2
1.0
64.505
234964
3
3043801
55.0
3250.0
5
4.0
44.431
285840
4
4548709
36.0
2912.0
4
3.0
45.225
295944
df.head()
df['House Age'].fillna(df['House Age'].mean(), inplace=True)
df['Num Bathrooms'].fillna(df['Num Bedrooms'].mean(), inplace=True)
missing_values_after_filling = df.isnull().sum()
print("\nMissing Values After Filling:\n", missing_values_after_filling)
Missing Values After Filling:
House Sale Identifier 0
House Age 0
SqFt 0
Num Bedrooms 0
Num Bathrooms 0
Average Income 0
Sales Price 0
dtype: int64
4B) Are there any extreme outliers? If so, decide how to correct the issue and show your code
that does so.
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
9 of 15
05-12-2023, 22:39
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
import matplotlib.pyplot as plt import seaborn as sns
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[['House Age', 'SqFt', 'Num Bedrooms', 'Num Bathrooms', 'Average Incom
plt.title('Box Plot for Identifying Outliers')
plt. show()
xxx
Q1 = df['Sales Price'].quantile(0.25)
Q3 = df['Sales Price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['Sales Price'] < lower_bound) | (df['Sales Price'] > upper_bound)]
df_no_outliers = df[(df['Sales Price'] >= lower_bound) & (df['Sales Price'] <= upper_boun
print ("Outliers:")
print (outliers)
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
10 of 15
05-12-2023, 22:39
print (outliers)
Outliers:
House Sale Identifier House Age SqFt Num Bedrooms Num Bathrooms \
12 5987860 46.000000 3110.0 6 5.000 28 3273662 42.000000 2358.0 1 0.000 50 1656962 30.000000 4483.0 6 5.000 54 4951518 30.000000 2068.0 4 3.000 101 9938683 7.000000 2987.0 4 3.635 121 8490383 30.000000 2725.0 2 1.000 150 1215048 22.000000 249385.5 6 5.000 174 2320295 30.000000 4467.0 7 6.000 185 7228100 16.000000 1906.0 3 2.000 188 7445076 26.614634 3694.0 5 4.000 197 6174356 34.000000 2783.0 4 3.000 233 8278041 25.000000 2363.0 5 4.000 248 7693757 3.000000 2193.0 3 2.000 249 9228407 17.000000 2403.0 6 5.000 264 8303461 47.000000 1938.0 3 2.000 272 1998101 39.000000 3135.0 3 2.000 280 4641054 48.000000 1996.0 5 4.000 284 2892940 29.000000 2228.0 1 0.000 293 2279439 7.000000 2489.0 5 4.000 295 7630534 24.000000 1784.0 4 3.000 298 3806167 18.000000 2345.0 2 1.000 317 3525167 33.000000 2947.0 5 4.000 321 9396285 26.614634 2300.0 3 2.000 325 7892873 27.000000 2244.0 4 3.635 344 5894467 35.000000 2311.0 3 2.000 364 3441886 39.000000 2173.0 1 0.000 373 4305566 26.614634 2082.0 7 6.000 380 6941350 42.000000 3518.0 4 3.000 430 4209982 16.000000 3413.0 2 1.000 449 8318327 25.000000 2519.0 2 1.000 490 6777658 40.000000 1855.0 6 5.000 505 6637705 8.000000 4199.0 4 3.635 532 1273378 9.000000 1871.0 2 1.000 554 7461958 42.000000 1918.0 2 1.000 558 3814258 34.000000 1920.0 4 3.000 574 8463703 26.614634 4367.0 3 2.000 580 3913188 34.000000 2595.0 3 2.000 608 4039593 15.000000 2752.0 3 2.000 622 2027180 26.614634 3286.0 3 2.000 625 1328669 26.614634 1907.0 1 0.000 629 8900210 39.000000 2086.0 5 4.000 660 8630729 60.000000 2242.0 3 2.000 680 4997299 26.614634 3014.0 3 2.000 716 1897133 26.614634 2238.0 1 0.000 738 5255322 26.000000 2182.0 3 2.000 766 2521571 45.000000 1699.0 3 2.000 786 3658410 0.000000 1966.0 3 2.000 828 8829895 34.000000 2274.0 2 1.000 837 8687036 24.000000 3777.0 7 6.000 847 9009467 33.000000 2112.0 3 2.000 852 1120173 15.000000 2060.0 5 4.000 857 5183967 26.614634 3238.0 6 5.000 860 7929707 26.614634 4128.0 6 5.000 Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
11 of 15
05-12-2023, 22:39
860 7929707 26.614634 4128.0 6 5.000 875 2822891 18.000000 3120.0 4 3.000 885 2193492 46.000000 3256.0 6 5.000 892 7361753 39.000000 2926.0 6 5.000 print("\nDataframe without Outliers:")
print(df_no_outliers)
Dataframe without Outliers:
House Sale Identifier House Age SqFt Num Bedrooms Num Bathrooms \
0 1615233 4.000000 2194.0 2 1.0 1 8524268 22.000000 3217.0 6 5.0 2 5739494 45.000000 2006.0 2 1.0 3 3043801 55.000000 3250.0 5 4.0 4 4548709 36.000000 2912.0 4 3.0 .. ... ... ... ... ... 995 8911243 2.000000 4525.0 4 3.0 996 1845461 26.614634 1948.0 4 3.0 997 7530903 26.614634 2836.0 6 5.0 998 1780321 38.000000 3059.0 5 4.0 999 9077141 27.000000 1867.0 2 1.0 Average Income Sales Price 0 63.433 218862 1 43.375 293655 2 64.505 234964 3 44.431 285840 4 45.225 295944 .. ... ... 995 45.349 281308 996 42.609 206885 997 66.243 265869 998 64.369 213394 999 52.280 230043 [940 rows x 7 columns]
4C) Are any of the variable signi
�
cantly skewed? If so, correct them. Display any resulting
histograms to show that the skew has been reduced.
xxx
from scipy.stats import skew
import numpy as np
plt.figure(figsize=(12, 6))
sns.histplot(df['Sales Price'], kde=True)
plt.title('Histogram of Sales Price(Before Transformation)')
plt.show()
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
12 of 15
05-12-2023, 22:39
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Skewness before transformation: 2.3827741670175127
skewness_before = skew(df['Sales Price'])
print(f"Skewness before transformation: {skewness_before}")
df['Sales Price'] = np.log1p(df['Sales Price'])
plt.figure(figsize=(12, 6))
sns.histplot(df['Sales Price'], kde=True)
plt.title('Histogram of Sales Price (After Log Transformation)')
plt.show()
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
13 of 15
05-12-2023, 22:39
skewness_after = skew(df['Sales Price'])
print(f"Skewness after log transformation: {skewness_after}")
Skewness after log transformation: 0.5862674016203109
4D) Does the dataset require scaling? If so, perform that operation.
from sklearn.preprocessing import StandardScaler
features_to_scale = ['House Age', 'SqFt', 'Num Bedrooms', 'Num Bathrooms', 'Average Incom
scalar = StandardScaler()
df[features_to_scale] = scalar.fit_transform(df[features_to_scale])
print("Scaled DataFrame:")
print(df.head())
Scaled DataFrame:
House Sale Identifier House Age SqFt Num Bedrooms Num Bathrooms \
0 1615233 -1.727576 -0.119639 -0.935926 -0.975395 1 8524268 -0.352521 -0.070017 1.353801 1.345937 2 5739494 1.404494 -0.128759 -0.935926 -0.975395 3 3043801 2.168413 -0.068417 0.781369 0.765604 4 4548709 0.716966 -0.084812 0.208938 0.185271 Average Income Sales Price 0 0.641111 12.296201 1 -1.015240 12.590164 2 0.729635 12.367192 3 -0.928038 12.563191 4 -0.862471 12.597929 4E) Display your resulting dataset
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
14 of 15
05-12-2023, 22:39
House Sale
Identifier
House Age
SqFt
Num
Bedrooms
Num
Bathrooms
Average
Income
Sales
Price
0
1615233
-1.727576e+00
-0.119639
-0.935926
-0.975395
0.641111
12.296201
1
8524268
-3.525208e-01
-0.070017
1.353801
1.345937
-1.015240
12.590164
2
5739494
1.404494e+00
-0.128759
-0.935926
-0.975395
0.729635
12.367192
3
3043801
2.168413e+00
-0.068417
0.781369
0.765604
-0.928038
12.563191
4
4548709
7.169663e-01
-0.084812
0.208938
0.185271
-0.862471
12.597929
...
...
...
...
...
...
...
...
995
8911243
-1.880360e+00
-0.006571
0.208938
0.185271
-0.852231
12.547209
996
1845461
-2.713987e-16
-0.131572
0.208938
0.185271
-1.078495
12.239923
997
7530903
-2.713987e-16
-0.088498
1.353801
1.345937
0.873156
12.490763
998
1780321
8.697502e-01
-0.077681
0.781369
0.765604
0.718404
12.270900
999
9077141
2.943884e-02
-0.135501
-0.935926
-0.975395
-0.279882
12.346026
1000 rows × 7 columns
df
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
15 of 15
05-12-2023, 22:39
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help