Homework7_Aaryan Pimple

pdf

School

University of Southern California *

*We aren’t endorsed by this school

Course

558

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

pdf

Pages

15

Uploaded by CoachMorningYak37

Report
ISE-558 Data Management for Analytics Homework 7 For this homework assignment, enter your answers in the blank cells below, either as Python code (in code cells) or as text in Markdown cells. When you are completed, create a PDF version of your solution with the menu command File > Download As HTML and then open the html le in a browser and "print" it to a PDF le. Upload this PDF le to Gradescope in the normal way. import pandas as pd import numpy as np Problem 1 You are to use the tables found in “Orders - Data Integration.csv” and “Product Costs - Data Integration.csv” to generate a report of the total pro t from each of your customers. Perform the following steps: 1A) Read in the two les and determine if either (or both) of the les are not in Tidy format. Summarize below which les are not in Tidy format and why: # Load the CSV files orders_data = pd.read_csv('/content/Orders - Data Integration.csv') product_costs_data = pd.read_csv('/content/Product Costs - Data Integration.csv') def check_tidy_format(data, table_name): variables_in_columns = all (data.columns == data.columns.str.strip()) observations_in_rows = data.shape[0] == len(data) observational_unit = len(data.index.unique()) == len(data) tidy_summary = { "Table Name": table_name, "Variables in Columns": variables_in_columns, "Observations in Rows": observations_in_rows, "Observational Unit": observational_unit, "Tidy Format": variables_in_columns and observations_in_rows and observational_unit } return tidy_summary Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 1 of 15 05-12-2023, 22:39
orders_tidy_summary = check_tidy_format (orders_data, "Orders") product_costs_tidy_summary = check_tidy_format (product_costs_data, "Product Costs") print("Summary for Orders table:") print(orders_tidy_summary) print("\nSummary for Product Costs table:") print(product_costs_tidy_summary) Summary for Orders table: {'Table Name': 'Orders', 'Variables in Columns': True, 'Observations in Rows': True, Summary for Product Costs table: {'Table Name': 'Product Costs', 'Variables in Columns': True, 'Observations in Rows': xxx 1B) If either table is not in Tidy format, correct it. Also, convert the costs from character to numeric types. def tidy_and_convert_numeric (data, table_name): if 'cost' in data.columns: data[ 'cost'] = pd.to_numeric(data['cost'], errors='coerce') data = data.dropna() data = data.reset_index(drop=True) print(f"{table_name} table has been corrected and costs have been converted to numeric. return data orders_data_corrected = tidy_and_convert_numeric (orders_data, "Orders") product_costs_data_corrected = tidy_and_convert_numeric (product_costs_data, "Product Cos Orders table has been corrected and costs have been converted to numeric. Product Costs table has been corrected and costs have been converted to numeric. 1C) Combine the two data frames to add the “cost” information column from the Product Costs - Data Integration data frame to the Orders - Data Integration data frame. Display your resulting combined data frame. combined_data = pd.merge(orders_data_corrected, product_costs_data_corrected, on='Product print("Combined Data Frame:") print(combined_data) Combined Data Frame: Order_Num Customer Line Item Year Purchased Product Code Quantity \ 0 1000 Customer A 1 2019 X1189 65 1 1000 Customer A 2 2019 A33 63 2 1000 Customer A 3 2019 BW243 75 3 1000 Customer A 4 2019 X1388 20 Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 2 of 15 05-12-2023, 22:39
3 1000 Customer A 4 2019 X1388 20 4 1000 Customer A 5 2019 Y12 82 5 1001 Customer B 1 2020 X1388 83 6 1001 Customer B 2 2020 BW243 29 7 1001 Customer B 3 2020 GG2554 70 8 1002 Customer C 1 2020 X1388 52 9 1002 Customer C 2 2020 HC155 73 10 1002 Customer C 3 2020 ZZ52 81 11 1002 Customer C 4 2020 YYS1 73 12 1002 Customer C 5 2020 GG2554 98 13 1002 Customer C 6 2020 Y12 93 14 1002 Customer C 7 2020 HCK15 81 15 1003 Customer D 1 2019 X1388 62 Unit Price 2019 Cost 2020 Cost 0 314 244.92 273.92 1 698 467.66 484.66 2 483 367.08 380.08 3 684 540.36 566.36 4 474 322.32 335.32 5 684 540.36 566.36 6 483 367.08 380.08 7 595 428.40 447.40 8 684 540.36 566.36 9 404 282.80 303.80 10 317 196.54 226.54 11 258 157.38 172.38 12 595 428.40 447.40 13 474 322.32 335.32 14 532 335.16 357.16 15 684 540.36 566.36 1D) Create a new column in the joined table that is equal to the total pro t for that line item (note: pro t equals the price that you charge for an item minus the cost that you have to pay to get that item from your supplier). Display the resulting dataframe. combined_data['Total Profit'] = (combined_data['Unit Price'] - combined_data['2020 Cost'] print("Combined Data Frame with Total Profit: ") print(combined_data) Combined Data Frame with Total Profit: Order_Num Customer Line Item Year Purchased Product Code Quantity \ 0 1000 Customer A 1 2019 X1189 65 1 1000 Customer A 2 2019 A33 63 2 1000 Customer A 3 2019 BW243 75 3 1000 Customer A 4 2019 X1388 20 4 1000 Customer A 5 2019 Y12 82 5 1001 Customer B 1 2020 X1388 83 6 1001 Customer B 2 2020 BW243 29 7 1001 Customer B 3 2020 GG2554 70 8 1002 Customer C 1 2020 X1388 52 9 1002 Customer C 2 2020 HC155 73 Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 3 of 15 05-12-2023, 22:39
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
9 1002 Customer C 2 2020 HC155 73 10 1002 Customer C 3 2020 ZZ52 81 11 1002 Customer C 4 2020 YYS1 73 12 1002 Customer C 5 2020 GG2554 98 13 1002 Customer C 6 2020 Y12 93 14 1002 Customer C 7 2020 HCK15 81 15 1003 Customer D 1 2019 X1388 62 Unit Price 2019 Cost 2020 Cost Total Profit 0 314 244.92 273.92 2605.20 1 698 467.66 484.66 13440.42 2 483 367.08 380.08 7719.00 3 684 540.36 566.36 2352.80 4 474 322.32 335.32 11371.76 5 684 540.36 566.36 9764.12 6 483 367.08 380.08 2984.68 7 595 428.40 447.40 10332.00 8 684 540.36 566.36 6117.28 9 404 282.80 303.80 7314.60 10 317 196.54 226.54 7327.26 11 258 157.38 172.38 6250.26 12 595 428.40 447.40 14464.80 13 474 322.32 335.32 12897.24 14 532 335.16 357.16 14162.04 15 684 540.36 566.36 7293.68 1E) Generate and display a data frame that contains the total pro t received from each of your customers (your resulting table will contain four rows - one row for each customer). customer_profit = combined_data.groupby('Customer')['Total Profit'].sum().reset_index() print("Total Profit from Each Customer:") print(customer_profit) Total Profit from Each Customer: Customer Total Profit 0 Customer A 37489.18 1 Customer B 23080.80 2 Customer C 68533.48 3 Customer D 7293.68 Problem 2 2A) Read into dataframe the following six star schema les: rentals.csv, addresses.csv, cities.csv, countries.csv, payments.csv, and customers.csv. Then, combine the six les into a single dataframe as your analytical base table. Display the rst few rows of this analytical base table (don’t worry if you can’t display all the columns in your printout). rental_df = pd.read_csv('/content/rentals.csv') Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 4 of 15 05-12-2023, 22:39
rental_df = pd.read_csv('/content/rentals.csv') address_df = pd.read_csv('/content/addresses.csv') city_df = pd.read_csv('/content/cities.csv') country_df = pd.read_csv('/content/countries.csv') customer_df = pd. read_csv ('/content/customers.csv') payments_df = pd.read_csv('/content/payments.csv') merged_df = pd.merge(rental_df, customer_df, on='CUSTOMER_ID') merged_df = pd.merge(merged_df, address_df, on='ADDRESS_ID') merged_df = pd.merge(merged_df, city_df, on='CITY_ID') merged_df = pd.merge(merged_df, country_df, on='COUNTRY_ID') merged_df = pd.merge(merged_df, payments_df, on='RENTAL_ID') print (merged_df.head()) RENTAL_ID RENTAL_DATE CUSTOMER_ID RETURN_DATE STAFF_ID \ 0 1 2005-05-24 22:53:30 130 2005-05-26 22:04:30 1 1 746 2005-05-29 09:25:10 130 2005-06-02 04:20:10 2 2 1630 2005-06-16 07:55:01 130 2005-06-19 06:38:01 1 3 1864 2005-06-17 01:39:47 130 2005-06-24 19:39:47 2 4 2163 2005-06-17 23:46:16 130 2005-06-22 22:48:16 2 FIRST_NAME LAST_NAME EMAIL ADDRESS_ID \ 0 CHARLOTTE HUNTER CHARLOTTE.HUNTER@sakilacustomer.org 134 1 CHARLOTTE HUNTER CHARLOTTE.HUNTER@sakilacustomer.org 134 2 CHARLOTTE HUNTER CHARLOTTE.HUNTER@sakilacustomer.org 134 3 CHARLOTTE HUNTER CHARLOTTE.HUNTER@sakilacustomer.org 134 4 CHARLOTTE HUNTER CHARLOTTE.HUNTER@sakilacustomer.org 134 ACTIVE ... DISTRICT CITY_ID POSTAL_CODE PHONE \ 0 True ... Gois 190 82639.0 9.354486e+11 1 True ... Gois 190 82639.0 9.354486e+11 2 True ... Gois 190 82639.0 9.354486e+11 3 True ... Gois 190 82639.0 9.354486e+11 4 True ... Gois 190 82639.0 9.354486e+11 CITY COUNTRY_ID COUNTRY PAYMENT_ID AMOUNT \ 0 Águas Lindas de Goiás 15 Brazil 3504 2.99 1 Águas Lindas de Goiás 15 Brazil 3505 2.99 2 Águas Lindas de Goiás 15 Brazil 3506 2.99 3 Águas Lindas de Goiás 15 Brazil 3507 2.99 4 Águas Lindas de Goiás 15 Brazil 3508 2.99 PAYMENT_DATE 0 2005-05-24 22:53:30 1 2005-05-29 09:25:10 2 2005-06-16 07:55:01 3 2005-06-17 01:39:47 4 2005-06-17 23:46:16 [5 rows x 22 columns] 2B) Write a command to return the total amount of all of the payments by customers who live in the United States Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 5 of 15 05-12-2023, 22:39
the United States total_payments_us = merged_df.loc[merged_df['COUNTRY'] == 'United States', 'AMOUNT'].sum( print(f'Total payments by customers in th eUnited States: {total_payments_us}') Total payments by customers in th eUnited States: 2015.1900000000003 2C) Write a command to return the total payment amounts for every country that has a total payment abount greater than 1000. Sort the list from highest to lowest. result_df = merged_df.groupby('COUNTRY')['AMOUNT'].sum().reset_index() result_df = result_df[result_df['AMOUNT'] > 1000].sort_values(by='AMOUNT', ascending=Fals print(result_df) COUNTRY AMOUNT 42 India 3052.44 21 China 2769.04 101 United States 2015.19 48 Japan 1767.85 58 Mexico 1762.00 13 Brazil 1591.23 77 Russian Federation 1552.35 73 Philippines 1124.21 2D) Write a command to return the total payment amounts for the 10 customer IDs with the largest total payment amounts. Sort the list from highest to lowest amounts. result_df = merged_df.groupby('CUSTOMER_ID')['AMOUNT'].sum().reset_index() result_df = result_df.nlargest(10, 'AMOUNT').sort_values(by='AMOUNT', ascending=False) print(result_df) CUSTOMER_ID AMOUNT 525 526 116.74 175 176 110.77 235 236 103.73 468 469 101.78 143 144 97.80 177 178 95.82 308 309 92.84 186 187 92.82 238 239 92.80 306 307 92.79 Problem 3 The les corporations.csv, departments.csv, and teams.csv contain data that you want to use to analyze the effects of various factors on a department’s error rates on the tasks that they Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 6 of 15 05-12-2023, 22:39
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
perform and track. Read these les into tibbles and combine in them into a single one row per analysis subject tibble to perform this analysis. corporations_df =pd.read_csv('/content/corporations.csv') departments_df =pd.read_csv('/content/departments.csv') teams_df =pd.read_csv('/content/teams.csv') merged_df = corporations_df.merge(departments_df, on = 'Corporation_ID', how='left') merged_df = merged_df.merge(teams_df, left_on = ['Corporation_ID','Department_Name'], rig print(merged_df.head()) Corporation_ID Corporation_Name Total_Employees Total_Revenue \ 0 1 TechCorp Inc. 10801 $7B 1 1 TechCorp Inc. 10801 $7B 2 1 TechCorp Inc. 10801 $7B 3 1 TechCorp Inc. 10801 $7B 4 1 TechCorp Inc. 10801 $7B Headquarters Department_Name Department_Size Department \ 0 San Francisco, CA HR 277 HR 1 San Francisco, CA HR 277 HR 2 San Francisco, CA HR 277 HR 3 San Francisco, CA Finance 1302 Finance 4 San Francisco, CA Finance 1302 Finance Team_Name Team_Leader Team_Size Task_Completion_Rate \ 0 Training Frank 55 0.93 1 Employee Relations Karen 229 0.76 2 Recruitment Larry 166 0.95 3 Audit Hannah 47 0.96 4 Investments Hannah 23 0.61 Error_Rate Training_Hours 0 0.13 36 1 0.06 40 2 0.15 38 3 0.09 14 4 0.11 11 merged_df.drop(['Team_Name','Team_Leader'],axis=1,inplace=True) merged_df.columns Index(['Corporation_ID', 'Corporation_Name', 'Total_Employees', 'Total_Revenue', 'Headquarters', 'Department_Name', 'Department_Size', 'Department', 'Team_Size', 'Task_Completion_Rate', 'Error_Rate', 'Training_Hours'], dtype='object') merged_df_group = merged_df.groupby(['Corporation_ID','Corporation_Name','Total_Employees' 'Team_Size':['sum'], 'Task_Completion_Rate':['mean'],'Error_Rate':['mean'], 'Training_ merged_df_group.head(10) Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 7 of 15 05-12-2023, 22:39
Corporation_ID Corporation_Name Total_Employees Total_Revenue Headquarters De 0 1 TechCorp Inc. 10801 $7B San Francisco, CA 1 1 TechCorp Inc. 10801 $7B San Francisco, CA 2 1 TechCorp Inc. 10801 $7B San Francisco, CA 3 1 TechCorp Inc. 10801 $7B San Francisco, CA 4 1 TechCorp Inc. 10801 $7B San Francisco, CA 5 2 HealthHub LLC 9568 $3B New York, NY 6 2 HealthHub LLC 9568 $3B New York, NY 7 2 HealthHub LLC 9568 $3B New York, NY 8 2 HealthHub LLC 9568 $3B New York, NY 9 2 HealthHub LLC 9568 $3B New York, NY Problem 4 For this problem, you will be performing initial data cleansing and feature engineering on the dataset "House Price Dataset.csv" that will be used to predict house prices based on the following attributes: House Sale Identi er House Age SqFt (square footage of the house) Num Bathrooms Num Bedrooms Average Income Sales Price Read the csv le into a Python/Pandas dataframe and prepare the dataset for analytics by performing the steps below. Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 8 of 15 05-12-2023, 22:39
4A) Are there any missing values? If so, decide how to correct the issue and show your code that does so. df = pd.read_csv('/content/House Price Dataset.csv') missing_values = df.isnull().sum() print("Missing Values:\n", missing_values) Missing Values: House Sale Identifier 0 House Age 180 SqFt 0 Num Bedrooms 0 Num Bathrooms 50 Average Income 0 Sales Price 0 dtype: int64 House Sale Identifier House Age SqFt Num Bedrooms Num Bathrooms Average Income Sales Price 0 1615233 4.0 2194.0 2 1.0 63.433 218862 1 8524268 22.0 3217.0 6 5.0 43.375 293655 2 5739494 45.0 2006.0 2 1.0 64.505 234964 3 3043801 55.0 3250.0 5 4.0 44.431 285840 4 4548709 36.0 2912.0 4 3.0 45.225 295944 df.head() df['House Age'].fillna(df['House Age'].mean(), inplace=True) df['Num Bathrooms'].fillna(df['Num Bedrooms'].mean(), inplace=True) missing_values_after_filling = df.isnull().sum() print("\nMissing Values After Filling:\n", missing_values_after_filling) Missing Values After Filling: House Sale Identifier 0 House Age 0 SqFt 0 Num Bedrooms 0 Num Bathrooms 0 Average Income 0 Sales Price 0 dtype: int64 4B) Are there any extreme outliers? If so, decide how to correct the issue and show your code that does so. Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 9 of 15 05-12-2023, 22:39
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
import matplotlib.pyplot as plt import seaborn as sns plt.figure(figsize=(12, 6)) sns.boxplot(data=df[['House Age', 'SqFt', 'Num Bedrooms', 'Num Bathrooms', 'Average Incom plt.title('Box Plot for Identifying Outliers') plt. show() xxx Q1 = df['Sales Price'].quantile(0.25) Q3 = df['Sales Price'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = df[(df['Sales Price'] < lower_bound) | (df['Sales Price'] > upper_bound)] df_no_outliers = df[(df['Sales Price'] >= lower_bound) & (df['Sales Price'] <= upper_boun print ("Outliers:") print (outliers) Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 10 of 15 05-12-2023, 22:39
print (outliers) Outliers: House Sale Identifier House Age SqFt Num Bedrooms Num Bathrooms \ 12 5987860 46.000000 3110.0 6 5.000 28 3273662 42.000000 2358.0 1 0.000 50 1656962 30.000000 4483.0 6 5.000 54 4951518 30.000000 2068.0 4 3.000 101 9938683 7.000000 2987.0 4 3.635 121 8490383 30.000000 2725.0 2 1.000 150 1215048 22.000000 249385.5 6 5.000 174 2320295 30.000000 4467.0 7 6.000 185 7228100 16.000000 1906.0 3 2.000 188 7445076 26.614634 3694.0 5 4.000 197 6174356 34.000000 2783.0 4 3.000 233 8278041 25.000000 2363.0 5 4.000 248 7693757 3.000000 2193.0 3 2.000 249 9228407 17.000000 2403.0 6 5.000 264 8303461 47.000000 1938.0 3 2.000 272 1998101 39.000000 3135.0 3 2.000 280 4641054 48.000000 1996.0 5 4.000 284 2892940 29.000000 2228.0 1 0.000 293 2279439 7.000000 2489.0 5 4.000 295 7630534 24.000000 1784.0 4 3.000 298 3806167 18.000000 2345.0 2 1.000 317 3525167 33.000000 2947.0 5 4.000 321 9396285 26.614634 2300.0 3 2.000 325 7892873 27.000000 2244.0 4 3.635 344 5894467 35.000000 2311.0 3 2.000 364 3441886 39.000000 2173.0 1 0.000 373 4305566 26.614634 2082.0 7 6.000 380 6941350 42.000000 3518.0 4 3.000 430 4209982 16.000000 3413.0 2 1.000 449 8318327 25.000000 2519.0 2 1.000 490 6777658 40.000000 1855.0 6 5.000 505 6637705 8.000000 4199.0 4 3.635 532 1273378 9.000000 1871.0 2 1.000 554 7461958 42.000000 1918.0 2 1.000 558 3814258 34.000000 1920.0 4 3.000 574 8463703 26.614634 4367.0 3 2.000 580 3913188 34.000000 2595.0 3 2.000 608 4039593 15.000000 2752.0 3 2.000 622 2027180 26.614634 3286.0 3 2.000 625 1328669 26.614634 1907.0 1 0.000 629 8900210 39.000000 2086.0 5 4.000 660 8630729 60.000000 2242.0 3 2.000 680 4997299 26.614634 3014.0 3 2.000 716 1897133 26.614634 2238.0 1 0.000 738 5255322 26.000000 2182.0 3 2.000 766 2521571 45.000000 1699.0 3 2.000 786 3658410 0.000000 1966.0 3 2.000 828 8829895 34.000000 2274.0 2 1.000 837 8687036 24.000000 3777.0 7 6.000 847 9009467 33.000000 2112.0 3 2.000 852 1120173 15.000000 2060.0 5 4.000 857 5183967 26.614634 3238.0 6 5.000 860 7929707 26.614634 4128.0 6 5.000 Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 11 of 15 05-12-2023, 22:39
860 7929707 26.614634 4128.0 6 5.000 875 2822891 18.000000 3120.0 4 3.000 885 2193492 46.000000 3256.0 6 5.000 892 7361753 39.000000 2926.0 6 5.000 print("\nDataframe without Outliers:") print(df_no_outliers) Dataframe without Outliers: House Sale Identifier House Age SqFt Num Bedrooms Num Bathrooms \ 0 1615233 4.000000 2194.0 2 1.0 1 8524268 22.000000 3217.0 6 5.0 2 5739494 45.000000 2006.0 2 1.0 3 3043801 55.000000 3250.0 5 4.0 4 4548709 36.000000 2912.0 4 3.0 .. ... ... ... ... ... 995 8911243 2.000000 4525.0 4 3.0 996 1845461 26.614634 1948.0 4 3.0 997 7530903 26.614634 2836.0 6 5.0 998 1780321 38.000000 3059.0 5 4.0 999 9077141 27.000000 1867.0 2 1.0 Average Income Sales Price 0 63.433 218862 1 43.375 293655 2 64.505 234964 3 44.431 285840 4 45.225 295944 .. ... ... 995 45.349 281308 996 42.609 206885 997 66.243 265869 998 64.369 213394 999 52.280 230043 [940 rows x 7 columns] 4C) Are any of the variable signi cantly skewed? If so, correct them. Display any resulting histograms to show that the skew has been reduced. xxx from scipy.stats import skew import numpy as np plt.figure(figsize=(12, 6)) sns.histplot(df['Sales Price'], kde=True) plt.title('Histogram of Sales Price(Before Transformation)') plt.show() Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 12 of 15 05-12-2023, 22:39
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Skewness before transformation: 2.3827741670175127 skewness_before = skew(df['Sales Price']) print(f"Skewness before transformation: {skewness_before}") df['Sales Price'] = np.log1p(df['Sales Price']) plt.figure(figsize=(12, 6)) sns.histplot(df['Sales Price'], kde=True) plt.title('Histogram of Sales Price (After Log Transformation)') plt.show() Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 13 of 15 05-12-2023, 22:39
skewness_after = skew(df['Sales Price']) print(f"Skewness after log transformation: {skewness_after}") Skewness after log transformation: 0.5862674016203109 4D) Does the dataset require scaling? If so, perform that operation. from sklearn.preprocessing import StandardScaler features_to_scale = ['House Age', 'SqFt', 'Num Bedrooms', 'Num Bathrooms', 'Average Incom scalar = StandardScaler() df[features_to_scale] = scalar.fit_transform(df[features_to_scale]) print("Scaled DataFrame:") print(df.head()) Scaled DataFrame: House Sale Identifier House Age SqFt Num Bedrooms Num Bathrooms \ 0 1615233 -1.727576 -0.119639 -0.935926 -0.975395 1 8524268 -0.352521 -0.070017 1.353801 1.345937 2 5739494 1.404494 -0.128759 -0.935926 -0.975395 3 3043801 2.168413 -0.068417 0.781369 0.765604 4 4548709 0.716966 -0.084812 0.208938 0.185271 Average Income Sales Price 0 0.641111 12.296201 1 -1.015240 12.590164 2 0.729635 12.367192 3 -0.928038 12.563191 4 -0.862471 12.597929 4E) Display your resulting dataset Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 14 of 15 05-12-2023, 22:39
House Sale Identifier House Age SqFt Num Bedrooms Num Bathrooms Average Income Sales Price 0 1615233 -1.727576e+00 -0.119639 -0.935926 -0.975395 0.641111 12.296201 1 8524268 -3.525208e-01 -0.070017 1.353801 1.345937 -1.015240 12.590164 2 5739494 1.404494e+00 -0.128759 -0.935926 -0.975395 0.729635 12.367192 3 3043801 2.168413e+00 -0.068417 0.781369 0.765604 -0.928038 12.563191 4 4548709 7.169663e-01 -0.084812 0.208938 0.185271 -0.862471 12.597929 ... ... ... ... ... ... ... ... 995 8911243 -1.880360e+00 -0.006571 0.208938 0.185271 -0.852231 12.547209 996 1845461 -2.713987e-16 -0.131572 0.208938 0.185271 -1.078495 12.239923 997 7530903 -2.713987e-16 -0.088498 1.353801 1.345937 0.873156 12.490763 998 1780321 8.697502e-01 -0.077681 0.781369 0.765604 0.718404 12.270900 999 9077141 2.943884e-02 -0.135501 -0.935926 -0.975395 -0.279882 12.346026 1000 rows × 7 columns df Module 7 Homework.ipynb - Colaboratory https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo... 15 of 15 05-12-2023, 22:39
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help