1Dbk8_m8vR0d8SMRHmfgo_SbPSPdlq7Bj
pdf
keyboard_arrow_up
School
University of Southern California *
*We aren’t endorsed by this school
Course
558
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
Pages
6
Uploaded by CoachMorningYak37
ISE-558 Data Management for Analytics
Homework 7
For this homework assignment, enter your answers in the blank cells below, either as Python
code (in code cells) or as text in Markdown cells. When you are completed, create a PDF version
of your solution with the menu command File > Download As HTML and then open the html �
le
in a browser and "print" it to a PDF �
le. Upload this PDF �
le to Gradescope in the normal way.
import pandas as pd
import numpy as np
Problem 1
You are to use the tables found in “Orders - Data Integration.csv” and “Product Costs - Data
Integration.csv” to generate a report of the total pro
�
t from each of your customers. Perform the
following steps:
1A) Read in the two �
les and determine if either (or both) of the �
les are not in Tidy format.
Summarize below which �
les are not in Tidy format and why:
import pandas as pd
# Load the CSV files
orders_path = '/Orders - Data Integration.csv'
product_costs_path = ''
# Reading the CSV files into DataFrames
orders_df = pd.read_csv(orders_path)
product_costs_df = pd.read_csv(product_costs_path)
# Merging the orders dataframe with the product costs dataframe
merged_df = orders_df.merge(product_costs_df, on="Product Code")
# Function to calculate the cost for each line item
def calculate_cost(row):
year = row['Year Purchased']
quantity = row['Quantity']
cost_per_unit = row['2019 Cost'] if year == 2019 else row['2020 Cost']
return cost_per_unit * quantity
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
1 of 6
05-12-2023, 17:52
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-d2e39827f55f> in <cell line: 8>
()
6
7
# Reading the CSV files into DataFrames
----> 8
orders_df
=
pd
.
read_csv
(
orders_path
)
9
product_costs_df
=
pd
.
read_csv
(
product_costs_path
)
10
6 frames
/usr/local/lib/python3.10/dist-packages/pandas/io/common.py in get_handle
(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
854
if
ioargs
.
encoding
and
"b"
not in
ioargs
.
mode
:
855
# Encoding
--> 856 handle = open(
857
handle
,
858
ioargs
.
mode
,
FileNotFoundError
: [Errno 2] No such file or directory: '/Orders - Data Integration.csv'
# Adding a new column for the total cost
merged_df['Total Cost'] = merged_df.apply(calculate_cost, axis=1)
# Calculating profit for each line item (Revenue - Cost)
merged_df['Profit'] = (merged_df['Unit Price'] * merged_df['Quantity']) - merged_df['Total
# Grouping by customer and calculating total profit
total_profit_per_customer = merged_df.groupby('Customer')['Profit'].sum().reset_index()
# Display the result
print(total_profit_per_customer)
xxx
1B) If either table is not in Tidy format, correct it. Also, convert the costs from character to
numeric types.
1C) Combine the two data frames to add the “cost” information column from the Product Costs -
Data Integration data frame to the Orders - Data Integration data frame. Display your resulting
combined data frame.
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
2 of 6
05-12-2023, 17:52
1D) Create a new column in the joined table that is equal to the total pro
�
t for that line item
(note: pro
�
t equals the price that you charge for an item minus the cost that you have to pay to
get that item from your supplier). Display the resulting dataframe.
1E) Generate and display a data frame that contains the total pro
�
t received from each of your
customers (your resulting table will contain four rows - one row for each customer).
Problem 2
2A) Read into dataframe the following six star schema �
les: rentals.csv, addresses.csv,
cities.csv, countries.csv, payments.csv, and customers.csv. Then, combine the six �
les into a
single dataframe as your analytical base table. Display the �
rst few rows of this analytical base
table (don’t worry if you can’t display all the columns in your printout).
2B) Write a command to return the total amount of all of the payments by customers who live in
the United States
2C) Write a command to return the total payment amounts for every country that has a total
payment abount greater than 1000. Sort the list from highest to lowest.
2D) Write a command to return the total payment amounts for the 10 customer IDs with the
largest total payment amounts. Sort the list from highest to lowest amounts.
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
3 of 6
05-12-2023, 17:52
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Problem 3
The �
les corporations.csv, departments.csv, and teams.csv contain data that you want to use to
analyze the effects of various factors on a department’s error rates on the tasks that they
perform and track. Read these �
les into tibbles and combine in them into a single one row per
analysis subject tibble to perform this analysis.
Problem 4
For this problem, you will be performing initial data cleansing and feature engineering on the
dataset "House Price Dataset.csv" that will be used to predict house prices based on the
following attributes:
• House Sale Identi
�
er
• House Age
• SqFt (square footage of the house)
• Num Bathrooms
• Num Bedrooms
• Average Income
• Sales Price
Read the csv �
le into a Python/Pandas dataframe and prepare the dataset for analytics by
performing the steps below.
4A) Are there any missing values? If so, decide how to correct the issue and show your code
that does so.
xxx
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
4 of 6
05-12-2023, 17:52
4B) Are there any extreme outliers? If so, decide how to correct the issue and show your code
that does so.
xxx
4C) Are any of the variable signi
�
cantly skewed? If so, correct them. Display any resulting
histograms to show that the skew has been reduced.
xxx
4D) Does the dataset require scaling? If so, perform that operation.
4E) Display your resulting dataset
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
5 of 6
05-12-2023, 17:52
Module 7 Homework.ipynb - Colaboratory
https://colab.research.google.com/drive/1Dbk8_m8vR0d8SMRHmfgo...
6 of 6
05-12-2023, 17:52
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help