Assignment_4_Cleaning_Data_Set_Instructions

docx

School

Indiana University, Bloomington *

*We aren’t endorsed by this school

Course

BUSH352

Subject

Computer Science

Date

Feb 20, 2024

Type

docx

Pages

4

Uploaded by emilyguse09

Report
Assignment 4: Cleaning a Data Set Task: Identify and Correct Errors in a Data Set From Canvas, download and save: BtownOrders(Jan2019)_assignment_data.xlsx and Assignment4_Cleaning_Data_Set_Answersheet.docx Scenario: You have done such a good job with forecasting and cleaning data from Btown Interactive that they have asked you to do some further cleaning of their data. The sales group is suspicious that there might be errors in the data for January. You will be working with a new set of 3,296 orders with 5,182 line items from January 2019. The data is in a file called “ BtownOrders(Jan2019)_assignment_data.xlsx .” A “line item” is just an order for a specified number of a particular product – there can be multiple line items per order. You’ll be looking for errors in the data in several places: 1. Errors in the product names . 2. Errors in the promotional codes . 3. Errors in the total_product_price . You will find, document, and correct the errors in the Excel workbook. Deliverables: Complete the Assignment4_Cleaning_Data_Set_Answersheet.docx associated with this assignment on Canvas. Submit the completed answer sheet with the “cleaned” Excel file on Canvas by the due date. Evaluation: You will be graded based on the number of correct answers. There are 10 questions overall. Be sure to insert rows in the answer sheet where appropriate to answer the questions completely. 1 Make sure you complete the in-class exercise “Locating Bad Data in Excel” before attempting this assignment! See the W6C2 in-class assignment on Canvas for guidance.
Part 1: Errors in Product Names Verify that the product names (Column J) are correct using the master list in the Lookups tab and correct any errors. You can assume the information in the Lookups tab is always right. So if there is a mismatch, the error is in your data set. To do this, you will use the MATCH function (use W6C2 in-class exercise as a guide). Place your MATCH calculation in column N of the “Btown Orders (Jan)” worksheet. Title of the column “ ProdMatch ” (in cell N1) and start your MATCH formulas in cell N2. HINT: Using the Sort and Filter features in Excel can also help you. You’ve used both of those features in the exercises we’ve done so far in this course. ANOTHER HINT: Remember, there is a list of correct product names in the Lookups worksheet. Using the Assignment4_Cleaning_Data_Set_Answersheet.docx provided with this assignment, answer the following questions: 1) How many line items (rows) had incorrect product names? 2) List the products names with errors, listing the incorrect name, the corrected name, and how many rows of data had the error. The product names are only off by a couple of letters and should be easily identified in the Lookups list. Correct all invalid product names in the data. (Try sorting by product_name . You only need to list each incorrect product name once.) Now fix the incorrect product names in the “ Btown Orders (Jan) ” worksheet. Use caution when sorting!!! Refer to item #4 from the Taber reading . HINT: Use “Find and Replace” to speed up fixing the errors. Part 2: Errors in Promotional Codes Verify that the promotional codes (Column E) are correct using the master list in the Lookups tab and correct any errors. Use the MATCH function and place your function in Column O of the Btown Orders (Jan) ” worksheet. Make the title of the column “ PromMatch ” (in cell O1) and start your MATCH formulas in cell O2. Answer the following questions: 1) How many line items (rows) had incorrect promotional codes? The value in the promo_code field should match the Lookups data sheet exactly. 2) List the promotional codes with errors, listing the incorrect codes, the corrected 2
codes, and how many rows of data had the error. (Try sorting by promo_code . You only need to list each incorrect promotional code once.) Now fix the incorrect promotional code values in the “Btown Orders (Jan)” worksheet. Remember, there is a list of correct promotional codes in the Lookups worksheet. Part 3: Errors in the Total Product Price Verify that the total product price is correct for each line item. We know that the product prices and quantities were recorded correctly, but we’re just not sure the total product price was calculated correctly, which is the total price of the entire order and the amount we bill our customers. To do this, keep a few things in mind: The total product price is the item product price multiplied by the product quantity . For the first line item in the data set, we see this is true: First , see if there are any outliers for total_product_price . 1) How many outliers are there? 2) Copy/paste the plot into the worksheet on the answer sheet. Now sort by total product price to identify those outliers. 3) List the lineitem_ids and the total product price for the outliers as listed. By looking at the quantity purchased and the total price, it seems unlikely that the item product price is incorrect (this would make the products very expensive!). So, correct the total product price for these rows in column I of the spreadsheet. Remember, total_product_price is product_quantity times item_product_price . Don’t delete the rows, fix them. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Second , check for 0 values for total product price. 4) How many 0 values are there? Now correct the total product price for these rows in column I of the spreadsheet. Don’t delete the rows, fix them. Third , check to see if there are any other errors in the data set. You can do this by comparing the product_quantity (column G in the spreadsheet) multiplied by the item_product_price (column H in the spreadsheet) to the total_product_price (column I in the spreadsheet). If the item price OR the total price is incorrect, then these two values won’t match, indicating a problem. 5) How many line items still have errors (rows with “WRONG”)? 6) List the lineitem_id for each row with an error and the incorrect total_product_price . Now correct the total product price for these rows in column I of the spreadsheet. Be sure to check each item ( product_name ) to compare prices. Don’t delete the rows, fix them. Submit both your answer sheet and your saved completed Excel file on Canvas with this assignment. 4 HINT: Use an IF function in Excel. Place your IF function in Column P. Make the title of the column “TotalCheck” (in cell P1) and start your IF formulas in cell Q2. ANOTHER HINT: As a separate example, if we wanted to compare whether the sum of the values in cells A2 and B2 were equal to the value in cell C2, we could do this: =IF((A2 + B2) = C2),"RIGHT","WRONG") {don’t use this one here! –just showing you how} Which says that if the equation is true (A2 + B2) = C2) , then display the word RIGHT in the cell. Otherwise, display the word WRONG . This will allow you to find out which rows have a problem. Use Filter to locate the WRONG s.