Assignment 61 - ETL Project2232 (3)

docx

School

San Francisco State University *

*We aren’t endorsed by this school

Course

40

Subject

Information Systems

Date

Nov 24, 2024

Type

docx

Pages

15

Uploaded by PresidentLightningAardvark11

Report
UMGC Data 620 Assignment 6.1 Your company wants to merge its old product order data into a new data mart to facilitate analysis. Your team has been tasked with writing an ETL (extract, transform, and load) code sequence, and executing it on three years’ worth of order data. Your team will produce: A .csv data file suitable for direct upload to the data mart, to match the data mart for- mat given in the assignment A Microsoft word memo to the executive team, outlining what you did and what your recommendations are for moving forward. In the Appendix of the memo you will put the SQL code you wrote. Of course, it is possible to perform ETL using a variety of software packages; even Excel. But, for this project, please do *all* of your programming in My SQL Workbench. A correct answer obtained by using something other than MySQL will not receive credit. Rubric: Element: Possible Points Notes .csv file deliverable 20 Graded according to correctness of .csv file data over all years, product lines, and other summary fields. File should have headers describing the columns. Columns should be sorted per instructions. .csv data supports answers to Executive Memo below. SQL code 40 SQL Code is submitted as a separate attachment and labeled as required. Graded according to SQL rubric. Include comments and easy-to-follow queries. If we cannot generate your .csv file from the input files using your SQL code and following your directions, you will receive a 0 on this part. You are welcome to include screenshots, but please 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 1
also include SQL code such that we can run it. If your only SQL is a screenshot, you will receive very little credit on this part. Executive Memo ERD, ETL Documentation and Metadata 20 ERD is clearly documented and contains sentences denoting cardinality of relationships. Process explanation is clear and in business English, not “technology-speak.” Diagrams are encouraged. There is no use of SQL in this part, but instead references are made to SQL code by caption number in Appendix where needed. Metadata is clear and comprehensive, and would be sufficient for a new programmer to come up to speed quickly. Question 1 – Granularity 30 Complete answer to question, with examples where needed to support points. Demonstrates understanding of granularity in data marts. Question 2 - Ramon 30 Complete answer to question, with examples where needed. Demonstrates understanding of what Ramon would need to answer the query; can run an example with one or two pieces of final data to illustrate. Question 3 – different format for the data 30 Complete answer to question. Demonstrates understanding of advantages and disadvantages of two different types of two data layouts. Identifies any missing ideas and defends answer. Question 4 – tidy data 30 Complete answer to question. Demonstrates understanding of tidy data and applies concepts to this case study. Critically evaluates two data layouts in context of tidy data. Formatting APA Formatting Up to 20 point deduction if incorrect Memo conforms to desired formatting: APA formatting for everything except for important charts/diagrams. Do not put important charts/diagrams in Appendix. Instead, put them in line with your text, with a Figure caption or a Table reference using Microsoft Word’s References -> Insert Caption and Cross-Reference capability. Appendix should contain technical SQL Code, and any 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 2
secondary charts/diagrams. Few grammatical or spelling errors. Passes a Turnitin plagiarism check. Correct APA formatting for nearly all of the paper is expected. Points may be deducted (up to 20) if formatting is incorrect. TOTAL 200 Getting Started: This assignment starts with the script, “week6_bu_grad.sql” . This script should create a table called “business_unit” and a table called “Product_BU.” Unfortunately, the metadata descriptions have been lost, so you will need to figure out what you can from the SQL script. The only thing you know about the metadata is that the company runs several individual strategic business units, such as “On The Go” and “Snack.” Each of these business units is run under an umbrella designation, such as “Growth” or “Decline.” The company will run marketing for growth products differently than it would run marketing for products on the decline. You also have product order files from 2017, 2018, and 2019. They are attached as .csv files titled “2017_product_data_students-final.csv” “2018_product_data_students-final.csv” “2019_product_data_students-final.csv” Your job is to use SQL to perform an ETL which will accomplish the following: 1. Extract data from the 2017, 2018, and 2019 order files 2. Transform the data according to the given rules 3. Load it into one final table 4. Export your final output table under the name “GX_output_final.csv” . (You may create as many or as few data objects as you like in your work, but the data in the .csv file named “GX_output_final.csv” will be the data evaluated. You may write one large SQL script to accomplish the entire process. You may also break your SQL commands into smaller batches. Just like commercial ETL scripts, your entire script should run without human intervention - you should not require the user to use MySQL GUI commands partway through. If as a last resort you do end up doing this, your notes should reflect what you did (for example, in the Appendix 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
you could say “We created database YYY, and then used the “import” button on the MySQL GUI interface to upload the .csv file into Table Z. Then we ran the script shown in Figure X …”) Please only use MySQL for this assignment. The only exceptions here are minor edits made using Notepad or Excel, such as putting headers on column names. Document these carefully in your Appendix; if your SQL script doesn’t write column headers, but your output file magically has them, we want to know how they got there. You can just say something like “After we did << XXX >> to export the data, we used Notepad to insert Row 1, which are the header names.”) Remember, you have learned how to download and run a .sql script. And in Week 4, we learned how to import and export files. You will need both of these skills this week. Detailed Instructions: The very last pages of this assignment document contains a method for export. You are welcome to use this. If you want to use a different export method, verify with your faculty that it is OK. Extraction : Your extracted data should meet the following criteria for each of the 2017, 2018, and 2019 data sets. Business Unit Designations are for “Growth” and “Mature” only. Among other things, this means you should not choose any orders which are associated with a “Decline” designation. Transformation : Your output file should follow this format, for loading into the data mart. A sample of some output is given below; note that your data may or may not match these numbers. 1. BU Designation – this is Growth and Mature; subtotal by this field 2. BU Name – no transformations; subtotal by this field within BU Designation 3. Product – no transformations; subtotal by this field within BU Name 4. Region – no transformations; subtotal by this field within Product 5. Year – no transformations; subtotal by this field within Region 6. Month – no transformations; subtotal by this field within Year 7. Sum of Quantity – this reflects the sum of the “Quantity” field in the relevant data. For example, in the data below, done for a previous data warehouse, the first line indicates that for Growth/Energy/Purple Pain/Eastern/2012/April, there was a total of 20 Purple Pain packets sold. This could reflect twenty 1-packet sales, four 5-packet sales, or one sale of 20 packets. (Note this data is for illustration only, and does not reflect your more recent sales data.) 8. Sum of Order Total – this reflects the sum of the “Order total” field in the relevant data. For example, in the data below, from a previous data warehouse, the first line indicates that for Growth/Energy/Purple Pain/Eastern/2012/April, there was a total of 6960 cents in revenue from the 20 Purple Pain packets sold. (This implies a price of 6960/20 = 348 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 4
cents, or $3.48 per Purple Pain Packet in 2012.) You can assume pricing is stable throughout a calendar year, and any price changes happen instantaneously at midnight on December 31 and apply to the entire next year. BU Designa- tion BU Name Product Region Year Month Sum of Quantity Sum of Or- der Total Growth Energy Purple Pain Eastern 2012 4 20 6960 Growth Energy Purple Pain Eastern 2012 8 19 6612 Growth Energy Purple Pain Western 2012 6 0 0 Growth Energy Red Hot Chili Peppers Eastern 2012 1 33 14190 Growth Energy Red Hot Chili Peppers Eastern 2012 8 30 12900 Growth Energy Red Hot Chili Peppers Midwest 2012 6 37 15910 Growth Energy Red Hot Chili Peppers Western 2012 2 12 5160 Growth Energy Red Hot Chili Peppers Western 2012 3 33 14190 Growth Snack Crocodile Tears Eastern 2012 2 26 7332 Growth Snack Crocodile Tears Southeast 2012 4 4 1128 Growth Snack Crocodile Tears Western 2012 3 18 5076 Mature Health Panda Gummies Eastern 2012 4 69 10074 Mature Health Panda Gummies Midwest 2012 7 16 2336 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 5 Go
Load : Your deliverable is a single .csv file with the applicable data in it. It should contain only the fields listed above, and should be sorted alphabetically (or numerically) ascending in each field, with the leftmost fields having precedence. For example, you should first sort on BU Designation, and within that, sort on BU_Name. Your one data file should contain the data from all three years (2017, 2018, and 2019). If they are not contained already upon export, make sure to use your .csv editor (such as Notepad or Excel) to insert the field names on your .csv file. Management Memo You writes a memo to management outlining your answers to the following questions: 1. Create and explain an entity relationship diagram (ERD) to go with this data. Your ERD should describe the business situation in existence as best as you can infer it. Since your input files are not necessarily in the best shape, your ERD should not simply map the input files. Your output file is by definition a flat file with no major database schema, so your ERD shouldn’t map that either. As a hint, consider this: based on the data here, what relationship can you infer exists between BU Designation and Product? One to one? One to many? Must-have or may-have? Use the ER tool in MySQL to do your ERD, and incorporate a screenshot of your ERD in the management memo. Note you will not forward or reverse engineer from this file, nor will you use any code it may generate; you will just use the ERD printout as documentation of what “should” be. (You do not need to attach the MySQL entity file.) 2. Document your ETL process. Which functions did you use, and what logic did you fol- low? This should be at the level that your boss, who has an MBA but not an IT/database background, can follow it. Do not use “computer-ese” here; use regular business Eng- lish. 3. Give metadata for your final deliverable file. The analysts who follow you will thank you. 4. Your boss has a question for you. “We think this is about the right level of granularity for our data mart. What do you think? Should we extract more detailed information, and if so, what? Or would you recommend going to a coarser level of granularity, and if so, what fields would you recommend we drop?” Give your rationale. Think critically, and demonstrate a good understanding of data management. 5. Your boss wants to know the answer to this business question: “We believe our Growth segment should show at least 10% year over year growth in either quantity sold or order total. We also believe our Mature segment should remain pretty much the same in terms of quantity and order totals. If I give the final data file you produced to Ramon (an expert analyst), can he run queries to answer this?” (You may wish to run a query or two as proof of concept.) Tell the boss if you believe the data as laid out like it is will easily support Ramon in that sort of analysis. If it will, what about it makes it easy? If it won’t, how could it change to support this analysis? 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
6. Your boss has another question: “Our database folks have suggested we use a different format for the ETL if I’m so interested in growth. It’s copied below. It’s the exact same data, just a little differently arranged. What do you think of it? Bobby, one of my IT people, thinks a data mart with this layout is a brilliant answer to the growth question. But Susie, another one of my IT people, has concerns that this data layout will make it hard to query on any other dimension, such as whether a particular product is doing well or poorly in a given region, regardless of year, or monthly seasonal trends. Am I missing anything here? What do you recommend? If we had to go with just one layout of our data mart, which layout should it be?” Existing layout: << Other fields such as Designation here >> Year Sum of Quantity Sum of Order Total XXX 2017 15 150 XXX 2018 16 160 XXX 2019 17 170 Proposed layout: << Other fields such as Designation here >> Sum of Quantity for 2017 Sum of Order Total for 2017 Sum of Quantity for 2018 Sum of Order Total for 2018 Sum of Quantity for 2019 Sum of Order Total for 2019 XXX 15 150 16 160 17 170 7. There is one final question. “I hear a lot about ‘tidy data,’ but I’m not really sure what it is. Can you define tidy data, and then let me know how this concept might apply to the data layouts in Question 6 above?” A successful memo will meet the following criteria: Times New Roman, double spaced, 12-point font, with 1-inch margins Contain a cover page with your number and names on it Contain a bibliography in APA format citing appropriate references (you may need to only cite this classroom and the Reference Manual; if you look up other sources cite them too.) Pass a Turnitin check for plagiarism Be in memo form, addressed to your boss, in business English (not computer-ese). Technical talk goes in the Appendix. Be of reasonable length. There are no page minimums or maximums, but please be rea- sonable. Something on the order of 10 pages or less for the written memo should prob- ably suffice; the Appendix may run longer. 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 7
An Appendix with any technical information you want to include. Classic APA formatting calls for all figures, exhibits, and tables to be in the Appendix. I’m changing this requirement here. If a diagram (for example, a flowchart of something) would make more sense in the body of your paper, put it in the body. If it would make more sense in the Appendix, leave it in the Appendix. In the past, papers which have earned 85% or above have had Microsoft Word cross-referencing and captions, and will have at least a few diagrams in the main body of the paper. Papers which have all their figures in the Appendix typically earn less than 85%. Submit: Your memo, labeled “GX_memo.docx” (or .pdf), Your final output file, labeled “GX_output_final.csv’’ The SQL code file(s) you used to make this happen. If there is more than one file, label them to make it easy to find and assemble them. If you have 3 files, you can call them “GX_1_extract.sql”, “GX_2_transform.sql”, “GX_3_load.sql”, etc. . The 1, 2, 3 are the orders in which we should run the scripts. And the word is a summary of what it might do – the words you use are entirely up to you. 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 8
2017 Data Notes: Your order 2017 data is contained in the attached file, “2017_product_data_students-final.csv.” A sample of this file’s type of data is contained below in Table 1 Sample of order data from 2017. (Note your file may or may not have the same data in it.) Your field definitions follow: ID: Record ID number Month: integer, corresponds to the month of the sale. For example, 5 = May. Country: character, should all be USA. (All data in this exercise should be USA.) Region: character, represents the regions within the country. State: two characters, USPS state abbreviations. Each state is within one region. Product: character. This is the name of a packaged food product. Per-unit price: integer. This represents the per-unit price in cents; for example, 466 in- dicates that Blue Rock Candy sells for $4.66 per package. (For the purposes of this exer- cise, you may disregard all currency formatting and just use 466 to represent $4.66. If you choose to do this, make sure you note it in your final product.) Quantity: integer. This represents how many items were in that particular order. The first order here was for 3 packages of Blue Rock Candy. Order Total: integer. This is the per-unit price x the quantity. The first line here indi- cates that 466 x 3 = 1398 (or $13.98) was the price of the first order. Table 1 Sample of order data from 2017 Month Country Region State Product Per- Unit Price Quantity Order Total 5 USA Midwest MN Blue Rock Candy 466 3 1398 5 USA Eastern RI Pink Bubble Gum 318 15 4770 4 USA Southeast MO Crocodile Tears 282 4 1128 1 USA Eastern MD Yellow Zonkers 258 27 6966 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2018 Data Notes: Your order 2018 data is contained in the attached file, “2018_product_data_students-final.csv” A sample of this file’s data is contained below as Table 2 Sample of order data from 2018. (Note your file may or may not have the same data in it.) Your field definitions follow: ID: Record ID number Month: integer, corresponds to the month of the sale. For example, 5 = May. Region: character, represents the regions within the country. Customer_ID: numeric, represents the customer’s unique Customer ID number. Product: character. This is the name of a packaged food product. The product name is consistent between 2017, 2018, and 2019; for example, if something is called “Orange Creepies” in 2017, those characters refer to the same product in 2018 and 2019. Per-unit price: integer. This represents the per-unit price in cents; for example, 293 in- dicates that Crocodile Tears sells for $2.93 per package. (For the purposes of this exer- cise, you may disregard all currency formatting and just use 293 to represent $2.93. If you choose to do this, make sure you note it in your final product.) Quantity_1: integer. This represents how many items were in the first shipment of that particular order. This year we had shipping problems, and could often not ship the en- tire order all at once. Orders were split into two shipments where necessary, and Quan- tity_1 reflects how many units were shipped first. (Assume all shipments were com- pleted in the month listed, and that no shipments had the first shipment in one month and the second shipment in the subsequent month.) Quantity_2: integer. This represents how many items were in the second shipment of that particular order. A 0 indicates a second shipment was not necessary. To get the to- tal number of items shipped, you need to add Quantity_1 and Quantity_2. The first line here reflects that Crocodile Tears has a first shipment of 13 units, and a second shipment of 1 unit, all within the month of May, for a total of 13 + 1 = 14 units. Table 2 Sample of order data from 2018 Month Region Customer_I D Product Per-Unit Price Quan- tity_1 Quantity_ 2 5 Southeast 857 Crocodile Tears 293 13 1 9 Midwest 785 Blue Rock Candy 489 16 10 5 Eastern 906 Nap Be Gone 427 24 4 2 Western 939 Yellow Zonkers 253 8 5 7 Western 558 Pink Bubble Gum 318 26 7 2019 Data Notes: 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 1
Your order 2019 data is contained in the attached file, “2019_product_data_students-final.csv.” A sample of this file’s data is contained below as Table 3 Sample of order data from 2019. (Note your file may or may not have the same data in it.) Your field definitions follow: ID: Record ID number Month: integer, corresponds to the month of the sale. For example, 5 = May. Country: character, represents the country of the customer. Should all be USA. Region: character, represents the regions within the country. State: USPS code for the 50 United States. Product: character. Same as in 2017 and 2018 data. Per-unit price: integer. This represents the per-unit price in cents; for example, 425 in- dicates that Red Hot Chili Peppers sells for $4.25 per package. (For the purposes of this exercise, you may disregard all currency formatting and just use 425 to represent $4.25. If you choose to do this, make sure you note it in your final product.) Quantity: This represents how many items were in that particular order. The first order here was for 32 packages of Red Hot Chili Peppers. Order Subtotal: This represents the order subtotal, calculated as per-unit price x quan- tity. For example, the first order here reflects a per-unit price of 425 cents x 32 units, for a subtotal of 13,600 (or $136.00). Quantity Discount: This represents the new policy (effective January 1, 2019) that all or- ders 91 units and up will automatically earn a 10% discount. An order of exactly 91 units will earn the discount; an order of 90 units does not earn the discount. All order dis- counts have been rounded to the nearest penny, so you can assume this field has no decimals in it. In the data below, o The first order, of 32 Red Hot Chili Peppers to New Jersey, did not qualify for the Quantity Discount. Therefore, the Order total would simply be the Order subto- tal. o The fourth order, of 95 Green Lightning to Rhode Island, did qualify for the Quantity Discount, which has already been computed as 3705, or 10% of 37050. In this case, the final order total would be 37,050 - 3,705 = 33,345 (or $333.45). 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 2
When you subtotal the discounts, the Order Subtotal, Quantity Discount, and Order To- tal should just add up. It should work like this (below is a very abbreviated data set, de- signed to show only treatment of order totals): Product Name Order Subtotal Quantity Discount Order Total Product A 100 0 100 Product A 500 50 450 Product A 200 0 200 Rolled up, this data should give: Product Name Sum of Order Subtotal Sum of Quantity Dis- count Sum of Order Total Product A 800 50 750 Table 3 Sample of order data from 2019 Month Coun- try Region State Product Per-Unit Price Quan- tity Order Subto- tal Quan- tity Dis- count 6 USA Eastern NJ Red Hot Chili Pep- pers 425 32 13600 0 5 USA Eastern DE Giant Gummies 428 34 14552 0 5 USA South- east LA Orange Creepies 466 25 11650 0 6 USA Eastern RI Green Lightning 390 95 37050 3705 10 USA Eastern GA Giant Gummies 428 90 38520 0 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Assignment 6 Azure Instructions This section is where our FA Deepa Devarakonda kindly demonstrates how to do an Azure export using some 2012 data. Note the 2012 data is not the data you are using for this assignment. For this assignment, you will need to update the files. This section illustrates one method for how to get your ETL code export friendly. Please note the query below is not a query you can use exactly to complete this assignment. Instead, it is simply a baby query to show a sample of SQL for a similar data set and to demonstrate the CLI pipe. It assumes you have imported the 2012 file to a table named 2012_orders. If you wish to use another method for export, please first verify with your faculty that it is OK and if it is, document it lavishly when you turn it in so we can reproduce what you did. Write this in your SQL file and name it something. I named mine “ assignment6rev1.sql” select 2012_orders.Product as Product, ',', 2012_orders.`Region` as Region, ',', 2012_orders.Month, ',', sum(2012_orders.Quantity) as 'Sum of Quantity', ',', sum(2012_orders.ordertotal) as 'Sum of Order Total' from 2012_orders group by 2012_orders.Quantity, 2012_orders.ordertotal order by 2012_orders.month; After you have saved it, you will pipe your SQL script using MySQL CLI interface: 1. Create and Save your ETL Code in a .sql file. (You just did this above.) 2. Save it to your local drive. For example, I saved my script under “C:\Users\dde- varakonda\Downloads\assignment6rev1.sql” (These instructions were written in part by FA Deepa Devarakonda, so her userid is ddevarakonda. You should substitute your own userid.) 3. Locate the local installation of mysql. It should be something like C:\Program Files\ MySQL\MySQL Workbench 6.3 CE (copy this path), this can be also C:\Program Files\ MySQL\MySQL Workbench 8.0 CE depending on the Workbench version installed. 4. Open command prompt window and execute change directory command. a. cd C:\Program Files\MySQL\MySQL Workbench 6.3 CE (path of mysql from step #3) 5. Copy this command to a notepad and modify everything that is inside < > 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 4
mysql.exe -h < Server Name> -P 3306 -u < MySQL User> -p --batch --skip-column- names < C:\Users\ddevarakonda\Downloads\assignment6rev1.sql> C:\Users\ <user>\Downloads\output6_1.csv o < Server Name> -Server name from mysql connection o < user name> - DB user name from mysql connections o C:\Users\ddevarakonda\Downloads\assignment6rev1.sql> - this is your SQL file. Replace ddevarakonda with your own userid. o C:\Users\<user>\Downloads\output6_1.csv – modify this to your desired file path for your output. You will replace the <user> with your own userid on your own computer. 6. Copy this formatted text to command window> run (press enter) 7. Enter DB password when prompted (Pa$$word1 or whatever you set your password to) 8. Go look for it! Locate the CSV file saved in your destination folder. Note: All query results will be appended to one column as a result of this query execution (see sample below). Also, this CSV file doesn’t contain headers. You will need to manually split each attribute out to different columns and then save your file. (It’s OK to use the Text to Columns feature in Excel to do this if you need to. Just document what you did.) Include headers in your CSV file before you submit for grading. 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 5
9. As written, the command does not include column names. If you want to include the column names, just remove the –skip-column-names command: mysql.exe -h < Server Name> -P 3306 -u < MySQL User> -p --batch < C:\Users\ <user>\Downloads\assignment6rev1.sql> C:\Users\<user>\Downloads\ output6_1.csv With this, all the information would be separated in its respective columns in the CSV. 23e91761272c821db3db0e35eda5098d0c1c4629.docx Page 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help