A - Data Profiling (1)

docx

School

Syracuse University *

*We aren’t endorsed by this school

Course

722

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

docx

Pages

9

Uploaded by HighnessKnowledge7427

Report
IST722 – Data Warehouse Homework Assignment A Michael A. Fudge, Jr. Data Profiling Before We Start: Understanding Every Assignment in this Course. Since this is your very first assignment, I thought I’d take some time to explain how every assignment is structured in this course. Purpose The purpose of these assignments is to give you hands-on experience by doing some data warehousing; allowing you to put into practice the concepts you learned through the class sessions. You should learn quite a bit as you work through these assignments. This is by design. I strongly recommend taking notes as you complete the work, that way you can ask questions in class. There will always be time allocated to asking questions so you can clear up any misunderstandings. Since eventually you will be expected to apply these techniques as you complete the project, you will need to understand them well. Be prepared to ask questions! Is This Individual or Group Work? You are permitted to, in fact encouraged to, collaborate / work with others as you complete these assignments. Learning from each other is a very natural way of learning. However, it is expected every student will write the code on their own. You learn by doing, not by copying what someone else did! So, in summary you will complete an individual assignment, but are allowed to work on it together. You should specify who you worked with in your reflection. The Written Reflection Every student is expected to reflect on their learning. From the teacher’s perspective this is the most important part of the assignment itself. Your written reflection should briefly explain what you learned by completing the assignment, where you struggled, what you feel you need to do to understand the material better. It should also detail who you collaborated with and how that helped. Take the time to think about it and be specific and personal. You will also assign a reflection score to yourself based on a 1-4 scale. A lower score does not have a negative impact on your grade. What impacts your grade are terse, unspecific reflections that do not inform the teacher of where you are on your learning journey or provide any indication of what you learned from the assignment itself. Three Parts to Every Assignment Every assignment follows the same structure: 1. It begins with an overview section which explains the activity, its goals, and requirements. 2. The second part of the assignment will walk you through the process of completing the activity. Typically, in the walkthrough we will use the Northwind Traders database. 3. The third part of the assignment, you must complete on your own. It is expected you will take what you learned from your studies, class sessions, and your experiences with parts 1 and 2, then apply them to the problems you face in the 3 rd part. In many cases, you must complete part 2 before part 3 can be done. What am I handing in? At the end of every assignment, there is a section called “Turning it in” that contains specific instructions of what should be turned in from the specific assignment. There is a generic template we use for each assignment as this contains the reflection part for you. Page 1 of 9
IST722 – Data Warehouse Homework Assignment A Michael A. Fudge, Jr. Data Profiling Assignment A: Data Profiling the External World Part 1: Overview You cannot build a data warehouse without a clear understanding of your organization’s data and business processes. Therefore, a critical activity of data warehousing is taking time to understand your data. This is known as Data profiling - the activity of learning about the characteristics, capabilities, and quality of your data and metadata (schema). Getting intimate with your data is an important step in building the data warehouse because you must understand the data available to work with before you can leverage it as part of your data warehouse solution. After all, you cannot create something that isn’t there. For example, if you don’t have a time- history of end-of-day inventory levels, then you cannot promise to build a solution that tracks inventory levels over time! This assignment will help you discover how to understand your organization’s data so that you can figure out how to properly build a data warehouse around it. Goals Specifically the goals are to: Get you familiar with how to use the tools provided in the course. Help you understand how to profile data – reading unknown database schemas / data and deducing intent behind table designs and data. Get you comfortable writing SQL SELECT queries against a relational database, to understand the capabilities of the data and metadata that you have. Prerequisites To complete this assignment, you will need the following: Read and review External World (where you found this document). Microsoft Azure Data Studio https://learn.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio Access to the external world externalworld.cent-su.org The Northwind Traders Case Study Northwind traders is a fictitious importer and exporter of specialty foods from around the world. It was created by Microsoft as a sample operational database to use with their database products, such as Microsoft Access and SQL Server. We’ll use this database as a case study for building a data warehouse. You will use this same database Page 2 of 9
IST722 – Data Warehouse Homework Assignment A Michael A. Fudge, Jr. Data Profiling throughout each of your assignments in this course and will get very intimate with the data and table design. Page 3 of 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
IST722 – Data Warehouse Homework Assignment A Michael A. Fudge, Jr. Data Profiling The Northwind Data Model Below is a screenshot of the internal model for the Northwind Traders database. Use this diagram as a reference for understanding the structure of the Northwind data and building your data warehouse. Page 4 of 9
IST722 – Data Warehouse Homework Assignment A Michael A. Fudge, Jr. Data Profiling When you open the Northwind database in Azure Data Studio’s object explorer, you will see the following tables which comprise the database. Some of these tables such as the Customer tables represent master data. Master data is also known as reference data as it is normally in reference to organizational processes. This type of data exists to store information about a key business entity, in this example customers . Other tables, like Orders represent business activities. These are transactions, events or actions. This is commonly known as process-oriented data. From a data warehouse perspective, all source data falls into one of these two categories. It should be noted that the boundary of what is master data or process data does not always align with the database tables. You must look at the business processes themselves and how data flows through the organization and “ends up” in the table. The tables are just an implementation and like any implementation they are at the discretion of the database designer. Of course in this class we cannot ask someone who works at Northwind Traders about customers or orders. We can only make assumptions about how they would use the data, and for that reason we will rely heavily on the existing database schema. Page 5 of 9
IST722 – Data Warehouse Homework Assignment A Michael A. Fudge, Jr. Data Profiling Part 2: Walk-Through In this part we will walk you through the process of data profiling. We will profile two types of sources, a master/reference data source, and a business process. First Step: Is That Master Data or A Business Process? In our walk through, our master data will be Employees and our business process will be the assignment of sales territories to the employee. How do you know which source is master data and which is a business process? The trick is to think of your data at the conceptual model level: An Employee is assigned to 1 or more Territories A Territory is assigned to 1 or more Employees In this example Employee and Territory are master data. They represent categorical business data. The Many-To-Many relationship between them is a business process representing who gets assigned to which territory. If you’re a part of speech fan, the nouns are usually the reference-oriented/master data, and the verbs are likely the process-oriented data. Where this is tricky is when dealing with process at rest. For example, ordering is a process, but the process at rest is an order. In the Kimball method of data warehousing your master data becomes dimensions and the business processes, events or transactions become fact tables . Honestly, that a gross over simplification, but for now it works. Next: Understanding your data at its atomic level. Now that you know which are master and which are business processes. It’s time to understand the following: What are the business or natural keys? We’re not interested in the primary key , which is the database constraint set on the RDMBS table. We also don’t want surrogate keys as those carry no meaning by design. We’re looking for what makes each entity unique from the business user’s vantage point. For example, a natural key for a customer might be their email address. After all, one expects each customer to have a distinct email. There should always be a natural key for master data, but there might not be one for business processes. What does “one row” of data mean? The answer to this question often reveals itself when you discover the business key or identify the business process. Am I looking at data that is in reference, or data that is the result of a business process? Page 6 of 9 Employee Territory Assigned To 1/M 1/M Assigned To
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
IST722 – Data Warehouse Homework Assignment A Michael A. Fudge, Jr. Data Profiling The easiest way to figure this out is by using the SELECT statement. Observe: If you want to try it for yourself, open a new query window in the Northwind database by pressing CTRL+N , then type the select statement and press the Run button in the toolbar. It appears as if there’s one row for each employee. That stands to reason, but to be honest it’s not always that simple. What about the business key? What is used to uniquely identify the employee? Normally this would be social-security number or tax payer ID number as each row would have a unique value. With this data it’s not clear. You might think of using name: It seems to work as there are still 9 rows so we didn’t lose an employee. It’s not a good business key because there could be two employees with the same name. In this case there isn’t but we should always think beyond the data when we’re profiling. In this case we have no choice but to use the Primary key EmployeeID . Finally: Understanding the characteristics of your data In addition to knowing what one row means, you might need to discover other characteristics of your data. These are commonly dictated by the business requirements. It’s crucial to understand what you have so that when it is delivered to the data warehouse you can determine if the data is still accurate. After all, we only want quality data in our data warehouse. Page 7 of 9
IST722 – Data Warehouse Homework Assignment A Michael A. Fudge, Jr. Data Profiling Here are some sample queries, which might be asked as part of a functional business requirement and the SQL statements which satisfy them. What are the different job titles for each employee? There are 4 distinct job titles for 9 employees. That’s a decent category, as possibility for drill- down in our analytics as well see in future assignments. Example: How many sales people per territory? From the screenshot we now know there is only 1 territory with more than one employee assigned: New York . Let’s take a moment to break down this SQL statement as writing queries like this is common practice in data profiling. We are joining the Territories table to the EmployeeTerritories table so we can get a count ( count(*) ) of employees for each TerritoryDescription (that’s the group by ). The t and et are table aliases. These are needed since both tables have a column called Territory ID. We order by EmployeeCount to see the Territories with the most employees first. Page 8 of 9
IST722 – Data Warehouse Homework Assignment A Michael A. Fudge, Jr. Data Profiling Part 3: On your own Profile the following table in the Northwind Traders Database. The first two were done for you in the previous part. The last one NorthwindDailyInventoryLevelsOneWeek.csv can be found on the Minio S3 compatible server in the northwind folder. Download the csv file and open it in Excel to profile it. Table Type Row Count Business Key One Row Is Employees Master Data 9 None (PK used) An Employee EmployeeTerritories Business Process 49 None (PK’s used) An Employee assigned to a territory. Territories Customers Suppliers Products Shipments (of Orders) Details (of an Order) Shippers NorthwindDaily InventoryLevels (1 week) CSV file in Minio S3. Write SQL Queries to answer the following questions which might be associated with functional business requirements in a data warehouse. For each of the following provide a screen shot of the SQL query, its output, and make sure your name or netid appears in the screenshot. 1. List the customer contact names and titles sorted by company name. 2. Factoring in discounts, what is total amount of product sold? 3. Provide a list of product category names with counts of products in each category. 4. Select a specific customer and display that customer’s orders with total amount of product sold for each order. 5. Select a specific employee and each order, how it was shipped (shipvia) company who shipped it, and the total number of days elapsed from order date to shipped date. Turning it in: Please turn in the submission template your completed reflection and with your name, netid, and date at the top. Copy and paste your completed part 3 (data profiling table + 5 SQL questions). Be sure you include screenshots as directed. Page 9 of 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help