A - Data Profiling (1)
docx
keyboard_arrow_up
School
Syracuse University *
*We aren’t endorsed by this school
Course
722
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
docx
Pages
9
Uploaded by HighnessKnowledge7427
IST722 – Data Warehouse
Homework Assignment A
Michael A. Fudge, Jr.
Data Profiling
Before We Start: Understanding Every Assignment in this Course.
Since this is your very first assignment, I thought I’d take some time to explain how every assignment is structured in this course. Purpose
The purpose of these assignments is to give you hands-on experience by doing some data warehousing; allowing you to put into practice the concepts you learned through the class sessions. You should learn quite a bit as you work through these assignments. This is by design. I strongly recommend taking notes as you complete the work, that way you can ask questions in class. There will always be time allocated to asking questions so you can clear up any misunderstandings. Since eventually you will be expected to apply these techniques as you complete the project, you will need to understand them well. Be prepared to ask questions! Is This Individual or Group Work?
You are permitted to, in fact encouraged to, collaborate / work with others as you complete these assignments. Learning from each other is a very natural way of learning. However, it is expected every student will write the code on their own. You learn by doing, not by copying what someone else did!
So, in summary you will complete an individual assignment, but are allowed to work on it together. You should specify who you worked with in your reflection.
The Written Reflection
Every student is expected to reflect on their learning. From the teacher’s perspective this is the most important part of the assignment itself. Your written reflection should briefly explain what you learned by completing the assignment, where you struggled, what you feel you need to do to understand the material better. It should also detail who you collaborated with and how that helped. Take the time to think about it and be specific and personal. You will also assign a reflection score to yourself based on a 1-4 scale. A lower score does not have a negative impact on your grade. What impacts your grade are terse, unspecific reflections that do not inform the teacher of where you are on your learning journey or provide any indication of what you learned from the assignment itself. Three Parts to Every Assignment
Every assignment follows the same structure:
1.
It begins with an overview section which explains the activity, its goals, and requirements. 2.
The second part of the assignment will walk you through the process of completing the activity. Typically, in the walkthrough we will use the Northwind Traders database. 3.
The third part of the assignment, you must complete on your own. It is expected you will take what you learned from your studies, class sessions, and your experiences with parts 1 and 2, then apply them to the
problems you face in the 3
rd
part. In many cases, you must complete part 2 before part 3 can be done. What am I handing in?
At the end of every assignment, there is a section called “Turning it in” that contains specific instructions of what should be turned in from the specific assignment. There is a generic template we use for each assignment as this contains the reflection part for you. Page 1 of 9
IST722 – Data Warehouse
Homework Assignment A
Michael A. Fudge, Jr.
Data Profiling
Assignment A: Data Profiling the External World
Part 1: Overview
You cannot build a data warehouse without a clear understanding of your organization’s data and business processes. Therefore, a critical activity of data warehousing is taking time to understand your data. This is known as Data profiling
- the activity of learning about the characteristics, capabilities, and quality of your data and metadata (schema). Getting intimate with your data is an important step in building the data warehouse because you must understand the data available to work with before you can leverage it as part of your data warehouse solution. After all, you cannot create something that isn’t there. For example, if you don’t have a time-
history of end-of-day inventory levels, then you cannot promise to build a solution that tracks inventory levels over time! This assignment will help you discover how to understand your organization’s data so that you can figure out how to properly build a data warehouse around it. Goals
Specifically the goals are to:
Get you familiar with how to use the tools provided in the course.
Help you understand how to profile data – reading unknown database schemas / data and deducing intent behind table designs and data.
Get you comfortable writing SQL SELECT queries against a relational database, to understand the capabilities of the data and metadata that you have.
Prerequisites
To complete this assignment, you will need the following:
Read and review External World (where you found this document).
Microsoft Azure Data Studio https://learn.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio
Access to the external world externalworld.cent-su.org
The Northwind Traders Case Study
Northwind traders is a fictitious importer and exporter of specialty foods from around the world. It was created by Microsoft as a sample operational
database to use with their database products, such as Microsoft Access and SQL Server. We’ll use this database as a case study for building a data warehouse. You will use this same database Page 2 of 9
IST722 – Data Warehouse
Homework Assignment A
Michael A. Fudge, Jr.
Data Profiling
throughout each of your assignments in this course and will get very intimate with the data and table design. Page 3 of 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
IST722 – Data Warehouse
Homework Assignment A
Michael A. Fudge, Jr.
Data Profiling
The Northwind Data Model Below is a screenshot of the internal model for the Northwind Traders database. Use this diagram as a reference for understanding the structure of the Northwind data and building your data warehouse.
Page 4 of 9
IST722 – Data Warehouse
Homework Assignment A
Michael A. Fudge, Jr.
Data Profiling
When you open the Northwind
database in Azure Data Studio’s object explorer, you will see the following tables which comprise the database. Some of these tables such as the Customer
tables represent master data. Master data is also known as reference data as it is normally in reference to organizational processes. This type of data
exists to store information about a key business entity, in this example customers
.
Other tables, like Orders
represent business activities. These are transactions, events or actions. This is commonly known as process-oriented data. From a data warehouse perspective, all source data falls into one of these two categories. It should be noted that the boundary of what is master data or process data does not always align with the database tables. You must look at the business processes themselves and how data flows through the organization and “ends up” in the table. The tables are just an implementation and like any implementation they are at the discretion of the database designer.
Of course in this class we cannot ask someone who works at Northwind Traders about customers or orders. We can only make assumptions about how they would use the data, and for that reason we will rely heavily on the existing database schema.
Page 5 of 9
IST722 – Data Warehouse
Homework Assignment A
Michael A. Fudge, Jr.
Data Profiling
Part 2: Walk-Through
In this part we will walk you through the process of data profiling. We will profile two types of sources, a
master/reference data source, and a business process. First Step: Is That Master Data or A Business Process?
In our walk through, our master data will be Employees and our business process will be the assignment of sales territories to the employee.
How do you know which source is master data and which is a business process? The trick is to think of your data at the conceptual model level:
An Employee
is assigned to
1 or more Territories
A Territory
is assigned to
1 or more Employees
In this example Employee and Territory are master data. They represent categorical business data. The Many-To-Many relationship between them is a business process representing who gets assigned to which territory. If you’re a part of speech fan, the nouns
are usually the reference-oriented/master data,
and the verbs
are likely the process-oriented data. Where this is tricky is when dealing with process at rest. For example, ordering is a process, but the process at rest is an order.
In the Kimball method of data warehousing your master data becomes dimensions
and the business processes, events or transactions become fact tables
. Honestly, that a gross over simplification, but for now it works. Next: Understanding your data at its atomic level.
Now that you know which are master and which are business processes. It’s time to understand the following:
What are the business or natural keys? We’re not interested in the primary key
, which is the database constraint set on the RDMBS table. We also don’t want surrogate keys
as those carry no meaning by design. We’re looking for what makes each entity unique from the business user’s vantage point. For example, a natural key for a customer might be their email address. After all, one expects each customer to have a distinct email. There should always be a natural key for master data, but there might not be one for business processes.
What does “one row” of data mean? The answer to this question often reveals itself when you discover the business key or identify the business process.
Am I looking at data that is in reference, or data that is the result of a business process?
Page 6 of 9
Employee
Territory
Assigned To 1/M
1/M Assigned To
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
IST722 – Data Warehouse
Homework Assignment A
Michael A. Fudge, Jr.
Data Profiling
The easiest way to figure this out is by using the SELECT
statement. Observe:
If you want to try it for yourself, open a new query window in the Northwind database by pressing CTRL+N
, then type the select statement and press the Run
button in the toolbar.
It appears as if there’s one row for each employee. That stands to reason, but to be honest it’s not always that simple. What about the business key? What is used to uniquely identify the employee? Normally this would be social-security number or tax payer ID number as each row would have a unique value. With this data it’s not clear. You might think of using name: It seems to work as there are still 9 rows so we didn’t lose an employee. It’s not a good business key because there could
be two employees with the same name. In this case there isn’t but we should always think beyond the data
when we’re profiling. In this case we have no choice but to use the Primary key EmployeeID
.
Finally: Understanding the characteristics of your data
In addition to knowing what one row means, you might need to discover other characteristics of your data. These are commonly dictated by the business requirements. It’s crucial to understand what you have
so that when it is delivered to the data warehouse you can determine if the data is still accurate. After all, we only want quality data in our data warehouse. Page 7 of 9
IST722 – Data Warehouse
Homework Assignment A
Michael A. Fudge, Jr.
Data Profiling
Here are some sample queries, which might be asked as part of a functional business requirement and the SQL statements which satisfy them.
What are the different job titles for each employee?
There are 4 distinct job titles for 9 employees. That’s a decent category, as possibility for drill-
down in our analytics as well see in future assignments.
Example: How many sales people per territory?
From the screenshot we now know there is only 1 territory with more than one employee assigned: New York
.
Let’s take a moment to break down this SQL statement as writing queries like this is common practice in data profiling.
We are joining the Territories
table to the EmployeeTerritories
table so we can get a count (
count(*)
) of employees for each TerritoryDescription
(that’s the group by
). The t
and et
are
table aliases. These are needed since both tables have a column called Territory ID.
We order by
EmployeeCount
to see the Territories with the most employees first.
Page 8 of 9
IST722 – Data Warehouse
Homework Assignment A
Michael A. Fudge, Jr.
Data Profiling
Part 3: On your own
Profile the following table in the Northwind Traders Database. The first two were done for you in the previous part. The last one NorthwindDailyInventoryLevelsOneWeek.csv can be found on the Minio S3 compatible server in the northwind
folder. Download the csv file
and open it in Excel to profile it. Table
Type
Row Count
Business Key
One Row Is Employees
Master Data
9
None (PK used)
An Employee
EmployeeTerritories
Business Process
49
None (PK’s used)
An Employee assigned to a territory.
Territories
Customers
Suppliers
Products
Shipments (of Orders)
Details (of an Order)
Shippers
NorthwindDaily
InventoryLevels (1 week) CSV file in Minio S3. Write SQL Queries to answer the following questions which might be associated with functional business
requirements in a data warehouse. For each of the following provide a screen shot of the SQL query, its output, and make sure your name or netid appears in the screenshot.
1.
List the customer contact names and titles sorted by company name.
2.
Factoring in discounts, what is total amount of product sold?
3.
Provide a list of product category names with counts of products in each category.
4.
Select a specific customer and display that customer’s orders with total amount of product sold for each order.
5.
Select a specific employee and each order, how it was shipped (shipvia) company who shipped it, and the total number of days elapsed from order date to shipped date.
Turning it in:
Please turn in the submission template your completed reflection and with your name, netid, and date at the top. Copy and paste your completed part 3 (data profiling table + 5 SQL questions). Be sure you include screenshots as directed. Page 9 of 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help