DWH_Notes

docx

School

California State University, East Bay *

*We aren’t endorsed by this school

Course

620

Subject

Information Systems

Date

Oct 30, 2023

Type

docx

Pages

5

Uploaded by AmbassadorBeaver2154

Report
Dimensional Modeling Used to design schemas to support multidimensional analysis Consists of a Fact table and Dimension tables Fact table consists of key measurements of a business process of interest and keys to dimensions: Ex: Sales process -> Sales figures Dimensions – context (who, what, where, when, how) of measurement: Ex: Time, Location, Product Dimension attribute – columns in dimension table: May have hierarchical relationships Each dimension table contains data for one dimension: May require de-normalization of transaction databases Building Dimensional Tables Transaction database : Location(lid , code, loc_name, stateid), -,State(stateid, state_name, countryid) -Country(countryid, country_name) D imension table: Locations(dimid, loc_code, city_name, state_name, country_name) dim id Loc_co de City_na me State_na me 1 SJ1 San Jose CA TimeI D LocID ProdID Sales 1 1 1 454 (Fact) Contains key measurements of business process: Sales – monthly sales, Level of detail is granularity or “grain” Dimensional Modeling Star schema: Easy to create, Easy to understand, Extensible Modeling steps: Identify process, Identify grain, Identify dimensions, Identify facts Dimension Table exampleS : (i)Retail -- store, products, date, (ii) Telecommunications -- call origin, call destination, (iii) Banking -- customers, accounts, branch, account officer (iv) Insurance -- policy type, insured party Fact Table examples: (i) Retail -- number of units sold, sales amount, (ii) Telecommunications -- length of call in minutes, average number of calls, (iii) Banking -- average monthly balance (iv) Insurance -- claims amount Extending Start Schema: (i)DW may also store pre-aggregated data: Separate fact table (monthly, weekly..), Fact constellations – multiple fact tables that share dimensions (ii)DB Server Issues – Index structures, parallel processing, complex sql queries etc. (Complex queries – nested, aggregates, functions etc. DW Goals (i)Accessible – data intuitive and obvious to user, easy to use interface (ii) Consistency – data should be reliable. Common labels. (iii) Change – inevitable but when new questions are asked, existing data and apps are not disrupted. Gracefully (iv) Timely – operational decisions, time to clean and validate – users know(v) Secure – IP (vi) Trust – data driven business decisions, trust data DW/BI Manager Responsibilities Understand users – what do they want – job responsibility goals etc, identify champions early adopters, find new users; Quality and rel – choose robust and actionable data selected from vast. Easy to use ui, trusted data, monitor accuracy cont. adapt to change; Sustain – take credit, update, maintain trust, keep customers happy Dimensional Modeling Process Select Business Process; Declare Grain; Identify Dimensions; Identify Facts Business Process - What is a business process? Common characteristics: Action verbs, Supported by IT systems, Input/Output/Metrics/KPI Examples: Start with examples – then abstract characteristics (verbs, IT systems, IO, KPI). Not interested in non IT processes since no data. Retail – sales; Healthcare – billing, treatment?; Education – registration? Teaching?; Insurance – write policy, process claims; Transportation – UPS – shipment, Uber? – travel; Strategy – Hotel – Good customer exp, Financial strength, Happy Employees and Invest in Emp., Competitive, ; Measure: Checkin Time, How many are booking, how much training do emp get, Declare the Grain: Level of detail associated with fact table measurements; What level of granularity should be used?; Summarizing data in DW: (i) Summarizing to improve performance (ii) Summarize to meet business needs. Grain (i) Include lowest grain – lowest level of detail (ii) Lowest level of grain allows flexibility; Lowest level of grain ups – package od vs package segments Grain: Level of detail associated with fact table measurements. Examples: One row per scan of individual product on customer sales transaction; One row per individual boarding pass scanned at an airport gate; Insurance?; Transportation? Ride Share Service?; Education?; ; Transportation/logistics – container; Retail – line item Identify Dimensions Context of measurements; Who, what, why, where, when, how; Data types – fact vs dimensions: Data types – facts – numeric, dimens- categorical Identify Facts What is the process measuring?; Key performance metrics; Typical facts are numeric additive figures; KPI determines process; Display product stacking. Retail Business Where is data collected?; Many stores?; Profit = high sales price, high quantity, low costs; Decisions – pricing and promotions
(i)Start with most important process : Across Org-ROI (ii) “What level of data detail should be made available in a dimensional model?” Use most detailed atomic information captured by a business process: Monday vs. Sunday?; Individual sizes of certain brands; How many shoppers used 50 cents off promotion? (iii)Identify Dimensions : Product, Store, Date/Time, Payment Method, Promotion, Cashier. Add more dimensions if additional dimensions naturally take on only one value under each combination of primary dimensions. (iv) Identify Facts: “Facts must be true to grain”; Quantity; ; Regular unit price; Discount unit price; Net unit price; Extended discount price; Extended Sales dollar amount; Extended cost dollar amount; Extended gross profit. Additive – quantity, ext sales, cost, profit; Account Balance semi additive; Profit margin – non additive; Regular unit price – not additive, summing unit price is not unit price, average is not weighted Sell 1 for $1, 4 for $0.5. , 1 + 0.5 = 1.5 is not unit price. 0.75 is not average. Modeling Time Dimension Is date dimension needed?; Time vs Date; Present in every DW; Date and Time of Day dimensions (i) Date -> 20yrs -> 7300 approx. (ii) Time of Day -> 20 x 365 x 24 x 60 = 10,512,000; Date dimension needed – expensive join, direct sql query; Business user not well versed in sql date syntax etc.; Holidays, fiscal etc. roll ups expensive as real calcs on calendar func; Time – sparsity, large size, needed? What should date dimension capture?: Day, Month, Year; Weekday, Weekend ;Day of Week; Day of Month; Calendar vs Fiscal. Sql date stamp; Holiday Indicator – Y/N or Holiday, Non Holiday – use meaningful values for end user Product Dimension: Product dimension contains every single SKU (Stock Keeping Unit); Product<Brand<Subcategory<Category<Department; HG Dark Chocolate < HG < Ice Cream < Frozen Desserts < Frozen Foods; Normalization of Repeated Values?; List price – fact or dimension? (i) Used for filtering or grouping – dimension (ii) Used for calculations - fact (iii) Both - both Store Dimension: Describes every store; Primary geographic dimension; Geographic Hierarchy; Internal District Hierarchy Promotion Dimension Most interesting dimension, often called causal dimension; Typical promotions: price reduction, end aisle display, news paper ad, coupons; Need to capture each combination of promotion; Modeled as separate dimensions or combined; ; Promotion cost?; No promotion key; Null values in fact or dimension tables Multiple payment methods –pay one bill with cash and credit – then? Change grain? Separate fact table per transaction; Null in Fact ok, Null in Dim not ok – say NA or Unknown Schema Extensibility: Example: Retailer adds frequent shopper program – how would this be supported? (i)Add new dimension table, surrogate keys (ii) Modeling at most granular level allows extensibility (iii) Extension schemas by adding (a) New dimensions (b)New dimension attributes (c) New measured facts Eg: New measured fact – rfid tags to shopping cart. Fact and dimension table keys (i)Use surrogate keys as primary keys: a)Integer keys, sequential (ii)Advantages vs. natural keys: Buffer from operational rules; Integrate multiple source systems; Improve performance; Handle null or unknown conditions; Dimension attribute change tracking. Unknown conditions – no promotion; Dimension attribute change – leapfrom from electronic toys to educational dept. ; Surrogate keys for degenerate dimensions – pos no are reused Snowflake schemas are a type of data warehousing schema that prioritizes data integrity, minimizes data redundancy, and offers flexibility for complex hierarchies. They are particularly useful in situations where maintaining the accuracy and consistency of data is of paramount importance. However, they may result in more complex queries and potentially slower query performance due to the multiple joins involved. Centipede fact tables Large fact table – fact table is larges and too much space consumed as opposed to dimen tables Indexing large multipart key is hard Value Chain: Done with sales, what about others? Where next?; Services value chain etc. Insurance value chain? Retail Value Chain: Inventory Management: Inventory Models Goals: (i) Sufficient inventory for smooth operations and prevent stockouts (ii) Minimize inventory to reduce carrying costs and increase profit Inventory Models (i) Periodic Snapshot (ii) Inventory Transactions (iii) Accumulating Snapshots Inventory Snapshots: Dimensional Model: Process: Periodic inventory snapshots; Grain: Snapshot; Dimensions: Product, Date, Store or Warehouse; Fact: Quantity Inventory Snapshot Schema: Date dim – same as in retail ; Product DIM and store can have additional details; Product – min reorder quantity; Discuss reorder point – if inventory falls to this level reorder or will stockout; How much to reorder- Economic reorder point – minimize order cost vs holding cost; Store DIM – storage capacity Semi Additive FACTS (i)Inventory Quantity on Hand is Semi-Additive (a) Additive across stores and products (b)Not additive across dates (ii)Most useful ways of combining is Average (a) SQL AVG Function cannot work (b) OLAP AVG function can avg with specific aggregation rules SQL AVG Func for Average Daily Inventory: Sql will avg by dividing across all rows (9) than just date(3); Group by will give 3 averages – need one.
Enhanced Inventory Facts: Quantity on Hand; Quantity Sold; Inventory Dollar Value at Cost; Inventory Dollar Value at Selling Price; Are the above additive across date, product and store dimensions? Inventory Dollar Value at cost or selling price are not additive. (see errata) Inventory Transactions: (i)Record every transaction that affects inventory (ii) Inventory transactions at warehouse: Receive product; Inspect;Return to vendor; Place in product bin; Package/Ship ; Receive from customer/returns; Remove from inventory Inventory Transaction Models: Detailed data that mirrors transaction tables, Useful for measuring transaction type frequencies, but cumbersome and impractical for inventory position.; Fact table fact : dollar amount Inventory Accumulating Snapshot: Each fact row tracks receipt through disposition of product; Possible when products received in one shipment can be distinguished from products received later; Track products by Serial no or lot no Dimensions : (i) Sales: Product; Date; Store; Promotion; Payment (ii) Inventory: Date; Product; Store; Warehouse; Supplier Enterprise data warehouse bus architecture “Family of dimensional models that share a comprehensive set of common, conformed dimensions”; “Master suite of standardized dimensions and facts with uniform interpretation across enterprise”; New dimensional models fit together like pieces of a puzzle; Bus architecture – overall enterprise wide design but can be built incrementally Enterprise Data Warehouse Bus Matrix Development process; Important deliverable of DW/BI; Multiple purposes – architecture planning, db design, data governance coord, org communication Replace dim with business function; Requirements determination – which groups?--> Conformed Dimensions Aka Master , Common or Reference Dimension; Built once in ETL, replicated logically or physically ; Help combine measures across business processes; Identical Conformed Dimensions; Shrunken Rollup Conformed Dimensions; Limited Conformity Drill across open order, inventory, sales – diff fact tables. Common product dimension; Identical – date dim – same keys, column names, definitions, values. Small change July JULY, or Month, MonthName don’t confirm; Shrunken – rollup, attribute subset – product->brand; Shrunken row subset – specific products (referential integrity issue); Limited – GE – aeroplane engine, appliances Conformed facts Revenue, profit, costs, quality measures etc.; Use same label only if definitions are same, same units of measurement. Different bp’s have revenue – can be added only if same
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Revenue = sold, or revenue = sold - returns Data Goveranance: Conformed Dimensions – need enterprise consensus on attribute names and contents; IT can facilitate but cannot be sole driver – SME lead; Data Governance Objective: Common data definitions, labels and domain values; Master Data Management : Master file that is a common point of reference Enterprise Data Warehouse Architecture (i) Value chain processes (a) Sales, Inventory (b)Support process: Procurement (ii) Procurement process: Critical business process for many industries including Retail, Manufacturing, Services (iii)Overall procurement goal: Source appropriate materials in most economical manner Procurement: ( i) Procurement Activities: Negotiate contracts; Issue Purchase requisitions; Track receipts and authorize payments (ii)Procurement analytics requirements: What is purchased? From whom? At what cost?; Monitoring organizational purchasing behavior; Monitoring vendor performance Procurement Processes : Purchase Requisitions; Purchase Orders;Shipping Notification; Warehouse Receipts; Vendor Invoice; Vendor Payments Single vs Multiple Fact Tables : (i)Varying Perspectives on Procurement Process: Purchase orders, receiving, accounts payable; Different systems, dimensions (ii) Single or Multiple Fact table design considerations: Analytic requirements, Unique business processes; Source systems; Dimensionality Procurement Process Bus Matrix (i) Bus Matrix (ii) Multiple Fact Tables: Allows richer and more descriptive dimensionality (iii) Accumulating Fact tables : Procurement Snapshot Table; Pipeline/Process View and Monitoring Procurement Dimensional Model Procurement transaction table; Shared common conformed dimensions Slowly Changing Dimensions: As Dimensions change over time, how should they be handled in DW? Type 0: Retain Original : Attribute value never changes; Persistent durable keys, date dimension attributes Type 1: Overwrite: Overwrite old attribute value with current value; Loss of history, rebuild aggregates, appropriate when changes are insignificant Type 2: Add New Row: Predominant technique to preserve history; Effective and expiration dates, Current row indicator; Pre-existing aggregations do not change; Safest response if not clear about which rule to apply; Difference between Type 1 and Type2 Type 3: Add New Column for Prior Value: Appropriate when: Business need is to track both new and prior values; ;Significant change affecting many rows;Example: Product line or sales district reorganization Type 4: Add Mini-Dimension: Appropriate when: Dimensions are large; Many frequent changes; Example: Customer Demographics Order Management Processes : Ordering, Shipping, Invoicing, Receiving Payments, Returns; Order Management Bus Matrix Role Playing Dimensions: Single dimension appears multiple times in the same fact table ; Each role is a separate view of underlying single physical table; Representation in bus matrix Junk Dimensions: (i)Transactional data sources may often include numerous flags and indicators; (ii)Options to consider: Ignore;Add as fact row; Make each indicator a dimension; Add to header dimension (iii)Alternative: Junk Dimension:Aggregate low cardinality indicators into one dimension Ignore, fact rows, each indiv dimension, Tax-code, prepay, discount, Hospital – admission type – walk, apt, transfer; level of care – ICU, Med Floor, Ped ICU, Surgery Ignore flags, fact row, sep dim, junk dim Junk Dim Junk Dimension Design Number of rows in junk dimension depends on cardinality of indicators;Example: If a single junk dimension has 10 two value indicators, number of rows would be 2^ 10 = 1024 ; Rows can be added incrementally to dimension table or can include all combinations of indicators; Best approach depends on cardinality of indicators and frequency of occurrence of different values of indicator
Header –Line Pattern Problems: Design accurately represents header – line relationship but has several shortcomings; Header may be close to 20% of fact table when average orderliness per header at 5.; Dimension tables should be much smaller than fact tables; Header dimension grows with fact table. Dimension tables should not normally grow at same rate as fact Multiple Currencies: Order transactions typically expressed in both local and global currency; Conversion rate could be rate at end of day, at time of capture or another as per company policy; Above technique allows easy roll up of data across different currencies; Global currency is fully additive, local currency is only additive for the currency specified; Currency dimension needed even when geography known; Currency conversion fact table allows more complex analysis. Supporting Facts of Different Granularity: Example: Shipping Charges are at Order Level, whereas fact table is at OrderLine level; (i) Two possible approaches: Allocation (preferable if possible); Separate fact table (ii) Warning: Shouldn’t mix fact granularities at different levels Header/Line Pattern to Avoid (2): Avoid joins between fact tables; Fact tables are very large and can take very long to query Audit dimension Data Quality: Accuracy, completeness etc. of data; Data Provenance: How was data calculated, what formulas were used?; Business users may ask

Browse Popular Homework Q&A

Q: If $20,000 is invested at an interest rate of 2% per year, compounded semiannually, find the value…
Q: The given contingency table lists the cross tabulation of the number of business partnerships by…
Q: Give the chair flip of each structure and indicate which structure in each pair is more stable. a.…
Q: You need to buy some chicken for dinner tonight. You found an ad showing that the store across town…
Q: Evaluate the following double integral over the region R by converting it to an iterated integral. e…
Q: Leaves contain an array of different light-absorbing molecules to harvest the full spectrum of…
Q: at x: つ:8ー5x, atx=e
Q: (a) lim 2²/ I→∞
Q: The size P(t) of a population at time t is modeled by the equation dP dt (a) In the long term, how…
Q: What was the difference between the Ku Klux Klan of Reconstruction and the KKK of the 1920s? A. The…
Q: Let f(x) 2 and g(x) 3 Find lim(fog)(x) and - I %3D x – 1° | lim (g • f)(x).
Q: As soon as a traffic light turns green, a car speeds up from rest to 51.0 mi/h with constant…
Q: $HOT! a. If the disease were X-linked recessive, what would be the probability that a male child…
Q: What is the relationship between the following two molecules? HO -H H-OH HO -H H- -OH A) enantiomers…
Q: Mai Lai massacre so devastating to Americans?
Q: Write a program that uses one loop to process the integers from 300 down to 200, inclusive. The…
Q: cos 0 = 6100 9' tan Ꮎ < 0 Find sin 0.
Q: In your paper, discuss IPv6, the Internet, virtual private networks, and voice over internet…
Q: -x²15-y²15 Consider the function F(x,y) = e and the point P(-2,2). a. Find the unit vectors that…
Q: Java create a class
Q: What will be the actual length of a 5” pipeline displayed on a letter size paper if a 1=2,000 scale…
Q: Find the arc length of the graph of the function over the indicated interval. (Round your answer to…