-
Notifications
You must be signed in to change notification settings - Fork 0
/
section 4: Data-ware-house-design-building-blocks
133 lines (117 loc) · 7.71 KB
/
section 4: Data-ware-house-design-building-blocks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
Decide what you are Data Warehouse will be used for:
Purpose of Data Warehousing:
The primary objective of a data warehouse is to enable data-driven decisions. Ideally, these decisions should encompass:
The Past: Historical analysis to understand trends and patterns.
The Present: Real-time or near real-time analysis to monitor current operations.
The Future: Predictive analytics to forecast future outcomes.
The Unknown: Exploratory analytics to discover new insights.
Synergy Between BI and Data Warehousing:
Business intelligence (BI) and data warehousing emerged simultaneously, influencing and driving each other's growth.
This synergistic relationship means that as BI evolved, it required more robust and sophisticated data warehousing solutions, and vice versa
Categories of Business Intelligence:
Basic Reporting: Traditional reports generated regularly to provide insights into business operations.
Online Analytical Processing (OLAP): Tools that allow users to interactively analyze multidimensional data from multiple perspectives.
Dimensional Modeling:
Dimensional modeling involves organizing data into fact and dimension tables, optimized for query performance and ease of understanding. This model is well-suited for:
Aggregating data: Summarizing data to support reporting.
Multidimensional analysis: Enabling OLAP functionalities to drill down into data, slice and dice it, and pivot across various dimensions.
Predictive and Exploratory Analytics:
If your data warehouse aims to support predictive or exploratory analytics, you'll need different data models. These analytics require:
Advanced Statistical Models: Often requiring raw, unstructured data.
Machine Learning Models: Needing large datasets that may not fit into traditional dimensional models.
Data Lakes: To store vast amounts of raw data in its native format.
Design Engineering and Dimensional Modeling:
In the context of dimensional modeling, we'll explore:
Fact Tables: Central tables that contain quantitative data for analysis.
Dimension Tables: Surrounding tables that describe the dimensions (attributes) related to the facts.
The Basic principles of Dimensionality:
Focus on Basic Reporting and OLAP
Purpose: Enable data-driven decisions using basic reporting and online analytical processing (OLAP).
Dimensional Structure: Organize data dimensionally to easily slice, dice, and drill down into data for better insights.
Importance of Measurements and Context
Example:
With Context: "What is the average annual faculty salary by rank, by department, by academic year?"
Measurement: Average annual faculty salary.
Context: Rank, department, academic year.
Understanding "By" and "For"
By: Groups data by every possible value in a dimension.
For: Filters data for specific values in a dimension.
Organize data by measurements with clear, dimensional context to derive meaningful insights for decision-making.
Compare Facts, Fact Tables, Dimensions and Dimension Tables:
Facts and Fact Tables
Facts: Numeric, quantifiable measurements (e.g., salary, number of credits, dollar amounts).
Fact Tables: Store these measurements in a relational database.
Dimensions and Dimension Tables
Dimensions: Provide context for facts (e.g., academic department, student major, campus building).
Dimension Tables: Store this contextual information.
Key Points
Fact vs. Logical Fact: A data warehousing fact is a measurable metric, not just a true statement.
Storing Facts: Facts go into fact tables; these tables hold our measurable data.
Storing Dimensions: Dimensions go into dimension tables; these tables provide context to our facts.
Star vs. Snowflake Schema
Star Schema: Combines all levels of a dimension hierarchy into a single table.
Snowflake Schema: Splits each level of a dimension hierarchy into separate tables.
Important Notes
Facts are different from fact tables.
We use "dimension" and "dimension table" interchangeably, though they can technically differ.
Understanding schemas (star vs. snowflake) helps clarify how dimensions are structured.
Compare Different forms of additivity in facts:
Types of Facts:
Additive Facts: Refers to the ability to add facts.(Can be added in all scenarios.)
Non-Additive Facts: Cannot be added up.(Store components and calculate ratios, averages, or percentages at report time).
Semi-Additive Facts: Sometimes additive, sometimes not (Require careful handling, typically used in specific types of fact tables.)
Compare a star schems to a snowflake schema:
Key Points Recap:
Data Warehouse: Drives analytical decision making.
Business Intelligence (BI): A key part of decision making.
OLAP (Online Analytical Processing): Supports BI through dimensional analysis.
Dimensional Data: Involves facts and dimensions, stored in fact tables and dimension tables.
Star Schema:
Structure: All levels of a hierarchy (e.g., product, product family, category) are in a single dimension table.
Appearance: Resembles a star with one fact table at the center and dimension tables around it.
Joins: Fewer database joins due to single dimension tables.
Storage: Requires more storage due to repeated data.
Normalization: Considered de-normalized, but not strictly accurate.
Snowflake Schema:
Structure: Each level of a hierarchy is in its own table (e.g., product in one table, product family in another).
Appearance: Resembles a snowflake with multiple related tables.
Joins: Requires more database joins due to split dimension tables.
Storage: Requires less storage as it avoids data repetition.
Normalization: Considered normalized, but not strictly accurate.
Example Comparison:
Star Schema Example:
Fact Table: Budget Fact
Dimension Tables:
Academic Department DIM
Time DIM
Expense Category DIM
Hierarchy: All levels squished into single dimension tables.
Snowflake Schema Example:
Fact Table: Budget Fact
Dimension Tables:
Academic Department DIM
College DIM (related to Academic Department)
Semester DIM (related to Time DIM)
Academic Year DIM (related to Semester DIM)
Expense Category DIM
OPEX/CAPEX DIM (related to Expense Category)
Summary:
Both schemas contain the same dimensions but differ in table representation.
Star Schema: Simpler structure, fewer joins, more storage.
Snowflake Schema: More complex structure, more joins, less storage.
Database keys for for data warehousing:
Fundamentals of Databases and Keys:
Schemas: Whether using a star or snowflake schema in a relational database management system (RDBMS), logical relationships connect data across different tables.
Keys: Officially establish relationships through keys.
Primary Keys: Unique identifiers for each row in a table. They can be single or multiple columns.
Foreign Keys: Primary keys from other tables, used to designate logical relationships, enforce data integrity, and improve query performance.
Natural vs. Surrogate Keys:
Natural Keys: Originated from source systems, can be cryptic or understandable. They travel into the data warehouse from source systems.
Surrogate Keys: Generated within the data warehouse, have no business meaning, and are used to relate data across tables.
Key Design Decisions:
Use of Surrogate or Natural Keys:
Guidance: Use surrogate keys as primary and foreign keys. Avoid using natural keys due to data integrity concerns.
Natural Keys in Dimension Tables:
Recommendation: Keep natural keys in dimension tables as secondary keys, even if not used for primary-foreign key relationships.
Natural Keys in Fact Tables:
Guidance: Generally, do not use natural keys in fact tables. This will be discussed in detail later when setting up fact tables and managing history with slowly changing dimensions.