-
Notifications
You must be signed in to change notification settings - Fork 0
/
section 3: bring-data-into-data-warehouse
146 lines (134 loc) · 10.2 KB
/
section 3: bring-data-into-data-warehouse
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
Compare ETL to ELT:
ETL: Extract, Transform, Load
Extract:
Quickly pull data from various source applications.
Traditionally done in batches (e.g., every hour, daily, weekly).
Transform:
Prepare and standardize the data from different sources.
Handle complex transformations to ensure uniformity (e.g., standardizing product data, customer information).
This phase is crucial for ensuring that data from multiple sources can be compared and analyzed together.
Load:
Move the transformed data into the user access layer of the data warehouse or data mart.
The data is now ready for business intelligence (BI) and analytics.
Challenges with ETL:
Requires significant business analysis and data modeling before data storage.
A lot of preparatory work is needed to ensure data integrity and usability.
ELT: Extract, Load, Transform:
Extract:
Pull data from various source applications, similar to ETL.
Collect data in its rawest form, regardless of structure (structured, semi-structured, unstructured).
Load:
Load the raw data directly into a big data environment, such as Hadoop Distributed File System (HDFS) or AWS S3 buckets.
This step eliminates the need for upfront data modeling and analysis.
Transform:
Transform data when needed, using the computational power of big data environments.
Allows for on-demand data transformation based on specific BI and analytics needs.
Advantages of ELT:
Flexibility: Defer schema creation and data modeling until data is needed for analysis.
Performance: Utilize the computational power of big data environments for transformation, improving performance with large datasets.
Schema on Write (ETL): Schema is defined when data is written to the storage. Data structure is fixed upfront.
Schema on Read (ELT): Schema is defined when data is read for analysis. Data structure is flexible until needed.
Use Cases:
Traditional Data Warehousing (ETL):
Suitable for environments with relational databases and OLAP cubes.
Ideal for structured data and scenarios requiring predefined schemas.
Big Data and Data Lakes (ELT):
Suitable for environments with unstructured or semi-structured data.
Ideal for scenarios requiring flexibility in data modeling and analysis.
ETL and ELT are both essential data processing techniques with distinct advantages and use cases.
ETL remains prevalent in traditional data warehousing, where structured data and predefined schemas are essential.
ELT is gaining popularity with big data technologies, providing flexibility and improved performance for large volumes of diverse data.
Understanding the differences and applications of ETL and ELT allows organizations to choose the best approach for their data processing needs.
Design the Initial Load ETL:
Initial Loading ETL:
Overview:
Initial Loading ETL is typically a one-time process performed before a data warehouse goes live.
The primary goal is to gather all the necessary data to get the data warehouse up and running.
This involves extracting relevant data, transforming it, and loading it into the user access layer of the data warehouse.
Steps Involved:
Extract:
Pull relevant data from various source systems.
Focus on gathering data that will be essential for business intelligence (BI) and analytics.
Transform:
Standardize and clean the extracted data.
Ensure data consistency and quality before loading it into the warehouse.
Load:
Move the transformed data into the staging area, then into the user access layer of the data warehouse.
This prepares the data for use in BI and analytics.
Key Considerations:
Relevance: Only bring in data that is relevant and necessary for BI and analytics.
Absolutely Needed Data: Data required for standard reports and visualizations.
Probably Needed Data: Data that is closely related to the essential data and likely to be used soon.
Historical Data: Include historical data to support historical analysis and trend reporting.
One-Time Process: Initial Loading ETL is typically performed once. However, it might need to be redone in specific scenarios:
Data Warehouse Issues: If the data warehouse is corrupted or re-platformed.
Complete Overhaul: When a significant redesign or overhaul of the data warehouse is required.
Incremental ETL:
Overview:
Incremental ETL is used to keep the data warehouse up to date after the initial loading.
This process involves regularly updating the data warehouse with new and updated data from source systems.
Difference from Initial Loading ETL:
Initial Loading ETL is a comprehensive, one-time data load, while Incremental ETL is an ongoing process.
Incremental ETL ensures that the data warehouse remains current by adding new data and updating existing data regularly.
Initial Loading ETL is a critical step in setting up a data warehouse, ensuring that all necessary data is in place before the system goes live.
It focuses on relevance, extracting, transforming, and loading data that is essential for BI and analytics.
While typically a one-time process, it may need to be redone in certain situations.
Following the initial load, Incremental ETL processes are used to keep the data warehouse up to date, ensuring continuous availability of fresh data for analytical purposes.
Compare Different Models for Incremental ETL:
Incremental ETL:
Overview:
Incremental ETL is performed regularly to update the data warehouse with new, modified, and deleted data.
This process ensures the data warehouse remains current, reflecting the latest changes from the source systems.
Key Activities:
1.Adding New Data
2.Updating Modified Data
3.Handling Deleted Data
purpose:
The main goal of incremental ETL is to keep the data warehouse up to date. It balances the need for non-volatility (ensuring static data during strategic planning) with the need for current info.
Incremental ETL Patterns:
There are four major patterns used in incremental ETL:
Append Pattern:
New data is added to the existing tables without altering the existing records.
Suitable for scenarios where historical data accumulation is important.
In-Place Update Pattern:
Existing rows are updated with new values, but the number of rows remains unchanged.
Commonly used for type one slowly changing dimensions in dimensional modeling.
Complete Replacement Pattern:
An entire segment of the data warehouse is overwritten, even if most data remains unchanged.
Less common in modern data warehousing but might be used in legacy systems.
Rolling Append Pattern:
Maintains a fixed time window of historical data by appending new data and deleting the oldest equivalent data.
For example, keeping 36 months of sales history and updating it weekly.
Incremental ETL is crucial for keeping a data warehouse updated with the latest data.
Understanding and applying the right ETL patterns ensure that your data warehouse can support timely and accurate business intelligence and analytics.
The choice of pattern depends on the specific requirements and architecture of the data warehouse, with append and in-place updates being the most commonly used in modern systems.
Explore the role of Data Transformation:
The transformation step ensures that the data is uniform and well-structured, facilitating effective business intelligence and analytics.
Data transformation is pivotal in creating a cohesive, functional data warehouse.
By unifying data values, types, sizes, and eliminating duplicates, we ensure the data is consistent and reliable for business intelligence and analytics.
Understanding these transformation models and applying them effectively is crucial for building robust data warehousing solutions.
Goals of Data Transformation:
1. Uniformity: Ensure that data from various sources looks consistent, allowing for accurate comparisons and analyses.
2. Restructuring: Organize raw data from the staging layer into a well-engineered set of data structures suitable for the data warehouse.
Transformation Models
1. Data Value Unification: Objective: Standardize how data values are represented across different systems.
2. Data Type and Size Unification: Objective: Standardize the data types and sizes for similar data across different systems.
3. Data De-duplication: Objective: Remove duplicate entries to avoid counting the same entity multiple times.
4. Dropping Columns (Vertical Slicing): Objective: Remove unnecessary columns from the data to streamline what gets loaded into the data warehouse.
5. Row Filtering (Horizontal Slicing): Objective: Filter out rows based on specific values to include only relevant data in the data warehouse.
6. Error Correction: Objective: Correct known errors in the source data during the ETL process to maintain data accuracy and uniformity.
Implement Mix-and-Match Incremental ETL:
ETL feeds might seem conceptually similar: copying data from source systems and feeding it into a data warehouse regularly to keep everything updated.
However, upon closer inspection, you'll notice that each ETL feed can have different frequencies and patterns.
ETL Feed Frequencies:
Daily Updates: Some ETL feeds update the data warehouse daily. This is suitable for data that changes frequently but doesn't require real-time updates.
Hourly Updates: For more critical data, hourly updates ensure the data warehouse stays up-to-date with minimal lag.
Weekly Updates: For less volatile data, weekly updates are sufficient, reducing the load on the system and optimizing resources.
ETL Patterns:
Appends: New data is added to existing records without altering previous data.
Rolling Appends: Similar to appends but with a mechanism to handle and remove older data, maintaining a rolling window of recent data.
Complete Replacements: Entire tables or datasets are replaced with new data, ensuring complete accuracy but potentially using more resources.
In-Place Updates: Specific records are updated directly in the data warehouse, which is efficient for small, frequent changes.
Mix-and-Match Philosophy:
Each ETL feed can use a different combination of frequency and pattern based on the specific needs of the data.
This flexibility extends down to the table level within each source system.