|
Table of Contents |
5 |
|
|
About the Author |
14 |
|
|
About the Technical Reviewer |
15 |
|
|
Acknowledgments |
16 |
|
|
Introduction |
17 |
|
|
Chapter 1: Core Data and Analytics Concepts |
19 |
|
|
Core Data Concepts |
19 |
|
|
What Is Data? |
20 |
|
|
Structured Data |
20 |
|
|
Semi-structured Data |
21 |
|
|
Unstructured Data |
21 |
|
|
Data Processing Methods |
22 |
|
|
Batch Data Processing |
22 |
|
|
Streaming or Real-Time Data Processing |
23 |
|
|
Relational Data and Its Characteristics |
24 |
|
|
Non-Relational Data and Its Characteristics |
26 |
|
|
Core Data Analytics Concepts |
28 |
|
|
What Is Data Analytics? |
28 |
|
|
Data Ingestion |
28 |
|
|
Data Exploration |
29 |
|
|
Data Processing |
30 |
|
|
ETL |
30 |
|
|
ELT |
31 |
|
|
ELT / ETL Tools |
32 |
|
|
Data Visualization |
32 |
|
|
Data Analytics Categories |
33 |
|
|
Descriptive Analytics |
34 |
|
|
Diagnostic Analytics |
34 |
|
|
Predictive Analytics |
35 |
|
|
Prescriptive Analytics |
35 |
|
|
Cognitive Analytics |
36 |
|
|
Summary |
36 |
|
|
Chapter 2: Modern Data Warehouses and Data Lakehouses |
38 |
|
|
What Is a Data Warehouse? |
39 |
|
|
Core Data Warehouse Concepts |
40 |
|
|
Data Model |
40 |
|
|
Model Types |
41 |
|
|
Schema Types |
41 |
|
|
Metadata |
42 |
|
|
Why Do We Need a Data Warehouse? |
42 |
|
|
Efficient Decision-Making |
42 |
|
|
Separation of Concerns |
42 |
|
|
Single Version of the Truth |
43 |
|
|
Data Restructuring |
43 |
|
|
Self-Service BI |
43 |
|
|
Historical Data |
44 |
|
|
Security |
44 |
|
|
Data Quality |
44 |
|
|
Data Mining |
45 |
|
|
More Revenues |
45 |
|
|
What Is a Modern Data Warehouse? |
45 |
|
|
Difference Between Traditional & Modern Data Warehouses |
46 |
|
|
Cloud vs. On-Premises |
46 |
|
|
Separation of Compute and Storage Resources |
46 |
|
|
Cost |
47 |
|
|
Scalability |
47 |
|
|
ETL vs. ELT |
48 |
|
|
Disaster Recovery |
48 |
|
|
Overall Architecture |
48 |
|
|
Data Lakehouse |
49 |
|
|
What Is a Data Lake? |
49 |
|
|
What Is Delta Lake? |
50 |
|
|
What Is Apache Spark? |
51 |
|
|
What Is a Data Lakehouse? |
52 |
|
|
Characteristics of a Data Lakehouse |
53 |
|
|
Various Data Types |
53 |
|
|
AI |
53 |
|
|
Decoupled Compute and Storage Resources |
54 |
|
|
Open Source Storage Format |
54 |
|
|
Data Analytics and BI Tools |
54 |
|
|
ACID Properties |
54 |
|
|
Differences Between a Data Warehouse and a Data Lakehouse |
55 |
|
|
Architecture |
55 |
|
|
Access to Raw Data |
55 |
|
|
Open Source vs. Proprietary |
56 |
|
|
Workloads |
56 |
|
|
Query Engines |
56 |
|
|
Data Processing |
57 |
|
|
Real-Time Data |
57 |
|
|
Examples of Data Lakehouses |
58 |
|
|
Azure Synapse Analytics |
58 |
|
|
Databricks |
59 |
|
|
Benefits of Data Lakehouse |
60 |
|
|
Support for All Types of Data |
60 |
|
|
Time to Market |
61 |
|
|
More Cost Effective |
61 |
|
|
AI |
61 |
|
|
Reduction in ETL/ELT Jobs |
62 |
|
|
Usage of Open Source Tools and Technologies |
62 |
|
|
Efficient and Easy Data Governance |
62 |
|
|
Drawbacks of Data Lakehouse |
63 |
|
|
Monolithic Architecture |
63 |
|
|
Technical Infancy |
63 |
|
|
Migration Cost |
64 |
|
|
Lack of Many Products/Options |
64 |
|
|
Scarcity of Skilled Technical Resources |
64 |
|
|
Summary |
65 |
|
|
Chapter 3: Introduction to Azure Synapse Analytics |
66 |
|
|
What Is Azure Synapse Analytics? |
66 |
|
|
Azure Synapse Analytics vs. Azure SQL Data Warehouse |
68 |
|
|
Why Should You Learn Azure Synapse Analytics? |
69 |
|
|
Main Features of Azure Synapse Analytics |
70 |
|
|
Unified Data Analytics Experience |
70 |
|
|
Powerful Data Insights |
71 |
|
|
Unlimited Scale |
72 |
|
|
Security, Privacy, and Compliance |
72 |
|
|
HTAP |
73 |
|
|
Key Service Capabilities of Azure Synapse Analytics |
73 |
|
|
Data Lake Exploration |
74 |
|
|
Multiple Language Support |
75 |
|
|
Deeply Integrated Apache Spark |
76 |
|
|
Serverless Synapse SQL Pool |
77 |
|
|
Hybrid Data Integration |
78 |
|
|
Power BI Integration |
79 |
|
|
AI Integration |
80 |
|
|
Enterprise Data Warehousing |
81 |
|
|
Seamless Streaming Analytics |
82 |
|
|
Workload Management |
82 |
|
|
Advanced Security |
84 |
|
|
Summary |
85 |
|
|
Chapter 4: Architecture and Its Main Components |
86 |
|
|
High-Level Architecture |
87 |
|
|
Main Components of Architecture |
90 |
|
|
Synapse SQL |
90 |
|
|
Compute Layer |
90 |
|
|
Dedicated Synapse SQL Pool |
90 |
|
|
Serverless Synapse SQL Pool |
91 |
|
|
Storage Layer |
93 |
|
|
Synapse Spark or Apache Spark |
94 |
|
|
Synapse Pipelines |
96 |
|
|
Synapse Studio |
98 |
|
|
Synapse Link |
100 |
|
|
Summary |
102 |
|
|
Chapter 5: Synapse SQL |
104 |
|
|
Synapse SQL Architecture Components |
105 |
|
|
Massively Parallel Processing Engine |
106 |
|
|
Distributed Query Processing Engine |
107 |
|
|
Control Node |
107 |
|
|
Compute Nodes |
108 |
|
|
Data Movement Service |
109 |
|
|
Distribution |
109 |
|
|
Hash Distribution |
111 |
|
|
Round-Robin Distribution |
112 |
|
|
Replication-based Distribution |
112 |
|
|
Azure Storage |
114 |
|
|
Dedicated or Provisioned Synapse SQL Pool |
114 |
|
|
Serverless or On-Demand Synapse SQL Pool |
116 |
|
|
Synapse SQL Feature Comparison |
117 |
|
|
Database Object Types |
117 |
|
|
Query Language |
119 |
|
|
Security |
120 |
|
|
Tools |
123 |
|
|
Storage Options |
124 |
|
|
Data Formats |
125 |
|
|
Resource Consumption Model for Synapse SQL |
125 |
|
|
Synapse SQL Best Practices |
126 |
|
|
Best Practices for Serverless Synapse SQL Pool |
127 |
|
|
Best Practices for Dedicated Synapse SQL Pool |
128 |
|
|
How-To’s |
129 |
|
|
Create a Dedicated Synapse SQL Pool |
129 |
|
|
Create a Serverless or On-Demand Synapse SQL Pool |
132 |
|
|
Load Data Using COPY Statement in Dedicated Synapse SQL Pool |
132 |
|
|
Ingest Data into Azure Data Lake Storage Gen2 |
133 |
|
|
Summary |
134 |
|
|
Chapter 6: Synapse Spark |
136 |
|
|
What Is Apache Spark? |
137 |
|
|
What Is Synapse Spark in Azure Synapse Analytics? |
139 |
|
|
Synapse Spark Features & Capabilities |
140 |
|
|
Speed |
140 |
|
|
Faster Start Time |
140 |
|
|
Ease of Creation |
140 |
|
|
Ease of Use |
141 |
|
|
Security |
141 |
|
|
Automatic Scalability |
141 |
|
|
Separation of Concerns |
142 |
|
|
Multiple Language Support |
142 |
|
|
Integration with IDEs |
142 |
|
|
Pre-loaded Libraries |
143 |
|
|
REST APIs |
143 |
|
|
Delta Lake and Its Importance in Synapse Spark |
144 |
|
|
Synapse Spark Job Optimization |
145 |
|
|
Data Format |
145 |
|
|
Memory Management |
146 |
|
|
Data Serialization |
146 |
|
|
Data Caching |
147 |
|
|
Data Abstraction |
147 |
|
|
Join and Shuffle Optimization |
148 |
|
|
Bucketing |
149 |
|
|
Hyperspace Indexing |
149 |
|
|
Synapse Spark Machine Learning |
149 |
|
|
Data Preparation and Exploration |
150 |
|
|
Build Machine Learning Models |
150 |
|
|
Train Machine Learning Models |
150 |
|
|
Model Deployment and Scoring |
151 |
|
|
How-To’s |
151 |
|
|
How to Create a Synapse Spark Pool |
151 |
|
|
How to Create and Submit Apache Spark Job Definition in Synapse Studio Using Python |
157 |
|
|
How to Monitor Synapse Spark Pools Using Synapse Studio |
163 |
|
|
Summary |
166 |
|
|
Chapter 7: Synapse Pipelines |
168 |
|
|
Overview of Azure Data Factory |
169 |
|
|
Overview of Synapse Pipelines |
171 |
|
|
Activities |
172 |
|
|
Pipelines |
173 |
|
|
Linked Services |
173 |
|
|
Dataset |
174 |
|
|
Integration Runtimes (IR) |
175 |
|
|
Azure Integration Runtime (Azure IR) |
175 |
|
|
Self-Hosted Integration Runtimes (SHIR) |
176 |
|
|
Azure SSIS Integration Runtimes (Azure SSIS IR) |
177 |
|
|
Control Flow |
177 |
|
|
Parameters |
178 |
|
|
Data Flow |
178 |
|
|
Data Movement Activities |
178 |
|
|
Category: Azure |
179 |
|
|
Category: Database |
180 |
|
|
Category: NoSQL |
181 |
|
|
Category: File |
181 |
|
|
Category: Generic |
182 |
|
|
Category: Services and Applications |
182 |
|
|
Data Transformation Activities |
184 |
|
|
Control Flow Activities |
185 |
|
|
Copy Pipeline Example |
186 |
|
|
Transformation Pipeline Example |
188 |
|
|
Pipeline Triggers |
189 |
|
|
Summary |
190 |
|
|
Chapter 8: Synapse Workspace and Studio |
192 |
|
|
What Is a Synapse Analytics Workspace? |
193 |
|
|
Synapse Analytics Workspace Components and Features |
194 |
|
|
Azure Data Lake Storage Gen2 Account and File System |
194 |
|
|
Serverless Synapse SQL Pool |
195 |
|
|
Shared Metadata Management |
195 |
|
|
Code Artifacts |
196 |
|
|
What Is Synapse Studio? |
197 |
|
|
Main Features of Synapse Studio |
199 |
|
|
Home Hub |
199 |
|
|
Data Hub |
199 |
|
|
Develop Hub |
200 |
|
|
Integrate Hub |
201 |
|
|
Monitor Hub |
202 |
|
|
Integration |
203 |
|
|
Activities |
204 |
|
|
Manage Hub |
204 |
|
|
Analytics Pools |
204 |
|
|
External Connections |
205 |
|
|
Integration |
205 |
|
|
Security |
206 |
|
|
Synapse Studio Capabilities |
206 |
|
|
Data Preparation |
206 |
|
|
Data Management |
207 |
|
|
Data Exploration |
207 |
|
|
Data Warehousing |
207 |
|
|
Data Visualization |
208 |
|
|
Machine Learning |
208 |
|
|
Power BI in Synapse Studio |
209 |
|
|
How-To’s |
210 |
|
|
How to Create or Provision a New Azure Synapse Analytics Workspace Using Azure Portal |
210 |
|
|
How to Launch Azure Synapse Studio |
212 |
|
|
How to Link Power BI with Azure Synapse Studio |
213 |
|
|
Summary |
215 |
|
|
Chapter 9: Synapse Link |
217 |
|
|
OLTP vs. OLAP |
218 |
|
|
What Is HTAP? |
219 |
|
|
Benefits of HTAP |
219 |
|
|
No-ETL Analytics |
219 |
|
|
Instant Insights |
220 |
|
|
Reduced Data Duplication |
220 |
|
|
Simplified Technical Architecture |
220 |
|
|
What Is Azure Synapse Link? |
221 |
|
|
Azure Cosmos DB |
222 |
|
|
Azure Cosmos DB Analytical Store |
222 |
|
|
Columnar Storage |
224 |
|
|
Decoupling of Operational Store |
224 |
|
|
Automatic Data Synchronization |
225 |
|
|
SQL API and MongoDB API |
225 |
|
|
Analytical TTL |
225 |
|
|
Automatic Schema Updates |
226 |
|
|
Cost-Effective Archiving |
226 |
|
|
Scalability |
227 |
|
|
When to Use Azure Synapse Link for Cosmos DB |
227 |
|
|
Azure Synapse Link Limitations |
228 |
|
|
Azure Synapse Link Use Cases |
229 |
|
|
Industrial IOT |
230 |
|
|
Predictive Maintenance Pipeline |
231 |
|
|
Operational Reporting |
231 |
|
|
Real-Time Applications |
232 |
|
|
Real-Time Personalization for E-Commerce Users |
232 |
|
|
How-To’s |
233 |
|
|
How to Enable Azure Synapse Link for Azure Cosmos DB |
233 |
|
|
How to Create an Azure Cosmos DB Container with Analytical Store Using Azure Portal |
235 |
|
|
How to Connect to Azure Synapse Link for Azure Cosmos DB Using Azure Portal |
236 |
|
|
Summary |
237 |
|
|
Chapter 10: Azure Synapse Analytics Use Cases and Reference Architecture |
240 |
|
|
Where Should You Use Azure Synapse Analytics? |
241 |
|
|
Large Volume of Data |
241 |
|
|
Disparate Sources of Data |
241 |
|
|
Data Transformation |
241 |
|
|
Batch or Streaming Data |
242 |
|
|
Where Should You Not Use Azure Synapse Analytics? |
242 |
|
|
Use Cases for Azure Synapse Analytics |
243 |
|
|
Financial Services |
243 |
|
|
Manufacturing |
244 |
|
|
Retail |
245 |
|
|
Healthcare |
245 |
|
|
Reference Architectures for Azure Synapse Analytics |
246 |
|
|
Modern Data Warehouse Architecture |
246 |
|
|
Real-Time Analytics on Big Data Architecture |
251 |
|
|
Summary |
254 |
|
|
Index |
257 |
|