Modern Data Warehousing

28 Sep 2021

Used as the source for analysis, reporting, and online analytical processing (OLAP).
Data warehouses have to handle “Big Data”.
Big data is the term used for large quantities of data collected in escalating volumes, at higher velocities, and in a greater variety of formats.
Historical (meaning stored) or real time (meaning streamed from the source).
Modern data warehouse might contain a mixture of relational and non-relational data, including files, social media streams, and Internet of Things (IoT) sensor data.
Data warehousing solution -> Azure Data Factory, Azure Data Lake Storage, Azure Databricks, Azure Synapse Analytics, and Azure Analysis Services.
Power BI -> analyze and visualize the data, generating reports, charts, and dashboards.
Combination of up-to-the-second data, and historical information.

A data integration service.
Purpose : retrieve data from one or more data sources, and convert it into a format that you process.
Enables you to extract the interesting data, and discard the rest.
Azure Data Factory as a pipeline of operations.
A pipeline can run continuously, as data is received from the various data sources.
Azure Data Lake Storage
A repository for large quantities of raw data.
Data lake as a staging point for your ingested data, before it’s massaged and converted into a format suitable for performing analytics.
A data warehouse also stores large quantities of data, but the data in a warehouse has been processed to convert it into a format for efficient analysis.
A data lake holds raw data, but a data warehouse holds structured information.
Azure Data Lake Storage is essentially an extension of Azure Blob storage, organized as a near-infinite file system.
Charateristics of Data Lake Storage:
Organizes your files into directories and subdirectories for improved file organization.
Supports the Portable Operating System Interface (POSIX) file and directory permissions to enable granular Role-Based Access Control (RBAC) on your data.
Compatible with the Hadoop Distributed File System (HDFS).

Azure Databricks is an Apache Spark environment running on Azure to provide big data processing, streaming, and machine leark consumes large amounts of dataning.
Apache Spark efficient data processing engine that can consume and process large amounts of data very quickly.
Spark supports ->R,Python,Scala
Spark code using notebooks . notebooks contains cells, each contains a seperate block of code.

An analytics engine
Azure Synapse Analytics leverages a massively parallel processing (MPP) architecture.
The Control node is the brain of the architecture.
The Compute nodes provide the computational power.
Azure Synapse Analytics supports two computational models: SQL pools and Spark pools.
In a SQL pool, each compute node uses an Azure SQL Database and Azure Storage to handle a portion of the data.
submit queries -> Transact-SQL statements -> Azure Synapse Analytics runs them
PolyBase enables you to retrieve data from relational and non-relational sources, such as delimited text files, Azure Blob Storage, and Azure Data Lake Storage.
You can only scale a SQL pool when it’s not running a Transact-SQL query.
Spark pool, the nodes are replaced with a Spark cluster
Spark is optimized for in-memory processing. A Spark job can load and cache data into memory and query it repeatedly. In-memory computing is much faster than disk-based applications, but requires additional memory resources.
Autoscaling can occur while processing is active.
Azure Analysis Services
Build tabular models to support online analytical processing (OLAP) queries.

Azure HDInsight is a big data processing service.
HDInsight implements a clustered model that distributes processing across a set of computers.
This model is similar to that used by Synapse Analytics, except that the nodes are running the Spark processing engine rather than Azure SQL Database.
Hadoop is an open source framework that breaks large data processing problems down into smaller chunks and distributes them across a cluster of servers, similar to the way in which Synapse Analytics operates.
Hive is a SQL-like query facility that you can use with an HDInsight cluster to examine data held in a variety of formats. You can use it to create, load, and query external tables, in a manner similar to PolyBase for Azure Synapse Analytics

Trail Blazed