Concepts of data analytics

Data analytics

  • Taking data and finding meaningful information and inferences from it.
  • Helps to identify strengths and weaknesses of organization to enable with business decisions.
  • Data warehouse requires that you can capture the data that you need and wrangle it into an appropriate format.
  • Wrangling is the process by which you transform and map raw data into a more useful format for analysis. It can involve writing code to capture, filter, clean, combine, and aggregate data from many sources.
  • Two important stages in data analytics: 1) data ingestion 2) data processing

Data Process

Data ingestion

  • Process of obtaining and importing data for immediate use or storage in a database.
  • The data can arrive as a continuous stream, or it may come in batches, depending on the source.
  • The purpose of the ingestion process is to capture this data and store it.
  • This raw data can be held in a repository such as a database management system, a set of files, or some other type of fast, easily accessible storage.
  • Ingestion process might also perform filtering(reject suspicious, corrupt, or duplicated data).
  • Perform some transformations at this stage, converting data into a standard form for later processing(reformat all date and time data to use the same date and time representations, and convert all measurement data to use the same units.).

    Data Processing

  • The data processing stage occurs after the data has been ingested and collected.
  • Data processing takes the data in its raw form, cleans it, and converts it into a more meaningful format (tables, graphs, documents, and so on).

    Data cleaning

  • A generalized term that encompasses a range of actions, such as removing anomalies, and applying filters and transformations that would be too time-consuming to run during the ingestion stage.

Data Process Stages

ETL

  • ETL stands for Extract, Transform, and Load.
  • The raw data is retrieved and transformed before being saved.
  • continuous pipeline of operations.
  • Basic data cleaning tasks, deduplicating data, and reformatting the contents of individual fields. ETL

    ELT

  • Extract, Load, and Transform
  • The data processing engine can take an iterative approach, retrieving and processing the data from storage, before writing the transformed data and models back to storage.
  • ELT is more suitable for constructing complex models that depend on multiple items in the database, often using periodic batch processing. ELT
  • scalable ->suitable for cloud.
  • Stream oriented ->ETL

2-etl-vs-elt

Azure Data Factory,

Create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows, or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.

<- Back to Main Menu

Describe concepts of relational data

Databases in early days

Problems:
  • Every application stored data in its own unique structure.
  • When developers wanted to build applications to use that data, they had to know a lot about the particular data structure to find the data they needed.
  • Data structures were inefficient, hard to maintain, and hard to optimize for delivering good application performance.
  • The relational database model was designed to solve the problem of multiple arbitrary data structures.
  • The relational model provided a standard way of representing and querying data that could be used by any application.
    Strengths:
  • Its use of tables: efficient, and flexible way to store and access structured information.
  • Relational databases are used to track inventories, process ecommerce transactions, manage huge amounts of mission-critical customer information, and much more.
  • A relational database is useful for storing any information containing related data elements that must be organized in a rules-based, consistent way.

Explore the characteristics of relational data

  • One of the main benefits of computer databases is that they make it easy to store information so it’s quick and easy to find.
  • An ecommerce system might use a database to record information about the products an organization sells, and the details of customers and the orders they’ve placed.
  • A relational database provides a model for storing the data, and a query capability that enables you to retrieve data quickly.

Understand the characteristics of relational data

  • Tables -> you model collections of entities from the real world.
  • An entity is described as a thing about which information needs to be known or held.
  • In the ecommerce example, you might create tables for customers, products, and orders.
  • A table contains rows, and each row represents a single instance of an entity.
  • The rows in a table have one or more columns that define the properties of the entity, such as the customer name, or product ID.
  • All rows in the same table have the same columns. Some columns are used to maintain relationships between tables.
  • A relational database by creating a data model.
  • The primary key(PK) indicates the column (or combination of columns) that uniquely identify each row. Every table should have a primary key.
  • The relationships between the tables.
    • 1-to-many(1 customer -> many orders).
    • many-to-1(many customers-> many orders).
  • A foreign key(FK) also helps to identify and prevent anomalies, such as orders for customers that don’t exist in the Customers table.
  • The main characteristics of a relational database :
    • All data is tabular.
    • All rows in the same table have the same set of columns.
    • A table can contain any number of rows.
    • A primary key(PK) uniquely identifies each row in a table. No two rows can share the same primary key.
    • A foreign key references rows in another, related table.
    • Structured Query Language (SQL)
    • Use SQL to create tables, insert, update, and delete rows in tables, and to query data.
    • Combine the data from multiple tables in a query using a join operation.
    • A join operation spans the relationships between tables, enabling you to retrieve the data from more than one table at a time.
      Explore relational database use cases
  • Use a relational database any time you can easily model your data as a collection of tables with a fixed set of columns.
  • If you have a collection of music, video, or other media files, relational database cannot be used but use Azure Blob Storage.
  • Social networking sites use databases to store data about millions of users, each of whom can be linked to any number of other users in a highly complex web of relationships. This type of data lends itself more to a graph database structure rather than a collection of relational tables.
  • OLTP applications are focused on transaction-oriented tasks that process a very large number of transactions per minute. Relational databases are well suited for OLTP applications because they naturally support insert, update, and delete operations.
  • Examples of OLTP applications that use relational databases are:
    • Banking solutions
    • Online retail applications
    • Flight reservation systems
    • Many online purchasing applications.
      Explore relational data structures
  • Index -> An index helps you search for data in a table.
    • When the user runs a query that specifies this column in the WHERE clause, the database management system can use this index to fetch the data more quickly than if it had to scan through the entire table row by row.
    • An index might consume additional storage space, and each time you insert, update, or delete data in a table, the indexes for that table must be maintained.
    • This additional work can slow down insert, update, and delete operations, and incur additional processing charges.
    • When deciding which indexes to create, you must strike a balance between having indexes that speed up your queries versus the cost of performing other operations.
    • In a table that is read only, or that contains data that is modified infrequently, more indexes will improve query performance.
    • If a table is queried infrequently, but subject to a large number of inserts, updates, and deletes (such as a table involved in OLTP), then creating indexes on that table can slow your system down.
    • A clustered index physically reorganizes a table by the index key.
  • View -> A view is a virtual table based on the result set of a query.
  • A view as a window on specified rows in an underlying table.
  • Query the view and filter the data in much the same way as a table.
  • A view can also join tables together.

    Choose the right platform for a relational workload

    On-Primise Databases
  • Maintaining the hardware and software, applying patches, backing up databases, restoring them is organization responsibility.
  • Scalability is also a concern.
  • Data can go offline because of formidable task.
    Cloud
  • Many of these operations can be handled for you by the data center staff, in many cases with no (or minimal) downtime.
  • Free to focus on the data itself and the management concerns to taken care by others (Azure fees).
  • No capital expenses, data can be backed up regularly, and companies only have to pay for the resources they use.
  • Those organizations plan aggressive expansion on a global basis -> to connect with customers, partners, and other businesses anywhere with minimal effort.
  • Instant provisioning because everything is already configured -> eliminated and users can access the application.

Understand IaaS and PaaS

Infrastructure-as-a-Service(IaaS)
  • Create a virtual infrastructure in the cloud that mirrors the way an on-premises data center might work
  • Create a set of virtual machines, connect them together using a virtual network, and add a range of virtual devices.
  • Companys are responsible for many of the day-to-day operations, such as installing and configuring the software, patching, taking backups, and restoring data when needed.
  • The IaaS approach is best for migrations and applications requiring operating system-level access. SQL virtual machines are lift-and-shift (copy your on-premises solution directly to a virtual machine in the cloud).
    Platform-as-a-service(PaaS)
  • Rather than creating a virtual infrastructure, and installing and managing the database software yourself, a PaaS solution does this for you.
  • Azure automatically creates the necessary virtual machines, networks, and other devices for you.
  • PaaS solutions for relational databases, include Azure SQL Database, Azure Database for PostgreSQL, Azure Database for MySQL, and Azure Database for MariaDB.
  • Connect to them, create your databases, and upload your data.

benefits and tradeoffs when running a database management system

<- Back to Main Menu

Describe concepts of non-relational data

Introduction

  • Video, audio, images, temporal information, large volumes of free text, encrypted information etc.
Azure Table storage
  • NoSQL key-value model ->Semi-structure data
  • The data for an item is stored as a set of fields, and the item is identified by a unique key.
  • Items are referred to as rows, and fields are known as columns.
  • No concept of relationships, stored procedures, secondary indexes, or foreign keys.
  • Denormalized, with each row holding the entire data for a logical entity.
  • Faster access -> NO Joins
  • Partitioning is a mechanism for grouping related rows, based on a common property -> partition key.
  • Partition Advantage -> 1. improve scalability and performance. 2. Grow and Shrink 3. Improves performance on search -> reducing the amount of I/O (reads and writes) needed to locate the data.
  • 2 elements : 1) the partition key that identifies the partition containing the row 2) a row key that is unique to each row in the same partition.

  • Point queries that identify a single row, and Range queries that fetch a contiguous block of rows in a partition.
  • The partition key and row key effectively define a clustered index over the data.
  • The columns in a table can hold numeric, string, or binary data up to 64 KB in size. A table can have up to 252 columns, apart from the partition and row keys. The maximum row size is 1 MB.
    Advantages
  • It’s simpler to scale.An Azure storage account can hold up to 5 PB of data.
  • Holds semi-structured data
  • No need to map and maintain the complex relationships(no normalization).
  • Row insertion is fast
  • Data retrieval is fast(the partition and row keys pair for queries).
Disadvantages:
  • Consistency aren’t guaranteed
  • No referential integrity
  • Difficult to filter and sort on non-key data.
Excellent for:
  • Storing TBs of structured data capable of serving web scale applications.
  • Storing datasets that don’t require complex joins, foreign keys, or stored procedures, and that can be denormalized for fast access.
  • Capturing event logging and performance monitoring data.
  • Provides high-availability guarantees in a single region.
  • Each table is replicated three times within an Azure region.
  • Create tables in geo-redundant storage -> additional costs.
  • Azure will transparently switch to a working replica while the failed replica is recovered.
  • Helps to protect your data(configure security and role-based access control)
Azure Blob storage
  • Store massive amounts of unstructured data, or blobs, in the cloud.
  • Create blobs using an Azure storage account.
  • Block blobs -> 1. A block blob is handled as a set of blocks(100MB) 2. 50000 blocks 3. Max of 4.7TB. 4. Smallest amount of data that can be read or written as an individual unit. 5. Best used to store discrete, large, binary objects that change infrequently.
  • Page blobs -> 1. collection of fixed size 512-byte pages. 2. support random read and write operations 3. hold up to 8 TB of data. 4. implement virtual disk storage for virtual machines.
  • Append blobs -> 1. Support append operations. 2. Only add blocks to the end of an append blob; updating or deleting existing blocks 3. Each block can vary in size, up to 4 MB. 4. The maximum size of an append blob is just over 195 GB.

  • Azure storage account -> create blobs -> containers.
  • A container provides a convenient way of grouping related blobs together, and you can organize blobs in a hierarchy of folders, similar to files in a file system on disk.
  • Control ->read /write blobs inside containers.
  • 3 access tiers:
    • Hot tier -> 1. default 2. frequently accesses 3. high-performance media.
    • Cool tier ->1. lower performance 2. accessed infrequently. 3. create the blob in the Hot tier, but migrate it to the Cool tier later and vice-versa.
    • Archive tier ->1. lowest storage cost, but with increased latency. 2. can take hours for the data to become available. 3. Rehydrated
  • A lifecycle management policy can automatically move a blob from Hot to Cool, and then to the Archive tier, can also arrange to delete outdated blobs.
Use Case and Management
  • static website
  • Storing files for distributed access
  • Streaming video and audio
  • Storing data for backup and restore, disaster recovery, and archiving
  • Storing data for analysis by an on-premises or Azure-hosted service
Azure Blob storage include:
  • Versioning. You can maintain and restore earlier versions of a blob.

  • Soft delete This feature enables you to recover a blob that has been removed or overwritten, by accident or otherwise.

  • Snapshots A snapshot is a read-only version of a blob at a particular point in time.

  • Change Feed The change feed for a blob provides an ordered, read-only, record of the updates made to a blob. You can use the change feed to monitor these changes, and perform operations such as:

    • Update a secondary index, synchronize with a cache, search-engine, or any other content-management scenarios.
    • Extract business analytics insights and metrics, based on changes that occur to your objects, either in a streaming manner or batched mode.
    • Store, audit, and analyze changes to your objects, over any period of time, for security, compliance or intelligence for enterprise data management.
    • Build solutions to back up, mirror, or replicate object state in your account for disaster management or compliance.
    • Build connected application pipelines that react to change events or schedule executions based on created or changed objects.

      Azure File Storage

  • Azure File Storage enables you to create files shares in the cloud, and access these file shares from anywhere with an internet connection.
  • Azure File Storage exposes file shares using the Server Message Block 3.0 (SMB) protocol.
  • Control access to shares in Azure File Storage using authentication and authorization services available through Azure Active Directory Domain Services.
  • Share up to 100 TB of data in a single storage account.
  • Data can be distributed across any number of file shares in the account.
  • Maximum size of a single file is 1 TB.
  • Supports up to 2000 concurrent connections per shared file.
  • Upload files to Azure File Storage using 1) the Azure portal 2) tools such as the AzCopy utility .
  • Azure File Sync service to synchronize locally cached copies of shared files with the data in Azure File Storage.
  • Two performance tiers: 1) Standard tier uses hard disk-based hardware in a datacenter, and the Premium tier uses solid-state disks. 2) Premium tier offers greater throughput, but is charged at a higher rate.

Use cases and management benefits of using Azure File Storage

  • Migrate existing applications to the cloud.
  • Share server data across on-premises and cloud.
  • Integrate modern applications with Azure File Storage.
  • Simplify hosting High Availability (HA) workload data.

Important Points

  • Don’t use Azure File Storage for files that can be written by multiple concurrent processes simultaneously. Multiple writers require careful synchronization, otherwise the changes made by one process can be overwritten by another.
  • The alternative solution is to lock the file as it is written, and then release the lock when the write operation is complete. However, this approach can severely impact concurrency and limit performance.

  • A fully managed service.
  • Shared data is replicated locally within a region, but can also be geo-replicated to a second region.
  • 300 MB/second of throughput for a single Standard file share, but you can increase throughput capacity by creating a Premium file share, for additional cost.
  • All data is encrypted at rest.
  • Can enable encryption for data in-transit between Azure File Storage and your applications.

    Azure Cosmos DB

  • NOSQL -> documents, graphs, key-value stores, and column family stores.
  • Azure Cosmos DB is a multi-model NoSQL database management system. Cosmos DB manages data as a partitioned set of documents. A document is a collection of fields, identified by a key. The fields in each document can vary, and a field can contain child documents.
  • Many document databases use JSON (JavaScript Object Notation) to represent the document structure.
  • A document can hold up to 2 MB of data, including small binary objects.
  • If you need to store larger blobs as part of a document, use Azure Blob storage, and add a reference to the blob in the document.
  • Cosmos DB provides APIs(Application Programming Interface) to access these documents using a set of well-known interfaces.
    1. SQL API. ->realtional DB
    2. Table API. -> key-value pair
    3. MongoDB API. -> document database
    4. Cassandra API. -> column family database management system.
    5. Gremlin API. -> graph database
  • CosmosDB database -> The documents in a container are grouped together into partitions.
  • A set of documents that share a common partition key.
  • Helps to reduce the amount of I/O (disk reads) that queries might need to perform when retrieving a set of documents for a given entity.

Use cases and management benefits

  • Highly scalable DBMS
  • Automatically allocates space in a container for your partitions, and each partition can grow up to 10 GB in size.
  • Indexes are created and maintained automatically.
  • Virtually no administrative overhead
  • To ensure availability, all databases are replicated within a single region.
  • Replication is transparent, and failover from a failed replica is automatic.
  • Guarantees 99.99% high availability.
  • Replicate anywhere in the world (additional cost).
  • All replicas are synchronized.
  • Types of Consistency:
    • Strong -> Linearizability refers to serving requests concurrently.
    • Bounded staleness-> The reads are guaranteed to honor the consistent-prefix guarantee. The “staleness” can be configured in two ways: * The number of versions (K) of the item * The time interval (T) reads might lag behind the writes
    • Session -> a single client session reads are guaranteed to honor the consistent-prefix, monotonic reads, monotonic writes, read-your-writes, and write-follows-reads guarantees.
    • Consistent prefix -> updates that are returned contain some prefix of all the updates, with no gaps. Consistent prefix consistency level guarantees that reads never see out-of-order writes.
    • Eventual-> 1. There’s no ordering guarantee for reads. In the absence of any further writes, the replicas eventually converge. 2. Weakest form of Consistency.
  • Finding out the probability that your clients may get strong and consistent reads for your workloads by looking at the Probabilistically Bounded Staleness (PBS) metric.

    Scenarios

  • IoT and telematics
  • Retail and marketing.
  • Gaming.
  • Web and mobile applications.

    Azure Files

  • fully managed file shares in the cloud that are accessible via the industry standard Server Message Block (SMB) protocol or Network File System (NFS) protocol.

    Why Azure Files is useful

  • Replace or supplement on-premises file servers
  • “Lift and shift” applications
  • Simplify cloud development
  • Diagnostic share
  • Dev/Test/Debug
  • Containerization

    Key benefits

  • Shared access
  • Fully managed
  • Scripting and tooling
  • Resiliency
  • Familiar programmability

<- Back to Main Menu

Description of normalization

Normalization:

  • The process of organizing data in a database.
  • Creating tables and establishing relationships between those tables according to rules designed both to protect the data
  • Make the database more flexible by eliminating redundancy and inconsistent dependency.
  • Redundant data -> wastes disk space and creates maintenance problems.
  • Problem of redundancy : If data that exists in more than one place must be changed, the data must be changed in exactly the same way in all locations.
  • Inconsistency Dependancy: can make data difficult to access because the path to find the data may be missing or broken.

Rules of Normalization

1NF
  1. Eliminate repeating groups in individual tables.
  2. Create a separate table for each set of related data.
  3. Identify each set of related data with a primary key.
    2NF
  4. Create separate tables for sets of values that apply to multiple records.
  5. Relate these tables with a foreign key.
    3NF
  6. Eliminate fields that do not depend on the key.

Normazlization Example

AWS Cloud Practitioner Notes

1.The Cloud
2.Understanding Your AWS Account
3.Getting Support on AWS
4.Understanding the AWS Environment
5.Securing Your AWS Resources
6.Working with Your AWS Resources
7.The Core Compute Services
8.The Core Storage Services
9.The Core Database Services
10.The Core Networking Services
11.Automating Your AWS Workloads
12.Common Use-Case Scenarios