Batch Processing – Intelligent Secure Data (ISD) CO. LTD

Course Content

Introduction to Big Data and Data Engineering – part1

Big Data needs Data Engineering because raw data is too large, fast, and messy to process or use directly. Data Engineering solves this by building scalable systems (scale-out) to collect, store, and process data efficiently.

0/5

Why Big Data Needs Data Engineering?

scale up solution

scale out solution

Why Data Engineering Became Necessary

How Data Engineering Solves This Problem

Introduction to Big Data and Data Engineering – part2

Data is growing continuously because of social media, IoT, and digital systems, so Big Data is large, fast, and diverse data that has 5Vs (Volume, Velocity, Variety, Veracity, Value). We handle it using Batch Processing for large historical data and Real-Time Analytics for instant insights and fast decision-making.

0/8

Data continues to grow

Define Big Data

Types of data in big data

Where Big Data Comes From

Big data characteristics

How to deal with Big Data

Batch Processing

Real-Time Analytics

Introduction to Big Data and Data Engineering – part3

Big Data faces challenges like huge volume, variety, and processing complexity, so systems like OLTP, Data Warehouses, Data Lakes, and Lakehouses are used to manage it. ETL transforms data before loading, while ELT loads data first then transforms it for modern big data processing.

0/10

Big data challenges

Schema-on-write

Understand the role of network infrastructure in big data

Introduction to Big Data and Data Engineering – part4

0/12

Introduction to Distributed Systems in Big Data

Why Big Data Needs Distributed Systems

Basics of Distributed Computing

How Distributed Systems Work

Key Concepts in Distributed Systems-part1

Key Concepts in Distributed Systems-part2

Role in Big Data Architecture

Single Machine vs Distributed Systems

Fault Tolerance

Parallel Processing

Data Distribution

Data Engineering with SQL & Python

0/122

Install anaconda

Jupyter notebook

Data Structures

Comparisons Operators

Conditional statement If, Else, Elif Statements

Loops for loop and while loop

introduction to pandas

Create DataFrame

Working with columns

Working with Rows

working with files

View the first few rows of a DataFrame

View the last few rows of a DataFrame

Get summary of the DataFrame

Generate summary statistics for numerical columns

Get the number of rows and columns in a DataFrame

View the column names of the DataFrame

Access the index of the DataFrame

Check the data types of each column

Check for missing values in the DataFrame

Remove rows with missing values

Rename columns or rows in the DataFrame

Sort the DataFrame

Group data by one or more columns to perform aggregation

Merge two DataFrames

Create a pivot table for summarizing data

handling missing value

Data Consistency

Data Transformation and Normalization

Validating Data Types

Fixing Data Entry Errors

Consistency in Categorical Data

Standardizing Data Formats

Data Types conversion

Data Validation and Verification

Handling Categorical Variables

Project 1 : Bank Transactions

project 2 : Customer Purchases

Project3 : Payroll analysis

Introduction to Database

Relational Database

A primary key and A foreign key

Relationships between Tables

Type of sql commands

DDL – Data Definition Language

DQL – Data Query Language

DML – Data Manipulation Language

DCL – Data Control Language

TCL – Transaction Control Language

SQL Server Management Studio (SSMS)

Install SQL Server Developer and SQL Server Management Studio (SSMS)

Transportation database

Sql server data types

AND – OR conditions

Introduction to Joins

Aggregation functions and Groupby

String Function

Conversion Functions

Noncorrelated Subqueries

Correlated Subqueries

Window functions

A Common Table Expression

Project : Transportation Analysis project

Hadoop Production Deployment & Cluster Setup

0/33

Why do we develop something

Challenges in Data Processing Before Hadoop

Introduction to Hadoop

Hadoop Ecosystem Overview

Hadoop Architecture-part1

Hadoop Architecture-part2

Hadoop Architecture-part3

Hadoop Architecture-part4

Hadoop Architecture-part5

HDFS Architecture – NameNode

HDFS Architecture – DataNode

Read Operation and Write Operation

Block and Replication

Secondary NameNode Role

HDFS Block Storage

Fault Tolerance in HDFS

Data Locality Concept

Hadoop Installation (virtualbox) + Services + HDFS Commands Practice

Start Hadoop Services

Stop Hadoop Services

NameNode UI (HDFS Web Interface)

Basic Navigation Commands

File Upload & Download Commands

File Viewing Commands

File & Directory Management

System & Advanced Commands

HDFS High Availability & Rack Awareness Architecture

Apache ZooKeeper: Architecture, Coordination & Distributed Leadership in Hadoop

Project 1 : HDFS Small Files Optimization

Project 2 : HDFS Block Size & Replication

Project 3 : HDFS File Format & Compression Optimization

Enterprise Data Engineering with Apache Spark

0/68

Challenges in Big Data Before Apache Spark and Understanding Apache Spark

Differences Between Spark 2.x and Spark 1.x

Apache Spark 2.x Architecture

Fault Tolerance & Scalability: Hadoop vs Spark

Spark UI – Test Spark Job

Stage details – 0

Stage details – 1

Stage details – 2

Stage details – 3

Stage details – 4

Environment Tab

SQL Tab and SQL metrics

Structured Streaming Tab

JDBC/ODBC Server Tab

Shuffle read and Shuffle write in Spark

PySpark RDD API and RDD (Resilient Distributed Dataset)

RDD Transformations

Run (Spark Job Execution Flow)

DAG (Directed Acyclic Graph)

Stages and Tasks

Partitioning Strategy

Caching & Persisting

Performance Tuning Basics

PySpark DataFrame API

Schema Inference

Schema Enforcement

Column Operations

Filtering & Aggregation

Normal join – Broadcast join – production problem

Handling Null Values

UDF (User Defined Functions)

Window Functions in pyspark

PySpark SQL and Spark SQL

Spark Streaming and Structured Streaming

Streaming Sources (Kafka, Files)

Window Operations

Late data handling (watermark)

Stateful Streaming + Final Streaming Pipeline Architecture

Memory Management

Executor Memory Structure

JVM Heap Memory in Spark

Memory for Execution vs Storage

Unified Memory Management

Off-Heap Memory

Garbage Collection (GC) Impact

Partition Tuning

Reduce Shuffle Operations

Broadcast Join Optimization

Avoid Wide Transformations

Data Skew Handling

File Format Optimization (Parquet, ORC)

Predicate Pushdown

Project 1: DataFrame API Performance Optimization Project

Project 2: Spark SQL Query Optimization Project

Project 3: Spark Memory Tuning & Resource Optimization

Project 4: Shuffle Handling & Optimization in Spark

Project 5: Data Skew Detection & Mitigation

Project 6: Broadcast Join Optimization

Introduction to Hive and Sqoop

0/41

Introduction to Hive

Hive Architecture

Hive Data Model

Hive Query Language (HQL)

Data Types in Hive

install virtualbox on windows

install putty software on windows

install winscp on windows

Install Cloudera and Setting Up Hadoop

Hive Installation

Create Database

Creating and Managing Tables in Hive

Loading Data into Hive Tables

Managed Table in hive

External Table in hive

hive query -HQL

The partitioning in Hive

Static Partitioning

Dynamic Partitioning

Hive Optimization Techniques

Sqoop Architecture

Key Features of Sqoop

Sqoop Connectors

Sqoop Commands Overview

Sqoop Installation

Importing Data from RDBMS to HDFS

Exporting Data from HDFS to RDBMS

Adding more mappers to a Sqoop

handling portions of data with Sqoop

Incremental Data Import in Sqoop

Data Compression with Sqoop

SequenceFile format

Create sqoop job

Sqoop Performance Optimization

Kafka: From Zero to Production

0/120

Starting the Kafka Journey – part1

Starting the Kafka Journey – part2

Starting the Kafka Journey – part3

Starting the Kafka Journey – part4

Starting the Kafka Journey – part5

Starting the Kafka Journey – part6

Event Streaming Concept

Kafka vs traditional messaging systems

Kafka ecosystem overview

Real-time Data Pipelines

Log Aggregation

Streaming Analytics

Event-driven Microservices

CDC (Change Data Capture)

Distributed event streaming platform

Publish/subscribe messaging

Durable commit log

Replayable events

High throughput architecture

Horizontal scalability

Zookeeper mode and KRaft mode

Lab 1 : Install Kafka

Lab 1 : Start broker

Lab 1 : Create topic

Lab 1 : Produce & consume messages

Lab 1 : Inspect logs

Kafka Architecture Overview

Kafka Log Structure

Event flow (routing)

Append-only log design

Partition-based scalability

Offset indexing

Distributed storage model

Lab 2 :Create multi-partition topics

Lab 2 : Observe partition distribution

Lab 2 : Test message ordering

Lab 2 : Send keyed messages

Lab 2 : Explore broker storage directories

Broker Internals

Partitions and Replication Factor

Leader/Follower Model

ISR (In-Sync Replicas)

Leader Election

Fault Tolerance

Ordering Guarantees

Durability Model

Data Consistency Model

Throughput vs Latency

Retention Policies

Replicated partitions

Automatic failover

High durability storage

Segment-based logs

Time-based retention

Size-based retention

Lab 3 : Kafka Replication , ISR & Broker Failure

Consumer API and Consumer Groups

Offset Commit Strategies

Delivery Semantics

Idempotent Producer

Message Keys and Partitioning Logic

Batching and Compression

Retry Mechanisms – Error Handling – Dead Letter Queue (DLQ)

Parallel consumption

Offset management

Duplicate prevention

Exactly-once semantics

Lab 4 : Kafka Consumer Groups , Consumer lag – Rebalancing

Broker Configuration

Removing Brokers

Broker Replacement

Rebalancing Partitions

Rolling Restart

High Availability

ZooKeeper vs KRaft

Cluster Scaling

Hardware Planning

Storage Planning

Failure Recovery

Multi-broker architecture

Partition reassignment

Zero-downtime upgrades

Cluster balancing

Lab 5 : Kafka Add/Remove Brokers, Scaling & Recovery

Retention Policies

Time-based Retention

Size-based Retention

Segment Management

Offset Retention

Replication Tuning

Disaster Recovery

Data Durability Guarantees

Lab 6 : Configure retention rules

Monitoring & Observability

Performance Tuning

Stream Processing

Kafka Integration Ecosystem

Project 1 : End-to-End Big Data Streaming Platform : Kafka – Spark – HDFS

Project 2 : Real-Time Data Pipeline using Kafka, Spark & Snowflake

Project 3 : End-to-End Big Data Streaming Platform with Apache Kafka, Apache Spark, PostgreSQL & Grafana

Project 4 : End-to-End CDC Pipeline with Apache Kafka, Debezium & PostgreSQL

Snowflake and dbt: Zero to Production Data Engineering

0/47

Cloud Data Warehousing Essentials

Getting Started with Snowflake Cloud

Snowflake as a SaaS Platform

Snowflake Account & Core Building Blocks

Snowflake Architecture & Execution Model

Databases & Table Structures in Snowflake

Time Travel & Data Recovery System

Schemas & Session Context Management

Data Integrity & Data Types

Zero-Copy Cloning & Data Replication

Stored Procedures & Automation Logic

Security, Roles & Access Control

Transactions & Data Consistency

Streams & Data Change Tracking

Task Automation & Workflow Scheduling

Automated data partitioning & incremental loading using snowflake tasks

Incremental load using snowflake tasks

SnowSQL & Command Line Operations

Snowflake COPY INTO Command for Data Loading and Unloading

External Storage

BI Integration with Power BI

Introduction to Modern Data Transformation

Data Modeling with DBT

Dynamic SQL with Jinja

Testing & Data Documentation

Seeds & Data Sources

Deployment & CI/CD Pipelines

DBT Best Practices & Optimization

Hooks & Workflow Extensions

Snapshots & Historical Tracking

DBT Packages & Ecosystem Extensions

Environment Setup & Prerequisites

Building the Snowflake Data Warehouse

Initializing the dbt Project

Configuring Snowflake Connection – Connecting dbt to Snowflake

Building Source Definitions in dbt

Creating Staging Models for Data Cleaning

Implementing Business Logic with Intermediate Models

Building Analytics-Ready Data Marts

Configuring dbt Project Settings

Running dbt Pipelines

Check Data Warehouse

Generate dbt Documentation

Building Final Retail KPI Reports

Additional Session 1 : DBT

Additional Session 2 : DBT

Project : Real-Time Analytics Pipeline with Kafka, Spark & Snowflake

Apache Airflow: From Basics to Production

0/30

What is Apache Airflow and Airflow architecture overview

Airflow Setup & Installation + lab

A DAG (Directed Acyclic Graph) + Lab

DAG structure in Airflow + lab

DAG parameters (start_date, schedule) + lab

Task definition basics + lab

Dependencies concept + lab

Cron expressions

Airflow scheduling system

Catchup vs no catchup

Backfill & Catchup

Timezones in scheduling

Manual vs automatic triggers

Scheduling + lab

Task Dependencies + Lab

Parallel tasks + Lab

Diamond Dependency + Lab

Branching workflows + Lab

Trigger rules + Lab

Data Passing , XCom and Push & pull mechanism

Variables in Airflow

Connections concept

Sensors (advanced use)

Email alerts setup

Project 1 : End-to-End Real-Time Data Pipeline using Kafka, Spark, Hadoop & Airflow

Project 2 : End-to-End Data Pipeline with Kafka, Spark, HDFS, PostgreSQL & Airflow

Project 3 : End-to-End Data Engineering Platform with Hadoop, Spark, Airflow & dbt

Modern Data Lakehouse with Apache Iceberg

0/11

Introduction to Iceberg

Why Iceberg Was Created

Iceberg Architecture

Install Spark + enable Iceberg catalog and create first table

Building Your First Apache Iceberg Table with Spark and HDFS Catalog + Lab

Understanding Table Metadata

Iceberg Table Internals : Metadata JSON, Snapshots, and Version History + Lab

Exploring Iceberg Manifests, Snapshots, and File-Level Metadata + Lab

Hive Table Architecture and Partition Storage Analysis + Lab

Schema evolution + column IDs + backward/forward compatibility + snapshots + hidden partitioning + metadata inspection + Lab

Project : Event-Driven Lakehouse Pipeline: Kafka Ingestion, Spark Processing, Iceberg Storage

Building Real-Time Data Pipelines with Apache Flink

0/12

Introduction to Apache Flink

Real time vs Batch Processing

Sources (Data Input)

Transformations

Sinks (Data Output)

Parallelism (Distributed Processing)

Stateful Processing

Checkpointing (Fault Tolerance)

Event Time + Watermarks

Fault Tolerance + Recovery

Project : Building Real-Time Data Pipelines Using Kafka, Apache Flink & Flink SQL

End-to-End Data Flow Engineering with Apache NiFi

0/18

Introduction to Apache NiFi

Apache NiFi handles the big data challenges

The change from ETL to data streaming and how Apache NiFi fits in

Apache NiFi’s key features

How Apache NiFi Addresses Big Data Integration Challenges

When NiFi Might Not Be Ideal

NiFi and Big data tools

Flow-based programming (FBP)

Apache NiFi’s Main Components

Configuring and Tuning Apache NiFi

Cluster Terminologies

Load Balancing Strategies

Benefits of Clustering

Case 1: Real-Time Operational Intelligence System

Case 2: Enterprise CDC Data Integration with Apache NiFi

Case 3: Data Governance & GDPR-Compliant Data Platform

Data Warehouse Design & Implementation

0/42

Evolution OLTP – DW – Data Lake → Lakehouse

OLTP Online Transaction Processing.

Data Warehouse (DW)

Why Do We Need a Data Warehouse

From OLTP to DW – ETL ELT Pipelines

Data Modelling – Star Schema

NORMALIZED (OLTP style – 3NF)

DENORMALIZED (Star Schema – Data Warehouse)

Normalize to write fast. Denormalize to read fast.

Historical Data and Time-Based Analysis

Why OLTP is BAD for Historical Analysis

Slowly Changing Dimensions (SCD)

Enables historical accuracy in reports.

Typical Data Warehouse Workloads

Enterprise Data Warehouse Technologies

OLTP vs Data Warehouse Limitations of Data Warehouses

Data Lake Architecture

Data Zones in Data Lake

Use Cases for Data Lakes

Strengths and Weaknesses of Data Lakes

OLTP → DW → Data Lake

Lake house architecture

Enterprise Data Warehouse

Inmon vs Kimball vs Data Vault

Enterprise Analytical Requirements

Subject Areas & Conformed Dimensions

Data Domains & Ownership

Enterprise Bus Matrix

Advnaced topics

Project 0 Data warehouse for an airline sales analysis system

Project Step 1 – Business Understanding & Project Objectives

Project Step 2 – Airline Data Collection & CSV Preparation

Project Step 3 – HDFS Storage Layer Setup

Project Step 4 – Snowflake Data Warehouse Configuration

Project Step 5 – Staging Tables & Data Ingestion

Project Step 6 – Fact & Dimension Tables

Project Step 7 – ETL Pipeline Development

Project Step 8 – KPI Analytics, SQL Views & Stored Procedures

Project Step 9 – Managing DIM_CUSTOMER (SCD Type 2) History

Project Step 10 – Power BI Dashboard & Business Reporting

Distributed NoSQL with Apache Cassandra

0/7

Apache Cassandra

Distributed and Decentralized (Peer-to-Peer)

Cassandra architecture

Partitioning & Data Distribution

Tunable Consistency

Project : Distributed Real-Time Analytics System with Kafka, Cassandra & Airflow

Professional Data Engineering and Big data Program

0% Complete