Professional Data Engineering and Big data Program

0 (0 Ratings)

Level

All Levels
Total Enrolled

3
Duration

50 hours 28 minutes
Last Updated

June 22, 2026

Level

All Levels
Total Enrolled

3
Duration

50 hours 28 minutes
Last Updated

June 22, 2026

Course content:

Introduction to Big Data and Data Engineering – part1
Big Data needs Data Engineering because raw data is too large, fast, and messy to process or use directly. Data Engineering solves this by building scalable systems (scale-out) to collect, store, and process data efficiently.

Why Big Data Needs Data Engineering?

03:13
scale up solution

02:59
scale out solution

05:15
Why Data Engineering Became Necessary

10:02
How Data Engineering Solves This Problem

02:04

Introduction to Big Data and Data Engineering – part2
Data is growing continuously because of social media, IoT, and digital systems, so Big Data is large, fast, and diverse data that has 5Vs (Volume, Velocity, Variety, Veracity, Value). We handle it using Batch Processing for large historical data and Real-Time Analytics for instant insights and fast decision-making.

Introduction to Big Data and Data Engineering – part3
Big Data faces challenges like huge volume, variety, and processing complexity, so systems like OLTP, Data Warehouses, Data Lakes, and Lakehouses are used to manage it. ETL transforms data before loading, while ELT loads data first then transforms it for modern big data processing.

Big data challenges

00:54
OLTP

02:15
Data warehouse

03:38
Data Lake

01:38
Data Lakehouse

01:38
Schema-on-write

01:20
Schema-on-read

01:38
ETL

01:05
ELT

01:20
Understand the role of network infrastructure in big data

01:20

Introduction to Big Data and Data Engineering – part4

Introduction to Distributed Systems in Big Data

02:09
Why Big Data Needs Distributed Systems

01:19
Basics of Distributed Computing

01:39
How Distributed Systems Work

01:07
Key Concepts in Distributed Systems-part1

01:37
Key Concepts in Distributed Systems-part2

00:48
Role in Big Data Architecture

00:51
Single Machine vs Distributed Systems

00:43
Scalability

00:57
Fault Tolerance

01:04
Parallel Processing

00:48
Data Distribution

00:46

Data Engineering with SQL & Python

Introduction

03:34
IDE

02:38
Anaconda

12:38
Install anaconda

16:31
Jupyter notebook

15:40
Print

10:40
Numbers

03:07
Variables

05:54
Strings

01:13
Strings method

19:52
Data Structures

05:48
lists

01:00
Dictionaries

02:44
Tuples

02:21
sets

03:03
Booleans

04:05
Comparisons Operators

03:20
Conditional statement If, Else, Elif Statements

03:27
Loops for loop and while loop

08:30
Functions

05:06
introduction to pandas

00:59
Data Frame

01:36
Create DataFrame

04:27
Working with columns

07:16
Working with Rows

08:57
Subsets

03:32
working with files

05:48
Method1

04:23
Method2

22:25
summarize data

00:51
View the first few rows of a DataFrame

01:47
View the last few rows of a DataFrame

00:46
Get summary of the DataFrame

00:57
Generate summary statistics for numerical columns

00:50
Get the number of rows and columns in a DataFrame

00:46
View the column names of the DataFrame

00:17
Access the index of the DataFrame

00:29
Check the data types of each column

00:24
Check for missing values in the DataFrame

01:12
Remove rows with missing values

01:21
Rename columns or rows in the DataFrame

00:56
Sample method

01:09
Sort the DataFrame

03:10
Group data by one or more columns to perform aggregation

05:57
Merge two DataFrames

03:35
Create a pivot table for summarizing data

02:02
file format

03:55
CSV file

03:23
Excel file

01:48
Json file

02:25
Parquet file

01:21
XML file

00:45
Data Quality

00:39
handling missing value

08:07
duplicated

02:12
Data Consistency

01:34
Data Transformation and Normalization

03:51
Validating Data Types

02:07
Fixing Data Entry Errors

02:32
Consistency in Categorical Data

01:13
Standardizing Data Formats

02:20
Data Types conversion

01:00
Data Validation and Verification

01:09
Handling Categorical Variables

02:00
Project 1 : Bank Transactions

22:48
project 2 : Customer Purchases

23:13
Project3 : Payroll analysis

16:01
Introduction to Database

02:19
Table

01:09
schema

00:41
Relational Database

01:05
A primary key and A foreign key

02:47
Relationships between Tables

01:04
One-to-One

00:27
One-to-Many

00:28
Many-to-Many

00:22
RDBMS

00:55
SQL

00:52
Type of sql commands

01:56
DDL – Data Definition Language

00:50
DQL – Data Query Language

00:47
DML – Data Manipulation Language

00:42
DCL – Data Control Language

00:38
TCL – Transaction Control Language

00:40
SQL Server Management Studio (SSMS)

01:06
Install SQL Server Developer and SQL Server Management Studio (SSMS)

10:38
SSMS Overview

03:38
Transportation database

06:40
Sql server data types

06:40
SELECT

01:49
AND – OR conditions

04:01
ORDER BY

02:07
DISTINCT

00:42
Between

00:36
IN

01:41
like

02:00
Introduction to Joins

02:17
Inner JOIN

04:05
Left JOIN

02:57
Right JOIN

01:21
Full JOIN

01:51
Aggregation functions and Groupby

03:15
COUNT

00:47
SUM

00:59
AVG

00:28
MAX

00:25
MIN

00:17
GROUB BY

01:18
HAVING

01:08
String Function

03:04
Date Functions

05:32
Conversion Functions

01:49
Subquery

00:40
Noncorrelated Subqueries

01:44
Correlated Subqueries

01:13
Window functions

00:50
ROW_NUMBER()

02:02
RANK ()

01:56
DENSE_RANK()

01:39
CASE statement

02:09
A Common Table Expression

01:22
Project : Transportation Analysis project

23:53

Hadoop Production Deployment & Cluster Setup

Why do we develop something

03:34
Challenges in Data Processing Before Hadoop

02:38
Hadoop History

12:38
Introduction to Hadoop

16:31
Hadoop Ecosystem Overview

15:40
Hadoop Architecture-part1

10:40
Hadoop Architecture-part2

03:07
Hadoop Architecture-part3

05:54
Hadoop Architecture-part4

01:13
Hadoop Architecture-part5

19:52
HDFS Architecture – NameNode

05:48
HDFS Architecture – DataNode

01:00
Read Operation and Write Operation

02:44
Block and Replication

02:21
Secondary NameNode Role

03:03
HDFS Block Storage

04:05
Fault Tolerance in HDFS

03:20
Data Locality Concept

03:27
Hadoop Installation (virtualbox) + Services + HDFS Commands Practice

08:30
Start Hadoop Services

05:06
jps command

00:59
Stop Hadoop Services

01:36
NameNode UI (HDFS Web Interface)

04:27
Basic Navigation Commands

07:16
File Upload & Download Commands

08:57
File Viewing Commands

03:32
File & Directory Management

05:48
System & Advanced Commands

02:41
HDFS High Availability & Rack Awareness Architecture

22:25
Apache ZooKeeper: Architecture, Coordination & Distributed Leadership in Hadoop

04:04
Project 1 : HDFS Small Files Optimization

06:18
Project 2 : HDFS Block Size & Replication

05:57
Project 3 : HDFS File Format & Compression Optimization

08:27

Enterprise Data Engineering with Apache Spark

Challenges in Big Data Before Apache Spark and Understanding Apache Spark

20:02
Differences Between Spark 2.x and Spark 1.x

05:17
Apache Spark 2.x Architecture

12:41
Fault Tolerance & Scalability: Hadoop vs Spark

06:05
Spark setup

03:42
PySpark Setup

01:41
Spark UI

01:59
Spark UI – Test Spark Job

05:39
Jobs details

11:32
Stage details – 0

11:23
Stage details – 1

09:38
Stage details – 2

03:32
Stage details – 3

07:36
Stage details – 4

06:34
Executors Tab

00:58
Environment Tab

00:54
Storage Tab

01:08
SQL Tab and SQL metrics

00:59
Structured Streaming Tab

02:02
JDBC/ODBC Server Tab

01:17
Shuffle read and Shuffle write in Spark

12:12
PySpark RDD API and RDD (Resilient Distributed Dataset)

05:10
RDD Transformations

08:30
RDD Actions

02:34
Run (Spark Job Execution Flow)

00:54
DAG (Directed Acyclic Graph)

00:47
Stages and Tasks

01:09
Partitioning Strategy

01:35
Caching & Persisting

01:18
Performance Tuning Basics

02:20
PySpark DataFrame API

02:34
DataFrame API

01:08
Dataset API

01:03
Schema Inference

11:51
Schema Enforcement

02:25
Column Operations

30:15
Filtering & Aggregation

11:08
Normal join – Broadcast join – production problem

17:36
Handling Null Values

06:46
UDF (User Defined Functions)

12:14
Window Functions in pyspark

26:33
PySpark SQL and Spark SQL

11:12
Spark Streaming and Structured Streaming

01:27
Streaming Sources (Kafka, Files)

01:55
Window Operations

04:59
Late data handling (watermark)

02:03
Stateful Streaming + Final Streaming Pipeline Architecture

09:00
Memory Management

03:53
Executor Memory Structure

02:48
JVM Heap Memory in Spark

02:15
Memory for Execution vs Storage

00:48
Unified Memory Management

01:25
Off-Heap Memory

01:37
Garbage Collection (GC) Impact

02:16
Partition Tuning

02:29
Reduce Shuffle Operations

01:20
Broadcast Join Optimization

00:58
Avoid Wide Transformations

00:48
Data Skew Handling

01:57
File Format Optimization (Parquet, ORC)

01:15
Predicate Pushdown

01:04
Column Pruning

00:43
Project 1: DataFrame API Performance Optimization Project

12:40
Project 2: Spark SQL Query Optimization Project

07:27
Project 3: Spark Memory Tuning & Resource Optimization

08:13
Project 4: Shuffle Handling & Optimization in Spark

04:30
Project 5: Data Skew Detection & Mitigation

03:11
Project 6: Broadcast Join Optimization

17:36

Introduction to Hive and Sqoop

Introduction to Hive

01:53
Hive Architecture

06:11
Hive Data Model

01:06
Hive Query Language (HQL)

00:32
Data Types in Hive

01:03
DAG

04:02
install virtualbox on windows

01:59
install putty software on windows

01:09
install winscp on windows

01:21
Install Cloudera and Setting Up Hadoop

09:20
Hive Installation

01:50
Create Database

02:30
Creating and Managing Tables in Hive

07:23
Loading Data into Hive Tables

12:55
Managed Table in hive

02:07
External Table in hive

07:28
hive query -HQL

09:49
The partitioning in Hive

00:50
Static Partitioning

04:08
Dynamic Partitioning

05:45
Hive Bucketing

09:52
HIVE JOINS

07:29
Hive Optimization Techniques

01:55
Apache Sqoop

02:21
Sqoop Architecture

00:53
Key Features of Sqoop

02:20
Sqoop Connectors

01:22
Sqoop Commands Overview

00:56
Sqoop Installation

00:36
MySQL Database

02:32
Importing Data from RDBMS to HDFS

06:22
Exporting Data from HDFS to RDBMS

09:23
Adding more mappers to a Sqoop

04:21
handling portions of data with Sqoop

06:29
Incremental Data Import in Sqoop

15:59
Data Compression with Sqoop

04:20
Avro format

03:53
SequenceFile format

03:16
Parquet format

02:46
Create sqoop job

04:14
Sqoop Performance Optimization

01:00

Kafka: From Zero to Production

Starting the Kafka Journey – part1

24:05
Starting the Kafka Journey – part2

12:26
Starting the Kafka Journey – part3

27:39
Starting the Kafka Journey – part4

18:30
Starting the Kafka Journey – part5

07:08
Starting the Kafka Journey – part6

47:43
What is Kafka

01:36
Event Streaming Concept

01:11
Kafka vs traditional messaging systems

02:44
Kafka ecosystem overview

01:57
Real-time Data Pipelines

00:39
Log Aggregation

00:39
Streaming Analytics

00:41
Event-driven Microservices

01:56
CDC (Change Data Capture)

01:01
Distributed event streaming platform

00:52
Publish/subscribe messaging

01:31
Durable commit log

00:42
Replayable events

00:58
High throughput architecture

00:44
Horizontal scalability

00:50
Zookeeper mode and KRaft mode

04:34
Lab 1 : Install Kafka

09:47
Lab 1 : Start broker

01:22
Lab 1 : Create topic

06:18
Lab 1 : Produce & consume messages

03:22
Lab 1 : Inspect logs

02:46
Kafka Architecture Overview

01:47
Producers

01:00
Consumers

01:24
Topics

00:45
Partitions

00:40
Brokers

00:51
Clusters

00:47
Kafka Log Structure

00:43
Event flow (routing)

01:08
Append-only log design

00:43
Partition-based scalability

01:02
Offset indexing

00:45
Distributed storage model

01:01
Event routing

00:56
Lab 2 :Create multi-partition topics

02:52
Lab 2 : Observe partition distribution

01:30
Lab 2 : Test message ordering

03:36
Lab 2 : Send keyed messages

03:49
Lab 2 : Explore broker storage directories

02:29
Broker Internals

01:49
Partitions and Replication Factor

01:19
Leader/Follower Model

01:15
ISR (In-Sync Replicas)

01:19
Leader Election

01:48
Fault Tolerance

01:19
Offset Concept

01:03
Ordering Guarantees

00:58
Durability Model

01:24
Data Consistency Model

01:01
Throughput vs Latency

01:07
Retention Policies

01:25
Log Segments

01:07
Replayability

01:17
Replicated partitions

01:02
Automatic failover

00:46
High durability storage

00:47
Segment-based logs

00:31
Time-based retention

00:42
Size-based retention

00:50
Lab 3 : Kafka Replication , ISR & Broker Failure

09:45
Producer API

02:14
Consumer API and Consumer Groups

01:22
Offset Commit Strategies

01:15
Rebalancing

01:40
Delivery Semantics

01:24
Idempotent Producer

01:53
Message Keys and Partitioning Logic

01:02
Batching and Compression

01:10
Retry Mechanisms – Error Handling – Dead Letter Queue (DLQ)

01:20
Parallel consumption

00:27
Offset management

00:47
Duplicate prevention

00:56
Exactly-once semantics

00:52
Lab 4 : Kafka Consumer Groups , Consumer lag – Rebalancing

11:27
Cluster Setup

01:14
Broker Configuration

01:21
Adding Brokers

01:12
Removing Brokers

00:59
Broker Replacement

00:59
Rebalancing Partitions

01:07
Rolling Restart

01:04
Kafka Upgrades

00:49
High Availability

00:57
ZooKeeper vs KRaft

00:51
Cluster Scaling

01:00
Network Tuning

00:40
Hardware Planning

00:47
Storage Planning

00:57
Failure Recovery

00:51
Multi-broker architecture

00:52
Partition reassignment

00:45
Zero-downtime upgrades

00:42
Cluster balancing

00:45
Lab 5 : Kafka Add/Remove Brokers, Scaling & Recovery

09:58
Retention Policies

02:42
Time-based Retention

01:04
Size-based Retention

01:12
Log Compaction

01:47
Segment Management

00:55
Offset Retention

00:47
Replication Tuning

01:24
Disaster Recovery

01:10
Data Durability Guarantees

01:24
Lab 6 : Configure retention rules

10:13
Kafka Security

09:52
Monitoring & Observability

07:09
Performance Tuning

07:09
Stream Processing

04:57
Kafka Integration Ecosystem

04:53
Project 1 : End-to-End Big Data Streaming Platform : Kafka – Spark – HDFS

46:09
Project 2 : Real-Time Data Pipeline using Kafka, Spark & Snowflake

47:00
Project 3 : End-to-End Big Data Streaming Platform with Apache Kafka, Apache Spark, PostgreSQL & Grafana

35:19
Project 4 : End-to-End CDC Pipeline with Apache Kafka, Debezium & PostgreSQL

58:07

Snowflake and dbt: Zero to Production Data Engineering

Cloud Data Warehousing Essentials

04:21
Getting Started with Snowflake Cloud

19:12
Snowflake as a SaaS Platform

03:08
Snowflake Account & Core Building Blocks

03:47
Snowflake Architecture & Execution Model

11:14
Databases & Table Structures in Snowflake

15:35
Time Travel & Data Recovery System

09:31
Schemas & Session Context Management

17:49
Data Integrity & Data Types

08:01
Zero-Copy Cloning & Data Replication

07:33
Stored Procedures & Automation Logic

05:10
Security, Roles & Access Control

13:35
Transactions & Data Consistency

06:55
Streams & Data Change Tracking

08:15
Task Automation & Workflow Scheduling

08:27
Automated data partitioning & incremental loading using snowflake tasks

07:23
Incremental load using snowflake tasks

08:50
SnowSQL & Command Line Operations

02:44
Snowflake COPY INTO Command for Data Loading and Unloading

06:11
External Storage

01:05
BI Integration with Power BI

03:53
Introduction to Modern Data Transformation

17:22
Data Modeling with DBT

07:37
Dynamic SQL with Jinja

05:21
Testing & Data Documentation

04:34
Seeds & Data Sources

07:52
Deployment & CI/CD Pipelines

05:44
DBT Best Practices & Optimization

05:58
Hooks & Workflow Extensions

05:48
Snapshots & Historical Tracking

06:40
DBT Packages & Ecosystem Extensions

01:32
Environment Setup & Prerequisites

05:40
Building the Snowflake Data Warehouse

06:33
Initializing the dbt Project

13:05
Configuring Snowflake Connection – Connecting dbt to Snowflake

02:03
Building Source Definitions in dbt

02:55
Creating Staging Models for Data Cleaning

05:04
Implementing Business Logic with Intermediate Models

00:45
Building Analytics-Ready Data Marts

01:14
Configuring dbt Project Settings

10:41
Running dbt Pipelines

04:49
Check Data Warehouse

05:28
Generate dbt Documentation

01:59
Building Final Retail KPI Reports

01:20
Additional Session 1 : DBT

01:19:35
Additional Session 2 : DBT

01:09:22
Project : Real-Time Analytics Pipeline with Kafka, Spark & Snowflake

47:00

Apache Airflow: From Basics to Production

What is Apache Airflow and Airflow architecture overview

15:13
Airflow Setup & Installation + lab

09:54
A DAG (Directed Acyclic Graph) + Lab

09:01
DAG structure in Airflow + lab

09:19
DAG parameters (start_date, schedule) + lab

01:56
Task definition basics + lab

01:58
Dependencies concept + lab

09:52
Scheduling

00:57
Cron expressions

01:58
Airflow scheduling system

01:12
Catchup vs no catchup

01:43
Backfill & Catchup

01:43
Timezones in scheduling

01:07
Manual vs automatic triggers

01:04
Scheduling + lab

08:45
Task Dependencies + Lab

05:15
Parallel tasks + Lab

02:51
Diamond Dependency + Lab

03:53
Branching workflows + Lab

04:38
Trigger rules + Lab

06:46
Data Passing , XCom and Push & pull mechanism

07:10
Variables in Airflow

02:46
Connections concept

05:31
Sensors (advanced use)

01:56
HTTP Operator

01:01
SQL Operator

01:27
Email alerts setup

34:29
Project 1 : End-to-End Real-Time Data Pipeline using Kafka, Spark, Hadoop & Airflow

09:42
Project 2 : End-to-End Data Pipeline with Kafka, Spark, HDFS, PostgreSQL & Airflow

01:02:43
Project 3 : End-to-End Data Engineering Platform with Hadoop, Spark, Airflow & dbt

41:05

Modern Data Lakehouse with Apache Iceberg

Introduction to Iceberg

11:27
Why Iceberg Was Created

07:14
Iceberg Architecture

12:44
Install Spark + enable Iceberg catalog and create first table

04:20
Building Your First Apache Iceberg Table with Spark and HDFS Catalog + Lab

10:19
Understanding Table Metadata

05:46
Iceberg Table Internals : Metadata JSON, Snapshots, and Version History + Lab

16:05
Exploring Iceberg Manifests, Snapshots, and File-Level Metadata + Lab

11:57
Hive Table Architecture and Partition Storage Analysis + Lab

10:36
Schema evolution + column IDs + backward/forward compatibility + snapshots + hidden partitioning + metadata inspection + Lab

13:11
Project : Event-Driven Lakehouse Pipeline: Kafka Ingestion, Spark Processing, Iceberg Storage

10:19

Building Real-Time Data Pipelines with Apache Flink

Introduction to Apache Flink

06:44
Real time vs Batch Processing

03:00
Dataflow Model

03:54
Sources (Data Input)

03:39
Transformations

01:35
Sinks (Data Output)

01:47
Parallelism (Distributed Processing)

02:53
Stateful Processing

08:15
Checkpointing (Fault Tolerance)

06:41
Event Time + Watermarks

09:30
Fault Tolerance + Recovery

02:06
Project : Building Real-Time Data Pipelines Using Kafka, Apache Flink & Flink SQL

05:17

End-to-End Data Flow Engineering with Apache NiFi

Introduction to Apache NiFi

03:53
Apache NiFi handles the big data challenges

01:33
The change from ETL to data streaming and how Apache NiFi fits in

01:32
Apache NiFi’s key features

02:30
How Apache NiFi Addresses Big Data Integration Challenges

00:50
When NiFi Might Not Be Ideal

01:10
NiFi and Big data tools

02:04
Flow-based programming (FBP)

02:36
Apache NiFi’s Main Components

02:42
NiFi GUI

01:25
Configuring and Tuning Apache NiFi

04:17
Scaling NiFi

00:54
Cluster Terminologies

01:18
Load Balancing Strategies

01:13
Benefits of Clustering

00:37
Case 1: Real-Time Operational Intelligence System

00:49
Case 2: Enterprise CDC Data Integration with Apache NiFi

03:52
Case 3: Data Governance & GDPR-Compliant Data Platform

02:13

Data Warehouse Design & Implementation

Evolution OLTP – DW – Data Lake → Lakehouse

02:03
OLTP Online Transaction Processing.

06:57
Data Warehouse (DW)

01:34
Why Do We Need a Data Warehouse

03:10
From OLTP to DW – ETL ELT Pipelines

01:41
Data Modelling – Star Schema

08:13
NORMALIZED (OLTP style – 3NF)

00:54
DENORMALIZED (Star Schema – Data Warehouse)

00:49
Normalize to write fast. Denormalize to read fast.

01:07
Historical Data and Time-Based Analysis

04:55
Why OLTP is BAD for Historical Analysis

00:57
Slowly Changing Dimensions (SCD)

02:22
Enables historical accuracy in reports.

02:49
Typical Data Warehouse Workloads

00:51
Enterprise Data Warehouse Technologies

01:06
OLTP vs Data Warehouse Limitations of Data Warehouses

00:51
Data Lake

02:33
Schema-on-Read

01:02
Data Lake Architecture

00:48
Data Zones in Data Lake

01:25
Use Cases for Data Lakes

00:37
Strengths and Weaknesses of Data Lakes

01:33
OLTP → DW → Data Lake

01:09
Lake house architecture

08:54
Enterprise Data Warehouse

04:28
Inmon vs Kimball vs Data Vault

05:14
Enterprise Analytical Requirements

03:43
Subject Areas & Conformed Dimensions

03:39
Data Domains & Ownership

03:33
Enterprise Bus Matrix

01:55
Advnaced topics

14:26
Project 0 Data warehouse for an airline sales analysis system

01:15
Project Step 1 – Business Understanding & Project Objectives

01:14
Project Step 2 – Airline Data Collection & CSV Preparation

08:51
Project Step 3 – HDFS Storage Layer Setup

01:10
Project Step 4 – Snowflake Data Warehouse Configuration

08:17
Project Step 5 – Staging Tables & Data Ingestion

06:53
Project Step 6 – Fact & Dimension Tables

04:40
Project Step 7 – ETL Pipeline Development

16:22
Project Step 8 – KPI Analytics, SQL Views & Stored Procedures

02:11
Project Step 9 – Managing DIM_CUSTOMER (SCD Type 2) History

05:50
Project Step 10 – Power BI Dashboard & Business Reporting

01:43

Distributed NoSQL with Apache Cassandra

Design and build scalable End-to-End Data Pipelines
Develop batch and real-time data processing systems
Implement modern Data Engineering architectures in production environments
Build and manage Data Warehouses and ETL workflows
Work with tools such as Python, SQL, Spark, Kafka, Airflow, DBT, and Snowflake
Process large-scale data using distributed systems like Hadoop and Spark
Apply best practices in data quality, governance, and monitoring
Troubleshoot real-world issues such as pipeline failures and data inconsistencies
Optimize data workflows for performance and scalability
Build a strong, job-ready Data Engineering portfolio through real projects

Access to recorded sessions
Live coaching and mentoring sessions
Hands-on, production-level projects
Pre-configured technical environment
Real-world datasets and case studies
Data pipeline and architecture templates
Interview preparation resources
Ongoing technical support

No Review Yet

Instructor:

Eng Mohammed

Big Data Engineer and Data Consultant @ ISD Company

Professional Data Engineering and Big data Program

Course content:

Introduction to Big Data and Data Engineering – part1 Big Data needs Data Engineering because raw data is too large, fast, and messy to process or use directly. Data Engineering solves this by building scalable systems (scale-out) to collect, store, and process data efficiently.

Big data challenges

OLTP

Data warehouse

Data Lake

Data Lakehouse

Schema-on-write

Schema-on-read

ETL

ELT

Understand the role of network infrastructure in big data

Introduction to Big Data and Data Engineering – part4

Introduction to Distributed Systems in Big Data

Why Big Data Needs Distributed Systems

Basics of Distributed Computing

How Distributed Systems Work

Key Concepts in Distributed Systems-part1

Key Concepts in Distributed Systems-part2

Role in Big Data Architecture

Single Machine vs Distributed Systems

Scalability

Fault Tolerance

Parallel Processing

Data Distribution

Data Engineering with SQL & Python

Introduction

IDE

Anaconda

Install anaconda

Jupyter notebook

Print

Numbers

Variables

Strings

Strings method

Data Structures

lists

Dictionaries

Tuples

sets

Booleans

Comparisons Operators

Conditional statement If, Else, Elif Statements

Loops for loop and while loop

Functions

introduction to pandas

Data Frame

Create DataFrame

Working with columns

Working with Rows

Subsets

working with files

Method1

Method2

summarize data

View the first few rows of a DataFrame

View the last few rows of a DataFrame

Get summary of the DataFrame

Generate summary statistics for numerical columns

Get the number of rows and columns in a DataFrame

View the column names of the DataFrame

Access the index of the DataFrame

Check the data types of each column

Check for missing values in the DataFrame

Remove rows with missing values

Rename columns or rows in the DataFrame

Sample method

Sort the DataFrame

Group data by one or more columns to perform aggregation

Merge two DataFrames

Create a pivot table for summarizing data

file format

CSV file

Excel file

Json file

Parquet file

XML file

Data Quality

Introduction to Big Data and Data Engineering – part1
Big Data needs Data Engineering because raw data is too large, fast, and messy to process or use directly. Data Engineering solves this by building scalable systems (scale-out) to collect, store, and process data efficiently.