Big Data Introduction:
- What is Big Data
- Evolution of Big Data
- Benefits of Big Data
- Operational vs Analytical Big Data
- Need for Big Data Analytics
- Big Data Challenges
Hadoop cluster:
- Master Nodes
- Name Node
- Secondary Name Node
- Job Tracker
- Client Nodes
- Slaves
- Hadoop configuration
- Setting up a Hadoop cluster
HDFS:
- Introduction to HDFS
- HDFS Features
- HDFS Architecture
- Blocks
- Goals of HDFS
- The Name node & Data Node
- Secondary Name node
- The Job Tracker
- The Process of a File Read
- How does a File Write work
- Data Replication
- Rack Awareness
- HDFS Federation
- Configuring HDFS
- HDFS Web Interface
- Fault tolerance
- Name node failure management
- Access HDFS from Java
Yarn
- Introduction to Yarn
- Why Yarn
- Classic MapReduce v/s Yarn
- Advantages of Yarn
- Yarn Architecture
- Resource Manager
- Node Manager
- Application Master
- Application submission in YARN
- Node Manager containers
- Resource Manager components
- Yarn applications
- Scheduling in Yarn
- Fair Scheduler
- Capacity Scheduler
- Fault tolerance
MapReduce:
- What is MapReduce
- Why MapReduce
- How MapReduce works
- Difference between Hadoop 1 & Hadoop 2
- Identity mapper & reducer
- Data flow in MapReduce
- Input Splits
- Relation Between Input Splits and HDFS Blocks
- Flow of Job Submission in MapReduce
- Job submission & Monitoring
- MapReduce algorithms
- Sorting
- Searching
- Indexing
- TF-IDF
Hadoop Fundamentals:
- What is Hadoop
- History of Hadoop
- Hadoop Architecture
- Hadoop Ecosystem Components
- How does Hadoop work
- Why Hadoop & Big Data
- Hadoop Cluster introduction
- Cluster Modes
- Standalone
- Pseudo-distributed
- Fully - distributed
- HDFS Overview
- Introduction to MapReduce
- Hadoop in demand
HDFS Operations:
- Starting HDFS
- Listing files in HDFS
- Writing a file into HDFS
- Reading data from HDFS
- Shutting down HDFS
HDFS Command Reference:
- Listing contents of directory
- Displaying and printing disk usage
- Moving files & directories
- Copying files and directories
- Displaying file contents
Java Overview For Hadoop:
- Object oriented concepts
- Variables and Data types
- Static data type
- Primitive data types
- Objects & Classes
- Java Operators
- Method and its types
- Constructors
- Conditional statements
- Looping in Java
- Access Modifiers
- Inheritance
- Polymorphism
- Method overloading & overriding
- Interfaces
MapReduce Programming:
- Hadoop data types
- The Mapper Class
- Map method
- The Reducer Class
- Shuffle Phase
- Sort Phase
- Secondary Sort
- Reduce Phase
- The Job class
- Job class constructor
- JobContext interface
- Combiner Class
- How Combiner works
- Record Reader
- Map Phase
- Combiner Phase
- Reducer Phase
- Record Writer
- Partitioners
- Input Data
- Map Tasks
- Partitioner Task
- Reduce Task
- Compilation & Execution
Pig:
- What is Apache Pig?
- Why Apache Pig?
- Pig features
- Where should Pig be used
- Where not to use Pig
- The Pig Architecture
- Pig components
- Pig v/s MapReduce
- Pig v/s SQL
- Pig v/s Hive
- Pig Installation
- Pig Execution Modes & Mechanisms
- Grunt Shell Commands
- Pig Latin - Data Model
- Pig Latin Statements
- Pig data types
- Pig Latin operators
- CaseSensitivity
- Grouping & Co Grouping in Pig Latin
- Sorting & Filtering
- Joins in Pig latin
- Built-in Function
- Writing UDFs
- Macros in Pig
HBase:
- What is HBase
- History Of HBase
- The NoSQL Scenario
- HBase & HDFS
- Physical Storage
- HBase v/s RDBMS
- Features of HBase
- HBase Data model
- Master server
- Region servers & Regions
- HBase Shell
- Create table and column family
- The HBase Client API
Spark:
- Introduction to Apache Spark
- Features of Spark
- Spark built on Hadoop
- Components of Spark
- Resilient Distributed Datasets
- Data Sharing using Spark RDD
- Iterative Operations on Spark RDD
- Interactive Operations on Spark RDD
- Spark shell
- RDD transformations
- Actions
- Programming with RDD
- Start Shell
- Create RDD
- Execute Transformations
- Caching Transformations
- Applying Action
- Checking output
- GraphX overview
Impala:
- Introducing Cloudera Impala
- Impala Benefits
- Features of Impala
- Relational databases vs Impala
- How Impala works
- Architecture of Impala
- Components of the Impala
- The Impala Daemon
- The Impala Statestore
- The Impala Catalog Service
- Query Processing Interfaces
- Impala Shell Command Reference
- Impala Data Types
- Creating & deleting databases and tables
- Inserting & overwriting table data
- Record Fetching and ordering
- Grouping records
- Using the Union clause
- Working of Impala with Hive
- Impala v/s Hive v/s HBase
MongoDB Overview:
- Introduction to MongoDB
- MongoDB v/s RDBMS
- Why & Where to use MongoDB
- Databases & Collections
- Inserting & querying documents
- Schema Design
- CRUD Operations
Oozie & Hue Overview:
- Introduction to Apache Oozie
- Oozie Workflow
- Oozie Coordinators
- Property File
- Oozie Bundle system
- CLI and extensions
- Overview of Hue
Hive:
- What is Hive?
- Features of Hive
- The Hive Architecture
- Components of Hive
- Installation & configuration
- Primitive types
- Complex types
- Built in functions
- Hive UDFs
- Views & Indexes
- Hive Data Models
- Hive vs Pig
- Co-groups
- Importing data
- Hive DDL statements
- Hive Query Language
- Data types & Operators
- Type conversions
- Joins
- Sorting & controlling data flow
- local vs mapreduce mode
- Partitions
- Buckets
Sqoop:
- Introducing Sqoop
- Scoop installation
- Working of Sqoop
- Understanding connectors
- Importing data from MySQL to Hadoop HDFS
- Selective imports
- Importing data to Hive
- Importing to Hbase
- Exporting data to MySQL from Hadoop
- Controlling import process
Flume:
- What is Flume?
- Applications of Flume
- Advantages of Flume
- Flume architecture
- Data flow in Flume
- Flume features
- Flume Event
- Flume Agent
- Sources
- Channels
- Sinks
- Log Data in Flume
Zookeeper Overview:
- Zookeeper Introduction
- Distributed Application
- Benefits of Distributed Applications
- Why use Zookeeper
- Zookeeper Architecture
- Hierarchial Namespace
- Znodes
- Stat structure of a Znode
- Electing a leader
Kafka Basics:
- Messaging Systems
- Point-to-Point
- Publish - Subscribe
- What is Kafka
- Kafka Benefits
- Kafka Topics & Logs
- Partitions in Kafka
- Brokers
- Producers & Consumers
- What are Followers
- Kafka Cluster Architecture
- Kafka as a Pub-Sub Messaging
- Kafka as a Queue Messaging
- Role of Zookeeper
- Basic Kafka Operations
- Creating a Kafka Topic
- Listing out topics
- Starting Producer
- Starting Consumer
- Modifying a Topic
- Deleting a Topic
- Integration With Spark
Scala Basics:
- Introduction to Scala
- Spark & Scala interdependence
- Objects & Classes
- Class definition in Scala
- Creating Objects
- Scala Traits
- Basic Data Types
- Operators in Scala
- Control structures
- Fields in Scala
- Functions in Scala
- Collections in Scala
- Mutable collection
- Immutable collection