CLD18 – HDP Data Science


This course provides instruction on the theory and practice of data science, including machine learning and natural language processing. This course introduces many of the core concepts behind today’s most commonly used algorithms and introducing them in practical applications. We’ll discuss concepts and key algorithms in all of the major areas – Classification, Regression, Clustering, Dimensionality Reduction, including a primer on Neural Networks. We’ll focus on both single-server tools and frameworks (Python, NumPy, pandas, SciPy, Scikit-learn, NLTK, TensorFlow Jupyter) as well as large-scale tools and frameworks (Spark MLlib, Stanford CoreNLP, TensorFlowOnSpark/Horovod/MLeap, Apache Zeppelin). Download the data sheet to view the full list of objectives and labs.

  • Prerequisites
    Students must have experience with Python and Scala, Spark, and prior exposure to statistics, probability, and a basic understanding of big data and Hadoop principles. While brief reviews are offered in these topics, students new to Hadoop are encouraged to attend the Apache Hadoop Essentials (HDP-123) course and HDP Spark Developer (DEV-343), as well as the language-specific introduction courses.
  • Target Audience
    Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Spark/Hadoop

DAY 1 – An Introduction to Data Science, SciKit-Learn, HDFS, Reviewing Spark apps, DataFrames and NOSQL

  • Discuss aspects of Data Science, the team members, and the team roles
  • Discuss use cases for Data Science
  • Discuss the current State of the Art and its future direction
  • Review HDFS, Spark, Jupyter, and Zeppelin
  • Work with SciKit-Learn, Pandas, NumPy, Matplotlib, and Seaborn
  • Hello, ML w/ SciKit-Learn 
  • Spark REPLs, Spark Submit, & Zeppelin Review 
  • HDFS Review 
  • Spark DataFrames and Files 
  • NiFi Review

DAY 2 – Algorithms in Spark ML and SciKit-Learn: Linear Regression, Logistic Regression, Support Vectors, Decision Trees-

  • Discuss categories and use cases of the various ML Algorithms
  • Understand Linear Regression, Logistic Regression, and Support Vectors
  • Understand Decision Trees and their limitations
  • Understand Nearest-Neighbors
  • Discuss and demonstrate a Spam Classifier
  • Linear Regression as a Projection 
  • Logistic Regression 
  • Support Vectors 
  • Decision Trees 
  • Linear Regression as a Classifier

DAY 3 – K-Means & GMM Clustering, Essential TensorFlow, NLP with NLTK, NLP with Stanford CoreNLP

  • Discuss and understand Clustering Algorithms
  • Work with TensorFlow to create a basic neural network
  • Work with TensorFlow to create a basic neural network
  • Discuss Natural Language Processing
  • Discuss Dimensionality Reduction Algorithms
  • K-Means Clustering 
  • GMM Clustering 
  • Essential TensorFlow 
  • Sentiment Analysis
  • Dimensionality Reduction with PCA

DAY 4 – HyperParameter Tuning, K-Fold Validation, Ensemble Methods, ML Pipelines in SparkML

  • Discuss Hyper-Parameter Tuning and K-Fold Validation
  • Understand Ensemble Models
  • Discuss ML Pipelines in Spark MLlib
  • Discuss ML in production and real-world issues
  • Demonstrate TensorFlowOnSpark
  • Hyper-parameter tuning 
  • K-Fold Validation 
  • Ensemble Methods 
  • ML Pipelines in SparkML 
  • Demo: TensorFlowOnSpark

CLD19 – HDP Operations: Apache Hadoop Security Training


This course is designed for experienced administrators who will be implementing secure Hadoop clusters using authentication, authorization, auditing and data protection strategies and tools. Download the data sheet to view the full list of course objectives and labs.

  • Prerequisites
    Students should be experienced in the management of Hadoop using Ambari and Linux environments. Completion of the Hadoop Administration I course is highly recommended.
  • Target Audience
    IT administrators and operators responsible for installing, configuring and supporting an Apache Hadoop deployment in a Linux environment.

Day 1 – An Introduction to Security

  • Definition of Security
  • Securing Sensitive Data
  • Integrating HDP Security
  • What Security Tools to use for Each Use Case
  • HDP Security Prerequisites
  • Ambari Server Security
  • Kerberos Deep Dive
  • Setting up the Lab Environment
  • Configuring the AD Resolution Certificate
  • Security Options for Ambari

Day 2 – Working with Kerberos and Apache Ranger

  • Enable Kerberos
  • Apache Ranger Installation
  • Apache Ranger KMS
  • Kerberizing the Cluster
  • Installing Apache Ranger
  • Setting up Apache Ranger KMS Data Encryption

DAY 3 – Working with Apache Knox

  • Secure Access with Ranger
  • Apache Knox Overview
  • Apache Knox Installation
  • Ambari Views for Controlled Access
  • Secured Hadoop Exercises
  • Configuring Apache Knox
  • Exploring Other Security Features of Apache Ambari

CLD20 – HDP Spark Developer


This course introduces the Apache Spark distributed computing engine, and is suitable for developers, data analysts, architects, technical managers, and anyone who needs to use Spark in a hands-on manner. It is based on the Spark 2.x release. The course provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g. RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface.It includes in-depth coverage of Spark SQL, DataFrames, and DataSets, which are now the preferred programming API. This includes exploring possible performance issues and strategies for optimization. The course also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and integrating with the Kafka server

  • Prerequisites
    Students should be familiar with programming principles and have previous experience in software
    development using Scala. Previous experience with data streaming, SQL, and HDP is also helpful, but not
  • Target Audience
    Software engineers that are looking to develop in-memory applications for time sensitive and highly iterative
    applications in an Enterprise HDP environment.

DAY 1 – Scala Ramp Up, Introduction to Spark

  • Scala Introduction
  • Working with: Variables, Data Types, and Control Flow
  • The Scala Interpreter
  • Collections and their Standard Methods (e.g. map())
  • Working with: Functions, Methods, and Function Literals
  • Define the Following as they Relate to Scala: Class, Object, and Case Class
  • Overview, Motivations, Spark Systems
  • Spark Ecosystem
  • Spark vs. Hadoop
  • Acquiring and Installing Spark
  • The Spark Shell, SparkContext
  • Setting Up the Lab Environment
  • Starting the Scala Interpreter
  • A First Look at Spark
  • A First Look at the Spark Shell

DAY 2 – RDDs and Spark Architecture, Spark SQL, DataFrames and DataSets

  • RDD Concepts, Lifecycle, Lazy Evaluation
  • RDD Partitioning and Transformations
  • Working with RDDs Including: Creating and Transforming
  • An Overview of RDDs
  • SparkSession, Loading/Saving Data, Data Formats
  • Introducing DataFrames and DataSets
  • Identify Supported Data Formats
  • Working with the DataFrame (untyped) Query DSL
  • SQL-based Queries
  • Working with the DataSet (typed) API
  • Mapping and Splitting
  • DataSets vs. DataFrames vs. RDDs
  • RDD Basics
  • Operations on Multiple RDDs
  • Data Formats
  • Spark SQL Basics
  • DataFrame Transformations
  • The DataSet Typed API
  • Splitting Up Data

Day 3 – Shuffling, Transformations and Performance, Performance Tuning

  • Working with: Grouping, Reducing, Joining
  • Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
  • Exploring the Catalyst Query Optimizer
  • The Tungsten Optimizer
  • Discuss Caching, Including: Concepts, Storage Type, Guidelines
  • Minimizing Shuffling for Increased Performance
  • Using Broadcast Variables and Accumulators
  • General Performance Guidelines
  • Exploring Group Shuffling
  • Seeing Catalyst at Work
  • Seeing Tungsten at Work
  • Working with Caching, Joins, Shuffles, Broadcasts, Accumulators
  • Broadcast General Guidelines

Day 4 – Creating Standalone Applications and Spark Streaming

  • Core API, SparkSession.Builder
  • Configuring and Creating a SparkSession
  • Building and Running Applications
  • Application Lifecycle (Driver, Executors, and Tasks)
  • Cluster Managers (Standalone, YARN, Mesos)
  • Logging and Debugging
  • Introduction and Streaming Basics
  • Spark Streaming (Spark 1.0+)
  • Structured Streaming (Spark 2+)
  • Consuming Kafka Data
  • Spark Job Submission
  • Additional Spark Capabilities
  • Spark Streaming
  • Spark Structured Streaming
  • Spark Structured Streaming with Kafka

CLD21 – HDP Operations: Administration Foundations


This four-day instructor-led training course provides students with the foundational knowledge required to plan, deploy, configure, and manage a cluster running the Hortonworks Data Platform (HDP).

What You Will Learn

Students who successfully complete this course will learn how to administer Apache Hadoop and the Hortonworks Data Platform (HDP). You will be able to:

  • Install the Hortonworks Data Platform
  • Manage Hadoop services
  • Use and manage Hadoop Distributed File System (HDFS) Storage
  • Configure rack awareness
  • Manage cluster nodes and cluster node storage
  • Use HDFS snapshots and Distributed Copy (DistCp)
  • Configure heterogeneous storage and HDFS centralized cache
  • Configure an HDFS NFS gateway and NameNode high availability
  • Describe the View File System (ViewFS)
  • Manage YARN resources and run YARN applications
  • Configure the YARN capacity scheduler, containers, and queues to manage computing resources
  • Configure YARN node labels and YARN ResourceManager high availability
  • Manage Ambari alerts
  • Deploy an HDP cluster using Ambari blueprints
  • Upgrade a cluster to a newer version of HDP

What to Expect

This course is designed primarily for system administrators and system operators responsible for installing, configuring, and managing an HDP cluster.

Students must have experience working in a Linux environment with standard Linux system commands. Students should be able to read and execute basic Linux shell scripts. In addition, we recommend that students have some operational experience in data center practices.

CLD14 – Just Enough Scala


Scala is a programming language that is a superset of Java, blending the object-oriented and the functional programming paradigms. The language is complex and could take a semester or more to master. This class focuses only on the elements that are necessary to be able to program in Cloudera’s training courses. 

Immersive Training

Through instructor-led discussion or OnDemand videos, as well as hands-on exercises, participants will learn:

  • What Scala is and how it differs from languages such as Java or Python
  • Why Scala is a good choice for Spark programming
  • How to use key language features such as data types, collections, and flow control
  • How to implement functional programming solutions in Scala
  • How to work with Scala classes, packages, and libraries Working with libraries 

Audience and prerequisites

Basic knowledge of programming concepts such as objects, conditional statements, and looping is required. This course is best suited to students with Java programming experience. Those with experience in another language may prefer the Just Enough Python course. Basic knowledge of Linux is assumed.

Please note that this course does not teach big data concepts, nor does it cover how to use Cloudera software. Instead, it is meant as a precursor for one of our developer-focused training courses that provide those skills. 

CLD15 – Just Enough Python


Cloudera University’s Python training course will teach you the key language concepts and programming techniques you need so that you can concentrate on the subjects covered in Cloudera’s developer courses without also having to learn a complex programming language and a new programming paradigm on the fly. 

Immersive Training

Through instructor-led discussion, as well as hands-on exercises, participants will learn:

  • How to define, assign, and access variables
  • Which collection types are commonly used, how they differ, and how to use them
  • How to control program flow using conditional statements, looping, iteration, and exception handling
  • How to define and use both named and anonymous (Lambda) functions
  • How to organize code into separate modules
  • How to use important features of standard Python libraries, including mathematical and regular expression support 

Audience and prerequisites

Prior knowledge of Hadoop is not required. Since this course is intended for developers who do not yet have the prerequisite skills writing code in Python, basic programming experience in at least one commonly-used programming language (ideally Java, but Ruby, Perl, Scala, C, C++, PHP, or Javascript will suffice) is assumed. 

Please note that this course does not teach big data concepts, nor does it cover how to use Cloudera software. Instead, it is meant as a precursor for one of our developer-focused training courses that provide those skills. 

CLD16 – HDP Self-Paced Learning Library


The HDP OnDemand Library offers an individual anytime, anywhere access to a collection of self-paced training courses on the Hortonworks Data Platform (HDP). Courses are available for administrators, developers, and data scientists. The HDP courses are designed and developed by Hadoop experts and provide an immersive and valuable real-world experience. In our scenario-based training courses, we offer unmatched depth and expertise. Our Hadoop learning path prepares you to be an expert with highly valued, practical skills. The HDP OnDemand Library accelerates the journey to Hadoop competency.

Current courses include: 

  • HDP Overview: Apache Hadoop Essentials (HDP-123)
  • HDP Developer: Spark (DEV-343)
  • HDF: NiFi Flow Management (ADM-301)
  • HDP Operations: Administration Foundations (ADM-221)
  • HDP Operations: Security (ADM-351)
  • HDP Data Science (SCI-241)

CLD17 – HDP Apache Hive Training


This four-day training course is designed for analysts and developers who need to create and analyze Big Data stored in Apache Hadoop using Hive. Topics include: Understanding of HDP and HDF and their integration with Hive; Hive on Tez, LLAP, and Druid OLAP query analysis; Hive data ingestion using HDF and Spark; and Enterprise Data Warehouse offload capabilities in HDP using Hive.


Students should be familiar with programming principles and have experience in software development. Knowledge of SQL, data modeling, and scripting is also helpful. No prior Hadoop Knowledge is needed.

Course Details

Information Architecture and Big Data

  • Enterprise Data Warehouse Optimization

Introduction to Apache Hive

  • About Apache Hive
  • About Apache Zeppelin and Apache Superset (incubating)

Apache Hive Architecture

  • Apache Hive Architecture

Apache Hive Programming

  • Apache Hive Basics
  • Apache Hive Transactions (Hive ACID)

File Formats

  • SerDes and File Formats

Partitions and Bucketing

  • Partitions
  • Bucketing
  • Skew and Temporary Tables

Advanced Apache Hive Programming

  • Data Sorting
  • Apache Hive User Defined Functions (UDFs)
  • Subqueries and Views
  • Joins
  • Windowing and Grouping
  • Other Topics

Apache Hive Performance Tuning

  • Cost-Based Optimization and Statistics
  • Bloom Filters
  • Execution and Resource Plans

Live Long and Process (LLAP) Deep Dive

  • Live Long and Process Overview
  • Apache Hive and LLAP Performance
  • Apache Hive and LLAP Installation

Security and Data Governance

  • Apache Ranger
  • Apache Ranger and Hive
  • Apache Atlas
  • Apache Atlas and Hive Integration

Apache HBase and Phoenix Integration with Hive

  • Apache HBase Overview
  • Apache Ranger and Hive
  • Apache HBase Integration with Apache Hive
  • Apache Phoenix Overview

Apache Druid (incubating) with Apache Hive

  • Apache Druid (incubating) Overview
  • Apache Druid (incubating) Queries
  • Apache Druid (incubating) and Hive Integration

Apache Sqoop and Integration with Apache Hive

  • Overview of Apache Sqoop

Apache Spark and Integration with Apache Hive

  • Introduction to Apache Spark
  • Apache Hive and Spark

Introduction to HDF (Apache NiFi) and Integration with Apache Hive

  • Introduction to Apache NiFi
  • Apache NiFi and Apache Hive

CLD12 – Apache HBase Training


Take your knowledge to the next level with Cloudera Training for Apache HBase. Cloudera Educational Services’ three-day training course enables participants to store and access massive quantities of multi-structured data and perform hundreds of thousands of operations per second.

Hands-on Hadoop

Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop ecosystem, learning topics such as:

  • The use cases and usage occasions for HBase, Hadoop, and RDBMS
  • Using the HBase shell to directly manipulate HBase tables
  • Designing optimal HBase schemas for efficient data storage and recovery
  • How to connect to HBase using the Java API to insert and retrieve data in real time
  • Best practices for identifying and resolving performance bottlenecks

Audience and prerequisites

This course is appropriate for developers and administrators who intend to use HBase. Prior experience with databases and data modeling is helpful, but not required. Knowledge of Java is assumed. Prior knowledge of Hadoop is not required, but Cloudera Developer Training for Spark and Hadoop provides an excellent foundation for this course.

Course Contents


Introduction to Hadoop and HBase

  • Introducing Hadoop
  • Core Hadoop Components
  • What Is HBase?
  • Why Use HBase?
  • Strengths of HBase
  • HBase in Production
  • Weaknesses of HBase

HBase Tables

  • HBase Concepts
  • HBase Table Fundamentals
  • Thinking About Table Design

HBase Shell

  • Creating Tables with the HBase Shell
  • Working with Tables
  • Working with Table Data

HBase Architecture Fundamentals

  • HBase Regions
  • HBase Cluster Architecture
  • HBase and HDFS Data Locality

HBase Schema Design

  • General Design Considerations
  • Application-Centric Design
  • Designing HBase Row Keys
  • Other HBase Table Features

Basic Data Access with the HBase API

  • Options to Access HBase Data
  • Creating and Deleting HBase Tables
  • Retrieving Data with Get
  • Retrieving Data with Scan
  • Inserting and Updating Data
  • Deleting Data

More Advanced HBase API Features

  • Filtering Scans
  • Best Practices
  • HBase Coprocessors

HBase Write Path

  • HBase Write Path
  • Compaction
  • Splits

HBase Read Path

  • How HBase Reads Data
  • Block Caches for Reading

HBase Performance Tuning

  • Column Family Considerations
  • Schema Design Considerations
  • Configuring for Caching
  • Memory Considerations
  • Dealing with Time Series and Sequential Data
  • Pre-Splitting Regions

HBase Administration and Cluster Management

  • HBase Daemons
  • ZooKeeper Considerations
  • HBase High Availability
  • Using the HBase Balancer
  • Fixing Tables with hbck
  • HBase Security

HBase Replication and Backup

  • HBase Replication
  • HBase Backup
  • MapReduce and HBase Clusters

Using Hive and Impala with HBase

  • How to Use Hive and Impala to Access HBase


Appendix A: Accessing Data with Python and Thrift

  • Thrift Usage
  • Working with Tables
  • Getting and Putting Data
  • Scanning Data
  • Deleting Data
  • Counters
  • Filters

Appendix B: OpenTSDB

CLD13 – Introduction to Apache Kudu


Cloudera’s Introduction to Apache Kudu training teaches students the basics of Apache Kudu, a data storage system for the Hadoop platform that is optimized for analytical queries. The course covers common Kudu use cases and Kudu architecture. Students will learn how to create, manage, and query Kudu tables, and to develop Spark applications that use Kudu.

Get hands-on experience

Through instructor-led discussion, as well as hands-on exercises, participants will learn topics including:

  • A high-level explanation of Kudu
  • How does it compares to other relevant storage systems and which use cases would be best implemented with Kudu
  • Learn about Kudu’s architecture as well as how to design tables that will store data for optimum performance.
  • Learn data management techniques on how to insert, update, or delete records from Kudu tables using Impala, as well as bulk loading methods
  • Finally, develop Apache Spark applications with Apache Kudu

What to expect

This material is intended for a broad audience of students involved with either software development or data analysis. This would include software developers, data engineers, DBAs, data scientists, and data analysts.

Students should know SQL. Familiarity with Impala is preferred but not required. Students should also know how to develop Apache Spark applications using either Python or Scala. Basic Linux experience is expected. 

Course Contents


Overview and Architecture

  • What Is Kudu?
  • Why Use Kudu?
  • Kudu Use Cases
  • Architecture Overview
  • Kudu Tools
  • Essential Points

Apache Kudu Tables

  • Kudu Tables
  • Data Storage Options
  • Designing Schemas
  • Partitioning Tables for Best Performance
  • Using Kudu Tools with Tables
  • Essential Points

Using Apache Kudu with Apache Impala

  • Apache Impala Overview
  • Creating and Querying Tables
  • Deleting Tables
  • Loading and Modifying Data in Kudu Tables
  • Defining Partitioning Strategy
  • Essential Points

Developing Apache Spark Applications with Apache Kudu

  • Apache Spark and Apache Kudu
  • Kudu, Spark SQL, and DataFrames
  • Managing Kudu Table Data with Scala
  • Creating Kudu Tables with Scala
  • Essential Points