Cloudera – Knowledge Factory

CLD18 – HDP Data Science

Overview

This course provides instruction on the theory and practice of data science, including machine learning and natural language processing. This course introduces many of the core concepts behind today’s most commonly used algorithms and introducing them in practical applications. We’ll discuss concepts and key algorithms in all of the major areas – Classification, Regression, Clustering, Dimensionality Reduction, including a primer on Neural Networks. We’ll focus on both single-server tools and frameworks (Python, NumPy, pandas, SciPy, Scikit-learn, NLTK, TensorFlow Jupyter) as well as large-scale tools and frameworks (Spark MLlib, Stanford CoreNLP, TensorFlowOnSpark/Horovod/MLeap, Apache Zeppelin). Download the data sheet to view the full list of objectives and labs.

Prerequisites
Students must have experience with Python and Scala, Spark, and prior exposure to statistics, probability, and a basic understanding of big data and Hadoop principles. While brief reviews are offered in these topics, students new to Hadoop are encouraged to attend the Apache Hadoop Essentials (HDP-123) course and HDP Spark Developer (DEV-343), as well as the language-specific introduction courses.

Target Audience
Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Spark/Hadoop

DAY 1 – An Introduction to Data Science, SciKit-Learn, HDFS, Reviewing Spark apps, DataFrames and NOSQL

OBJECTIVES

Discuss aspects of Data Science, the team members, and the team roles
Discuss use cases for Data Science
Discuss the current State of the Art and its future direction
Review HDFS, Spark, Jupyter, and Zeppelin
Work with SciKit-Learn, Pandas, NumPy, Matplotlib, and Seaborn

LABS

Hello, ML w/ SciKit-Learn
Spark REPLs, Spark Submit, & Zeppelin Review
HDFS Review
Spark DataFrames and Files
NiFi Review

DAY 2 – Algorithms in Spark ML and SciKit-Learn: Linear Regression, Logistic Regression, Support Vectors, Decision Trees-

OBJECTIVES

Discuss categories and use cases of the various ML Algorithms
Understand Linear Regression, Logistic Regression, and Support Vectors
Understand Decision Trees and their limitations
Understand Nearest-Neighbors
Discuss and demonstrate a Spam Classifier

LABS

Linear Regression as a Projection
Logistic Regression
Support Vectors
Decision Trees
Linear Regression as a Classifier

DAY 3 – K-Means & GMM Clustering, Essential TensorFlow, NLP with NLTK, NLP with Stanford CoreNLP

OBJECTIVES

Discuss and understand Clustering Algorithms
Work with TensorFlow to create a basic neural network
Work with TensorFlow to create a basic neural network
Discuss Natural Language Processing
Discuss Dimensionality Reduction Algorithms

LABS

K-Means Clustering
GMM Clustering
Essential TensorFlow
Sentiment Analysis
Dimensionality Reduction with PCA

DAY 4 – HyperParameter Tuning, K-Fold Validation, Ensemble Methods, ML Pipelines in SparkML

OBJECTIVES

Discuss Hyper-Parameter Tuning and K-Fold Validation
Understand Ensemble Models
Discuss ML Pipelines in Spark MLlib
Discuss ML in production and real-world issues
Demonstrate TensorFlowOnSpark

LABS

Hyper-parameter tuning
K-Fold Validation
Ensemble Methods
ML Pipelines in SparkML
Demo: TensorFlowOnSpark

CLD19 – HDP Operations: Apache Hadoop Security Training

Overview

This course is designed for experienced administrators who will be implementing secure Hadoop clusters using authentication, authorization, auditing and data protection strategies and tools. Download the data sheet to view the full list of course objectives and labs.

Prerequisites
Students should be experienced in the management of Hadoop using Ambari and Linux environments. Completion of the Hadoop Administration I course is highly recommended.

Target Audience
IT administrators and operators responsible for installing, configuring and supporting an Apache Hadoop deployment in a Linux environment.

Day 1 – An Introduction to Security

OBJECTIVES

Definition of Security
Securing Sensitive Data
Integrating HDP Security
What Security Tools to use for Each Use Case
HDP Security Prerequisites
Ambari Server Security
Kerberos Deep Dive

LABS

Setting up the Lab Environment
Configuring the AD Resolution Certificate
Security Options for Ambari

Day 2 – Working with Kerberos and Apache Ranger

OBJECTIVES

Enable Kerberos
Apache Ranger Installation
Apache Ranger KMS

LABS

Kerberizing the Cluster
Installing Apache Ranger
Setting up Apache Ranger KMS Data Encryption

DAY 3 – Working with Apache Knox

OBJECTIVES

Secure Access with Ranger
Apache Knox Overview
Apache Knox Installation
Ambari Views for Controlled Access

LABS

Secured Hadoop Exercises
Configuring Apache Knox
Exploring Other Security Features of Apache Ambari

CLD20 – HDP Spark Developer

Overview

This course introduces the Apache Spark distributed computing engine, and is suitable for developers, data analysts, architects, technical managers, and anyone who needs to use Spark in a hands-on manner. It is based on the Spark 2.x release. The course provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g. RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface.It includes in-depth coverage of Spark SQL, DataFrames, and DataSets, which are now the preferred programming API. This includes exploring possible performance issues and strategies for optimization. The course also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and integrating with the Kafka server

Prerequisites
Students should be familiar with programming principles and have previous experience in software
development using Scala. Previous experience with data streaming, SQL, and HDP is also helpful, but not
required.

Target Audience
Software engineers that are looking to develop in-memory applications for time sensitive and highly iterative
applications in an Enterprise HDP environment.

DAY 1 – Scala Ramp Up, Introduction to Spark

OBJECTIVES

Scala Introduction
Working with: Variables, Data Types, and Control Flow
The Scala Interpreter
Collections and their Standard Methods (e.g. map())
Working with: Functions, Methods, and Function Literals
Define the Following as they Relate to Scala: Class, Object, and Case Class
Overview, Motivations, Spark Systems
Spark Ecosystem
Spark vs. Hadoop
Acquiring and Installing Spark
The Spark Shell, SparkContext

LABS

Setting Up the Lab Environment
Starting the Scala Interpreter
A First Look at Spark
A First Look at the Spark Shell

DAY 2 – RDDs and Spark Architecture, Spark SQL, DataFrames and DataSets

OBJECTIVES

RDD Concepts, Lifecycle, Lazy Evaluation
RDD Partitioning and Transformations
Working with RDDs Including: Creating and Transforming
An Overview of RDDs
SparkSession, Loading/Saving Data, Data Formats
Introducing DataFrames and DataSets
Identify Supported Data Formats
Working with the DataFrame (untyped) Query DSL
SQL-based Queries
Working with the DataSet (typed) API
Mapping and Splitting
DataSets vs. DataFrames vs. RDDs

LABS

RDD Basics
Operations on Multiple RDDs
Data Formats
Spark SQL Basics
DataFrame Transformations
The DataSet Typed API
Splitting Up Data

Day 3 – Shuffling, Transformations and Performance, Performance Tuning

OBJECTIVES

Working with: Grouping, Reducing, Joining
Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
Exploring the Catalyst Query Optimizer
The Tungsten Optimizer
Discuss Caching, Including: Concepts, Storage Type, Guidelines
Minimizing Shuffling for Increased Performance
Using Broadcast Variables and Accumulators
General Performance Guidelines

LABS

Exploring Group Shuffling
Seeing Catalyst at Work
Seeing Tungsten at Work
Working with Caching, Joins, Shuffles, Broadcasts, Accumulators
Broadcast General Guidelines

Day 4 – Creating Standalone Applications and Spark Streaming

OBJECTIVES

Core API, SparkSession.Builder
Configuring and Creating a SparkSession
Building and Running Applications
Application Lifecycle (Driver, Executors, and Tasks)
Cluster Managers (Standalone, YARN, Mesos)
Logging and Debugging
Introduction and Streaming Basics
Spark Streaming (Spark 1.0+)
Structured Streaming (Spark 2+)
Consuming Kafka Data

LABS

Spark Job Submission
Additional Spark Capabilities
Spark Streaming
Spark Structured Streaming
Spark Structured Streaming with Kafka

CLD21 – HDP Operations: Administration Foundations

Overview

This four-day instructor-led training course provides students with the foundational knowledge required to plan, deploy, configure, and manage a cluster running the Hortonworks Data Platform (HDP).

What You Will Learn

Students who successfully complete this course will learn how to administer Apache Hadoop and the Hortonworks Data Platform (HDP). You will be able to:

Install the Hortonworks Data Platform
Manage Hadoop services
Use and manage Hadoop Distributed File System (HDFS) Storage
Configure rack awareness
Manage cluster nodes and cluster node storage
Use HDFS snapshots and Distributed Copy (DistCp)
Configure heterogeneous storage and HDFS centralized cache
Configure an HDFS NFS gateway and NameNode high availability
Describe the View File System (ViewFS)
Manage YARN resources and run YARN applications
Configure the YARN capacity scheduler, containers, and queues to manage computing resources
Configure YARN node labels and YARN ResourceManager high availability
Manage Ambari alerts
Deploy an HDP cluster using Ambari blueprints
Upgrade a cluster to a newer version of HDP

What to Expect

This course is designed primarily for system administrators and system operators responsible for installing, configuring, and managing an HDP cluster.

Students must have experience working in a Linux environment with standard Linux system commands. Students should be able to read and execute basic Linux shell scripts. In addition, we recommend that students have some operational experience in data center practices.

CLD14 – Just Enough Scala

Overview

Scala is a programming language that is a superset of Java, blending the object-oriented and the functional programming paradigms. The language is complex and could take a semester or more to master. This class focuses only on the elements that are necessary to be able to program in Cloudera’s training courses.

Immersive Training

Through instructor-led discussion or OnDemand videos, as well as hands-on exercises, participants will learn:

What Scala is and how it differs from languages such as Java or Python
Why Scala is a good choice for Spark programming
How to use key language features such as data types, collections, and flow control
How to implement functional programming solutions in Scala
How to work with Scala classes, packages, and libraries Working with libraries

Audience and prerequisites

Basic knowledge of programming concepts such as objects, conditional statements, and looping is required. This course is best suited to students with Java programming experience. Those with experience in another language may prefer the Just Enough Python course. Basic knowledge of Linux is assumed.

Please note that this course does not teach big data concepts, nor does it cover how to use Cloudera software. Instead, it is meant as a precursor for one of our developer-focused training courses that provide those skills.

CLD15 – Just Enough Python

Overview

Cloudera University’s Python training course will teach you the key language concepts and programming techniques you need so that you can concentrate on the subjects covered in Cloudera’s developer courses without also having to learn a complex programming language and a new programming paradigm on the fly.

Immersive Training

Through instructor-led discussion, as well as hands-on exercises, participants will learn:

How to define, assign, and access variables
Which collection types are commonly used, how they differ, and how to use them
How to control program flow using conditional statements, looping, iteration, and exception handling
How to define and use both named and anonymous (Lambda) functions
How to organize code into separate modules
How to use important features of standard Python libraries, including mathematical and regular expression support

Audience and prerequisites

Prior knowledge of Hadoop is not required. Since this course is intended for developers who do not yet have the prerequisite skills writing code in Python, basic programming experience in at least one commonly-used programming language (ideally Java, but Ruby, Perl, Scala, C, C++, PHP, or Javascript will suffice) is assumed.

CLD16 – HDP Self-Paced Learning Library

Overview

The HDP OnDemand Library offers an individual anytime, anywhere access to a collection of self-paced training courses on the Hortonworks Data Platform (HDP). Courses are available for administrators, developers, and data scientists. The HDP courses are designed and developed by Hadoop experts and provide an immersive and valuable real-world experience. In our scenario-based training courses, we offer unmatched depth and expertise. Our Hadoop learning path prepares you to be an expert with highly valued, practical skills. The HDP OnDemand Library accelerates the journey to Hadoop competency.

Current courses include:

HDP Overview: Apache Hadoop Essentials (HDP-123)
HDP Developer: Spark (DEV-343)
HDF: NiFi Flow Management (ADM-301)
HDP Operations: Administration Foundations (ADM-221)
HDP Operations: Security (ADM-351)
HDP Data Science (SCI-241)

CLD17 – HDP Apache Hive Training

Overview

This four-day training course is designed for analysts and developers who need to create and analyze Big Data stored in Apache Hadoop using Hive. Topics include: Understanding of HDP and HDF and their integration with Hive; Hive on Tez, LLAP, and Druid OLAP query analysis; Hive data ingestion using HDF and Spark; and Enterprise Data Warehouse offload capabilities in HDP using Hive.

Prerequisites

Students should be familiar with programming principles and have experience in software development. Knowledge of SQL, data modeling, and scripting is also helpful. No prior Hadoop Knowledge is needed.

Course Details

Information Architecture and Big Data

Enterprise Data Warehouse Optimization

Introduction to Apache Hive

About Apache Hive
About Apache Zeppelin and Apache Superset (incubating)

Apache Hive Architecture

Apache Hive Architecture

Apache Hive Programming

Apache Hive Basics
Apache Hive Transactions (Hive ACID)

File Formats

SerDes and File Formats

Partitions and Bucketing

Partitions
Bucketing
Skew and Temporary Tables

Advanced Apache Hive Programming

Data Sorting
Apache Hive User Defined Functions (UDFs)
Subqueries and Views
Joins
Windowing and Grouping
Other Topics

Apache Hive Performance Tuning

Cost-Based Optimization and Statistics
Bloom Filters
Execution and Resource Plans

Live Long and Process (LLAP) Deep Dive

Live Long and Process Overview
Apache Hive and LLAP Performance
Apache Hive and LLAP Installation

Security and Data Governance

Apache Ranger
Apache Ranger and Hive
Apache Atlas
Apache Atlas and Hive Integration

Apache HBase and Phoenix Integration with Hive

Apache HBase Overview
Apache Ranger and Hive
Apache HBase Integration with Apache Hive
Apache Phoenix Overview

Apache Druid (incubating) with Apache Hive

Apache Druid (incubating) Overview
Apache Druid (incubating) Queries
Apache Druid (incubating) and Hive Integration

Apache Sqoop and Integration with Apache Hive

Overview of Apache Sqoop

Apache Spark and Integration with Apache Hive

Introduction to Apache Spark
Apache Hive and Spark

Introduction to HDF (Apache NiFi) and Integration with Apache Hive

Introduction to Apache NiFi
Apache NiFi and Apache Hive

CLD13 – Introduction to Apache Kudu

Overview

Cloudera’s Introduction to Apache Kudu training teaches students the basics of Apache Kudu, a data storage system for the Hadoop platform that is optimized for analytical queries. The course covers common Kudu use cases and Kudu architecture. Students will learn how to create, manage, and query Kudu tables, and to develop Spark applications that use Kudu.

Get hands-on experience

Through instructor-led discussion, as well as hands-on exercises, participants will learn topics including:

A high-level explanation of Kudu
How does it compares to other relevant storage systems and which use cases would be best implemented with Kudu
Learn about Kudu’s architecture as well as how to design tables that will store data for optimum performance.
Learn data management techniques on how to insert, update, or delete records from Kudu tables using Impala, as well as bulk loading methods
Finally, develop Apache Spark applications with Apache Kudu

What to expect

This material is intended for a broad audience of students involved with either software development or data analysis. This would include software developers, data engineers, DBAs, data scientists, and data analysts.

Students should know SQL. Familiarity with Impala is preferred but not required. Students should also know how to develop Apache Spark applications using either Python or Scala. Basic Linux experience is expected.

Course Contents

Introduction

Overview and Architecture

What Is Kudu?
Why Use Kudu?
Kudu Use Cases
Architecture Overview
Kudu Tools
Essential Points

Apache Kudu Tables

Kudu Tables
Data Storage Options
Designing Schemas
Partitioning Tables for Best Performance
Using Kudu Tools with Tables
Essential Points

Using Apache Kudu with Apache Impala

Apache Impala Overview
Creating and Querying Tables
Deleting Tables
Loading and Modifying Data in Kudu Tables
Defining Partitioning Strategy
Essential Points

Developing Apache Spark Applications with Apache Kudu

Apache Spark and Apache Kudu
Kudu, Spark SQL, and DataFrames
Managing Kudu Table Data with Scala
Creating Kudu Tables with Scala
Essential Points

Conclusion

CLD10 – Cloudera Data Science Workbench Training

Overview

Cloudera Data Science Workbench Training prepares learners to complete data science and machine learning projects using Cloudera Data Science Workbench (CDSW).

Get Hands-On Experience

Through narrated demonstrations and hands-on exercises, learners achieve proficiency in CDSW and develop the skills required to:

Navigate CDSW’s options and interfaces with confidence
Create projects in CDSW and collaborate securely with other users and teams
Develop and run reproducible Python and R code
Customize projects by installing packages and setting environment variables
Connect to a secure (Kerberized) Cloudera or Hortonworks cluster
Work with large-scale data using Apache Spark 2 with PySpark and sparklyr
Perform end-to-end machine learning workflows in CDSW using Python or R (read, inspect, transform, visualize, and model data)
Measure, track, and compare machine learning models using CDSW’s Experiments capability
Deploy models as REST API endpoints serving predictions using CDSW’s Models capability
Work collaboratively using CDSW together with Git

What to Expect

This OnDemand course is designed for learners at organizations using CDSW under a trial license or a commercial license. The learner must have access to a CDSW environment on a Cloudera or Hortonworks cluster running Apache Spark 2. Some experience with data science using Python or R is helpful but not required. No prior knowledge of Spark or other Hadoop ecosystem tools is required.