CLD18 – HDP Data Science

Overview

This course provides instruction on the theory and practice of data science, including machine learning and natural language processing. This course introduces many of the core concepts behind today’s most commonly used algorithms and introducing them in practical applications. We’ll discuss concepts and key algorithms in all of the major areas – Classification, Regression, Clustering, Dimensionality Reduction, including a primer on Neural Networks. We’ll focus on both single-server tools and frameworks (Python, NumPy, pandas, SciPy, Scikit-learn, NLTK, TensorFlow Jupyter) as well as large-scale tools and frameworks (Spark MLlib, Stanford CoreNLP, TensorFlowOnSpark/Horovod/MLeap, Apache Zeppelin). Download the data sheet to view the full list of objectives and labs.

  • Prerequisites
    Students must have experience with Python and Scala, Spark, and prior exposure to statistics, probability, and a basic understanding of big data and Hadoop principles. While brief reviews are offered in these topics, students new to Hadoop are encouraged to attend the Apache Hadoop Essentials (HDP-123) course and HDP Spark Developer (DEV-343), as well as the language-specific introduction courses.
  • Target Audience
    Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Spark/Hadoop

DAY 1 – An Introduction to Data Science, SciKit-Learn, HDFS, Reviewing Spark apps, DataFrames and NOSQL

OBJECTIVES
  • Discuss aspects of Data Science, the team members, and the team roles
  • Discuss use cases for Data Science
  • Discuss the current State of the Art and its future direction
  • Review HDFS, Spark, Jupyter, and Zeppelin
  • Work with SciKit-Learn, Pandas, NumPy, Matplotlib, and Seaborn
LABS
  • Hello, ML w/ SciKit-Learn 
  • Spark REPLs, Spark Submit, & Zeppelin Review 
  • HDFS Review 
  • Spark DataFrames and Files 
  • NiFi Review

DAY 2 – Algorithms in Spark ML and SciKit-Learn: Linear Regression, Logistic Regression, Support Vectors, Decision Trees-

OBJECTIVES
  • Discuss categories and use cases of the various ML Algorithms
  • Understand Linear Regression, Logistic Regression, and Support Vectors
  • Understand Decision Trees and their limitations
  • Understand Nearest-Neighbors
  • Discuss and demonstrate a Spam Classifier
LABS
  • Linear Regression as a Projection 
  • Logistic Regression 
  • Support Vectors 
  • Decision Trees 
  • Linear Regression as a Classifier

DAY 3 – K-Means & GMM Clustering, Essential TensorFlow, NLP with NLTK, NLP with Stanford CoreNLP

OBJECTIVES
  • Discuss and understand Clustering Algorithms
  • Work with TensorFlow to create a basic neural network
  • Work with TensorFlow to create a basic neural network
  • Discuss Natural Language Processing
  • Discuss Dimensionality Reduction Algorithms
LABS
  • K-Means Clustering 
  • GMM Clustering 
  • Essential TensorFlow 
  • Sentiment Analysis
  • Dimensionality Reduction with PCA

DAY 4 – HyperParameter Tuning, K-Fold Validation, Ensemble Methods, ML Pipelines in SparkML

OBJECTIVES
  • Discuss Hyper-Parameter Tuning and K-Fold Validation
  • Understand Ensemble Models
  • Discuss ML Pipelines in Spark MLlib
  • Discuss ML in production and real-world issues
  • Demonstrate TensorFlowOnSpark
LABS
  • Hyper-parameter tuning 
  • K-Fold Validation 
  • Ensemble Methods 
  • ML Pipelines in SparkML 
  • Demo: TensorFlowOnSpark

CLD19 – HDP Operations: Apache Hadoop Security Training

Overview

This course is designed for experienced administrators who will be implementing secure Hadoop clusters using authentication, authorization, auditing and data protection strategies and tools. Download the data sheet to view the full list of course objectives and labs.

  • Prerequisites
    Students should be experienced in the management of Hadoop using Ambari and Linux environments. Completion of the Hadoop Administration I course is highly recommended.
  • Target Audience
    IT administrators and operators responsible for installing, configuring and supporting an Apache Hadoop deployment in a Linux environment.

Day 1 – An Introduction to Security

OBJECTIVES
  • Definition of Security
  • Securing Sensitive Data
  • Integrating HDP Security
  • What Security Tools to use for Each Use Case
  • HDP Security Prerequisites
  • Ambari Server Security
  • Kerberos Deep Dive
LABS
  • Setting up the Lab Environment
  • Configuring the AD Resolution Certificate
  • Security Options for Ambari

Day 2 – Working with Kerberos and Apache Ranger

OBJECTIVES
  • Enable Kerberos
  • Apache Ranger Installation
  • Apache Ranger KMS
LABS
  • Kerberizing the Cluster
  • Installing Apache Ranger
  • Setting up Apache Ranger KMS Data Encryption

DAY 3 – Working with Apache Knox

OBJECTIVES
  • Secure Access with Ranger
  • Apache Knox Overview
  • Apache Knox Installation
  • Ambari Views for Controlled Access
LABS
  • Secured Hadoop Exercises
  • Configuring Apache Knox
  • Exploring Other Security Features of Apache Ambari

CLD20 – HDP Spark Developer

Overview

This course introduces the Apache Spark distributed computing engine, and is suitable for developers, data analysts, architects, technical managers, and anyone who needs to use Spark in a hands-on manner. It is based on the Spark 2.x release. The course provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g. RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface.It includes in-depth coverage of Spark SQL, DataFrames, and DataSets, which are now the preferred programming API. This includes exploring possible performance issues and strategies for optimization. The course also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and integrating with the Kafka server

  • Prerequisites
    Students should be familiar with programming principles and have previous experience in software
    development using Scala. Previous experience with data streaming, SQL, and HDP is also helpful, but not
    required.
  • Target Audience
    Software engineers that are looking to develop in-memory applications for time sensitive and highly iterative
    applications in an Enterprise HDP environment.

DAY 1 – Scala Ramp Up, Introduction to Spark

OBJECTIVES
  • Scala Introduction
  • Working with: Variables, Data Types, and Control Flow
  • The Scala Interpreter
  • Collections and their Standard Methods (e.g. map())
  • Working with: Functions, Methods, and Function Literals
  • Define the Following as they Relate to Scala: Class, Object, and Case Class
  • Overview, Motivations, Spark Systems
  • Spark Ecosystem
  • Spark vs. Hadoop
  • Acquiring and Installing Spark
  • The Spark Shell, SparkContext
LABS
  • Setting Up the Lab Environment
  • Starting the Scala Interpreter
  • A First Look at Spark
  • A First Look at the Spark Shell

DAY 2 – RDDs and Spark Architecture, Spark SQL, DataFrames and DataSets

OBJECTIVES
  • RDD Concepts, Lifecycle, Lazy Evaluation
  • RDD Partitioning and Transformations
  • Working with RDDs Including: Creating and Transforming
  • An Overview of RDDs
  • SparkSession, Loading/Saving Data, Data Formats
  • Introducing DataFrames and DataSets
  • Identify Supported Data Formats
  • Working with the DataFrame (untyped) Query DSL
  • SQL-based Queries
  • Working with the DataSet (typed) API
  • Mapping and Splitting
  • DataSets vs. DataFrames vs. RDDs
LABS
  • RDD Basics
  • Operations on Multiple RDDs
  • Data Formats
  • Spark SQL Basics
  • DataFrame Transformations
  • The DataSet Typed API
  • Splitting Up Data

Day 3 – Shuffling, Transformations and Performance, Performance Tuning

OBJECTIVES
  • Working with: Grouping, Reducing, Joining
  • Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
  • Exploring the Catalyst Query Optimizer
  • The Tungsten Optimizer
  • Discuss Caching, Including: Concepts, Storage Type, Guidelines
  • Minimizing Shuffling for Increased Performance
  • Using Broadcast Variables and Accumulators
  • General Performance Guidelines
LABS
  • Exploring Group Shuffling
  • Seeing Catalyst at Work
  • Seeing Tungsten at Work
  • Working with Caching, Joins, Shuffles, Broadcasts, Accumulators
  • Broadcast General Guidelines

Day 4 – Creating Standalone Applications and Spark Streaming

OBJECTIVES
  • Core API, SparkSession.Builder
  • Configuring and Creating a SparkSession
  • Building and Running Applications
  • Application Lifecycle (Driver, Executors, and Tasks)
  • Cluster Managers (Standalone, YARN, Mesos)
  • Logging and Debugging
  • Introduction and Streaming Basics
  • Spark Streaming (Spark 1.0+)
  • Structured Streaming (Spark 2+)
  • Consuming Kafka Data
LABS
  • Spark Job Submission
  • Additional Spark Capabilities
  • Spark Streaming
  • Spark Structured Streaming
  • Spark Structured Streaming with Kafka

CLD21 – HDP Operations: Administration Foundations

Overview

This four-day instructor-led training course provides students with the foundational knowledge required to plan, deploy, configure, and manage a cluster running the Hortonworks Data Platform (HDP).

What You Will Learn

Students who successfully complete this course will learn how to administer Apache Hadoop and the Hortonworks Data Platform (HDP). You will be able to:

  • Install the Hortonworks Data Platform
  • Manage Hadoop services
  • Use and manage Hadoop Distributed File System (HDFS) Storage
  • Configure rack awareness
  • Manage cluster nodes and cluster node storage
  • Use HDFS snapshots and Distributed Copy (DistCp)
  • Configure heterogeneous storage and HDFS centralized cache
  • Configure an HDFS NFS gateway and NameNode high availability
  • Describe the View File System (ViewFS)
  • Manage YARN resources and run YARN applications
  • Configure the YARN capacity scheduler, containers, and queues to manage computing resources
  • Configure YARN node labels and YARN ResourceManager high availability
  • Manage Ambari alerts
  • Deploy an HDP cluster using Ambari blueprints
  • Upgrade a cluster to a newer version of HDP

What to Expect

This course is designed primarily for system administrators and system operators responsible for installing, configuring, and managing an HDP cluster.

Students must have experience working in a Linux environment with standard Linux system commands. Students should be able to read and execute basic Linux shell scripts. In addition, we recommend that students have some operational experience in data center practices.

CLD15 – Just Enough Python

Overview

Cloudera University’s Python training course will teach you the key language concepts and programming techniques you need so that you can concentrate on the subjects covered in Cloudera’s developer courses without also having to learn a complex programming language and a new programming paradigm on the fly. 

Immersive Training

Through instructor-led discussion, as well as hands-on exercises, participants will learn:

  • How to define, assign, and access variables
  • Which collection types are commonly used, how they differ, and how to use them
  • How to control program flow using conditional statements, looping, iteration, and exception handling
  • How to define and use both named and anonymous (Lambda) functions
  • How to organize code into separate modules
  • How to use important features of standard Python libraries, including mathematical and regular expression support 

Audience and prerequisites

Prior knowledge of Hadoop is not required. Since this course is intended for developers who do not yet have the prerequisite skills writing code in Python, basic programming experience in at least one commonly-used programming language (ideally Java, but Ruby, Perl, Scala, C, C++, PHP, or Javascript will suffice) is assumed. 

Please note that this course does not teach big data concepts, nor does it cover how to use Cloudera software. Instead, it is meant as a precursor for one of our developer-focused training courses that provide those skills. 

CLD16 – HDP Self-Paced Learning Library

Overview

The HDP OnDemand Library offers an individual anytime, anywhere access to a collection of self-paced training courses on the Hortonworks Data Platform (HDP). Courses are available for administrators, developers, and data scientists. The HDP courses are designed and developed by Hadoop experts and provide an immersive and valuable real-world experience. In our scenario-based training courses, we offer unmatched depth and expertise. Our Hadoop learning path prepares you to be an expert with highly valued, practical skills. The HDP OnDemand Library accelerates the journey to Hadoop competency.

Current courses include: 

  • HDP Overview: Apache Hadoop Essentials (HDP-123)
  • HDP Developer: Spark (DEV-343)
  • HDF: NiFi Flow Management (ADM-301)
  • HDP Operations: Administration Foundations (ADM-221)
  • HDP Operations: Security (ADM-351)
  • HDP Data Science (SCI-241)

CLD17 – HDP Apache Hive Training

Overview

This four-day training course is designed for analysts and developers who need to create and analyze Big Data stored in Apache Hadoop using Hive. Topics include: Understanding of HDP and HDF and their integration with Hive; Hive on Tez, LLAP, and Druid OLAP query analysis; Hive data ingestion using HDF and Spark; and Enterprise Data Warehouse offload capabilities in HDP using Hive.

Prerequisites

Students should be familiar with programming principles and have experience in software development. Knowledge of SQL, data modeling, and scripting is also helpful. No prior Hadoop Knowledge is needed.

Course Details

Information Architecture and Big Data

  • Enterprise Data Warehouse Optimization

Introduction to Apache Hive

  • About Apache Hive
  • About Apache Zeppelin and Apache Superset (incubating)

Apache Hive Architecture

  • Apache Hive Architecture

Apache Hive Programming

  • Apache Hive Basics
  • Apache Hive Transactions (Hive ACID)

File Formats

  • SerDes and File Formats

Partitions and Bucketing

  • Partitions
  • Bucketing
  • Skew and Temporary Tables

Advanced Apache Hive Programming

  • Data Sorting
  • Apache Hive User Defined Functions (UDFs)
  • Subqueries and Views
  • Joins
  • Windowing and Grouping
  • Other Topics

Apache Hive Performance Tuning

  • Cost-Based Optimization and Statistics
  • Bloom Filters
  • Execution and Resource Plans

Live Long and Process (LLAP) Deep Dive

  • Live Long and Process Overview
  • Apache Hive and LLAP Performance
  • Apache Hive and LLAP Installation

Security and Data Governance

  • Apache Ranger
  • Apache Ranger and Hive
  • Apache Atlas
  • Apache Atlas and Hive Integration

Apache HBase and Phoenix Integration with Hive

  • Apache HBase Overview
  • Apache Ranger and Hive
  • Apache HBase Integration with Apache Hive
  • Apache Phoenix Overview

Apache Druid (incubating) with Apache Hive

  • Apache Druid (incubating) Overview
  • Apache Druid (incubating) Queries
  • Apache Druid (incubating) and Hive Integration

Apache Sqoop and Integration with Apache Hive

  • Overview of Apache Sqoop

Apache Spark and Integration with Apache Hive

  • Introduction to Apache Spark
  • Apache Hive and Spark

Introduction to HDF (Apache NiFi) and Integration with Apache Hive

  • Introduction to Apache NiFi
  • Apache NiFi and Apache Hive

CLD14 – Just Enough Scala

Overview

Scala is a programming language that is a superset of Java, blending the object-oriented and the functional programming paradigms. The language is complex and could take a semester or more to master. This class focuses only on the elements that are necessary to be able to program in Cloudera’s training courses. 

Immersive Training

Through instructor-led discussion or OnDemand videos, as well as hands-on exercises, participants will learn:

  • What Scala is and how it differs from languages such as Java or Python
  • Why Scala is a good choice for Spark programming
  • How to use key language features such as data types, collections, and flow control
  • How to implement functional programming solutions in Scala
  • How to work with Scala classes, packages, and libraries Working with libraries 

Audience and prerequisites

Basic knowledge of programming concepts such as objects, conditional statements, and looping is required. This course is best suited to students with Java programming experience. Those with experience in another language may prefer the Just Enough Python course. Basic knowledge of Linux is assumed.

Please note that this course does not teach big data concepts, nor does it cover how to use Cloudera software. Instead, it is meant as a precursor for one of our developer-focused training courses that provide those skills. 

CLD10 – Cloudera Data Science Workbench Training

Overview

Cloudera Data Science Workbench Training prepares learners to complete data science and machine learning projects using Cloudera Data Science Workbench (CDSW).

Get Hands-On Experience

Through narrated demonstrations and hands-on exercises, learners achieve proficiency in CDSW and develop the skills required to:

  • Navigate CDSW’s options and interfaces with confidence
  • Create projects in CDSW and collaborate securely with other users and teams
  • Develop and run reproducible Python and R code
  • Customize projects by installing packages and setting environment variables
  • Connect to a secure (Kerberized) Cloudera or Hortonworks cluster
  • Work with large-scale data using Apache Spark 2 with PySpark and sparklyr
  • Perform end-to-end machine learning workflows in CDSW using Python or R (read, inspect, transform, visualize, and model data)
  • Measure, track, and compare machine learning models using CDSW’s Experiments capability
  • Deploy models as REST API endpoints serving predictions using CDSW’s Models capability
  • Work collaboratively using CDSW together with Git

What to Expect

This OnDemand course is designed for learners at organizations using CDSW under a trial license or a commercial license. The learner must have access to a CDSW environment on a Cloudera or Hortonworks cluster running Apache Spark 2. Some experience with data science using Python or R is helpful but not required. No prior knowledge of Spark or other Hadoop ecosystem tools is required.

CLD11 – Cloudera Search Training

Overview

Cloudera Educational Services three-day Search training course is for developers and data engineer who want to index data in Hadoop for more powerful real-time queries. Participants will learn to get more value from their data by integrating Cloudera Search with external applications.

Learn a modern toolset

Cloudera Search brings full-text, interactive search and scalable, flexible indexing to Hadoop and an enterprise data hub. Powered by Apache Solr, Search delivers scale and reliability for a new generation of integrated, multi-workload queries.

Get hands-on experience

Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop ecosystem, learning topics such as:

  • Perform batch indexing of data stored in HDFS and HBase
  • Perform indexing of streaming data in near-real-time with Flume
  • Index content in multiple languages and file formats
  • Process and transform incoming data with Morphlines
  • Create a user interface for your index using Hue
  • Integrate Cloudera Search with external applications
  • Improve the Search experience using features such as faceting, highlighting, spelling correction

What to expect

This course is intended for developers and data engineers with at least basic familiarity with Hadoop and experience programming in a general-purpose language such as Java, C, C++, Perl, or Python. Participants should be comfortable with the Linux command line and should be able to perform basic tasks such as creating and removing directories, viewing and changing file permissions, executing scripts, and examining file output. No prior experience with Apache Solr or Cloudera Search is required, nor is any experience with HBase or SQL.