CLD13 – Introduction to Apache Kudu

Overview

Cloudera’s Introduction to Apache Kudu training teaches students the basics of Apache Kudu, a data storage system for the Hadoop platform that is optimized for analytical queries. The course covers common Kudu use cases and Kudu architecture. Students will learn how to create, manage, and query Kudu tables, and to develop Spark applications that use Kudu.

Get hands-on experience

Through instructor-led discussion, as well as hands-on exercises, participants will learn topics including:

  • A high-level explanation of Kudu
  • How does it compares to other relevant storage systems and which use cases would be best implemented with Kudu
  • Learn about Kudu’s architecture as well as how to design tables that will store data for optimum performance.
  • Learn data management techniques on how to insert, update, or delete records from Kudu tables using Impala, as well as bulk loading methods
  • Finally, develop Apache Spark applications with Apache Kudu

What to expect

This material is intended for a broad audience of students involved with either software development or data analysis. This would include software developers, data engineers, DBAs, data scientists, and data analysts.

Students should know SQL. Familiarity with Impala is preferred but not required. Students should also know how to develop Apache Spark applications using either Python or Scala. Basic Linux experience is expected. 

Course Contents

Introduction

Overview and Architecture

  • What Is Kudu?
  • Why Use Kudu?
  • Kudu Use Cases
  • Architecture Overview
  • Kudu Tools
  • Essential Points

Apache Kudu Tables

  • Kudu Tables
  • Data Storage Options
  • Designing Schemas
  • Partitioning Tables for Best Performance
  • Using Kudu Tools with Tables
  • Essential Points

Using Apache Kudu with Apache Impala

  • Apache Impala Overview
  • Creating and Querying Tables
  • Deleting Tables
  • Loading and Modifying Data in Kudu Tables
  • Defining Partitioning Strategy
  • Essential Points

Developing Apache Spark Applications with Apache Kudu

  • Apache Spark and Apache Kudu
  • Kudu, Spark SQL, and DataFrames
  • Managing Kudu Table Data with Scala
  • Creating Kudu Tables with Scala
  • Essential Points

Conclusion

CLD10 – Cloudera Data Science Workbench Training

Overview

Cloudera Data Science Workbench Training prepares learners to complete data science and machine learning projects using Cloudera Data Science Workbench (CDSW).

Get Hands-On Experience

Through narrated demonstrations and hands-on exercises, learners achieve proficiency in CDSW and develop the skills required to:

  • Navigate CDSW’s options and interfaces with confidence
  • Create projects in CDSW and collaborate securely with other users and teams
  • Develop and run reproducible Python and R code
  • Customize projects by installing packages and setting environment variables
  • Connect to a secure (Kerberized) Cloudera or Hortonworks cluster
  • Work with large-scale data using Apache Spark 2 with PySpark and sparklyr
  • Perform end-to-end machine learning workflows in CDSW using Python or R (read, inspect, transform, visualize, and model data)
  • Measure, track, and compare machine learning models using CDSW’s Experiments capability
  • Deploy models as REST API endpoints serving predictions using CDSW’s Models capability
  • Work collaboratively using CDSW together with Git

What to Expect

This OnDemand course is designed for learners at organizations using CDSW under a trial license or a commercial license. The learner must have access to a CDSW environment on a Cloudera or Hortonworks cluster running Apache Spark 2. Some experience with data science using Python or R is helpful but not required. No prior knowledge of Spark or other Hadoop ecosystem tools is required.

CLD6 – Spark Application Performance Tuning Workshop

Overview

This three-day hands-on training course presents the concepts and architectures of Spark and the underlying data platform, providing students with the conceptual understanding necessary to diagnose and solve performance issues.

With this understanding of Spark internals and the underlying data platform, the course teaches students how to tune Spark application code and configuration. The course illustrates performance design best practices and pitfalls. Students are prepared to apply these patterns and anti-patterns to their own designs and code.

The course format emphasizes instructor-led demos of performance issues and techniques to address them, followed by hands-on exercises. Students explore these performance issues and techniques in an interactive notebook environment. Students take away from the course a practical, illustrative body of code.

Prerequisites

This course is designed for software developers, engineers, and data scientists who develop Spark applications and need the information and techniques for tuning their code. This is not a beginning course in Spark; students should be comfortable completing the tasks covered in Cloudera Developer Training for Apache Spark and Hadoop. Spark examples and hands-on exercises are presented in Python and Scala. The ability to program in one of those languages is required. Basic familiarity with the Linux command line is assumed. Basic knowledge of SQL is helpful.

Course Topics

Spark Architecture

  • Coverage of all concepts found in the Spark Application UI
  • RDD execution
  • Data Frame execution
  • Catalyst optimizer
  • Partitioning
  • Shuffling

Optimizing Data

  • Recognizing and dealing with skewed data
  • Handling small files
  • Join optimizations
    Broadcast Joins
    Common Joins
    Skewed Joins
    Bucketed Joins
  • Unbalanced partitions
  • Partitioned and bucketed tables
  • Object serialization
  • Compression
  • File formats
  • Storage options
  • Schema inference

Optimizing Processing

  • Static vs. dynamic scheduling
  • Dynamic resource pools in YARN
  • Partition processing
  • Broadcast variables
  • Driver and executor memory and CPU core configuration
  • Python overhead
  • UDFs

Developing High Performance Algorithms

  • Caching data
  • Checkpoints
  • Recovery

CLD7 – Big Data Architecture Workshop

Overview

BDAW is a learning event that addresses advanced big data architecture topics. BDAW brings together technical contributors into a group setting to design and architect solutions to a challenging business problem. The workshop addresses big data architecture problems in general, and then applies them to the design of a challenging system.

Throughout the highly interactive workshop, participants apply concepts to real-world examples resulting in detailed synergistic discussions. The workshop is conducive for participants to learn techniques for architecting big data systems, not only from Cloudera’s experience but also from the experiences of fellow participants.

Audience & Prerequisites

To gain the most from the workshop, participants should have working knowledge of technologies such as HDFS, Spark, MapReduce, Hive/Impala, Data Formats and relational database management systems. Detailed API level knowledge is not needed, as there will not be any programming activities.

The workshop will be divided into small groups to discuss the problems and develop solutions. Each group will select a spokesperson who will present the group’s findings to the workshop. There will not be any programming labs, but we will have solutions implemented and deployed in the cloud for demos during the workshop.

Course Outline

Introduction

Workshop Application Use Cases

  • Oz Metropolitan
  • Architectural questions
  • Team activity: Analyze Metroz Application Use Cases

Application Vertical Slice

  • Definition
  • Minimizing risk of an unsound architecture
  • Selecting a vertical slice
  • Team activity: Identify an initial vertical slice for Metroz

Application Processing

  • Real time, near real time processing
  • Batch processing
  • Data access patterns
  • Delivery and processing guarantees
  • Machine Learning pipelines
  • Team activity: identify delivery and processing patterns in Metroz, characterize response time requirements, identify Machine Learning pipelines

Application Data

  • Three V’s of Big Data
  • Data Lifecycle
  • Data Formats
  • Transforming Data
  • Team activity: Metroz Data Requirements

Scalable Applications

  • Scale up, scale out, scale to X
  • Determining if an application will scale
  • Poll: scalable airport terminal designs
  • Hadoop and Spark Scalability
  • Team activity: Scaling Metroz

Fault Tolerant Distributed Systems

  • Principles
  • Transparency
  • Hardware vs. Software redundancy
  • Tolerating disasters
  • Stateless functional fault tolerance
  • Stateful fault tolerance
  • Replication and group consistency
  • Fault tolerance in Spark and Map Reduce
  • Application tolerance for failures
  • Team activity: Identify Metroz component failures and requirements

Security and Privacy

  • Principles
  • Privacy
  • Threats
  • Technologies
  • Team activity: identify threats and security mechanisms in Metroz

Deployment

  • Cluster sizing and evolution
  • On-premise vs. Cloud
  • Edge computing
  • Team activity: select deployment for Metroz

Technology Selection

  • HDFS
  • HBase
  • Kudu
  • Relational Database Management Systems
  • Map Reduce
  • Spark, including streaming, SparkSQL and SparkML
  • Hive
  • Impala
  • Cloudera Search
  • Data Sets and Formats
  • Team activity: technologies relevant to Metroz

Software Architecture

  • Architecture artifacts
  • One platform or multiple, lambda architecture
  • Team activity: produce high level architecture, selected technologies, revisit vertical slice
  • Vertical Slice demonstration

CLD8 – Data Analyst Training

Overview

Cloudera Educational Services‘ four-day Data Analyst Training course will teach you to apply traditional data analytics and business intelligence skills to big data. This course presents the tools data professionals need to access, manipulate, transform, and analyze complex data sets using SQL and familiar scripting languages.

What to Expect

Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the ecosystem, learning:

  • How the open source ecosystem of big data tools addresses challenges not met by traditional RDBMSs
  • Using Apache Hive and Apache Impala to provide SQL access to data
  • Hive and Impala syntax and data formats, including functions and subqueries
  • Create, modify, and delete tables, views, and databases; load data; and store results of queries
  • Create and use partitions and different file formats
  • Combining two or more datasets using JOIN or UNION, as appropriate
  • What analytic and windowing functions are, and how to use them
  • Store and query complex or nested data structures
  • Process and analyze semi-structured and unstructured data
  • Techniques for optimizing Hive and Impala queries
  • Extending the capabilities of Hive and Impala using parameters, custom file formats and SerDes, and external scripts
  • How to determine whether Hive, Impala, an RDBMS, or a mix of these is best for a given task

Audience & Prerequisites

This course is designed for data analysts, business intelligence specialists, developers, system architects, and database administrators. Some knowledge of SQL is assumed, as is basic Linux command-line familiarity. Prior knowledge of Apache Hadoop is not required.

CLD9 – Data Scientist Training

Overview

This workshop covers data science and machine learning workflows at scale using Apache Spark 2 and other key components of a big data ecosystem. The workshop emphasizes the use of data science and machine learning methods to address real-world business challenges.

What to expect

The workshop is designed for data scientists who currently use Python or R to work with smaller datasets on a single machine and who need to scale up their analyses and machine learning models to large datasets on distributed clusters. Data engineers and developers with some knowledge of data science and machine learning may also find this workshop useful.

Workshop participants should have a basic understanding of Python or R and some experience exploring and analyzing data and developing statistical or machine learning models. Knowledge of Hadoop or Spark is not required.

The workshop includes brief lectures, interactive demonstrations, hands-on exercises, and discussions covering topics including:

  • Overview of data science and machine learning at scale
  • Overview of the Hadoop ecosystem
  • Working with HDFS data and Hive tables using Hue
  • Introduction to Cloudera Data Science Workbench
  • Overview of Apache Spark 2
  • Reading and writing data
  • Inspecting data quality
  • Cleansing and transforming data
  • Summarizing and grouping data
  • Combining, splitting, and reshaping data
  • Exploring data
  • Configuring, monitoring, and troubleshooting Spark applications
  • Overview of machine learning in Spark MLlib
  • Extracting, transforming, and selecting features
  • Building and evaluating regression models
  • Building and evaluating classification models
  • Building and evaluating clustering models
  • Cross-validating models and tuning hyperparameters
  • Building machine learning pipelines
  • Deploying machine learning models

Technologies

Participants gain practical skills and hands-on experience with data science tools including:

CLD1 – Cloudera OnDemand Training Library

Overview

Cloudera’s OnDemand Library offers anytime, anywhere access to our extensive collection of self-paced training courses. Designed to provide a robust training experience, it covers topics across Cloudera’s enterprise platforms, and is an invaluable asset for organizations building solutions with Cloudera. Individuals receive detailed web-based instruction, and complete challenging, practice based exercises in a cloud-based environment. Take entire courses, or use the embedded search capabilities to find content specific to your needs across the portfolio of content. With regular knowledge checks throughout the courses, an iOS app for offline access, and a discussion board monitored by Cloudera staff, our ondemand students have all the tools they need to successfully complete their training, and apply their skills on the job.

Summary

The Cloudera Educational Services OnDemand Library subscription provides access to all the courses listed below, plus any updates or new content added during your subscription, via our OnDemand Portal for one year and includes 100 hours of access to the cloud hosted hands-on exercise environments.

Current courses include:

  • Cloudera Administrator Training 
  • Cloudera Developer Training for Spark and Hadoop
  • Cloudera Data Analyst Training: Using Hive and Impala (NEW)
  • Cloudera Security 
  • Cloudera Search Training
  • Cloudera Training for Apache HBase
  • Cloudera Data Science Workbench (NEW)
  • Just Enough Python
  • Just Enough Scala
  • Introduction to Apache Kafka
  • Introduction to Apache Kudu
  • Deploying and Scaling Cloudera Enterprise on Microsoft Azure
  • Introduction to Cloudera Altus Director
  • Introduction to Cloudera Manager
  • Introduction to Cloudera Navigator
  • CDP Essentials (NEW)
  • AWS Fundamentals for CDP Public Cloud (NEW)
  • Introduction to Cloudera Data Warehouse: Self-Service Analytics in the Cloud with CDP (NEW)
  • CDP for CDH Users (NEW)
  • CDP for HDP Users (NEW)

CLD2 – Administrator Training

Overview

Take your knowledge to the next level with Cloudera’s Administrator Training and Certification. Cloudera Educational Services’s four-day administrator training course provides participants with a comprehensive understanding of all the steps necessary to operate and maintain a Hadoop cluster using Cloudera Manager. From installation and configuration through load balancing and tuning, this training course is the best preparation for the real-world challenges faced by Cloudera administrators. 

Get hands-on experience

Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop ecosystem, learning topics such as:

  • Cloudera Manager features that make managing your clusters easier, such as aggregated logging, configuration management, resource management, reports, alerts, and service management
  • Configuring and deploying production-scale clusters that provide key Hadoop-related services, including YARN, HDFS, Impala, Hive, Spark, Kudu, and Kafka
  • Determining the correct hardware and infrastructure for your cluster
  • Proper cluster configuration and deployment to integrate with the data center
  • Ingesting, storing, and accessing data in HDFS, Kudu, and cloud object stores such as Amazon S3
  • How to load file-based and streaming data into the cluster using Kafka and Flume
  • Configuring automatic resource management to ensure service-level agreements are met for multiple users of a cluster
  • Best practices for preparing, tuning, and maintaining a production cluster
  • Troubleshooting, diagnosing, and solving cluster issues

What to expect

This course is best suited to systems administrators and IT managers who have basic Linux experience. Prior knowledge of Apache Hadoop is not required.

Get certified

Upon completion of the course, attendees are encouraged to continue their study and register for the CCA Administrator exam. Certification is a great differentiator. It helps establish you as a leader in the field, providing employers and customers with tangible evidence of your skills and expertise.

CLD3 – Cloudera DataFlow: Flow Management with Apache NiFi

Overview

This three-day hands-on training course provides the fundamental concepts and experience necessary to automate the ingest, flow, transformation, and egress of data using Apache NiFi.

Along with gaining a grasp of the key features, concepts, and benefits of NiFi, participants will create and run NiFi dataflows for a variety of scenarios. Students will gain expertise using processors, connections, and process groups, and will use NiFi Expression Language to control the flow of data from various sources to multiple destinations. Participants will monitor dataflows, examine progress of data through a dataflow, and connect dataflows to external systems such as Kafka and HDFS. After taking this course, participants will have key knowledge and expertise for configuring and managing data ingestion, movement, and transformation scenarios for the enterprise.

What You Will Learn

Students who successfully complete this course will be able to:

  • Understand the role of Apache NiFi and MiNiFi in the Cloudera DataFlow platform
  • Describe NiFi’s architecture, including standalone and clustered configurations
  • Use key features, including FlowFiles, processors, process groups, controllers, and connections, to define a NiFi dataflow
  • Navigate, configure dataflows, and use dataflow information with the NiFi User Interface
  • Trace the life of data, its origin, transformation, and destination, using data provenance
  • Organize and simplify dataflows
  • Manage dataflow versions using the NiFi Registry
  • Use the NiFi Expression Language to control dataflows
  • Implement dataflow optimization methods and available monitoring and reporting features
  • Connect dataflows with other systems, such as Kafka and HDFS
  • Describe aspects of NiFi security

What to Expect

This course is designed for Developers, Data Engineers, Data Scientists, and Data Stewards. It provides a no-code, graphical approach to configuring real-time data streaming, ingestion, and management solutions for a variety of use cases. Though programming experience is not required, basic experience with Linux is presumed. Exposure to big data concepts and applications is helpful.

CLD4 – Security Training

Overview

Get the Knowledge and Skills

After successfully completing this course, the student will be able to:

  • Describe security in the context of Hadoop
  • Assess threats to a production Hadoop cluster
  • Plan and deploy defenses against these threats
  • Improve the security of each node in the cluster
  • Set up authentication with Kerberos and Active Directory
  • Use permissions and ACLs to control access to files in HDFS
  • Use platform authorization features to control data access
  • Perform common key management tasks
  • Use encryption to protect data in motion and at rest
  • Monitor a cluster for suspicious activity

What To Expect

The course is intended for system administrators and those in similar roles. Prospective students should have a good understanding of Hadoop’s architecture, the ability to perform system administration tasks in the Linux environment, and at least basic exposure to Cloudera Manager. We recommend that students complete the Cloudera Administrator Training for Apache Hadoop course, or have equivalent on-the-job experience, before beginning this course. No prior training or experience with computer security is required.