Cloudera – 2. stránka – Knowledge Factory

CLD12 – Apache HBase Training

Overview

Take your knowledge to the next level with Cloudera Training for Apache HBase. Cloudera Educational Services’ three-day training course enables participants to store and access massive quantities of multi-structured data and perform hundreds of thousands of operations per second.

Hands-on Hadoop

Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop ecosystem, learning topics such as:

The use cases and usage occasions for HBase, Hadoop, and RDBMS
Using the HBase shell to directly manipulate HBase tables
Designing optimal HBase schemas for efficient data storage and recovery
How to connect to HBase using the Java API to insert and retrieve data in real time
Best practices for identifying and resolving performance bottlenecks

Audience and prerequisites

This course is appropriate for developers and administrators who intend to use HBase. Prior experience with databases and data modeling is helpful, but not required. Knowledge of Java is assumed. Prior knowledge of Hadoop is not required, but Cloudera Developer Training for Spark and Hadoop provides an excellent foundation for this course.

Course Contents

Introduction

Introduction to Hadoop and HBase

Introducing Hadoop
Core Hadoop Components
What Is HBase?
Why Use HBase?
Strengths of HBase
HBase in Production
Weaknesses of HBase

HBase Tables

HBase Concepts
HBase Table Fundamentals
Thinking About Table Design

HBase Shell

Creating Tables with the HBase Shell
Working with Tables
Working with Table Data

HBase Architecture Fundamentals

HBase Regions
HBase Cluster Architecture
HBase and HDFS Data Locality

HBase Schema Design

General Design Considerations
Application-Centric Design
Designing HBase Row Keys
Other HBase Table Features

Basic Data Access with the HBase API

Options to Access HBase Data
Creating and Deleting HBase Tables
Retrieving Data with Get
Retrieving Data with Scan
Inserting and Updating Data
Deleting Data

More Advanced HBase API Features

Filtering Scans
Best Practices
HBase Coprocessors

HBase Write Path

HBase Write Path
Compaction
Splits

HBase Read Path

How HBase Reads Data
Block Caches for Reading

HBase Performance Tuning

Column Family Considerations
Schema Design Considerations
Configuring for Caching
Memory Considerations
Dealing with Time Series and Sequential Data
Pre-Splitting Regions

HBase Administration and Cluster Management

HBase Daemons
ZooKeeper Considerations
HBase High Availability
Using the HBase Balancer
Fixing Tables with hbck
HBase Security

HBase Replication and Backup

HBase Replication
HBase Backup
MapReduce and HBase Clusters

Using Hive and Impala with HBase

How to Use Hive and Impala to Access HBase

Conclusion

Appendix A: Accessing Data with Python and Thrift

Thrift Usage
Working with Tables
Getting and Putting Data
Scanning Data
Deleting Data
Counters
Filters

Appendix B: OpenTSDB

CLD13 – Introduction to Apache Kudu

Overview

Cloudera’s Introduction to Apache Kudu training teaches students the basics of Apache Kudu, a data storage system for the Hadoop platform that is optimized for analytical queries. The course covers common Kudu use cases and Kudu architecture. Students will learn how to create, manage, and query Kudu tables, and to develop Spark applications that use Kudu.

Get hands-on experience

Through instructor-led discussion, as well as hands-on exercises, participants will learn topics including:

A high-level explanation of Kudu
How does it compares to other relevant storage systems and which use cases would be best implemented with Kudu
Learn about Kudu’s architecture as well as how to design tables that will store data for optimum performance.
Learn data management techniques on how to insert, update, or delete records from Kudu tables using Impala, as well as bulk loading methods
Finally, develop Apache Spark applications with Apache Kudu

What to expect

This material is intended for a broad audience of students involved with either software development or data analysis. This would include software developers, data engineers, DBAs, data scientists, and data analysts.

Students should know SQL. Familiarity with Impala is preferred but not required. Students should also know how to develop Apache Spark applications using either Python or Scala. Basic Linux experience is expected.

Course Contents

Introduction

Overview and Architecture

What Is Kudu?
Why Use Kudu?
Kudu Use Cases
Architecture Overview
Kudu Tools
Essential Points

Apache Kudu Tables

Kudu Tables
Data Storage Options
Designing Schemas
Partitioning Tables for Best Performance
Using Kudu Tools with Tables
Essential Points

Using Apache Kudu with Apache Impala

Apache Impala Overview
Creating and Querying Tables
Deleting Tables
Loading and Modifying Data in Kudu Tables
Defining Partitioning Strategy
Essential Points

Developing Apache Spark Applications with Apache Kudu

Apache Spark and Apache Kudu
Kudu, Spark SQL, and DataFrames
Managing Kudu Table Data with Scala
Creating Kudu Tables with Scala
Essential Points

Conclusion

CLD6 – Spark Application Performance Tuning Workshop

Overview

This three-day hands-on training course presents the concepts and architectures of Spark and the underlying data platform, providing students with the conceptual understanding necessary to diagnose and solve performance issues.

With this understanding of Spark internals and the underlying data platform, the course teaches students how to tune Spark application code and configuration. The course illustrates performance design best practices and pitfalls. Students are prepared to apply these patterns and anti-patterns to their own designs and code.

The course format emphasizes instructor-led demos of performance issues and techniques to address them, followed by hands-on exercises. Students explore these performance issues and techniques in an interactive notebook environment. Students take away from the course a practical, illustrative body of code.

Prerequisites

This course is designed for software developers, engineers, and data scientists who develop Spark applications and need the information and techniques for tuning their code. This is not a beginning course in Spark; students should be comfortable completing the tasks covered in Cloudera Developer Training for Apache Spark and Hadoop. Spark examples and hands-on exercises are presented in Python and Scala. The ability to program in one of those languages is required. Basic familiarity with the Linux command line is assumed. Basic knowledge of SQL is helpful.

Course Topics

Spark Architecture

Coverage of all concepts found in the Spark Application UI
RDD execution
Data Frame execution
Catalyst optimizer
Partitioning
Shuffling

Optimizing Data

Recognizing and dealing with skewed data
Handling small files
Join optimizations
Broadcast Joins
Common Joins
Skewed Joins
Bucketed Joins
Unbalanced partitions
Partitioned and bucketed tables
Object serialization
Compression
File formats
Storage options
Schema inference

Optimizing Processing

Static vs. dynamic scheduling
Dynamic resource pools in YARN
Partition processing
Broadcast variables
Driver and executor memory and CPU core configuration
Python overhead
UDFs

Developing High Performance Algorithms

Caching data
Checkpoints
Recovery

CLD7 – Big Data Architecture Workshop

Overview

BDAW is a learning event that addresses advanced big data architecture topics. BDAW brings together technical contributors into a group setting to design and architect solutions to a challenging business problem. The workshop addresses big data architecture problems in general, and then applies them to the design of a challenging system.

Throughout the highly interactive workshop, participants apply concepts to real-world examples resulting in detailed synergistic discussions. The workshop is conducive for participants to learn techniques for architecting big data systems, not only from Cloudera’s experience but also from the experiences of fellow participants.

Audience & Prerequisites

To gain the most from the workshop, participants should have working knowledge of technologies such as HDFS, Spark, MapReduce, Hive/Impala, Data Formats and relational database management systems. Detailed API level knowledge is not needed, as there will not be any programming activities.

The workshop will be divided into small groups to discuss the problems and develop solutions. Each group will select a spokesperson who will present the group’s findings to the workshop. There will not be any programming labs, but we will have solutions implemented and deployed in the cloud for demos during the workshop.

Course Outline

Introduction

Workshop Application Use Cases

Oz Metropolitan
Architectural questions
Team activity: Analyze Metroz Application Use Cases

Application Vertical Slice

Definition
Minimizing risk of an unsound architecture
Selecting a vertical slice
Team activity: Identify an initial vertical slice for Metroz

Application Processing

Real time, near real time processing
Batch processing
Data access patterns
Delivery and processing guarantees
Machine Learning pipelines
Team activity: identify delivery and processing patterns in Metroz, characterize response time requirements, identify Machine Learning pipelines

Application Data

Three V’s of Big Data
Data Lifecycle
Data Formats
Transforming Data
Team activity: Metroz Data Requirements

Scalable Applications

Scale up, scale out, scale to X
Determining if an application will scale
Poll: scalable airport terminal designs
Hadoop and Spark Scalability
Team activity: Scaling Metroz

Fault Tolerant Distributed Systems

Principles
Transparency
Hardware vs. Software redundancy
Tolerating disasters
Stateless functional fault tolerance
Stateful fault tolerance
Replication and group consistency
Fault tolerance in Spark and Map Reduce
Application tolerance for failures
Team activity: Identify Metroz component failures and requirements

Security and Privacy

Principles
Privacy
Threats
Technologies
Team activity: identify threats and security mechanisms in Metroz

Deployment

Cluster sizing and evolution
On-premise vs. Cloud
Edge computing
Team activity: select deployment for Metroz

Technology Selection

HDFS
HBase
Kudu
Relational Database Management Systems
Map Reduce
Spark, including streaming, SparkSQL and SparkML
Hive
Impala
Cloudera Search
Data Sets and Formats
Team activity: technologies relevant to Metroz

Software Architecture

Architecture artifacts
One platform or multiple, lambda architecture
Team activity: produce high level architecture, selected technologies, revisit vertical slice
Vertical Slice demonstration

CLD8 – Data Analyst Training

Overview

Cloudera Educational Services‘ four-day Data Analyst Training course will teach you to apply traditional data analytics and business intelligence skills to big data. This course presents the tools data professionals need to access, manipulate, transform, and analyze complex data sets using SQL and familiar scripting languages.

What to Expect

Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the ecosystem, learning:

How the open source ecosystem of big data tools addresses challenges not met by traditional RDBMSs
Using Apache Hive and Apache Impala to provide SQL access to data
Hive and Impala syntax and data formats, including functions and subqueries
Create, modify, and delete tables, views, and databases; load data; and store results of queries
Create and use partitions and different file formats
Combining two or more datasets using JOIN or UNION, as appropriate
What analytic and windowing functions are, and how to use them
Store and query complex or nested data structures
Process and analyze semi-structured and unstructured data
Techniques for optimizing Hive and Impala queries
Extending the capabilities of Hive and Impala using parameters, custom file formats and SerDes, and external scripts
How to determine whether Hive, Impala, an RDBMS, or a mix of these is best for a given task

Audience & Prerequisites

This course is designed for data analysts, business intelligence specialists, developers, system architects, and database administrators. Some knowledge of SQL is assumed, as is basic Linux command-line familiarity. Prior knowledge of Apache Hadoop is not required.

CLD9 – Data Scientist Training

Overview

This workshop covers data science and machine learning workflows at scale using Apache Spark 2 and other key components of a big data ecosystem. The workshop emphasizes the use of data science and machine learning methods to address real-world business challenges.

What to expect

The workshop is designed for data scientists who currently use Python or R to work with smaller datasets on a single machine and who need to scale up their analyses and machine learning models to large datasets on distributed clusters. Data engineers and developers with some knowledge of data science and machine learning may also find this workshop useful.

Workshop participants should have a basic understanding of Python or R and some experience exploring and analyzing data and developing statistical or machine learning models. Knowledge of Hadoop or Spark is not required.

The workshop includes brief lectures, interactive demonstrations, hands-on exercises, and discussions covering topics including:

Overview of data science and machine learning at scale
Overview of the Hadoop ecosystem
Working with HDFS data and Hive tables using Hue
Introduction to Cloudera Data Science Workbench
Overview of Apache Spark 2
Reading and writing data
Inspecting data quality
Cleansing and transforming data
Summarizing and grouping data
Combining, splitting, and reshaping data
Exploring data
Configuring, monitoring, and troubleshooting Spark applications
Overview of machine learning in Spark MLlib
Extracting, transforming, and selecting features
Building and evaluating regression models
Building and evaluating classification models
Building and evaluating clustering models
Cross-validating models and tuning hyperparameters
Building machine learning pipelines
Deploying machine learning models

Technologies

Participants gain practical skills and hands-on experience with data science tools including:

Spark, Spark SQL, and Spark MLlib
PySpark and sparklyr
Cloudera Data Science Workbench (CDSW)
Hue

CLD1 – Cloudera OnDemand Training Library

Overview

Cloudera’s OnDemand Library offers anytime, anywhere access to our extensive collection of self-paced training courses. Designed to provide a robust training experience, it covers topics across Cloudera’s enterprise platforms, and is an invaluable asset for organizations building solutions with Cloudera. Individuals receive detailed web-based instruction, and complete challenging, practice based exercises in a cloud-based environment. Take entire courses, or use the embedded search capabilities to find content specific to your needs across the portfolio of content. With regular knowledge checks throughout the courses, an iOS app for offline access, and a discussion board monitored by Cloudera staff, our ondemand students have all the tools they need to successfully complete their training, and apply their skills on the job.

Summary

The Cloudera Educational Services OnDemand Library subscription provides access to all the courses listed below, plus any updates or new content added during your subscription, via our OnDemand Portal for one year and includes 100 hours of access to the cloud hosted hands-on exercise environments.

Current courses include:

Cloudera Administrator Training
Cloudera Developer Training for Spark and Hadoop
Cloudera Data Analyst Training: Using Hive and Impala (NEW)
Cloudera Security
Cloudera Search Training
Cloudera Training for Apache HBase
Cloudera Data Science Workbench (NEW)
Just Enough Python
Just Enough Scala
Introduction to Apache Kafka
Introduction to Apache Kudu
Deploying and Scaling Cloudera Enterprise on Microsoft Azure
Introduction to Cloudera Altus Director
Introduction to Cloudera Manager
Introduction to Cloudera Navigator
CDP Essentials (NEW)
AWS Fundamentals for CDP Public Cloud (NEW)
Introduction to Cloudera Data Warehouse: Self-Service Analytics in the Cloud with CDP (NEW)
CDP for CDH Users (NEW)
CDP for HDP Users (NEW)

CLD2 – Administrator Training

Overview

Take your knowledge to the next level with Cloudera’s Administrator Training and Certification. Cloudera Educational Services’s four-day administrator training course provides participants with a comprehensive understanding of all the steps necessary to operate and maintain a Hadoop cluster using Cloudera Manager. From installation and configuration through load balancing and tuning, this training course is the best preparation for the real-world challenges faced by Cloudera administrators.

Get hands-on experience

Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop ecosystem, learning topics such as:

Cloudera Manager features that make managing your clusters easier, such as aggregated logging, configuration management, resource management, reports, alerts, and service management
Configuring and deploying production-scale clusters that provide key Hadoop-related services, including YARN, HDFS, Impala, Hive, Spark, Kudu, and Kafka
Determining the correct hardware and infrastructure for your cluster
Proper cluster configuration and deployment to integrate with the data center
Ingesting, storing, and accessing data in HDFS, Kudu, and cloud object stores such as Amazon S3
How to load file-based and streaming data into the cluster using Kafka and Flume
Configuring automatic resource management to ensure service-level agreements are met for multiple users of a cluster
Best practices for preparing, tuning, and maintaining a production cluster
Troubleshooting, diagnosing, and solving cluster issues

What to expect

This course is best suited to systems administrators and IT managers who have basic Linux experience. Prior knowledge of Apache Hadoop is not required.

Get certified

Upon completion of the course, attendees are encouraged to continue their study and register for the CCA Administrator exam. Certification is a great differentiator. It helps establish you as a leader in the field, providing employers and customers with tangible evidence of your skills and expertise.

CLD3 – Cloudera DataFlow: Flow Management with Apache NiFi

Overview

This three-day hands-on training course provides the fundamental concepts and experience necessary to automate the ingest, flow, transformation, and egress of data using Apache NiFi.

Along with gaining a grasp of the key features, concepts, and benefits of NiFi, participants will create and run NiFi dataflows for a variety of scenarios. Students will gain expertise using processors, connections, and process groups, and will use NiFi Expression Language to control the flow of data from various sources to multiple destinations. Participants will monitor dataflows, examine progress of data through a dataflow, and connect dataflows to external systems such as Kafka and HDFS. After taking this course, participants will have key knowledge and expertise for configuring and managing data ingestion, movement, and transformation scenarios for the enterprise.

What You Will Learn

Students who successfully complete this course will be able to:

Understand the role of Apache NiFi and MiNiFi in the Cloudera DataFlow platform
Describe NiFi’s architecture, including standalone and clustered configurations
Use key features, including FlowFiles, processors, process groups, controllers, and connections, to define a NiFi dataflow
Navigate, configure dataflows, and use dataflow information with the NiFi User Interface
Trace the life of data, its origin, transformation, and destination, using data provenance
Organize and simplify dataflows
Manage dataflow versions using the NiFi Registry
Use the NiFi Expression Language to control dataflows
Implement dataflow optimization methods and available monitoring and reporting features
Connect dataflows with other systems, such as Kafka and HDFS
Describe aspects of NiFi security

What to Expect

This course is designed for Developers, Data Engineers, Data Scientists, and Data Stewards. It provides a no-code, graphical approach to configuring real-time data streaming, ingestion, and management solutions for a variety of use cases. Though programming experience is not required, basic experience with Linux is presumed. Exposure to big data concepts and applications is helpful.

CLD4 – Security Training

Overview

Get the Knowledge and Skills

After successfully completing this course, the student will be able to:

Describe security in the context of Hadoop
Assess threats to a production Hadoop cluster
Plan and deploy defenses against these threats
Improve the security of each node in the cluster
Set up authentication with Kerberos and Active Directory
Use permissions and ACLs to control access to files in HDFS
Use platform authorization features to control data access
Perform common key management tasks
Use encryption to protect data in motion and at rest
Monitor a cluster for suspicious activity

What To Expect

The course is intended for system administrators and those in similar roles. Prospective students should have a good understanding of Hadoop’s architecture, the ability to perform system administration tasks in the Linux environment, and at least basic exposure to Cloudera Manager. We recommend that students complete the Cloudera Administrator Training for Apache Hadoop course, or have equivalent on-the-job experience, before beginning this course. No prior training or experience with computer security is required.