Spark Application Performance Tuning Workshop (NEW!)

Overview

This three-day hands-on training course presents the concepts and architectures of Spark and the underlying data platform, providing students with the conceptual understanding necessary to diagnose and solve performance issues.

With this understanding of Spark internals and the underlying data platform, the course teaches students how to tune Spark application code and configuration. The course illustrates performance design best practices and pitfalls. Students are prepared to apply these patterns and anti-patterns to their own designs and code.

The course format emphasizes instructor-led demos of performance issues and techniques to address them, followed by hands-on exercises. Students explore these performance issues and techniques in an interactive notebook environment. Students take away from the course a practical, illustrative body of code.

Prerequisites

This course is designed for software developers, engineers, and data scientists who develop Spark applications and need the information and techniques for tuning their code. This is not a beginning course in Spark; students should be comfortable completing the tasks covered in Cloudera Developer Training for Apache Spark and Hadoop. Spark examples and hands-on exercises are presented in Python and Scala. The ability to program in one of those languages is required. Basic familiarity with the Linux command line is assumed. Basic knowledge of SQL is helpful.

Course Topics

Spark Architecture

Coverage of all concepts found in the Spark Application UI
RDD execution
Data Frame execution
Catalyst optimizer
Partitioning
Shuffling

Optimizing Data

Recognizing and dealing with skewed data
Handling small files
Join optimizations
Broadcast Joins
Common Joins
Skewed Joins
Bucketed Joins
Unbalanced partitions
Partitioned and bucketed tables
Object serialization
Compression
File formats
Storage options
Schema inference

Optimizing Processing

Static vs. dynamic scheduling
Dynamic resource pools in YARN
Partition processing
Broadcast variables
Driver and executor memory and CPU core configuration
Python overhead
UDFs

Developing High Performance Algorithms

Caching data
Checkpoints
Recovery

Poptat termín

Aktuálně nejsou žádné termíny

Vypňte formulář a my vás budeme informovat, jakmile bude vypsán nový termín kurzu.