HDP Spark Developer – Knowledge Factory

Overview

This course introduces the Apache Spark distributed computing engine, and is suitable for developers, data analysts, architects, technical managers, and anyone who needs to use Spark in a hands-on manner. It is based on the Spark 2.x release. The course provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g. RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface.It includes in-depth coverage of Spark SQL, DataFrames, and DataSets, which are now the preferred programming API. This includes exploring possible performance issues and strategies for optimization. The course also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and integrating with the Kafka server

Prerequisites
Students should be familiar with programming principles and have previous experience in software
development using Scala. Previous experience with data streaming, SQL, and HDP is also helpful, but not
required.

Target Audience
Software engineers that are looking to develop in-memory applications for time sensitive and highly iterative
applications in an Enterprise HDP environment.

DAY 1 – Scala Ramp Up, Introduction to Spark

OBJECTIVES

Scala Introduction
Working with: Variables, Data Types, and Control Flow
The Scala Interpreter
Collections and their Standard Methods (e.g. map())
Working with: Functions, Methods, and Function Literals
Define the Following as they Relate to Scala: Class, Object, and Case Class
Overview, Motivations, Spark Systems
Spark Ecosystem
Spark vs. Hadoop
Acquiring and Installing Spark
The Spark Shell, SparkContext

LABS

Setting Up the Lab Environment
Starting the Scala Interpreter
A First Look at Spark
A First Look at the Spark Shell

DAY 2 – RDDs and Spark Architecture, Spark SQL, DataFrames and DataSets

OBJECTIVES

RDD Concepts, Lifecycle, Lazy Evaluation
RDD Partitioning and Transformations
Working with RDDs Including: Creating and Transforming
An Overview of RDDs
SparkSession, Loading/Saving Data, Data Formats
Introducing DataFrames and DataSets
Identify Supported Data Formats
Working with the DataFrame (untyped) Query DSL
SQL-based Queries
Working with the DataSet (typed) API
Mapping and Splitting
DataSets vs. DataFrames vs. RDDs

LABS

RDD Basics
Operations on Multiple RDDs
Data Formats
Spark SQL Basics
DataFrame Transformations
The DataSet Typed API
Splitting Up Data

Day 3 – Shuffling, Transformations and Performance, Performance Tuning

OBJECTIVES

Working with: Grouping, Reducing, Joining
Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
Exploring the Catalyst Query Optimizer
The Tungsten Optimizer
Discuss Caching, Including: Concepts, Storage Type, Guidelines
Minimizing Shuffling for Increased Performance
Using Broadcast Variables and Accumulators
General Performance Guidelines

LABS

Exploring Group Shuffling
Seeing Catalyst at Work
Seeing Tungsten at Work
Working with Caching, Joins, Shuffles, Broadcasts, Accumulators
Broadcast General Guidelines

Day 4 – Creating Standalone Applications and Spark Streaming

OBJECTIVES

Core API, SparkSession.Builder
Configuring and Creating a SparkSession
Building and Running Applications
Application Lifecycle (Driver, Executors, and Tasks)
Cluster Managers (Standalone, YARN, Mesos)
Logging and Debugging
Introduction and Streaming Basics
Spark Streaming (Spark 1.0+)
Structured Streaming (Spark 2+)
Consuming Kafka Data

LABS

Spark Job Submission
Additional Spark Capabilities
Spark Streaming
Spark Structured Streaming
Spark Structured Streaming with Kafka

Poptat termín

Aktuálně nejsou žádné termíny

Vypňte formulář a my vás budeme informovat, jakmile bude vypsán nový termín kurzu.