Hands-on Exercises

Welcome

Welcome to the AMP Camp 6 hands-on exercises! These exercises are extended and enhanced from those given at previous AMP Camp Big Data Bootcamps. They were written by volunteer graduate students and postdocs in the UC Berkeley AMPLab. Many of those same graduate students are present today as teaching assistants. The exercises we cover today will have you working directly with the Spark specific components of the AMPLab’s open-source software stack, called the Berkeley Data Analytics Stack (BDAS).

These hands-on exercises will have you walk through examples of how to use Spark, Tachyon, and related projects.

Prerequisites

Assumptions

In order to get the most out of this course, we assume:

If you would like a quick primer on Scala, check out the following doc in the appendix:

Exercises Overview

Languages Used

Section
Spark yes no yes
Spark SQL yes no yes
IndexedRDD yes no no
Tachyon yes yes no
SparkR R only R only R only
Succinct yes no no
KeystoneML yes no no
Splash yes no no
Spark Time Series yes no no

In several of the proceeding training modules, you can choose which language you want to use as you follow along and gain experience with the tools. The following table shows which languages this mini course supports for each section. You are welcome to mix and match languages depending on your preferences and interests.

Exercise Content

The modules we will cover at the AMPCamp training are listed below. These can be done in any order according to your interests, though we recommend that new users start with Spark.

Note: Please follow the setup instructions at the Getting Started page before any of the exercises.

Day 1

Exercise Description Length More Documentation
Spark Use the Spark shell to write interactive queries Short Programming Guide
Spark SQL Use the Spark shell to write interactive SQL queries Short Programming Guide
IndexedRDD Use mutable RDDs Medium Github
Tachyon Deploy Tachyon and try simple functionalities. Medium Project Website
SparkR Interactive Data Analytics using Spark in R Short Programming Guide

Day 2

Exercise Description Length More Documentation
Succinct Query compressed data with Succinct Medium Project Page
KeystoneML Text and Image classification with KeystoneML Medium Project Page
Splash Use Splash to run stochastic learning algorithms Short Project Page
Spark Time Series Analyze time series data Long  

IRC Chat Room

A chat room is available for participants to connect with each other and get realtime help with exercises. The room can be joined here or by using an IRC client to connect to the #ampcamp channel on the FreeNode (irc.freenode.net) network.

Hands-on Exercises