Hands-on Exercises

SparkR

Welcome

Welcome to the AMP Camp 5 hands-on exercises! These exercises are extended and enhanced from those given at previous AMP Camp Big Data Bootcamps. They were written by volunteer graduate students and postdocs in the UC Berkeley AMPLab. Many of those same graduate students are present today as teaching assistants. The exercises we cover today will have you working directly with the Spark specific components of the AMPLab’s open-source software stack, called the Berkeley Data Analytics Stack (BDAS).

These hands-on exercises will have you walk through examples of how to use Tachyon, Spark, and related projects.

Prerequisites

Assumptions

In order to get the most out of this course, we assume:

If you would like a quick primer on Scala, check out the following doc in the appendix:

Exercises Overview

Languages Used

Section
Spark yes no yes
Spark SQL yes no yes
Tachyon no yes no
MLlib yes no yes
GraphX yes no no
Pipelines yes no no
SparkR R only R only R only
ADAM yes no no

In several of the proceeding training modules, you can choose which language you want to use as you follow along and gain experience with the tools. The following table shows which languages this mini course supports for each section. You are welcome to mix and match languages depending on your preferences and interests.

Exercise Content

The modules we will cover at the AMPCamp training are listed below. These can be done in any order according to your interests, though we recommend that new users start with Spark.

Note: Please follow the setup instructions at the Getting Started page before any of the exercises.

Day 1

Exercise Description Length More Documentation
Spark Use the Spark shell to write interactive queries Short Programming Guide
Spark SQL Use the Spark shell to write interactive SQL queries Short Programming Guide
Tachyon Deploy Tachyon and try simple functionalities. Medium Project Website
MLlib Build a movie recommender with Spark Medium Programming Guide
GraphX Explore graph-structured data and graph algorithms Long Programming Guide

Day 2

Exercise Description Length More Documentation
Pipelines Image classification with pipelines Medium  
SparkR Interactive Data Analytics using Spark in R Short Project Page; Github
ADAM Genome analysis with ADAM Medium  

IRC Chat Room

A chat room is available for participants to connect with each other and get realtime help with exercises. The room can be joined here or by using an IRC client to connect to the #ampcamp channel on the FreeNode (irc.freenode.net) network.

Hands-on Exercises