Hands-on Exercises

Welcome

Welcome to the AMP Camp 4 hands-on exercises! These exercises are extended and enhanced from those given at previous AMP Camp Big Data Bootcamps. They were written by volunteer graduate students and postdocs in the UC Berkelay AMPLab. Many of those same graduate students are present today as teaching assistants. The exercises we cover today will have you working directly with the Spark specific components of the AMPLab’s open-source software stack, called the Berkeley Data Analytics Stack (BDAS).

You can navigate around the exercises by looking in the page header or footer and clicking on the arrows or the dropdown button that shows the current page title (as shown in the figure below).

The components we will cover at the first Spark Training are listed below.

Introductory Exercises

The tutorial begins with a set of introductory excercises which should be done sequentially.

  1. Scala - a quick crashcourse on the Scala language and command line interface.
  2. Spark (project homepage) - a fast cluster compute engine.
  3. Shark (project homepage) - a SQL layer on top of Spark.

Advanced Exercises

These can be done in any order according to your interests.

  1. Spark Streaming (project homepage) - A stream processing layer on top of Spark.
  2. Machine Learning with MLLib (project homepage) - Build a movie recommender with Spark.
  3. Graph Analytics with GraphX (project homepage) - Explore graph-structured data (e.g., Web-Graph) and graph algorithms (e.g., PageRank) with GraphX.
  4. Tachyon (project homepage) - Deploy a reliable in-memory filesystem across the cluster.
  5. BlinkDB (project homepage) - Use SQL with statistical sampling to decrease latency.

Course Prerequisites

A few of the components support multiple languages. In some sections of this training material, you can choose which language you want to use as you follow along and gain experience with the tools. The following table shows which languages this mini course supports for each section. You are welcome to mix and match languages depending on your preferences and interests.

Section
Spark Interactive yes no yes
Shark Interactive All SQL
Spark Streaming yes yes no
Machine Learning yes no no
GraphX - Graph Analytics yes no no
BlinkDB - SQL With Sampling All SQL

Providing feedback

We are using the cutting edge versions (i.e., the master branches) of most of our software components, which means you may run into a few issues. If you do, please call over a TA and explain what’s going on. To report a problem, please create a new issue at the AMPLab’s training docs Github issue Tracker (there is also a link to this in the footer on all pages of the exercises).

Getting Started

If you are attending Spark Training in person, the TAs will be handing out cluster hostnames and you can obtain the private key from the TinyURL address on the projector. Once you have your cluster hostname and private key you can follow the directions to log into your cluster.

If you are participating in the exercises from a remote location, you will want to launch a BDAS cluster on Amazon EC2 for yourself.

Submit an issue on GitHub
Hands-on Exercises