The exercises in this mini course are divided into sections designed to give a hands-on experience with various software components of the Berkeley Data Analytics Stack (BDAS). For Spark, we will walk you through using the Spark shell for interactive exploration of data. You have the choice of doing the exercises using Scala or using Python. For Shark, you will be using SQL in the Shark console to interactively explore the same data. For Spark Streaming, we will walk you through writing stand alone Spark programs in Scala to processing Twitter’s sample stream of tweets. Finally, you will have to complete a complex machine learning exercise which will test your understanding of Spark.
Your cluster contains 6 m1.xlarge Amazon EC2 nodes. One of these 6 nodes is the master node, responsible for scheduling tasks as well as maintaining the HDFS metadata (a.k.a. HDFS name node). The other 5 are the slave nodes on which tasks are actually executed. You will mainly interact with the master node. If you haven’t already, let’s ssh onto the master node (see instructions above).
Once you’ve used SSH to log into the master, run the
ls command and you will see a number of directories.
Some of the more important ones are listed below:
- Templates for exercises:
kmeans: Template for the k-means clustering exercise
streaming: Standalone program for Spark Streaming exercises
java-app-template: Template for standalone Spark programs written in Java
scala-app-template: Template for standalone Spark programs written in Scala
shark-0.2: Shark installation
spark: Spark installation
- Useful scripts/documentation:
mesos-ec2: Suite of scripts to manage Mesos on EC2
spark-ec2: Suite of scripts to manage Spark on EC2
training: Documentation and code used for training exercises
ephemeral-hdfs: Hadoop installation
scala-2.9.2: Scala installation
hive-0.9.0-bin: Hive installation
mesos: Mesos installation
You can find a list of your 5 slave nodes in spark-ec2/slaves:
For stand-alone Spark programs, you will have to know the Spark cluster URL. You can find that in spark-ec2/cluster-url:
Dataset For Exploration
Your HDFS cluster should come preloaded with 20GB of Wikipedia traffic statistics data obtained from http://aws.amazon.com/datasets/4182 . To make the analysis feasible (within the short timeframe of the exercise), we took three days worth of data (May 5 to May 7, 2009; roughly 20G and 329 million entries). You can list the files:
ephemeral-hdfs/bin/hadoop fs -ls /wiki/pagecounts
There are 74 files (2 of which are intentionally left empty).
The data are partitioned by date and time. Each file contains traffic statistics for all pages in a specific hour. Let’s take a look at the file:
ephemeral-hdfs/bin/hadoop fs -cat /wiki/pagecounts/part-00148 | less
The first few lines of the file are copied here:
20090507-040000 aa ?page=http://www.stockphotosharing.com/Themes/Images/users_raw/id.txt 3 39267 20090507-040000 aa Main_Page 7 51309 20090507-040000 aa Special:Boardvote 1 11631 20090507-040000 aa Special:Imagelist 1 931
Each line, delimited by a space, contains stats for one page. The schema is:
<date_time> <project_code> <page_title> <num_hits> <page_size>
<date_time> field specifies a date in the YYYYMMDD format (year, month, day) followed by a hyphen and then the hour in the HHmmSS format (hour, minute, second).
There is no information in mmSS.
<project_code> field contains information about the language of the pages.
For example, project code “en” indicates an English page.
<page_title> field gives the title of the Wikipedia page.
<num_hits> field gives the number of page views in the hour-long time slot starting at
<page_size> field gives the size in bytes of the Wikipedia page.
less, stop viewing the file, and return to the command line, press