This section will walk you through the process of launching a small cluster using your own Amazon EC2 account and our scripts and AMI (New to AMIs? See this intro to AMIs).
The cluster setup script we’ll use below requires Python 2.x and has been tested to work on Linux or OS X. We will use the Bash shell in our examples below. If you are using Windows, consider installing Cygwin (note that we have not tested this, hence providing debug support would be hard).
Setting up EC2 keys
Make sure you have an Amazon EC2 account.
Set the environment variables
AWS_SECRET_ACCESS_KEY to your Amazon EC2 access key ID and secret access key.
These can be obtained from the AWS homepage by clicking
Account > Security Credentials > Access Credentials:
export AWS_ACCESS_KEY_ID=<ACCESS_KEY_ID> export AWS_SECRET_ACCESS_KEY=<SECRET_ACCESS_KEY>
Create an Amazon EC2 key pair for yourself.
This can be done by logging into your Amazon Web Services account through the AWS console, selecting
EC2 from the
Services menu, selecting
Key Pairs on the left sidebar, and creating and downloading a key:
Make sure that you set the permissions for the private key file to
600 (i.e. only you can read and write it) so that
ssh will work (commands to do this are provided farther below).
Getting the scripts to launch EC2 cluster
Check out the launch scripts by cloning the github repository.
git clone git://github.com/amplab/training-scripts.git
You can also obtain them by downloading the zip file at
Launching the cluster
Launch the cluster by running the following command. This script will launch a cluster, create a HDFS cluster and configure Mesos, Spark, and Shark. Finally, it will copy the datasets used in the exercises from EBS to the HDFS cluster. This can take around 20-30 mins.
cd training-scripts ./spark-ec2 -i <key_file> -k <name_of_key_pair> --copy launch amplab-training
<name_of_key_pair> is the name of your EC2 key pair (that you gave it when you created it),
<key_file> is the private key file for your key pair.
For example, if you created a key pair named
ampcamp-key and the private key (
<key_file>) is in your home directory and is called
ampcamp.pem, then the command would be
./spark-ec2 -i ~/ampcamp.pem -k ampcamp-key --copy launch amplab-training
This command may take a 30-40 minutes or longer and should produce a bunch of output as it first spins up the nodes for your cluster, sets up BDAS on them, and performs a large distributed file copy of the wikipedia files we’ll use in these training documents from EBS to your instance of HDFS.
The following are some errors that you may encounter, and other frequently asked questions:
Question: I got the following permission error when I ran the above command. Help!
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: UNPROTECTED PRIVATE KEY FILE! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Permissions 0644 for ‘../ampcamp.pem' are too open. It is recommended that your private key files are NOT accessible by others. This private key will be ignored. bad permissions: ignore key: ../ampcamp.pem Permission denied (publickey).
Answer: Run the next two commands.
chmod 600 ../ampcamp.pem ./spark-ec2 -i <key_file> -k <name_of_key_pair> --copy --resume launch amplab-training
Could not read http://s3.amazonaws.com/ampcamp-amis/latest-ampcamp3
Question: I got the following error when I ran the above command. Help!
Searching for existing cluster amplab-training... Could not read http://s3.amazonaws.com/ampcamp-amis/latest-ampcamp3
Answer: The lookup for the AMP Camp AMI failed. You can manually specificy the AMI to use by adding the ‘-a’ flag to the script. For example to use the AMP Camp 3 AMI, you can try the following command
./spark-ec2 -i <key_file> -k <name_of_key_pair> -a ami-452f622c --copy launch amplab-training
Your requested instance type (m1.xlarge) is not supported...
Question: I got the following permission error when I ran the above command. Help!
"Your requested instance type (m1.xlarge) is not supported in your requested Availability Zone (us-east-1b). Please retry your request by not specifying an Availability Zone or choosing us-east-1d, us-east-1c, us-east-1a, us-east-1e."
Answer: Add the
-z flag to your command line arguments to use an availability zone other than
You can set the value of that flag to “none”, as in the following example command, which tells the script to pick a random availability zone.
It may randomly pick an availability zone that doesn’t support this instance size (such as
us-east-1b), so you may need to try this command a few times to get it to work.
./spark-ec2 -i <key_file> -k <name_of_key_pair> -z none --copy launch amplab-training
Question: The above command is stuck at the following line. Help!
Copying AMP Camp Wikipedia pagecount data...
Answer: The data copy from EBS to your HDFS cluster is running and can take up to 30-40 minutes. If it has been longer and if you want to retry the data copy, use the following steps:
Stop the current process using
Ctrl + C
Login to the master node by running
./spark-ec2 -i <key_file> -k <key_pair> login amplab-training
Delete the directory the data was supposed to be copied to
/root/ephemeral-hdfs/bin/hadoop fs -rmr /wiki
Logout and run the following command to retry copying data from EBS
./spark-ec2 -i <key_file> -k <key_pair> copy-data amplab-training
Question: Can I specify the instances types while creating the cluster?
Answer: These exercises have been designed to work with at least 5 slave machines using instances of type m1.xlarge. You can also launch the cluster with different instance types. However, you should ensure two things:
Correct number of slaves: Make sure that the total memory in the slaves is about 54GB as the exercises are designed accordingly. So if you are using
m1.largeinstances (which have 7.5 GB memory), then you should launch a cluster with at least 8 slaves.
You can specify the instance type in the above command by setting the flag
-t <instance_type>. Similarly, you can specify the number of slaves by setting the flag
-s <number of slaves>. For example, to launching a cluster with 8
./spark-ec2 -i <key_file> -k <name_of_key_pair> -t m1.large -s 8 --copy launch amplab-training
Correct java heap setting for Spark: Make sure to change the
/root/shark/conf/shark-env.shon all of the instances to match the amount of memory available in the instance type you use. This is typically set it to the total amount of memory of the instance minus 1 GB for the OS (that is, for
m1.largewith 7.5GB memory, set
SPARK_MEM=6g). There is a easy way to change this configuration on all the instances. First, change this file in the master. Then run
/root/spark-ec2/copy-dir /root/spark/conf/ .
to copy the configuration directory to all slaves.
Information: Sometimes the EC2 instances don’t initialize within the standard waiting time of 120 seconds.
If that happens you, will ssh errors (or check in the Amazon web console).
In this case, try increasing the waiting to 4 minutes using the
-w 240 option.
Question: Can I use a EC2 region other than us-east-1 while creating the cluster?
Answer: These exercises have been created and tested on the us-east-1 region. However we have also copied the AMI to the us-west-1 region as well. To use the us-west-1 region, you can run the following command:
./spark-ec2 -i <key_file> -k <name_of_key_pair> -r us-west-1 -a ami-6ac4ee2f --copy launch amplab-training
We do not support running the AMI in any other EC2 regions. In case you need to do so, feel free to contact us for more help.
If you launched the cluster with the default script above (no custom instance type and/or number of slaves), your cluster should contain 6 m1.xlarge Amazon EC2 nodes.
Your cluster should be ready to use.
You can find the master hostname (
<master_node_hostname> in the instructions below) by running
./spark-ec2 -i <key_file> -k <key_pair> get-master amplab-training
At this point, it would be helpful to open a text file and copy
In a later exercise, you will want to have
<master_node_hostname> ready at hand without having to scroll through your terminal history.
Terminating the cluster (Not yet, only after you do the rest of the exercises!)
After you are done with your exercises (and only then), you can terminate the cluster by running
./spark-ec2 -i <key_file> -k <key_pair> destroy amplab-training
Log into your cluster
Move onto the next section for instructions that walk you through logging into your shiney new BDAS cluster.