This section will walk you through the process of launching a small cluster using your own Amazon EC2 account and our scripts and AMI (New to AMIs? See this intro to AMIs).
Pre-requisites
The cluster setup script we’ll use below requires Python 2.x and has been tested to work on Linux or OS X. We will use the Bash shell in our examples below. If you are using Windows, consider installing Cygwin (note that we have not tested this, hence providing debug support would be hard).
Setting up EC2 keys
Make sure you have an Amazon EC2 account.
Set the environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
to your Amazon EC2 access key ID and secret access key.
These can be obtained from the AWS homepage by clicking Account > Security Credentials > Access Credentials
:
export AWS_ACCESS_KEY_ID=<ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<SECRET_ACCESS_KEY>
Create an Amazon EC2 key pair for yourself.
This can be done by logging into your Amazon Web Services account through the AWS console, selecting EC2
from the Services
menu, selecting Key Pairs
on the left sidebar, and creating and downloading a key:
Make sure that you set the permissions for the private key file to 600
(i.e. only you can read and write it) so that ssh
will work (commands to do this are provided farther below).
Getting the scripts to launch EC2 cluster
Check out the launch scripts by cloning the github repository.
git clone git://github.com/amplab/training-scripts.git
You can also obtain them by downloading the zip file at https://github.com/amplab/training-scripts/archive/master.zip
Launching the cluster
Launch the cluster by running the following command. This script will launch a cluster, create a HDFS cluster and configure Mesos, Spark, and Shark. Finally, it will copy the datasets used in the exercises from S3 to the HDFS cluster. This can take around 15-20 mins.
cd training-scripts
./spark-ec2 -i <key_file> -k <name_of_key_pair> --copy launch amplab-training
Where <name_of_key_pair>
is the name of your EC2 key pair (that you gave it when you created it), <key_file>
is the private key file for your key pair.
For example, if you created a key pair named ampcamp-key
and the private key (<key_file>
) is in your home directory and is called ampcamp.pem
, then the command would be
./spark-ec2 -i ~/ampcamp.pem -k ampcamp-key --copy launch amplab-training
This command may take a 30-40 minutes or longer and should produce a bunch of output as it first spins up the nodes for your cluster, sets up BDAS on them, and performs a large distributed file copy of the wikipedia files we’ll use in these training documents from S3 to your instance of HDFS.
The following are some errors that you may encounter, and other frequently asked questions:
UNPROTECTED KEY FILE...
I get an error when running this command: Your requested instance type (m1.xlarge) is not supported...
I get an error when running this command: java.lang.IllegalArgumentException: Invalid hostname in URI...
Can I specify the instances types while creating the cluster?
Question: I got the following permission error when I ran the above command. Help!
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: UNPROTECTED PRIVATE KEY FILE! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Permissions 0644 for ‘../ampcamp.pem' are too open. It is recommended that your private key files are NOT accessible by others. This private key will be ignored. bad permissions: ignore key: ../ampcamp.pem Permission denied (publickey).
Answer: Run the next two commands.
chmod 600 ../ampcamp.pem
./spark-ec2 -i <key_file> -k <name_of_key_pair> --copy --resume launch amplab-training
Question: I got the following permission error when I ran the above command. Help!
"Your requested instance type (m1.xlarge) is not supported in your requested Availability Zone (us-east-1b). Please retry your request by not specifying an Availability Zone or choosing us-east-1d, us-east-1c, us-east-1a, us-east-1e."
Answer: Add the -z
flag to your command line arguments to use an availability zone other than us-east-1b
.
You can set the value of that flag to “none”, as in the following example command, which tells the script to pick a random availability zone.
It may randomly pick an availability zone that doesn’t support this instance size (such as us-east-1b
), so you may need to try this command a few times to get it to work.
./spark-ec2 -i <key_file> -k <name_of_key_pair> -z none --copy launch amplab-training
Question: I got the following error when I ran the above command. Help!
12/08/21 16:50:45 INFO tools.DistCp: destPath=hdfs://ip-10-42-151-150.ec2.internal:9000/wiki/pagecounts java.lang.IllegalArgumentException: Invalid hostname in URI s3n://AKIAJIFGXUZ4MDJNYCGQ:COWo3AxVhjyu43Ug5kDvTnO/V3wQloBRIEOYEQgG@ampcamp-data/wikistats_20090505-07
Answer: The data copy from S3 to your EC2 cluster has failed. Do the following steps:
-
Login to the master node by running
./spark-ec2 -i <key_file> -k <key_pair> login amplab-training
-
Open the HDFS config file at
/root/ephemeral-hdfs/conf/core-site.xml
and copy your AWS access key and secret key into the respective fields. -
Restart HDFS
/root/ephemeral-hdfs/bin/stop-dfs.sh /root/ephemeral-hdfs/bin/start-dfs.sh
-
Delete the directory the data was supposed to be copied to
/root/ephemeral-hdfs/bin/hadoop fs -rmr /wiki
-
Logout and run the following command to retry copying data from S3
./spark-ec2 -i <key_file> -k <key_pair> copy-data amplab-training
Question: Can I specify the instances types while creating the cluster?
Answer: These exercises have been designed to work with at least 5 slave machines using instances of type m1.xlarge. You can also launch the cluster with different instance types. However, you should ensure two things:
-
Correct number of slaves: Make sure that the total memory in the slaves is about 54GB as the exercises are designed accordingly. So if you are using
m1.large
instances (which have 7.5 GB memory), then you should launch a cluster with at least 8 slaves.You can specify the instance type in the above command by setting the flag
-t <instance_type>
. Similarly, you can specify the number of slaves by setting the flag-s <number of slaves>
. For example, to launching a cluster with 8m1.large
slaves, use./spark-ec2 -i <key_file> -k <name_of_key_pair> -t m1.large -s 8 --copy launch amplab-training
-
Correct java heap setting for Spark: Make sure to change the
SPARK_MEM
variable in/root/spark/conf/spark-env.sh
and/root/shark/conf/shark-env.sh
on all of the instances to match the amount of memory available in the instance type you use. This is typically set it to the total amount of memory of the instance minus 1 GB for the OS (that is, form1.large
with 7.5GB memory, setSPARK_MEM=6g
). There is a easy way to change this configuration on all the instances. First, change this file in the master. Then run/root/mesos-ec2/copy-dir /root/spark/conf/ .
to copy the configuration directory to all slaves.
Information: Sometimes the EC2 instances don’t initialize within the standard waiting time of 120 seconds.
If that happens you, will ssh errors (or check in the Amazon web console).
In this case, try increasing the waiting to 4 minutes using the -w 240
option.
If you launched the cluster with the default script above (no custom instance type and/or number of slaves), your cluster should contain 6 m1.xlarge Amazon EC2 nodes.
Post-launch steps
Your cluster should be ready to use.
You can find the master hostname (<master_node_hostname>
in the instructions below) by running
./spark-ec2 -i <key_file> -k <key_pair> get-master amplab-training
At this point, it would be helpful to open a text file and copy <master_node_hostname>
there.
In a later exercise, you will want to have <master_node_hostname>
ready at hand without having to scroll through your terminal history.
Go through the Tutorial
Contratulations, you should now have a cluster running on EC2 with the all of the BDAS components you need to perform the Big Data Analytics tutorial.
Terminating the cluster (Not yet, only after you do the exercises!)
After you are done with your exercises, you can terminate the cluster by running
./spark-ec2 -i <key_file> -k <key_pair> destroy amplab-training