Hands-on Exercises

What is Tachyon

Memory is the key to fast Big Data processing. This has been realized by many, and frameworks such as Spark already leverage memory performance. As data sets continue to grow, storage is increasingly becoming a critical bottleneck in many workloads.

To address this need, we have developed Tachyon. Tachyon is an open source memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster jobs, possibly written in different computation frameworks, such as Apache Spark and Apache MapReduce. In the big data ecosystem, Tachyon lies between computation frameworks or jobs, such as Apache Spark, Apache MapReduce, or Apache Flink, and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS, or Ceph. Tachyon brings significant performance improvement to the stack; for example, Baidu uses Tachyon to improve their data analytics performance by 30 times. Beyond performance, Tachyon bridges new workloads with data stored in traditional storage systems. Users can run Tachyon using its standalone cluster mode, for example on Amazon EC2, or launch Tachyon with Apache Mesos or Apache Yarn.

Tachyon

Tachyon is Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code changes. Tachyon is the default off-heap option in Spark, which means that RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads. The project is open source and is already deployed at multiple companies. In addition, Tachyon has more than 130 contributors from over 50 institutions, including Yahoo, Intel, Redhat, and Pivotal. The project is the storage layer of the Berkeley Data Analytics Stack (BDAS) and also part of the Fedora distribution.

In this chapter we first go over basic operations of Tachyon, and then run a Spark program on top of it. For more information, please visit Tachyon’s website or Github repository. We also host regular meetups in the bay area.

Prerequisites

Assumptions

Launch Tachyon

Configurations

All system’s configuration is under tachyon/conf folder. You configure the system by specifying your own environment variables in tachyon/conf/tachyon-env.sh. For this exercise, we have provided a preconfigured Tachyon installation.

$ cp conf/tachyon-env.sh.template conf/tachyon-env.sh

For more information on configuration values, you can visit the Tachyon Configuration Settings Docs.

Format the storage

Before starting Tachyon for the first time, we need to format the system using using the tachyon script in the tachyon/bin folder. Please type the following command. Note that if you are running Linux or MacOS, Tachyon will request root permissions using sudo when creating the RAM disk.

$ ./bin/tachyon format
Connection to localhost... Formatting Tachyon Worker @ HYMac-2.local
Removing local data under folder: /Users/haoyuan/Downloads/test/tachyon/libexec/../ramdisk/tachyonworker/
Formatting Tachyon Master @ localhost
Formatting JOURNAL_FOLDER: /Users/haoyuan/Downloads/test/tachyon/libexec/../journal/
Formatting UNDERFS_DATA_FOLDER: /Users/haoyuan/Downloads/test/tachyon/libexec/../../data/tmp/tachyon/data
Formatting UNDERFS_WORKERS_FOLDER: /Users/haoyuan/Downloads/test/tachyon/libexec/../../data/tmp/tachyon/workers

Start the system

After formatting the storage, we can try to start the system. This can be done by using tachyon/bin/tachyon-start.sh script.

$ ./bin/tachyon-start.sh local
Killed 0 processes
Killed 0 processes
Connection to localhost... Killed 0 processes
Starting master @ localhost
Starting worker @ HYMac-2.local

Interacting with Tachyon

In this section, we will go over three approaches to interact with Tachyon:

  1. Command Line Interface
  2. Application Programming Interface
  3. Web User Interface

Command Line Interface

You can interact with Tachyon using the following command:

$ ./bin/tachyon tfs

Then, it will return a list of options:

Usage: java TfsShell
       [cat <path>]
       [copyFromLocal <src> <remoteDst>]
       [copyToLocal <src> <localDst>]
       [count <path>]
       [du <path>]
       [fileinfo <path>]
       [free <file path|folder path>]
       [getUsedBytes]
       [getCapacityBytes]
       [load <path>]
       [loadMetadata <path>]
       [location <path>]
       [ls <path>]
       [lsr <path>]
       [mkdir <path>]
       [mount <tachyonPath> <ufsURI>]
       [mv <src> <dst>]
       [pin <path>]
       [report <path>]
       [request <tachyonaddress> <dependencyId>]
       [rm <path>]
       [rmr <path>]
       [setTTL <path> <time to live(in milliseconds)>]
       [unsetTTL <path>]
       [tail <path>]
       [touch <path>]
       [unmount <tachyonPath>]
       [unpin <path>]

Please try to put the local file tachyon/LICENSE into Tachyon file system as /LICENSE using command line.

$ ./bin/tachyon tfs copyFromLocal LICENSE /LICENSE
Copied LICENSE to /LICENSE

You can also use command line interface to verify this:

$ ./bin/tachyon tfs ls /
11.40 KB  02-07-2014 23:23:44:008  In Memory      /LICENSE

Now, you want to check out the content of the file:

$ ./bin/tachyon tfs cat /LICENSE
                                 Apache License
                          Version 2.0, January 2004
                       http://www.apache.org/licenses/
  TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
....

Application Programming Interface

After using command line to interact with Tachyon, you can also use its API. We have several sample applications. For example, BasicOperations.java shows how to user file create, write, and read operations.

Using the Tachyon script, you can simply use the following command to run this sample program. The following command runs several tests like BasicOperations.java, and also verifies Tachyon’s installation.

$ ./bin/tachyon runTests
$ /root/ampcamp6/tachyon/bin/tachyon runTest Basic CACHE_PROMOTE MUST_CACHE
$ /default_tests_files/BasicFile_CACHE_PROMOTE_MUST_CACHE has been removed
$ 2015-11-18 15:47:50,229 INFO   (ClientBase.java:connect) - Tachyon client (version 0.8.2) is trying to connect with FileSystemMaster master @ localhost/127.0.0.1:19998
$ 2015-11-18 15:47:50,242 INFO   (ClientBase.java:connect) - Client registered with FileSystemMaster master @ localhost/127.0.0.1:19998
$ 2015-11-18 15:47:50,269 INFO   (BasicOperations.java:createFile) - createFile with fileId 889192447 took 47 ms.
$ 2015-11-18 15:47:50,279 INFO   (ClientBase.java:connect) - Tachyon client (version 0.8.2) is trying to connect with BlockMaster master @ localhost/127.0.0.1:19998
$ 2015-11-18 15:47:50,280 INFO   (ClientBase.java:connect) - Client registered with BlockMaster master @ localhost/127.0.0.1:19998
$ 2015-11-18 15:47:50,304 INFO   (WorkerClient.java:connect) - Connecting local worker @ /192.168.1.9:29998
$ 2015-11-18 15:47:50,341 INFO   (FileUtils.java:createStorageDirPath) - Folder /Volumes/ramdisk/tachyonworker/3540443706671334291 was created!
$ 2015-11-18 15:47:50,346 INFO   (LocalBlockOutStream.java:<init>) - LocalBlockOutStream created new file block, block path: /Volumes/ramdisk/tachyonworker/3540443706671334291/872415232
$ 2015-11-18 15:47:50,392 INFO   (BasicOperations.java:writeFile) - writeFile to file /default_tests_files/BasicFile_CACHE_PROMOTE_MUST_CACHE took 123 ms.
$ 2015-11-18 15:47:50,457 INFO   (BasicOperations.java:readFile) - readFile file /default_tests_files/BasicFile_CACHE_PROMOTE_MUST_CACHE took 65 ms.
$ Passed the test!
$ ...

Web User Interface

After using commands and API to interact with Tachyon, let’s take a look at its web user interface. The URI is http://localhost:19999.

The first page is the overview of the running system. The second page is the system configuration

If you click on the Browse File System, it shows you all the files you just created and copied.

You can also click a particular file or folder. e.g. /LICENSE file, and then you will see the detailed information about it.

Run Spark on Tachyon

Input/Output with Tachyon

In this section, we run a Spark program to interact with Tachyon. The first one is to do a word count on /LICENSE file. In /root/spark folder, execute the following command to start Spark shell.

$ ./bin/spark-shell
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
var file = sc.textFile("tachyon://localhost:19998/LICENSE")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("tachyon://localhost:19998/result")
JavaRDD<String> file = spark.textFile("tachyon://localhost:19998/LICENSE");
JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
  public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()
  public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()
  public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile("tachyon://localhost:19998/result");
file = sc.textFile("tachyon://localhost:19998/LICENSE")
counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("tachyon://localhost:19998/result")

The results are stored in /result folder. You can verfy the results through Web UI or commands. Because /LICENSE is in memory, when a new Spark program comes up, it can load in memory data directly from Tachyon. In the meantime, we are also working on other features to make Tachyon further enhance Spark’s performance.

Store RDD OFF_HEAP in Tachyon

Storing RDD as OFF_HEAP storage in Tachyon has several advantages (more info):

Please try the following example:

sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
var file = sc.textFile("tachyon://localhost:19998/LICENSE")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)
counts.take(10)

Now, try running counts.take(10) again and you will see that it’s much faster than the first time because the counts RDD has been stored OFF_HEAP in Tachyon.

Mounting a Storage System in Tachyon

In this section, we will discover how to mount a part of the local file system to a Tachyon directory. Doing so will allow Tachyon to transparently access all the data of the mounted storage system without manually copying data, as we previously did with copyFromLocal. In addition, multiple storage systems can be mounted to different Tachyon paths, allowing Tachyon to provide a unified namespace for an arbitrary number of storage systems.

Using the Mount Command

To mount a under storage system to a path in Tachyon, use the mount command of the Tachyon shell. Try mounting the ampcamp6/tachyon folder to /local in Tachyon.

$ ./bin/tachyon tfs mount /local /root/ampcamp6/tachyon
Mounted /root/ampcamp6/tachyon at /local

Now that the the ampcamp6/tachyon directory has been mounted to /local in Tachyon, you can try to list the contents of the folder.

$ ./bin/tachyon tfs ls /local

You will see that there are no files in the folder. By default, Tachyon loads data lazily from mounted storage systems, fetching the files on demand to prevent a performance penalty when mounting a storage system with many objects.

Loading data from a Mounted Storage System

To fetch data from the mounted storage system, we simply need to access it. Try using the load command in the Tachyon shell to load the NOTICE file.

$ ./bin/tachyon tfs load /local/NOTICE
/local/NOTICE loaded

You can now verify the NOTICE file has been fetched from the mounted storage with ls or through the web UI.

$ ./bin/tachyon tfs ls /local
4111.00B  11-19-2015 10:02:03:013  In Memory      /local/NOTICE

load is just one way to fetch the data, we can also directly access the data through a Spark program to pull the data from the underlying storage.

Run spark-shell again in the spark directory.

bin/spark-shell

Now we will run wordcount on the README.md file which has not been loaded into Tachyon but exists in the mounted storage system.

sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
var file = sc.textFile("tachyon://localhost:19998/local/README.md")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("tachyon://localhost:19998/result2")
JavaRDD<String> file = spark.textFile("tachyon://localhost:19998/local/README.md");
JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
  public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()
  public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()
  public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile("tachyon://localhost:19998/result2");
file = sc.textFile("tachyon://localhost:19998/local/README.md")
counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("tachyon://localhost:19998/result2")

You can view the result of the operation through the web UI or Tachyon shell. You can also see that the README.md file appears in /local.

Shutting down Tachyon

To shutdown tachyon, issue the following command:

bin/tachyon-stop.sh

This brings us to the end of the Tachyon chapter of the tutorial. We encourage you to continue playing with the code and to check out the project website, Github repository, and meetup group for further information.

Bug reports and feature requests are welcomed.

Hands-on Exercises