Hands-on Exercises

Now that we’ve had fun with Spark, let’s try out Shark. Remember Shark is a large-scale data warehouse system that runs on top of Spark and provides a SQL like interface (that is fully compatible with Apache Hive).

  1. First, launch the Shark console:

  2. Similar to Apache Hive, Shark can query external tables (i.e., tables that are not created in Shark). Before you do any querying, you will need to tell Shark where the data is and define its schema.

    shark> create external table wikistats (dt string, project_code string, page_name string, page_views int, bytes int) row format delimited fields terminated by ' ' location '/wiki/pagecounts';
    Time taken: 0.232 seconds
    13/02/05 21:31:25 INFO CliDriver: Time taken: 0.232 seconds

    FAQ: If you see the following errors, don’t worry. Things are still working under the hood.

    12/08/18 21:07:34 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it cannot be resolved.
    12/08/18 21:07:34 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it cannot be resolved.
    12/08/18 21:07:34 ERROR DataNucleus.Plugin: Bundle "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be resolved.

    FAQ: If you see errors like these, you might have copied and pasted a line break, and should be able to remove it to get rid of the errors.

    13/02/05 21:22:16 INFO parse.ParseDriver: Parsing command: CR
    FAILED: Parse Error: line 1:0 cannot recognize input near 'CR' '<EOF>' '<EOF>'

    FAQ: If you partially complete the exercises and then restart them later, your tables will stick around. You will see errors like this if you try to create them again:

    FAILED: Error in metadata: AlreadyExistsException(message:Table wikistats already exists)
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

    To fix this, drop the table (wikistats or, introduced shortly, wikistats_cached), then create it again using the same command described above. You can drop a table with:

    drop table wikistats;
  3. Let’s create a table containing all English records and cache it in the cluster’s memory. Shark automatically caches any table having a name with the suffix “_cached”.

    shark> create table wikistats_cached as select * from wikistats where project_code="en";
    Time taken: 127.5 seconds
    13/02/05 21:57:34 INFO CliDriver: Time taken: 127.5 seconds
  4. Do a simple count to get the number of English records. If you have some familiarity working with databases, note that we us the “count(1)” syntax here since in earlier versions of Hive, the more popular “count(*)” operation was not supported. The Hive syntax is described in detail in the Hive Getting Started Guide.

    shark> select count(1) from wikistats_cached;
    Time taken: 7.632 seconds
    12/08/18 21:23:13 INFO CliDriver: Time taken: 7.632 seconds
  5. Output the total traffic to Wikipedia English pages for each hour between May 7 and May 9, with one line per hour.

    shark> select dt, sum(page_views) from wikistats_cached group by dt;
    20090507-070000	6292754
    20090505-120000	7304485
    20090506-110000	6609124
    Time taken: 12.614 seconds
    13/02/05 22:05:18 INFO CliDriver: Time taken: 12.614 seconds
  6. In the Spark section, we ran a very expensive query to compute pages that were viewed more than 200,000 times. It is fairly simple to do the same thing in SQL.

    To make the query run faster, we increase the number of reducers used in this query to 50 in the first command. Note that the default number of reducers, which we have been using so far in this section, is 1.

    shark> set mapred.reduce.tasks=50;
    shark> select page_name, sum(page_views) as views from wikistats_cached group by page_name having views > 200000;
    index.html      310642
    Swine_influenza 534253
    404_error/      43822489
    YouTube 203378
    X-Men_Origins:_Wolverine        204604
    Dom_DeLuise     396776
    Special:Watchlist       311465
    The_Beatles     317708
    Special:Search  17657352
    Special:Random  5816953
    Special:Export  248624
    Scrubs_(TV_series)      234855
    Cinco_de_Mayo   695817
    2009_swine_flu_outbreak 237677
    Deadpool_(comics)       382510
    Wiki    464935
    Special:Randompage      3521336
    Main_Page       18730347
    Time taken: 68.693 seconds
    13/02/26 08:12:42 INFO CliDriver: Time taken: 68.693 seconds
  7. With all the warm up, now it is your turn to write queries. Write Hive QL queries to answer the following questions:

To exit Shark, type the following at the Shark command line (and don’t forget the semicolon!).

   shark> exit;
Submit an issue on GitHub
Hands-on Exercises