assignments.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
    <meta name="description" content="Course homepage for CS 489 Big Data Infrastructure (Winter 2016) at the University of Waterloo">
    <meta name="author" content="Jimmy Lin">
    <title>Big Data Infrastructure</title>

    <!-- Bootstrap -->
    <link href="css/bootstrap.min.css" rel="stylesheet">

    <!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
    <link href="css/ie10-viewport-bug-workaround.css" rel="stylesheet">

    <style>
      body {
        padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
      }
    </style>

    <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
    <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
    <!--[if lt IE 9]>
      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
      <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
  </head>


  <body>

    <nav class="navbar navbar-inverse navbar-fixed-top">
      <div class="container">
        <div class="navbar-header">
          <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
            <span class="sr-only">Toggle navigation</span>
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
            <span class="icon-bar"></span>
          </button>
        </div>
        <div id="navbar" class="collapse navbar-collapse">
          <ul class="nav navbar-nav">
            <li><a href="index.html">Overview</a></li>
            <li><a href="organization.html">Organization</a></li>
            <li><a href="syllabus.html">Syllabus</a></li>
            <li class="active"><a href="assignments.html">Assignments</a></li>
            <li><a href="software.html">Software</a></li>
          </ul>
        </div><!--/.nav-collapse -->
      </div>
    </nav>

    <div class="container">


  <div class="page-header">
    <div style="float: right"/><img src="images/waterloo_logo.png"/></div>
    <h1>Assignments <small>CS 489/698 Big Data Infrastructure (Winter 2016)</small></h1>
  </div>

  <div class="subnav">
    <ul class="nav nav-pills">
      <li><a href="#assignment0">0</a></li>
      <li><a href="#assignment1">1</a></li>
      <li><a href="#assignment2">2</a></li>
      <li><a href="#assignment3">3</a></li>
      <li><a href="#assignment4">4</a></li>
      <li><a href="#assignment5">5</a></li>
      <li><a href="#assignment6">6</a></li>
      <li><a href="#assignment7">7</a></li>
      <li><a href="#project">Final Project</a></li>
    </ul>
  </div>


<section id="assignment0" style="padding-top:35px">
<div>
<h3>Assignment 0: Warmup <small>due 8:30am January 12</small></h3>

<p>The purpose of this assignment is to serve as a simple warmup
exercise and to serve as a practice "dry run" for the submission
procedures of subsequent assignments. You'll have to write a bit of
code but this assignment is mostly about the "mechanics" of setting up
your Hadoop development environment. In addition to running Hadoop
locally in either the Linux student CS environment or on your own
machine, you'll also try running jobs on the Altiscale cluster.</p>

<p>The general setup is as follows:
you will complete your assignments and check everything into a private
GitHub repo. Shortly after the assignment deadline, we'll pull your
repo for grading. Although we will be examining your solutions to
assignment 0, it will not be graded <i>per se</i>.</p>

<p>I'm assuming you already have
a <a href="http://github.com/">GitHub</a> account. If not, create one
as soon as possible. Once you've signed up for an account, go and
<a href="https://education.github.com/discount_requests/new">request
an educational account</a>. This will allow you to create private
repos for free. Please do this as soon as possible since there may be
delays in the request verification process.</p>

<h4 style="padding-top: 10px">Setting up Hadoop and Spark</h4>

<p>Hadoop and Spark are already installed in
the <code>linux.student.cs.uwaterloo.ca</code>
environment (you just need to add some paths).
Alternatively, you may wish to install everything locally
on your own machine. For both, see the <a href="software.html">software page</a> for
more details.</p>

<p>Bespin is a library that contains reference implementations of "big
data" algorithms in MapReduce and Spark. We'll be using it throughout
this course. Go and run
the <a href="https://github.com/lintool/bespin">Word Count in
MapReduce and Spark</a> example as shown in the Bespin README (clone
and build the repo, download the data files, run word count in both
MapReduce in Spark, and verify output). Assuming you are
using <code>linux.student.cs.uwaterloo.ca</code> (or if you have
properly set up your local environment), this task should be as simple
as copying and pasting commands from the Bespin README.</p>

<p>When running Hadoop, you might get the following warning: "Unable
to load native-hadoop library for your platform... using builtin-java
classes where applicable". It's okay: no need to worry.</p>

<h4 style="padding-top: 10px">Time to write some code!</h4>

<p>Create a <b>private</b> repo
called <code>bigdata2016w</code>. I'm assuming that you're
already familiar with Git and GitHub, but just in case, here
is <a href="https://help.github.com/articles/create-a-repo">how you
create a repo on GitHub</a>. For "Who has access to this repository?",
make sure you click "Only the people I specify". If you've
successfully gotten an educational account (per above), you should be
able to create private repos for free. If you're not already familiar
with Git, there are plenty of good tutorials online: do a simple web
search and find one you like.</p>

<p>What you're going to do now is to copy the MapReduce word count
example into you own private repo. Start with
<a href="assignments/pom.xml">this <code>pom.xml</code></a>: copy it
into your <code>bigdata2016w</code> repo. The replace this line in
that file:</p>

<pre>
  &lt;groupId&gt;ca.uwaterloo.cs.bigdata2016w.lintool&lt;/groupId&gt;
</pre>

<p>Instead of <code>lintool</code>, substitute your GitHub
username. You'll be working in your own namespace, so in everything
that follows, substitute your own GitHub username in place
of <code>lintool</code>.</p>

<p>Next, copy:</p>

<ul>
  <li><code>bespin/src/main/java/io/bespin/java/mapreduce/wordcount/WordCount.java</code> over to
  <li><code>bigdata2016w/src/main/java/ca/uwaterloo/cs/bigdata2016w/lintool/assignment0/WordCount.java</code>.
</ul>

<p>Open up this new version of <code>WordCount.java</code> using a
text editor (or your IDE of choice) and change the Java package
to <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment0</code>.</p>

<p>Now, in the <code>bigdata2016w/</code> base directory, you should
be able to run Maven to build your package:</p>

<pre>
$ mvn clean package
</pre>

<p>Once the build succeeds, you should be able to run the word count
demo program in your own repository:</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment0.WordCount \
   -input data/Shakespeare.txt -output wc
</pre>

<p>You should be running this in the Linux student CS environment or
on your own machine. Note that you'll need to copy over the
Shakespeare collection in <code>data/</code>. The output should be
exactly the same as the same program in Bespin, but the difference
here is that the code is now in a repository under your control, in
your own private namespace.<p>

<p>Let's make a simple modification to word count: instead of counting
words, I want to count the occurrences of two-character prefixes of
words, i.e., the first two characters. That is, I want to know how
many words begin with "aa", "ab", "ac", etc., all the way to "zz"
(including special characters, etc.). Create a program
called <code>PrefixCount</code> in the
package <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment0</code> that does this.</p>

<p>To be clear, the <code>WordCount</code> defines a "word" as
follows:</p>

<pre>
String w = itr.nextToken().toLowerCase().replaceAll("(^[^a-z]+|[^a-z]+$)", "");
</pre>

<p>Simply take whatever <code>w.substring(0, 2)</code> gives you as a
prefix. This means, of course, that you should ignore single
characters.</p>

<p>We should be able to run your program as follows:</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment0.PrefixCount \
   -input data/Shakespeare.txt -output cs489-2016w-lintool-a0-shakespeare
</pre>

<p>You shouldn't need to write more than a couple lines of code
(beyond changing class names and other boilerplate). We'll go over the
Hadoop API in more detail in class, but the changes should be
straightforward.</p>

<p>Answer the following questions:</p>

<p><b>Question 1.</b> In the Shakespeare collection, what are the
three most frequent two character prefixes and how many times does
each occur? (Remember when I mentioned "command line"-fu skills in
class? This is where such skills will come in handy...)</p>

<p><b>Question 2.</b> In the Shakespeare collection, how frequent does
the prefix "li" occur?</p>

<p>You can run the above instructions using 
<a href="assignments/check_assignment0_public_linux.py"><code>check_assignment0_public_linux.py</code></a> as follows:</p>

<pre>
$ wget http://lintool.github.io/bigdata-2016w/assignments/check_assignment0_public_linux.py
$ ./check_assignment0_public_linux.py lintool
</pre>

<p>In fact, we'll be using exactly this script to check your
assignment in the Linux Student CS environment. That is, make sure
that your code runs there even if you do development on your own
machine.</p>

<h4 style="padding-top: 10px">Using the Altiscale Cluster</h4>

<p>The <a href="software.html">software page</a> has details on
getting started with the Altiscale cluster. Register your account and
follow instructions to set up ssh into the "workspace".  Make sure
you've properly set up the proxy to view the cluster Resource Manager
(RM) webapp
at <a href="http://rm-ia.s3s.altiscale.com:8088/cluster/"><code>http://rm-ia.s3s.altiscale.com:8088/cluster/</code></a>.
Getting access to the RM webapp is important&mdash;you'll need it to
track your job status and for debugging purposes.</p>

<p>Once you've ssh'ed into the workspace, check out Bespin and run
word count:</p>

<pre>
$ hadoop jar target/bespin-0.1.0-SNAPSHOT.jar io.bespin.java.mapreduce.wordcount.WordCount \
   -input /shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt -output wc-jmr-combiner
</pre>

<p>Note that we're running word count over a larger collection here: a
10% sample of English Wikipedia totaling 1.3 GB (here's a chance to
exercise your newly-acquired HDFS skills to confirm for yourself).</p>

<p><b>Question 3.</b> Were you able to successfully run word count on
the Altiscale cluster and get access to the Resource Manager webapp?
(Yes or No)</p>

<p>Now switch into your own <code>bigdata2016w/</code> repo and run
your prefix count program on the sample Wikipedia data:</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment0.PrefixCount \
   -input /shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt -output cs489-2016w-lintool-a0-wiki
</pre>

<p><b>Question 4.</b> In the sample Wikipedia collection, what are the
three most frequent two character prefixes and how many times does
each occur?</p>

<p><b>Question 5.</b> In the sample Wikipedia collection, How frequent
does the prefix "li" occur?</p>

<p>Note that the Altiscale cluster is a shared resource, and how fast
your jobs complete will depend on how busy it is. You're advised to
begin the assignment early as to avoid long job queues. "I wasn't able
to complete the assignment because there were too many jobs running on
the cluster" will not be accepted as an excuse if your assignment is
late.</p>

<p>You can run the above instructions using 
<a href="assignments/check_assignment0_public_altiscale.py"><code>check_assignment0_public_altiscale.py</code></a> as follows:</p>

<pre>
$ wget http://lintool.github.io/bigdata-2016w/assignments/check_assignment0_public_altiscale.py
$ ./check_assignment0_public_altiscale.py lintool
</pre>

<p>In fact, we'll be using exactly this script to check your
assignment on the Altiscale cluster.</p>


<h4 style="padding-top: 10px">Turning in the Assignment</h4>

<p>At this point, you should have a GitHub
repo <code>bigdata2016w/</code> and inside the repo, you should have
the word count program copied over from Bespin and the new prefix count
implementation, along with your <code>pom.xml</code>.  Commit these
files. Next, create a file called <code>assignment0.md</code>
inside <code>bigdata2016w/</code>. In that file, put your answers to
the above questions (1&mdash;5). Use the Markdown annotation format: here's
a <a href="http://daringfireball.net/projects/markdown/basics">simple
guide</a>.</p>

<p><b>Note:</b> there is no need to commit <code>data/</code>
or <code>target/</code> (or any results that you may have generated),
so your repo should be very compact &mdash; it should only have four
files: two Java source files, <code>pom.xml</code>,
and <code>assignment0.md</code>. You can add a <code>.gitignore</code>
file if you wish.</p>

<p>For this and all subsequent assignments, make sure everything is on
the master branch. Push your repo to GitHub. You can verify that it's
there by logging into your GitHub account in a web browser: your
assignment should be viewable in the web interface.</p>

<p>For this (and the following assignments) there are two parts, one
that can be completed locally, and another that requires the Altiscale
cluster. For the first, make sure that your code runs in the Linux
Student CS environment (even if you do development on your own
machine), which is where we will be doing the grading. "But it runs on
my laptop!" will not be accepted as an excuse if we can't get your
code to run.</p>

<p>Almost there! Add the
user <a href="https://github.com/teachtool">teachtool</a> a
collaborator to your repo so that we can access it (under settings in
the main web interface on your repo). Note: do <b>not</b> add my
primary GitHub
account <a href="https://github.com/lintool">lintool</a> as a
collaborator.</p>

<p>Finally, you need to tell us your GitHub account so we can link it
to you. Submit your user
name <a href="http://goo.gl/forms/UBIaZNNzHF">here</a>.</p>

<p>And that's it!</p>

<p>To give you an idea of how we'll be grading this and future
assignments&mdash;we will clone your repo and use the above check
scripts:</p>

<ul>

<li><a href="assignments/check_assignment0_public_linux.py"><code>check_assignment0_public_linux.py</code></a>
in the Linux Student CS environment.</li>

<li><a href="assignments/check_assignment0_public_altiscale.py"><code>check_assignment0_public_altiscale.py</code></a> on the Altiscale cluster.</li>

</ul>

<p>We'll make sure the data files are in the right place, and once the
code completes, we will verify the output. It is highly recommend that
you run these check scripts: if it doesn't work for you, it won't work
for us either.</p>

<p>As mentioned above, one main purpose of this assignment is to
provide a practice "dry run" of how assignments will be submitted in
the future. It is your responsibility to follow these instructions and
learn the process: we will work with you to get the process sorted out
for this assignment, but in subsequent assignments, you may be docked
points for failing to conform to our expectations.</p>

<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>


<section id="assignment1" style="padding-top:35px">
<div>
<h3>Assignment 1: Counting in MapReduce <small>due 8:30am January 19</small></h3>

<p>By now, you should already be familiar with the Hadoop execution
environment (e.g., submitting jobs) and using Maven to organize your
assignments. You will be working in the same repo as before, except
that everything should go into the package namespace
<code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment1</code>
(obviously, replace <code>lintool</code> with your actual GitHub
username.</p>

<p>Note that the point of assignment 0 was to familiarize your with
GitHub and the Hadoop development environment. We will work through
issues with you, but starting this assignment, excuses along the lines
of "I couldn't get my repo set up properly", "I couldn't figure out
how to push my assignment to GitHub", etc. will not be accepted. It is
your responsibility to sort through any mechanics issue you have.</p>

<p>Before staring this assignment, it is <i>highly recommended</i>
that you look at the implementations of bigram relative frequency and
co-occurrence matrix computation
in <a href="http://bespin.io">Bespin</a>.</p>

<p>In this assignment you'll be
computing <a href="http://en.wikipedia.org/wiki/Pointwise_mutual_information">pointwise
mutual information</a>, which is a function of two events <i>x</i>
and <i>y</i>:</p>

<p><img width="200" src="assignments/PMI.png"/></p>

<p>The larger the magnitude of PMI for <i>x</i> and <i>y</i> is,
the more information you know about the probability of seeing <i>y</i>
having just seen <i>x</i> (and vice-versa, since PMI is
symmetrical). If seeing <i>x</i> gives you no information about seeing
<i>y</i>, then <i>x</i> and <i>y</i> are independent and the PMI is
zero.</p>

<p>Write a program (two separate implementations, actually&mdash;more details below)
that computes the PMI of words in the
<code>data/Shakespeare.txt</code> collection that's used in the Bespin
demos and the previous assignment. Your implementation should be in Java. To be more specific, the event
we're after is <i>x</i> occurring on a line in the file (the denominator above) or <i>x</i>
and <i>y</i> co-occurring on a line (the numerator above). That is, if a line contains "A B
C", then the co-occurring pairs are:</p>

<ul>
  <li>(A, B)</li>
  <li>(A, C)</li>
  <li>(B, A)</li>
  <li>(B, C)</li>
  <li>(C, A)</li>
  <li>(C, B)</li>
</ul>

<p>If the line contains "A A B C", the co-occurring pairs are still
the same as above; same if the line contains "A B C A B C"; or any
combinations of A, B, and C in any order.</p>

<p>A few additional important details:</p>

<ul>

<li>To reduce the number of spurious pairs, we are only interested in
pairs of words that co-occur in ten or more lines.</li>

<li>To reduce the computational complexity of the problem, we are only
going to consider up to the first 100 words in each line.</li>

<li>Just so everyone's answer is consistent, please use
log base 10.</li>

</ul>

<p>Use the same definition of "word" as in the word count demo.
Just to make sure we're all on the same page, use this as the
starting point of your mapper:</p>

<pre>
    @Override
    public void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
      String line = ((Text) value).toString();
      StringTokenizer itr = new StringTokenizer(line);

      int cnt = 0;
      Set<String> set = Sets.newHashSet();
      while (itr.hasMoreTokens()) {
        cnt++;
        String w = itr.nextToken().toLowerCase().replaceAll("(^[^a-z]+|[^a-z]+$)", "");
        if (w.length() == 0) continue;
        set.add(w);
        if (cnt >= 100) break;
      }

      String[] words = new String[set.size()];
      words = set.toArray(words);

      // Your code goes here...
   }
</pre>

<p>You will build two versions of the program (put both in
package <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment1</code>):</p>

<ol>

  <li>A "pairs" implementation. The implementation must use
  combiners. Name this implementation <code>PairsPMI</code>.</li>

  <li>A "stripes" implementation.  The implementation must use
  combiners. Name this implementation <code>StripesPMI</code>.</li>

</ol>

<p>Since PMI is symmetrical, PMI(x, y) = PMI(y, x). However, it's
actually easier in your implementation to compute both values, so
don't worry about duplicates. Also, use <code>TextOutputFormat</code>
so the results of your program are human readable.</p>

<p>Make sure that the pairs implementation and the stripes
implementation give the same answers!</p>

<p>Answer the following questions:</p>

<p><b>Question 1.</b> (6 points) <i>Briefly</i> describe in prose your solution,
both the pairs and stripes implementation. For example: how many
MapReduce jobs? What are the input records? What are the intermediate
key-value pairs? What are the final output records? A paragraph for
each implementation is about the expected length.</p>

<p><b>Question 2.</b> (2 points) What is the running time of the complete pairs
implementation? What is the running time of the complete stripes
implementation? (Tell me where you ran these experiments,
e.g., <code>linux.student.cs.uwaterloo.ca</code> or your own
laptop.)</p>

<p><b>Question 3.</b> (2 points) Now disable all combiners. What is the running
time of the complete pairs implementation now? What is the running
time of the complete stripes implementation? (Tell me where you ran
these experiments, e.g., <code>linux.student.cs.uwaterloo.ca</code> or
your own laptop.)</p>

<p><b>Question 4.</b> (3 points) How many distinct PMI pairs did you extract?</p>

<p><b>Question 5.</b> (3 points) What's the pair (x, y) (or pairs if there are
ties) with the highest PMI? Write a sentence or two to explain why
such a high PMI.</p>

<p><b>Question 6.</b> (6 points) What are the three words that have the highest
PMI with "tears" and "death"? And what are the PMI values?</p>

<p>Note that you can compute the answer to questions 4&mdash;6 however
you wish: a helper Java program, a Python script, command-line
one-liner, etc.</p>

<h4 style="padding-top: 10px">Running on the Altiscale cluster</h4>

<p>Now, on the Altiscale cluster, run your pairs and stripes
implementation on the sample Wikipedia collection stored on HDFS
at <code>/shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt</code>. Note
that in the Wikipedia collection, each article is on a line, so
we're computing co-occurring words in (the beginning of) the article. Also, the
"first 100 words" restriction will definitely apply here (whereas in
the Shakespeare collection, all the lines contained fewer than 100
words, so it was a no-op).</p>

<p>Make sure your code runs on this larger dataset. Assuming that
there aren't many competing jobs on the cluster, your programs should
not take more than 20 minutes to run. If your job is taking much
longer than that, then please kill it so it doesn't waste resources
and slow other people's jobs down. Obviously, if the cluster is really
busy or if there's a long list of queued jobs, your job will take
longer, so use your judgement here. The only point is: be nice. It's a
shared resource, and let's not let runaway jobs slow everyone
down.</p>

<p>One final detail, set your MapReduce job parameters as follows:</p>

<pre>
job.getConfiguration().setInt("mapred.max.split.size", 1024 * 1024 * 64);
job.getConfiguration().set("mapreduce.map.memory.mb", "3072");
job.getConfiguration().set("mapreduce.map.java.opts", "-Xmx3072m");
job.getConfiguration().set("mapreduce.reduce.memory.mb", "3072");
job.getConfiguration().set("mapreduce.reduce.java.opts", "-Xmx3072m");
</pre>

<p>What the last four options do is fairly obvious. The first sets
the <i>maximum</i> split size to be 64 MB. What effect does that have?
(Hint, consider the physical execution of MapReduce programs we
discussed in class)</p>

<p><b>Question 7.</b> (6 points) In the Wikipedia sample, what are the three
words that have the highest PMI with "waterloo" and "toronto"? And
what are the PMI values?</p>

<p>It's worth noting again: the Altiscale cluster is a shared
resource, and how fast your jobs complete will depend on how busy it
is. You're advised to begin the assignment early as to avoid long job
queues. "I wasn't able to complete the assignment because there were
too many jobs running on the cluster" will not be accepted as an
excuse if your assignment is late.</p>

<h4 style="padding-top: 10px">Turning in the Assignment</h4>

<p>Please follow these instructions carefully!</p>

<p>Make sure your repo has the following items:</p>

<ul>

<li>Similar to assignment 0, the answers to the questions go
in <code>bigdata2016w/assignment1.md</code>.</li>

<li>The pairs and stripes implementation should be in
package <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment1</code>.</li>

</ul>

<p>When grading, we will pull your repo and build your code:<p>

<pre>
$ mvn clean package
</pre>

<p>Your code should build successfully. We are then going to check
your code (both the pairs and stripes implementations).</p>

<p>We're going to run your code on the Linux student CS environment as
follows (we will make sure the collection is there):</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment1.PairsPMI \
   -input data/Shakespeare.txt -output cs489-2016w-lintool-a1-shakespeare-pairs -reducers 5

$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment1.StripesPMI \
   -input data/Shakespeare.txt -output cs489-2016w-lintool-a1-shakespeare-stripes -reducers 5
</pre>

<p>Make sure that your code runs in the Linux Student CS environment
(even if you do development on your own machine), which is where we
will be doing the grading. "But it runs on my laptop!" will not be
accepted as an excuse if we can't get your code to run.</p>

<p>You can run the above instructions using 
<a href="assignments/check_assignment1_public_linux.py"><code>check_assignment1_public_linux.py</code></a>.</p>

<p>We're going to run your code on the Altiscale cluster as
follows:</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment1.PairsPMI \
   -input /shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt -output cs489-2016w-lintool-a1-wiki-pairs -reducers 5

$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment1.StripesPMI \
   -input /shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt -output cs489-2016w-lintool-a1-wiki-stripes -reducers 5
</pre>

<p>You can run the above instructions using 
<a href="assignments/check_assignment1_public_altiscale.py"><code>check_assignment1_public_altiscale.py</code></a>.</p>

<p><b>Important:</b> Make sure that your code accepts the command-line
parameters above! That is, make sure the check scripts work!<p>

<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", verify
everything above works by performing a clean clone of your repo and
going through the steps above.</p>

<p>That's it! There's no need to send us anything&mdash;we already know
your username from the first assignment. Note that everything should
be committed and pushed to origin before the deadline.</p>

<h4 style="padding-top: 10px">Hints</h4>

<ul>
  <li>Did you take a look at the implementations of bigram relative
  frequency and co-occurrence matrix computation
  in <a href="http://bespin.io">Bespin</a>?</li>

  <li>Your solution will likely require more than one MapReduce job.</li>

  <li>You may have to load in "side data"?</li>

  <li>My <a href="https://github.com/lintool/tools/tree/master/lintools-datatypes/">lintools-datatypes
  package</a> has <code>Writable</code> datatypes that you might find
  useful. (Feel free to use, but assignment can be completed
  without it.)</li>

</ul>

<h4 style="padding-top: 10px">Grading</h4>

<p>This assignment is worth a total of 50 points, broken down as
follows:</p>

<ul>

  <li>The questions above are worth a total of 28 points.</li>

  <li>Getting your code to compile and successfully run is worth
  another 16 points (4 points each for the pairs and stripes
  implementation in the Linux student CS environment and on
  Altiscale). We will make a minimal effort to fix <i>trivial</i>
  issues with your code (e.g., a typo)&mdash;and deduct
  points&mdash;but <b>will not</b> spend time debugging your code. It
  is your responsibility to make sure your code runs: we have taken
  care to specify exactly how we will run your code&mdash;if anything
  is unclear, it is your responsibility to seek clarification.  In
  order to get a perfect score of 16 for this portion of the grade, we
  should be able to run the two public check
  scripts: <a href="assignments/check_assignment1_public_linux.py"><code>check_assignment1_public_linux.py</code></a>
  (on Linux Student CS)
  an <a href="assignments/check_assignment1_public_altiscale.py"><code>check_assignment1_public_altiscale.py</code></a>
  (on Altiscale cluster) successfully without any errors.</li>

  <li>Another 6 points is allotted to us verifying the output of your
  program in ways that we will not tell you. We're giving you the
  "public" versions of the check scripts; we'll run a "private"
  version to examine your output further (i.e., think blind test
  cases).</li>

</ul>

<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>


<section id="assignment2" style="padding-top:35px">
<div>
<h3>Assignment 2: Counting in Spark <small>due 8:30am January 26</small></h3>

<p>In this assignment you will "port" the MapReduce implementations of
the bigram frequency count program
from <a href="http://bespin.io">Bespin</a> over to Spark (in
Scala). Your starting points
are <code>ComputeBigramRelativeFrequencyPairs</code>
and <code>ComputeBigramRelativeFrequencyStripes</code> in
package <code>io.bespin.java.mapreduce.bigram</code> (in Java).
You are welcome to build on the <code>BigramCount</code> (Scala)
implementation <a href="https://github.com/lintool/bespin/blob/master/src/main/scala/io/bespin/scala/spark/bigram/BigramCount.scala">here</a>
for tokenization and "boilerplate" code like command-line argument
parsing. To be consistent in tokenization, you should copy over
the <code>Tokenizer</code> trait
<a href="https://github.com/lintool/bespin/blob/master/src/main/scala/io/bespin/scala/util/Tokenizer.scala">here</a>. You'll
also need to grab missing Maven dependencies
from <a href="https://github.com/lintool/bespin/blob/master/pom.xml">here</a>.</p>

<p>Put your code in the
package <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment2</code>. Since
you'll be writing Scala code, your source files should go
into <code>src/main/scala/ca/uwaterloo/cs/bigdata2016w/lintool/assignment2/</code>. Note
that the repository is designed so that Scala/Spark code will also
compile with the same Maven build command:</p>

<pre>
$ mvn clean package
</pre>

<p>Following the Java implementations, you will write both a "pairs"
and a "stripes" implementation in Spark. Not that although Spark has a
different API than MapReduce, the algorithmic concepts are still very
much applicable. Your pairs and stripes implementation should follow
the same logic as in the MapReduce implementations. In particular,
your program should only take one pass through the input data.</p>

<p>Make sure your implementation runs in the Linux student CS
environment on the Shakespeare collection and also on sample
Wikipedia
file <code>/shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt</code>
on HDFS in the Altiscale cluster. Note that submitting Spark jobs on
the Altiscale cluster requires a rather arcane command-line
invocation&nbsp;see the <a href="software.html">software page</a> for
more details.</p>

<p>You can verify the correctness of your algoritm by comparing the
output of the MapReduce implementation with your Spark
implementation. The output should be the same.</p>

<p>Clarification on terminology: informally, we often refer to
"mappers" and "reducers" in the context of Spark. That's a shorthand
way of saying map-like transformations
(<code>map</code>, <code>flatMap</code>, <code>filter</code>, <code>mapPartitions</code>,
etc.) and reduce-like transformations
(e.g., <code>reduceByKey</code>, <code>groupByKey</code>, <code>aggregateByKey</code>,
etc.). Hopefully it's clear from lecture that while Spark represents a
generalization of MapReduce, the notions of per-record processing
(i.e., map-like transformation) and grouping/shuffling (i.e.,
reduce-like transformations) are shared across both frameworks.</p>

<h4 style="padding-top: 10px">Turning in the Assignment</h4>

<p>Please follow these instructions carefully!</p>

<p>The pairs and stripes implementation should be in
package <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment2</code>;
your Scala code should be
in <code>src/main/scala/ca/uwaterloo/cs/bigdata2016w/lintool/assignment2/</code>.
There are no questions to answer in this assignment unless there is
something you would like to communicate with us, and if so, put it
in <code>assignment2.md</code>.</p>

<p>When grading, we will pull your repo and build your code:<p>

<pre>
$ mvn clean package
</pre>

<p>Your code should build successfully. We are then going to check
your code (both the pairs and stripes implementations).</p>

<p>We're going to run your code on the Linux student CS environment as
follows (we will make sure the collection is there):</p>

<pre>
$ spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment2.ComputeBigramRelativeFrequencyPairs \
   target/bigdata2016w-0.1.0-SNAPSHOT.jar --input data/Shakespeare.txt --output cs489-2016w-lintool-a2-shakespeare-pairs --reducers 5

$ spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment2.ComputeBigramRelativeFrequencyStripes \
   target/bigdata2016w-0.1.0-SNAPSHOT.jar --input data/Shakespeare.txt --output cs489-2016w-lintool-a2-shakespeare-stripes --reducers 5
</pre>

<p>Make sure that your code runs in the Linux Student CS environment
(even if you do development on your own machine), which is where we
will be doing the grading. "But it runs on my laptop!" will not be
accepted as an excuse if we can't get your code to run.</p>

<p>We're going to run your code on the Altiscale cluster as follows
(note we add <code>--num-executors 10</code> to specify the number of
executors; also note that we use the <code>my-spark-submit</code>
launch script&mdash;see the <a href="software.html">software</a>
page for details):</p>

<pre>
$ my-spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment2.ComputeBigramRelativeFrequencyPairs --num-executors 10 \
   target/bigdata2016w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt \
   --output cs489-2016w-lintool-a2-wiki-pairs --reducers 10

$ my-spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment2.ComputeBigramRelativeFrequencyStripes --num-executors 10 \
   target/bigdata2016w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt \
   --output cs489-2016w-lintool-a2-wiki-stripes --reducers 10
</pre>

<p><b>Important:</b> Make sure that your code accepts the command-line
parameters above!<p>

<p>Brief explanation about the relationship
between <code>--num-executors</code>
and <code>--reducers</code>. The <code>--num-executors</code> flag
specifies the number of Spark workers that you allocate for this
particular job. The <code>--reducers</code> flag is the amount of
parallelism that you set in your program in the reduce
stage. If <code>--num-executors</code> is larger
than <code>--reducers</code>, some of the workers will be sitting
idle, since you've allocated more workers for the job than the
parallelism you've specified in your
program. If <code>--reducers</code> is larger
than <code>--num-executors</code>, then your reduce tasks will queue
up at the workers, i.e., a worker will be assigned more than one
reduce task. In the above example we set the two equal.</p>

<p>Note that the setting of these two parameters should not affect the
correctness of your program. The setting of ten above is a reasonable
middle ground between having your jobs finish in a reasonable amount
of time and not monopolizing cluster resources.</p>

<p>A related but still orthogonal concept is partitions. Partitions
describes the physical division of records across workers during
execution. When reading from HDFS, the number of HDFS blocks
determines the number of partitions in your RDD. When you apply a
reduce-like transformation, you can optionally specify the number of
partitions (or Spark applies a default) &mdash; in this case, the
number of partitions is equal to the number of reducers.</p>

<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and going through the steps above.</p>

<p>That's it!</p>

<h4 style="padding-top: 10px">Grading</h4>

<p>This assignment is worth a total of 20 points, broken down as
follows:</p>

<ul>
  <li>The pairs implementation running locally is worth 6 points; the stripes implementation running locally is worth another 6 points.</li>
  <li>The pairs implementation running on Altiscale is worth 4 points; the stripes implementation running on Altiscale is worth another 4 points.</li>
</ul>

<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>


<section id="assignment3" style="padding-top:35px">
<div>
<h3>Assignment 3: Inverted Indexing <small>due 8:30am February 2</small></h3>

<p>This assignment is to be completed in MapReduce in Java. You will
be working in the same repo as before, except that everything should
go into the package namespace
<code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment3</code>
(obviously, replace <code>lintool</code> with your actual GitHub
username.</p>

<p>Look at the inverted indexing and boolean retrieval implementation
in <a href="http://bespin.io">Bespin</a>. Make sure you understand the
code. Starting from the inverted indexing
baseline <code>BuildInvertedIndex</code>, modify the indexer code in
the following ways:</p>

<p><b>1. Index Compression.</b> The index should be compressed using
<code>VInts</code>:
see <code>org.apache.hadoop.io.WritableUtils</code>. You should also
use gap-compression techniques as appropriate.</p>

<p><b>2. Buffering postings.</b> The baseline indexer implementation
currently buffers and sorts postings in the reducer, which as we
discussed in class is not a scalable solution. Address this
scalability bottleneck using techniques we discussed in class and in
the textbook.</p>

<p><b>3. Term partitioning.</b> The baseline indexer implementation
currently uses only one reducer and therefore all postings lists are
shuffled to the same node and written to HDFS in a single
partition. Change this so we can specify the number of reducers
(hence, partitions) as a command-line argument. This is, of course,
easy to do, but we need to make sure that the searcher understands
this partitioning also.</p>

<p><b>Note:</b> The major scalability issue is
buffering <i>uncompressed</i> postings in memory. In your solution,
you'll still end up buffering each postings list, but
in <i>compressed</i> form (raw bytes, no additional object
overhead). This is fine because if you use the right compression
technique, the postings lists are quite small. As a data point, on a
collection of 50 million web pages, 2GB heap is more than enough for a
full <i>positional</i> index (and in this assignment you're not asked
to store positional information in your postings).</p>

<p>To go into a bit more detail: in the reference implementation, the
final key type is <code>PairOfWritables&lt;IntWritable,
ArrayListWritable&lt;PairOfInts&gt;&gt;</code>. The most obvious idea
is to change that into something
like <code>PairOfWritables&lt;VIntWritable,
ArrayListWritable&lt;PairOfVInts&gt;&gt;</code>. This does not work!
The reason is that you will still be materializing each posting, i.e.,
all <code>PairOfVInts</code> objects in memory. This translates into a
Java object for every posting, which is wasteful in terms of memory
usage and will exhaust memory pretty quickly as you scale. In other
words, you're <i>still</i> buffering objects&mdash;just inside
the <code>ArrayListWritable</code>.

<p>This new indexer should be
named <code>BuildInvertedIndexCompressed</code>. This new class should
be in the
package <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment3</code>. Make
sure it works on the Shakespeare collection.</p>

<p>Modify <code>BooleanRetrieval</code> so that it works with the new
compressed indexes. Name this new
class <code>BooleanRetrievalCompressed</code>. This new class should
be in the same package as above and give the same
output as the old version.</p>

<p>Use <code>BuildInvertedIndex</code>
and <code>BooleanRetrieval</code> from Bespin as your starting
points. That is, copy over into your repo, rename, and begin your
assignment from there. Don't unnecessarily change code not directly
related to points #1-#3 above. In particular, <b>do not</b> change how
the documents are tokenized, etc. in <code>BuildInvertedIndex</code>
(otherwise there's no good way to check for the correctness of your
algorithm). Also, <b>do not</b> change the <code>fetchLine</code>
method in <code>BooleanRetrieval</code> so that everyone's output
looks the same.</p>

<p>In more detail, make sure that you can build the inverted index
with the following command (make sure your implementation runs in the
Linux student CS environment, as that is where we will be doing the
grading):</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment3.BuildInvertedIndexCompressed \
   -input data/Shakespeare.txt -output cs489-2016w-lintool-a3-index-shakespeare -reducers 4
</pre>

<p>We should be able to control the number of partitions (#3 above)
with the <code>-reducers</code> option. That is, the code should give
the correct results no matter what we set the value to.</p>

<p>Once we build the index, we should then be able to run a boolean
query as follows (in exactly the same manner
as <code>BooleanRetrieval</code> in Bespin</a>):</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment3.BooleanRetrievalCompressed \
   -index cs489-2016w-lintool-a3-index-shakespeare -collection data/Shakespeare.txt \
   -query "outrageous fortune AND"

$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment3.BooleanRetrievalCompressed \
   -index cs489-2016w-lintool-a3-index-shakespeare -collection data/Shakespeare.txt \
   -query "white red OR rose AND pluck AND"
</pre>

<p>Of course, we will try your program with additional queries to
verify its correctness.</p>

<p>Answer the following question:</p>

<p><b>Question 1.</b> What is the size of your compressed
index for Shakespeare collection? Just so we're using the same units,
report the output of <code>du -h</code>.</p>

<h4 style="padding-top: 10px">Running on the Altiscale cluster</h4>

<p>Now let's try running your implementation on the Altiscale cluster,
on the sample Wikipedia
file <code>/shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt</code>
on HDFS:</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment3.BuildInvertedIndexCompressed \
   -input /shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt \
   -output cs489-2016w-lintool-a3-index-wiki -reducers 4
</pre>

<p>And let's try running a query:</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment3.BooleanRetrievalCompressed \
   -index cs489-2016w-lintool-a3-index-wiki \
   -collection /shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt \
   -query "waterloo stanford OR cheriton AND"

$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment3.BooleanRetrievalCompressed \
   -index cs489-2016w-lintool-a3-index-wiki \
   -collection /shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt \
   -query "internet startup AND canada AND ontario AND"
</pre>

<p>Answer the following questions:</p>

<p><b>Question 2.</b> What is the size of your compressed
index for the sample Wikipedia collection? Just so we're using the
same units, report the output of <code>hadoop fs -du -h</code>.</p>

<p><b>Question 3.</b> What are the Wikipedia articles (just the
article titles) retrieved in response to the query <code>"waterloo
stanford OR cheriton AND"</code>?</p>

<p><b>Question 4.</b> What are the Wikipedia articles (just
the article titles) retrieved in response to the query <code>"internet
startup AND canada AND ontario AND"</code>?</p>

<h4 style="padding-top: 10px">Turning in the Assignment</h4>

<p>Please follow these instructions carefully!</p>

<p>Make sure your repo has the following items:</p>

<ul>

<li>Similar to the previous assignments, the answers to the questions go
in <code>bigdata2016w/assignment3.md</code>.</li>

<li>The implementations should be in
package <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment3</code>.</li>

</ul>

<p>Make sure your implementation runs in the Linux student CS
environment on the Shakespeare collection and also on sample Wikipedia
file <code>/shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt</code>
on HDFS in the Altiscale cluster, per above.</p>

<p>Specifically, we will clone your repo and use the below check
scripts:</p>

<ul>

<li><a href="assignments/check_assignment3_public_linux.py"><code>check_assignment3_public_linux.py</code></a>
in the Linux Student CS environment.</li>

<li><a href="assignments/check_assignment3_public_altiscale.py"><code>check_assignment3_public_altiscale.py</code></a> on the Altiscale cluster.</li>

</ul>

<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and run the public check scripts.</p>

<p>That's it!</p>

<h4 style="padding-top: 10px">Grading</h4>

<p>This assignment is worth a total of 50 points, broken down as
follows:</p>

<ul>

  <li>Implementation correctness is worth 30 points. Note that the
  questions above are not explicitly worth any points; they exist
  primarily to help us gauge your implementation correctness. For
  example, if your index size is larger than we expect, it's likely
  you've not applied compression correctly. If your retrieved results
  do not match ours, it's likely you have a bug in your retrieval
  implementation.</li>

  <li>Getting your code to compile and successfully run is worth
  another 10 points (5 points for the Linux student CS environment and
  5 points on the Altiscale cluster). We will make a minimal effort to
  fix <i>trivial</i> issues with your code (e.g., a typo)&mdash;and
  deduct points&mdash;but <b>will not</b> spend time debugging your
  code. It is your responsibility to make sure your code runs: we have
  taken care to specify exactly how we will run your code&mdash;if
  anything is unclear, it is your responsibility to seek
  clarification.   In
  order to get a perfect score of 10 for this portion of the grade, we
  should be able to run the two public check
  scripts: <a href="assignments/check_assignment3_public_linux.py"><code>check_assignment3_public_linux.py</code></a>
  (on Linux Student CS)
  an <a href="assignments/check_assignment3_public_altiscale.py"><code>check_assignment3_public_altiscale.py</code></a>
  (on Altiscale cluster) successfully without any errors.</li>

  <li>Another 10 points is allotted to us verifying the behavior and
  output of your program in ways that we will not tell you. We're
  giving you the "public" versions of the check scripts; we'll run a
  "private" version to examine your output further (i.e., think blind
  test cases).</li>

</ul>

<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>


<section id="assignment4" style="padding-top:35px">
<div>
<h3>Assignment 4: Multi-Source Personalized PageRank <small>due 8:30am February 9</small></h3>

<p>For this assignment, you will be working in the same repo as
before, except that everything should go into the package namespace
<code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment4</code>
(obviously, replace <code>lintool</code> with your actual GitHub
username.</p>

<p>Begin by taking the time to understand
the <a href="https://github.com/lintool/bespin/tree/master/src/main/java/io/bespin/java/mapreduce/pagerank">PageRank
reference implementation</a> in <a href="http://bespin.io">Bespin</a>
(particularly <code>RunPageRankBasic</code>).  For this assignment,
you are going to implement multiple-source personalized PageRank. As
we discussed in class, personalized PageRank is different from
ordinary PageRank in a few respects:</p>

<ul>

  <li>There is the notion of a <i>source</i> node, which is what we're
  computing the personalization with respect to.</li>

  <li>When initializing PageRank, instead of a uniform distribution
  across all nodes, the source node gets a mass of one and every other
  node gets a mass of zero.</li>

  <li>Whenever the model makes a random jump, the random jump is
  always back to the source node; this is unlike in ordinary PageRank,
  where there is an equal probability of jumping to any node.</li>

  <li>All mass lost in the dangling nodes are put back into the source
  node; this is unlike in ordinary PageRank, where the missing mass is
  evenly distributed across all nodes.</li>

</ul>

<p>Here are some publications about personalized PageRank if you're
interested. They're just provided for background; neither is necessary
for completing the assignment.</p>

<ul>

  <li>Daniel Fogaras, Balazs Racz, Karoly Csalogany, and Tamas Sarlos. (2005) <a href="assignments/Fogaras_etal_2005.pdf">Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments.</a> Internet Mathematics, 2(3):333-358.</li>

  <li>Bahman Bahmani, Abdur Chowdhury, and Ashish Goel. (2010) <a href="assignments/Bahmani_etal_VLDB2010.pdf">Fast Incremental and Personalized PageRank.</a> Proceedings of the 36th International Conference on Very Large Data Bases (VLDB 2010).</li>


</ul>

<p>Your implementation is going to run multiple personalized PageRank
computations in parallel, one with respect to each source. The sources
will be specified on the command line. This means that each
PageRank node object (i.e., <code>Writable</code>) is going to contain
an array of PageRank values.</p>

<p>Here's how the implementation is going to work: it largely follows
the reference implementation of <code>RunPageRankBasic</code>. You
must make your implementation work with respect to the command-line
invocations specified below.</p>

<p>First, convert the adjacency list into PageRank node records:</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment4.BuildPersonalizedPageRankRecords \
   -input data/p2p-Gnutella08-adj.txt -output cs489-2016w-lintool-a4-Gnutella-PageRankRecords \
   -numNodes 6301 -sources 367,249,145
</pre>

<p>The <code>-sources</code> option specifies the source nodes for the
personalized PageRank computations. In this case, we're running three
computations in parallel, with respect to node ids 367, 249, and
145. You can expect the option value to be in the form of a
comma-separated list, and that all node ids actually exist in the
graph. The list of source nodes may be arbitrarily long, but for
practical purposes we won't test your code with more than a few.</p>

<p>Since we're running three personalized PageRank computations in
parallel, each PageRank node is going to hold an array of three
values, the personalized PageRank values with respect to the first
source, second source, and third source. You can expect the array
positions to correspond exactly to the position of the node id in the
source string.</p>

<p>Next, partition the graph (hash partitioning) and get ready to
iterate:</p>

<pre>
$ hadoop fs -mkdir cs489-2016w-lintool-a4-Gnutella-PageRank

$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment4.PartitionGraph \
   -input cs489-2016w-lintool-a4-Gnutella-PageRankRecords \
   -output cs489-2016w-lintool-a4-Gnutella-PageRank/iter0000 -numPartitions 5 -numNodes 6301
</pre>

<p>After setting everything up, iterate multi-source personalized
PageRank:</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment4.RunPersonalizedPageRankBasic \
   -base cs489-2016w-lintool-a4-Gnutella-PageRank -numNodes 6301 -start 0 -end 20 -sources 367,249,145
</pre>

<p>Note that the sources are passed in from the command-line
again. Here, we're running twenty iterations.</p>

<p>Finally, run a program to extract the top ten personalized PageRank
values, with respect to each source.</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment4.ExtractTopPersonalizedPageRankNodes \
   -input cs489-2016w-lintool-a4-Gnutella-PageRank/iter0020 -output cs489-2016w-lintool-a4-Gnutella-PageRank-top10 \
   -top 10 -sources 367,249,145
</pre>

<p>The above program should print the following answer to stdout:</p>

<pre>
Source: 367
0.35257 367
0.04181 264
0.03889 266
0.03888 559
0.03883 5
0.03860 1317
0.03824 666
0.03817 7
0.03799 4
0.00850 249

Source: 249
0.34089 249
0.04034 251
0.03721 762
0.03688 123
0.03686 250
0.03668 753
0.03627 755
0.03623 351
0.03622 983
0.00949 367

Source: 145
0.36937 145
0.04195 1317
0.04120 265
0.03847 390
0.03606 367
0.03566 246
0.03525 667
0.03519 717
0.03513 149
0.03496 2098
</pre>

<h4 style="padding-top: 10px">Additional Specifications</h4>

<p>To make the final output easier to read, in the
class <code>ExtractTopPersonalizedPageRankNodes</code>, use the
following format to print each (personalized PageRank value, node id)
pair:</p>

<pre>
String.format("%.5f %d", pagerank, nodeid)
</pre>

<p>This will generate the final results in the same format as
above. Also note: print actual probabilities, not log
probabilities&mdash;although during the actual PageRank computation
keep values as log probabilities.</p>

<p>The final class <code>ExtractTopPersonalizedPageRankNodes</code>
does not need to be a MapReduce job (but it does need to read from
HDFS). Obviously, the other classes need to run MapReduce jobs.</p>

<p>The reference implementation of PageRank in Bespin has many
options: you can either use in-mapper combining or ordinary
combiners. In your implementation, use ordinary combiners.
Also, the reference implementation has
an option to either use range partitioning or hash partitioning: you
only need to implement hash partitioning. You can start with the
reference implementation and remove code that you don't need (see #2
below).</p>

<h4 style="padding-top: 10px">Hints and Suggestion</h4>

<p>To help you out, there's a small helper program in Bespin that
computes personalized PageRank using a sequential algorithm. Use it to
check your answers:</p>

<pre>
$ hadoop jar target/bespin-0.1.0-SNAPSHOT.jar io.bespin.java.mapreduce.pagerank.SequentialPersonalizedPageRank \
   -input data/p2p-Gnutella08-adj.txt -source 367
</pre>

<p>Note that this isn't actually a MapReduce job; we're simply using
Hadoop to run the <code>main</code> for convenience. The values from
your implementation should be pretty close to the output of the above
program, but might differ a bit due to convergence issues. After 20
iterations, the output of the MapReduce implementation should match to
at least the fourth decimal place.</p>

<p>This is complex assignment. We would suggest breaking the
implementation into the following steps:</p>

<ol>

<li>First, copy the reference PageRank implementation into your own
assignments repo (renaming the classes appropriately). Make sure you
can get it to run and output the correct results with ordinary
PageRank.</li>

<li>Simplify the code; i.e., if you decide to use the in-mapper
combiner, remove the code that works with ordinary combiners.</li>

<li>Implement personalized PageRank from a single source; that is, if
the user sets option <code>-sources w,x,y,z</code>, simply
ignore <code>x,y,z</code> and run personalized PageRank with respect
to <code>w</code>. This can be accomplished with the
existing <code>PageRankNode</code>, which holds a single floating
point value.</li>

<li>Extend the <code>PageRankNode</code> class to store an array of
floats (length of array is the number of sources) instead of a single
float. Make sure single-source personalized PageRank still runs.</li>

<li>Implement multi-source personalized PageRank.</li>

</ol>

<p>In particular, #3 is a nice checkpoint. If you're not able to get
the multiple-source personalized PageRank to work, at least completing
the single-source implementation will allow us to give you partial
credit.</p>

<h4 style="padding-top: 10px">Running on the Altiscale cluster</h4>

<p>Once you get your implementation working and debugged in the Linux
environment, you're going to run your code on a non-trivial graph: the
link structure of (all of) Wikipedia. The adjacency lists are stored
in <code>/shared/cs489/data/wiki-adj</code> on HDFS. The graph has
16,117,779 vertices and 155,472,640 edges.</p>

<p>First, convert the adjacency list into PageRank node records:</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment4.BuildPersonalizedPageRankRecords \
   -input /shared/cs489/data/wiki-adj -output cs489-2016w-lintool-a4-wiki-PageRankRecords \
   -numNodes 16117779 -sources 73273,73276
</pre>

<p>Next, partition the graph (hash partitioning) and get ready to
iterate:</p>

<pre>
$ hadoop fs -mkdir cs489-2016w-lintool-a4-wiki-PageRank

$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment4.PartitionGraph \
   -input cs489-2016w-lintool-a4-wiki-PageRankRecords \
   -output cs489-2016w-lintool-a4-wiki-PageRank/iter0000 -numPartitions 10 -numNodes 16117779
</pre>

<p>After setting everything up, iterate multi-source
personalized PageRank:</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment4.RunPersonalizedPageRankBasic \
   -base cs489-2016w-lintool-a4-wiki-PageRank -numNodes 16117779 -start 0 -end 20 -sources 73273,73276
</pre>

<p>On the Altiscale cluster, each iteration shouldn't take more than a
couple of minutes to complete. If it's taking more than five minutes
per iteration, you're doing something wrong.</p>

<p>Finally, run a program to extract the top ten personalized PageRank
values, with respect to each source.</p>

<pre>
$ hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
   ca.uwaterloo.cs.bigdata2016w.lintool.assignment4.ExtractTopPersonalizedPageRankNodes \
   -input cs489-2016w-lintool-a4-wiki-PageRank/iter0020 -output cs489-2016w-lintool-a4-wiki-PageRank-top10 \
   -top 10 -sources 73273,73276
</pre>

<p>In the example above, we're running personalized PageRank with
respect to two sources: 73273 and 73276. What articles do these ids
correspond to? There's a file on HDFS
at <code>/shared/cs489/data/wiki-titles.txt</code> that provides the
answer. How do you know if your implementation is correct? You can
sanity check your results by taking the ids and looking up the
articles that they correspond to. Do the results make sense?</p>

<h4 style="padding-top: 10px">Turning in the Assignment</h4>

<p>Please follow these instructions carefully!</p>

<p>Make sure your repo has the following items:</p>

<ul>

<li>Similar to the previous assignments, you'll create a file
called <code>bigdata2016w/assignment4.md</code> (more below).</li>

<li>All the implementations described above should be in
package <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment4</code>.</li>

</ul>

<p>Make sure your implementation runs in the Linux student CS
environment on the Gnutella graph and on the Wikipedia graph on the
Altiscale cluster.</p>

<p>In <code>bigdata2016w/assignment4.md</code>, tell us if you were
able to successfully complete the assignment. This is in case we can't
get your code to run, we need to know if it is because you weren't able
to complete the assignment successfully, or if it is due to some other
issue. If you were not able to implement everything, describe how far
you've gotten. Feel free to use this space to tell us additional
things we should look for in your implementation.</p>

<p>Also, in this file, copy the output of your implementation on the
Altiscale cluster, i.e., personalized PageRank with respect to
vertices 73273 and 73276. This will give us something to look at and
check if we can't get your code to run successfully. Something that
looks like this:</p>

<pre>
Source: 73273
0.XXXXX XXX
...

Source: 73276
0.XXXXX XXX
...
</pre>

<p>When grading, we will clone your repo and use the below check
scripts:</p>

<ul>

<li><a href="assignments/check_assignment4_public_linux.py"><code>check_assignment4_public_linux.py</code></a>
in the Linux Student CS environment.</li>

<li><a href="assignments/check_assignment4_public_altiscale.py"><code>check_assignment4_public_altiscale.py</code></a>
on the Altiscale cluster.</li>

</ul>

<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and run the public check scripts.</p>

<p>That's it!</p>

<h4 style="padding-top: 10px">Grading</h4>

<p>The entire assignment is worth 55 points:

<ul>

<li>A correct implementation of single-source personalized PageRank is
worth 10 points.</li>

<li>That we are able to run the single-source personalized PageRank
implementation in the Linux Student CS environment is worth 5
points.</li>

<li>A correct implementation of multiple-source personalized PageRank
is worth 15 points.</li>

<li>That we are able to run the multiple-source personalized PageRank
implementation in the Linux Student CS environment is worth 5
points.</li>

<li>Scaling the single-source personalized PageRank implementation on
Altiscale is worth 10 points.</li>

<li>Scaling the multiple-source personalized PageRank implementation
on Altiscale is worth 10 points.</li>

</ul>

<p>In our private check scripts, we will specify arbitrary source
nodes to verify the correctness of your implementation.</p>

<p>Note that this grading scheme discretizes each component of the
assignment. For example, if you implement everything and it works
correctly on the Linux Student CS environment, but can't get it to
scale on the Altiscale cluster to the larger graph, you'll receive 35
out of 55 points. On the other hand, if you implement single-source
personalized PageRank correctly and it runs in both the Linux Student
CS environment and Altiscale, you will receive 25 out of 55
points. And combinations thereof.</p>

<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>


<section id="assignment5" style="padding-top:35px">
<div>
<h3>Assignment 5: Data Warehousing <small>due 8:30am March 3</small></h3>

<p>In this assignment you'll be hand-crafting Spark programs that
implement SQL queries in a data warehousing scenario. Various
SQL-on-Hadoop solutions share in providing an SQL query interface on
data stored in HDFS via an intermediate execution framework. For
example, Hive queries are "compiled" into MapReduce jobs; SparkSQL
queries rely on Spark processing primitives for query execution. In
this assignment, you'll play the role of mediating between SQL queries
and the underlying execution framework (Spark). In more detail, you'll
be given a series of SQL queries, and for each you'll have to
hand-craft a Spark program that corresponds to each query.</p>

<p><b>Important:</b> You are not allowed to use the Dataframe API or
Spark SQL to complete this assignment. You must write code to manipulate raw RDDs.
Furthermore, you are not allowed to use <code>join</code> and
closely-related transformations in Spark for this assignment, because
otherwise it defeats the point of the exercise. The assignment will
guide you toward what we are looking for, but if you have any
questions as to what is allowed or not, ask!</p>

<p>We will be working with data from the TPC-H benchmark in this
assignment. The Transaction Processing Performance Council (TPC) is a
non-profit organization that defines various database benchmarks so
that database vendors can evaluate the performance of their products
fairly. TPC defines the "rules of the game", so to speak. TPC defines
various benchmarks, of which one is TPC-H, for evaluating ad-hoc
decision support systems in a data warehousing scenario. The current
version of the TPC-H benchmark is located
<a href="assignments/tpc-h_v2.17.1.pdf">here</a>. You'll want to skim
through this (long) document; the most important part is the
entity-relationship diagram of the data warehouse on page
13. Throughout the assignment you'll likely be referring to it, as it
will help you make sense of the queries you are running.</p>

<p>The TPC-H benchmark comes with a data generator, and we have
generated some data for
you <a href="assignments/TPC-H-0.1-TXT.tar.gz">here</a>
(<code>TPC-H-0.1-TXT.tar.gz</code>). For the first
part of the assignment where you will be working with Spark locally,
you will run your queries against this data.
Download and unpack the data above: you will see a number of text files,
each corresponding to a table in the TPC-H schema. The files are
delimited by <code>|</code>. You'll notice that some of the fields,
especially the text fields, are gibberish&mdash;that's normal, since
the data are randomly generated.</p>

<p>Implement the following seven SQL queries, running on
the <code>TPC-H-0.1-TXT</code> data. Each SQL query is
accompanied by a written description of what the query does; if there
are any ambiguities in the language, you can always assume that the
SQL query is correct. Each of your query will be a separate Spark
program. Put your code in the package
<code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment5</code>, in the
same repo that you've been working in all semester. Since you'll be
writing Scala code, your source files should go into
<code>src/main/scala/ca/uwaterloo/cs/bigdata2016w/lintool/assignment5/</code>.
Obviously, replace <code>lintool</code> with your actual GitHub
username.</p>

<p><b>Q1:</b> How many items were shipped on a particular date? This corresponds to the following SQL query:

<pre>
select count(*) from lineitem where l_shipdate = 'YYYY-MM-DD';
</pre>

<p>Write a program such that when we execute the following
command:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment5.Q1 \
   target/bigdata2016w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed), in a line that matches
the following regular expression:</p>

<pre>
ANSWER=\d+
</pre>

<p>The output of the query can contain logging and debug information,
but there must be a line with the answer in <b>exactly</b> the above
format.</p>

<p>The value of the
<code>--input</code> argument is the directory that contains the
plain-text tables. The value of the <code>--date</code> argument
corresponds to the <code>l_shipdate</code> predicate in the where
clause in the SQL query. You need to anticipate dates of the
form <code>YYYY-MM-DD</code>, <code>YYYY-MM</code>, or
just <code>YYYY</code>, and your query needs to give the correct
answer depending on the date format. You can assume that a valid date
(in one of the above formats) is provided, so you do not need to
perform input validation.</p>

<p><b>Q2:</b> Which clerks were responsible for processing items that
were shipped on a particular date? List the first 20 by order
key. This corresponds to the following SQL query:</p>

<pre>
select o_clerk, o_orderkey from lineitem, orders
where
  l_orderkey = o_orderkey and
  l_shipdate = 'YYYY-MM-DD'
order by o_orderkey asc limit 20;
</pre>

<p>Write a program such that when we execute the following
command:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment5.Q2 \
   target/bigdata2016w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed), as a sequence of tuples
in the following format:</p>

<pre>
(o_clerk,o_orderkey)
(o_clerk,o_orderkey)
...
</pre>

<p>That is, each tuple is comma-delimited and surrounded by
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>

<p>In the design of this data warehouse, the <code>lineitem</code>
and <code>orders</code> tables are not likely to fit in
memory. Therefore, the only scalable join approach is the reduce-side
join. You must implement this join in Spark using
the <code>cogroup</code> transformation.</p>

<p><b>Q3:</b> What are the names of parts and suppliers of items
shipped on a particular date? List the first 20 by order key. This
corresponds to the following SQL query:</p>

<pre>
select l_orderkey, p_name, s_name from lineitem, part, supplier
where
  l_partkey = p_partkey and
  l_suppkey = s_suppkey and
  l_shipdate = 'YYYY-MM-DD'
order by l_orderkey asc limit 20;
</pre>

<p>Write a program such that when we execute the following
command:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment5.Q3 \
   target/bigdata2016w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed), as a sequence of tuples
in the following format:</p>

<pre>
(l_orderkey,p_name,s_name)
(l_orderkey,p_name,s_name)
...
</pre>

<p>That is, each tuple is comma-delimited and surrounded by
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>

<p>In the design of this data warehouse, it is assumed that
the <code>part</code> and <code>supplier</code> tables will fit in
memory. Therefore, it is possible to implement a hash join. For this
query, you must implement a hash join in Spark with broadcast
variables.</p>

<p><b>Q4:</b> How many items were shipped to each country on a
particular date? This corresponds to the following SQL query:</p>

<pre>
select n_nationkey, n_name, count(*) from lineitem, orders, customer, nation
where
  l_orderkey = o_orderkey and
  o_custkey = c_custkey and
  c_nationkey = n_nationkey and
  l_shipdate = 'YYYY-MM-DD'
group by n_nationkey, n_name
order by n_nationkey asc;
</pre>

<p>Write a program such that when we execute the following
command:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment5.Q4 \
   target/bigdata2016w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed). Format
the output in the same manner as with the above queries: one tuple per
line, where each tuple is comma-delimited and surrounded by
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>

<p>Implement this query with different join techniques as you see
fit. You can assume that the <code>lineitem</code>
and <code>orders</code> table will not fit in memory, but you can
assume that the <code>customer</code> and <code>nation</code> tables
will both fit in memory. For this query, the performance as well as
the scalability of your implementation will contribute to the
grade.</p>

<p><b>Q5:</b> This query represents a very simple end-to-end ad hoc
analysis task: Related to <b>Q4</b>, your boss has asked you to
compare shipments to Canada vs. the United States by month, given all
the data in the data warehouse. You think this request is best
fulfilled by a line graph, with two lines (one representing the US and
one representing Canada), where the <i>x</i>-axis is the year/month and
the <i>y</i> axis is the volume, i.e., <code>count(*)</code>.
Generate this graph for your boss.</p>

<p>First, write a program such that when we execute the following
command:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment5.Q5 \
   target/bigdata2016w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT
</pre>

<p>the raw data necessary for the graph will be printed to stdout (on the
console where the above command is executed).
Format the output in the same manner as with the above queries: one
tuple per line, where each tuple is comma-delimited and surrounded by
parentheses.</p>

<p>Next, create this actual graph: use whatever tool you are
comfortable with, e.g., Excel, gnuplot, etc.</p>

<p><b>Q6:</b> This is a slightly modified version of TPC-H Q1 "Pricing
Summary Report Query". This query reports the amount of business that
was billed, shipped, and returned:</p>

<pre>
select
  l_returnflag,
  l_linestatus,
  sum(l_quantity) as sum_qty,
  sum(l_extendedprice) as sum_base_price,
  sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
  sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
  avg(l_quantity) as avg_qty,
  avg(l_extendedprice) as avg_price,
  avg(l_discount) as avg_disc,
  count(*) as count_order
from lineitem
where
  l_shipdate = 'YYYY-MM-DD'
group by l_returnflag, l_linestatus;
</pre>

<p>Write a program such that when we execute the following
command:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment5.Q6 \
   target/bigdata2016w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed). Format
the output in the same manner as with the above queries: one tuple per
line, where each tuple is comma-delimited and surrounded by
parentheses. Everything described in <b>Q1</b> about dates applies
here as well.</p>

<p>Implement this query as efficiently as you can, using all of the
optimizations we discussed in lecture. You will only get full points
for this question if you exploit all the optimization opportunities
that are available.</p>

<p><b>Q7:</b> This is a slightly modified version of TPC-H Q3
"Shipping Priority Query".  This query retrieves the 10 unshipped
orders with the highest value:</p>

<pre>
select
  c_name,
  l_orderkey,
  sum(l_extendedprice*(1-l_discount)) as revenue,
  o_orderdate,
  o_shippriority
from customer, orders, lineitem
where
  c_custkey = o_custkey and
  l_orderkey = o_orderkey and
  o_orderdate < "YYYY-MM-DD" and
  l_shipdate > "YYYY-MM-DD"
group by
  c_name,
  l_orderkey,
  o_orderdate,
  o_shippriority
order by
  revenue desc
limit 10;
</pre>

<p>Write a program such that when we execute the following
command:</p>

<pre>
spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment5.Q7 \
   target/bigdata2016w-0.1.0-SNAPSHOT.jar --input TPC-H-0.1-TXT --date '1996-01-01'
</pre>

<p>the answer to the above SQL query will be printed to stdout (on the
console where the above command is executed).
Format the output in the same manner as with the above queries: one
tuple per line, where each tuple is comma-delimited and surrounded by
parentheses. Here you can assume that the date argument is only in the
format <code>YYYY-MM-DD</code> and that it is a valid date.</p>

<p>Implement this query as efficiently as you can, using all of the
optimizations we discussed in lecture. Plan you joins as you see fit,
keeping in mind above assumptions on what will and will not fit in
memory. You will only get full points for this question if you exploit
all the optimization opportunities that are available.</p>

<h4 style="padding-top: 10px">Scaling up on Altiscale</h4>

<p>Once you get your implementation working and debugged in the Linux
environment, run your code on a larger TCP-H dataset, located on HDFS
at <code>/shared/cs489/data/TPC-H-10-TXT</code>. Make sure that all
seven queries above run correctly on this larger dataset.</p>

<p>On the Altiscale cluster, we will run your code with the following
command-line parameters (same for Q1-Q7):</p>

<pre>
my-spark-submit --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment5.Q1 --num-executors 10 --driver-memory 2g --executor-memory 2G \
 target/bigdata2016w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/TPC-H-10-TXT --date '1996-01-01'
</pre>

<p>In this configuration, your programs shouldn't take more than a
couple of minutes. If it's taking more than five minutes, you're
probably doing something wrong.</p>

<p><b>Important:</b> In your <code>my-spark-submit</code> script, make
sure you set <code>--deploy-mode client</code>. This will force the
driver to run on the client (i.e., workspace), so that you will see
the output of <code>println</code>. Otherwise, the driver will run on
an arbitrary cluster node, making stdout not directly visible.</p>

<h4 style="padding-top: 10px">Turning in the Assignment</h4>

<p>Please follow these instructions carefully!</p>

<p>Make sure your repo has the following items:</p>

<ul>

<li>Optional: put anything that you want to convey to us about your
implementation in <code>bigdata2016w/assignment5.md</code>.</li>

<li>Two files,
named <code>bigdata2016w/assignment5-Q5-small.pdf</code>
and <code>bigdata2016w/assignment5-Q5-large.pdf</code> that contains
the graphs for Q5 on the <code>TPC-H-0.1-TXT</code>
and <code>TPC-H-10-TXT</code> datasets, respectively. If you cannot
easily generate PDFs, the files should be some easily-viewable format,
e.g., <code>png</code>, <code>gif</code>, etc.</li>

<li>Your implementations for the queries should be in
package <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment5</code>. There
should be at the minimum seven classes (Q1-Q7), but you may include
helper classes as you see fit.</li>

</ul>

<p>Make sure your implementation runs in the Linux student CS
environment on <code>TPC-H-0.1-TXT</code>, and on the Alitscale
cluster on the <code>TPC-H-10-TXT</code> data.</p>

<p>Specifically, we will clone your repo and use the below check
scripts:</p>

<ul>

<li><a href="assignments/check_assignment5_public_linux.py"><code>check_assignment5_public_linux.py</code></a>
in the Linux Student CS environment.</li>

<li><a href="assignments/check_assignment5_public_altiscale.py"><code>check_assignment5_public_altiscale.py</code></a>
on the Altiscale cluster.</li>

</ul>

<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and run the public check scripts.</p>

<p>That's it!</p>

<h4 style="padding-top: 10px">Grading</h4>

<p>The entire assignment is worth 100 points:</p>

<ul>

  <li>Getting your code to compile is worth 10 points (by now, these
  should be "free" points).</li>

  <li>For Q1-Q3, each query is worth 10 points: 5 points for a correct
  implementation that works in the Linux Student CS environment, 5
  points for a correct implementation that works on the Altiscale
  cluster.</li>

  <li>Q4 and Q5 are each worth 14 points: 7 points for a correct
  implementation that works in the Linux Student CS environment, 7
  points for a correct implementation that works on the Altiscale
  cluster.</li>

  <li>Q6 and Q7 are each worth 16 points: 8 points for a correct
  implementation that works in the Linux Student CS environment, 8
  points for a correct implementation that works on the Altiscale
  cluster.</li>

</ul>

<p>A working implementation means that your code gives the right
output according to our private check scripts, which will
contain <code>--date</code> parameters that are unknown to you (but
will nevertheless conform to our specifications above).</p>

<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>


<section id="assignment6" style="padding-top:35px">
<div>
<h3>Assignment 6: Spam Classification <small>due 8:30am March 22</small></h3>

<!-- p><b>Warning:</b> This assignment is not completely written yet. It
is provided in draft form for those who wish to get a head start on
the assignment. The assignment will not be considered "complete" until
this message is removed. If you start working on the problems, it is
your responsibility to make sure that your solutions actually match
the final questions.</p -->

<p>In this assignment, you will build a spam classifier trained using
stochastic gradient descent in Spark, replicating the work described
in <a href="http://arxiv.org/abs/1004.5168">Efficient and Effective
Spam Filtering and Re-ranking for Large Web Datasets</a> by Cormack,
Smucker, and Clarke. We will draw your attention to specific sections
of the paper that are directly pertinent to the assignment, but you
should read the entire paper for background.</p>

<h4 style="padding-top: 10px">Downloading Data</h4>

<p>First let's grab the training and test data:</p>

<pre>
wget https://www.student.cs.uwaterloo.ca/~cs489/spam/spam.train.group_x.txt.bz2
wget https://www.student.cs.uwaterloo.ca/~cs489/spam/spam.train.group_y.txt.bz2
wget https://www.student.cs.uwaterloo.ca/~cs489/spam/spam.train.britney.txt.bz2
wget https://www.student.cs.uwaterloo.ca/~cs489/spam/spam.test.qrels.txt.bz2
</pre>

<!--p>Just to verify what you've downloaded:</p>

<div style="width: 50%">
<table class="table table-striped">
<thead><tr><td><b>File</b></td><td><b>MD5</b></td><td><b>Size</b></td></tr></thead>
<tr><td><code>spam.train.group_x.txt.bz2</code></td><td><code>947faf932afee7e35d79e7da10fe0e3e</code></td><td>5.5 MB</td></tr>
<tr><td><code>spam.train.group_y.txt.bz2</code></td><td><code>7cf45c3666915999f1048aafeff4c60e</code></td><td>6.6 MB</td></tr>
<tr><td><code>spam.train.britney.txt.bz2</code></td><td><code>bad2e4ccaed7482f9e99e65e58c6beda</code></td><td>248 MB</td></tr>
<tr><td><code>spam.test.qrels.txt.bz2</code></td><td><code>99858d9dd1b40e994732a641703859ec</code></td><td>303 MB</td></tr>
</table>
</div-->

<p>The sizes of the above files are 5.5 MB, 6.6 MB, 248 MB, and 303
MB, respectively. After you've downloaded the data, unpack:</p>

<pre>
bunzip2 spam.train.group_x.txt.bz2
bunzip2 spam.train.group_y.txt.bz2
bunzip2 spam.train.britney.txt.bz2
bunzip2 spam.test.qrels.txt.bz2
</pre>

<p>Verify the unpacked data:</p>

<div style="width: 50%">
<table class="table table-striped">
<thead><tr><td><b>File</b></td><td><b>MD5</b></td><td><b>Size</b></td></tr></thead>
<tr><td><code>spam.train.group_x.txt</code></td><td><code>d6897ed8319c71604b1278b660a479b6</code></td><td>25 MB</td></tr>
<tr><td><code>spam.train.group_y.txt</code></td><td><code>4d103821fdf369be526347b503655da5</code></td><td>20 MB</td></tr>
<tr><td><code>spam.train.britney.txt</code></td><td><code>b52d54caa20325413491591f034b5e7b</code></td><td>766 MB</td></tr>
<tr><td><code>spam.test.qrels.txt</code></td>   <td><code>df1d26476ec41fec625bc2eb9969875c</code></td><td>1.1 GB</td></tr>
</table>
</div>

<p>Next, download the two files you'll need for evaluating the output of
the spam classifier (links below):</p>

<ul>

<li><a href="assignments/compute_spam_metrics.c"><code>compute_spam_metrics.c</code></a></li>

<li><a href="assignments/spam_eval.sh"><code>spam_eval.sh</code></a></li>

<li><a href="assignments/spam_eval_hdfs.sh"><code>spam_eval_hdfs.sh</code></a></li>

</ul>

<p>Compile the C program:</p>

<pre>
gcc -O2 -o compute_spam_metrics compute_spam_metrics.c -lm
</pre>

<p>You might get some warnings but don't worry&mdash;the code should
compile fine. The actual evaluation script <code>spam_eval.sh</code>
(and <code>spam_eval_hdfs.sh</code>)
calls <code>compute_spam_metrics</code>, so make sure they're in the
same directory.</p>

<p>Note on local vs. Altiscale: for this assignment, your code must
(eventually) work in Altiscale, but feel free to develop locally or in
the Linux Student CS environment. The instructions below are written
for running locally, but in a separate section later we will cover
details specific to Altiscale.</p>

<h4 style="padding-top: 10px">Basic Spam Classifier</h4>

<p>In this assignment, we'll take you through building spam
classifiers of increasing complexity, but let's start with a basic
implementation using stochastic gradient descent. Build the spam
classifier in <b>exactly</b> the way we describe below, because later
parts of the assignment will depend on the setup.</p>

<p>First, let's write the classifier trainer. The classifier trainer
takes all the training instances, runs stochastic gradient descent,
and produces a model as output.</p>

<p>Look at the Cormack, Smucker, and Clarke paper:
the entire algorithm is literally 34 lines of C, shown in Figure 2 on
page 10. The stochastic gradient descent update equations are in
equations (11) and (12) on page 11. We actually made things <i>even
simpler</i> for you: the features used in the spam classifier are
hashed byte 4-grams (thus, integers)&mdash;we've pre-computed the
features for you.</p>

<p>Take a look at <code>spam.train.group_x.txt</code>. The first line
begins as follows:</p>

<pre>
clueweb09-en0094-20-13546 spam 387908 697162 426572 161118 688171 ...
</pre>

<p>In the file, each training instance is on a line. Each line begins
with a document id, the string "spam" or "ham" (the label), and a list
of integers (the features).</p>

<p>Therefore, your spam classifier will look something like this:</p>

<pre>
// w is the weight vector (make sure the variable is within scope)
val w = Map[Int, Double]()

// Scores a document based on its list of features.
def spamminess(features: Array[Int]) : Double = {
  var score = 0d
  features.foreach(f => if (w.contains(f)) score += w(f))
  score
}

// This is the main learner:
val delta = 0.002

// For each instance...
val isSpam = ...   // label
val features = ... // feature vector of the training instance

// Update the weights as follows:
val score = spamminess(features)
val prob = 1.0 / (1 + exp(-score))
features.foreach(f => {
  if (w.contains(f)) {
    w(f) += (isSpam - prob) * delta
  } else {
    w(f) = (isSpam - prob) * delta
   }
})
</pre>

<p>We've given you the code fragment for the learner above as a
starting point&mdash;it's your job to understand exactly how it works
and turn it into a complete classifier trainer in Spark.</p>

<p>For the structure of the Spark trainer program, take a look at
slide 14 in the <a href="slides/week08b.pdf">Week 8, part 2</a>
deck. We're going to build the configuration shown there (even though
the slide says MapReduce, we're implementing it in
Spark). Specifically, we're going run a single reducer to make sure we
pump all the training instances through a single learner on the
reducer end. The overall structure of your program is going to look
something like this:</p>

<pre>
val textFile = sc.textFile("/path/to/training/data")

val trained = textFile.map(line =>{
  // Parse input
  // ..
  (0, (docid, isSpam, features))
  }).groupByKey(1)
  // Then run the trainer...

trained.saveAsTextFile(...)
</pre>

<p>Note the mappers are basically just parsing the feature vectors and
pushing them over to the reducer side for additional processing. We
emit "0" as a "dummy key" to make sure all the training instances get
collected at the reducer end via <code>groupByKey()</code>... after
which you run the trainer (which applies the SGD updates, per
above). Of course, it's your job to figure out how to connect the
pieces together. This is the crux of the assignment.</p>

<p>Putting everything together, you will write a trainer program
called <code>TrainSpamClassifier</code> that we will execute in the
following manner:</p>

<pre>
spark-submit --driver-memory 2g --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment6.TrainSpamClassifier \
 target/bigdata2016w-0.1.0-SNAPSHOT.jar --input spam.train.group_x.txt --model cs489-2016w-lintool-a6-model-group_x
</pre>

<p>The <code>--input</code> option specifies the input training
instances (from above); the <code>--model</code> option specifies the
output directory where the model goes. Inside the model
directory <code>cs489-2016w-lintool-a6-model-group_x</code>, there
should be a single file, <code>part-00000</code>, that contains the
trained model. The trained model should be a sequence of tuples, one
on each line; each tuple should contain a feature and its weight (a
double value). Something like:</p>

<pre>
$ head -5 cs489-2016w-lintool-a6-model-group_x/part-00000
(547993,2.019484093190069E-4)
(577107,5.255371091500805E-5)
(12572,-4.40967560913553E-4)
(270898,-0.001340150007664197)
(946531,2.560528666942676E-4)
</pre>

<p>Next, you will write another Spark program
named <code>ApplySpamClassifier</code> that will apply the trained spam
classifier to the test instances. That is, the program will read in
each input instance, compute the spamminess score (from above), and
make a prediction: if the spamminess score is above 0, classify the
document as spam; otherwise, classify the document as ham.</p>

<p>We will run the program in the following manner:</p>

<pre>
spark-submit --driver-memory 2g --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment6.ApplySpamClassifier \
 target/bigdata2016w-0.1.0-SNAPSHOT.jar --input spam.test.qrels.txt \
 --output cs489-2016w-lintool-a6-test-group_x --model cs489-2016w-lintool-a6-model-group_x
</pre>

<p>The <code>--input</code> option specifies the input test instances;
the <code>--model</code> option specifies the classifier model; and
the <code>--output</code> option specifies the output directory.  The
test data is organized in exactly the same way as the training data.
The output of <code>ApplySpamClassifier</code> should be organized as
follows:</p>

<pre>
$ cat cs489-2016w-lintool-a6-test-group_x/* | sort | head -5
(clueweb09-en0000-00-00142,spam,2.601624279252943,spam)
(clueweb09-en0000-00-01005,ham,2.5654162439491004,spam)
(clueweb09-en0000-00-01382,ham,2.5893946346394188,spam)
(clueweb09-en0000-00-01383,ham,2.6190102258752614,spam)
(clueweb09-en0000-00-03449,ham,1.500142758578532,spam)
</pre>

<p>The first field in each tuple is the document id and the second
field is the test label. These are just copied from the
test data. The third field is the spamminess score, and the fourth
field is the classifier's prediction.</p>

<p><b>Important:</b> It is absolutely critical that your
classifier <b>does not</b> use the label in the test data when making its
predictions. The only reason the label is included in the
output is to facilitate evaluation (see below).</p>

<p>Finally, you can evaluate your results:</p>

<pre>
$ ./spam_eval.sh cs489-2016w-lintool-a6-test-group_x
1-ROCA%: 17.25
</pre>

<p>The eval script prints the evaluation metric, which is the area under the
receiver operating characteristic (ROC) curve. This is a common way to
characterize classifier error. The lower this score, the better.</p>

<p>If you've done everything correctly up until now, you should be
able to replicate the above results.</p>

<p>You should then be able to train on the <code>group_y</code> training set:</p>

<pre>
spark-submit --driver-memory 2g --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment6.TrainSpamClassifier \
 target/bigdata2016w-0.1.0-SNAPSHOT.jar --input spam.train.group_y.txt --model cs489-2016w-lintool-a6-model-group_y
</pre>

<p>And make predictions:</p>

<pre>
spark-submit --driver-memory 2g --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment6.ApplySpamClassifier \
 target/bigdata2016w-0.1.0-SNAPSHOT.jar --input spam.test.qrels.txt \
 --output cs489-2016w-lintool-a6-test-group_y --model cs489-2016w-lintool-a6-model-group_y
</pre>

<p>And evaluate:</p>

<pre>
$ ./spam_eval.sh cs489-2016w-lintool-a6-test-group_y
1-ROCA%: 12.82
</pre>

<p>Finally, train on the <code>britney</code> training set:</p>

<pre>
spark-submit --driver-memory 2g --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment6.TrainSpamClassifier \
 target/bigdata2016w-0.1.0-SNAPSHOT.jar --input spam.train.britney.txt --model cs489-2016w-lintool-a6-model-britney
</pre>

<p>And make predictions:</p>

<pre>
spark-submit --driver-memory 2g --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment6.ApplySpamClassifier \
 target/bigdata2016w-0.1.0-SNAPSHOT.jar --input spam.test.qrels.txt \
 --output cs489-2016w-lintool-a6-test-britney --model cs489-2016w-lintool-a6-model-britney
</pre>

<p>And evaluate:</p>

<pre>
$ ./spam_eval.sh cs489-2016w-lintool-a6-test-britney
1-ROCA%: 16.46
</pre>

<p>There may be some non-determinism in running over
the <code>britney</code> dataset, so you might get something slightly
different.</p>

<p>Here's a placeholder for question 1 that you're going to answer
below (see Altiscale section).</p>

<h4 style="padding-top: 10px">Ensemble Spam Classifier</h4>

<p>Next, let's build an ensemble classifier. Start by gathering all
the models from each of the individual classifiers into a common
directory:</p>

<pre>
mkdir cs489-2016w-lintool-a6-model-fusion
cp cs489-2016w-lintool-a6-model-group_x/part-00000 cs489-2016w-lintool-a6-model-fusion/part-00000
cp cs489-2016w-lintool-a6-model-group_y/part-00000 cs489-2016w-lintool-a6-model-fusion/part-00001
cp cs489-2016w-lintool-a6-model-britney/part-00000 cs489-2016w-lintool-a6-model-fusion/part-00002
</pre>

<p>With these three separate classifiers, implement two different
ensemble techniques:</p>

<ul>
  <li>Score averaging: Average the spamminess score from each
  individual classifier. If the average score is above 0, classify the
  document as spam; otherwise, classify the document as ham.</li>

  <li>Voting: Each classifier gets a vote on spam/ham. Majority
  wins. The spamminess score in this case is # spam - # ham (so the
  possible scores are -3, -1, 1, 3).</li>

</ul>

<p>Write a program <code>ApplyEnsembleSpamClassifier</code> that we
will execute in the following manner:</p>

<pre>
spark-submit --driver-memory 2g --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment6.ApplyEnsembleSpamClassifier \
 target/bigdata2016w-0.1.0-SNAPSHOT.jar --input spam.test.qrels.txt \
 --output cs489-2016w-lintool-a6-test-fusion-average --model cs489-2016w-lintool-a6-model-fusion --method average
</pre>

<p>The <code>--input</code> option specifies the input test instances.
The <code>--model</code> option specifies the base directory of all
the classifier models; in this directory your program should expect
each individual model in a <code>part-XXXXX</code> file; it's okay to
hard code the part files for convenience. The <code>--output</code>
option specifies the output directory. Finally,
the <code>--method</code> option specifies the ensemble technique,
either "average" or "vote" per above.</p>

<p>Your prediction program needs to load all three models, apply the
specified ensemble technique, and make predictions. Hint: Spark
broadcast variables are helpful in this implementation.</p>

<p>The output format of the predictions should be the same as the
output of the <code>ApplySpamClassifier</code> program. You should be
able to evaluate with <code>spam_eval.sh</code> in the same way. Go
ahead and predict with the two ensemble techniques and evaluate the
predictions. Note that ensemble techniques can sometimes improve on
the best classifier; sometimes not.</p>

<p>Here's a placeholder for questions 2 and 3 that you're going to
answer below (see Altiscale section).</p>

<p>How does the ensemble compare to just concatenating all the
training data together and training a single classifier? Let's find
out:</p>

<pre>
cat spam.train.group_x.txt spam.train.group_y.txt spam.train.britney.txt > spam.train.all.txt
</pre>

<p>Now train on this larger test set, predict, and evaluate.</p>

<p>Here's a placeholder for question 4 that you're going to answer
below (see Altiscale section).</p>

<h4 style="padding-top: 10px">The Effects of Data Shuffling</h4>

<p>In class, we talked about how a model trained using stochastic
gradient descent is dependent on the order in which the training
instances are presented to the trainer. Let's explore this effect.</p>

<p>Modify the <code>TrainSpamClassifier</code> to implement a new
option <code>--shuffle</code>. With this option, the program will
randomly shuffle the training instances before running the
trainer:</p>

<pre>
spark-submit --driver-memory 2g --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment6.TrainSpamClassifier \
 target/bigdata2016w-0.1.0-SNAPSHOT.jar --input spam.train.britney.txt --model cs489-2016w-lintool-a6-model-britney-shuffle --shuffle
</pre>

<p>You <i>must</i> shuffle the data using Spark. The way to accomplish
this in Spark is to generate a random number for each instance and
then sort the instances by the value. That is, you <i>cannot</i>
simply read all the training instances into memory in the driver,
shuffle, and then parallelize.</p>

<p>Obviously, the addition of the <code>--shuffle</code> option should
not break existing functionality; that is, without the option, the
program should behave exactly as before.</p>

<p>Note that in this case we're working with the <code>britney</code>
data because the two other datasets have very few
examples&mdash;random shuffles can lead to weird idiosyncratic
effects.</p>

<p>You should be able to evaluate the newly trained model in exactly
the same way as above. If you are getting a wildly different 1-ROCA%
scores each time, you're doing something wrong.</p>

<p>Here's a placeholder for question 5 that you're going to answer
below (see Altiscale section).</p>

<h4 style="padding-top: 10px">Running on Altiscale</h4>

<p>You are free to develop locally on your own machine or in the Linux
Student CS environment (and in fact, the instructions above assume
so), but you must make sure that your code runs in Altiscale
also. This is just to verify that your Spark programs will work in a
distributed environment, and that you are not inadvertently taking
advantage of some local feature.</p>

<p>All training and test data are located
in <code>/shared/cs489/data/</code> on HDFS. Note that
<code>spam.train.all.txt</code> has already been prepared for you in
that directory also.</p>

<p>For example, training, predicting, and evaluating on
the <code>group_x</code> dataset in Altiscale:</p>

<pre>
my-spark-submit --driver-memory 2g --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment6.TrainSpamClassifier \
 target/bigdata2016w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/spam.train.group_x.txt \
 --model cs489-2016w-lintool-a6-model-group_x

my-spark-submit --driver-memory 2g --class ca.uwaterloo.cs.bigdata2016w.lintool.assignment6.ApplySpamClassifier \
 target/bigdata2016w-0.1.0-SNAPSHOT.jar --input /shared/cs489/data/spam.test.qrels.txt \
 --output cs489-2016w-lintool-a6-test-group_x --model cs489-2016w-lintool-a6-model-group_x

./spam_eval_hdfs.sh cs489-2016w-lintool-a6-test-group_x
</pre>

<p>The major differences are:</p>

<ul>
  <li>Location of the training/test data (on HDFS).</li>
  <li>All input/output from/to HDFS.</li>
  <li>Use of the <code>my-spark-submit</code> script for launching Spark programs.</li>
  <li>Use of <code>spam_eval_hdfs.sh</code> for the evaluation script.</li>
</ul>

<p>Refer back to the placeholders above and answer the following
questions, <i>running your code on the Altiscale cluster</i>:</p>

<p><b>Question 1:</b> For each individual classifiers trained
on <code>group_x</code>, <code>group_y</code>,
and <code>britney</code>, what are the 1-ROCA% scores? You should be
able to replicate our results
on <code>group_x</code>, <code>group_y</code>, but there may be some
non-determinism for <code>britney</code>, which is why we want you to
report the figures.</p>

<p><b>Question 2:</b> What is the 1-ROCA% score of the score averaging
technique in the 3-classifier ensemble?</p>

<p><b>Question 3:</b> What is the 1-ROCA% score of the voting
technique in the 3-classifier ensemble?</p>

<p><b>Question 4:</b> What is the 1-ROCA% score of a single classifier
trained on all available training data concatenated together?</p>

<p><b>Question 5:</b> Run the shuffle trainer 10 times on
the <code>britney</code> dataset, predict and evaluate the classifier
on the test data each time. Report the 1-ROCA% score in each of the
ten trials and compute the overall average.</p>

<h4 style="padding-top: 10px">Turning in the Assignment</h4>

<p>Please follow these instructions carefully!</p>

<p>Make sure your repo has the following items:</p>

<ul>

<li>Put the answers to all the questions above
in <code>bigdata2016w/assignment6.md</code>.</li>

<li>Your implementations should go in
package <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment6</code>. At the minimum, you should have
<code>TrainSpamClassifier</code>, <code>ApplySpamClassifier</code>,
and <code>ApplyEnsembleSpamClassifier</code>. Feel free to include helper code also.</li>

</ul>

<p>Make sure your implementation runs on the Altiscale cluster. The
following check script is provided for you (check the source for the relevant flags):</p>

<ul>

<li><a href="assignments/check_assignment6_public.py"><code>check_assignment6_public.py</code></a></li>


</ul>

<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and run the public check scripts.</p>

<p>That's it!</p>

<h4 style="padding-top: 10px">Grading</h4>

<p>The entire assignment is worth 60 points:</p>

<ul>

  <li>Getting your code to compile is worth 4 points.</li>

  <li>A correct implementation of the
  basic <code>TrainSpamClassifier</code> is worth 15 points.</li>

  <li>A correct implementation of <code>ApplySpamClassifier</code> is
  worth 5 points.</li>

  <li>A correct implementation
  of <code>ApplyEnsembleSpamClassifier</code> is worth 6 points.</li>

  <li>A correct implementation of the <code>--shuffle</code> option
  in <code>TrainSpamClassifier</code> is worth 5 points.</li>

  <li>The answers to questions 1-5 are worth 3 points each.</li>

  <li>Being able to successfully run all your code on Altiscale is
  worth 10 points. We will begin by testing all your code on
  Altiscale. If everything works there, you will get full marks. If we
  can't get your code to run successfully on Altiscale, we will try
  running your code in the Linux Student CS environment. Even if
  everything works perfectly there, you will receive zero marks for
  this item.</li>

</ul>

<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>


<section id="assignment7" style="padding-top:35px">
<div>
<h3>Assignment 7: Inverted Indexing (Redux) <small>due 8:30am March 31</small></h3>

<p>In this assignment you'll revisit the inverted indexing and boolean
retrieval program in <a href="#assignment3">assignment 3</a>. In
assignment 3, your indexer program wrote postings to HDFS
in <code>MapFile</code>s and your boolean retrieval program read
postings from those <code>MapFile</code>s. In this assignment, you'll
write postings to and read postings from HBase instead. In other
words, the program logic should not change, except for the backend
storage that you are using. This assignment is to be completed using
MapReduce in Java.</p>

<h4 style="padding-top: 10px">HBase Word Count</h4>

<p>Because HBase requires additional daemon processes to be installed
and configured properly, this assignment must be completed in the
Altiscale environment. That is, do not use the Linux Student CS
environment for this assignment.</p>

<p>To start, take a look at <code>HBaseWordCount</code> in Bespin,
which is in the
package <code>io.bespin.java.mapreduce.wordcount</code>. Make sure you
pull the repo to grab the latest version of the code.
The <code>HBaseWordCount</code> program is like the basic word count
demo, except that it stores the output in an HBase table. That is, the
reducer output is directly written to an HBase table: the word serves
as the row key, "c" is the column family, "count" is the column
qualifier, and the value is the actual count.</p>

<p>The <code>HBaseWordCountFetch</code> program in the same package
illustrates how you can fetch these counts out of HBase and shows you
how to use the basic HBase "get" API.</p>

<p>Study these two programs to make sure you understand how they
work. The two sample program should give you a good introduction to
the HBase APIs. A free
online <a href="https://hbase.apache.org/book.html#mapreduce">HBase
book</a> is a good source of additional details.</p>

<p>Make sure you can run both
programs. Running <code>HBaseWordCount</code>:</p>

<pre>
hadoop jar target/bespin-0.1.0-SNAPSHOT.jar io.bespin.java.mapreduce.wordcount.HBaseWordCount \
 -config /home/hbase-0.98.16-hadoop2/conf/hbase-site.xml \
 -input /shared/cs489/data/Shakespeare.txt -table lintool-wc-shakes -reducers 5
</pre>

<p>Use the <code>-config</code> option to specify the HBase config
file: point to a version on the Altiscale workspace that we've
prepared for you. This config file tells the program how to connect to
the HBase cluster. Use the <code>-table</code> option to name the
table you're inserting the word counts into. The other options should
be straightforward to understand.</p>

<p><B>Note:</b> Since HBase is a shared resource across the cluster,
please make your tables unique by using your username as part of the
table name, per above.</p>

<p>You should then be able to fetch the word counts from HBase:</p>

<pre>
hadoop jar target/bespin-0.1.0-SNAPSHOT.jar io.bespin.java.mapreduce.wordcount.HBaseWordCountFetch \
 -config /home/hbase-0.98.16-hadoop2/conf/hbase-site.xml \
 -table lintool-wc-shakes -term love
</pre>

<p>If everything works, you'll discover that the term "love" appears
2053 times in the Shakespeare collection.</p>

<p>Next, you should try the same word count demo on the larger sample
wiki collection on HDFS
at <code>/shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt</code>.</p>

<h4 style="padding-top: 10px">HBase Storage</h4>

<p>Now it's time to write some code! Following past procedures,
everything should go into the package namespace
<code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment7</code> (obviously, replace
<code>lintool</code> with your actual GitHub username. Note we're back
to coding in Java for this assignment.</p>

<p>Before you begin, you'll need to pull in the HBase-related
artifacts; otherwise, your code will not compile. Add the following
lines in the dependencies block of your <code>pom.xml</code></p>

<pre>
    &lt;dependency&gt;
      &lt;groupId&gt;org.apache.hbase&lt;/groupId&gt;
      &lt;artifactId&gt;hbase-client&lt;/artifactId&gt;
      &lt;version&gt;0.98.16-hadoop2&lt;/version&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;org.apache.hbase&lt;/groupId&gt;
      &lt;artifactId&gt;hbase-server&lt;/artifactId&gt;
      &lt;version&gt;0.98.16-hadoop2&lt;/version&gt;
    &lt;/dependency&gt;
</pre>

<p>You will write two programs, <code>BuildInvertedIndexHBase</code>
and <code>BooleanRetrievalHBase</code>. These are the counterparts of
the programs you wrote in assignment 3 (and the original Bespin
demos); feel free to use your code there as a starting point. Note
that you don't need to worry about index compression for this
assignment!</p>

<p>The <code>BuildInvertedIndexHBase</code> program is the HBase
version of <code>BuildInvertedIndex</code> from the Bespin
demo. Instead of writing the index to HDFS, you will write the index
to an HBase table. Use the following table structure: the term will be
the row key. Your table will have a single column family called
"p". In the column family, each document id will be a column
qualifier. The value will be the term frequency.</p>

<p>The <code>BooleanRetrievalHBase</code> program is the HBase version
of <code>BooleanRetrieval</code> from the Bespin demo. This program
should read postings from HBase. Note that the only thing you need to
change is the method <code>fetchDocumentSet</code>: instead of reading
from the <code>MapFile</code>, you'll read from HBase.</p>

<p>We advise that you begin with the Shakespeare dataset. You should
be able to build the HBase index with the following command:</p>

<pre>
hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
  ca.uwaterloo.cs.bigdata2016w.lintool.assignment7.BuildInvertedIndexHBase \
  -config /home/hbase-0.98.16-hadoop2/conf/hbase-site.xml \
  -input /shared/cs489/data/Shakespeare.txt \
  -table cs489-2016w-lintool-a7-index-shakespeare -reducers 4
</pre>

<p>And run a query as follows:</p>

<pre>
hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
  ca.uwaterloo.cs.bigdata2016w.lintool.assignment7.BooleanRetrievalHBase \
  -config /home/hbase-0.98.16-hadoop2/conf/hbase-site.xml \
  -table cs489-2016w-lintool-a7-index-shakespeare\
  -collection /shared/cs489/data/Shakespeare.txt \
  -query "outrageous fortune AND"
</pre>

<p>After you've verified that everything works on the smaller
Shakespeare collection, move on to the sample Wikipedia
collection. Index as follows:</p>

<pre>
hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
  ca.uwaterloo.cs.bigdata2016w.lintool.assignment7.BuildInvertedIndexHBase \
  -config /home/hbase-0.98.16-hadoop2/conf/hbase-site.xml \
  -input /shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt \
  -table cs489-2016w-lintool-a7-index-wiki -reducers 5
</pre>

<p>And run a query as follows:</p>

<pre>
hadoop jar target/bigdata2016w-0.1.0-SNAPSHOT.jar \
  ca.uwaterloo.cs.bigdata2016w.lintool.assignment7.BooleanRetrievalHBase \
  -config /home/hbase-0.98.16-hadoop2/conf/hbase-site.xml \
  -collection /shared/cs489/data/enwiki-20151201-pages-articles-0.1sample.txt \
  -table cs489-2016w-lintool-a7-index-wiki \
  -query "waterloo stanford OR cheriton AND"
</pre>

<p>You should verify that all the sample queries (from assignment 3)
on both collections work.</p>

<h4 style="padding-top: 10px">Turning in the Assignment</h4>

<p>Please follow these instructions carefully!</p>

<p>Make sure your repo has the following items:</p>

<ul>

<li>If you have any notes you wish to convey to us, put it
in <code>bigdata2016w/assignment7.md</code>. Otherwise, please create
an empty file&mdash;following previous assignments, this is where the
grade with go.</li>

<li>Your implementations should go in
package <code>ca.uwaterloo.cs.bigdata2016w.lintool.assignment7</code>. At the minimum, you should have
<code>BuildInvertedIndexHBase</code>
and <code>BooleanRetrievalHBase</code>. Feel free to include helper
code also.</li>

</ul>

<p>The following check script is provided for you:</p>

<ul>

<li><a href="assignments/check_assignment7_public.py"><code>check_assignment7_public.py</code></a></li>


</ul>

<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and run the public check scripts.</p>

<p>That's it!</p>

<h4 style="padding-top: 10px">Grading</h4>

<p>The entire assignment is worth 30 points:</p>

<ul>

  <li>The implementation to <code>BuildInvertedIndexHBase</code> is
  worth 10 points; the implementation
  to <code>BooleanRetrievalHBase</code> is worth 5 points.</li>

  <li>Getting your code to run on sample queries (the same as the ones
  in assignment 3) is worth 10 points. That is, to earn all 10 points,
  we should be able to run your code on both the Shakespeare and
  sample Wikipedia collection, following exactly the procedure
  above. Therefore, if all the answers are correct and the
  implementation seems correct, but we cannot get your code to build
  and run, you will not earn these points.</li>

  <li>Another 5 points is allotted to us verifying the behavior and
  output of your program in ways that we will not tell you. We're
  giving you the "public" versions of the check scripts; we'll run a
  "private" version to examine your output further (i.e., think blind
  test cases).</li>

</ul>


<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>

<section id="project" style="padding-top:35px">
<div>
<h3>Final Project</h3>

<p>The final project is a requirement only for graduate students taking CS 698.</p>

<p>The topic of the final project can be on anything you wish in the
space of big data. Anything reasonably related to topics that we
covered in the course is within scope. For reference, there are three
types of projects you might consider:</p>

<ul>

  <li>Implement a big data algorithm in MapReduce or Spark: choose a
  particular big data algorithm (for processing text, graphs,
  relational data, etc.) and implement it. Ideally, the implementation
  does not already exist in a library or open-source package. Since we
  want you to implement the algorithm from scratch, it might perhaps
  be too tempting to simply copy existing
  code&mdash;see <a href="organization.html">notes on academic
  integrity</a>.</li>

  <li>Learn and explore a (new) big data processing framework:
  although we discussed a variety of processing frameworks in class,
  the assignments focused on MapReduce and Spark exclusively. Here's
  your chance to learn a new processing framework, e.g., Spark
  Streaming, GraphX, Giraph, Flink, etc. The project would involve
  learning to use the processing framework and doing something
  interesting with it. The "something interesting" might be a data
  mining algorithm, although note that the expectations would be lower
  than building something in MapReduce or Spark, since learning the
  new framework would form an essential component of the project.</li>

  <li>Perform some interesting data science. Is there a particular
  dataset you'd like to explore or analyze? Your project could involve
  performing interesting analytics on a dataset&mdash;here, the focus
  would be the analytical product and the insights gleaned, as opposed
  to the raw algorithms themselves. However, a superficial analysis
  with existing machine-learning libraries is not enough.</li>

</ul>

<p>You may work in groups of up to three, or you can also work by
yourself if you wish. The amount of effort devoted to the project
should be proportional to the number of people in the team. We would
expect a level of effort comparable to two assignments per person.</p>

<p>When you are ready, send the
instructors <code>uwaterloo-bigdata-2016w-staff@googlegroups.com</code>
an email describing what you'd like to work on. We will provide you
feedback on appropriateness, scope, etc.</p>

<p>In terms of resources, you are welcome to use the Altiscale
cluster. Note that we expect your project to be more than a "toy". To
calibrate what we mean by "toy", consider the assignments throughout
the course: they have a "run on local" part and "run on Altiscale"
part. The first part is "toy"; the Altiscale part would not be. If
you're planning to work with a framework that doesn't run on
Altiscale, you're responsible for finding your own hardware
resources.</p>

<p>The deliverable for the final project is a report. Use
the <a href="http://www.acm.org/publications/proceedings-template">ACM
Templates</a>. The contents of the report will of course vary by the
topic, but we would expect the following sections:</p>

<ul>

  <li>describe the problem you're tackling and what you're trying to
  accomplish (introduction, problem statement)</li>

  <li>present existing solutions (background, related work)</li>

  <li>detail how you went about solving the problem (methods,
  algorithms, implementation details, etc.)</li>

  <li>discuss how well things work (some sort of evaluation and results).</li>

</ul>

<p>Once again, length would vary, but 6 pages (in the ACM Template)
seems about right.</p>

<p style="padding-top: 20px"><a href="#">Back to top</a></p>
</div>
</section>


<p style="padding-top:100px" />

    </div><!-- /.container -->


    <!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
    <!-- Include all compiled plugins (below), or include individual files as needed -->
    <script src="js/bootstrap.min.js"></script>

    <!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
    <script src="js/ie10-viewport-bug-workaround.js"></script>
  </body>

</html>