Scala for the Impatient: Chapter 7 exercises solutions

I’ve started going through the book: Scala for the Impatient by Cay Horstmann. I’m posting the solution to chapter 7 exercises below.


2014 in review

The stats helper monkeys prepared a 2014 annual report for this blog.

Here’s an excerpt:

The concert hall at the Sydney Opera House holds 2,700 people. This blog was viewed about 9,700 times in 2014. If it were a concert at Sydney Opera House, it would take about 4 sold-out performances for that many people to see it.

Click here to see the complete report.

Writing and Running a Standalone App with Spark 1.0 and YARN 2.4.0

This article is a step-by-step guide to write, build and run a standalone app on Spark 1.0 written in Scala. It assumes that you have a working installation of Spark 1.0. If not, you can follow the steps detailed in the below post:

Running Spark 1.0 on Hadoop/YARN 2.4.0

Installing SBT

We’ll use Scala SBT (Simple Build Tool) for building our standalone app. To install sbt, follow these steps.

Note: Replace parambirs:parambirs with your username:groupname combination

$ cd ~/Downloads
$ wget
$ tar zxvf sbt-0.13.2.tgz 
$ sudo mv sbt /usr/local
$ cd /usr/local
$ sudo chown -R parambirs:parambirs sbt
$ vim ~/.bashrc

Add the following at the end of the file

export PATH=$PATH:/usr/local/sbt/bin

Refresh bash environment to reflect sbt addition to path

$ source ~/.bashrc

Download some sample data for our app and add it to hdfs

$ cd ~/Downloads
$ wget
$ hadoop dfs -put 1342.txt.utf-8 /user/parambirs/pride.txt
$ hadoop dfs -ls /user/parambirs
-rw-r--r-- 1 parambirs supergroup 717569 2014-05-20 13:56 /user/parambirs/pride.txt

Write a simple app

The source code for this sample app is available on github

$ cd ~
$ mkdir -p SimpleApp/src/main/scala
$ cd SimpleApp
$ vim src/main/scala/SimpleApp.scala

Add the following code to SimpleApp.scala file. The code loads the text of “Pride and Prejudice” book by Jane Austen and calculates the number of lines that contain the letter “a” and “b”. It then prints this data to the console.

/*** SimpleApp.scala ***/
import org.apache.spark._

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "hdfs://localhost:9000/user/parambirs/pride.txt"
    val conf = new SparkConf().setAppName("Simple App")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))

Create sbt build file

$ vim simple.sbt

Add the following content to the build file

name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0-SNAPSHOT"

resolvers += "Akka Repository" at ""

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.4.0"

The final structure for the source folder should be like this

$ find .

Building the App

Before we build our app, we need to publish spark-1.0 artifacts to local maven repository because our app has a dependency on it.

$ cd /usr/local/spark
$ sbt/sbt publish-local

Now we can build our app

$ cd ~/SimpleApp
$ sbt package

Running the app on YARN

We’ll use the yarn cluster mode (using spark-submit) to run our app. This sends our app as well as the spark assembly to the yarn cluster and the code is executed remotely.

$ cd /usr/local/spark
$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --master yarn --deploy-mode cluster --class SimpleApp --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 ~/SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar

Checking the output

The output of the program is stored in yarn logs. On my machine, the application Id for this app was application_1400566266818_0007 and therefore, the output could be read using the following command

$ cat /usr/local/hadoop/logs/userlogs/application_1400566266818_0007/container_1400566266818_0007_01_000001/stdout
Lines with a: 10559, Lines with b: 5874