Writing and Running a Standalone App with Spark 1.0 and YARN 2.4.0

This article is a step-by-step guide to write, build and run a standalone app on Spark 1.0 written in Scala. It assumes that you have a working installation of Spark 1.0. If not, you can follow the steps detailed in the below post:

Running Spark 1.0 on Hadoop/YARN 2.4.0

Installing SBT

We’ll use Scala SBT (Simple Build Tool) for building our standalone app. To install sbt, follow these steps.

Note: Replace parambirs:parambirs with your username:groupname combination

$ cd ~/Downloads
$ wget http://dl.bintray.com/sbt/native-packages/sbt/0.13.2/sbt-0.13.2.tgz
$ tar zxvf sbt-0.13.2.tgz 
$ sudo mv sbt /usr/local
$ cd /usr/local
$ sudo chown -R parambirs:parambirs sbt
$ vim ~/.bashrc

Add the following at the end of the file

export PATH=$PATH:/usr/local/sbt/bin

Refresh bash environment to reflect sbt addition to path

$ source ~/.bashrc

Download some sample data for our app and add it to hdfs

$ cd ~/Downloads
$ wget http://www.gutenberg.org/ebooks/1342.txt.utf-8
$ hadoop dfs -put 1342.txt.utf-8 /user/parambirs/pride.txt
$ hadoop dfs -ls /user/parambirs
-rw-r--r-- 1 parambirs supergroup 717569 2014-05-20 13:56 /user/parambirs/pride.txt

Write a simple app

The source code for this sample app is available on github

$ cd ~
$ mkdir -p SimpleApp/src/main/scala
$ cd SimpleApp
$ vim src/main/scala/SimpleApp.scala

Add the following code to SimpleApp.scala file. The code loads the text of “Pride and Prejudice” book by Jane Austen and calculates the number of lines that contain the letter “a” and “b”. It then prints this data to the console.

/*** SimpleApp.scala ***/
import org.apache.spark._

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "hdfs://localhost:9000/user/parambirs/pride.txt"
    val conf = new SparkConf().setAppName("Simple App")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

Create sbt build file

$ vim simple.sbt

Add the following content to the build file

name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0-SNAPSHOT"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.4.0"

The final structure for the source folder should be like this

$ find .
./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

Building the App

Before we build our app, we need to publish spark-1.0 artifacts to local maven repository because our app has a dependency on it.

$ cd /usr/local/spark
$ sbt/sbt publish-local

Now we can build our app

$ cd ~/SimpleApp
$ sbt package

Running the app on YARN

We’ll use the yarn cluster mode (using spark-submit) to run our app. This sends our app as well as the spark assembly to the yarn cluster and the code is executed remotely.

$ cd /usr/local/spark
$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --master yarn --deploy-mode cluster --class SimpleApp --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 ~/SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar

Checking the output

The output of the program is stored in yarn logs. On my machine, the application Id for this app was application_1400566266818_0007 and therefore, the output could be read using the following command

$ cat /usr/local/hadoop/logs/userlogs/application_1400566266818_0007/container_1400566266818_0007_01_000001/stdout
Lines with a: 10559, Lines with b: 5874

Running Spark 1.0 on Hadoop/YARN 2.4.0

Prerequisites

This article assumes that you’ve set up hadoop 2.4.0 and also cloned Spark git repository. If not, you can follow the steps detailed in the below posts:

Building Spark for Hadoop 2.4.0 and YARN

To enable YARN support and build against the correct hadoop libraries (for v2.4.0), set the SPARK_YARN and SPARK_HADOOP_VERSION variables while building the spark assembly

$ cd /usr/local/spark
$ SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true sbt/sbt clean assembly

Specifying the “clean” task is important as sbt sometimes mixes up jars from the default hadoop 1.0.4 version and the specified hadoop 2.4.0 version. Doing an explicit clean prevents this problem.

After this step, an assembly JAR for Spark with Hadoop 2.4.0 and YARN support will be created at the following location

./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar

Running an example Spark program on YARN

$ cd /usr/local/spark
$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.4.0.jar

This runs the application in yarn cluster mode. It starts a new yarn client program and the SparkPi code is then run as a child thread of the ApplicationMaster. Since the application is run on a remote machine, interactive applications can’t work this way (e.g. spark-shell). Refer to the next section for details on running such applications.

You can view cluster details on the YARN web interface (http://localhost:8088)

Screen Shot 2014-05-20 at 11.45.16 am

The output of YARN apps isn’t printed on the console. To see the output of this app, we’ll have to go through the app’s logs. YARN logs generally go into the $HADOOP_HOME/logs/userlogs/application_<appID> folder. You can get the appID from the YARN web interface or YARN output printed on the console.

$ cat /usr/local/hadoop/logs/userlogs/application_1400566266818_0001/container_1400566266818_0001_01_000001/stdout
...
Pi is roughly 3.1429

You can also see the logs from the web interface by navigating to YARN_App > history > logs

Running Spark Shell on YARN

With yarn-client mode, the application is launched locally (like in local mode). To run spark-shell in yarn-client mode, use the following command

$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop MASTER=yarn-client ./bin/spark-shell

You can see that this launches an application on YARN using the web interface

Screen Shot 2014-05-20 at 12.06.33 pm

Let’s try out some commands in the spark-shell

scala> val textFile=sc.textFile("file:///usr/local/spark/README.md")
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
scala> textFile.count
res0: Long = 126
scala> textFile.filter(_.contains("the")).count
res1: Long = 28
scala> exit

When running in yarn-client mode, it’s important to start local file URIs with “file://”. This is because in this mode, spark assumes that files are present in HDFS (in the /user/<username>) directory. For example, try the following commands in the spark-shell

scala> val textFile=sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
scala> textFile.count
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/parambirs/README.md

 

 

Building and running Spark 1.0 on Ubuntu

This article describes the step-by-step approach to build and run Apache Spark 1.0.0-SNAPSHOT. I personally use a virtual machine for testing out different big data softwares (Hadoop, Spark, Hive, etc.) and I’ve used LinuxMint 16 on VirtualBox 4.3.10 for the purpose of this blog post.

Install JDK 7

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

Verify the Java installation:

$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

Create a symlink for easier configuration later

$ cd /usr/lib/jvm/
$ sudo ln -s java-7-oracle jdk

Download Spark

Note: parambirs is my user name as well as group name on the ubuntu machine. Please replace this with your own user/group name

$ cd ~/Downloads
$ git clone https://github.com/apache/spark.git
$ sudo mv spark /usr/local
$ cd /usr/local
$ sudo chown -R parambirs:parambirs spark

Build

$ cd /usr/local/spark
$ sbt/sbt clean assembly

Run an Example

$ cd /usr/local/spark
$ ./bin/run-example org.apache.spark.examples.SparkPi 
...
Pi is roughly 3.1399
...

Run Spark Shell

$ ./bin/spark-shell

Try out some commands in the spark shell

scala> val textFile=sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
scala> textFile.count
res0: Long = 126
scala> textFile.filter(_.contains("the")).count
res1: Long = 28
scala> exit

Install Hadoop/YARN 2.4.0 on Ubuntu (VirtualBox)

This article describes the step-by-step approach to install Hadoop/YARN 2.4.0 on Ubuntu and its derivatives (LinuxMint, Kubuntu etc.). I personally use a virtual machine for testing out different big data softwares (Hadoop, Spark, Hive, etc.) and I’ve used LinuxMint 16 on VirtualBox 4.3.10 for the purpose of this blog post.

Install JDK 7

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

Verify the Java installation:

$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

Create a symlink for easier configuration later

$ cd /usr/lib/jvm/
$ sudo ln -s java-7-oracle jdk

Install OpenSSH Server

$ sudo apt-get install openssh-server
$ ssh-keygen -t rsa

Hit enter on all prompts i.e. accept all defaults including “no passphrase”. Next, to prevent password prompts, add the public key of this machine to the authorized keys folder (Hadoop services use ssh to talk among themselves even on a single node cluster).

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

SSH to localhost to test ssh server and also save localhost in the list of known hosts. Next time when you ssh to localhost, there will be no prompts

$ ssh localhost

Download Hadoop

Note 1: You should use a mirror URL from the official downloads page
Note 2: parambirs is my user name as well as group name on the ubuntu machine. Please replace this with your own user/group name

$ cd Downloads/
$ wget http://apache.claz.org/hadoop/common/hadoop-2.4.0/hadoop-2.4.0.tar.gz
$ tar zxvf hadoop-2.2.0.tar.gz
$ sudo mv hadoop-2.2.0 /usr/local/
$ cd /usr/local
$ sudo ln -s hadoop-2.2.0 hadoop
$ sudo chown -R parambirs:parambirs hadoop-2.2.0
$ sudo chown -R parambirs:parambirs hadoop

Environment Configuration

$ cd ~
$ vim .bashrc

Add the following to the end of .bashrc file

#Hadoop variables
export JAVA_HOME=/usr/lib/jvm/jdk/
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL

Modify hadoop-env.sh

$ cd /usr/local/hadoop/etc/hadoop
$ vim hadoop-env.sh
 
#modify JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/jdk/

Verify hadoop installation

$ source ~/.bashrc (refresh shell to reflect the configuration changes we’ve made)
$ hadoop version
Hadoop 2.4.0
Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1583262
Compiled by jenkins on 2014-03-31T08:29Z
Compiled with protoc 2.5.0
From source with checksum 375b2832a6641759c6eaf6e3e998147
This command was run using /usr/local/hadoop-2.4.0/share/hadoop/common/hadoop-common-2.4.0.jar

 Hadoop Configuration

$ cd ~
$ mkdir -p mydata/hdfs/namenode
$ mkdir -p mydata/hdfs/datanode

core-site.xml

$ cd /usr/local/hadoop/etc/hadoop/
$ vim core-site.xml

Add the following between the <configuration></configuration> elements

<property>
 <name>fs.default.name</name>
 <value>hdfs://localhost:9000</value>
</property>

yarn-site.xml

$ vim yarn-site.xml

Add the following between the <configuration></configuration> elements

<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>
<property>
 <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

mapred-site.xml

$ cp mapred-site.xml.template mapred-site.xml
$ vim mapred-site.xml

Add the following between the <configuration></configuration> elements

<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
</property>

hdfs-site.xml

$ vim hdfs-site.xml

Add the following between the <configuration></configuration> elements. Replace /home/parambirs with your own home directory.

<property>
 <name>dfs.replication</name>
 <value>1</value>
 </property>
 <property>
 <name>dfs.namenode.name.dir</name>
 <value>file:/home/parambirs/mydata/hdfs/namenode</value>
 </property>
 <property>
 <name>dfs.datanode.data.dir</name>
 <value>file:/home/parambirs/mydata/hdfs/datanode</value>
 </property>

Running Hadoop

Format the namenode

$ hdfs namenode -format

Start hadoop

$ start-dfs.sh
$ start-yarn.sh

Verify all services are running

$ jps
5037 SecondaryNameNode
4690 NameNode
5166 ResourceManager
4777 DataNode
5261 NodeManager
5293 Jps

Check web interfaces of different services

Run a hadoop example MR job

$ cd /usr/local/hadoop
$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar pi 2 5

References