Building and running Spark 1.0 on Ubuntu

This article describes the step-by-step approach to build and run Apache Spark 1.0.0-SNAPSHOT. I personally use a virtual machine for testing out different big data softwares (Hadoop, Spark, Hive, etc.) and I’ve used LinuxMint 16 on VirtualBox 4.3.10 for the purpose of this blog post.

Install JDK 7

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

Verify the Java installation:

$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

Create a symlink for easier configuration later

$ cd /usr/lib/jvm/
$ sudo ln -s java-7-oracle jdk

Download Spark

Note: parambirs is my user name as well as group name on the ubuntu machine. Please replace this with your own user/group name

$ cd ~/Downloads
$ git clone
$ sudo mv spark /usr/local
$ cd /usr/local
$ sudo chown -R parambirs:parambirs spark


$ cd /usr/local/spark
$ sbt/sbt clean assembly

Run an Example

$ cd /usr/local/spark
$ ./bin/run-example org.apache.spark.examples.SparkPi 
Pi is roughly 3.1399

Run Spark Shell

$ ./bin/spark-shell

Try out some commands in the spark shell

scala> val textFile=sc.textFile("")
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
scala> textFile.count
res0: Long = 126
scala> textFile.filter(_.contains("the")).count
res1: Long = 28
scala> exit

10 thoughts on “Building and running Spark 1.0 on Ubuntu

  2. Hi, I hit error when executing this command : $ sbt/sbt clean assembly. Based on the logs :
    # There is insufficient memory for the Java Runtime Environment to continue.
    # Native memory allocation (malloc) failed to allocate 1431699456 bytes for committing reserved memory.

    Mind to advise on how to overcome this?

    • Hi,

      It seems your system doesn’t have enough memory. How much RAM do you have? On my mac, sbt easily uses more than 1GB RAM while executing. So, you might need to try this out on another machine. If you’re using a VM, try allocating more memory to this instance.


  3. Hi, I hit below error when executing sbt/sbt clean assembly. Mind to assist?thanks

    [warn] ::::::::::::::::::::::::::::::::::::::::::::::
    [warn] ::::::::::::::::::::::::::::::::::::::::::::::
    [warn] :: org.scala-lang#scala-library;2.10.2: not found
    [warn] ::::::::::::::::::::::::::::::::::::::::::::::
    sbt.ResolveException: unresolved dependency: org.scala-lang#scala-library;2.10.2: not found

  6. Hi Thanks for sharing but it doesnt built
    [error] (streaming-kafka-assembly/*:assembly) duplicate entry: META-INF/MANIFEST.MF
    [error] (streaming-flume-sink/avro:generate) org.apache.avro.SchemaParseException: Undefined name: “strıng”
    [error] (assembly/*:assembly) duplicate entry: META-INF/MANIFEST.MF
    How can i solve this problem

  7. org.apache.avro.SchemaParseException: Undefined name: “strıng”
    at org.apache.avro.Schema.parse(
    at org.apache.avro.Schema.parse(
    at org.apache.avro.Schema.parse(
    at org.apache.avro.Protocol.parseTypes(
    at org.apache.avro.Protocol.parse(
    at org.apache.avro.Protocol.parse(
    at org.apache.avro.Protocol.parse(
    at sbtavro.SbtAvro$$anonfun$sbtavro$SbtAvro$$compile$2.apply(SbtAvro.scala:81)
    at sbtavro.SbtAvro$$anonfun$sbtavro$SbtAvro$$compile$2.apply(SbtAvro.scala:78)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at sbtavro.SbtAvro$.sbtavro$SbtAvro$$compile(SbtAvro.scala:78)
    at sbtavro.SbtAvro$$anonfun$sourceGeneratorTask$1$$anonfun$1.apply(SbtAvro.scala:112)
    at sbtavro.SbtAvro$$anonfun$sourceGeneratorTask$1$$anonfun$1.apply(SbtAvro.scala:111)
    at sbt.FileFunction$$anonfun$cached$1.apply(Tracked.scala:186)
    at sbt.FileFunction$$anonfun$cached$1.apply(Tracked.scala:186)
    at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3$$anonfun$apply$4.apply(Tracked.scala:200)
    at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3$$anonfun$apply$4.apply(Tracked.scala:196)
    at sbt.Difference.apply(Tracked.scala:175)
    at sbt.Difference.apply(Tracked.scala:157)
    at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3.apply(Tracked.scala:196)
    at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3.apply(Tracked.scala:195)
    at sbt.Difference.apply(Tracked.scala:175)
    at sbt.Difference.apply(Tracked.scala:151)
    at sbt.FileFunction$$anonfun$cached$2.apply(Tracked.scala:195)
    at sbt.FileFunction$$anonfun$cached$2.apply(Tracked.scala:193)
    at sbtavro.SbtAvro$$anonfun$sourceGeneratorTask$1.apply(SbtAvro.scala:114)
    at sbtavro.SbtAvro$$anonfun$sourceGeneratorTask$1.apply(SbtAvro.scala:108)
    at scala.Function5$$anonfun$tupled$1.apply(Function5.scala:35)
    at scala.Function5$$anonfun$tupled$1.apply(Function5.scala:34)
    at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
    at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
    at sbt.std.Transform$$anon$
    at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
    at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
    at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
    at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
    at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
    at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
    at sbt.CompletionService$$anon$
    at java.util.concurrent.Executors$
    at java.util.concurrent.ThreadPoolExecutor.runWorker(
    at java.util.concurrent.ThreadPoolExecutor$
    [error] (streaming-flume-sink/avro:generate) org.apache.avro.SchemaParseException: Undefined name: “strıng”
    [error] Total time: 1555 s, completed 09.Tem.2015 09:32:57

    any suggestion please.

