Building and running Spark 1.0 on Ubuntu

This article describes the step-by-step approach to build and run Apache Spark 1.0.0-SNAPSHOT. I personally use a virtual machine for testing out different big data softwares (Hadoop, Spark, Hive, etc.) and I’ve used LinuxMint 16 on VirtualBox 4.3.10 for the purpose of this blog post.

Install JDK 7

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

Verify the Java installation:

$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

Create a symlink for easier configuration later

$ cd /usr/lib/jvm/
$ sudo ln -s java-7-oracle jdk

Download Spark

Note: parambirs is my user name as well as group name on the ubuntu machine. Please replace this with your own user/group name

$ cd ~/Downloads
$ git clone https://github.com/apache/spark.git
$ sudo mv spark /usr/local
$ cd /usr/local
$ sudo chown -R parambirs:parambirs spark

Build

$ cd /usr/local/spark
$ sbt/sbt clean assembly

Run an Example

$ cd /usr/local/spark
$ ./bin/run-example org.apache.spark.examples.SparkPi 
...
Pi is roughly 3.1399
...

Run Spark Shell

$ ./bin/spark-shell

Try out some commands in the spark shell

scala> val textFile=sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
scala> textFile.count
res0: Long = 126
scala> textFile.filter(_.contains("the")).count
res1: Long = 28
scala> exit
Advertisements

10 thoughts on “Building and running Spark 1.0 on Ubuntu

  1. Pingback: Running Spark-1.0.0-SNAPSHOT on Hadoop/YARN 2.4.0 | Param Gyaan

  2. Hi, I hit error when executing this command : $ sbt/sbt clean assembly. Based on the logs :
    # There is insufficient memory for the Java Runtime Environment to continue.
    # Native memory allocation (malloc) failed to allocate 1431699456 bytes for committing reserved memory.

    Mind to advise on how to overcome this?
    Thanks

    • Hi,

      It seems your system doesn’t have enough memory. How much RAM do you have? On my mac, sbt easily uses more than 1GB RAM while executing. So, you might need to try this out on another machine. If you’re using a VM, try allocating more memory to this instance.

      Thanks
      Param

  3. Hi, I hit below error when executing sbt/sbt clean assembly. Mind to assist?thanks

    [warn] ::::::::::::::::::::::::::::::::::::::::::::::
    [warn] :: UNRESOLVED DEPENDENCIES ::
    [warn] ::::::::::::::::::::::::::::::::::::::::::::::
    [warn] :: org.scala-lang#scala-library;2.10.2: not found
    [warn] ::::::::::::::::::::::::::::::::::::::::::::::
    sbt.ResolveException: unresolved dependency: org.scala-lang#scala-library;2.10.2: not found

  4. Pingback: Spark : Futur of the past | BigData, Synthesis and Algorithmic

  5. Pingback: Spark : Futur of the past | BigData Synthesis and Algorithmic

  6. Hi Thanks for sharing but it doesnt built
    [error] (streaming-kafka-assembly/*:assembly) java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF
    [error] (streaming-flume-sink/avro:generate) org.apache.avro.SchemaParseException: Undefined name: “strıng”
    [error] (assembly/*:assembly) java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF
    How can i solve this problem

  7. org.apache.avro.SchemaParseException: Undefined name: “strıng”
    at org.apache.avro.Schema.parse(Schema.java:1075)
    at org.apache.avro.Schema.parse(Schema.java:1158)
    at org.apache.avro.Schema.parse(Schema.java:1116)
    at org.apache.avro.Protocol.parseTypes(Protocol.java:438)
    at org.apache.avro.Protocol.parse(Protocol.java:400)
    at org.apache.avro.Protocol.parse(Protocol.java:390)
    at org.apache.avro.Protocol.parse(Protocol.java:380)
    at sbtavro.SbtAvro$$anonfun$sbtavro$SbtAvro$$compile$2.apply(SbtAvro.scala:81)
    at sbtavro.SbtAvro$$anonfun$sbtavro$SbtAvro$$compile$2.apply(SbtAvro.scala:78)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at sbtavro.SbtAvro$.sbtavro$SbtAvro$$compile(SbtAvro.scala:78)
    at sbtavro.SbtAvro$$anonfun$sourceGeneratorTask$1$$anonfun$1.apply(SbtAvro.scala:112)
    at sbtavro.SbtAvro$$anonfun$sourceGeneratorTask$1$$anonfun$1.apply(SbtAvro.scala:111)
    at sbt.FileFunction$$anonfun$cached$1.apply(Tracked.scala:186)
    at sbt.FileFunction$$anonfun$cached$1.apply(Tracked.scala:186)
    at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3$$anonfun$apply$4.apply(Tracked.scala:200)
    at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3$$anonfun$apply$4.apply(Tracked.scala:196)
    at sbt.Difference.apply(Tracked.scala:175)
    at sbt.Difference.apply(Tracked.scala:157)
    at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3.apply(Tracked.scala:196)
    at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3.apply(Tracked.scala:195)
    at sbt.Difference.apply(Tracked.scala:175)
    at sbt.Difference.apply(Tracked.scala:151)
    at sbt.FileFunction$$anonfun$cached$2.apply(Tracked.scala:195)
    at sbt.FileFunction$$anonfun$cached$2.apply(Tracked.scala:193)
    at sbtavro.SbtAvro$$anonfun$sourceGeneratorTask$1.apply(SbtAvro.scala:114)
    at sbtavro.SbtAvro$$anonfun$sourceGeneratorTask$1.apply(SbtAvro.scala:108)
    at scala.Function5$$anonfun$tupled$1.apply(Function5.scala:35)
    at scala.Function5$$anonfun$tupled$1.apply(Function5.scala:34)
    at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
    at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
    at sbt.std.Transform$$anon$4.work(System.scala:63)
    at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
    at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
    at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
    at sbt.Execute.work(Execute.scala:235)
    at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
    at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
    at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
    at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    [error] (streaming-flume-sink/avro:generate) org.apache.avro.SchemaParseException: Undefined name: “strıng”
    [error] Total time: 1555 s, completed 09.Tem.2015 09:32:57

    any suggestion please.
    ty.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s