First look at Spark

Spark, started at UC Berkeley AMPLab in 2009,

is a fast and general cluster computing system for Big Data.

Build From Source

Spark source contains a shell build/mvn

which automatically download compatible version of scala and mvn

and compile the source code1.

git clone https://github.com/apache/spark
cd spark
build/mvn -DskipTests clean package

Launch Master & Slaves

The spark config files are located in conf. To launch the master & slaves:

sbin/start-all.sh

Running the Examples

To run examples provided in Scala:

bin/run-example SparkPi 10

This run-examples shell runs the example provided with spark by calling spark-submit.

Spark Submit Jobs

To submit jobs provided in Python:

bin/spark-submit examples/src/main/python/pi.py 10

To submit jobs provided in Java/Scala:

bin/spark-submit --class org.apache.spark.examples.SparkPi examples/target/spark-examples_*.jar 10

For more information about spark-submit see this.

Code Structer

The Spark has 3 main components:

  • The Master manages Workers.
  • The Worker watting for the command from Master.
  • The SparkSubmit provide the main gateway of launching a Spark application.

Each components has their main function.

The communication between Master and Worker is provided by Akka.

Each example also has a main function, this is where the example code started.

So, there are totally 4 main functions

Take SparkPi as an example:

package org.apache.spark.examples

import scala.math.random

import org.apache.spark._

/** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Spark Pi")
    val spark = new SparkContext(conf)
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / n)
    spark.stop()
  }
}
  • Above code first created a

    ParallelCollectionRDD

    , which is used to process slice, by using parallelize.

  • Then the code created a new MapPartitionsRDD.

    In MapPartitionsRDD, the previous ParallelCollectionRDD and the map function are stored but not “computed”.

  • Only in the reduce function, the map jobs and reduce jobs are scheduled.

More Examples About RDD can be find here


  1. In compiling the source code, the “clean” is recommended. If “clean” is not provided, you will make an incremental compiling. In doing this, you are likely to encounter compile error or function not found error in running.