Apache Spark

Spark a distributed computing platform that is built on top of Hadoop MapReduce. It extends the MapReduce model and make it easier to code and faster to execute.

Spark provides API in Java, Scala, and Python. Any of these languages can be used to create Spark applications.

Spark supports, Map, Reduce, SQL queries, streaming data, machine learning algorithms, and graph algorithms.

Spark stack contains:

  • Spark Core: provides the core functionality, in-memory computing, and ability to reference datasets in external systems.
  • Spark SQL: provides support for structured and semi-structured data through SchemaRDD (data abstraction)
  • Spark Streaming: takes data input in small batches and perform RDD transformations on the data. Enables real-time data processing.
  • MLlib: Maching learning library
  • GraphX: provides API for distributed graph processing

Resilient Distributed Datasets - RDD

RDD is an immutable and read-only distributed collection of objects. An RDD can contain Python objects, Java objects, Scala objects, or user-defined classes.

RDDs can be create using any of the three methods below:

  1. Parallelizing an existing collection: The data is already in Spark. You can create a parallel transformation of the RDD. Now this data can be operated on in parallel.
  2. Referencing a dataset: Reference a data set in a Hadoop supported storage such as Cassandra. When you reference the dataset, RDDs are created.
  3. Transformation from an existing RDD Apply a transformation that creates new RDDs.

RDDs have two types of operations: transformation and action. Transformations create Directed Acyclic Graph (DAG) that only evaluate at runtime and do not return any value. Actions can perform transformations and other actions. Actions return a value.

RDD works with any storage supported by Hadoop such as HDFS, HBase, and Cassandra.

Example code to create RDD

The code is in Scala:

val data = 1 to 100
val myData = sc.parallelize(data)
val myFile = sc.textFile("myfile.txt")