Sunday, December 28, 2014

Loading data using Spark SQL (cont.)

In the previous post we loaded data into Spark via a Scala case class, were continuing our experiment, this time the data will be loaded using Spark SQL. Spark SQL is component of Spark and it provides a SQL like interface.

Continuing with the code from the previous post, let’s create a SQLContext which is an entry point for running queries in Spark

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._

Let’s create a schema RDD programmatically, this method is recommended when a case class cannot be defined ahead of time according to the Spark SQL documentation.


There are three steps as detailed in the documentation:



  1. Create an RDD of Rows from the original RDD;
  2. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
  3. Apply the schema to the RDD of Rows via applySchema method provided by SQLContext.

The schema RDD is represented by by a StructType matching the structure of the data that is being read.


val schema = StructType(Array(StructField("year",IntegerType,true),
StructField("month",IntegerType,true), 
StructField("tAverage",DoubleType,true),
StructField("tMaximum",DoubleType,true),
StructField("tMinimum",DoubleType,true)))

Next, we convert the rows of the weather data into Rows


val rowRDD = source.map(_.split(","))
  .map(p => org.apache.spark.sql.Row(p(1)
  .toInt,p(2).toInt,p(4).toDouble,p(5)
  .toDouble, p(6).toDouble ))

Next, we apply the schema


val tempDataRDD = sqlContext.applySchema(rowRDD, schema)

Finally we register the schema as a table


tempDataRDD.registerTempTable("temperatureData")

Now that the schema has been registered, we can run a SQL query


sql("select year, month, AVG(tAverage) 
  from temperatureData group by year order by 
  year").foreach(println)

to compute the average temperature by year. The results are displayed interleaved with log entries


image


Note that while the application is running, you can get all the job details by logging into sparkUI, typically the port is 4040 (the correct port is displayed on the console)


image


image

Friday, December 26, 2014

Loading data with Apache Spark

A previous post showed how to get started with Apache Spark, the goal of this post is to document my experiments with Spark as I am learning it. In this post we will learn:
  • how to debug a spark program
  • how load a csv file in Spark and perform a few calculations
This post assumes that you’ve got you’ve followed the previous post first steps with Apache Spark
Start a new IDEA project and create new Scala project then create a new Scala object “LoadData” on the next screen.
The data file to be loaded is a CSV file containing monthly temperature data for Tucson, AZ downloaded from the United States Historical Climatology Network (HCN). The data covers the period 1893 through December 2013, the data file format is as follows:
Station ID
Year
Month
Precipitation
Minimum Temperature
Average
Temperature
Maximum
Temperature
String
Integer
Integer
Double
Double
Double
Double
Let’s load the data by using a Scala case class to load respectively the year, month, Minimum temperature, average temperature and maximum temperature from the CSV data file.

case class TempData(year:Int,month:Int,tMaximum:Double,
  tAverage:Double,tMinimum:Double)
The code for the LoadData Scala object is following
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object LoadData {  def main(args: Array[String]) {    
val conf = new SparkConf().setAppName("LoadData")
  .setMaster("local[4]")    val sc = new SparkContext(conf)    
val source = sc.textFile("AZ028815_9131.csv")    
case class TempData(year:Int,month:Int,tMaximum:Double,
tAverage:Double,tMinimum:Double)   
val tempData = source.filter(!_.contains(",.")).map(_.split(","))
  .map(p=>TempData(p(1).toInt,p(2).toInt,p(4).toDouble,p(5).toDouble,
  p(6).toDouble))  }}
Let’s verify that the data is loaded correctly by adding a print statement:

tData.foreach( println)
image

Friday, December 19, 2014

First steps with Apache Spark

Apache Spark is a promising open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. It allows data analysts to rapidly iterate over data via various techniques such as machine learning, streaming,that require fast, in-memory data processing.

The goal of this post is show how to to get setup to run Spark locally on your machine.

One of the attractive feature about Spark is the ability to interactively perform data analysis from the command line REPL or an IDE.  We’ll work with the Intellij IDEA IDE to experiment with Spark. To get started you’ll need to install the following:

Once the software stack listed above are installed, start a new IDEA project

image_thumb46
in the next screen, select SBT, SBT is a very handy build tool for Scala similar to Maven, it take care of the project dependencies management
image_thumb44
On the next screen, enter a name for the project
image
The IDEA project is displayed, create a new Scala object
image

In order to create a Scala program with main, we select “Object” from the drop down

image_thumb50

image

The new project workspace is displayed

image

Notice a file build.sbt, sbt is a dependency management similar to maven. Update your build.sbt file with:

name := "HelloWorld" 


version := "1.0"


libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.0"

This will cause IDEA to download all dependencies needed by the project.


Update the Scala code as follows:

/**
* Getting Started with Apache Spark
*/

import org.apache.spark.{SparkContext, SparkConf}

object HelloWorld {
val conf = new SparkConf().setAppName("HelloWorld").setMaster("local[4]")
val sc = new SparkContext(conf)

def main(args: Array[String]) {
println("Hello, world!")
}
}

The first line of the HelloWorld Object class sets common properties for your spark application, in this particular case we’re setting the application name and the number of threads (4) for our local environment. The second line creates a SparkContext which tells Spark how to access a cluster.
We’re done setting up our first Spark application, let’s run it!
image
Success!
image