Thursday, January 22, 2015

Visualizing a data set in Spark

The goal of this post is to introduce a nice tool Wisp by Quantifind that enables visualizing data stored in Scala using the web browser. In a previous post, we loaded a weather data set into an RDD. To include Wisp in the project, update the sbt definition and add:

libraryDependencies += "com.quantifind" %% "wisp" % "0.0.1"

We’re going to plot the averages of



  • tMinimum
  • tAverage
  • tMaximum

for all years.


The averages will be stored in mutable lists


    var tempAverage = new ListBuffer[Double]
var tempMinimum = new ListBuffer[Double]
var tempMaximum = new ListBuffer[Double]


We need to add an import for the Scala mutable list and the Highchart library


import scala.collection.mutable.ListBuffer
import com.quantifind.charts.Highcharts._

The averages are computed using the Spark function “aggregate”


    for(month <- 1 to 12) {
val monthData = tempData.filter(_.month==month)
val tAve = monthData.map(_.tAverage).aggregate((0.0, 0.0))((p, q) => (p._1 + q, p._2 + 1),(p, q) => (p._1 + q._1, p._2 + q._2))
val tMin = monthData.map(_.tMinimum).aggregate((0.0, 0.0))((p, q) => (p._1 + q, p._2 + 1),(p, q) => (p._1 + q._1, p._2 + q._2))
val tMax = monthData.map(_.tMaximum).aggregate((0.0, 0.0))((p, q) => (p._1 + q, p._2 + 1),(p, q) => (p._1 + q._1, p._2 + q._2))
tempMinimum += tMin._1/tMin._2
tempAverage += tAve._1/tAve._2
tempMaximum += tMax._1/tMax._2
}
The aggregate function provides a  customized way to perform reductions and aggregations with a RDD. In this particular case, aggregate will compute 2 values at the same time

  • sum of the temperature values
  • sum of the number of elements

The ratio of the 2 values represents the average.


Lets’ now use Wisp to plot the temperature profile


   line(1 to 12, tempMinimum)
hold()
line(1 to 12, tempAverage)
hold()
line(1 to 12, tempMaximum)
title("Temperature")
xAxis("Month")
yAxis("Temperature")
legend(List("Tminimum", "Taverage","Tminimum"))

Compile the code and run it, if all goes well, the console displays a URL


Output written to http://machine-name:PORT

Navigate to the URL using a web browser and you should see a chart showing the monthly temperature averages for tMinimum, tAverage and tMaximum.


image