Wednesday, May 20, 2015

Connecting OBIEE to Spark

In this post, we document the high level steps required to connect Oracle Business Intelligence 11.1.1.9 to Apache Spark using Spark SQL. The goal is to be able to run interactive queries in OBIEE using Spark as a data source. OBIEE 11.1.1.7 introduced the capabilities to integrate with Hadoop sources using the Hive server1 ODBC interface. A connector by Cloudera allows connecting to Hive Server 2 an improved version of Hive. Apache Spark SQL supports reading and writing data stored in Apache Hive. For this experiment, we are running OBIEE in Windows 7 and Hive, Spark in Linux Mint.

- Oracle Business Intelligence Enterprise Edition 11.1.1.9  

- Apache Spark 1.3.1

- Cloudera ODBC Driver for Hive (v2.5.12)

Once our environment is ready, we start our Apache Spark service:

image

Once the service is up and running, you should be able to navigate to http://[hostname]:8080

image

Next, we start the Spark Thrift Server service and direct it to connect to the Spark service;

image

For more details about the Apache Spark Thrift Server, visit the following link.

Now that the environment is ready, let’s create a connection to Hive. This is done by creating and ODBC data source in Windows. The connection to use is the “Cloudera ODBC Driver for Apache Hive”.  

image

To configure the ODBC connection, simply add host / port of the thrift interface, and make sure that HiveServer2 is selected. Authentication on Spark is disabled by default.

image

image

Let’s load some data and have fun! We will be using a new command shell called beeline that works with Hive Server2.

image

and connect to Hive

image

Using the beeline command shell, we load a subset of columns from the Chicago crime data set.

create table crime
(
ID STRING,
BLOCK STRING,
PRIMARY_TYPE STRING,
DESCRIPTION STRING,
LOCATIONDESC STRING,
YEAR INT,
LAT DOUBLE,
LON DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/tmp/data/crime.csv';

Now that we’ve created a data source and loaded data into hive, we need to create the metadata in in the OBIEE administration tool.

image

We first create a new Database (Right Click on the physical layer-> New Database)


image


Then a connection Pool


image


Let’s test the connection, this is done by right clicking on the connection pool and selecting import Metadata

clip_image001

If the connection is successful, the Import Metadata window is displayed

clip_image003

Click Next

clip_image005

For some reason, the hive tables are not listed under the “default” database; click on the shuttled icon then Finish.

The Physical catalog HIVE is created under the Hive Server2 connection pool

clip_image006

We need to create the “crime” table manually in OBIEE. This is done by right clicking on the “HIVE” physical catalog->New Object->Physical Table

clip_image007

The key thing to note is that the table name (CRIME) and the physical columns must match those in the hive database.

image

We then create 2 aliases then a business model in order to be able to create simple interactive reports in OBIEE Answers.


image


We also create a simple measure: count of ID to represent the number of crime occurrences.


image


We’re now ready to create a simple report in OBIEE, the first report will show the # of crime occurrences versus year (2001 through 2015). As soon as the job is submitted, we can log into Spark UI to monitor the job. Sure enough, we see one job being run


image


After a few seconds, the job completes


image


The results are displayed in OBIEE.


image


We then add the crime type in a pivot table


image


We add a tree map visualization to display the number of crime occurrences by crime type for 2015


image


In this post, we connected OBIEE to Hive integrated with Spark SQL; we then created analysis using the Hive-Spark data source.