Posts

Visualizing HANA Graph with Zeppelin

The SAP HANA database has the capability to store information in a graph. A graph is a data structure with vertex or nodes and edges. Graph structures are powerful because you can perform some types of analysis more quickly like nearest neighbor or shortest path calculations. It also enables faceted searching. Zeppelin has recently added support for displaying network graphs.

Simple Network Graph

Zeppelin 8 has support for network graphs. You need to download the Zeppelin 8 Snapshot and build it to get these features. The code to make this graph is:

This is described in the documentation. Note that the part of the json that holds the attributes for the node or edge is in an inner json object called “data”. This is how each node and edge can have different data depending on what type of node or edge it is.

"data": {"fullName":"Andrea Santurbano"}}

Because I happen to be learning the features of SAP HANA I wanted to display a graph from HANA in using Zeppelin. A previous post shows how to connect Zeppelin to HANA.

I am using a series of paragraphs to build the data and then visualize the graph. I have already built my graph workspace in Hana with help from Creating Graph Database Objects and Hana Graph Reference. The difficult part is transforming the data and relationships so they fit into a vertex / node table and an edge table.

I am using sample data from meetup.com which contains events organized by groups and attended by members and held at venues.

It is important to figure out which attributes should exist on the edge and which ones should be in the nodes. HANA is good at allowing sparse data in the attributes because of the way it stores data in a columnar form. If you wish to display the data on a map and in a graph using the same structure supplied by Zeppelin it is important to put your geo coordinates in the nodes and not in the edges.

First we connect to the database and load the node and edge tables with Spark / Scala and build the data frames. One issue here was converting the HANA ST_POINT data type into latitude and longitude values. I defined a select statement with the HANA functions of ST_X() and ST_Y() to perform the conversions before the data is put into the dataframe.

You will have problems and errors if your tables in HANA have null values. Databases don’t mind null values. Scala seems to hate null. So you have to convert any columns that could have null values to something that makes sense. In this case I converted varchar to empty strings and double to 0.0

Then I query the data frames to get the data needed for the visualization and transform it into json strings for the collections of nodes and edges. In the end this note outputs two json arrays. One is the nodes and the other is the edges.


Now we will visualize the data using the new Zeppelin directive called %network.

In my example data extracted from meetup.com. I have four types of nodes: venue, member, group and event. These are defined as labels. My edges or relationships I have defined as: held, sponsored and rsvp. These become the lines on the graph. Zeppelin combines the data from the edges and nodes into a single table view.

So in tabular format it will look like this:

Zeppelin Edges

 

Zeppelin Nodes

When you press the network button in Zeppelin the graph diagram appears.

Zeppelin Network Graph Diagram

Under settings you can specify what data displays on the screen. It does not allow for specifying the edge label displays and does not seem to support a weight option.

Network Graph Settings

If you select the map option and have the leaflet visualization loaded you can show the data on a map. Since I put the coordinates in the edges it will map the edge data. It would be better if I moved the coordinates into each node so the nodes could be displayed on the map.

Zeppelin Graph On a Map

This will help you get past some of the issues I had with getting the Zeppelin network diagram feature to work with HANA and perhaps with other databases.

In the future I hope to show how to call HANA features such as nearest neighbor, shortest path and pattern  matching algorithms to enhance the graph capabilities in Zeppelin.

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint

Query of a geographic region.

Zeppelin Maps the Hard Way

In Zeppelin Maps the Easy Way I showed how to add a map to Zeppelin with a Helium module. But what if you do not have access to the Helium NPM server to load in that module? And what if you want to add features to your Leaflet Map that are not supported in the volume-leaflet package?

This will show you how the Angular javascript library will allow you to add a map user interface to a Zeppelin paragraph.

Zeppelin Angular Leaflet Map

Zeppelin Angular Leaflet Map with Markers

First we want to get a map on the screen with markers.

In Zeppelin create a new note.

As was shown in How to Use Zeppelin With SAP HANA we create a separate paragraph to build the database connection. Please substitute in your own database driver and connection string to make it work for other databases. There are other examples where you can pull in data from a csv file and turn it into a table object.

In the next paragraph we place the spark scala code to query the database and build the markers that will be passed to the final paragraph which is built with angular.

The data query paragraph has a basic way to query a bounding box. It just looks for coordinates that are greater and less than the northwest and southeast corners of a bounding box.

var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
var coords = box.split(",")
sql1 = sql1 + " where lng > " + coords(0).toFloat + " and lat > " + coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " + coords(3).toFloat
}

var sql = sql1 +" limit 20"
val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect()) 

The data from this query is used to make the map_pings and bind it to angular so that any angular code can reference it. Zeppelin has the ability to bind data into other languages so it can be used by different paragraphs in the same note. There are samples for other databases, json and csv files at this link.

We do not have access to the Hana proprietary functions because Zeppelin will load the data up in its own table view of the HANA table. We are using the command “createOrReplaceTempView” so that a copy of the data is not made in Zeppelin. It will just pass the data through.

Note that you should set up the HANA jdbc driver as described in How to Use Zeppelin With SAP HANA.

It is best if you set up a dependency to the HANA jdbc jar in the Spark interpreter. Go to the Zeppelin settings menu.

Zeppelin Settings Menu

Zeppelin Settings Menu

Pick the Interpreter and find the Spark section and press edit.

Zeppelin Interpreter Screen

Zeppelin Interpreter Screen

Then add the path you where you have the SAP HANA jdbc driver called ngdbc.jar installed.

Configure HANA jdbc in Spark Interpreter

Configure HANA jdbc in Spark Interpreter

First Paragraph

%spark
import org.apache.spark.sql._
val driver ="com.sap.db.jdbc.Driver"
val url="jdbc:sap://11.1.88.110:30015/tri"
val database   = "database schema"   
val username   = "username for the database"
val password   = "the Password for the database"
val table_view = "event_view"
var box=""
val jdbcDF = sqlContext.read.format("jdbc").option("driver",driver)
                                           .option("url",url)
                                           .option("databaseName", database)
                                           .option("dbtable", "event_view")
                                           .option("user", username)
                                           .option("password",password)
                                           .option("dbtable", table_view).load()
jdbcDF.createOrReplaceTempView("event_view")

Second Paragraph

%spark

var box = "20.214843750000004,1.9332268264771233,42.36328125000001,29.6880527498568";
var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
    var coords = box.split(",")
    sql1 = sql1 + " where lng  > " + coords(0).toFloat + " and lat > " +  
        coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " +
        coords(3).toFloat
}
var sql = sql1 +" limit 20" 

val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect())
z.angularBind("paragraph", z.getInterpreterContext().getParagraphId())
// get the paragraph id of the the angular paragraph and put it below
z.run("20171127-081000_380354042")

Third Paragraph

In the third paragraph we add the angular code with the %angular directive. Note the for each loop section where it builds the markers and adds them to the map.

%angular 
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.css" />
.
<div id="map" style="height: 300px; width: 100%"></div>
<script type="text/javascript">
function initMap() {
    var element = $('#textbox');
    var map = L.map('map').setView([30.00, -30.00], 3);
   
    L.tileLayer('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png').addTo(map);
    var geoMarkers = L.layerGroup().addTo(map);
    
    var el = angular.element($('#map').parent('.ng-scope'));
    var $scope = el.scope().compiledScope;
   
    angular.element(el).ready(function() {
        window.locationWatcher = $scope.$watch('locations', function(newValue, oldValue) {
            //geoMarkers.clearLayers();
            angular.forEach(newValue, function(event) {
                if (event)
                  var marker = L.marker([event.values[1], event.values[2]]).bindPopup(event.values[0]).addTo(geoMarkers);
            });
        })
    });
}
if (window.locationWatcher) { window.locationWatcher(); }

// ensure we only load the script once, seems to cause issues otherwise
if (window.L) {
    initMap();
} else {
    console.log('Loading Leaflet library');
    var sc = document.createElement('script');
    sc.type = 'text/javascript';
    sc.src = 'https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.js';
    sc.onerror = function(err) { alert(err); }
    document.getElementsByTagName('head')[0].appendChild(sc);
}
</script>
<p>Testing the Map</p>

<form class="form-inline">
  <div class="form-group">
    <input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>
    <label for="paragraphId">Paragraph Id: </label>
    <input type="text" class="form-control" id="paragraphId" placeholder="Paragraph Id ..." ng-model="paragraph"></input>
  </div>
  <button type="submit" class="btn btn-primary" ng-click="z.runParagraph(paragraph)"> Run Paragraph</button>
</form>

Now when you run the three paragraphs in order it should produce a map with markers on it.

The next step is to add a way to query the database by drawing a box on the screen. Into the scala / spark code we add a variable for the bounding box with the z.angularBind() command. Then a watcher is made to see when this variable changes so the new value can be used to run the query.

Modify Second Paragraph

%spark
z.angularBind("box", box)
// Get the bounding box
z.angularWatch("box", (oldValue: Object, newValue: Object) => {
    println(s"value changed from $oldValue to $newValue")
    box = newValue.asInstanceOf[String]
})

var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
    var coords = box.split(",")
    sql1 = sql1 + " where lng  > " + coords(0).toFloat + " and lat > " +  coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " +  coords(3).toFloat
}
var sql = sql1 +" limit 20" 

val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect())
z.angularBind("paragraph", z.getInterpreterContext().getParagraphId())
z.run("20171127-081000_380354042") // put the paragraph id for your angular paragraph here

To the angular section we need to add in an additional leaflet library called leaflet.draw. This is done by adding an additional css link and a javascript script. Then the draw controls are added as shown in the code below.

Modify the Third Paragraph

%angular 
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet.draw/0.4.13/leaflet.draw.css" />
.
<script src='https://cdnjs.cloudflare.com/ajax/libs/leaflet.draw/0.4.13/leaflet.draw.js'></script>
<div id="map" style="height: 300px; width: 100%"></div>

<script type="text/javascript">
function initMap() {
    var element = $('#textbox');
    var map = L.map('map').setView([30.00, -30.00], 3);
   
    L.tileLayer('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png').addTo(map);
    var geoMarkers = L.layerGroup().addTo(map);
    var drawnItems = new L.FeatureGroup();
    
    map.addLayer(drawnItems);
    
    var drawControl = new L.Control.Draw({
        draw: {
             polygon: false,
             marker: false,
             polyline: false
        },
        edit: {
            featureGroup: drawnItems
        }
    });
    map.addControl(drawControl);
    
    map.on('draw:created', function (e) {
        var type = e.layerType;
        var layer = e.layer;
        drawnItems.addLayer(layer);
        element.val(layer.getBounds().toBBoxString());
        map.fitBounds(layer.getBounds());
        window.setTimeout(function(){
           //Triggers Angular to do its thing with changed model values
           element.trigger('input');
        }, 500);
    });
    
    var el = angular.element($('#map').parent('.ng-scope'));
    var $scope = el.scope().compiledScope;
   
    angular.element(el).ready(function() {
        window.locationWatcher = $scope.$watch('locations', function(newValue, oldValue) {
            $scope.latlng = [];
            angular.forEach(newValue, function(event) {
                if (event)
                  var marker = L.marker([event.values[1], event.values[2]]).bindPopup(event.values[0]).addTo(geoMarkers);
                  $scope.latlng.push(L.latLng(event.values[1], event.values[2]));
            });
            var bounds = L.latLngBounds($scope.latlng)
            map.fitBounds(bounds)
        })
    });

}

if (window.locationWatcher) { window.locationWatcher(); }

// ensure we only load the script once, seems to cause issues otherwise
if (window.L) {
    initMap();
} else {
    console.log('Loading Leaflet library');
    var sc = document.createElement('script');
    sc.type = 'text/javascript';
    sc.src = 'https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.js';
    sc.onerror = function(err) { alert(err); }
    document.getElementsByTagName('head')[0].appendChild(sc);
    s2.onload = initMap;
}
</script>
<p>Testing the Map</p>

<form class="form-inline">
  <div class="form-group">
    <input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>
    <label for="paragraphId">Paragraph Id: </label>
    <input type="text" class="form-control" id="paragraphId" placeholder="Paragraph Id ..." ng-model="paragraph"></input>
  </div>
  <button type="submit" class="btn btn-primary" ng-click="z.runParagraph(paragraph)"> Run Paragraph</button>
</form>

There are some important features to mention here that took some investigation to figure out.

Within Zeppelin I was unable to get the box being drawn to be visible. So instead drawing a box will the map to zoom to the area selected by utilizing this code:
element.val(layer.getBounds().toBBoxString());
map.fitBounds(layer.getBounds());

To make the map zoom back to the area after the query is run this code is triggered.

$scope.latlng.push(L.latLng(event.values[1], event.values[2]))
...
var bounds = L.latLngBounds($scope.latlng)
map.fitBounds(bounds)

To trigger the spark / scala paragraph to run after drawing a box this code causes it to run the query paragraph: data-ng-change=”z.runParagraph(paragraph_id);”

<input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>

The html form at the bottom is what holds and binds the data back and forth between the paragraphs. It is visible for debugging at the moment.

Query of a geographic region with Zeppelin

Query of a geographic region

Please let us know how it works out for you. Hopefully this will help you add maps to your Zeppelin notebook. I am sure there are many other better ways to accomplish this feature set but this is the first way I was able to get it all to work together.

Demo of the interface:

You can contact us using twitter at @volumeint.

Some code borrowed from: https://gist.github.com/granturing/a09aed4a302a7367be92 and https://zeppelin.apache.org/docs/latest/displaysystem/front-end-angular.html

Zeppelin Maps the Easy Way

You want your data on a map in Apache Zeppelin.

Apache Zeppelin allows a person to query all sorts of databases and big data systems using many languages. This data can then be displayed as graphs and charts. There is also a way to make a data driven map.

A map with data markers on it in Zeppelin

A map with data markers on it in Zeppelin

There is a very easy way to get a basic map with markers on it into Zeppelin. The first thing we will do is to query the database for the data we want to put on the map. First create a new note and add a query for your database of choice.

If you are using SAP HANA there are directions on how to install the jdbc driver at How to Use Zeppelin With SAP HANA

If you are using another database just use the %jdbc interpreter and modify your database configuration settings in the interpreter settings section of Zeppelin.

The query you write must have a column for the longitude and the latitude.

Query for data with Zeppelin

Query for data with Zeppelin

I my example I am using the query “select country_name, lat, lng, event_type from event_view where actor_type_id=’2′ and country_name =’South Sudan’ limit 10”. Note that my interpreter is Hana. This is described in a previous post about Hana and Zeppelin.

Once this data in in the table the plug-in for the map needs to be installed. In the upper right corner click on the login indicator and select Helium.

The Zeppelin Setup Menu

The Zeppelin Setup Menu

A list of Helium plug-ins will be displayed. Enable the one called zeppelin-leaflet or volume-leaflet. The Helium module for maps utilizes another javascript library called Leaflet.

Helium Modules List

Helium Modules

Once it is enabled go back to your note where you created the map query. A new button will show up.  It looks like the earth on a white background. When you press this a map will appear that needs to be configured.

Configure Map Settings

Configure Map Settings

Press the settings link. As shown above you will see the fields from the query. Just drag the fields down into the boxes called latitude, longitude, tooltip and popup.

Mapping the Columns for the map

Mapping the Columns

Once the latitude and longitude is filled in the map will appear with markers on it for the data points. If you put columns into the tooltip and popup boxes the tooltips and popups will work also.

The Finished Map

The Finished Map

This demonstrates the easiest way to add a map to Zeppelin. More advanced data processing before a map is built requires writing a paragraph of code in Spark Scala, Python or another language supported by Apache Zeppelin.

Future posts will show how to write Scala code to preprocess data and another post on how to draw a box on the map to select the area of the earth to be queried for data points.

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint

Ryft-ONE

Ryft and Apache Zeppelin

Ryft

Ryft is an FPGA – (field programmable gate array) appliance that allows for hosting and searching data quickly. In this post I will show one way to connect up Apache Zeppelin for use in data analysis using Scala code. Previously I showed how to connect Apache Zeppelin to SAP Hana.

The Ryft can quickly search structured and unstructured data without needing to build an index. This ability is attributed the the FPGA that can filter data on demand. It uses the internal 4 FPGA modules to process the data at search time. Other types of search systems like ElasticSearch, solr, Lucine or a database have to build and store an index of the data. Ryft operates without an index.

Ryft Speed Comparison

Ryft Speed Comparison

I have populated my Ryft with a cache of data from Enron. It is a dump of Enron emails obtained from Carnegie Mellon. This was as simple as uploading files to the Ryft and running a command like this:

ryftutil -copy “enron*” -c enron_email -a ryft.volumeintegration.com:8765

In the Zeppelin interface I will be able to search for keywords or phrases in the email files and display them. The size of the enron e-mail archive is 20 megabytes.

Ryft One Appliance

Ryft One Appliance

Apache Zeppelin

Apache Zeppelin is an open source web notebook that allows a person to write code in many languages to manipulate and visualize data.

Apache Zeppelin with Volume Analytics Interface

Apache Zeppelin with Volume Analytics Interface

To Apache Zeppelin work with Ryft I installed Apache Zeppelin onto the Ryft appliance and connected the Spark Ryft Connector jar found at this git project. Or download a prebuilt jar.

Follow the directions provided at the spark-ryft-connector project to compile the jar file needed. I compiled the jar file on my local desktop computer. Place the spark-ryft-connector jar file onto the Ryft machine. I did run into one that was not documented; the ryft connector was not working properly. It gives the error: “java.lang.NoClassDefFoundError: org/apache/spark/Logging”

I resolved the issue by downloading spark-core_2.11-1.5.2.logging.jar from  https://raw.githubusercontent.com/swordsmanliu/SparkStreamingHbase/master/lib/spark-core_2.11-1.5.2.logging.jar and put it in zeppelin/interpreter/spark/dep directory and that resoved the issue.

Now you can create a note in Zeppelin. I am using the Spark interpreter which allows you to write the code in Scala.

First you have to make sure Zeppelin can use the ryft code in the jar file. Make a dependency paragraph with this code:

%dep
z.reset()
z.load("/home/ryftuser/spark-ryft-connector-2.10.6-0.9.0.jar")

Ryft Query

Now make a new paragraph with the code to make form fields and run the Ryft API commands to perform a search. Figuring these queries out takes a detailed study of the documentation.

These are the commands to prepare and run the query. I show a simple search, a fuzzy hamming search and a fuzzy edit distance search. The Ryft can perform very fast fuzzy searches with wide edit distances because there is not an index being built.

Simple Query
queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
query = SimpleQuery(searchFor.toString)
Hamming Query
queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte, fhs)
Edit Distance Query
queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte)
The Search
var searchRDD = sc.ryftRDD(Seq(query), queryOptions)

This produces an RDD that can be manipulated to view the contents using code like the example below.

searchRDD.asInstanceOf[RyftRDD[RyftData]].collect.foreach { ryftData =>
   println(ryftData.offset)
   println(ryftData.length)
   println(ryftData.fuzziness)
   println(ryftData.data.replace("\n", " "))
   println(ryftData.file)
}

The Result in Zeppelin

Result of Searching Ryft with Zeppelin

Result of Searching Ryft with Zeppelin

In addition I have included code that allows the user to click on Show File to see the original e-mail with the relevant text highlighted in bold.

Results in BoldI installed Apache Zeppelin in a way that allows it access to a portion of the file system on the server where I stored the original copy of the email files.

In order for Apache Zeppelin to display the original email, I had to give it access to the part of the filesystem where the original emails were stored.  Ryft uses a catalog of the emails to perform searches, as it performs better when searching fewer larger files than more smaller ones. The catalog feature allows it to combine many small files into one large file.

The search results return a filename and offset which Apache Zeppelin uses to retrieve the relevant file and highlight the appropriate match. 

In the end results Ryft found all instances of the name Mohammad with various spelling differences in 0.148 seconds in a dataset of 30 megabytes. When I performed the same search terms on 48 gigabytes of data it ran the search in 5.89 seconds. And 94 gigabytes took 12.274 seconds, 102 gigabytes took 13 seconds. These are just quick sample numbers using dumps of many files. Perhaps performance could be improved by consolidating small files into catalogs.

Zeppelin Editor

The code is edited in Zeppelin itself.

Code in Zeppelin

Code in Zeppelin

You edit the code in the web interface but it can hide it once you have the form fields working. Here is the part of the code that produces the form fields:

 val searchFor = z.input("Search String", "mohammad")
 val distance = z.input("Search Distance", 2)
 var queryType = z.select("Query Type", Seq(("1","Simple"),("2","Hamming"),("3","Edit Distance"))).toString
 var surrounding = z.input("Surrounding", "line")

So in the end we end up with the following code.

%spark
import com.ryft.spark.connector._
import com.ryft.spark.connector.domain.RyftQueryOptions
import com.ryft.spark.connector.query.SimpleQuery
import com.ryft.spark.connector.query.value.{EditValue, HammingValue}
import com.ryft.spark.connector.rdd.RyftRDD
import com.ryft.spark.connector.domain.{fhs, RyftData, RyftQueryOptions}
import scala.language.postfixOps
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import scala.io.Source

def isEmpty(x: String) = x == null || x.isEmpty
  var queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
  val searchFor = z.input("Search String", "mohammad")
  val distance = z.input("Search Distance", 2)
  var queryType = z.select("Query Type",("2","Hamming"), Seq(("1","Simple"),("2","Hamming"),("3","Edit Distance"))).toString
  var surrounding = z.input("Surrounding", "line")
  var query = SimpleQuery(searchFor.toString)

 
  if (isEmpty(queryType)) {
      queryType = "2"
  }

  if (queryType.toString.toInt == 1) {
        println("simple")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, 0 toByte)
        }
        query = SimpleQuery(searchFor.toString)

  } else if (queryType.toString.toInt ==2) {
        println("hamming")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte, fhs)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte, fhs)
        }
  } else {
        println("edit")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte)
        }
  }

  var searchRDD = sc.ryftRDD(Seq(query), queryOptions)
  var count = searchRDD.count()

  print(s"%html <h2>Count: $count</h2>")

  if (count > 0) {
        println(s"Hamming search RDD first: ${searchRDD.first()}")
        println(searchRDD.count())
        print("%html <table>")
        print("<script>")
        println("function showhide(id) { var e = document.getElementById(id); e.style.display = (e.style.display == 'block') ? 'none' : 'block';}")
        print("</script>")
        print("<tr><td>File</td><td>Data</td></tr>")

        searchRDD.asInstanceOf[RyftRDD[RyftData]].collect.foreach { ryftData =>
            print("<tr><td style='width:600px'><a href=javascript:showhide('"+ryftData.file+"')>Show File </a></td>")
            val x = ryftData.data.replace("\n", " ")
            print(s"<td> $x</td></tr>")
            println("<tr id="+ ryftData.file +" style='display:none;'>")
            println("<td style='width:600px'>")

            val source = Source.fromFile("/home/ryftuser/maildir/"+ryftData.file)
            var theFile = try source.mkString finally source.close()
            var newDoc = ""
            var totalCharCount = 0
            var charCount = 0
            for (c <- theFile) {
                charCount = charCount + 1
                if (totalCharCount + charCount == ryftData.offset) {
                    newDoc = newDoc+"<b>"
                } else if (totalCharCount+charCount == ryftData.offset+ryftData.length+1) {
                    newDoc = newDoc+"</b>"
                }
                newDoc = newDoc+c
            }
            print(newDoc.replace("\n", "<br>"))
            totalCharCount = totalCharCount + charCount
            println("</td>")
            println("</tr>")
        }
        print("</table>")
    }

So this should get you started on being able to search data with Zeppelin and Ryft. YOu can use this interface to experiment with the different edit distances and search queries the Ryft supports. You can also implement additional methods to search by RegEx, IP addresses, dates and currency.

Please follow us on Facebook and on twitter at volumeint.

How to Use Zeppelin With SAP HANA

How to Use Zeppelin With SAP HANA

Apache Zeppelin is an open source tool that allows interactive data analytics from many data sources like databases, hive, spark, python, hdfs, HANA and more.

It allows for:

  • Data Ingestion
  • Data Discovery
  • Data Analytics
  • Data Visualization & Collaboration

httpsaws.amazon.com/blogs/aws/amazon-emr-update-apache-spark-1-5-2-ganglia-presto-zeppelin-and-oozie/

SAP HANA is an in memory data platform for storing and analyzing data. At its core it is a columner database with the ability to also become a row database.

HANA overview

Setup Zeppelin

Install Apache Zeppelin from https://zeppelin.apache.org/download.html with all interpreters. Once Zeppelin is installed go to the bin directory and run ./zeppelin-daemon.sh start You can view the Zeppelin interface at http://localhost:8080/

zeppelin start screen

Click on the menu to go to interpreter setup.

Configuration Menu

Choose the Create button.

Create new interpreter

Pick the JDBC interpreter group in the pull down.

Pick JDBC Interpreter

Fill in these HANA properties with your specific server name and port. Give it the name of hana.

HANA interpreter property values

The important parts are:

default.driver: com.sap.db.jdbc.Driver

default.url: jdbc:sap://[server]:[port]

Save the interpreter with the save button.

Install JDBC Driver

Find the jar file called ngdbc.jar and place it in zeppelin-0.x.X-bin-all/interpreter/jdbc.

The ngdbc.jar may be imbedded in another jar file. You will need to install the SAP HANA Studio or the sql client drivers. In order for me to find the ngdbc.jar I had to unzip the com.sap.ndb.studio.jdbc_2.3.6.jar file by changing the .jar to .zip. When it is unzipped I found the ngdbc.jar in the lib directory.

The com.sap.ndb.studio.jdbc_2.3.6.jar was found in a directory called studio/core_repository/plugins

Make a New Note

Go to the menu and create a new note.

Create New Note

Pick hana as your default interpreter and give your note a name.

Create New Note

This will create a workspace where you can create your tables and graphs from data in HANA and also annotate your report with text. It is also possible to pull in data from other database systems you have already configured.

Create New Note

Now enter your query in the text area called the paragraph.

type in query

Click the arrow to run the query and then you can select what kind of graph to display.

Type in query and pick a visualization

Now you can add additional paragraphs with markdown text in them to describe your information.

So we have successfully configured Apache Zeppelin to pull data from HANA using a jdbc driver.

How to Use HANA from Spark

You may also want to connect to HANA directly from Spark using Scala, Python or R code. The best way to implement this is to put the reference to the jdbc jar called ngdbc.jar in the Spark interpreter. Go to the Zeppelin settings menu.

Zeppelin Settings Menu

Zeppelin Settings Menu

Pick the Interpreter and find the Spark section and press edit.

Zeppelin Interpreter Screen

Zeppelin Interpreter Screen

Then add the path you where you have the SAP HANA jdbc driver called ngdbc.jar installed.

Configure HANA jdbc in Spark Interpreter

Configure jdbc in Spark Interpreter

This will get you started on using Apache Zeppelin with SAP HANA. We have further tutorials on how to build interfaces in Zeppelin:

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint