Posts

HANA Zeppelin Query Builder with Map Visualization

SAP HANA Query Builder On Apache Zeppelin Demo

HANA Zeppelin Query Builder with Map Visualization

HANA Zeppelin Query Builder with Map Visualization

In working with Apache Zeppelin I found that users wanted a way to explore data and build charts without needing to know SQL right away. This is an attempt to build a note in Zeppelin that would allow a new data scientist to get familiar with the data structure of their database. And it allows them to build simple single table queries that allow for building charts and maps quickly. In addition it shows the SQL used to perform the work.

Demo

This video will demonstrate how it works. I have leveraged work done by Randy Gelhausen’s query builder post on how to make a where clause builder.  I also used Damien Sorel’s jQuery Query Builder. These were used to make a series of paragraphs to lookup tables and columns in HANA and allow the user to build a custom query. This data can be quickly graphed using the Zeppelin Helium visualizations.

The Code

This is for those data scientists and coders that want to replicate this in their Zeppelin.

Note that this code is imperfect as I have not worked out all the issues with it. You may need to make changes to get it to work. It only works on Zeppelin 0.8.0 Snapshot. It is also made to work with SAP HANA as the databases.

It only has one type of aggregation – sum and it does not have a way to perform a having statement. But these features could easily be added.

This Zeppelin note is dependent on code from a previous post. Follow the directions in Using Zeppelin to Explore a Database first.

Paragraph One

%spark
//Get list of columns on a given table
def columns1(table: String) : Array[(String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => x.asInstanceOf[String])
}

def columns(table: String) : Array[(String, String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => (x, x))
}

def number_column_types(table: String) : Array[String] = {
 var columnType = sqlContext.sql("select column_name from table_columns where table_name='" +
    table + "' and data_type_name = 'INTEGER'")
 
 columnType.map {case Row(column_name: String) => (column_name)}.collect()
}

// set up the tables select list
val tables = sqlContext.sql("show tables").collect.map(s=>s(1).asInstanceOf[String].toUpperCase())
z.angularBind("tables", tables)
var sTable ="tables"
z.angularBind("selectedTable", sTable)


z.angularUnwatch("selectedTable")
z.angularWatch("selectedTable", (before:Object, after:Object) => {
 println("running " + after)
 sTable = after.asInstanceOf[String]
 // put the id for paragraph 2 and 3 here
 z.run("20180109-121251_268745664")
 z.run("20180109-132517_167004794")
})


var col = columns1(sTable)
col = col :+ "*"
z.angularBind("columns", col)
// hack to make the where clause work on initial load
var col2 = columns(sTable)
var extra = ("1","1")
col2 = col2 :+ extra
z.angularBind("columns2", col2)
var colTypes = number_column_types(sTable)
z.angularBind("numberColumns", colTypes)
var sColumns = Array("*")
// hack to make the where clause work on initial load
var clause = "1=1"
var countColumn = "*"
var limit = "10"

// setup for the columns select list
z.angularBind("selectedColumns", sColumns)
z.angularUnwatch("selectedColumns")
z.angularWatch("selectedColumns", (before:Object, after:Object) => {
 sColumns = after.asInstanceOf[Array[String]]
 // put the id for paragraph 2 and 3 here
 z.run("20180109-121251_268745664")
 z.run("20180109-132517_167004794")
})
z.angularBind("selectedCount", countColumn)
z.angularUnwatch("selectedCount")
z.angularWatch("selectedCount", (before:Object, after:Object) => {
 countColumn = after.asInstanceOf[String]
})
// bind the where clause
z.angularBind("clause", clause)
z.angularUnwatch("clause")
z.angularWatch("clause", (oldVal, newVal) => {
 clause = newVal.asInstanceOf[String]
})

z.angularBind("limit", limit)
z.angularUnwatch("limit")
z.angularWatch("limit", (oldVal, newVal) => {
 limit = newVal.asInstanceOf[String]
})

This paragraph is Scala code that sets up some functions that are used to query the table with the list of tables and the table with the list of columns. You must have the tables loaded into Spark as views or tables in order to see them in the select lists. This paragraph performs all the binding so that the next paragraph which is Angular code can get the data built here.

Paragraph Two

%angular
<link rel="stylesheet" href="https://cdn.rawgit.com/mistic100/jQuery-QueryBuilder/master/dist/css/query-builder.default.min.css">
<script src="https://cdn.rawgit.com/mistic100/jQuery-QueryBuilder/master/dist/js/query-builder.standalone.min.js"></script>

<script type="text/javascript">
  var button = $('#generateQuery');
  var qb = $('#builder');
  var whereClause = $('#whereClause');
 
  button.click(function(){
    whereClause.val(qb.queryBuilder('getSQL').sql);
    whereClause.trigger('input'); //triggers Angular to detect changed value
  });
 
  // this builds the where statement builder
  var el = angular.element(qb.parent('.ng-scope'));
  angular.element(el).ready(function(){
    var integer_columns = angular.element('#numCol').val()
    //Executes on page-load and on update to 'columns', defined in first snippet
    window.watcher = el.scope().compiledScope.$watch('columns2', function(newVal, oldVal) {
      //Append each column to QueryBuilder's list of filters
      var options = {allowEmpty: true, filters: []}
      $.each(newVal, function(i, v){
        if(integer_columns.split(',').indexOf(v._1) !== -1){
          options.filters.push({id: v._1, type: 'integer'});
        } else if(v._1.indexOf("DATE") !== -1) {
          options.filters.push({id: v._1, type: 'date'})
        } else { 
          options.filters.push({id: v._1, type: 'string'});
        }
      });
      qb.queryBuilder(options);
    });
  });
</script>
<input type="text" ng-model="numberColumns" id="numCol"></input>
<form class="form-inline">
 <div class="form-group">
 Please select table: Select Columns:<br>
 <select size=5 ng-model="selectedTable" ng-options="o as o for o in tables" 
       data-ng-change="z.runParagraph('20180109-151738_134370871')"></select>
 <select size=5 multiple ng-model="selectedColumns" ng-options="o as o for o in columns">
 <option value="*">*</option>
 </select>
 Sum Column:
 <select ng-model="selectedCount" ng-options="o as o for o in columns">
 <option value="*">*</option>
 </select>
 <label for="limitId">Limit: </label> <input type="text" class="form-control" 
       id="limitId" placeholder="Limit Rows" ng-model="limit"></input>
 </div>
</form>
<div id="builder"></div>
<button type="submit" id="generateQuery" class="btn btn-primary" 
       ng-click="z.runParagraph('20180109-132517_167004794')">Run Query</button>
<input id="whereClause" type="text" ng-model="clause" class="hide"></input>

<h3>Query: select {{selectedColumns.toString()}} from {{selectedTable}} where {{clause}} 
   with a sum on: {{selectedCount}} </h3>

Paragraph two uses javascript libraries from jQuery and jQuery Query Builder. In the z.runParagraph  command use the paragraph id from paragraph three.

Paragraph Three

The results of the query show up in this paragraph. Its function is to generate the query and run it for display.

%spark
import scala.collection.mutable.ArrayBuffer

var selected_count_column = z.angular("selectedCount").asInstanceOf[String]
var selected_columns = z.angular("selectedColumns").asInstanceOf[Array[String]]
var limit = z.angular("limit").asInstanceOf[String]
var limit_clause = ""
if (limit != "*") {
 limit_clause = "limit " + limit
}
val countColumn = z.angular("selectedCount")
var selected_columns_n = selected_columns.toBuffer
// remove from list of columns
selected_columns_n -= selected_count_column

if (countColumn != "*") {
 val query = "select "+ selected_columns_n.mkString(",") + ", sum(" + selected_count_column +
     ") "+ selected_count_column +"_SUM from " + z.angular("selectedTable") + " where " + 
      z.angular("clause") + " group by " + selected_columns_n.mkString(",") + " " + 
      limit_clause
 println(query)
 z.show(sqlContext.sql(query))
} else {
 val query2 = "select "+ selected_columns.mkString(",") +" from " + z.angular("selectedTable") + 
      " where " + z.angular("clause") + " " + limit_clause
 println(query2)
 z.show(sqlContext.sql(query2))
}

Now if everything is just right you will be able to query your tables without writing SQL. This is a limited example as I have not provided options for different types of aggregation, advanced grouping or joins for multiple tables.

 

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint.

Volume Analytics Table Explorer - HANA & Zeppelin

Using Zeppelin to Explore a Database

In attempting to use Apache Zeppelin I found it difficult to just explore a new database. This was the situation when connecting SAP HANA database to Apache Zeppelin using the JDBC driver.

So I created a Zeppelin interface that can be used by a person who does not know how to code or use SQL.

This is a note with code in multiple paragraphs that would allow a person to see a list of all the tables in the database and then view the structure of them and look at a sample of the data in each table.

Volume Analytics Table Explorer - HANA & Zeppelin

Volume Analytics Table Explorer – HANA & Zeppelin

When using a standard database with Apache Zeppelin one needs to register each table into Spark so that it can query it and make DataFrames from the native tables. I got around this by allowing the user to choose they tables they want to register into Apache Zeppelin and Spark. This registration involved using the createOrReplaceTempView function on a DataFrame. This allows us to retain the speed of HANA without copying all the data into a Spark table.

The video shows a short demonstration of how this works.

Once tables are registered as Spark views they can be used by all the other notes on the Apache Zeppelin server. This means that other users can leverage the tables without knowing they came from the HANA database.

The code is custom to HANA because of the names of the system tables where it stores the lists of tables and column names. The code also converts HANA specific data types such as ST_POINT to comma delimited strings.

This example of dynamic forms with informed by Data-Driven Dynamic Forms in Apache Zeppelin

Previous posts on Apache Zeppelin and SAP Hana are:

The Code

Be aware this is prototype code that works on Zeppelin 0.8.0 Snapshot which as of today needs to be built from source. It is pre-release.

First Paragraph

In the first paragraph I am loading up the HANA jdbc driver. But you can avoid doing this by adding your jdbc jar to the dependencies section of the interpreter configuration as laid out in How to Use Zeppelin With SAP HANA

%dep
z.reset() 
z.load("/projects/zeppelin/interpreter/jdbc/ngdbc.jar")

Second Paragraph

In the second paragraph we build the Data Frames from tables in HANA that contain the list of tables and columns in the database. This will be used to show the user what tables and columns are available to use for data analysis.

%spark
import org.apache.spark.sql._
val driver ="com.sap.db.jdbc.Driver"
val url="jdbc:sap://120.12.83.105:30015/ffa"
val database = "dbname"
val username = "username"
val password = "password"
// type in the schemas you wish to expose
val tables = """(select * from tables where schema_name in ('FFA', 'SCHEMA_B')) a """
val columns = """(select * from table_columns where schema_name in ('FFA', 'SCHEMA_B')) b """

val jdbcDF = sqlContext.read.format("jdbc").option("driver",driver)
 .option("url",url)
 .option("databaseName", database)
 .option("user", username)
 .option("password",password)
 .option("dbtable", tables).load()
jdbcDF.createOrReplaceTempView("tables")

val jdbcDF2 = sqlContext.read.format("jdbc").option("driver",driver)
 .option("url",url)
 .option("databaseName", database)
 .option("user", username)
 .option("password",password)
 .option("dbtable", columns).load()
jdbcDF2.createOrReplaceTempView("table_columns")

Third Paragraph

The third paragraph contains the functions that will be used in the fourth paragraph that needs to call Spark / Scala functions. These functions will return the column names and types when a table name is given. Also it has the function that will load a HANA table into a Spark table view.

%spark
//Get list of distinct values on a column for given table
def distinctValues(table: String, col: String) : Array[(String, String)] = {
 sqlContext.sql("select distinct " + col + " from " + table + " order by " + col).collect.map(x => (x(0).asInstanceOf[String], x(0).asInstanceOf[String]))
}

def distinctWhere(table: String, col: String, schema: String) : Array[(String, String)] = {
 var results = sqlContext.sql("select distinct " + col + " from " + table + " where schema_name = '" + schema +"' order by " + col)
 results.collect.map(x => (x(0).asInstanceOf[String], x(0).asInstanceOf[String]))
}

//Get list of tables
def tables(): Array[(String, String)] = {
 sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String].toUpperCase(), x(1).asInstanceOf[String].toUpperCase()))
}

//Get list of columns on a given table
def columns(table: String) : Array[(String, String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => (x, x))
}

def hanaColumns(schema: String, table: String): Array[(String, String)] = {
 sqlContext.sql("select column_name, data_type_name from table_columns where schema_name = '"+ schema + "' and table_name = '" + table+"'").collect.map(x => (x(0).asInstanceOf[String], x(1).asInstanceOf[String]))
}

//load table into spark
def loadSparkTable(schema: String, table: String) : Unit = {
  var columns = hanaColumns(schema, table)
  var tableSql = "(select "
  for (c <- columns) {
    // If this column is a geo datatype convert it to a string
    if (c._2 == "ST_POINT" || c._2 == "ST_GEOMETRY") {
      tableSql = tableSql + c._1 + ".st_y()|| ',' || " + c._1 + ".st_x() " + c._1 + ", "
    } else {
      tableSql = tableSql + c._1 + ", "
    }
  }
 tableSql = tableSql.dropRight(2)
 tableSql = tableSql + " from " + schema +"."+table+") " + table

 val jdbcDF4 = sqlContext.read.format("jdbc").option("driver",driver)
  .option("url",url)
  .option("databaseName", "FFA")
  .option("user", username)
  .option("password", password)
  .option("dbtable", tableSql).load()
  jdbcDF4.createOrReplaceTempView(table)
 
}

//Wrapper for printing any DataFrame in Zeppelin table format
def printQueryResultsAsTable(query: String) : Unit = {
 val df = sqlContext.sql(query)
 print("%table\n" + df.columns.mkString("\t") + '\n'+ df.map(x => x.mkString("\t")).collect().mkString("\n")) 
}

def printTableList(): Unit = {
 println(sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String])).mkString("%table\nTables Loaded\n","\n","\n"))
}

// this part keeps a list of the tables that have been registered for reference
val aRDD = sc.parallelize(sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String])))
val aDF = aRDD.toDF()
aDF.registerTempTable("tables_loaded")

Fourth Paragraph

The fourth paragraph contains the Spark code needed to produce the interface with select lists for picking the tables. It uses dynamic forms as described in the Zeppelin documentation and illustrated in more detail by Rander Zander.

%spark
val schema = z.select("Schemas", distinctValues("tables","schema_name")).asInstanceOf[String]
var table = z.select("Tables", distinctWhere("tables", "table_name", schema)).asInstanceOf[String]
val options = Seq(("yes","yes"))
val load = z.checkbox("Register & View Data", options).mkString("")

val query = "select column_name, data_type_name, length, is_nullable, comments from table_columns where schema_name = '" + schema + "' and table_name = '" + table + "' order by position"
val df = sqlContext.sql(query)


if (load == "yes") { 
 if (table != null && !table.isEmpty()) {
   loadSparkTable(schema, table)
   z.run("20180108-113700_1925475075")
 }
}

if (table != null && !table.isEmpty()) {
 println("%html <h1>"+schema)
 println(table + "</h1>")
 z.show(df)
} else {
 println("%html <h1>Pick a Schema and Table</h1>")
}

As the user changes the select lists schema in paragraph 3 will be called and the tables select list will be populated with the new tables. When they select the table the paragraph will refresh with a table containing some of the details about the table columns like the column types and sizes.

When they select the Register and View checkbox the table will get turned into a Spark view and paragraph five will contain the data contents of the table. Note the z.run command. This runs a specific paragraph and you need to put in your own value here. This should be the paragraph id from the next paragraph which is paragraph five.

Paragraph Five

%spark
z.show(sql("select * from " + table +" limit 100"))

The last paragraph will list the first 100 rows from the table that have been selected and has the register and view on.

Slight modifications of this code will allow the same sort of interface to be built for MySQL, Postgres, Oracle, MS-SQL or any other database.

Now go to SAP HANA Query Builder On Apache Zeppelin Demo and you will find code to build a simple query builder note.

Please let us know on twitter, facebook and LinkedIn if this helps you or your find a better way to do this in Zeppelin.

Previous posts on Apache Zeppelin and SAP Hana are:

 

Visualizing HANA Graph with Zeppelin

The SAP HANA database has the capability to store information in a graph. A graph is a data structure with vertex or nodes and edges. Graph structures are powerful because you can perform some types of analysis more quickly like nearest neighbor or shortest path calculations. It also enables faceted searching. Zeppelin has recently added support for displaying network graphs.

Simple Network Graph

Zeppelin 8 has support for network graphs. You need to download the Zeppelin 8 Snapshot and build it to get these features. The code to make this graph is:

This is described in the documentation. Note that the part of the json that holds the attributes for the node or edge is in an inner json object called “data”. This is how each node and edge can have different data depending on what type of node or edge it is.

"data": {"fullName":"Andrea Santurbano"}}

Because I happen to be learning the features of SAP HANA I wanted to display a graph from HANA in using Zeppelin. A previous post shows how to connect Zeppelin to HANA.

I am using a series of paragraphs to build the data and then visualize the graph. I have already built my graph workspace in Hana with help from Creating Graph Database Objects and Hana Graph Reference. The difficult part is transforming the data and relationships so they fit into a vertex / node table and an edge table.

I am using sample data from meetup.com which contains events organized by groups and attended by members and held at venues.

It is important to figure out which attributes should exist on the edge and which ones should be in the nodes. HANA is good at allowing sparse data in the attributes because of the way it stores data in a columnar form. If you wish to display the data on a map and in a graph using the same structure supplied by Zeppelin it is important to put your geo coordinates in the nodes and not in the edges.

First we connect to the database and load the node and edge tables with Spark / Scala and build the data frames. One issue here was converting the HANA ST_POINT data type into latitude and longitude values. I defined a select statement with the HANA functions of ST_X() and ST_Y() to perform the conversions before the data is put into the dataframe.

You will have problems and errors if your tables in HANA have null values. Databases don’t mind null values. Scala seems to hate null. So you have to convert any columns that could have null values to something that makes sense. In this case I converted varchar to empty strings and double to 0.0

Then I query the data frames to get the data needed for the visualization and transform it into json strings for the collections of nodes and edges. In the end this note outputs two json arrays. One is the nodes and the other is the edges.


Now we will visualize the data using the new Zeppelin directive called %network.

In my example data extracted from meetup.com. I have four types of nodes: venue, member, group and event. These are defined as labels. My edges or relationships I have defined as: held, sponsored and rsvp. These become the lines on the graph. Zeppelin combines the data from the edges and nodes into a single table view.

So in tabular format it will look like this:

Zeppelin Edges

 

Zeppelin Nodes

When you press the network button in Zeppelin the graph diagram appears.

Zeppelin Network Graph Diagram

Under settings you can specify what data displays on the screen. It does not allow for specifying the edge label displays and does not seem to support a weight option.

Network Graph Settings

If you select the map option and have the leaflet visualization loaded you can show the data on a map. Since I put the coordinates in the edges it will map the edge data. It would be better if I moved the coordinates into each node so the nodes could be displayed on the map.

Zeppelin Graph On a Map

This will help you get past some of the issues I had with getting the Zeppelin network diagram feature to work with HANA and perhaps with other databases.

In the future I hope to show how to call HANA features such as nearest neighbor, shortest path and pattern  matching algorithms to enhance the graph capabilities in Zeppelin.

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint

Mapping an Epidemic

Mapping an Epidemic

This map changed the way we see the world and the way we study science, nature, and disease.

In August of 1854, cholera was ravaging the Soho neighborhood of London where John Snow) was a doctor. People were fleeing the area as they thought cholera was spread by gasses in the air or, as they called it, “bad air.”

Just as there is disinformation today about Ebola being airborne, the experts of that time thought most disease was spread in the air. There was no concept that disease might be in the water. They had no idea that bacteria even existed.

John had worked as a doctor in a major outbreak of cholera in a mine. But despite working in close quarters with the miners, he never contracted the disease. He wondered why the air did not affect him.

This inspired him to write a paper on why he believed cholera was spread through water and bodily fluids. The experts at the time did not accept his theory; they continued to believe cholera was caused by the odors emitted by rotting waste.

In the Soho outbreak in August 1854, John Snow saw a chance to further prove his theory. He went door to door keeping a tally of deaths at each home. This was only part of his quest to find evidence to prove the source of the plagues of the day.

He had been collecting statistical information, personal interviews, and other research for many years. He added this information to his paper, “On the Mode of Communication of Cholera.” The paper and his work in researching and collecting evidence founded the science of epidemiology.

One of the most innovative features was plotting data using a map; it was the first published use of dots on a map supporting a scientific conclusion. Each of the bars on John Snow’s map represents one death. Using this visual technique, he could illustrate that the deaths were centered around a point and further investigate and interview people in the area. He could also find anomalies and outliers such as deaths far from the concentration and areas with no deaths.

Epicenter Pump and Brewery

He found through personal interviews and mapping the data that the workers in the brewery (in the epicenter of the epidemic) were not dying. The owner of the brewery said that the workers were given free beer, and he thought that they never drank water at all. In fact, there was a deep well in the brewery used in the beer. In other cases, John Snow found that addresses with low deaths had their own personal well.

He also investigated the outlying incidents through interviews: some worked in the area of the pump or walked by it on the way to school. One woman who got sick had the water brought to her by a wagon each day because she liked the taste of that particular well water. One person he talked to even said the water smelled like sewage and did not drink it, but his servant did and came down with a case of cholera.

The incidents highlighted the area around a public pump on Broad Street. Using his data, he convinced the local authorities to have the pump handle removed.

The most innovative feature of the map is that it changed the way we use maps. The idea that data could be visualized to prove a fact was very new.

John Snow’s map of the service areas of two water companies

John Snow also produced another map showing which water companies supplied water in London. This map showed that the water company which stopped using water from the Thames had a lower death rate due to cholera. The map allowed John Snow to provide further evidence of disease spread through water and what could be done to fix the issue.

This is similar to the Ebola outbreak of today where tracking the disease is important. John Snow’s idea of collecting data in the field and mapping it lives on in maps like those from HealthMap, which show the spread of the Ebola virus.

Data Exploration via Map

Today, we use data driven maps as a powerful tool for all sorts of reasons. But it all started with John Snow.

(For an interesting take on this event and other historical technology that changed the way we live today, watch the “Clean“ episode of the How We Got to Now series on PBS.)

To learn more about Volume Labs and Volume Integration, please follow us on Twitter @volumeint and check out our website.