Programming

Our Volume Integration posts that are about programming or contain code examples.

A Volume Analytics Flow for Finding Social Media Bots

Volume Analytics Chaos Control

Volume Analytics Chaos Control

Volume Analytics is a software tool used to build, deploy and manage data processing applications.

Volume Analytics is a scalable data management platform that allows the rapid ingest, transformation, and loading high volumes of data into multiple analytic models as defined by your requirements or your existing data models.

Volume Analytics is a platform for streaming large volumes of varied data at high velocity.​

Volume Analytics is a tool that both enables rapid software development and operational maintainability with scalability for high data volumes. Volume Analytics can be used for all of your data mining, fusion, extraction, transform and loading needs. Volume Analytics has been used to mine and analyze social media feeds, monitor and alert on insider threats and automate the search for cyber threats. In addition it is being used to consolidate data from many data sources (databases, HDFS, file systems, data lakes) and producing multiple data models for multiple data analytics visualization tools. It could also be used to consolidate sensor data from IoT devices or monitor a SCADA industrial control network.

Volume Analytics easily facilitates a way to quickly develop highly redundant software that’s both scalable and maintainable. In the end you save money on labor for development and maintenance of systems built with Volume Analytics.

In other words Volume Analytics provides the plumbing of a data processing system. The application you are building has distinct units of work that need to be done. We might compare it to a water treatment plant. Dirty water comes in to the system in a pipe and comes to a large contaminate filter. The filter is a work task and the pipe is a topic. Together they make a flow.

After the first filter another pipe carries the water minus the dirt to another water purification worker. In the water plant there is a dashboard for the managers to monitor the system to see if they need to fix something or add more pipes and cleaning tasks to the system.

Volume Analytics provides the pipes, a platform to run the worker tasks and a management tool to control the flow of data through the system.

A Volume Analytics Flow for Finding Social Media Bots

A Volume Analytics Flow for Finding Social Media Bots

In addition Volume Analytics has redundancy for disaster recovery, high availability and parallel processing. This is where our analogy fails. Data is duplicated across multiple topics. The failure of a particular topic (pipe)  does not destroy any data because it is preserved on another topic. Topics are optimally setup in multiple data centers to maintain high availability.

In Volume Analytics the water filter tasks in the analogy are called tasks. Tasks are groups of code that perform some unit of work. Your specific application will have its own tasks. The tasks are deployed on more than one server in more than one data center.

Benefits

Faster start up time saves money and time.

Volume Analytics allows a faster start up time for a new application or system being built. The team does not need to build the platform that moves the data to tasks. They do not need to build a monitoring system as those features are included. However, Volume Analytics will integrate with your current monitoring systems.

System is down less often

The DevOps team gets visibility into the system out of the box. They do not have to stand up a log search system. So it saves time. They can see what is going on and fix it quickly.

Plan for Growth

As your data grows and the system needs to process more data Volume Analytics grows. Add server instances to increase the processing power.  As work grows Volume Analytics allocates work to new instances. There is no re-coding needed. Save time and money as developers are not needed to re-implement the code to work at a larger scale.

Less Disruptive deployments

Construct your application in a way that allows for deployments of new features with a lower impact on features in production. New code libraries and modules can be deployed to the platform and allowed to interact with the already running parts of the system without an outage. A built in code library repository is included.

In addition currently running flows can be terminated while the data waits on the topics for the newly programmed flow to be started.

This Flow processes files to find IP addresses, searches multiple APIs for matches and inserts data into a HANA database

This Flow processes files to find IP addresses, searches multiple APIs for matches and inserts data into a HANA database

A data processing search threats flow in production. Each of the boxes is a task that performs a unit of work. The task puts the processed data on the topic represented by the star. Then the next task picks up the data and does another part of the job. The combination of a set of tasks and topics is a flow.

Geolocate IP Flow

Geolocate IP Flow

Additional flow to geolocate IP addresses added as the first flow is running.

Combined Flows

Combined Flows

The combination of flows working together. The topic ip4-topic is an integration point.

Modular

Volume Analytics is modular and tasks are reusable. You can reconfigure your data processing pipeline without introducing new code. You can use tasks in more than one application.

Highly Available

Out of the box, Volume Analytics highly available due to its built in redundancy. Work tasks and topics (pipes) run in triplicate. As long as your compute instances are in multiple data centers you will have redundancy built in. Volume Analytics knows how to balance the data between duplicate and avoid data loss if one or more work tasks fail — this extends to the concept of queuing up work if all work tasks fail.

Integration

Volume Analytics integrates with other products. It can retrieve and save data to other systems like topics, queues, databases, file systems and data stores. In addition these integrations happen over encrypted channels.

In our sample application CyberFlow there are many tasks that integrate with other systems. The read bucket task reads files from an AWS S3 bucket, the ThreatCrowd is an API call to https://www.threatcrowd.org and Honeypot calls to https://www.projecthoneypot.org. Then the insert tasks integrate to the SAP HANA database used in this example.

Volume Analytics integrates with your enterprise authentication and authorizations systems like LDAP, ActiveDirectory, CAP and more.

Data Management

Ingests datasets from throughout the enterprise, tracking each delivery and routing it through Volume Analytics to extract the greatest benefit. Shares common capabilities such as text extraction, sentiment analysis, categorization, and indexing. A series of services make those datasets discoverable and available to authorized users and other downstream systems.

Data Analytics

In addition, to the management console Volume Analytics comes with an notebook application. This allows a data scientist or analyst to discover and convert data into information on reports. After your data is processed by Volume Analytics and put into a database the Notebook can be used to visualize the data. The data is sliced and diced and displayed on graphs, charts and maps.

Volume Analytics Notebook

Flow Control Panel

Topic Control Panel

The Flow control panel allows for control and basic monitoring of flows. Flows are groupings of tasks and topics working together. You can stop, start and terminate flows. Launch additional flow virtual machines when there is heavy load of data processing work from this screen. The panel also gives access to start up extra worker tasks as needed. There is also a link that will allow you to analyze the logs in Kibana

Topic Control Panel

Topic Control Panel

The topic control panel allows for the control and monitoring of topics. Monitor and delete topics  from here.

Consumer Monitor Panel

Consumer Monitor Panel

The consumer monitor panel allows for the monitoring of consumer tasks. Consumer tasks are the tasks that read from a topic. They may also write to a topic. This screen will allow you to monitor that the messages are being processed and determine if there is a lag in the processing.

Volume Analytics is used by our customers to process data from many data streams and data sources quickly and reliably. In addition, it has enabled the production of prototype systems that scale up into enterprise systems without rebuilding and re-coding the entire system.

And now this tour of Volume Analytics leads into a video demonstration of how it all works together.

Demonstration Video

This video will further describe the features of Volume Analytics using an example application which parses ip addresses out of incident reports and searches other systems for indications of those IP addresses. The data is saved into a SAP HANA database.

Request a Demo Today

Volume Analytics is scalable, fast, maintainable and repeatable. Contact us to request a free demo and experience the power and efficiency of Volume Analytics today.

Contact

HANA Zeppelin Query Builder with Map Visualization

SAP HANA Query Builder On Apache Zeppelin Demo

HANA Zeppelin Query Builder with Map Visualization

HANA Zeppelin Query Builder with Map Visualization

In working with Apache Zeppelin I found that users wanted a way to explore data and build charts without needing to know SQL right away. This is an attempt to build a note in Zeppelin that would allow a new data scientist to get familiar with the data structure of their database. And it allows them to build simple single table queries that allow for building charts and maps quickly. In addition it shows the SQL used to perform the work.

Demo

This video will demonstrate how it works. I have leveraged work done by Randy Gelhausen’s query builder post on how to make a where clause builder.  I also used Damien Sorel’s jQuery Query Builder. These were used to make a series of paragraphs to lookup tables and columns in HANA and allow the user to build a custom query. This data can be quickly graphed using the Zeppelin Helium visualizations.

The Code

This is for those data scientists and coders that want to replicate this in their Zeppelin.

Note that this code is imperfect as I have not worked out all the issues with it. You may need to make changes to get it to work. It only works on Zeppelin 0.8.0 Snapshot. It is also made to work with SAP HANA as the databases.

It only has one type of aggregation – sum and it does not have a way to perform a having statement. But these features could easily be added.

This Zeppelin note is dependent on code from a previous post. Follow the directions in Using Zeppelin to Explore a Database first.

Paragraph One

%spark
//Get list of columns on a given table
def columns1(table: String) : Array[(String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => x.asInstanceOf[String])
}

def columns(table: String) : Array[(String, String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => (x, x))
}

def number_column_types(table: String) : Array[String] = {
 var columnType = sqlContext.sql("select column_name from table_columns where table_name='" +
    table + "' and data_type_name = 'INTEGER'")
 
 columnType.map {case Row(column_name: String) => (column_name)}.collect()
}

// set up the tables select list
val tables = sqlContext.sql("show tables").collect.map(s=>s(1).asInstanceOf[String].toUpperCase())
z.angularBind("tables", tables)
var sTable ="tables"
z.angularBind("selectedTable", sTable)


z.angularUnwatch("selectedTable")
z.angularWatch("selectedTable", (before:Object, after:Object) => {
 println("running " + after)
 sTable = after.asInstanceOf[String]
 // put the id for paragraph 2 and 3 here
 z.run("20180109-121251_268745664")
 z.run("20180109-132517_167004794")
})


var col = columns1(sTable)
col = col :+ "*"
z.angularBind("columns", col)
// hack to make the where clause work on initial load
var col2 = columns(sTable)
var extra = ("1","1")
col2 = col2 :+ extra
z.angularBind("columns2", col2)
var colTypes = number_column_types(sTable)
z.angularBind("numberColumns", colTypes)
var sColumns = Array("*")
// hack to make the where clause work on initial load
var clause = "1=1"
var countColumn = "*"
var limit = "10"

// setup for the columns select list
z.angularBind("selectedColumns", sColumns)
z.angularUnwatch("selectedColumns")
z.angularWatch("selectedColumns", (before:Object, after:Object) => {
 sColumns = after.asInstanceOf[Array[String]]
 // put the id for paragraph 2 and 3 here
 z.run("20180109-121251_268745664")
 z.run("20180109-132517_167004794")
})
z.angularBind("selectedCount", countColumn)
z.angularUnwatch("selectedCount")
z.angularWatch("selectedCount", (before:Object, after:Object) => {
 countColumn = after.asInstanceOf[String]
})
// bind the where clause
z.angularBind("clause", clause)
z.angularUnwatch("clause")
z.angularWatch("clause", (oldVal, newVal) => {
 clause = newVal.asInstanceOf[String]
})

z.angularBind("limit", limit)
z.angularUnwatch("limit")
z.angularWatch("limit", (oldVal, newVal) => {
 limit = newVal.asInstanceOf[String]
})

This paragraph is Scala code that sets up some functions that are used to query the table with the list of tables and the table with the list of columns. You must have the tables loaded into Spark as views or tables in order to see them in the select lists. This paragraph performs all the binding so that the next paragraph which is Angular code can get the data built here.

Paragraph Two

%angular
<link rel="stylesheet" href="https://cdn.rawgit.com/mistic100/jQuery-QueryBuilder/master/dist/css/query-builder.default.min.css">
<script src="https://cdn.rawgit.com/mistic100/jQuery-QueryBuilder/master/dist/js/query-builder.standalone.min.js"></script>

<script type="text/javascript">
  var button = $('#generateQuery');
  var qb = $('#builder');
  var whereClause = $('#whereClause');
 
  button.click(function(){
    whereClause.val(qb.queryBuilder('getSQL').sql);
    whereClause.trigger('input'); //triggers Angular to detect changed value
  });
 
  // this builds the where statement builder
  var el = angular.element(qb.parent('.ng-scope'));
  angular.element(el).ready(function(){
    var integer_columns = angular.element('#numCol').val()
    //Executes on page-load and on update to 'columns', defined in first snippet
    window.watcher = el.scope().compiledScope.$watch('columns2', function(newVal, oldVal) {
      //Append each column to QueryBuilder's list of filters
      var options = {allowEmpty: true, filters: []}
      $.each(newVal, function(i, v){
        if(integer_columns.split(',').indexOf(v._1) !== -1){
          options.filters.push({id: v._1, type: 'integer'});
        } else if(v._1.indexOf("DATE") !== -1) {
          options.filters.push({id: v._1, type: 'date'})
        } else { 
          options.filters.push({id: v._1, type: 'string'});
        }
      });
      qb.queryBuilder(options);
    });
  });
</script>
<input type="text" ng-model="numberColumns" id="numCol"></input>
<form class="form-inline">
 <div class="form-group">
 Please select table: Select Columns:<br>
 <select size=5 ng-model="selectedTable" ng-options="o as o for o in tables" 
       data-ng-change="z.runParagraph('20180109-151738_134370871')"></select>
 <select size=5 multiple ng-model="selectedColumns" ng-options="o as o for o in columns">
 <option value="*">*</option>
 </select>
 Sum Column:
 <select ng-model="selectedCount" ng-options="o as o for o in columns">
 <option value="*">*</option>
 </select>
 <label for="limitId">Limit: </label> <input type="text" class="form-control" 
       id="limitId" placeholder="Limit Rows" ng-model="limit"></input>
 </div>
</form>
<div id="builder"></div>
<button type="submit" id="generateQuery" class="btn btn-primary" 
       ng-click="z.runParagraph('20180109-132517_167004794')">Run Query</button>
<input id="whereClause" type="text" ng-model="clause" class="hide"></input>

<h3>Query: select {{selectedColumns.toString()}} from {{selectedTable}} where {{clause}} 
   with a sum on: {{selectedCount}} </h3>

Paragraph two uses javascript libraries from jQuery and jQuery Query Builder. In the z.runParagraph  command use the paragraph id from paragraph three.

Paragraph Three

The results of the query show up in this paragraph. Its function is to generate the query and run it for display.

%spark
import scala.collection.mutable.ArrayBuffer

var selected_count_column = z.angular("selectedCount").asInstanceOf[String]
var selected_columns = z.angular("selectedColumns").asInstanceOf[Array[String]]
var limit = z.angular("limit").asInstanceOf[String]
var limit_clause = ""
if (limit != "*") {
 limit_clause = "limit " + limit
}
val countColumn = z.angular("selectedCount")
var selected_columns_n = selected_columns.toBuffer
// remove from list of columns
selected_columns_n -= selected_count_column

if (countColumn != "*") {
 val query = "select "+ selected_columns_n.mkString(",") + ", sum(" + selected_count_column +
     ") "+ selected_count_column +"_SUM from " + z.angular("selectedTable") + " where " + 
      z.angular("clause") + " group by " + selected_columns_n.mkString(",") + " " + 
      limit_clause
 println(query)
 z.show(sqlContext.sql(query))
} else {
 val query2 = "select "+ selected_columns.mkString(",") +" from " + z.angular("selectedTable") + 
      " where " + z.angular("clause") + " " + limit_clause
 println(query2)
 z.show(sqlContext.sql(query2))
}

Now if everything is just right you will be able to query your tables without writing SQL. This is a limited example as I have not provided options for different types of aggregation, advanced grouping or joins for multiple tables.

 

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint.

Volume Analytics Table Explorer - HANA & Zeppelin

Using Zeppelin to Explore a Database

In attempting to use Apache Zeppelin I found it difficult to just explore a new database. This was the situation when connecting SAP HANA database to Apache Zeppelin using the JDBC driver.

So I created a Zeppelin interface that can be used by a person who does not know how to code or use SQL.

This is a note with code in multiple paragraphs that would allow a person to see a list of all the tables in the database and then view the structure of them and look at a sample of the data in each table.

Volume Analytics Table Explorer - HANA & Zeppelin

Volume Analytics Table Explorer – HANA & Zeppelin

When using a standard database with Apache Zeppelin one needs to register each table into Spark so that it can query it and make DataFrames from the native tables. I got around this by allowing the user to choose they tables they want to register into Apache Zeppelin and Spark. This registration involved using the createOrReplaceTempView function on a DataFrame. This allows us to retain the speed of HANA without copying all the data into a Spark table.

The video shows a short demonstration of how this works.

Once tables are registered as Spark views they can be used by all the other notes on the Apache Zeppelin server. This means that other users can leverage the tables without knowing they came from the HANA database.

The code is custom to HANA because of the names of the system tables where it stores the lists of tables and column names. The code also converts HANA specific data types such as ST_POINT to comma delimited strings.

This example of dynamic forms with informed by Data-Driven Dynamic Forms in Apache Zeppelin

Previous posts on Apache Zeppelin and SAP Hana are:

The Code

Be aware this is prototype code that works on Zeppelin 0.8.0 Snapshot which as of today needs to be built from source. It is pre-release.

First Paragraph

In the first paragraph I am loading up the HANA jdbc driver. But you can avoid doing this by adding your jdbc jar to the dependencies section of the interpreter configuration as laid out in How to Use Zeppelin With SAP HANA

%dep
z.reset() 
z.load("/projects/zeppelin/interpreter/jdbc/ngdbc.jar")

Second Paragraph

In the second paragraph we build the Data Frames from tables in HANA that contain the list of tables and columns in the database. This will be used to show the user what tables and columns are available to use for data analysis.

%spark
import org.apache.spark.sql._
val driver ="com.sap.db.jdbc.Driver"
val url="jdbc:sap://120.12.83.105:30015/ffa"
val database = "dbname"
val username = "username"
val password = "password"
// type in the schemas you wish to expose
val tables = """(select * from tables where schema_name in ('FFA', 'SCHEMA_B')) a """
val columns = """(select * from table_columns where schema_name in ('FFA', 'SCHEMA_B')) b """

val jdbcDF = sqlContext.read.format("jdbc").option("driver",driver)
 .option("url",url)
 .option("databaseName", database)
 .option("user", username)
 .option("password",password)
 .option("dbtable", tables).load()
jdbcDF.createOrReplaceTempView("tables")

val jdbcDF2 = sqlContext.read.format("jdbc").option("driver",driver)
 .option("url",url)
 .option("databaseName", database)
 .option("user", username)
 .option("password",password)
 .option("dbtable", columns).load()
jdbcDF2.createOrReplaceTempView("table_columns")

Third Paragraph

The third paragraph contains the functions that will be used in the fourth paragraph that needs to call Spark / Scala functions. These functions will return the column names and types when a table name is given. Also it has the function that will load a HANA table into a Spark table view.

%spark
//Get list of distinct values on a column for given table
def distinctValues(table: String, col: String) : Array[(String, String)] = {
 sqlContext.sql("select distinct " + col + " from " + table + " order by " + col).collect.map(x => (x(0).asInstanceOf[String], x(0).asInstanceOf[String]))
}

def distinctWhere(table: String, col: String, schema: String) : Array[(String, String)] = {
 var results = sqlContext.sql("select distinct " + col + " from " + table + " where schema_name = '" + schema +"' order by " + col)
 results.collect.map(x => (x(0).asInstanceOf[String], x(0).asInstanceOf[String]))
}

//Get list of tables
def tables(): Array[(String, String)] = {
 sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String].toUpperCase(), x(1).asInstanceOf[String].toUpperCase()))
}

//Get list of columns on a given table
def columns(table: String) : Array[(String, String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => (x, x))
}

def hanaColumns(schema: String, table: String): Array[(String, String)] = {
 sqlContext.sql("select column_name, data_type_name from table_columns where schema_name = '"+ schema + "' and table_name = '" + table+"'").collect.map(x => (x(0).asInstanceOf[String], x(1).asInstanceOf[String]))
}

//load table into spark
def loadSparkTable(schema: String, table: String) : Unit = {
  var columns = hanaColumns(schema, table)
  var tableSql = "(select "
  for (c <- columns) {
    // If this column is a geo datatype convert it to a string
    if (c._2 == "ST_POINT" || c._2 == "ST_GEOMETRY") {
      tableSql = tableSql + c._1 + ".st_y()|| ',' || " + c._1 + ".st_x() " + c._1 + ", "
    } else {
      tableSql = tableSql + c._1 + ", "
    }
  }
 tableSql = tableSql.dropRight(2)
 tableSql = tableSql + " from " + schema +"."+table+") " + table

 val jdbcDF4 = sqlContext.read.format("jdbc").option("driver",driver)
  .option("url",url)
  .option("databaseName", "FFA")
  .option("user", username)
  .option("password", password)
  .option("dbtable", tableSql).load()
  jdbcDF4.createOrReplaceTempView(table)
 
}

//Wrapper for printing any DataFrame in Zeppelin table format
def printQueryResultsAsTable(query: String) : Unit = {
 val df = sqlContext.sql(query)
 print("%table\n" + df.columns.mkString("\t") + '\n'+ df.map(x => x.mkString("\t")).collect().mkString("\n")) 
}

def printTableList(): Unit = {
 println(sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String])).mkString("%table\nTables Loaded\n","\n","\n"))
}

// this part keeps a list of the tables that have been registered for reference
val aRDD = sc.parallelize(sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String])))
val aDF = aRDD.toDF()
aDF.registerTempTable("tables_loaded")

Fourth Paragraph

The fourth paragraph contains the Spark code needed to produce the interface with select lists for picking the tables. It uses dynamic forms as described in the Zeppelin documentation and illustrated in more detail by Rander Zander.

%spark
val schema = z.select("Schemas", distinctValues("tables","schema_name")).asInstanceOf[String]
var table = z.select("Tables", distinctWhere("tables", "table_name", schema)).asInstanceOf[String]
val options = Seq(("yes","yes"))
val load = z.checkbox("Register & View Data", options).mkString("")

val query = "select column_name, data_type_name, length, is_nullable, comments from table_columns where schema_name = '" + schema + "' and table_name = '" + table + "' order by position"
val df = sqlContext.sql(query)


if (load == "yes") { 
 if (table != null && !table.isEmpty()) {
   loadSparkTable(schema, table)
   z.run("20180108-113700_1925475075")
 }
}

if (table != null && !table.isEmpty()) {
 println("%html <h1>"+schema)
 println(table + "</h1>")
 z.show(df)
} else {
 println("%html <h1>Pick a Schema and Table</h1>")
}

As the user changes the select lists schema in paragraph 3 will be called and the tables select list will be populated with the new tables. When they select the table the paragraph will refresh with a table containing some of the details about the table columns like the column types and sizes.

When they select the Register and View checkbox the table will get turned into a Spark view and paragraph five will contain the data contents of the table. Note the z.run command. This runs a specific paragraph and you need to put in your own value here. This should be the paragraph id from the next paragraph which is paragraph five.

Paragraph Five

%spark
z.show(sql("select * from " + table +" limit 100"))

The last paragraph will list the first 100 rows from the table that have been selected and has the register and view on.

Slight modifications of this code will allow the same sort of interface to be built for MySQL, Postgres, Oracle, MS-SQL or any other database.

Now go to SAP HANA Query Builder On Apache Zeppelin Demo and you will find code to build a simple query builder note.

Please let us know on twitter, facebook and LinkedIn if this helps you or your find a better way to do this in Zeppelin.

Previous posts on Apache Zeppelin and SAP Hana are:

 

Visualizing HANA Graph with Zeppelin

The SAP HANA database has the capability to store information in a graph. A graph is a data structure with vertex or nodes and edges. Graph structures are powerful because you can perform some types of analysis more quickly like nearest neighbor or shortest path calculations. It also enables faceted searching. Zeppelin has recently added support for displaying network graphs.

Simple Network Graph

Zeppelin 8 has support for network graphs. You need to download the Zeppelin 8 Snapshot and build it to get these features. The code to make this graph is:

This is described in the documentation. Note that the part of the json that holds the attributes for the node or edge is in an inner json object called “data”. This is how each node and edge can have different data depending on what type of node or edge it is.

"data": {"fullName":"Andrea Santurbano"}}

Because I happen to be learning the features of SAP HANA I wanted to display a graph from HANA in using Zeppelin. A previous post shows how to connect Zeppelin to HANA.

I am using a series of paragraphs to build the data and then visualize the graph. I have already built my graph workspace in Hana with help from Creating Graph Database Objects and Hana Graph Reference. The difficult part is transforming the data and relationships so they fit into a vertex / node table and an edge table.

I am using sample data from meetup.com which contains events organized by groups and attended by members and held at venues.

It is important to figure out which attributes should exist on the edge and which ones should be in the nodes. HANA is good at allowing sparse data in the attributes because of the way it stores data in a columnar form. If you wish to display the data on a map and in a graph using the same structure supplied by Zeppelin it is important to put your geo coordinates in the nodes and not in the edges.

First we connect to the database and load the node and edge tables with Spark / Scala and build the data frames. One issue here was converting the HANA ST_POINT data type into latitude and longitude values. I defined a select statement with the HANA functions of ST_X() and ST_Y() to perform the conversions before the data is put into the dataframe.

You will have problems and errors if your tables in HANA have null values. Databases don’t mind null values. Scala seems to hate null. So you have to convert any columns that could have null values to something that makes sense. In this case I converted varchar to empty strings and double to 0.0

Then I query the data frames to get the data needed for the visualization and transform it into json strings for the collections of nodes and edges. In the end this note outputs two json arrays. One is the nodes and the other is the edges.


Now we will visualize the data using the new Zeppelin directive called %network.

In my example data extracted from meetup.com. I have four types of nodes: venue, member, group and event. These are defined as labels. My edges or relationships I have defined as: held, sponsored and rsvp. These become the lines on the graph. Zeppelin combines the data from the edges and nodes into a single table view.

So in tabular format it will look like this:

Zeppelin Edges

 

Zeppelin Nodes

When you press the network button in Zeppelin the graph diagram appears.

Zeppelin Network Graph Diagram

Under settings you can specify what data displays on the screen. It does not allow for specifying the edge label displays and does not seem to support a weight option.

Network Graph Settings

If you select the map option and have the leaflet visualization loaded you can show the data on a map. Since I put the coordinates in the edges it will map the edge data. It would be better if I moved the coordinates into each node so the nodes could be displayed on the map.

Zeppelin Graph On a Map

This will help you get past some of the issues I had with getting the Zeppelin network diagram feature to work with HANA and perhaps with other databases.

In the future I hope to show how to call HANA features such as nearest neighbor, shortest path and pattern  matching algorithms to enhance the graph capabilities in Zeppelin.

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint

Query of a geographic region.

Zeppelin Maps the Hard Way

In Zeppelin Maps the Easy Way I showed how to add a map to Zeppelin with a Helium module. But what if you do not have access to the Helium NPM server to load in that module? And what if you want to add features to your Leaflet Map that are not supported in the volume-leaflet package?

This will show you how the Angular javascript library will allow you to add a map user interface to a Zeppelin paragraph.

Zeppelin Angular Leaflet Map

Zeppelin Angular Leaflet Map with Markers

First we want to get a map on the screen with markers.

In Zeppelin create a new note.

As was shown in How to Use Zeppelin With SAP HANA we create a separate paragraph to build the database connection. Please substitute in your own database driver and connection string to make it work for other databases. There are other examples where you can pull in data from a csv file and turn it into a table object.

In the next paragraph we place the spark scala code to query the database and build the markers that will be passed to the final paragraph which is built with angular.

The data query paragraph has a basic way to query a bounding box. It just looks for coordinates that are greater and less than the northwest and southeast corners of a bounding box.

var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
var coords = box.split(",")
sql1 = sql1 + " where lng > " + coords(0).toFloat + " and lat > " + coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " + coords(3).toFloat
}

var sql = sql1 +" limit 20"
val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect()) 

The data from this query is used to make the map_pings and bind it to angular so that any angular code can reference it. Zeppelin has the ability to bind data into other languages so it can be used by different paragraphs in the same note. There are samples for other databases, json and csv files at this link.

We do not have access to the Hana proprietary functions because Zeppelin will load the data up in its own table view of the HANA table. We are using the command “createOrReplaceTempView” so that a copy of the data is not made in Zeppelin. It will just pass the data through.

Note that you should set up the HANA jdbc driver as described in How to Use Zeppelin With SAP HANA.

It is best if you set up a dependency to the HANA jdbc jar in the Spark interpreter. Go to the Zeppelin settings menu.

Zeppelin Settings Menu

Zeppelin Settings Menu

Pick the Interpreter and find the Spark section and press edit.

Zeppelin Interpreter Screen

Zeppelin Interpreter Screen

Then add the path you where you have the SAP HANA jdbc driver called ngdbc.jar installed.

Configure HANA jdbc in Spark Interpreter

Configure HANA jdbc in Spark Interpreter

First Paragraph

%spark
import org.apache.spark.sql._
val driver ="com.sap.db.jdbc.Driver"
val url="jdbc:sap://11.1.88.110:30015/tri"
val database   = "database schema"   
val username   = "username for the database"
val password   = "the Password for the database"
val table_view = "event_view"
var box=""
val jdbcDF = sqlContext.read.format("jdbc").option("driver",driver)
                                           .option("url",url)
                                           .option("databaseName", database)
                                           .option("dbtable", "event_view")
                                           .option("user", username)
                                           .option("password",password)
                                           .option("dbtable", table_view).load()
jdbcDF.createOrReplaceTempView("event_view")

Second Paragraph

%spark

var box = "20.214843750000004,1.9332268264771233,42.36328125000001,29.6880527498568";
var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
    var coords = box.split(",")
    sql1 = sql1 + " where lng  > " + coords(0).toFloat + " and lat > " +  
        coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " +
        coords(3).toFloat
}
var sql = sql1 +" limit 20" 

val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect())
z.angularBind("paragraph", z.getInterpreterContext().getParagraphId())
// get the paragraph id of the the angular paragraph and put it below
z.run("20171127-081000_380354042")

Third Paragraph

In the third paragraph we add the angular code with the %angular directive. Note the for each loop section where it builds the markers and adds them to the map.

%angular 
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.css" />
.
<div id="map" style="height: 300px; width: 100%"></div>
<script type="text/javascript">
function initMap() {
    var element = $('#textbox');
    var map = L.map('map').setView([30.00, -30.00], 3);
   
    L.tileLayer('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png').addTo(map);
    var geoMarkers = L.layerGroup().addTo(map);
    
    var el = angular.element($('#map').parent('.ng-scope'));
    var $scope = el.scope().compiledScope;
   
    angular.element(el).ready(function() {
        window.locationWatcher = $scope.$watch('locations', function(newValue, oldValue) {
            //geoMarkers.clearLayers();
            angular.forEach(newValue, function(event) {
                if (event)
                  var marker = L.marker([event.values[1], event.values[2]]).bindPopup(event.values[0]).addTo(geoMarkers);
            });
        })
    });
}
if (window.locationWatcher) { window.locationWatcher(); }

// ensure we only load the script once, seems to cause issues otherwise
if (window.L) {
    initMap();
} else {
    console.log('Loading Leaflet library');
    var sc = document.createElement('script');
    sc.type = 'text/javascript';
    sc.src = 'https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.js';
    sc.onerror = function(err) { alert(err); }
    document.getElementsByTagName('head')[0].appendChild(sc);
}
</script>
<p>Testing the Map</p>

<form class="form-inline">
  <div class="form-group">
    <input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>
    <label for="paragraphId">Paragraph Id: </label>
    <input type="text" class="form-control" id="paragraphId" placeholder="Paragraph Id ..." ng-model="paragraph"></input>
  </div>
  <button type="submit" class="btn btn-primary" ng-click="z.runParagraph(paragraph)"> Run Paragraph</button>
</form>

Now when you run the three paragraphs in order it should produce a map with markers on it.

The next step is to add a way to query the database by drawing a box on the screen. Into the scala / spark code we add a variable for the bounding box with the z.angularBind() command. Then a watcher is made to see when this variable changes so the new value can be used to run the query.

Modify Second Paragraph

%spark
z.angularBind("box", box)
// Get the bounding box
z.angularWatch("box", (oldValue: Object, newValue: Object) => {
    println(s"value changed from $oldValue to $newValue")
    box = newValue.asInstanceOf[String]
})

var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
    var coords = box.split(",")
    sql1 = sql1 + " where lng  > " + coords(0).toFloat + " and lat > " +  coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " +  coords(3).toFloat
}
var sql = sql1 +" limit 20" 

val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect())
z.angularBind("paragraph", z.getInterpreterContext().getParagraphId())
z.run("20171127-081000_380354042") // put the paragraph id for your angular paragraph here

To the angular section we need to add in an additional leaflet library called leaflet.draw. This is done by adding an additional css link and a javascript script. Then the draw controls are added as shown in the code below.

Modify the Third Paragraph

%angular 
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet.draw/0.4.13/leaflet.draw.css" />
.
<script src='https://cdnjs.cloudflare.com/ajax/libs/leaflet.draw/0.4.13/leaflet.draw.js'></script>
<div id="map" style="height: 300px; width: 100%"></div>

<script type="text/javascript">
function initMap() {
    var element = $('#textbox');
    var map = L.map('map').setView([30.00, -30.00], 3);
   
    L.tileLayer('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png').addTo(map);
    var geoMarkers = L.layerGroup().addTo(map);
    var drawnItems = new L.FeatureGroup();
    
    map.addLayer(drawnItems);
    
    var drawControl = new L.Control.Draw({
        draw: {
             polygon: false,
             marker: false,
             polyline: false
        },
        edit: {
            featureGroup: drawnItems
        }
    });
    map.addControl(drawControl);
    
    map.on('draw:created', function (e) {
        var type = e.layerType;
        var layer = e.layer;
        drawnItems.addLayer(layer);
        element.val(layer.getBounds().toBBoxString());
        map.fitBounds(layer.getBounds());
        window.setTimeout(function(){
           //Triggers Angular to do its thing with changed model values
           element.trigger('input');
        }, 500);
    });
    
    var el = angular.element($('#map').parent('.ng-scope'));
    var $scope = el.scope().compiledScope;
   
    angular.element(el).ready(function() {
        window.locationWatcher = $scope.$watch('locations', function(newValue, oldValue) {
            $scope.latlng = [];
            angular.forEach(newValue, function(event) {
                if (event)
                  var marker = L.marker([event.values[1], event.values[2]]).bindPopup(event.values[0]).addTo(geoMarkers);
                  $scope.latlng.push(L.latLng(event.values[1], event.values[2]));
            });
            var bounds = L.latLngBounds($scope.latlng)
            map.fitBounds(bounds)
        })
    });

}

if (window.locationWatcher) { window.locationWatcher(); }

// ensure we only load the script once, seems to cause issues otherwise
if (window.L) {
    initMap();
} else {
    console.log('Loading Leaflet library');
    var sc = document.createElement('script');
    sc.type = 'text/javascript';
    sc.src = 'https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.js';
    sc.onerror = function(err) { alert(err); }
    document.getElementsByTagName('head')[0].appendChild(sc);
    s2.onload = initMap;
}
</script>
<p>Testing the Map</p>

<form class="form-inline">
  <div class="form-group">
    <input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>
    <label for="paragraphId">Paragraph Id: </label>
    <input type="text" class="form-control" id="paragraphId" placeholder="Paragraph Id ..." ng-model="paragraph"></input>
  </div>
  <button type="submit" class="btn btn-primary" ng-click="z.runParagraph(paragraph)"> Run Paragraph</button>
</form>

There are some important features to mention here that took some investigation to figure out.

Within Zeppelin I was unable to get the box being drawn to be visible. So instead drawing a box will the map to zoom to the area selected by utilizing this code:
element.val(layer.getBounds().toBBoxString());
map.fitBounds(layer.getBounds());

To make the map zoom back to the area after the query is run this code is triggered.

$scope.latlng.push(L.latLng(event.values[1], event.values[2]))
...
var bounds = L.latLngBounds($scope.latlng)
map.fitBounds(bounds)

To trigger the spark / scala paragraph to run after drawing a box this code causes it to run the query paragraph: data-ng-change=”z.runParagraph(paragraph_id);”

<input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>

The html form at the bottom is what holds and binds the data back and forth between the paragraphs. It is visible for debugging at the moment.

Query of a geographic region with Zeppelin

Query of a geographic region

Please let us know how it works out for you. Hopefully this will help you add maps to your Zeppelin notebook. I am sure there are many other better ways to accomplish this feature set but this is the first way I was able to get it all to work together.

Demo of the interface:

You can contact us using twitter at @volumeint.

Some code borrowed from: https://gist.github.com/granturing/a09aed4a302a7367be92 and https://zeppelin.apache.org/docs/latest/displaysystem/front-end-angular.html

ruler and graphboard

How to Animate an Auto-height Element

ruler and graphboard

Animating an auto-height element seems like it should be fairly straightforward, however it seems I’m not the only one who has struggled with this particular issue.  The problem is usually some variant of the following:

  • I have some element I would like to allow to vertically expand and collapse.
  • The element has dynamic content – so therefore the height of the expanded element is unknown/dynamic.
  • I need to set the height of the element to auto to allow the element to change height based on its contents.
  • CSS doesn’t allow transitioning to auto height so – it just jumps to the height when expanding/collapsing.  No animation 🙁

 

This is what I want to do.

Showing auto-height expander

Some Workarounds

You may find several potential solutions to this problem if you spend a bit of time poking around the internet.

For example – there is the max-height workaround.  In this solution you would basically transition the max-height attribute instead of the height. The trick is to set the final max-height to something way larger than you think the element will ever grow.  This will effectively animate to the height of the element’s contents.  This might feel a little hanky to you – and for good reason. For starters – you have to guess what might be the largest the contents of the will ever get. But the content is dynamic – so that could easily get out of hand. Furthermore, the transition will animate to the full max-height specified. The visible height will stop at the height of the content – but the transition thinks it needs to grow all the way to the max-height. So for example – if you set a transition time of 300ms – it will take that long to animate to the full max-height even though the visual height stops well before then.

Other workarounds involve hiding the visual elements instead of changing the actual height or using javascript to manually animate/hide elements etc., but these are even more complicated than the max-height solution and introduce a slew of new problems to deal with (the very least of which is wreaking havoc on the element’s accessibility).

My Hack Solution

If you’re the kind of person that peeks at the end of the book (shame on you) then you can check out my working solution on codepen.

It still uses CSS to animate the height property through the transition property.  However it also uses a bit of JavaScript to store some state for the element.

This solution will not work for all situations – but it suited my needs well, but there are some restrictions:

  • You must know the initial default height of the element.  This means if you don’t know what content will be in your div on initial load – this might not work so well.  But if your element has an initial set of known contents this should work like a champ.
  • Content can only be added or removed from the element while it is in the expanded state.  If content is added/removed from the div while collapsed  – then you’re out of luck again.

Assuming your needs fulfill these requirements – this should work nicely.

The solution essentially works like this:

  1. Store the initial height of the element in a variable somewhere. We’ll call it expandHeight for now.
  2. When the element expands – you can easily transition the height from 0 to the expandHeight.
  3. After the transition is complete (use a setTimeout based on whatever you set the transition-duration property to) then set the element’s height property to auto
  4. Add/remove content to the element as desired
  5. When collapsing –
    1. First store the element’s current height back into the expandHeight variable.
    2. Next set the element’s height back to a fixed value (what you just stored in expandHeight). This is because the element cannot transition from an auto height either. It can only transition to/from a fixed height.
    3. Now you can transition back to a height of 0.
  6. When you need to expand again – just start at step 2 above and repeat as necessary!

 

That’s about all there is to it and it has worked well for me. One caveat is that you may need to stick step 5.3 in another setTimeout with a very small delay to allow the DOM time to register that the height attribute has changed from an auto height to a fixed height.

Here’s my fully functioning example:

See the Pen Auto-Height Expanding Div by Nate Gibbons (@marshallformula) on CodePen.

The astute observer might notice that it would not take too much imagination to create a high order ReactJS component out of this solution that stores its own state internally so you can re-use it anywhere with ease.

Let me know what you think.  More importantly – let me know if you’ve got something even better!  Cheers!

Feature Photo by Christian Kaindl
Digital Cherries by Bradley Johnson

JSON Joins - jq

Like a bunch of json objects being manipulated. Digital Cherries by Bradley Johnson

Digital Cherries by Bradley Johnson https://art.wowak.com like little JSON objects

Manipulation of JSON files is an interesting challenge but it is much easier than trying to manipulate XML files. I was given two files that were lists of JSON objects. The two files had a common key that could be used to join the files together called business_id. The object was to flatten out the two files into one file.

Once they are in one large file we will be using a system called Ryft to search the data. Ryft is an FPGA (Field Programmable Gate Array) system that searches data very quickly with special capabilities for performing fuzzy searches with a large edit distance. I am hypothesizing that the Ryft will work better on flat data instead of needing to perform a join between two tables.

The files are from the Yelp dataset challenge. We will use the program called jq to join the data. The data files are yelp_business.json and yelp_review.json

The file yelp_business.json has this format below and each record / object is separated by a hard return. This file is 74 Megabytes. Note: I added hard returns between the fields to make it easier to read.
{
"business_id": "UsFtqoBl7naz8AVUBZMjQQ",
"full_address": "202 McClure St\nDravosburg, PA 15034",
"hours": {},
"open": true,
"categories": ["Nightlife"],
"city": "Dravosburg",
"review_count": 5,
"name": "Clancy's Pub",
"neighborhoods": [],
"longitude": -79.8868138,
"state": "PA",
"stars": 3.0,
"latitude": 40.3505527,
"attributes": {"Happy Hour": true, "Accepts Credit Cards": true, "Good For Groups": true, "Outdoor Seating": false, "Price Range": 1},
"type": "business"
}

The file yelp_review.json has this format below and is also separated by hard returns. Now for each business there are many reviews. So if a business has five reviews the final output will contain five rows of data for that one business. This file is 2.1 Gigabytes

{
"votes": {"funny": 0, "useful": 0, "cool": 0},
"user_id": "uK8tzraOp4M5u3uYrqIBXg",
"review_id": "Di3exaUCFNw1V4kSNW5pgA",
"stars": 5,
"date": "2013-11-08",
"text": "All the food is great here. But the best thing they have is their wings. Their wings are simply fantastic!! The \"Wet Cajun\" are by the best & most popular. I also like the seasoned salt wings. Wing Night is Monday & Wednesday night, $0.75 whole wings!\n\nThe dining area is nice. Very family friendly! The bar is very nice is well. This place is truly a Yinzer's dream!! \"Pittsburgh Dad\" would love this place n'at!!",
"type": "review",
"business_id": "UsFtqoBl7naz8AVUBZMjQQ"
}

Notice that “business_id” is in both objects. This is the field we wish to join with.

Originally I started with this answer on OpenStack.

So what we really want is a left join of all the data like a database would perform.

The jq software one to read functions out of a file to make the command line easier. With help from pkoppstein at the jq github we have a file called leftJoin.jq.

# leftJoin(a1; a2; field) expects a1 and a2 to be arrays of JSON objects
# and that for each of the objects, the field value is a string.
# A left join is performed on "field".
def leftJoin(a1; a2; field):
# hash phase:
(reduce a2[] as $o ({}; . + { ($o | field): $o } )) as $h2
# join phase:
| reduce a1[] as $o ([]; . + [$h2[$o | field] + $o ])|.[];

leftJoin( $file2; $file1; .business_id)

Based on this code the last line is what passes in the variables for the file names and sets the key used to join to be business_id. With in the reduce commands it is turning the lists of objects into json arrays and then finding the field of business_id and concatenating the two json objects together with the (+) plus sign. The final command “|.[]” at the end where the semicolon finalizes the command is used to turn the json array back into a stream of json list objects. The Ryft appliance only reads in json as lists of objects.

If there are any fields that are identically named the jq code will use the one from file2. So because both files have the field of “type” the new data file will be type = review.

To run this we use the command line of:
jq -nc --slurpfile file1 yelp_business.json --slurpfile file2 yelp_review.json -f leftJoin.jq > yelp_bus_review.json

As a result this command takes the two files and passes them to the jq commands to do the work of joining them together. It will write out a new file called yelp_bus_review.json. It may take a long time to run depending on the size of your files but I end up with a 4.8 Gigabyte file when finished. Here are two rows in the new file:

{"business_id":"UsFtqoBl7naz8AVUBZMjQQ","full_address":"202 McClure St\nDravosburg, PA 15034","hours":{},"open":true,"categories":["Nightlife"],"city":"Dravosburg","review_count":5,"name":"Clancy's Pub","neighborhoods":[],"longitude":-79.8868138,"state":"PA","stars":4,"latitude":40.3505527,"attributes":{"Happy Hour":true,"Accepts Credit Cards":true,"Good For Groups":true,"Outdoor Seating":false,"Price Range":1},"type":"review","votes":{"funny":0,"useful":0,"cool":0},"user_id":"JPPhyFE-UE453zA6K0TVgw","review_id":"mjCJR33jvUNt41iJCxDU_g","date":"2014-11-28","text":"Cold cheap beer. Good bar food. Good service. \n\nLooking for a great Pittsburgh style fish sandwich, this is the place to go. The breading is light, fish is more than plentiful and a good side of home cut fries. \n\nGood grilled chicken salads or steak. Soup of day is homemade and lots of specials. Great place for lunch or bar snacks and beer."}
{"business_id":"UsFtqoBl7naz8AVUBZMjQQ","full_address":"202 McClure St\nDravosburg, PA 15034","hours":{},"open":true,"categories":["Nightlife"],"city":"Dravosburg","review_count":5,"name":"Clancy's Pub","neighborhoods":[],"longitude":-79.8868138,"state":"PA","stars":2,"latitude":40.3505527,"attributes":{"Happy Hour":true,"Accepts Credit Cards":true,"Good For Groups":true,"Outdoor Seating":false,"Price Range":1},"type":"review","votes":{"funny":0,"useful":0,"cool":0},"user_id":"pl78RcFgklDns8atQegwVA","review_id": "kG7wxkBu62X6yxUuZ5IQ6Q","date":"2016-02-24","text":"Possibly the most overhyped establishment in Allegheny County. If you're not a regular, you will be ignored by those who're tending bar. Beer selection is okay, the prices are good and the service is terrible. I would go here, but only if it was someone else's idea."}

Now we have one large flat file instead of two. This will allow for quicker searches as we do not have to join the data together in the middle of the search.

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint

Package Your WebApp

Package Your WebApp

package webapp

package webapp

So you’re building modern web application. That most likely means you’re building a Single Page Application (SPA) in JavaScript and reading data from a server via REST. The REST server code could be implemented using any number of programming languages and technology stacks.

There are a few schools of thought when it comes to developing a web application. One option is to keep the development of client and server code completely separate. Another approach is to develop both client & server code together via Universal Javascript. Additionally there are issues regarding how to store the code base(s) in the repository, how versioning is applied, and finally how the code is deployed and maintained.

This article proposes a solution that has worked well for one of our projects at Volume Integration. I’ve created a sample application that demonstrates some of the key components of this solution.

You can download/clone the project here:

https://github.com/marshallformula/packaged-webapp

2015 Utah State Park Attendance Example Application

The sample application is a very simple web application that shows a graph of Utah State Park attendance for 2015. Here is a screenshot of the finished product.

Sample Web Application Screenshot

The main components of the application are:

Standalone Java Web Server (Spring Boot). The main purpose is to provide a set of REST services for the WebApp to consume. But it also initially serves the static web application code (HTML, CSS, JavaScript)

Web Application (SPA). The web application uses modern web application practices – including transpiling ES2016 code using Babel, packing and optimizing code and dependencies using webpack, as well as compiling advanced css using preprocessors like less and sass.

Requirements

Development of this application requires the following:

Developing the Application

This application is set up so that you can develop the REST services and JavaScript application independently.

Developing REST Services

The REST services are written in Java Utilizing Spring Boot & Spring MVC functionality. All of that code is located in src/main/java.

To develop the services code interactively just run

gradlew bootRun

This will start up the embedded webserver (Tomcat by default) and deliver your services. As you write your code – the server should detect code changes and restart as necessary due the inclusion of Spring Boot DevTools.

If you are developing/running the REST server interactively while developing the JavaScript web application – you will need to add a system property like so:

gradlew bootRun -Dcors.origins=http://localhost:3000

There is one hack feature that’s required to enable the bootRun gradle tasks to accept and apply configuration properties in this manner. Add this snippet to the build.gradle file:

1
2
3
4
bootRun {
systemProperties System.properties
}

This is because when developing the web application – it will be running on its own development server on port 3000 (see below) which will have a different host and we will need to configure CORS to allow the web application to consume the REST services.

I won’t delve into the all of the intricacies of how Spring MVC works its magic, but the following configurations are required in the app to make it work:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
@SpringBootApplication
public class ExampleApplication extends WebMvcConfigurerAdapter {
public static void main(String[] args){
SpringApplication.run(ExampleApplication.class, args);
}
@Value(“${cors.origins}”)
private String origins;
//this is to add allow the CORS origins specified on the cors.orgins property to communicate with this server.
@Override
public void addCorsMappings(CorsRegistry registry) {
if(!StringUtils.isEmpty(origins)){
CorsRegistration registration = registry.addMapping(“/api/**”);
Arrays.stream(origins.split(“,”))
.map(String::trim)
.forEach(registration::allowedOrigins);
registration.allowedMethods(“GET”, “POST”, “PUT”, “DELETE”);
} else {
super.addCorsMappings(registry);
}
}
//this is necessary to forward all un-mapped requests to index.html.
//This is required if you want to use the HTML5 History API
@Override
public void addViewControllers(ViewControllerRegistry registry) {
registry.setOrder(Ordered.LOWEST_PRECEDENCE);
registry.addViewController(“/**”).setViewName(“forward:/index.html”);
}
//this is helpful in connection with the method above to allow paths to the /assets folder for images, files etc
@Override
public void addResourceHandlers(ResourceHandlerRegistry registry) {
registry.addResourceHandler(“/assets/**”).addResourceLocations(“classpath:/static/”);
super.addResourceHandlers(registry);
}
}

Developing the JavaScript Web Application

The JavaScript web application is set up to operate as a standalone web application (when provided some REST services to connect to). All of the web application code is under src/main/app. You will need to be in this directory to run the following commands.

The web application is packaged with webpack and utilizes webpack-dev-server to enable interactive development.

First you must download all of the necessary dependencies from npm.

npm run setup

Once that is complete you are ready to start the development server by simply running npm start. However the application can be configured via an environment variable to connect to a REST server at any location. This way you can work on the web application and connect to any instance of your API (dev/test environments).

If you would like to connect to your local instance of the REST services that are running using the instructions above – you will just need to set the REST_URL environment variable to http://localhost:8080/api. The easiest way to do that is to just prepend the variable declaration to the start command like this:

REST_URL=http://localhost:8080/api npm start

You might wonder how an environment variable on the can be incorporated into the necessary places in the client JavaScript files. There are most likely several ways to do this – one of the simplest is through the webpack DefinePlugin.

Just add & configure the plugin in the webpack.config.json file like this:

1
2
3
new webpack.DefinePlugin({
REST_URL : JSON.stringify(process.env.REST_URL || “/api”)
})

The application is transpiled, packaged and available at http://localhost:3000. This is why the CORS property must be configured properly above.

The development server communicates with your browser via web sockets – so any changes that are made to your code are immediately re-packaged and available to your browser without needing to refresh. Like Magic!

Building the WebApp

The application is packaged together as a Spring Boot runnable jar compiled using Gradle. Installing Gradle manually is not necessary. The application is configured using the gradle wrapper script.

To compile and package the application just run this:

./gradlew bootRepackage on OSX/Linux

gradlew.bat bootRepackage on Windows.

This will download all dependencies, compile/transpile and package all of the code for both the Java REST application and the JavaScript web application into in a runnable jar. The jar file is located in build/libs/packaged-webapp-1.0-SNAPSHOT.jar. This is accomplished by the very helpful gradle plugin that runs npm scripts. The build npm script inside our web application’s package.json will transpile and package all of the front end code as necessary and place it in src/main/resources/static – from which the Spring Boot application is preconfigured to serve static content.

The key is to add the proper gradle build dependencies to run the npm scripts before packaging the entire application into a jar. This is done with the following code in the build.gradle file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
task buildApp(type: NpmTask) {
args = [‘run’, ‘build’]
execOverrides {
it.workingDir = ‘src/main/app’
}
}
task npmClean(type: NpmTask) {
args = [‘run’, ‘clean’]
execOverrides {
it.workingDir = ‘src/main/app’
}
}
clean.dependsOn npmClean
bootRepackage.dependsOn buildApp

After it is packaged running the application is simple:

java -jar build/libs/packaged-webapp-1.0-SNAPSHOT.jar

This will start an embedded webserver (Tomcat) which you can access at http://localhost:8080. You can change the port if necessary by adding the --server.port argument:

java -jar build/libs/packaged-webapp-1.0-SNAPSHOT.jar --server.port=8989

Being able to develop the separate application components both individually and independently provides many benefits. We can have both server side and client developers work concurrently in the same code base. This helps in keeping the REST services and Web application in sync.

Also utilizing Spring Boot to package and run the application simplifies both the building and deployment of the application. A simple gradle command compiles, transpiles, and packages all of the code (server code and client code) together. Deployment is simple – because it’s only one simple jar file and the only dependency is Java. No mucking about with slightly different servlet container configurations on different environments etc.

I’m sure there are other great solutions out there that help ease the burden of developing server/client web applications together and we’d love to hear about them. Let us know in the comments.

If you have any questions about how we’re making this work or questions about the example project feel free to reach out:

forking ruby

Forking and Threading in Ruby

Great American Restaurants

There is a fantastic restaurant chain in the DC area called Great American Restaurants. They have received numerous awards for their food, their service and their work environment. However, their service is one of the primary reasons I love going to their restaurants. They have an innovative team method of taking care of tables. Basically, if a server is walking by, they will check your table to see if you need anything. Which means empty glasses or plates don’t sit on your table. When you put the credit card down, the next server that walks by picks it up and brings it back immediately. Unlike restaurants where a single server is assigned to your table, at a GAR restaurant tasks are accomplished quickly because the restaurant makes all of the workers available to every table. This is parallel processing which we need to do with Ruby in the situation we found ourselves in.

This week, we are migrating a competitor’s database into our database. The competitor’s database will add millions of records to our database. The plan is to migrate all of their data and then index the whole combined database into Solr. But when we figured out that it was going to take 7 days to index all of our new data into Solr, we knew something had to change.

Our application is a Rails application. We use Sunspot for our ActiveRecord and Solr integration. Sunspot enables full-text and faceted searching of ActiveRecord models. When you need to reindex your models, Sunspot provides a nice rake task to handle reindexing, unfortunately, this is performed by a single threaded, non-parallel worker. This is where change needed to happen. We needed a full team of workers to get this data indexed in a timely manner.

The proper way to get parallel processing in your Ruby app is to use the Parallel gem. But I didn’t find that gem until after I had already created a working solution. However, after implementing this solution, I think knowing how to implement threading and forks is something that every Rubyist should know, so I will share with you my solution. This solution brings together one idea that is not covered in any of my searches. The idea is simply:

You need both Threads and Forks.

I found great articles on Threads. And I found great articles on Forks. But there were no articles that said, “but what you really need are Threads and Forks.” Let’s start by defining Threads and Forks.

Threads share the same CPU and the same Ruby interpreter. This means that all of your variables are available to all of the other threads. This has consequences, both good and bad, so be careful with how you are using variables with Threads.

Forks spawn into a separate CPU and Ruby interpreter. They get a copy of the variables that are not linked back to the variables of the parent process. If you change variables in a Fork, the parent process and the other Fork processes will not see those changes. Again, this has consequences, caution should be used here. One of my earliest attempts had me reindexing everything 5 times because I was copying the entire queue into every Fork. That did not speed things up!

My first attempt followed this example of Ruby threads. This gave me multiple workers but didn’t give me the multi-CPU solution that I needed.

To get multi-CPUs working I found this article on Ruby forks. But, this wasn’t the full answer either.

The goal was to keep each processor busy with work, we have 6 processors so I wanted to keep 5 of those processors busy at all times. My first strategy was to use Rails’ find_in_batches feature to loop over all of the records 1000 at a time and send each batch to Solr. My second attempt took the 1000 records, split them into 5 groups using Rails’ in_groups and then send those groups to Solr. Here’s the second attempt:

cpu_available = 5
Order.includes(:products).find_in_batches do |batch|
batch.in_groups(cpu_available, false) do |group|
Process.fork do
Sunspot.index! group
end
end
Process.waitall
end
view rawruby_fork.rb hosted with ❤ by GitHub

This looks good, but has one small problem. If one of the groups is larger than the others, you could have idle CPU’s. You don’t want this when you have 1.7m Order records that need to be indexed. Imagine a large warehouse that handles Orders that are big and small. Orders that have 4 Products and Orders that have 1000 Products. Your CPU’s may be waiting a while if one of your Forks is indexing a couple of those Orders with 1000 products.

This is where you need Threads and Forks to keep all of the processors busy. Here is my final solution:

batches = Queue.new
Order.select(:id).find_in_batches do |batch|
batches &lt;&lt; batch.map(&amp;:id)
end
cpu_available.times do
Thread.new do
while(ids = batches.pop)
pid = Process.fork do
Sunspot.index! Order.includes(:product).find(ids)
end
Process.wait(pid)
end
end
end
Thread.new do
sleep 1 until batches.size.zero?
end.join
Process.waitall

I know that looks a little crazy, but it rocks, so let me explain how it works. We start off grabbing all of the ids of the Orders and putting them into a Queue. Each item in the Queue is a batch of 1000 Orders. We just want the ids and not the whole Order object so we call #map(&:id). Then we create new threads for each of our CPUs. Inside of each Thread we loop while there are batches still available in the Queue. For each batch we get, we create a new Fork. Inside of that Fork we execute the find that brings back all of the Order objects and its Products and throw the results into Solr. As soon as the Fork completes, you get the next batch and start again. The last block, watches the Queue and when the Queue is empty it cleans up all of the threads. The last line waits for all of the Forks to complete before allowing the script to finish.

Without the Thread the Fork gets a full copy of the Queue and tries to work the whole Queue. Without the Fork, the Thread stays on a single CPU. However, the Threads all share the same memory so the Queue is shared across all of the Threads. The Forks are then given a single batch to work and then they are gone until the next Fork is created. Just like my favorite restaurant, this is efficient and wonderful. All of the CPUs are kept busy until there is no more work left on the Queue.

Photo credit: Be considerate via photopin (license)

Writing Better Software

Writing Better Software

Programming is hard. Human brains are not wired to work like computers. Humans also have limited cognitive resources. Computers are only limited by the hardware they are built with.

Regardless – until we reach the singularity – humans are responsible for programming computers and as a whole we are not doing a very good job of it. There are many reasons that our software is bad, but I want to focus on solving the complexity issue.

You see humans have a very limited amount of cognitive resources – which makes trying to reason out (“grok” if you will) a program very difficult. In fact we humans are even more limited in cognitive resources than you might think:

While the focus of that video revolves around utilizing our resources better when trying to learn software (which is a fascinating study in itself) – it demonstrates the scarcity of our cognitive resources.

Software systems are only becoming more complex. So how do we keep up without wasting resources and writing bad code?

Abstraction and Immutability


Abstraction:

When was the last time you wrote some assembly code? I sincerely hope the answer is never because we have high-level programming languages that compile and abstract the actual machine code that gets processed by the computer therefore preserving our cognitive resources.

Have you been writing object-oriented code? This is yet another abstraction that models computer solutions in terms objects and their relationships. As humans we grasp this paradigm so much easier than procedural coding because the model reflects our physical world and their relationships. Our brains see familiar patterns and understand them.

You may at this point be saying to yourself “Sure but we’ve had high-level languages and object-oriented programming for years now – yet programming is still hard and we’re still writing buggy software”. One of the biggest problems with object-oriented programming is that it encourages mutable state within objects (see below), but we can resolve that. Programs are so complex now that OO abstraction is no longer robust or abstract enough for humans to grok effectively. So let’s abstract even higher!

What I’m talking about here is writing declarative code versus imperative code. Writing code imperatively means giving the computer step by step instructions to perform in a specific order. Writing declarative code is more like abstractly describing a problem and what you would like to see as the solution – then letting the computer figure out the best way to solve it. That may seem somewhat strange to you – but you’ve most likely been using some declarative programming without realizing it. If you’re ever written any SQL – then you my friend have written declarative code.

Examine the following SQL statement:

Select id, name, price from products join orders on products.id = orders.product_id where price > 10 and orders.data > '2010-01-01'

If I were to re-write this imperatively in pseudo-code I might have to do something like this:

var allProducts = fetchAllProducts()
var allOrders = fetchAllOrders()
var selectedProducts = []

for(int i = 0; i < allProducts.length; i++) {
  for(int j = 0; j < allOrders.length; j++) {
    var currentProduct = allProducts[i]
    var currentOrder = allOrders[i]

      if(currentProduct.id = currentOrder.product_id && currentProduct.price > 10 && currentOrder.date > '2010-01-01'){
        selectedProducts.push(currentProduct)
      }
   }
}

return selectedProducts

Isn’t declarative so much cleaner and easier to understand? In SQL – I just describe what I want and let the database code figure out the most efficient way of getting the data back to me. Such nice abstraction! Why are we still writing imperative code in our applications? Wouldn’t it be nice to write declarative code and let the computer figure out the most efficient way to execute the instructions?


Immutability:

I’m going to make a completely unscientific and unqualified assertion that 82.3% of the bugs in your object-oriented code are caused by mutable state.

Seriously though – why do we even use a debugger if not to examine the state of objects at certain points in the application’s execution trying to figure out how/where/why it’s changing.

In addition to bug-prone code, trying to reason out a program with mutable state severely drains your cognitive resources. Even with a debugger, trying to juggle the current state & interaction of all those objects in your head will take its toll.

If you’re wondering how you can do anything useful without mutable state, you’re not alone. Rest assured it’s possible and effective. It just takes a bit of re-training for your brain not to think that way. That stack overflow article is very helpful, also there’s this library (read the case for immutability) and this article, and this one about immutable objects.


By now hopefully you can see the benefits of abstraction, declarative coding, and immutability. So where is this magical land of unicorns and rainbows where we can write simple robust software using these constructs? Believe it or not the answer is not a new language or another framework. It has actually been here all along, silently waiting for us to dust it off and realize its full potential.

Free Vector Design by: Vecteezy!

From Wikipedia:

In computer science, functional programming is a programming paradigm — a style of building the structure and elements of computer programs — that treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data. It is a declarative programming paradigm, which means programming is done with expressions. In functional code, the output value of a function depends only on the arguments that are input to the function, so calling a function f twice with the same value for an argument x will produce the same result f(x) each time. Eliminating side effects, i.e. changes in state that do not depend on the function inputs, can make it much easier to understand and predict the behavior of a program, which is one of the key motivations for the development of functional programming.

Key facets of functional programming:

  1. Functions are pure: Functional purity means that a function given the same input will always return the same output, and executing the function produces no side-effects.
  2. Functions do not store or mutate state: This is really just a follow on to the last point. The only reason for a function to store state would be to use it in subsequent calls which violates it’s purity. If a function mutates state outside of itself – that is what we call a side-effect which is also impure. State changes in a functional program are modeled by passing a data structure into a pure function and receiving a new data structure as the result. This new data structure is completely predictable (read testable) by nature of the function’s purity given the input.
  3. Functions are first class citizens: Essentially this means you can treat functions just like any other code construct. ie: you can store functions in arrays, assign them to variables, pass them around (even as arguments to other functions!).
  4. Functions are _compose-able: _This feature you get by nature of the previously listed facets, and is perhaps the most important feature of functional programming. This is where that higher level of abstraction comes into play. Let me try to demonstrate with this (admittedly straw-man) scenario:

First lets say I’ve created a pure function yToZ() that converts ‘y’ to ‘z’.

const yToZ = function(yArg) {
  // some transformation here
  return z;
}

Great. That’s a nice pure function. Since we’re good developers we write a suite of unit tests against that function. Now we’ve built up our confidence that yToZ() is a solid function that will have expected out put without side effects (remember it’s pure). In fact I’ve built up enough confidence that I don’t even care how the implementation of yToZ() works. All I know is that I give it an ‘y’ and I get a ‘z’ every time and nothing else in the program gets touched. And now just freed up all those cognitive resources that were being used to understand how yToZ() works.

Uh -oh here comes Manager Bob with a feature request – now we need to have a program that converts ‘x’ to ‘z’. Great.. now I have to write a brand new function ‘xToZ’… or do I? I think to myself ‘x’ is pretty close to ‘y’ and I’ve already got yToZ(). I don’t remember how it works but I know it takes a ‘y’ and returns a ‘z’. Let me create xToY() and see where that gets me.

const xToY = function(xArg) {
  //some transformation here
  return y;
}

Now of course I write my unit tests for xToY() and forget about its implementation and free up my cognitive resources again.. I don’t remember or care how these functions work , but I know what they expect and what they output. Now wouldn’t it be nice if I could combine (compose) these two functions to get from my input ‘x’ to the desired output ‘z’.

import { compose } from 'rambda'

const xToZ = compose(yToZ, xToY)

// Now we can call xToZ(xArg)
// which is the same as calling:
//  yToZ(xToY(xArg))
//
//  Composition!

Nice! Now I have xToZ()! I’ll write my unit tests – and free up cognitive resources. I don’t have to remember that xToZ() is actually composed of two separate functions. I just know that I give it an ‘x’ and I get a ‘z’.

three months later…

Manager Bob : users want to give us a ‘w’… but they still want a ‘z’ returned. At this point I have no recollection of how xToZ() was implemented, and i don’t care. All I have to write is a wToX() and compose them and I’m done… and so on…

const wToZ = compose(wToX, xToZ)

 

This is overly simplistic, but it demonstrates the point. Functional purity and immutable data means no more unexpected behavior (aka bugs). You should also see that after a short while you can quickly build a library of very modular pieces of code, like a collection of Lego bricks. Eventually writing programs simply becomes deciding how to snap the bricks together in a very DRY process. Best of all – you most likely don’t even have to adopt a new language like Haskell, or Clojure to start programming this way (although after awhile you will want to). Languages like Scala, Python, JavaScript and Swift all give you the tools that enable functional coding.

Is functional programming going to solve all your problems and make all of your bugs disappear? Obviously not. Additionally it’s a bit of a steep learning curve – especially if you’ve been programming imperatively for a long time. You will need to re-train your brain into a somewhat drastically different way of solving problems programmatically. Also, functional paradigms can sometimes seem a bit too academic and theoretical. That’s because most functional concepts are backed by solid mathematical axioms (Category Theory) and usually expressed in mathematical notation. You might start to see words like Functors, Monads, chain, and foldMap. Don’t let it scare you away. All you need is a good teacher and a bit of effort. It’s well worth that effort. Might I very highly recommend the following:

 

Functional programming is nothing new. However, it’s popularity is rapidly increasing and I predict a functional renaissance in the very near future. Let’s stop wasting our limited cognitive resources writing difficult code that’s hard to maintain and hard to understand. Let’s write simple declarative code and start building our Lego creations.