A Volume Analytics Flow for Finding Social Media Bots

Volume Analytics Chaos Control

Volume Analytics Chaos Control

Volume Analytics is a software tool used to build, deploy and manage data processing applications.

Volume Analytics is a scalable data management platform that allows the rapid ingest, transformation, and loading high volumes of data into multiple analytic models as defined by your requirements or your existing data models.

Volume Analytics is a platform for streaming large volumes of varied data at high velocity.​

Volume Analytics is a tool that both enables rapid software development and operational maintainability with scalability for high data volumes. Volume Analytics can be used for all of your data mining, fusion, extraction, transform and loading needs. Volume Analytics has been used to mine and analyze social media feeds, monitor and alert on insider threats and automate the search for cyber threats. In addition it is being used to consolidate data from many data sources (databases, HDFS, file systems, data lakes) and producing multiple data models for multiple data analytics visualization tools. It could also be used to consolidate sensor data from IoT devices or monitor a SCADA industrial control network.

Volume Analytics easily facilitates a way to quickly develop highly redundant software that’s both scalable and maintainable. In the end you save money on labor for development and maintenance of systems built with Volume Analytics.

In other words Volume Analytics provides the plumbing of a data processing system. The application you are building has distinct units of work that need to be done. We might compare it to a water treatment plant. Dirty water comes in to the system in a pipe and comes to a large contaminate filter. The filter is a work task and the pipe is a topic. Together they make a flow.

After the first filter another pipe carries the water minus the dirt to another water purification worker. In the water plant there is a dashboard for the managers to monitor the system to see if they need to fix something or add more pipes and cleaning tasks to the system.

Volume Analytics provides the pipes, a platform to run the worker tasks and a management tool to control the flow of data through the system.

A Volume Analytics Flow for Finding Social Media Bots

A Volume Analytics Flow for Finding Social Media Bots

In addition Volume Analytics has redundancy for disaster recovery, high availability and parallel processing. This is where our analogy fails. Data is duplicated across multiple topics. The failure of a particular topic (pipe)  does not destroy any data because it is preserved on another topic. Topics are optimally setup in multiple data centers to maintain high availability.

In Volume Analytics the water filter tasks in the analogy are called tasks. Tasks are groups of code that perform some unit of work. Your specific application will have its own tasks. The tasks are deployed on more than one server in more than one data center.

Benefits

Faster start up time saves money and time.

Volume Analytics allows a faster start up time for a new application or system being built. The team does not need to build the platform that moves the data to tasks. They do not need to build a monitoring system as those features are included. However, Volume Analytics will integrate with your current monitoring systems.

System is down less often

The DevOps team gets visibility into the system out of the box. They do not have to stand up a log search system. So it saves time. They can see what is going on and fix it quickly.

Plan for Growth

As your data grows and the system needs to process more data Volume Analytics grows. Add server instances to increase the processing power.  As work grows Volume Analytics allocates work to new instances. There is no re-coding needed. Save time and money as developers are not needed to re-implement the code to work at a larger scale.

Less Disruptive deployments

Construct your application in a way that allows for deployments of new features with a lower impact on features in production. New code libraries and modules can be deployed to the platform and allowed to interact with the already running parts of the system without an outage. A built in code library repository is included.

In addition currently running flows can be terminated while the data waits on the topics for the newly programmed flow to be started.

This Flow processes files to find IP addresses, searches multiple APIs for matches and inserts data into a HANA database

This Flow processes files to find IP addresses, searches multiple APIs for matches and inserts data into a HANA database

A data processing search threats flow in production. Each of the boxes is a task that performs a unit of work. The task puts the processed data on the topic represented by the star. Then the next task picks up the data and does another part of the job. The combination of a set of tasks and topics is a flow.

Geolocate IP Flow

Geolocate IP Flow

Additional flow to geolocate IP addresses added as the first flow is running.

Combined Flows

Combined Flows

The combination of flows working together. The topic ip4-topic is an integration point.

Modular

Volume Analytics is modular and tasks are reusable. You can reconfigure your data processing pipeline without introducing new code. You can use tasks in more than one application.

Highly Available

Out of the box, Volume Analytics highly available due to its built in redundancy. Work tasks and topics (pipes) run in triplicate. As long as your compute instances are in multiple data centers you will have redundancy built in. Volume Analytics knows how to balance the data between duplicate and avoid data loss if one or more work tasks fail — this extends to the concept of queuing up work if all work tasks fail.

Integration

Volume Analytics integrates with other products. It can retrieve and save data to other systems like topics, queues, databases, file systems and data stores. In addition these integrations happen over encrypted channels.

In our sample application CyberFlow there are many tasks that integrate with other systems. The read bucket task reads files from an AWS S3 bucket, the ThreatCrowd is an API call to https://www.threatcrowd.org and Honeypot calls to https://www.projecthoneypot.org. Then the insert tasks integrate to the SAP HANA database used in this example.

Volume Analytics integrates with your enterprise authentication and authorizations systems like LDAP, ActiveDirectory, CAP and more.

Data Management

Ingests datasets from throughout the enterprise, tracking each delivery and routing it through Volume Analytics to extract the greatest benefit. Shares common capabilities such as text extraction, sentiment analysis, categorization, and indexing. A series of services make those datasets discoverable and available to authorized users and other downstream systems.

Data Analytics

In addition, to the management console Volume Analytics comes with an notebook application. This allows a data scientist or analyst to discover and convert data into information on reports. After your data is processed by Volume Analytics and put into a database the Notebook can be used to visualize the data. The data is sliced and diced and displayed on graphs, charts and maps.

Volume Analytics Notebook

Flow Control Panel

Topic Control Panel

The Flow control panel allows for control and basic monitoring of flows. Flows are groupings of tasks and topics working together. You can stop, start and terminate flows. Launch additional flow virtual machines when there is heavy load of data processing work from this screen. The panel also gives access to start up extra worker tasks as needed. There is also a link that will allow you to analyze the logs in Kibana

Topic Control Panel

Topic Control Panel

The topic control panel allows for the control and monitoring of topics. Monitor and delete topics  from here.

Consumer Monitor Panel

Consumer Monitor Panel

The consumer monitor panel allows for the monitoring of consumer tasks. Consumer tasks are the tasks that read from a topic. They may also write to a topic. This screen will allow you to monitor that the messages are being processed and determine if there is a lag in the processing.

Volume Analytics is used by our customers to process data from many data streams and data sources quickly and reliably. In addition, it has enabled the production of prototype systems that scale up into enterprise systems without rebuilding and re-coding the entire system.

And now this tour of Volume Analytics leads into a video demonstration of how it all works together.

Demonstration Video

This video will further describe the features of Volume Analytics using an example application which parses ip addresses out of incident reports and searches other systems for indications of those IP addresses. The data is saved into a SAP HANA database.

Request a Demo Today

Volume Analytics is scalable, fast, maintainable and repeatable. Contact us to request a free demo and experience the power and efficiency of Volume Analytics today.

Contact

HANA Zeppelin Query Builder with Map Visualization

SAP HANA Query Builder On Apache Zeppelin Demo

HANA Zeppelin Query Builder with Map Visualization

HANA Zeppelin Query Builder with Map Visualization

In working with Apache Zeppelin I found that users wanted a way to explore data and build charts without needing to know SQL right away. This is an attempt to build a note in Zeppelin that would allow a new data scientist to get familiar with the data structure of their database. And it allows them to build simple single table queries that allow for building charts and maps quickly. In addition it shows the SQL used to perform the work.

Demo

This video will demonstrate how it works. I have leveraged work done by Randy Gelhausen’s query builder post on how to make a where clause builder.  I also used Damien Sorel’s jQuery Query Builder. These were used to make a series of paragraphs to lookup tables and columns in HANA and allow the user to build a custom query. This data can be quickly graphed using the Zeppelin Helium visualizations.

The Code

This is for those data scientists and coders that want to replicate this in their Zeppelin.

Note that this code is imperfect as I have not worked out all the issues with it. You may need to make changes to get it to work. It only works on Zeppelin 0.8.0 Snapshot. It is also made to work with SAP HANA as the databases.

It only has one type of aggregation – sum and it does not have a way to perform a having statement. But these features could easily be added.

This Zeppelin note is dependent on code from a previous post. Follow the directions in Using Zeppelin to Explore a Database first.

Paragraph One

%spark
//Get list of columns on a given table
def columns1(table: String) : Array[(String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => x.asInstanceOf[String])
}

def columns(table: String) : Array[(String, String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => (x, x))
}

def number_column_types(table: String) : Array[String] = {
 var columnType = sqlContext.sql("select column_name from table_columns where table_name='" +
    table + "' and data_type_name = 'INTEGER'")
 
 columnType.map {case Row(column_name: String) => (column_name)}.collect()
}

// set up the tables select list
val tables = sqlContext.sql("show tables").collect.map(s=>s(1).asInstanceOf[String].toUpperCase())
z.angularBind("tables", tables)
var sTable ="tables"
z.angularBind("selectedTable", sTable)


z.angularUnwatch("selectedTable")
z.angularWatch("selectedTable", (before:Object, after:Object) => {
 println("running " + after)
 sTable = after.asInstanceOf[String]
 // put the id for paragraph 2 and 3 here
 z.run("20180109-121251_268745664")
 z.run("20180109-132517_167004794")
})


var col = columns1(sTable)
col = col :+ "*"
z.angularBind("columns", col)
// hack to make the where clause work on initial load
var col2 = columns(sTable)
var extra = ("1","1")
col2 = col2 :+ extra
z.angularBind("columns2", col2)
var colTypes = number_column_types(sTable)
z.angularBind("numberColumns", colTypes)
var sColumns = Array("*")
// hack to make the where clause work on initial load
var clause = "1=1"
var countColumn = "*"
var limit = "10"

// setup for the columns select list
z.angularBind("selectedColumns", sColumns)
z.angularUnwatch("selectedColumns")
z.angularWatch("selectedColumns", (before:Object, after:Object) => {
 sColumns = after.asInstanceOf[Array[String]]
 // put the id for paragraph 2 and 3 here
 z.run("20180109-121251_268745664")
 z.run("20180109-132517_167004794")
})
z.angularBind("selectedCount", countColumn)
z.angularUnwatch("selectedCount")
z.angularWatch("selectedCount", (before:Object, after:Object) => {
 countColumn = after.asInstanceOf[String]
})
// bind the where clause
z.angularBind("clause", clause)
z.angularUnwatch("clause")
z.angularWatch("clause", (oldVal, newVal) => {
 clause = newVal.asInstanceOf[String]
})

z.angularBind("limit", limit)
z.angularUnwatch("limit")
z.angularWatch("limit", (oldVal, newVal) => {
 limit = newVal.asInstanceOf[String]
})

This paragraph is Scala code that sets up some functions that are used to query the table with the list of tables and the table with the list of columns. You must have the tables loaded into Spark as views or tables in order to see them in the select lists. This paragraph performs all the binding so that the next paragraph which is Angular code can get the data built here.

Paragraph Two

%angular
<link rel="stylesheet" href="https://cdn.rawgit.com/mistic100/jQuery-QueryBuilder/master/dist/css/query-builder.default.min.css">
<script src="https://cdn.rawgit.com/mistic100/jQuery-QueryBuilder/master/dist/js/query-builder.standalone.min.js"></script>

<script type="text/javascript">
  var button = $('#generateQuery');
  var qb = $('#builder');
  var whereClause = $('#whereClause');
 
  button.click(function(){
    whereClause.val(qb.queryBuilder('getSQL').sql);
    whereClause.trigger('input'); //triggers Angular to detect changed value
  });
 
  // this builds the where statement builder
  var el = angular.element(qb.parent('.ng-scope'));
  angular.element(el).ready(function(){
    var integer_columns = angular.element('#numCol').val()
    //Executes on page-load and on update to 'columns', defined in first snippet
    window.watcher = el.scope().compiledScope.$watch('columns2', function(newVal, oldVal) {
      //Append each column to QueryBuilder's list of filters
      var options = {allowEmpty: true, filters: []}
      $.each(newVal, function(i, v){
        if(integer_columns.split(',').indexOf(v._1) !== -1){
          options.filters.push({id: v._1, type: 'integer'});
        } else if(v._1.indexOf("DATE") !== -1) {
          options.filters.push({id: v._1, type: 'date'})
        } else { 
          options.filters.push({id: v._1, type: 'string'});
        }
      });
      qb.queryBuilder(options);
    });
  });
</script>
<input type="text" ng-model="numberColumns" id="numCol"></input>
<form class="form-inline">
 <div class="form-group">
 Please select table: Select Columns:<br>
 <select size=5 ng-model="selectedTable" ng-options="o as o for o in tables" 
       data-ng-change="z.runParagraph('20180109-151738_134370871')"></select>
 <select size=5 multiple ng-model="selectedColumns" ng-options="o as o for o in columns">
 <option value="*">*</option>
 </select>
 Sum Column:
 <select ng-model="selectedCount" ng-options="o as o for o in columns">
 <option value="*">*</option>
 </select>
 <label for="limitId">Limit: </label> <input type="text" class="form-control" 
       id="limitId" placeholder="Limit Rows" ng-model="limit"></input>
 </div>
</form>
<div id="builder"></div>
<button type="submit" id="generateQuery" class="btn btn-primary" 
       ng-click="z.runParagraph('20180109-132517_167004794')">Run Query</button>
<input id="whereClause" type="text" ng-model="clause" class="hide"></input>

<h3>Query: select {{selectedColumns.toString()}} from {{selectedTable}} where {{clause}} 
   with a sum on: {{selectedCount}} </h3>

Paragraph two uses javascript libraries from jQuery and jQuery Query Builder. In the z.runParagraph  command use the paragraph id from paragraph three.

Paragraph Three

The results of the query show up in this paragraph. Its function is to generate the query and run it for display.

%spark
import scala.collection.mutable.ArrayBuffer

var selected_count_column = z.angular("selectedCount").asInstanceOf[String]
var selected_columns = z.angular("selectedColumns").asInstanceOf[Array[String]]
var limit = z.angular("limit").asInstanceOf[String]
var limit_clause = ""
if (limit != "*") {
 limit_clause = "limit " + limit
}
val countColumn = z.angular("selectedCount")
var selected_columns_n = selected_columns.toBuffer
// remove from list of columns
selected_columns_n -= selected_count_column

if (countColumn != "*") {
 val query = "select "+ selected_columns_n.mkString(",") + ", sum(" + selected_count_column +
     ") "+ selected_count_column +"_SUM from " + z.angular("selectedTable") + " where " + 
      z.angular("clause") + " group by " + selected_columns_n.mkString(",") + " " + 
      limit_clause
 println(query)
 z.show(sqlContext.sql(query))
} else {
 val query2 = "select "+ selected_columns.mkString(",") +" from " + z.angular("selectedTable") + 
      " where " + z.angular("clause") + " " + limit_clause
 println(query2)
 z.show(sqlContext.sql(query2))
}

Now if everything is just right you will be able to query your tables without writing SQL. This is a limited example as I have not provided options for different types of aggregation, advanced grouping or joins for multiple tables.

 

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint.

Volume Analytics Table Explorer - HANA & Zeppelin

Using Zeppelin to Explore a Database

In attempting to use Apache Zeppelin I found it difficult to just explore a new database. This was the situation when connecting SAP HANA database to Apache Zeppelin using the JDBC driver.

So I created a Zeppelin interface that can be used by a person who does not know how to code or use SQL.

This is a note with code in multiple paragraphs that would allow a person to see a list of all the tables in the database and then view the structure of them and look at a sample of the data in each table.

Volume Analytics Table Explorer - HANA & Zeppelin

Volume Analytics Table Explorer – HANA & Zeppelin

When using a standard database with Apache Zeppelin one needs to register each table into Spark so that it can query it and make DataFrames from the native tables. I got around this by allowing the user to choose they tables they want to register into Apache Zeppelin and Spark. This registration involved using the createOrReplaceTempView function on a DataFrame. This allows us to retain the speed of HANA without copying all the data into a Spark table.

The video shows a short demonstration of how this works.

Once tables are registered as Spark views they can be used by all the other notes on the Apache Zeppelin server. This means that other users can leverage the tables without knowing they came from the HANA database.

The code is custom to HANA because of the names of the system tables where it stores the lists of tables and column names. The code also converts HANA specific data types such as ST_POINT to comma delimited strings.

This example of dynamic forms with informed by Data-Driven Dynamic Forms in Apache Zeppelin

Previous posts on Apache Zeppelin and SAP Hana are:

The Code

Be aware this is prototype code that works on Zeppelin 0.8.0 Snapshot which as of today needs to be built from source. It is pre-release.

First Paragraph

In the first paragraph I am loading up the HANA jdbc driver. But you can avoid doing this by adding your jdbc jar to the dependencies section of the interpreter configuration as laid out in How to Use Zeppelin With SAP HANA

%dep
z.reset() 
z.load("/projects/zeppelin/interpreter/jdbc/ngdbc.jar")

Second Paragraph

In the second paragraph we build the Data Frames from tables in HANA that contain the list of tables and columns in the database. This will be used to show the user what tables and columns are available to use for data analysis.

%spark
import org.apache.spark.sql._
val driver ="com.sap.db.jdbc.Driver"
val url="jdbc:sap://120.12.83.105:30015/ffa"
val database = "dbname"
val username = "username"
val password = "password"
// type in the schemas you wish to expose
val tables = """(select * from tables where schema_name in ('FFA', 'SCHEMA_B')) a """
val columns = """(select * from table_columns where schema_name in ('FFA', 'SCHEMA_B')) b """

val jdbcDF = sqlContext.read.format("jdbc").option("driver",driver)
 .option("url",url)
 .option("databaseName", database)
 .option("user", username)
 .option("password",password)
 .option("dbtable", tables).load()
jdbcDF.createOrReplaceTempView("tables")

val jdbcDF2 = sqlContext.read.format("jdbc").option("driver",driver)
 .option("url",url)
 .option("databaseName", database)
 .option("user", username)
 .option("password",password)
 .option("dbtable", columns).load()
jdbcDF2.createOrReplaceTempView("table_columns")

Third Paragraph

The third paragraph contains the functions that will be used in the fourth paragraph that needs to call Spark / Scala functions. These functions will return the column names and types when a table name is given. Also it has the function that will load a HANA table into a Spark table view.

%spark
//Get list of distinct values on a column for given table
def distinctValues(table: String, col: String) : Array[(String, String)] = {
 sqlContext.sql("select distinct " + col + " from " + table + " order by " + col).collect.map(x => (x(0).asInstanceOf[String], x(0).asInstanceOf[String]))
}

def distinctWhere(table: String, col: String, schema: String) : Array[(String, String)] = {
 var results = sqlContext.sql("select distinct " + col + " from " + table + " where schema_name = '" + schema +"' order by " + col)
 results.collect.map(x => (x(0).asInstanceOf[String], x(0).asInstanceOf[String]))
}

//Get list of tables
def tables(): Array[(String, String)] = {
 sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String].toUpperCase(), x(1).asInstanceOf[String].toUpperCase()))
}

//Get list of columns on a given table
def columns(table: String) : Array[(String, String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => (x, x))
}

def hanaColumns(schema: String, table: String): Array[(String, String)] = {
 sqlContext.sql("select column_name, data_type_name from table_columns where schema_name = '"+ schema + "' and table_name = '" + table+"'").collect.map(x => (x(0).asInstanceOf[String], x(1).asInstanceOf[String]))
}

//load table into spark
def loadSparkTable(schema: String, table: String) : Unit = {
  var columns = hanaColumns(schema, table)
  var tableSql = "(select "
  for (c <- columns) {
    // If this column is a geo datatype convert it to a string
    if (c._2 == "ST_POINT" || c._2 == "ST_GEOMETRY") {
      tableSql = tableSql + c._1 + ".st_y()|| ',' || " + c._1 + ".st_x() " + c._1 + ", "
    } else {
      tableSql = tableSql + c._1 + ", "
    }
  }
 tableSql = tableSql.dropRight(2)
 tableSql = tableSql + " from " + schema +"."+table+") " + table

 val jdbcDF4 = sqlContext.read.format("jdbc").option("driver",driver)
  .option("url",url)
  .option("databaseName", "FFA")
  .option("user", username)
  .option("password", password)
  .option("dbtable", tableSql).load()
  jdbcDF4.createOrReplaceTempView(table)
 
}

//Wrapper for printing any DataFrame in Zeppelin table format
def printQueryResultsAsTable(query: String) : Unit = {
 val df = sqlContext.sql(query)
 print("%table\n" + df.columns.mkString("\t") + '\n'+ df.map(x => x.mkString("\t")).collect().mkString("\n")) 
}

def printTableList(): Unit = {
 println(sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String])).mkString("%table\nTables Loaded\n","\n","\n"))
}

// this part keeps a list of the tables that have been registered for reference
val aRDD = sc.parallelize(sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String])))
val aDF = aRDD.toDF()
aDF.registerTempTable("tables_loaded")

Fourth Paragraph

The fourth paragraph contains the Spark code needed to produce the interface with select lists for picking the tables. It uses dynamic forms as described in the Zeppelin documentation and illustrated in more detail by Rander Zander.

%spark
val schema = z.select("Schemas", distinctValues("tables","schema_name")).asInstanceOf[String]
var table = z.select("Tables", distinctWhere("tables", "table_name", schema)).asInstanceOf[String]
val options = Seq(("yes","yes"))
val load = z.checkbox("Register & View Data", options).mkString("")

val query = "select column_name, data_type_name, length, is_nullable, comments from table_columns where schema_name = '" + schema + "' and table_name = '" + table + "' order by position"
val df = sqlContext.sql(query)


if (load == "yes") { 
 if (table != null && !table.isEmpty()) {
   loadSparkTable(schema, table)
   z.run("20180108-113700_1925475075")
 }
}

if (table != null && !table.isEmpty()) {
 println("%html <h1>"+schema)
 println(table + "</h1>")
 z.show(df)
} else {
 println("%html <h1>Pick a Schema and Table</h1>")
}

As the user changes the select lists schema in paragraph 3 will be called and the tables select list will be populated with the new tables. When they select the table the paragraph will refresh with a table containing some of the details about the table columns like the column types and sizes.

When they select the Register and View checkbox the table will get turned into a Spark view and paragraph five will contain the data contents of the table. Note the z.run command. This runs a specific paragraph and you need to put in your own value here. This should be the paragraph id from the next paragraph which is paragraph five.

Paragraph Five

%spark
z.show(sql("select * from " + table +" limit 100"))

The last paragraph will list the first 100 rows from the table that have been selected and has the register and view on.

Slight modifications of this code will allow the same sort of interface to be built for MySQL, Postgres, Oracle, MS-SQL or any other database.

Now go to SAP HANA Query Builder On Apache Zeppelin Demo and you will find code to build a simple query builder note.

Please let us know on twitter, facebook and LinkedIn if this helps you or your find a better way to do this in Zeppelin.

Previous posts on Apache Zeppelin and SAP Hana are:

 

Query of a geographic region.

Zeppelin Maps the Hard Way

In Zeppelin Maps the Easy Way I showed how to add a map to Zeppelin with a Helium module. But what if you do not have access to the Helium NPM server to load in that module? And what if you want to add features to your Leaflet Map that are not supported in the volume-leaflet package?

This will show you how the Angular javascript library will allow you to add a map user interface to a Zeppelin paragraph.

Zeppelin Angular Leaflet Map

Zeppelin Angular Leaflet Map with Markers

First we want to get a map on the screen with markers.

In Zeppelin create a new note.

As was shown in How to Use Zeppelin With SAP HANA we create a separate paragraph to build the database connection. Please substitute in your own database driver and connection string to make it work for other databases. There are other examples where you can pull in data from a csv file and turn it into a table object.

In the next paragraph we place the spark scala code to query the database and build the markers that will be passed to the final paragraph which is built with angular.

The data query paragraph has a basic way to query a bounding box. It just looks for coordinates that are greater and less than the northwest and southeast corners of a bounding box.

var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
var coords = box.split(",")
sql1 = sql1 + " where lng > " + coords(0).toFloat + " and lat > " + coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " + coords(3).toFloat
}

var sql = sql1 +" limit 20"
val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect()) 

The data from this query is used to make the map_pings and bind it to angular so that any angular code can reference it. Zeppelin has the ability to bind data into other languages so it can be used by different paragraphs in the same note. There are samples for other databases, json and csv files at this link.

We do not have access to the Hana proprietary functions because Zeppelin will load the data up in its own table view of the HANA table. We are using the command “createOrReplaceTempView” so that a copy of the data is not made in Zeppelin. It will just pass the data through.

Note that you should set up the HANA jdbc driver as described in How to Use Zeppelin With SAP HANA.

It is best if you set up a dependency to the HANA jdbc jar in the Spark interpreter. Go to the Zeppelin settings menu.

Zeppelin Settings Menu

Zeppelin Settings Menu

Pick the Interpreter and find the Spark section and press edit.

Zeppelin Interpreter Screen

Zeppelin Interpreter Screen

Then add the path you where you have the SAP HANA jdbc driver called ngdbc.jar installed.

Configure HANA jdbc in Spark Interpreter

Configure HANA jdbc in Spark Interpreter

First Paragraph

%spark
import org.apache.spark.sql._
val driver ="com.sap.db.jdbc.Driver"
val url="jdbc:sap://11.1.88.110:30015/tri"
val database   = "database schema"   
val username   = "username for the database"
val password   = "the Password for the database"
val table_view = "event_view"
var box=""
val jdbcDF = sqlContext.read.format("jdbc").option("driver",driver)
                                           .option("url",url)
                                           .option("databaseName", database)
                                           .option("dbtable", "event_view")
                                           .option("user", username)
                                           .option("password",password)
                                           .option("dbtable", table_view).load()
jdbcDF.createOrReplaceTempView("event_view")

Second Paragraph

%spark

var box = "20.214843750000004,1.9332268264771233,42.36328125000001,29.6880527498568";
var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
    var coords = box.split(",")
    sql1 = sql1 + " where lng  > " + coords(0).toFloat + " and lat > " +  
        coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " +
        coords(3).toFloat
}
var sql = sql1 +" limit 20" 

val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect())
z.angularBind("paragraph", z.getInterpreterContext().getParagraphId())
// get the paragraph id of the the angular paragraph and put it below
z.run("20171127-081000_380354042")

Third Paragraph

In the third paragraph we add the angular code with the %angular directive. Note the for each loop section where it builds the markers and adds them to the map.

%angular 
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.css" />
.
<div id="map" style="height: 300px; width: 100%"></div>
<script type="text/javascript">
function initMap() {
    var element = $('#textbox');
    var map = L.map('map').setView([30.00, -30.00], 3);
   
    L.tileLayer('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png').addTo(map);
    var geoMarkers = L.layerGroup().addTo(map);
    
    var el = angular.element($('#map').parent('.ng-scope'));
    var $scope = el.scope().compiledScope;
   
    angular.element(el).ready(function() {
        window.locationWatcher = $scope.$watch('locations', function(newValue, oldValue) {
            //geoMarkers.clearLayers();
            angular.forEach(newValue, function(event) {
                if (event)
                  var marker = L.marker([event.values[1], event.values[2]]).bindPopup(event.values[0]).addTo(geoMarkers);
            });
        })
    });
}
if (window.locationWatcher) { window.locationWatcher(); }

// ensure we only load the script once, seems to cause issues otherwise
if (window.L) {
    initMap();
} else {
    console.log('Loading Leaflet library');
    var sc = document.createElement('script');
    sc.type = 'text/javascript';
    sc.src = 'https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.js';
    sc.onerror = function(err) { alert(err); }
    document.getElementsByTagName('head')[0].appendChild(sc);
}
</script>
<p>Testing the Map</p>

<form class="form-inline">
  <div class="form-group">
    <input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>
    <label for="paragraphId">Paragraph Id: </label>
    <input type="text" class="form-control" id="paragraphId" placeholder="Paragraph Id ..." ng-model="paragraph"></input>
  </div>
  <button type="submit" class="btn btn-primary" ng-click="z.runParagraph(paragraph)"> Run Paragraph</button>
</form>

Now when you run the three paragraphs in order it should produce a map with markers on it.

The next step is to add a way to query the database by drawing a box on the screen. Into the scala / spark code we add a variable for the bounding box with the z.angularBind() command. Then a watcher is made to see when this variable changes so the new value can be used to run the query.

Modify Second Paragraph

%spark
z.angularBind("box", box)
// Get the bounding box
z.angularWatch("box", (oldValue: Object, newValue: Object) => {
    println(s"value changed from $oldValue to $newValue")
    box = newValue.asInstanceOf[String]
})

var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
    var coords = box.split(",")
    sql1 = sql1 + " where lng  > " + coords(0).toFloat + " and lat > " +  coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " +  coords(3).toFloat
}
var sql = sql1 +" limit 20" 

val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect())
z.angularBind("paragraph", z.getInterpreterContext().getParagraphId())
z.run("20171127-081000_380354042") // put the paragraph id for your angular paragraph here

To the angular section we need to add in an additional leaflet library called leaflet.draw. This is done by adding an additional css link and a javascript script. Then the draw controls are added as shown in the code below.

Modify the Third Paragraph

%angular 
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet.draw/0.4.13/leaflet.draw.css" />
.
<script src='https://cdnjs.cloudflare.com/ajax/libs/leaflet.draw/0.4.13/leaflet.draw.js'></script>
<div id="map" style="height: 300px; width: 100%"></div>

<script type="text/javascript">
function initMap() {
    var element = $('#textbox');
    var map = L.map('map').setView([30.00, -30.00], 3);
   
    L.tileLayer('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png').addTo(map);
    var geoMarkers = L.layerGroup().addTo(map);
    var drawnItems = new L.FeatureGroup();
    
    map.addLayer(drawnItems);
    
    var drawControl = new L.Control.Draw({
        draw: {
             polygon: false,
             marker: false,
             polyline: false
        },
        edit: {
            featureGroup: drawnItems
        }
    });
    map.addControl(drawControl);
    
    map.on('draw:created', function (e) {
        var type = e.layerType;
        var layer = e.layer;
        drawnItems.addLayer(layer);
        element.val(layer.getBounds().toBBoxString());
        map.fitBounds(layer.getBounds());
        window.setTimeout(function(){
           //Triggers Angular to do its thing with changed model values
           element.trigger('input');
        }, 500);
    });
    
    var el = angular.element($('#map').parent('.ng-scope'));
    var $scope = el.scope().compiledScope;
   
    angular.element(el).ready(function() {
        window.locationWatcher = $scope.$watch('locations', function(newValue, oldValue) {
            $scope.latlng = [];
            angular.forEach(newValue, function(event) {
                if (event)
                  var marker = L.marker([event.values[1], event.values[2]]).bindPopup(event.values[0]).addTo(geoMarkers);
                  $scope.latlng.push(L.latLng(event.values[1], event.values[2]));
            });
            var bounds = L.latLngBounds($scope.latlng)
            map.fitBounds(bounds)
        })
    });

}

if (window.locationWatcher) { window.locationWatcher(); }

// ensure we only load the script once, seems to cause issues otherwise
if (window.L) {
    initMap();
} else {
    console.log('Loading Leaflet library');
    var sc = document.createElement('script');
    sc.type = 'text/javascript';
    sc.src = 'https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.js';
    sc.onerror = function(err) { alert(err); }
    document.getElementsByTagName('head')[0].appendChild(sc);
    s2.onload = initMap;
}
</script>
<p>Testing the Map</p>

<form class="form-inline">
  <div class="form-group">
    <input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>
    <label for="paragraphId">Paragraph Id: </label>
    <input type="text" class="form-control" id="paragraphId" placeholder="Paragraph Id ..." ng-model="paragraph"></input>
  </div>
  <button type="submit" class="btn btn-primary" ng-click="z.runParagraph(paragraph)"> Run Paragraph</button>
</form>

There are some important features to mention here that took some investigation to figure out.

Within Zeppelin I was unable to get the box being drawn to be visible. So instead drawing a box will the map to zoom to the area selected by utilizing this code:
element.val(layer.getBounds().toBBoxString());
map.fitBounds(layer.getBounds());

To make the map zoom back to the area after the query is run this code is triggered.

$scope.latlng.push(L.latLng(event.values[1], event.values[2]))
...
var bounds = L.latLngBounds($scope.latlng)
map.fitBounds(bounds)

To trigger the spark / scala paragraph to run after drawing a box this code causes it to run the query paragraph: data-ng-change=”z.runParagraph(paragraph_id);”

<input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>

The html form at the bottom is what holds and binds the data back and forth between the paragraphs. It is visible for debugging at the moment.

Query of a geographic region with Zeppelin

Query of a geographic region

Please let us know how it works out for you. Hopefully this will help you add maps to your Zeppelin notebook. I am sure there are many other better ways to accomplish this feature set but this is the first way I was able to get it all to work together.

Demo of the interface:

You can contact us using twitter at @volumeint.

Some code borrowed from: https://gist.github.com/granturing/a09aed4a302a7367be92 and https://zeppelin.apache.org/docs/latest/displaysystem/front-end-angular.html

Zeppelin Maps the Easy Way

You want your data on a map in Apache Zeppelin.

Apache Zeppelin allows a person to query all sorts of databases and big data systems using many languages. This data can then be displayed as graphs and charts. There is also a way to make a data driven map.

A map with data markers on it in Zeppelin

A map with data markers on it in Zeppelin

There is a very easy way to get a basic map with markers on it into Zeppelin. The first thing we will do is to query the database for the data we want to put on the map. First create a new note and add a query for your database of choice.

If you are using SAP HANA there are directions on how to install the jdbc driver at How to Use Zeppelin With SAP HANA

If you are using another database just use the %jdbc interpreter and modify your database configuration settings in the interpreter settings section of Zeppelin.

The query you write must have a column for the longitude and the latitude.

Query for data with Zeppelin

Query for data with Zeppelin

I my example I am using the query “select country_name, lat, lng, event_type from event_view where actor_type_id=’2′ and country_name =’South Sudan’ limit 10”. Note that my interpreter is Hana. This is described in a previous post about Hana and Zeppelin.

Once this data in in the table the plug-in for the map needs to be installed. In the upper right corner click on the login indicator and select Helium.

The Zeppelin Setup Menu

The Zeppelin Setup Menu

A list of Helium plug-ins will be displayed. Enable the one called zeppelin-leaflet or volume-leaflet. The Helium module for maps utilizes another javascript library called Leaflet.

Helium Modules List

Helium Modules

Once it is enabled go back to your note where you created the map query. A new button will show up.  It looks like the earth on a white background. When you press this a map will appear that needs to be configured.

Configure Map Settings

Configure Map Settings

Press the settings link. As shown above you will see the fields from the query. Just drag the fields down into the boxes called latitude, longitude, tooltip and popup.

Mapping the Columns for the map

Mapping the Columns

Once the latitude and longitude is filled in the map will appear with markers on it for the data points. If you put columns into the tooltip and popup boxes the tooltips and popups will work also.

The Finished Map

The Finished Map

This demonstrates the easiest way to add a map to Zeppelin. More advanced data processing before a map is built requires writing a paragraph of code in Spark Scala, Python or another language supported by Apache Zeppelin.

Future posts will show how to write Scala code to preprocess data and another post on how to draw a box on the map to select the area of the earth to be queried for data points.

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint

War what is it good for

War, What is it Good For?

Do you remember the Edwin Starr song “War” from 1969? The chorus repeats:

War, huh yeah

What is it good for?

Absolutely nothing, oh hoh, oh
Well, war is good for at least one thing…maps!

Wartime Maps

Mapping data before computers was difficult and seems to have been a primary concern during war. In fact, wars have advanced the state of the art in mapping data for situational awareness throughout history. The speed at which we can determine events and plot them on a map shows amazing technical advancement.

The basic idea is to visualize the placement of the enemy and friendly forces on a paper map with pins, which we still do today. But instead of physical pins, we use images of pins on an electronic map.

Churchill’s War Rooms

The Map room – Churchill War Rooms

I want to take you to where Winston Churchill poured over maps during World War II. His war rooms were contained in an underground bunker beneath five feet of concrete in London. According to the Imperial War Museums, there was a concern that Londoners would feel abandoned and evacuation would be slow. So the government built a bunker right in London for use during the next war.

These rooms were left exactly the way they were found on August 16, 1945, at the end of the war. You can still see the pin holes in the maps for past troop movements and ships as they crossed the ocean.

Large Wall Map – Churchill War Rooms

There are also walls full of graphs and charts. It’s the 1948 version of today’s management dashboard. These charts outlined the number of troops and were kept up to date by an army of people moving pins and updating charts.

Informational Bar Charts – Churchill War Rooms

It is obvious how these maps and charts were used to enhance decision-making. They provided accurate knowledge and understanding of location, type and counts of equipment, and health of the troops for both the axis and the allies.

Graphs – Churchill War Rooms

There is even a map of Germany with an acetate covering to allow them to write on it. The last thing they wrote were the outlines of which countries would administer the division of Germany.

Germany Divided

Men of Maps

Churchill enjoyed studying maps so much that he had his sleeping/office quarters in the bunker papered with maps from floor to ceiling. His love for maps was well known.

In fact, his peer and collaborator in America, Franklin Roosevelt, was also a big fan of maps and had a steady stream of updated maps provided to him by the National Geographic Society. In the FDR White House, there was a cloakroom converted into a map room modeled after Churchill’s map room. The FDR Library says, “Maps posted in the room were used to track the locations of land, sea and air forces.”

Secret Room

There was another more secretive part of the Churchill War Rooms. Down a back hallway there was a restroom, or as it is called in England, the WC.

It was reserved for Winston Churchill’s use alone. Very few people really knew what was on the other side of the door.

Churchill’s “Water Closet” in the War Rooms

Typical Restroom Lock Indicator for Restrooms in England

The space was actually a secret telephone room with a direct line to FDR in the White House. The two leaders would coordinate the war operations over the encrypted line. It was encrypted with a system called SIGSALY that sat under the Selfridges department store on one end and the Pentagon on the other.

Innovations Continue Today

The use of great human effort, paper maps, and telecommunications aided in the war effort and led to innovations in managing logistics and monitoring world events geospatially. We have come along way, but we still put pins in a map – they just happen to be electronic. The militaries of the world continue to upgrade their map rooms into walls of video screens and server rooms of computers to make visualization updates in near real-time. Onward!

 

To learn more about Volume Labs and Volume Integration, please follow us on Twitter @volumeint and check out our website.

Mapping an Epidemic

Mapping an Epidemic

This map changed the way we see the world and the way we study science, nature, and disease.

In August of 1854, cholera was ravaging the Soho neighborhood of London where John Snow) was a doctor. People were fleeing the area as they thought cholera was spread by gasses in the air or, as they called it, “bad air.”

Just as there is disinformation today about Ebola being airborne, the experts of that time thought most disease was spread in the air. There was no concept that disease might be in the water. They had no idea that bacteria even existed.

John had worked as a doctor in a major outbreak of cholera in a mine. But despite working in close quarters with the miners, he never contracted the disease. He wondered why the air did not affect him.

This inspired him to write a paper on why he believed cholera was spread through water and bodily fluids. The experts at the time did not accept his theory; they continued to believe cholera was caused by the odors emitted by rotting waste.

In the Soho outbreak in August 1854, John Snow saw a chance to further prove his theory. He went door to door keeping a tally of deaths at each home. This was only part of his quest to find evidence to prove the source of the plagues of the day.

He had been collecting statistical information, personal interviews, and other research for many years. He added this information to his paper, “On the Mode of Communication of Cholera.” The paper and his work in researching and collecting evidence founded the science of epidemiology.

One of the most innovative features was plotting data using a map; it was the first published use of dots on a map supporting a scientific conclusion. Each of the bars on John Snow’s map represents one death. Using this visual technique, he could illustrate that the deaths were centered around a point and further investigate and interview people in the area. He could also find anomalies and outliers such as deaths far from the concentration and areas with no deaths.

Epicenter Pump and Brewery

He found through personal interviews and mapping the data that the workers in the brewery (in the epicenter of the epidemic) were not dying. The owner of the brewery said that the workers were given free beer, and he thought that they never drank water at all. In fact, there was a deep well in the brewery used in the beer. In other cases, John Snow found that addresses with low deaths had their own personal well.

He also investigated the outlying incidents through interviews: some worked in the area of the pump or walked by it on the way to school. One woman who got sick had the water brought to her by a wagon each day because she liked the taste of that particular well water. One person he talked to even said the water smelled like sewage and did not drink it, but his servant did and came down with a case of cholera.

The incidents highlighted the area around a public pump on Broad Street. Using his data, he convinced the local authorities to have the pump handle removed.

The most innovative feature of the map is that it changed the way we use maps. The idea that data could be visualized to prove a fact was very new.

John Snow’s map of the service areas of two water companies

John Snow also produced another map showing which water companies supplied water in London. This map showed that the water company which stopped using water from the Thames had a lower death rate due to cholera. The map allowed John Snow to provide further evidence of disease spread through water and what could be done to fix the issue.

This is similar to the Ebola outbreak of today where tracking the disease is important. John Snow’s idea of collecting data in the field and mapping it lives on in maps like those from HealthMap, which show the spread of the Ebola virus.

Data Exploration via Map

Today, we use data driven maps as a powerful tool for all sorts of reasons. But it all started with John Snow.

(For an interesting take on this event and other historical technology that changed the way we live today, watch the “Clean“ episode of the How We Got to Now series on PBS.)

To learn more about Volume Labs and Volume Integration, please follow us on Twitter @volumeint and check out our website.

10

10+ Surprising Geospatial Technologies

Data Organized on Map

I’ve spent years in the geospatial arena, so I’m a bit of a geospatial technology geek. But now it seems like the rest of the world is increasingly interested in this technology too.

You may remember the old latitude and longitude numbers that you learned about in school. Perhaps they didn’t seem very useful or relevant to life at the time, but these coordinates are now tracked constantly with our various GPS enabled gadgets. It’s becoming increasingly common to use coordinates to define the location of data collected, a person, landmark, and more. We can add even further accuracy by recording elevation and point in time.

I would like to describe some of the components that fall under the umbrella of geospatial technology. You might find some surprises!

Equipment

First, let’s discuss some of the tools used to collect geospatial data.

1. GPS

Global Positioning System (GPS) technology is the software and equipment needed to provide the location of things on the planet. This is most often done with the use of special satellites but is often augmented by other methods like WiFi signals. There are even technologies in use that determine location by looking at the stars.

2. Field Sensors

Field sensors are electronic devices that are placed to collect information about weather, soil, or other environmental conditions. These data collecting devices could be anything from a camera to a cell phone. During collection, the data is tagged with geospatial information, so the location of the event is known and can be mapped.

Overhead Imagery

My next geospatial category is overhead imagery. This includes all the imagery from aircrafts and satellites.

3. Visual Overhead Imagery

Visual overhead imagery includes what you see in Google Maps and Google Earth when you use the satellite function. This imagery could be collected via satellite or aircraft, and the technology used involves cameras, aircraft, satellites, global positioning systems, altimeters, and microwave transmission equipment. Today, even video is collected overhead by Planet Labs.

If you don’t own an airplane or satellite, can you collect visual overhead imagery? Yes! It doesn’t have to be expensive. Some hobbyists and students are cutting their teeth on low-cost imagery collection using kites and balloons.

Balloon mapping of Lake Borgne, Louisiana (Cartographer: Stewart Long/publiclab.org)

4. Hyperspectral Overhead Imagery

Hyperspectral refers to the waves of light that are beyond human sight. Engineers have developed sensors that can gather these waves from space, but it can also be done from aircraft. The data is then transformed into a visual representation through analysis and processing to create hyperspectral overhead imagery.

This type of geospatial technology has some surprising uses. Over at the US Geological Survey (USGS), they have used hyperspectral overhead imagery collected via satellite to detect the presence of arsenic in the leaves of ferns. Further analysis led them to aid in locating arsine gas canisters buried in Washington, DC. For more information, check out the full dissertation entitled _Remote Sensing Investigations of Furgative Soil Arsenic and its Effects on Vegetation Reflectance_.

5. LIDAR

Light Detection and Ranging (LIDAR) is a technology that uses an airborne system to measure distance by shining a laser to the ground and measuring the reflected light. This yields a very accurate contour of the earth’s surface as shown in the image of the Three Sisters below.

LIDAR image of the Three Sisters volcanic peaks in Oregon (DOGAMI)

LIDAR can also measure objects on the ground such as trees and houses. This type of data is used to determine elevation and is often used when processing other imagery to improve accuracy.

How do autonomous vehicles “see” where they are going and what is in the way? LIDAR, of course! Plus, it’s even used in various industries to make 3D models of buildings and topography.

Processing

So now that we collected all this imagery, how do we use it?

6. Imagery Processing Systems

The overhead imagery produced from satellites and aircraft is not perfect for human viewing in raw form. So we use imagery processing systems to help automate the manipulation of images and data collected. This collection of computer systems makes the images and data useful to us.

Most images are taken from an angle and must be adjusted or warped. Imagery processing systems assign each pixel a geographic coordinate and an elevation. This is done by combining GPS data that was collected with each click of the camera.

Often this process is called orthorectification. To see a simplified illustration, take a look at this orthorectification animation from Satellite Imaging Corporation.

7. Geospatial Mapping

Geospatial mapping is the process and technology involved in placing information on a map. It is often the final stage of geospatial processing.

Mapping combines data from many sources and layers it onto a map, so conclusions can be drawn about the data. There are different degrees of accuracy required in this process. For some applications, showing data in an approximate relation to each other is sufficient. But other applications, like construction and military exercises, require specialized software and equipment to be as precise as possible.

In an earlier post, I wrote about creating maps with D3. The goal was to build a heat map to display the count of documents for each place name as shown in the image below.

Data Organized on Map

Applications

Let’s explore the some of the applications of all this geospatial technology.

8. Geospatial Marketing

Geospatial marketing is the concept of using geospatial tools and the collection of location information to improve marketing to customers. This is often a subset of geospatial mapping, but this application combines data about customers’ locations. This can help determine where to place a store or how many customers purchase from a particular location. For example, companies can use data about where people typically go after a ballgame to determine where advertisements should be placed.

Another widespread application of geospatial data in marketing is using the IP addresses gained from customers browsing websites and viewing advertisements. These IP addresses can be geographically located, sometimes as specifically as a person’s house, and then used to target advertisements or redesign a website.

9. Location-Aware Applications

Location-aware applications are a category of technologies that are cognizant of their location and provide feedback based that location. In fact, if an IP address can be tied to a location, almost any application can be location-aware.

With the advent of smart phones, location-aware applications have become even more common. Of course, your phone’s mapping application can display your location on a map.

There are also smartphone apps that will trigger events or actions on a phone when you cross into a geospatial area. Some examples are Geofencer and PhoneWeaver.

Additionally, the cameras on smart phones can collect the location of the phone when taking a picture. This is imbedded within the picture and can be used by Facebook, Picasa, Photoshop, and other photo software to display locale information on a map. (You may want to disable this feature if you would rather not have people know where you live.)

10. Internet of Things

The Internet of Things (IoT) is the category of technology that includes electronic objects that connect to the internet and transmit their location. This is a broad and emerging area of geospatial technology that will add even more location data to the world.

IoT could contain objects like cars, fire alarms, energy savings devices like Nest and Neurio, fitness tracking bands like the ones from Jawbone or Nike, and more. For these IoT applications and devices to work optimally, they need to know your location and combine it with other information sensed around them.

Nike+ FuelBand (Peter Parkes/flickr.com)

11. Geospatial Virtual Reality

Virtual reality that makes use of geospatial data is another emerging category. This technology will allow for an immersive experience in realistic geospatial models.

Geospatial virtual reality incorporates all of the technologies listed above to put people into the middle of simulated real-word environments. It’s already been implemented with new hardware like the Oculus Rift, which is a virtual reality headset that enables players to step inside their favorite games and virtual worlds.

Oculus Rift (Sebastian Stabinger/commons.wikimedia.org)

Show Me the Data!

At the base of all of this technology is data. Increasingly, we have to invent more ways to store geospatial data in order for it to be processed and analyzed. The next steps of geospatial technologies involve attaching geospatial information to all data collection and then processing and filtering the massive amounts of data, which is known as big data.

This is my list of surprising geospatial technologies that matter today. It started out as a top 10 list, but evolved to 11 because I just couldn’t leave out geospatial virtual reality. It’s so cool! Feel free to add your suggestions of geospatial technologies in the comments below or as a pingback.

Making maps with D3

Making maps with D3

I used D3 to build a data driven map. The goal was to build a map using D3 using data from a service.

The service provides a JSON file, which consists of place names and a count of documents. The place names are country names or US states, as in the following sample:

[{"id":121,"value":"iran","count":2508},{"id":88,"value":"washington","count":1778}]

Overview

I started with Mike Bostock’s Let’s Make a Map since it was most helpful in getting me to a US map.

The general steps are as follows:

  1. Get shape files
  2. Filter out what you need
  3. Merge and convert to TopoJSON
  4. Build D3 Javascript to join data to the TopoJSON and display the map

1.  Find Shape Files

After much experimentation I found the best shape files to use were from Natural Earth Data. They have three sizes – large 1:10m, medium 1:50m, and small 1:110m.

I found that the large size produced a JSON file that was around 2.4 megabytes, much too large for use in a web browser. The lines drawn for the large map were very smooth. The small shape would produce a JSON file that was 96k, but it was missing a good number of small countries and used more jagged lines. The medium size came out to 618k and contained all of the countries I needed.

2.  Filter Shape Files

For this project, I used Admin 0 Countries and Admin 1 States & Provinces without large lakes. To begin, we need to extract just the US states from the states & provinces shape file, since we only want the US states.

To do this, we use some SQL. First, find the column name that indicates what data is from the USA using ogrinfo.

ogrinfo -sql 'select * from ne_50m_admin_1_states_provinces_lakes' ne_50m_admin_1_states_provinces_lakes.shp -fid 0

This will print out the data in the first row of the shape file, which should be all the data for the first state in the file. Find the column name that indicates the country name. In this case it is _sr_adm0a3. To see if it works with the USA use this:

ogrinfo -sql "select * from ne_50m_admin_1_states_provinces_lakes where sr_adm0_a3 = 'USA'" ne_50m_admin_1_states_provinces_lakes.shp -fid 0

So now we want to convert it to a GeoJSON file using ogr2ogr.

ogr2ogr -f&nbsp;GeoJSON&nbsp;-where "sr_adm0_a3 = 'USA'" states.json ne_50m_admin_1_states_provinces_lakes.shp

Now let’s move on to countries. The first time I tried this, I got all the way to trying to produce the map in the browser and found that it would not add the coloring to represent the data into the countries.

It turns out that this shape file has the country names defined with a column name that is uppercase NAME. The states file had it defined as lowercase name. This is the key to matching up the data in the JSON file. I could tell by running ogrinfo that the column names were different.

ogr2ogr countries.shp ne_50m_admin_0_countries.shp -sql "select NAME as name, POSTAL, ISO_A2, ISO_A3, scalerank, LABELRANK from ne_50m_admin_0_countries"

To change the name of the column use SQL as (shown in the command above). The shape file also contains lots of data I did not need like population and GDP. I eliminated it by only selecting the columns that I wanted to use. This will produce an interim shape file called countries.shp.

Next, convert the countries shape files into a GeoJSON file using ogr2ogr.

ogr2ogr -f&nbsp;GeoJSON&nbsp;countries.json countries.shp

3. Convert to TopoJSON

The goal is to get a topoJSON file since it stores the data most efficiently. This next command will convert the two GeoJSON files (states and countries) into one TopoJSON file by merging the data together. (Remember that we have named the countries of the world as countries and the US states are called states.)

topojson --id-property name --allow-empty -o world.json countries.json states.json

The –id-property setting will make the name field the id. This is used to join the data from the document count JSON file. The allow-empty setting forces it to save the polygons for all the countries even if they are very small. Without that setting, I found that TopoJSON would remove some of the small geographical entities that are countries or territories like Aruba. See the TopoJSON documentation for more info.

4. Build D3 Javascript

The next step is to build the html page and javascript that will draw the map using our data. If you prefer to skip to the completed code use this link.

First of course, we must have the TopoJSON and D3 javascripts loaded.

Next, set up the map projection and size of the map.

&lt;script&gt; var width = 960, height = 960; var projection = d3.geo.mercator().scale(200); var path = d3.geo.path().projection(projection); var svg = d3.select("body").append("svg").attr("width", width).attr("height", height); var g = svg.append("g");

The queue command will set up the loading of the two JSON files that will be merged.
queue()
.defer(d3.json, "world.json")
.defer(d3.json, "toplocations.json")
.await(ready);

Now, here’s the main function that is used to draw everything.
function ready(error, world, locations) {
console.log(world)

In the style section, we need to add the following styles in order to draw the lines of the map.

.subunit-boundary {
fill: none;
stroke: #777;
stroke-linejoin: round;
}

The console command will output the contents to the browser console which you can view using Firebug or the JavaScript console in many browsers. Inside of this function, put the following code to draw the boundaries of the continents. Note that I am referencing world.objects.countries; this is where each country is stored in the world.json file.

Listing of the countries in the world.json

g.append("path")
.datum(topojson.mesh(world, world.objects.countries, function(a, b) { return a == b }))
.attr("d", path)
`.attr(“class”, “subunit-boundary”);“

 

Ocean borders sans countries

This next one draws the lines between each country.

g.append("path")
.datum(topojson.mesh(world, world.objects.countries, function(a, b) { return a !== b }))
.attr("d", path)
.attr("class", "subunit-boundary");

Country Borders

Now add the US states and the little bit around the great lakes.

g.append("path")
.datum(topojson.mesh(world, world.objects.states, function(a, b) { return a !== b }))
.attr("d", path)
.attr("class", "subunit-boundary");
g.append("path")
.datum(topojson.mesh(world, world.objects.states, function(a, b) { return a == b }))
.attr("d", path)
.attr("class", "subunit-boundary");
};

US States

The next step involves converting the count on each country into a color, finding the matching state or country, and filling in the color. This is done using D3 and CSS.

Add the following code to the style section, which will control what colors go in each range of values indicated for each country. Again, thanks to Mike Bostock for this piece of code. The .subunit class is used to fill in the regions that do not have a count value in the toplocations.json.

.subunit { fill: #aaa; }
.q0-9 { fill:rgb(247,251,255); }
.q1-9 { fill:rgb(222,235,247); }
.q2-9 { fill:rgb(198,219,239); }
.q3-9 { fill:rgb(158,202,225); }
.q4-9 { fill:rgb(107,174,214); }
.q5-9 { fill:rgb(66,146,198); }
.q6-9 { fill:rgb(33,113,181); }
.q7-9 { fill:rgb(8,81,156); }
.q8-9 { fill:rgb(8,48,107); }

Next is the javascript to run through the toplocations.json file and make a map of every country and the count. This is a loop that will iterate through each country and put its name and count in a map. This is used later to match up the country name (id) in the world.json file and find the count.

locations.forEach(function(data) {
countByName.set(data.value, data.count);
})

We also need this quantization code outside of the main function block. Open this window to see where it is placed in the code.

var quantize = d3.scale.quantize()
.domain([0, 2000])
.range(d3.range(9).map(function(i) { return "q" + i + "-9"; }));

This will take the count from each country and turn it into a range from 1 – 9. It will only handle values up to 2000 because of the domain command.

Next, we need similar code to the one we used to draw the borders. One is for the countries, and one is for the states because they ended up in two different arrays in the world.json.

g.selectAll(".countries")
.data(topojson.feature(world, world.objects.countries).features)
.enter().append("path")
.attr("class", function(d) { return "subunit " + quantize(countByName.get(d.id.toLowerCase())); })
.attr("d", path);
g.selectAll(".states")
.data(topojson.feature(world, world.objects.states).features)
.enter().append("path")
.attr("class", function(d) { return "subunit " + quantize(countByName.get(d.id.toLowerCase())); })
.attr("d", path);

Note the countByName function which finds the id from the map JSON. This is the country or state name. It must be changed to lowercase so it will match up with the data in our toplocations.json file. It will return the count for that country and then quantize will convert it to a CSS class that corresponds to a color. This class is added to the JSON so that when the browser draws, it will be filled with the correct color.

Now for the cream on top. It is always nice to allow for zooming and movement of the map. The following code will allow your map users to control their point of view.

var zoom = d3.behavior.zoom()
.on("zoom",function() {
g.attr("transform","translate("+
d3.event.translate.join(",")+")scale("+d3.event.scale+")");
});
svg.call(zoom)

Your Turn

Hopefully, this example will help you in building your own data driven map. All the code and a working sample can be found at http://bl.ocks.org/bradllj/8326068.

References