AI Brain Animated

Artificial Intelligence of Art

Let’s Make some Art

Learning through experience is one of the best ways to learn. So this is how I went about educating myself about machine learning, artificial intelligence and neural networks. I am an artist and programmer that became interested in how computers could make art. So lets use artificial intelligence to create new artwork.

I decided to use an open source project on GitHub called art-DCGAN by Robbie Barrat. His implementation was used recently in Christie’s first Artificial Intelligence art auction. But I first encountered Robbie Barrat’s work on superRare.co

To start out we need some data to train the neural network. I wanted to use simple drawings to train my first model. I happen to have a library of art that I personally created. They are line drawings made using sound played through an oscilloscope. It turns out that most of the prebuilt code libraries utilize very small image sizes and you need at least 10,000 images to train them effectively.

Get Lots of Images

I happen to have movies of animated oscilloscope line drawings. Most of them come from this artwork called “Baby’s First Mobile”.

To generate a large number of images I turned the videos into individual frames using ffmpeg. Using commands like:

ffmpeg -i babyMobileVoice.mp4 -vf crop=500:500 thumb%04d.jpg -hide_banner

I was able to crop out the center of the video and produce individual frames. After running ffmpeg on all the oscilloscope videos I have and deleting all the black frames I ended up with over 10,000 images.

Make them Small

The next step is to make the images small. Some tools only accept 64 x 64 pixel images. The art-DCGAN uses 128 x 128 pixel images. I borrowed a python script from this blog at coria.com.   To double the number of images I read that someone rotated them by 90 degrees for additional training examples. Using these directions I added a rotate feature and made a copy of each image.

import sys
import os
import numpy as np
from os import walk
import cv2
import imutils

# width to resize
width = int(sys.argv[1])
# height to resize
height = int(sys.argv[2])
# location of the input dataset
input_dir = sys.argv[3]
# location of the output dataset
out_dir = sys.argv[4]

if len(sys.argv) != 5:
print("Please specify width, height, input directory and output directory.")
sys.exit(0)

print("Working...")

# get all the pictures in directory
images = []
ext = (".jpeg", ".jpg", ".png")

for (dirpath, dirnames, filenames) in walk(input_dir):
for filename in filenames:
if filename.endswith(ext):
images.append(os.path.join(dirpath, filename))

for image in images:
img = cv2.imread(image, cv2.IMREAD_UNCHANGED)
print(image) 
h, w = img.shape[:2]
pad_bottom, pad_right = 0, 0
if h < w:
ratio = w / h
else:
ratio = h / w

if h > height or w > width:
# shrinking image algorithm
interp = cv2.INTER_AREA
else:
# stretching image algorithm
interp = cv2.INTER_CUBIC

w = width
h = round(w / ratio)
if h > height:
h = int(height)
w = round(h * ratio)
pad_bottom = abs(height - h)
pad_right = abs(width - w)
print(w)
print(h) 
scaled_img = cv2.resize(img, (w, int(h)), interpolation=interp)
padded_img = cv2.copyMakeBorder(
scaled_img,0,pad_bottom,0,pad_right,borderType=cv2.BORDER_CONSTANT,value=[0,0,0])

cv2.imwrite(os.path.join(out_dir, os.path.basename(image)), padded_img)
rotate_img = imutils.rotate(padded_img, 90)
cv2.imwrite(os.path.join(out_dir, '90_'+os.path.basename(image)), rotate_img) 

print("Completed!")

Running this command

python resize_images.py 128 128 /origin /destination

will produce images of the proper size for the training to begin.

Setup an AWS GPU Instance

The GitHub art-DCGAN project has install directions but I will add to them here. In AWS I selected:

Ubuntu Server 14.04 LTS (HVM), SSD Volume Type – ami-06b5810be11add0e2 with the g3s.xlarge instance type.

Install all the Software

Once the instance started. I followed the directions on art-DCGAN install. Make sure to find the exact packages in the directions. On the Nvidia site you will need to press the legacy releases button to find the right version of the CUDA software. The deb file I used is cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb Newer updates of the software will not work with the art-DCGAN code.
Once it has finished building the code clone the art-DCGAN code. Now move your thousands of images into a folder called myimages/images. Make sure to have a folder called images inside the myimages folder as described in the project art-DCGAN was forked from.
Now you can run the command below but note the checkpoint options. This will cause it to save checkpoint models as it runs. The code tended to crash on me before it completed.
DATA_ROOT=myimages dataset=folder saveIter=10 name=scope1 ndf=50 ngf=150 th main.lua
saveIter will tell it to save a model for the discriminator and the generator after every 10 epochs. In my case an epoch was 318 training rounds.
But with these settings the training would constantly collapse. If the output of the training starts to show either the discriminator or the generator running at 0% your training has failed. To fix it I lowered the number of filters for the discriminator so it could keep up with the generator.
The GAN model is an adversarial training. A generator will create a random image and the discriminator will judge it based on the training source images. It repeats this process until the neural network is wired in such a way to produce images like the ones in the training set.
DATA_ROOT=myimages dataset=folder saveIter=10 name=scope1 ndf=12 ngf=150 th main.lua
is what worked for my particular artwork.
After running for five hours and making it to 90 epochs before failing I ended up with some images that look a lot like mine but with many variations.

Now the Artificial Intelligence model will make some new art

To generate a 36 image grid find the newest model file and use the command:
batchSize=36 net=./checkpoints/scope_90_net_G.t7 th generate.lua

AI Generated Oscilloscope Art

Artificial Intelligence Generated Oscilloscope Art

AI Generated Oscilloscope Art

AI Generated Oscilloscope Art

Now it becomes a matter of picking out the best work as an AI art curator.

With all of this work it becomes obvious that this is a very slow way to make art. It is still very much a human endeavor. But allowing a computer to generate lots of ideas for an artist to explore is helpful. Many artists have used randomized systems to generate art ideas. Artists such David Bowie, Brian Eno, Nam June Paik and John Cage have used mechanical and computer generators to automate some of their work and kick start the creative process.

In addition by making this model we have learned how to train other models that can be used for image recognition. You can also try out the other features of art-DCGAN that will download art from a public site and then generate new images based on that genre.

A Volume Analytics Flow for Finding Social Media Bots

Volume Analytics Chaos Control

Volume Analytics Chaos Control

Volume Analytics is a software tool used to build, deploy and manage data processing applications.

Volume Analytics is a scalable data management platform that allows the rapid ingest, transformation, and loading high volumes of data into multiple analytic models as defined by your requirements or your existing data models.

Volume Analytics is a platform for streaming large volumes of varied data at high velocity.​

Volume Analytics is a tool that both enables rapid software development and operational maintainability with scalability for high data volumes. Volume Analytics can be used for all of your data mining, fusion, extraction, transform and loading needs. Volume Analytics has been used to mine and analyze social media feeds, monitor and alert on insider threats and automate the search for cyber threats. In addition it is being used to consolidate data from many data sources (databases, HDFS, file systems, data lakes) and producing multiple data models for multiple data analytics visualization tools. It could also be used to consolidate sensor data from IoT devices or monitor a SCADA industrial control network.

Volume Analytics easily facilitates a way to quickly develop highly redundant software that’s both scalable and maintainable. In the end you save money on labor for development and maintenance of systems built with Volume Analytics.

In other words Volume Analytics provides the plumbing of a data processing system. The application you are building has distinct units of work that need to be done. We might compare it to a water treatment plant. Dirty water comes in to the system in a pipe and comes to a large contaminate filter. The filter is a work task and the pipe is a topic. Together they make a flow.

After the first filter another pipe carries the water minus the dirt to another water purification worker. In the water plant there is a dashboard for the managers to monitor the system to see if they need to fix something or add more pipes and cleaning tasks to the system.

Volume Analytics provides the pipes, a platform to run the worker tasks and a management tool to control the flow of data through the system.

A Volume Analytics Flow for Finding Social Media Bots

A Volume Analytics Flow for Finding Social Media Bots

In addition Volume Analytics has redundancy for disaster recovery, high availability and parallel processing. This is where our analogy fails. Data is duplicated across multiple topics. The failure of a particular topic (pipe)  does not destroy any data because it is preserved on another topic. Topics are optimally setup in multiple data centers to maintain high availability.

In Volume Analytics the water filter tasks in the analogy are called tasks. Tasks are groups of code that perform some unit of work. Your specific application will have its own tasks. The tasks are deployed on more than one server in more than one data center.

Benefits

Faster start up time saves money and time.

Volume Analytics allows a faster start up time for a new application or system being built. The team does not need to build the platform that moves the data to tasks. They do not need to build a monitoring system as those features are included. However, Volume Analytics will integrate with your current monitoring systems.

System is down less often

The DevOps team gets visibility into the system out of the box. They do not have to stand up a log search system. So it saves time. They can see what is going on and fix it quickly.

Plan for Growth

As your data grows and the system needs to process more data Volume Analytics grows. Add server instances to increase the processing power.  As work grows Volume Analytics allocates work to new instances. There is no re-coding needed. Save time and money as developers are not needed to re-implement the code to work at a larger scale.

Less Disruptive deployments

Construct your application in a way that allows for deployments of new features with a lower impact on features in production. New code libraries and modules can be deployed to the platform and allowed to interact with the already running parts of the system without an outage. A built in code library repository is included.

In addition currently running flows can be terminated while the data waits on the topics for the newly programmed flow to be started.

This Flow processes files to find IP addresses, searches multiple APIs for matches and inserts data into a HANA database

This Flow processes files to find IP addresses, searches multiple APIs for matches and inserts data into a HANA database

A data processing search threats flow in production. Each of the boxes is a task that performs a unit of work. The task puts the processed data on the topic represented by the star. Then the next task picks up the data and does another part of the job. The combination of a set of tasks and topics is a flow.

Geolocate IP Flow

Geolocate IP Flow

Additional flow to geolocate IP addresses added as the first flow is running.

Combined Flows

Combined Flows

The combination of flows working together. The topic ip4-topic is an integration point.

Modular

Volume Analytics is modular and tasks are reusable. You can reconfigure your data processing pipeline without introducing new code. You can use tasks in more than one application.

Highly Available

Out of the box, Volume Analytics highly available due to its built in redundancy. Work tasks and topics (pipes) run in triplicate. As long as your compute instances are in multiple data centers you will have redundancy built in. Volume Analytics knows how to balance the data between duplicate and avoid data loss if one or more work tasks fail — this extends to the concept of queuing up work if all work tasks fail.

Integration

Volume Analytics integrates with other products. It can retrieve and save data to other systems like topics, queues, databases, file systems and data stores. In addition these integrations happen over encrypted channels.

In our sample application CyberFlow there are many tasks that integrate with other systems. The read bucket task reads files from an AWS S3 bucket, the ThreatCrowd is an API call to https://www.threatcrowd.org and Honeypot calls to https://www.projecthoneypot.org. Then the insert tasks integrate to the SAP HANA database used in this example.

Volume Analytics integrates with your enterprise authentication and authorizations systems like LDAP, ActiveDirectory, CAP and more.

Data Management

Ingests datasets from throughout the enterprise, tracking each delivery and routing it through Volume Analytics to extract the greatest benefit. Shares common capabilities such as text extraction, sentiment analysis, categorization, and indexing. A series of services make those datasets discoverable and available to authorized users and other downstream systems.

Data Analytics

In addition, to the management console Volume Analytics comes with an notebook application. This allows a data scientist or analyst to discover and convert data into information on reports. After your data is processed by Volume Analytics and put into a database the Notebook can be used to visualize the data. The data is sliced and diced and displayed on graphs, charts and maps.

Volume Analytics Notebook

Flow Control Panel

Topic Control Panel

The Flow control panel allows for control and basic monitoring of flows. Flows are groupings of tasks and topics working together. You can stop, start and terminate flows. Launch additional flow virtual machines when there is heavy load of data processing work from this screen. The panel also gives access to start up extra worker tasks as needed. There is also a link that will allow you to analyze the logs in Kibana

Topic Control Panel

Topic Control Panel

The topic control panel allows for the control and monitoring of topics. Monitor and delete topics  from here.

Consumer Monitor Panel

Consumer Monitor Panel

The consumer monitor panel allows for the monitoring of consumer tasks. Consumer tasks are the tasks that read from a topic. They may also write to a topic. This screen will allow you to monitor that the messages are being processed and determine if there is a lag in the processing.

Volume Analytics is used by our customers to process data from many data streams and data sources quickly and reliably. In addition, it has enabled the production of prototype systems that scale up into enterprise systems without rebuilding and re-coding the entire system.

And now this tour of Volume Analytics leads into a video demonstration of how it all works together.

Demonstration Video

This video will further describe the features of Volume Analytics using an example application which parses ip addresses out of incident reports and searches other systems for indications of those IP addresses. The data is saved into a SAP HANA database.

Request a Demo Today

Volume Analytics is scalable, fast, maintainable and repeatable. Contact us to request a free demo and experience the power and efficiency of Volume Analytics today.

Contact

HANA Zeppelin Query Builder with Map Visualization

SAP HANA Query Builder On Apache Zeppelin Demo

HANA Zeppelin Query Builder with Map Visualization

HANA Zeppelin Query Builder with Map Visualization

In working with Apache Zeppelin I found that users wanted a way to explore data and build charts without needing to know SQL right away. This is an attempt to build a note in Zeppelin that would allow a new data scientist to get familiar with the data structure of their database. And it allows them to build simple single table queries that allow for building charts and maps quickly. In addition it shows the SQL used to perform the work.

Demo

This video will demonstrate how it works. I have leveraged work done by Randy Gelhausen’s query builder post on how to make a where clause builder.  I also used Damien Sorel’s jQuery Query Builder. These were used to make a series of paragraphs to lookup tables and columns in HANA and allow the user to build a custom query. This data can be quickly graphed using the Zeppelin Helium visualizations.

The Code

This is for those data scientists and coders that want to replicate this in their Zeppelin.

Note that this code is imperfect as I have not worked out all the issues with it. You may need to make changes to get it to work. It only works on Zeppelin 0.8.0 Snapshot. It is also made to work with SAP HANA as the databases.

It only has one type of aggregation – sum and it does not have a way to perform a having statement. But these features could easily be added.

This Zeppelin note is dependent on code from a previous post. Follow the directions in Using Zeppelin to Explore a Database first.

Paragraph One

%spark
//Get list of columns on a given table
def columns1(table: String) : Array[(String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => x.asInstanceOf[String])
}

def columns(table: String) : Array[(String, String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => (x, x))
}

def number_column_types(table: String) : Array[String] = {
 var columnType = sqlContext.sql("select column_name from table_columns where table_name='" +
    table + "' and data_type_name = 'INTEGER'")
 
 columnType.map {case Row(column_name: String) => (column_name)}.collect()
}

// set up the tables select list
val tables = sqlContext.sql("show tables").collect.map(s=>s(1).asInstanceOf[String].toUpperCase())
z.angularBind("tables", tables)
var sTable ="tables"
z.angularBind("selectedTable", sTable)


z.angularUnwatch("selectedTable")
z.angularWatch("selectedTable", (before:Object, after:Object) => {
 println("running " + after)
 sTable = after.asInstanceOf[String]
 // put the id for paragraph 2 and 3 here
 z.run("20180109-121251_268745664")
 z.run("20180109-132517_167004794")
})


var col = columns1(sTable)
col = col :+ "*"
z.angularBind("columns", col)
// hack to make the where clause work on initial load
var col2 = columns(sTable)
var extra = ("1","1")
col2 = col2 :+ extra
z.angularBind("columns2", col2)
var colTypes = number_column_types(sTable)
z.angularBind("numberColumns", colTypes)
var sColumns = Array("*")
// hack to make the where clause work on initial load
var clause = "1=1"
var countColumn = "*"
var limit = "10"

// setup for the columns select list
z.angularBind("selectedColumns", sColumns)
z.angularUnwatch("selectedColumns")
z.angularWatch("selectedColumns", (before:Object, after:Object) => {
 sColumns = after.asInstanceOf[Array[String]]
 // put the id for paragraph 2 and 3 here
 z.run("20180109-121251_268745664")
 z.run("20180109-132517_167004794")
})
z.angularBind("selectedCount", countColumn)
z.angularUnwatch("selectedCount")
z.angularWatch("selectedCount", (before:Object, after:Object) => {
 countColumn = after.asInstanceOf[String]
})
// bind the where clause
z.angularBind("clause", clause)
z.angularUnwatch("clause")
z.angularWatch("clause", (oldVal, newVal) => {
 clause = newVal.asInstanceOf[String]
})

z.angularBind("limit", limit)
z.angularUnwatch("limit")
z.angularWatch("limit", (oldVal, newVal) => {
 limit = newVal.asInstanceOf[String]
})

This paragraph is Scala code that sets up some functions that are used to query the table with the list of tables and the table with the list of columns. You must have the tables loaded into Spark as views or tables in order to see them in the select lists. This paragraph performs all the binding so that the next paragraph which is Angular code can get the data built here.

Paragraph Two

%angular
<link rel="stylesheet" href="https://cdn.rawgit.com/mistic100/jQuery-QueryBuilder/master/dist/css/query-builder.default.min.css">
<script src="https://cdn.rawgit.com/mistic100/jQuery-QueryBuilder/master/dist/js/query-builder.standalone.min.js"></script>

<script type="text/javascript">
  var button = $('#generateQuery');
  var qb = $('#builder');
  var whereClause = $('#whereClause');
 
  button.click(function(){
    whereClause.val(qb.queryBuilder('getSQL').sql);
    whereClause.trigger('input'); //triggers Angular to detect changed value
  });
 
  // this builds the where statement builder
  var el = angular.element(qb.parent('.ng-scope'));
  angular.element(el).ready(function(){
    var integer_columns = angular.element('#numCol').val()
    //Executes on page-load and on update to 'columns', defined in first snippet
    window.watcher = el.scope().compiledScope.$watch('columns2', function(newVal, oldVal) {
      //Append each column to QueryBuilder's list of filters
      var options = {allowEmpty: true, filters: []}
      $.each(newVal, function(i, v){
        if(integer_columns.split(',').indexOf(v._1) !== -1){
          options.filters.push({id: v._1, type: 'integer'});
        } else if(v._1.indexOf("DATE") !== -1) {
          options.filters.push({id: v._1, type: 'date'})
        } else { 
          options.filters.push({id: v._1, type: 'string'});
        }
      });
      qb.queryBuilder(options);
    });
  });
</script>
<input type="text" ng-model="numberColumns" id="numCol"></input>
<form class="form-inline">
 <div class="form-group">
 Please select table: Select Columns:<br>
 <select size=5 ng-model="selectedTable" ng-options="o as o for o in tables" 
       data-ng-change="z.runParagraph('20180109-151738_134370871')"></select>
 <select size=5 multiple ng-model="selectedColumns" ng-options="o as o for o in columns">
 <option value="*">*</option>
 </select>
 Sum Column:
 <select ng-model="selectedCount" ng-options="o as o for o in columns">
 <option value="*">*</option>
 </select>
 <label for="limitId">Limit: </label> <input type="text" class="form-control" 
       id="limitId" placeholder="Limit Rows" ng-model="limit"></input>
 </div>
</form>
<div id="builder"></div>
<button type="submit" id="generateQuery" class="btn btn-primary" 
       ng-click="z.runParagraph('20180109-132517_167004794')">Run Query</button>
<input id="whereClause" type="text" ng-model="clause" class="hide"></input>

<h3>Query: select {{selectedColumns.toString()}} from {{selectedTable}} where {{clause}} 
   with a sum on: {{selectedCount}} </h3>

Paragraph two uses javascript libraries from jQuery and jQuery Query Builder. In the z.runParagraph  command use the paragraph id from paragraph three.

Paragraph Three

The results of the query show up in this paragraph. Its function is to generate the query and run it for display.

%spark
import scala.collection.mutable.ArrayBuffer

var selected_count_column = z.angular("selectedCount").asInstanceOf[String]
var selected_columns = z.angular("selectedColumns").asInstanceOf[Array[String]]
var limit = z.angular("limit").asInstanceOf[String]
var limit_clause = ""
if (limit != "*") {
 limit_clause = "limit " + limit
}
val countColumn = z.angular("selectedCount")
var selected_columns_n = selected_columns.toBuffer
// remove from list of columns
selected_columns_n -= selected_count_column

if (countColumn != "*") {
 val query = "select "+ selected_columns_n.mkString(",") + ", sum(" + selected_count_column +
     ") "+ selected_count_column +"_SUM from " + z.angular("selectedTable") + " where " + 
      z.angular("clause") + " group by " + selected_columns_n.mkString(",") + " " + 
      limit_clause
 println(query)
 z.show(sqlContext.sql(query))
} else {
 val query2 = "select "+ selected_columns.mkString(",") +" from " + z.angular("selectedTable") + 
      " where " + z.angular("clause") + " " + limit_clause
 println(query2)
 z.show(sqlContext.sql(query2))
}

Now if everything is just right you will be able to query your tables without writing SQL. This is a limited example as I have not provided options for different types of aggregation, advanced grouping or joins for multiple tables.

 

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint.

Volume Analytics Table Explorer - HANA & Zeppelin

Using Zeppelin to Explore a Database

In attempting to use Apache Zeppelin I found it difficult to just explore a new database. This was the situation when connecting SAP HANA database to Apache Zeppelin using the JDBC driver.

So I created a Zeppelin interface that can be used by a person who does not know how to code or use SQL.

This is a note with code in multiple paragraphs that would allow a person to see a list of all the tables in the database and then view the structure of them and look at a sample of the data in each table.

Volume Analytics Table Explorer - HANA & Zeppelin

Volume Analytics Table Explorer – HANA & Zeppelin

When using a standard database with Apache Zeppelin one needs to register each table into Spark so that it can query it and make DataFrames from the native tables. I got around this by allowing the user to choose they tables they want to register into Apache Zeppelin and Spark. This registration involved using the createOrReplaceTempView function on a DataFrame. This allows us to retain the speed of HANA without copying all the data into a Spark table.

The video shows a short demonstration of how this works.

Once tables are registered as Spark views they can be used by all the other notes on the Apache Zeppelin server. This means that other users can leverage the tables without knowing they came from the HANA database.

The code is custom to HANA because of the names of the system tables where it stores the lists of tables and column names. The code also converts HANA specific data types such as ST_POINT to comma delimited strings.

This example of dynamic forms with informed by Data-Driven Dynamic Forms in Apache Zeppelin

Previous posts on Apache Zeppelin and SAP Hana are:

The Code

Be aware this is prototype code that works on Zeppelin 0.8.0 Snapshot which as of today needs to be built from source. It is pre-release.

First Paragraph

In the first paragraph I am loading up the HANA jdbc driver. But you can avoid doing this by adding your jdbc jar to the dependencies section of the interpreter configuration as laid out in How to Use Zeppelin With SAP HANA

%dep
z.reset() 
z.load("/projects/zeppelin/interpreter/jdbc/ngdbc.jar")

Second Paragraph

In the second paragraph we build the Data Frames from tables in HANA that contain the list of tables and columns in the database. This will be used to show the user what tables and columns are available to use for data analysis.

%spark
import org.apache.spark.sql._
val driver ="com.sap.db.jdbc.Driver"
val url="jdbc:sap://120.12.83.105:30015/ffa"
val database = "dbname"
val username = "username"
val password = "password"
// type in the schemas you wish to expose
val tables = """(select * from tables where schema_name in ('FFA', 'SCHEMA_B')) a """
val columns = """(select * from table_columns where schema_name in ('FFA', 'SCHEMA_B')) b """

val jdbcDF = sqlContext.read.format("jdbc").option("driver",driver)
 .option("url",url)
 .option("databaseName", database)
 .option("user", username)
 .option("password",password)
 .option("dbtable", tables).load()
jdbcDF.createOrReplaceTempView("tables")

val jdbcDF2 = sqlContext.read.format("jdbc").option("driver",driver)
 .option("url",url)
 .option("databaseName", database)
 .option("user", username)
 .option("password",password)
 .option("dbtable", columns).load()
jdbcDF2.createOrReplaceTempView("table_columns")

Third Paragraph

The third paragraph contains the functions that will be used in the fourth paragraph that needs to call Spark / Scala functions. These functions will return the column names and types when a table name is given. Also it has the function that will load a HANA table into a Spark table view.

%spark
//Get list of distinct values on a column for given table
def distinctValues(table: String, col: String) : Array[(String, String)] = {
 sqlContext.sql("select distinct " + col + " from " + table + " order by " + col).collect.map(x => (x(0).asInstanceOf[String], x(0).asInstanceOf[String]))
}

def distinctWhere(table: String, col: String, schema: String) : Array[(String, String)] = {
 var results = sqlContext.sql("select distinct " + col + " from " + table + " where schema_name = '" + schema +"' order by " + col)
 results.collect.map(x => (x(0).asInstanceOf[String], x(0).asInstanceOf[String]))
}

//Get list of tables
def tables(): Array[(String, String)] = {
 sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String].toUpperCase(), x(1).asInstanceOf[String].toUpperCase()))
}

//Get list of columns on a given table
def columns(table: String) : Array[(String, String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => (x, x))
}

def hanaColumns(schema: String, table: String): Array[(String, String)] = {
 sqlContext.sql("select column_name, data_type_name from table_columns where schema_name = '"+ schema + "' and table_name = '" + table+"'").collect.map(x => (x(0).asInstanceOf[String], x(1).asInstanceOf[String]))
}

//load table into spark
def loadSparkTable(schema: String, table: String) : Unit = {
  var columns = hanaColumns(schema, table)
  var tableSql = "(select "
  for (c <- columns) {
    // If this column is a geo datatype convert it to a string
    if (c._2 == "ST_POINT" || c._2 == "ST_GEOMETRY") {
      tableSql = tableSql + c._1 + ".st_y()|| ',' || " + c._1 + ".st_x() " + c._1 + ", "
    } else {
      tableSql = tableSql + c._1 + ", "
    }
  }
 tableSql = tableSql.dropRight(2)
 tableSql = tableSql + " from " + schema +"."+table+") " + table

 val jdbcDF4 = sqlContext.read.format("jdbc").option("driver",driver)
  .option("url",url)
  .option("databaseName", "FFA")
  .option("user", username)
  .option("password", password)
  .option("dbtable", tableSql).load()
  jdbcDF4.createOrReplaceTempView(table)
 
}

//Wrapper for printing any DataFrame in Zeppelin table format
def printQueryResultsAsTable(query: String) : Unit = {
 val df = sqlContext.sql(query)
 print("%table\n" + df.columns.mkString("\t") + '\n'+ df.map(x => x.mkString("\t")).collect().mkString("\n")) 
}

def printTableList(): Unit = {
 println(sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String])).mkString("%table\nTables Loaded\n","\n","\n"))
}

// this part keeps a list of the tables that have been registered for reference
val aRDD = sc.parallelize(sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String])))
val aDF = aRDD.toDF()
aDF.registerTempTable("tables_loaded")

Fourth Paragraph

The fourth paragraph contains the Spark code needed to produce the interface with select lists for picking the tables. It uses dynamic forms as described in the Zeppelin documentation and illustrated in more detail by Rander Zander.

%spark
val schema = z.select("Schemas", distinctValues("tables","schema_name")).asInstanceOf[String]
var table = z.select("Tables", distinctWhere("tables", "table_name", schema)).asInstanceOf[String]
val options = Seq(("yes","yes"))
val load = z.checkbox("Register & View Data", options).mkString("")

val query = "select column_name, data_type_name, length, is_nullable, comments from table_columns where schema_name = '" + schema + "' and table_name = '" + table + "' order by position"
val df = sqlContext.sql(query)


if (load == "yes") { 
 if (table != null && !table.isEmpty()) {
   loadSparkTable(schema, table)
   z.run("20180108-113700_1925475075")
 }
}

if (table != null && !table.isEmpty()) {
 println("%html <h1>"+schema)
 println(table + "</h1>")
 z.show(df)
} else {
 println("%html <h1>Pick a Schema and Table</h1>")
}

As the user changes the select lists schema in paragraph 3 will be called and the tables select list will be populated with the new tables. When they select the table the paragraph will refresh with a table containing some of the details about the table columns like the column types and sizes.

When they select the Register and View checkbox the table will get turned into a Spark view and paragraph five will contain the data contents of the table. Note the z.run command. This runs a specific paragraph and you need to put in your own value here. This should be the paragraph id from the next paragraph which is paragraph five.

Paragraph Five

%spark
z.show(sql("select * from " + table +" limit 100"))

The last paragraph will list the first 100 rows from the table that have been selected and has the register and view on.

Slight modifications of this code will allow the same sort of interface to be built for MySQL, Postgres, Oracle, MS-SQL or any other database.

Now go to SAP HANA Query Builder On Apache Zeppelin Demo and you will find code to build a simple query builder note.

Please let us know on twitter, facebook and LinkedIn if this helps you or your find a better way to do this in Zeppelin.

Previous posts on Apache Zeppelin and SAP Hana are:

 

Visualizing HANA Graph with Zeppelin

The SAP HANA database has the capability to store information in a graph. A graph is a data structure with vertex or nodes and edges. Graph structures are powerful because you can perform some types of analysis more quickly like nearest neighbor or shortest path calculations. It also enables faceted searching. Zeppelin has recently added support for displaying network graphs.

Simple Network Graph

Zeppelin 8 has support for network graphs. You need to download the Zeppelin 8 Snapshot and build it to get these features. The code to make this graph is:

This is described in the documentation. Note that the part of the json that holds the attributes for the node or edge is in an inner json object called “data”. This is how each node and edge can have different data depending on what type of node or edge it is.

"data": {"fullName":"Andrea Santurbano"}}

Because I happen to be learning the features of SAP HANA I wanted to display a graph from HANA in using Zeppelin. A previous post shows how to connect Zeppelin to HANA.

I am using a series of paragraphs to build the data and then visualize the graph. I have already built my graph workspace in Hana with help from Creating Graph Database Objects and Hana Graph Reference. The difficult part is transforming the data and relationships so they fit into a vertex / node table and an edge table.

I am using sample data from meetup.com which contains events organized by groups and attended by members and held at venues.

It is important to figure out which attributes should exist on the edge and which ones should be in the nodes. HANA is good at allowing sparse data in the attributes because of the way it stores data in a columnar form. If you wish to display the data on a map and in a graph using the same structure supplied by Zeppelin it is important to put your geo coordinates in the nodes and not in the edges.

First we connect to the database and load the node and edge tables with Spark / Scala and build the data frames. One issue here was converting the HANA ST_POINT data type into latitude and longitude values. I defined a select statement with the HANA functions of ST_X() and ST_Y() to perform the conversions before the data is put into the dataframe.

You will have problems and errors if your tables in HANA have null values. Databases don’t mind null values. Scala seems to hate null. So you have to convert any columns that could have null values to something that makes sense. In this case I converted varchar to empty strings and double to 0.0

Then I query the data frames to get the data needed for the visualization and transform it into json strings for the collections of nodes and edges. In the end this note outputs two json arrays. One is the nodes and the other is the edges.


Now we will visualize the data using the new Zeppelin directive called %network.

In my example data extracted from meetup.com. I have four types of nodes: venue, member, group and event. These are defined as labels. My edges or relationships I have defined as: held, sponsored and rsvp. These become the lines on the graph. Zeppelin combines the data from the edges and nodes into a single table view.

So in tabular format it will look like this:

Zeppelin Edges

 

Zeppelin Nodes

When you press the network button in Zeppelin the graph diagram appears.

Zeppelin Network Graph Diagram

Under settings you can specify what data displays on the screen. It does not allow for specifying the edge label displays and does not seem to support a weight option.

Network Graph Settings

If you select the map option and have the leaflet visualization loaded you can show the data on a map. Since I put the coordinates in the edges it will map the edge data. It would be better if I moved the coordinates into each node so the nodes could be displayed on the map.

Zeppelin Graph On a Map

This will help you get past some of the issues I had with getting the Zeppelin network diagram feature to work with HANA and perhaps with other databases.

In the future I hope to show how to call HANA features such as nearest neighbor, shortest path and pattern  matching algorithms to enhance the graph capabilities in Zeppelin.

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint

Query of a geographic region.

Zeppelin Maps the Hard Way

In Zeppelin Maps the Easy Way I showed how to add a map to Zeppelin with a Helium module. But what if you do not have access to the Helium NPM server to load in that module? And what if you want to add features to your Leaflet Map that are not supported in the volume-leaflet package?

This will show you how the Angular javascript library will allow you to add a map user interface to a Zeppelin paragraph.

Zeppelin Angular Leaflet Map

Zeppelin Angular Leaflet Map with Markers

First we want to get a map on the screen with markers.

In Zeppelin create a new note.

As was shown in How to Use Zeppelin With SAP HANA we create a separate paragraph to build the database connection. Please substitute in your own database driver and connection string to make it work for other databases. There are other examples where you can pull in data from a csv file and turn it into a table object.

In the next paragraph we place the spark scala code to query the database and build the markers that will be passed to the final paragraph which is built with angular.

The data query paragraph has a basic way to query a bounding box. It just looks for coordinates that are greater and less than the northwest and southeast corners of a bounding box.

var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
var coords = box.split(",")
sql1 = sql1 + " where lng > " + coords(0).toFloat + " and lat > " + coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " + coords(3).toFloat
}

var sql = sql1 +" limit 20"
val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect()) 

The data from this query is used to make the map_pings and bind it to angular so that any angular code can reference it. Zeppelin has the ability to bind data into other languages so it can be used by different paragraphs in the same note. There are samples for other databases, json and csv files at this link.

We do not have access to the Hana proprietary functions because Zeppelin will load the data up in its own table view of the HANA table. We are using the command “createOrReplaceTempView” so that a copy of the data is not made in Zeppelin. It will just pass the data through.

Note that you should set up the HANA jdbc driver as described in How to Use Zeppelin With SAP HANA.

It is best if you set up a dependency to the HANA jdbc jar in the Spark interpreter. Go to the Zeppelin settings menu.

Zeppelin Settings Menu

Zeppelin Settings Menu

Pick the Interpreter and find the Spark section and press edit.

Zeppelin Interpreter Screen

Zeppelin Interpreter Screen

Then add the path you where you have the SAP HANA jdbc driver called ngdbc.jar installed.

Configure HANA jdbc in Spark Interpreter

Configure HANA jdbc in Spark Interpreter

First Paragraph

%spark
import org.apache.spark.sql._
val driver ="com.sap.db.jdbc.Driver"
val url="jdbc:sap://11.1.88.110:30015/tri"
val database   = "database schema"   
val username   = "username for the database"
val password   = "the Password for the database"
val table_view = "event_view"
var box=""
val jdbcDF = sqlContext.read.format("jdbc").option("driver",driver)
                                           .option("url",url)
                                           .option("databaseName", database)
                                           .option("dbtable", "event_view")
                                           .option("user", username)
                                           .option("password",password)
                                           .option("dbtable", table_view).load()
jdbcDF.createOrReplaceTempView("event_view")

Second Paragraph

%spark

var box = "20.214843750000004,1.9332268264771233,42.36328125000001,29.6880527498568";
var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
    var coords = box.split(",")
    sql1 = sql1 + " where lng  > " + coords(0).toFloat + " and lat > " +  
        coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " +
        coords(3).toFloat
}
var sql = sql1 +" limit 20" 

val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect())
z.angularBind("paragraph", z.getInterpreterContext().getParagraphId())
// get the paragraph id of the the angular paragraph and put it below
z.run("20171127-081000_380354042")

Third Paragraph

In the third paragraph we add the angular code with the %angular directive. Note the for each loop section where it builds the markers and adds them to the map.

%angular 
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.css" />
.
<div id="map" style="height: 300px; width: 100%"></div>
<script type="text/javascript">
function initMap() {
    var element = $('#textbox');
    var map = L.map('map').setView([30.00, -30.00], 3);
   
    L.tileLayer('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png').addTo(map);
    var geoMarkers = L.layerGroup().addTo(map);
    
    var el = angular.element($('#map').parent('.ng-scope'));
    var $scope = el.scope().compiledScope;
   
    angular.element(el).ready(function() {
        window.locationWatcher = $scope.$watch('locations', function(newValue, oldValue) {
            //geoMarkers.clearLayers();
            angular.forEach(newValue, function(event) {
                if (event)
                  var marker = L.marker([event.values[1], event.values[2]]).bindPopup(event.values[0]).addTo(geoMarkers);
            });
        })
    });
}
if (window.locationWatcher) { window.locationWatcher(); }

// ensure we only load the script once, seems to cause issues otherwise
if (window.L) {
    initMap();
} else {
    console.log('Loading Leaflet library');
    var sc = document.createElement('script');
    sc.type = 'text/javascript';
    sc.src = 'https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.js';
    sc.onerror = function(err) { alert(err); }
    document.getElementsByTagName('head')[0].appendChild(sc);
}
</script>
<p>Testing the Map</p>

<form class="form-inline">
  <div class="form-group">
    <input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>
    <label for="paragraphId">Paragraph Id: </label>
    <input type="text" class="form-control" id="paragraphId" placeholder="Paragraph Id ..." ng-model="paragraph"></input>
  </div>
  <button type="submit" class="btn btn-primary" ng-click="z.runParagraph(paragraph)"> Run Paragraph</button>
</form>

Now when you run the three paragraphs in order it should produce a map with markers on it.

The next step is to add a way to query the database by drawing a box on the screen. Into the scala / spark code we add a variable for the bounding box with the z.angularBind() command. Then a watcher is made to see when this variable changes so the new value can be used to run the query.

Modify Second Paragraph

%spark
z.angularBind("box", box)
// Get the bounding box
z.angularWatch("box", (oldValue: Object, newValue: Object) => {
    println(s"value changed from $oldValue to $newValue")
    box = newValue.asInstanceOf[String]
})

var sql1 = "select comments desc, lat, lng from EVENT_VIEW "
if (box.length > 0) {
    var coords = box.split(",")
    sql1 = sql1 + " where lng  > " + coords(0).toFloat + " and lat > " +  coords(1).toFloat + " and lng < " + coords(2).toFloat + " and lat < " +  coords(3).toFloat
}
var sql = sql1 +" limit 20" 

val map_pings = jdbcDF.sqlContext.sql(sql)
z.angularBind("locations", map_pings.collect())
z.angularBind("paragraph", z.getInterpreterContext().getParagraphId())
z.run("20171127-081000_380354042") // put the paragraph id for your angular paragraph here

To the angular section we need to add in an additional leaflet library called leaflet.draw. This is done by adding an additional css link and a javascript script. Then the draw controls are added as shown in the code below.

Modify the Third Paragraph

%angular 
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet.draw/0.4.13/leaflet.draw.css" />
.
<script src='https://cdnjs.cloudflare.com/ajax/libs/leaflet.draw/0.4.13/leaflet.draw.js'></script>
<div id="map" style="height: 300px; width: 100%"></div>

<script type="text/javascript">
function initMap() {
    var element = $('#textbox');
    var map = L.map('map').setView([30.00, -30.00], 3);
   
    L.tileLayer('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png').addTo(map);
    var geoMarkers = L.layerGroup().addTo(map);
    var drawnItems = new L.FeatureGroup();
    
    map.addLayer(drawnItems);
    
    var drawControl = new L.Control.Draw({
        draw: {
             polygon: false,
             marker: false,
             polyline: false
        },
        edit: {
            featureGroup: drawnItems
        }
    });
    map.addControl(drawControl);
    
    map.on('draw:created', function (e) {
        var type = e.layerType;
        var layer = e.layer;
        drawnItems.addLayer(layer);
        element.val(layer.getBounds().toBBoxString());
        map.fitBounds(layer.getBounds());
        window.setTimeout(function(){
           //Triggers Angular to do its thing with changed model values
           element.trigger('input');
        }, 500);
    });
    
    var el = angular.element($('#map').parent('.ng-scope'));
    var $scope = el.scope().compiledScope;
   
    angular.element(el).ready(function() {
        window.locationWatcher = $scope.$watch('locations', function(newValue, oldValue) {
            $scope.latlng = [];
            angular.forEach(newValue, function(event) {
                if (event)
                  var marker = L.marker([event.values[1], event.values[2]]).bindPopup(event.values[0]).addTo(geoMarkers);
                  $scope.latlng.push(L.latLng(event.values[1], event.values[2]));
            });
            var bounds = L.latLngBounds($scope.latlng)
            map.fitBounds(bounds)
        })
    });

}

if (window.locationWatcher) { window.locationWatcher(); }

// ensure we only load the script once, seems to cause issues otherwise
if (window.L) {
    initMap();
} else {
    console.log('Loading Leaflet library');
    var sc = document.createElement('script');
    sc.type = 'text/javascript';
    sc.src = 'https://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.5/leaflet.js';
    sc.onerror = function(err) { alert(err); }
    document.getElementsByTagName('head')[0].appendChild(sc);
    s2.onload = initMap;
}
</script>
<p>Testing the Map</p>

<form class="form-inline">
  <div class="form-group">
    <input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>
    <label for="paragraphId">Paragraph Id: </label>
    <input type="text" class="form-control" id="paragraphId" placeholder="Paragraph Id ..." ng-model="paragraph"></input>
  </div>
  <button type="submit" class="btn btn-primary" ng-click="z.runParagraph(paragraph)"> Run Paragraph</button>
</form>

There are some important features to mention here that took some investigation to figure out.

Within Zeppelin I was unable to get the box being drawn to be visible. So instead drawing a box will the map to zoom to the area selected by utilizing this code:
element.val(layer.getBounds().toBBoxString());
map.fitBounds(layer.getBounds());

To make the map zoom back to the area after the query is run this code is triggered.

$scope.latlng.push(L.latLng(event.values[1], event.values[2]))
...
var bounds = L.latLngBounds($scope.latlng)
map.fitBounds(bounds)

To trigger the spark / scala paragraph to run after drawing a box this code causes it to run the query paragraph: data-ng-change=”z.runParagraph(paragraph_id);”

<input id="textbox" ng-model="box" data-ng-change="z.runParagraph(paragraph);"></input>

The html form at the bottom is what holds and binds the data back and forth between the paragraphs. It is visible for debugging at the moment.

Query of a geographic region with Zeppelin

Query of a geographic region

Please let us know how it works out for you. Hopefully this will help you add maps to your Zeppelin notebook. I am sure there are many other better ways to accomplish this feature set but this is the first way I was able to get it all to work together.

Demo of the interface:

You can contact us using twitter at @volumeint.

Some code borrowed from: https://gist.github.com/granturing/a09aed4a302a7367be92 and https://zeppelin.apache.org/docs/latest/displaysystem/front-end-angular.html

Zeppelin Maps the Easy Way

You want your data on a map in Apache Zeppelin.

Apache Zeppelin allows a person to query all sorts of databases and big data systems using many languages. This data can then be displayed as graphs and charts. There is also a way to make a data driven map.

A map with data markers on it in Zeppelin

A map with data markers on it in Zeppelin

There is a very easy way to get a basic map with markers on it into Zeppelin. The first thing we will do is to query the database for the data we want to put on the map. First create a new note and add a query for your database of choice.

If you are using SAP HANA there are directions on how to install the jdbc driver at How to Use Zeppelin With SAP HANA

If you are using another database just use the %jdbc interpreter and modify your database configuration settings in the interpreter settings section of Zeppelin.

The query you write must have a column for the longitude and the latitude.

Query for data with Zeppelin

Query for data with Zeppelin

I my example I am using the query “select country_name, lat, lng, event_type from event_view where actor_type_id=’2′ and country_name =’South Sudan’ limit 10”. Note that my interpreter is Hana. This is described in a previous post about Hana and Zeppelin.

Once this data in in the table the plug-in for the map needs to be installed. In the upper right corner click on the login indicator and select Helium.

The Zeppelin Setup Menu

The Zeppelin Setup Menu

A list of Helium plug-ins will be displayed. Enable the one called zeppelin-leaflet or volume-leaflet. The Helium module for maps utilizes another javascript library called Leaflet.

Helium Modules List

Helium Modules

Once it is enabled go back to your note where you created the map query. A new button will show up.  It looks like the earth on a white background. When you press this a map will appear that needs to be configured.

Configure Map Settings

Configure Map Settings

Press the settings link. As shown above you will see the fields from the query. Just drag the fields down into the boxes called latitude, longitude, tooltip and popup.

Mapping the Columns for the map

Mapping the Columns

Once the latitude and longitude is filled in the map will appear with markers on it for the data points. If you put columns into the tooltip and popup boxes the tooltips and popups will work also.

The Finished Map

The Finished Map

This demonstrates the easiest way to add a map to Zeppelin. More advanced data processing before a map is built requires writing a paragraph of code in Spark Scala, Python or another language supported by Apache Zeppelin.

Future posts will show how to write Scala code to preprocess data and another post on how to draw a box on the map to select the area of the earth to be queried for data points.

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint

Future Fear by BardIonson.com

Countering Computational Propaganda

Future Fear by bardionson.com

Countering Computational Propaganda

There is something new happening in computer science and social media. It is computational propaganda. Computational propaganda is the use of computer information systems for political purposes according to the journal Big Data. This would also include the efforts of governments to influence public opinion of another country in order to change the foreign relations and policy of that country or to cause dissent of citizens against their government.

Countering Propaganda

The first step and most important to take is to realize that propaganda is real. And then to admit that propaganda does impact the thought process of citizens. So I will start with the first step of describing computational propaganda and then ways to computationally identify it.

Elements of Computational Propaganda

There are various elements to computational propaganda that I will attempt to outline. Some of the elements include: algorithms, automation, human curation source, artificial intelligence, social media bots, sock puppets, troll farms, cyber attacks and stolen information, disinformation and data science.

This new generation of propaganda has gone through a process of computerized automation. It is not totally automated but computers, networks and the internet have made it possible to deliver it in an automated manner. It is also enabling the delivery in a personal manner, in a way that makes it seem to be coming from a real human. Technology is also making strides in the automatic production of propaganda.

Previously propaganda was slower moving and required getting news and editorials published. One used to have to own or control media outlets and then put people in place to spread disinformation. Our new modern way of spreading information online with social media has lowered the cost of spreading propaganda.

What is propaganda?

Propaganda is not just pure lies and conspiracy theories. It has come a long way from the Nazi and the communist USSR mode of operation that might obvious to us now. Current propaganda is infused with truth although it is also either partly true or out of context in hidden ways to individual audiences. This information can now be tailored and targeted at individuals. Often this information is geared to lead people to action as all good propaganda should.

Jacques Ellul in Propaganda The formation of Men’s Attitudes says that it is difficult to determine what propaganda is because it is a “secret action”. Ellul finds it impossible to pin down an exact definition of propaganda that did not take up an entire book. Our society has propaganda baked in that we are not aware of. In short, propaganda for Ellul covered the following areas:

  • Psychological Action – seeks to modify opinions by pure psychological methods
  • Psychological Warfare – destroy the morale of an adversary so that the opponent begins to doubt the validity of his beliefs and actions
  • Brainwashing – only used on prisoners to turn enemies to allies
  • Public and Human Relations – These group activities are also propaganda as they seek to adapt a person to a society. As we will see foreign governments use this technique in social media today.

Ellul says: “modern propaganda is based on scientific analysis of psychology and sociology… the propagandist builds his techniques on the basis of his knowledge of man” Then the effects of the propaganda are measured for results and refined.

These thoughts written in 1956 have only continued to be refined. They are now at work on the internet and in social media. And now there is a faster way to measure the results with computers and tracking of online activity.

In reality, there are competing propaganda efforts happening online. There is a democratic propaganda that competes with anti-democratic strains and militant islamic ones also.

Computational Propaganda in Action

But to take a step back I want to outline what this looks like online at the moment as I understand it.

2016

Computational propaganda burst onto the awareness of the public during the 2016 US presidential election and continues today. This effort by Russian actors over many years and I will outline some of the features that are computational in nature. The entire propaganda system is co-dependant but attempts to hide as disconnected entities.

Recently it has been reported that Russian based accounts on Facebook and Twitter have been spreading propaganda to divide the American public. They used Facebook to support both sides of the Black Lives Matter protests, promote gun rights and anti-immigration. This has also been observed on Reddit and Twitter. Some accounts were bots or semi-automated bots and other accounts were used to purchase targeted advertisements.

Cyber War

One element was stealing information through cyber attacks and social engineering. This exhibited itself in the taking of information from the Democratic National Convention and utilizing classified leaks given to WikiLeaks.

Personal information on American citizens by foreign intelligence services has been stolen. Security questionnaires were taken by Chinese services. And recently credit information was appropriated in addition to voter rolls from an unknown number of states during the elections. This personal information is alleged to have enabled the more precise targeting of propaganda to specific populations.

Propaganda Generation

Then this information was selectively transformed into propaganda by taking it out of context and targeting it at select audiences. I can assume that producers of the propaganda used computer software to search the massive amounts of information that was stolen. Then they edited the most damaging of the information to refine it for maximum effect on specific audiences.

Targeting Propaganda

This was also the case for the efforts of the Trump campaign where they took comments damaging to an opponent and targeted voter suppression ads at specific people. Social media ad networks and databases were used in these efforts. These systems allow anyone to target a message at thousands of different personal identifiers, location and income brackets. This can be done on Facebook and Twitter.

Targeting Ad on Facebook

Targeting Ad on Facebook

The advertisement above is being targeted at people in NY City who engage with liberal political content and have a net worth between 1 million and 2 million.

Personal Propaganda

In addition there are stores of personal data that was correlated with computational algorithms to determine specific personality traits of individuals. Using this information messaging was crafted and target it at psychological vulnerabilities of individuals to change their thinking or push them to action. source

The Troll Factory

On top of this paid targeting there are places called troll factories where armies of people engage in social interactions on social media and blogs. There is one of these in St. Petersburg called The Factory. People will engage with online content to discredit it with propaganda techniques such as “whataboutism”. This is where people contest valid facts by pointing out a perceived hypocrisy on the other side of the issue. They will also attempt to generate fake news to cause panic in a particular community.

Bots and Sock Puppets

These efforts combine with a computational technique of online bots, trolls and sock puppets. Sock puppets are online fake personas. They appear to be real people online often appearing as Americans but they just broadcast propaganda.

Bots are pieces of software that perform online tasks. They may just pickup other people’s messages on line and rebroadcast them. Other bots are more human like and engage in conversations. They often search for keywords in conversations and generate a “whatboutism” or some disparaging message to confuse readers. There are other bots that are semi-automated. Once these are challenged as being bots or have a question they cannot respond to, a human will intervene and manually provide responses.

Bot networks

These bots and sock puppet accounts often work as a team. They will gang up on conversations and re-broadcast each others messages. The goal is to make their message seem mainstream by its volume or popularity. They can confuse rational conversations and arguments online by misdirecting them and triggering emotional responses.

Suppression of Speech Bots

Recently bots have been suppressing messaging from people who are attempting to get out messages to counter disinformation. In the case of Brian Krebs who was attacking the messaging of bots that support Putin they tricked Twitter into disabling Brians account. They did this by causing thousands of accounts to follow him and then retweet his tweets in mass. This effort caused Twitter to automatically assume that Brian was paying bots to promote his account. So they turned it off.

Artificial Intelligence

Artificial intelligence and machine learning are additional computation techniques being weaponized in this information battle. Not only is AI used to automate bots to appear human but it also can to create messaging content. Artificial intelligence processes source documents or training data. Then programmers configure the system to output new messages. It multiplies the efforts of a human to generate new content. This output is then curated or even tested out online. This validates the effectiveness of the content. This feedback loop is used to create more effective triggers for people.

Attacking the Disenfranchised

Often these efforts to trigger action leverage marginalized groups in society. The bots and troll factories can take domestic content and amplify it for their own purposes. It has been shown that Russian based bots often rebroadcast messages that attempt to increase divisions in US citizens. Hamilton68 illustrates these efforts.

Screen Shot 2017-09-16 at 10.43.50 AM.png
This is a dashboard that tracks known Russian bots and exposes what they are promoting. Often this is anything that breeds mistrust in the US government and pits groups against each other.

Some countries also invest in traditional media like newspapers,radio and television stations to broadcast messages. They attempt to make this look like real news but it is actually disinformation and propaganda. This “news” is picked up as legitimate by other news outlets and rebroadcast. People who buy into the messaging with use the source as proof of their opinions.

Conspiracy Theories

Propagandists often use current running conspiracy theories to oppose competing messages or true news. They neutralize the opposition by coming up with a secret conspiracy or amplifying one in circulation.

Live events

Recently The DailyBeast reported that Russian operatives organized and promoted rallies on Facebook. This illustrates the purpose of propaganda which is to move people to action. Once they are acting on the beliefs pushed by the propaganda it can tip into political action.

Politics

At the moment in the US propaganda from Russia seems aimed at changing the foreign policy of the US government. The Russians have secured the information space of their citizens in an authoritarian way that is not acceptable in American society. And they have leveraged the lack of privacy controls in the American capitalist system, where information about people is sold for marketing purposes. It seems that Russian propaganda efforts are only aimed at the destruction of democracy or western values according to Chris Zappone of The Age

Computational Countermeasures

There are hundreds of ideas on how to counter this online propaganda. Some are government policies, industry self policing and educational programs. But I want to focus on countermeasures that are computational. In effect an attempt to fight fire with fire.

Inoculation

One way to minimize the impact of propaganda is to have tools that alert an individual of the event. Another type of computational tool is one that allows a community to monitor what others in their community are being targeted with. This can prevent individual weaknesses from being exploited. There are efforts underway in this space but there are opportunities for continual improvement.

Current concepts

Ideas from Volume

  • A dark advertisement exposure network. Volunteers install browser agents to gather ads and put them in a public searchable database with the targeting criteria. Could also use fake personas to collect advertisements.
  • Public searchable databases of bots and sock puppets identified by computational techniques, such as time of day analysis, linguistic patterns and rebroadcasting behaviours.
  • The bot collection database would also hold the relationships between accounts, messages and motives.
  • Computational methods software package to identify bots that pretend to be human and out them as bots on the social media
  • Browser plug-in that will show a user the motives of a bot and expose a network of bots that link or help each other. It enables a person to ignore and discount ideas coming from that entity.

In a blog post Colin Crowell, Twitter’s vice president of public policy, government and philanthropy, said in a blog post that Twitter itself is a “powerful antidote to the spreading of all types of false information” and “journalists, experts and engaged citizens Tweet side-by-side correcting and challenging public discourse in seconds.”

The issue with this is that the bots are programmatically generating information using artificial intelligence or have armies of cheap labor. How can citizens and journalists keep up with researching and debunking half truths, obvious lies and nonsense?

Perhaps we need to build counter bot networks. Of course, ones that fit into the social media companies terms of service.

Bot Combat

(more ideas from Volume Integration)

  • Bots that disrupt bot networks by sending trigger words to them and keep them busy and away from meaningful conversations.
  • Bots that look for disinformation and hoaxes and broadcast information to debunk it
  • Artificial intelligence social bots that can automatically research messaging from propaganda bots and counter the messaging
  • Crafting fact checking messaging to target back at organizations running troll factories and bot networks
  • A force of vetted volunteers that could perform analysis tasks to find bots and propagandists and then write counter arguments to them

Analysis

Of course this rests on the concept of analysis so we would need tools to visualize the data and perform the human effort needed to find relationships and motives of the actors. Some of this effort would be utilizing the many algorithms already designed to detect bots and online propaganda. In addition Volume Integration has a tool that can help monitor and alert on social media account activity and messaging. See our products.

At the end of the post is a list of papers on methods and analysis techniques to find automated bots online.

Prevention (CyberSecurity)

One factor in recent propaganda has been the ability for bad actors to obtain classified or private information. In this case we need better cyber security is needed. Sometimes the information is gathered via social engineering. Outside actors manipulate a person inside an organization into providing the information or access to the computer systems.

Ideas from Volume

  • Separate internal corporate networks from the internet
  • Increase Cybersecurity methods and policies (patching schedules, inventory control, multiple factor authentication, firewalls, packet inspection, audits)
  • Trust / Risk Verification Systems like Volume Analytics which monitor events on a computer network to alert on unauthorized or risky behaviour.

Conclusion

I am afraid that there is no real collusion. We are at the beginning of our abilities to counter computational propaganda. It is going to be an arms race as tactics and systems change. Technology and technique will breed more technology and technique. I hope we are able to separate out the false information and come a bit closer to truth. In the end, it will be the next phase of human conflict and manipulation to gain power and wealth over personal freedom.

Contact us if you want to know more about our work or follow us on Twitter, LinkedIn or Facebook.

Current Studies on Automating Analysis

Ryft-ONE

Ryft and Apache Zeppelin

Ryft

Ryft is an FPGA – (field programmable gate array) appliance that allows for hosting and searching data quickly. In this post I will show one way to connect up Apache Zeppelin for use in data analysis using Scala code. Previously I showed how to connect Apache Zeppelin to SAP Hana.

The Ryft can quickly search structured and unstructured data without needing to build an index. This ability is attributed the the FPGA that can filter data on demand. It uses the internal 4 FPGA modules to process the data at search time. Other types of search systems like ElasticSearch, solr, Lucine or a database have to build and store an index of the data. Ryft operates without an index.

Ryft Speed Comparison

Ryft Speed Comparison

I have populated my Ryft with a cache of data from Enron. It is a dump of Enron emails obtained from Carnegie Mellon. This was as simple as uploading files to the Ryft and running a command like this:

ryftutil -copy “enron*” -c enron_email -a ryft.volumeintegration.com:8765

In the Zeppelin interface I will be able to search for keywords or phrases in the email files and display them. The size of the enron e-mail archive is 20 megabytes.

Ryft One Appliance

Ryft One Appliance

Apache Zeppelin

Apache Zeppelin is an open source web notebook that allows a person to write code in many languages to manipulate and visualize data.

Apache Zeppelin with Volume Analytics Interface

Apache Zeppelin with Volume Analytics Interface

To Apache Zeppelin work with Ryft I installed Apache Zeppelin onto the Ryft appliance and connected the Spark Ryft Connector jar found at this git project. Or download a prebuilt jar.

Follow the directions provided at the spark-ryft-connector project to compile the jar file needed. I compiled the jar file on my local desktop computer. Place the spark-ryft-connector jar file onto the Ryft machine. I did run into one that was not documented; the ryft connector was not working properly. It gives the error: “java.lang.NoClassDefFoundError: org/apache/spark/Logging”

I resolved the issue by downloading spark-core_2.11-1.5.2.logging.jar from  https://raw.githubusercontent.com/swordsmanliu/SparkStreamingHbase/master/lib/spark-core_2.11-1.5.2.logging.jar and put it in zeppelin/interpreter/spark/dep directory and that resoved the issue.

Now you can create a note in Zeppelin. I am using the Spark interpreter which allows you to write the code in Scala.

First you have to make sure Zeppelin can use the ryft code in the jar file. Make a dependency paragraph with this code:

%dep
z.reset()
z.load("/home/ryftuser/spark-ryft-connector-2.10.6-0.9.0.jar")

Ryft Query

Now make a new paragraph with the code to make form fields and run the Ryft API commands to perform a search. Figuring these queries out takes a detailed study of the documentation.

These are the commands to prepare and run the query. I show a simple search, a fuzzy hamming search and a fuzzy edit distance search. The Ryft can perform very fast fuzzy searches with wide edit distances because there is not an index being built.

Simple Query
queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
query = SimpleQuery(searchFor.toString)
Hamming Query
queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte, fhs)
Edit Distance Query
queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte)
The Search
var searchRDD = sc.ryftRDD(Seq(query), queryOptions)

This produces an RDD that can be manipulated to view the contents using code like the example below.

searchRDD.asInstanceOf[RyftRDD[RyftData]].collect.foreach { ryftData =>
   println(ryftData.offset)
   println(ryftData.length)
   println(ryftData.fuzziness)
   println(ryftData.data.replace("\n", " "))
   println(ryftData.file)
}

The Result in Zeppelin

Result of Searching Ryft with Zeppelin

Result of Searching Ryft with Zeppelin

In addition I have included code that allows the user to click on Show File to see the original e-mail with the relevant text highlighted in bold.

Results in BoldI installed Apache Zeppelin in a way that allows it access to a portion of the file system on the server where I stored the original copy of the email files.

In order for Apache Zeppelin to display the original email, I had to give it access to the part of the filesystem where the original emails were stored.  Ryft uses a catalog of the emails to perform searches, as it performs better when searching fewer larger files than more smaller ones. The catalog feature allows it to combine many small files into one large file.

The search results return a filename and offset which Apache Zeppelin uses to retrieve the relevant file and highlight the appropriate match. 

In the end results Ryft found all instances of the name Mohammad with various spelling differences in 0.148 seconds in a dataset of 30 megabytes. When I performed the same search terms on 48 gigabytes of data it ran the search in 5.89 seconds. And 94 gigabytes took 12.274 seconds, 102 gigabytes took 13 seconds. These are just quick sample numbers using dumps of many files. Perhaps performance could be improved by consolidating small files into catalogs.

Zeppelin Editor

The code is edited in Zeppelin itself.

Code in Zeppelin

Code in Zeppelin

You edit the code in the web interface but it can hide it once you have the form fields working. Here is the part of the code that produces the form fields:

 val searchFor = z.input("Search String", "mohammad")
 val distance = z.input("Search Distance", 2)
 var queryType = z.select("Query Type", Seq(("1","Simple"),("2","Hamming"),("3","Edit Distance"))).toString
 var surrounding = z.input("Surrounding", "line")

So in the end we end up with the following code.

%spark
import com.ryft.spark.connector._
import com.ryft.spark.connector.domain.RyftQueryOptions
import com.ryft.spark.connector.query.SimpleQuery
import com.ryft.spark.connector.query.value.{EditValue, HammingValue}
import com.ryft.spark.connector.rdd.RyftRDD
import com.ryft.spark.connector.domain.{fhs, RyftData, RyftQueryOptions}
import scala.language.postfixOps
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import scala.io.Source

def isEmpty(x: String) = x == null || x.isEmpty
  var queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
  val searchFor = z.input("Search String", "mohammad")
  val distance = z.input("Search Distance", 2)
  var queryType = z.select("Query Type",("2","Hamming"), Seq(("1","Simple"),("2","Hamming"),("3","Edit Distance"))).toString
  var surrounding = z.input("Surrounding", "line")
  var query = SimpleQuery(searchFor.toString)

 
  if (isEmpty(queryType)) {
      queryType = "2"
  }

  if (queryType.toString.toInt == 1) {
        println("simple")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, 0 toByte)
        }
        query = SimpleQuery(searchFor.toString)

  } else if (queryType.toString.toInt ==2) {
        println("hamming")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte, fhs)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte, fhs)
        }
  } else {
        println("edit")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte)
        }
  }

  var searchRDD = sc.ryftRDD(Seq(query), queryOptions)
  var count = searchRDD.count()

  print(s"%html <h2>Count: $count</h2>")

  if (count > 0) {
        println(s"Hamming search RDD first: ${searchRDD.first()}")
        println(searchRDD.count())
        print("%html <table>")
        print("<script>")
        println("function showhide(id) { var e = document.getElementById(id); e.style.display = (e.style.display == 'block') ? 'none' : 'block';}")
        print("</script>")
        print("<tr><td>File</td><td>Data</td></tr>")

        searchRDD.asInstanceOf[RyftRDD[RyftData]].collect.foreach { ryftData =>
            print("<tr><td style='width:600px'><a href=javascript:showhide('"+ryftData.file+"')>Show File </a></td>")
            val x = ryftData.data.replace("\n", " ")
            print(s"<td> $x</td></tr>")
            println("<tr id="+ ryftData.file +" style='display:none;'>")
            println("<td style='width:600px'>")

            val source = Source.fromFile("/home/ryftuser/maildir/"+ryftData.file)
            var theFile = try source.mkString finally source.close()
            var newDoc = ""
            var totalCharCount = 0
            var charCount = 0
            for (c <- theFile) {
                charCount = charCount + 1
                if (totalCharCount + charCount == ryftData.offset) {
                    newDoc = newDoc+"<b>"
                } else if (totalCharCount+charCount == ryftData.offset+ryftData.length+1) {
                    newDoc = newDoc+"</b>"
                }
                newDoc = newDoc+c
            }
            print(newDoc.replace("\n", "<br>"))
            totalCharCount = totalCharCount + charCount
            println("</td>")
            println("</tr>")
        }
        print("</table>")
    }

So this should get you started on being able to search data with Zeppelin and Ryft. YOu can use this interface to experiment with the different edit distances and search queries the Ryft supports. You can also implement additional methods to search by RegEx, IP addresses, dates and currency.

Please follow us on Facebook and on twitter at volumeint.

How to Use Zeppelin With SAP HANA

How to Use Zeppelin With SAP HANA

Apache Zeppelin is an open source tool that allows interactive data analytics from many data sources like databases, hive, spark, python, hdfs, HANA and more.

It allows for:

  • Data Ingestion
  • Data Discovery
  • Data Analytics
  • Data Visualization & Collaboration

httpsaws.amazon.com/blogs/aws/amazon-emr-update-apache-spark-1-5-2-ganglia-presto-zeppelin-and-oozie/

SAP HANA is an in memory data platform for storing and analyzing data. At its core it is a columner database with the ability to also become a row database.

HANA overview

Setup Zeppelin

Install Apache Zeppelin from https://zeppelin.apache.org/download.html with all interpreters. Once Zeppelin is installed go to the bin directory and run ./zeppelin-daemon.sh start You can view the Zeppelin interface at http://localhost:8080/

zeppelin start screen

Click on the menu to go to interpreter setup.

Configuration Menu

Choose the Create button.

Create new interpreter

Pick the JDBC interpreter group in the pull down.

Pick JDBC Interpreter

Fill in these HANA properties with your specific server name and port. Give it the name of hana.

HANA interpreter property values

The important parts are:

default.driver: com.sap.db.jdbc.Driver

default.url: jdbc:sap://[server]:[port]

Save the interpreter with the save button.

Install JDBC Driver

Find the jar file called ngdbc.jar and place it in zeppelin-0.x.X-bin-all/interpreter/jdbc.

The ngdbc.jar may be imbedded in another jar file. You will need to install the SAP HANA Studio or the sql client drivers. In order for me to find the ngdbc.jar I had to unzip the com.sap.ndb.studio.jdbc_2.3.6.jar file by changing the .jar to .zip. When it is unzipped I found the ngdbc.jar in the lib directory.

The com.sap.ndb.studio.jdbc_2.3.6.jar was found in a directory called studio/core_repository/plugins

Make a New Note

Go to the menu and create a new note.

Create New Note

Pick hana as your default interpreter and give your note a name.

Create New Note

This will create a workspace where you can create your tables and graphs from data in HANA and also annotate your report with text. It is also possible to pull in data from other database systems you have already configured.

Create New Note

Now enter your query in the text area called the paragraph.

type in query

Click the arrow to run the query and then you can select what kind of graph to display.

Type in query and pick a visualization

Now you can add additional paragraphs with markdown text in them to describe your information.

So we have successfully configured Apache Zeppelin to pull data from HANA using a jdbc driver.

How to Use HANA from Spark

You may also want to connect to HANA directly from Spark using Scala, Python or R code. The best way to implement this is to put the reference to the jdbc jar called ngdbc.jar in the Spark interpreter. Go to the Zeppelin settings menu.

Zeppelin Settings Menu

Zeppelin Settings Menu

Pick the Interpreter and find the Spark section and press edit.

Zeppelin Interpreter Screen

Zeppelin Interpreter Screen

Then add the path you where you have the SAP HANA jdbc driver called ngdbc.jar installed.

Configure HANA jdbc in Spark Interpreter

Configure jdbc in Spark Interpreter

This will get you started on using Apache Zeppelin with SAP HANA. We have further tutorials on how to build interfaces in Zeppelin:

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint