Posts

Volume Analytics Table Explorer - HANA & Zeppelin

Using Zeppelin to Explore a Database

In attempting to use Apache Zeppelin I found it difficult to just explore a new database. This was the situation when connecting SAP HANA database to Apache Zeppelin using the JDBC driver.

So I created a Zeppelin interface that can be used by a person who does not know how to code or use SQL.

This is a note with code in multiple paragraphs that would allow a person to see a list of all the tables in the database and then view the structure of them and look at a sample of the data in each table.

Volume Analytics Table Explorer - HANA & Zeppelin

Volume Analytics Table Explorer – HANA & Zeppelin

When using a standard database with Apache Zeppelin one needs to register each table into Spark so that it can query it and make DataFrames from the native tables. I got around this by allowing the user to choose they tables they want to register into Apache Zeppelin and Spark. This registration involved using the createOrReplaceTempView function on a DataFrame. This allows us to retain the speed of HANA without copying all the data into a Spark table.

The video shows a short demonstration of how this works.

Once tables are registered as Spark views they can be used by all the other notes on the Apache Zeppelin server. This means that other users can leverage the tables without knowing they came from the HANA database.

The code is custom to HANA because of the names of the system tables where it stores the lists of tables and column names. The code also converts HANA specific data types such as ST_POINT to comma delimited strings.

This example of dynamic forms with informed by Data-Driven Dynamic Forms in Apache Zeppelin

Previous posts on Apache Zeppelin and SAP Hana are:

The Code

Be aware this is prototype code that works on Zeppelin 0.8.0 Snapshot which as of today needs to be built from source. It is pre-release.

First Paragraph

In the first paragraph I am loading up the HANA jdbc driver. But you can avoid doing this by adding your jdbc jar to the dependencies section of the interpreter configuration as laid out in How to Use Zeppelin With SAP HANA

%dep
z.reset() 
z.load("/projects/zeppelin/interpreter/jdbc/ngdbc.jar")

Second Paragraph

In the second paragraph we build the Data Frames from tables in HANA that contain the list of tables and columns in the database. This will be used to show the user what tables and columns are available to use for data analysis.

%spark
import org.apache.spark.sql._
val driver ="com.sap.db.jdbc.Driver"
val url="jdbc:sap://120.12.83.105:30015/ffa"
val database = "dbname"
val username = "username"
val password = "password"
// type in the schemas you wish to expose
val tables = """(select * from tables where schema_name in ('FFA', 'SCHEMA_B')) a """
val columns = """(select * from table_columns where schema_name in ('FFA', 'SCHEMA_B')) b """

val jdbcDF = sqlContext.read.format("jdbc").option("driver",driver)
 .option("url",url)
 .option("databaseName", database)
 .option("user", username)
 .option("password",password)
 .option("dbtable", tables).load()
jdbcDF.createOrReplaceTempView("tables")

val jdbcDF2 = sqlContext.read.format("jdbc").option("driver",driver)
 .option("url",url)
 .option("databaseName", database)
 .option("user", username)
 .option("password",password)
 .option("dbtable", columns).load()
jdbcDF2.createOrReplaceTempView("table_columns")

Third Paragraph

The third paragraph contains the functions that will be used in the fourth paragraph that needs to call Spark / Scala functions. These functions will return the column names and types when a table name is given. Also it has the function that will load a HANA table into a Spark table view.

%spark
//Get list of distinct values on a column for given table
def distinctValues(table: String, col: String) : Array[(String, String)] = {
 sqlContext.sql("select distinct " + col + " from " + table + " order by " + col).collect.map(x => (x(0).asInstanceOf[String], x(0).asInstanceOf[String]))
}

def distinctWhere(table: String, col: String, schema: String) : Array[(String, String)] = {
 var results = sqlContext.sql("select distinct " + col + " from " + table + " where schema_name = '" + schema +"' order by " + col)
 results.collect.map(x => (x(0).asInstanceOf[String], x(0).asInstanceOf[String]))
}

//Get list of tables
def tables(): Array[(String, String)] = {
 sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String].toUpperCase(), x(1).asInstanceOf[String].toUpperCase()))
}

//Get list of columns on a given table
def columns(table: String) : Array[(String, String)] = {
 sqlContext.sql("select * from " + table + " limit 0").columns.map(x => (x, x))
}

def hanaColumns(schema: String, table: String): Array[(String, String)] = {
 sqlContext.sql("select column_name, data_type_name from table_columns where schema_name = '"+ schema + "' and table_name = '" + table+"'").collect.map(x => (x(0).asInstanceOf[String], x(1).asInstanceOf[String]))
}

//load table into spark
def loadSparkTable(schema: String, table: String) : Unit = {
  var columns = hanaColumns(schema, table)
  var tableSql = "(select "
  for (c <- columns) {
    // If this column is a geo datatype convert it to a string
    if (c._2 == "ST_POINT" || c._2 == "ST_GEOMETRY") {
      tableSql = tableSql + c._1 + ".st_y()|| ',' || " + c._1 + ".st_x() " + c._1 + ", "
    } else {
      tableSql = tableSql + c._1 + ", "
    }
  }
 tableSql = tableSql.dropRight(2)
 tableSql = tableSql + " from " + schema +"."+table+") " + table

 val jdbcDF4 = sqlContext.read.format("jdbc").option("driver",driver)
  .option("url",url)
  .option("databaseName", "FFA")
  .option("user", username)
  .option("password", password)
  .option("dbtable", tableSql).load()
  jdbcDF4.createOrReplaceTempView(table)
 
}

//Wrapper for printing any DataFrame in Zeppelin table format
def printQueryResultsAsTable(query: String) : Unit = {
 val df = sqlContext.sql(query)
 print("%table\n" + df.columns.mkString("\t") + '\n'+ df.map(x => x.mkString("\t")).collect().mkString("\n")) 
}

def printTableList(): Unit = {
 println(sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String])).mkString("%table\nTables Loaded\n","\n","\n"))
}

// this part keeps a list of the tables that have been registered for reference
val aRDD = sc.parallelize(sqlContext.sql("show tables").collect.map(x => (x(1).asInstanceOf[String])))
val aDF = aRDD.toDF()
aDF.registerTempTable("tables_loaded")

Fourth Paragraph

The fourth paragraph contains the Spark code needed to produce the interface with select lists for picking the tables. It uses dynamic forms as described in the Zeppelin documentation and illustrated in more detail by Rander Zander.

%spark
val schema = z.select("Schemas", distinctValues("tables","schema_name")).asInstanceOf[String]
var table = z.select("Tables", distinctWhere("tables", "table_name", schema)).asInstanceOf[String]
val options = Seq(("yes","yes"))
val load = z.checkbox("Register & View Data", options).mkString("")

val query = "select column_name, data_type_name, length, is_nullable, comments from table_columns where schema_name = '" + schema + "' and table_name = '" + table + "' order by position"
val df = sqlContext.sql(query)


if (load == "yes") { 
 if (table != null && !table.isEmpty()) {
   loadSparkTable(schema, table)
   z.run("20180108-113700_1925475075")
 }
}

if (table != null && !table.isEmpty()) {
 println("%html <h1>"+schema)
 println(table + "</h1>")
 z.show(df)
} else {
 println("%html <h1>Pick a Schema and Table</h1>")
}

As the user changes the select lists schema in paragraph 3 will be called and the tables select list will be populated with the new tables. When they select the table the paragraph will refresh with a table containing some of the details about the table columns like the column types and sizes.

When they select the Register and View checkbox the table will get turned into a Spark view and paragraph five will contain the data contents of the table. Note the z.run command. This runs a specific paragraph and you need to put in your own value here. This should be the paragraph id from the next paragraph which is paragraph five.

Paragraph Five

%spark
z.show(sql("select * from " + table +" limit 100"))

The last paragraph will list the first 100 rows from the table that have been selected and has the register and view on.

Slight modifications of this code will allow the same sort of interface to be built for MySQL, Postgres, Oracle, MS-SQL or any other database.

Now go to SAP HANA Query Builder On Apache Zeppelin Demo and you will find code to build a simple query builder note.

Please let us know on twitter, facebook and LinkedIn if this helps you or your find a better way to do this in Zeppelin.

Previous posts on Apache Zeppelin and SAP Hana are:

 

I Code Like a Girl

I Code Like a Girl

Programmers Betty Jean Jennings (left) and Fran Bilas (right) operate ENIAC's main control panel at the Moore School of Electrical Engineering. (U.S. Army photo from the archives of the ARL Technical Library)

Programmers Betty Jean Jennings (left) and Fran Bilas (right) operate ENIAC’s main control panel at the Moore School of Electrical Engineering. (U.S. Army photo from the archives of the ARL Technical Library)

The American computer industry seems dominated by people like me – males. But it was not always this way, in fact I stand on the shoulders of women.

While writing about SAGE in previous posts, I learned that at the beginning of the computer age, most programmers were women. My skills, including programming languages and tools, have a foundation built on the contributions of those who came before me. It turns out that I code like a girl, using concepts pioneered by women.

Computers: Human Women or Machine?

In the 1940’s and 50’s, there was a peak in women working as computers. Mathematically intensive industries like nuclear research, ballistics, and engineering often employed women to perform computations.

Do you remember “When Computers Were Women?” The article reminds us that a “computer” was actually a human being until around 1945. Afterward, a computer was a machine and humans were called operators.

ENAIC

In 1946, the first electronic general-purpose computer called ENIAC (Electronic Numerical Integrator And Computer) was built. This began a process of reducing and automating manual calculations.

A select number of women operators and former “computers” were enlisted to become programmers for ENIAC. Notably, the first ENIAC programming team was comprised entirely of women: Kay McNulty, Betty Jennings, Betty Snyder, Marlyn Wescoff, Fran Bilas, and Ruth Lichterman.

Women programmer School of Electrical Engineering. (U.S. Army photo from the archives of the ARL Technical Library)

School of Electrical Engineering. (U.S. Army photo from the archives of the ARL Technical Library)

At the time, there was a strong division between the male domain of hardware and the female sphere of software. Male electronic engineers built the ENAIC system. But since software design and programming were considered clerical work, women instructed the the 27-ton hand-built mass of wires and vacuum tubes to perform calculations in sequence.

In true programmer fashion, the women learned by doing. The hardware engineers dropped the blueprints and wiring documents on them and said, “Here figure out how the machine works and then figure out how to program it.” So the women crawled around the massive frame and learned how each component worked. They successfully understood the interplay between hardware and software and how the computer’s behavior could be traced to a hardware or a software issue.

Unfortunately, all the credit for creating ENIAC went to the men who conceived it and built the hardware. The media covered the debut of ENIAC in February 1946, which showed off the centerpiece calculation of a trajectory. The program created by Betty Snyder and Betty Jennings impressed the VIPs because it allowed the computer to calculate faster than the projectile itself. But the women were not mentioned, seen in pictures of the event, nor invited to the press lunch with the men. In the end, the computer was the star of the show, depicted as an autonomous brain.

Gender Code

Words powerfully describe gender roles. What is now considered a male-dominated field, was once defined as “women’s work.” In the days of the first computers, the norms were as follows:

Male Female
Hard sciences Soft sciences
Engineering Programming
Hardware Software

Society was keen on recognizing men’s contributions, while neglecting those of women. In the book, _Recoding Gender: Women’s Changing Participation in Computing_, Janet Abbate found that publicity materials for ENIAC state that the machine reduced 25 man-months of human computer time to two hours on the ENIAC. However, it fails to mention that most of the human computers were really women. The materials also neglect to highlight the years of labor by both men (on the hardware) and women (writing software) to create the system. The only human labor noted in the press was the initial design of the machine, which was performed by men.

But even women of the time seemed to define their computer jobs as gender specific. Elise Shutt was a programmer on a later version of ENIAC called ORDVAC. When she was hired by Raytheon in 1953, she said, “It really amazed me that these men were programmers, because I thought it was women’s work.”

In another example, Grace Hopper compared programming to tasks like sewing clothes, making a recipe, and the work of a mother teaching a child. Thus, she defined programming as a female occupation. But this seems to have been lost on her supervisor, Howard Aiken who said in praise of Grace, “Grace was a good man.

Recruiting materials were also used to attract women to programming with various metaphors and generalizations. In the 1940s, MIT had a shortage of men and highlighted skills such as needlework and knitting as characteristics useful for programmers. Others noted that female pursuits like crosswords and puzzles would make good programmers.

Recognition

The first women who pioneered programming on the ENIAC finally gained the recognition they deserved in 1997 when they were inducted into the Women in Technology Hall of Fame in 1997 and IEEE in 1997 and 2008.

It seems that we continue to struggle with metaphors and defining skills used to train and attract the next generation to computer work. From recent statistics, we are finding a wider gender gap in the computer industry. The reasons for this are inconclusive and give us a reason for self-evaluation and consideration of language used to hire and promote, treatment of women, and how skills are evaluated.

It becomes increasingly important to value each member of a programming team regardless of gender, age, race, or creed to attract and keep the best minds to build our future software. There seems to be no end to the amount of programming work needed. Code on!

 

Check out more of our work at Volume Integration and follow us on Twitter.