Posts

Ryft-ONE

Ryft and Apache Zeppelin

Ryft

Ryft is an FPGA – (field programmable gate array) appliance that allows for hosting and searching data quickly. In this post I will show one way to connect up Apache Zeppelin for use in data analysis using Scala code. Previously I showed how to connect Apache Zeppelin to SAP Hana.

The Ryft can quickly search structured and unstructured data without needing to build an index. This ability is attributed the the FPGA that can filter data on demand. It uses the internal 4 FPGA modules to process the data at search time. Other types of search systems like ElasticSearch, solr, Lucine or a database have to build and store an index of the data. Ryft operates without an index.

Ryft Speed Comparison

Ryft Speed Comparison

I have populated my Ryft with a cache of data from Enron. It is a dump of Enron emails obtained from Carnegie Mellon. This was as simple as uploading files to the Ryft and running a command like this:

ryftutil -copy “enron*” -c enron_email -a ryft.volumeintegration.com:8765

In the Zeppelin interface I will be able to search for keywords or phrases in the email files and display them. The size of the enron e-mail archive is 20 megabytes.

Ryft One Appliance

Ryft One Appliance

Apache Zeppelin

Apache Zeppelin is an open source web notebook that allows a person to write code in many languages to manipulate and visualize data.

Apache Zeppelin with Volume Analytics Interface

Apache Zeppelin with Volume Analytics Interface

To Apache Zeppelin work with Ryft I installed Apache Zeppelin onto the Ryft appliance and connected the Spark Ryft Connector jar found at this git project. Or download a prebuilt jar.

Follow the directions provided at the spark-ryft-connector project to compile the jar file needed. I compiled the jar file on my local desktop computer. Place the spark-ryft-connector jar file onto the Ryft machine. I did run into one that was not documented; the ryft connector was not working properly. It gives the error: “java.lang.NoClassDefFoundError: org/apache/spark/Logging”

I resolved the issue by downloading spark-core_2.11-1.5.2.logging.jar from  https://raw.githubusercontent.com/swordsmanliu/SparkStreamingHbase/master/lib/spark-core_2.11-1.5.2.logging.jar and put it in zeppelin/interpreter/spark/dep directory and that resoved the issue.

Now you can create a note in Zeppelin. I am using the Spark interpreter which allows you to write the code in Scala.

First you have to make sure Zeppelin can use the ryft code in the jar file. Make a dependency paragraph with this code:

%dep
z.reset()
z.load("/home/ryftuser/spark-ryft-connector-2.10.6-0.9.0.jar")

Ryft Query

Now make a new paragraph with the code to make form fields and run the Ryft API commands to perform a search. Figuring these queries out takes a detailed study of the documentation.

These are the commands to prepare and run the query. I show a simple search, a fuzzy hamming search and a fuzzy edit distance search. The Ryft can perform very fast fuzzy searches with wide edit distances because there is not an index being built.

Simple Query
queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
query = SimpleQuery(searchFor.toString)
Hamming Query
queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte, fhs)
Edit Distance Query
queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte)
The Search
var searchRDD = sc.ryftRDD(Seq(query), queryOptions)

This produces an RDD that can be manipulated to view the contents using code like the example below.

searchRDD.asInstanceOf[RyftRDD[RyftData]].collect.foreach { ryftData =>
   println(ryftData.offset)
   println(ryftData.length)
   println(ryftData.fuzziness)
   println(ryftData.data.replace("\n", " "))
   println(ryftData.file)
}

The Result in Zeppelin

Result of Searching Ryft with Zeppelin

Result of Searching Ryft with Zeppelin

In addition I have included code that allows the user to click on Show File to see the original e-mail with the relevant text highlighted in bold.

Results in BoldI installed Apache Zeppelin in a way that allows it access to a portion of the file system on the server where I stored the original copy of the email files.

In order for Apache Zeppelin to display the original email, I had to give it access to the part of the filesystem where the original emails were stored.  Ryft uses a catalog of the emails to perform searches, as it performs better when searching fewer larger files than more smaller ones. The catalog feature allows it to combine many small files into one large file.

The search results return a filename and offset which Apache Zeppelin uses to retrieve the relevant file and highlight the appropriate match. 

In the end results Ryft found all instances of the name Mohammad with various spelling differences in 0.148 seconds in a dataset of 30 megabytes. When I performed the same search terms on 48 gigabytes of data it ran the search in 5.89 seconds. And 94 gigabytes took 12.274 seconds, 102 gigabytes took 13 seconds. These are just quick sample numbers using dumps of many files. Perhaps performance could be improved by consolidating small files into catalogs.

Zeppelin Editor

The code is edited in Zeppelin itself.

Code in Zeppelin

Code in Zeppelin

You edit the code in the web interface but it can hide it once you have the form fields working. Here is the part of the code that produces the form fields:

 val searchFor = z.input("Search String", "mohammad")
 val distance = z.input("Search Distance", 2)
 var queryType = z.select("Query Type", Seq(("1","Simple"),("2","Hamming"),("3","Edit Distance"))).toString
 var surrounding = z.input("Surrounding", "line")

So in the end we end up with the following code.

%spark
import com.ryft.spark.connector._
import com.ryft.spark.connector.domain.RyftQueryOptions
import com.ryft.spark.connector.query.SimpleQuery
import com.ryft.spark.connector.query.value.{EditValue, HammingValue}
import com.ryft.spark.connector.rdd.RyftRDD
import com.ryft.spark.connector.domain.{fhs, RyftData, RyftQueryOptions}
import scala.language.postfixOps
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import scala.io.Source

def isEmpty(x: String) = x == null || x.isEmpty
  var queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
  val searchFor = z.input("Search String", "mohammad")
  val distance = z.input("Search Distance", 2)
  var queryType = z.select("Query Type",("2","Hamming"), Seq(("1","Simple"),("2","Hamming"),("3","Edit Distance"))).toString
  var surrounding = z.input("Surrounding", "line")
  var query = SimpleQuery(searchFor.toString)

 
  if (isEmpty(queryType)) {
      queryType = "2"
  }

  if (queryType.toString.toInt == 1) {
        println("simple")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, 0 toByte)
        }
        query = SimpleQuery(searchFor.toString)

  } else if (queryType.toString.toInt ==2) {
        println("hamming")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte, fhs)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte, fhs)
        }
  } else {
        println("edit")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte)
        }
  }

  var searchRDD = sc.ryftRDD(Seq(query), queryOptions)
  var count = searchRDD.count()

  print(s"%html <h2>Count: $count</h2>")

  if (count > 0) {
        println(s"Hamming search RDD first: ${searchRDD.first()}")
        println(searchRDD.count())
        print("%html <table>")
        print("<script>")
        println("function showhide(id) { var e = document.getElementById(id); e.style.display = (e.style.display == 'block') ? 'none' : 'block';}")
        print("</script>")
        print("<tr><td>File</td><td>Data</td></tr>")

        searchRDD.asInstanceOf[RyftRDD[RyftData]].collect.foreach { ryftData =>
            print("<tr><td style='width:600px'><a href=javascript:showhide('"+ryftData.file+"')>Show File </a></td>")
            val x = ryftData.data.replace("\n", " ")
            print(s"<td> $x</td></tr>")
            println("<tr id="+ ryftData.file +" style='display:none;'>")
            println("<td style='width:600px'>")

            val source = Source.fromFile("/home/ryftuser/maildir/"+ryftData.file)
            var theFile = try source.mkString finally source.close()
            var newDoc = ""
            var totalCharCount = 0
            var charCount = 0
            for (c <- theFile) {
                charCount = charCount + 1
                if (totalCharCount + charCount == ryftData.offset) {
                    newDoc = newDoc+"<b>"
                } else if (totalCharCount+charCount == ryftData.offset+ryftData.length+1) {
                    newDoc = newDoc+"</b>"
                }
                newDoc = newDoc+c
            }
            print(newDoc.replace("\n", "<br>"))
            totalCharCount = totalCharCount + charCount
            println("</td>")
            println("</tr>")
        }
        print("</table>")
    }

So this should get you started on being able to search data with Zeppelin and Ryft. YOu can use this interface to experiment with the different edit distances and search queries the Ryft supports. You can also implement additional methods to search by RegEx, IP addresses, dates and currency.

Please follow us on Facebook and on twitter at volumeint.

Digital Cherries by Bradley Johnson

JSON Joins - jq

Like a bunch of json objects being manipulated. Digital Cherries by Bradley Johnson

Digital Cherries by Bradley Johnson https://art.wowak.com like little JSON objects

Manipulation of JSON files is an interesting challenge but it is much easier than trying to manipulate XML files. I was given two files that were lists of JSON objects. The two files had a common key that could be used to join the files together called business_id. The object was to flatten out the two files into one file.

Once they are in one large file we will be using a system called Ryft to search the data. Ryft is an FPGA (Field Programmable Gate Array) system that searches data very quickly with special capabilities for performing fuzzy searches with a large edit distance. I am hypothesizing that the Ryft will work better on flat data instead of needing to perform a join between two tables.

The files are from the Yelp dataset challenge. We will use the program called jq to join the data. The data files are yelp_business.json and yelp_review.json

The file yelp_business.json has this format below and each record / object is separated by a hard return. This file is 74 Megabytes. Note: I added hard returns between the fields to make it easier to read.
{
"business_id": "UsFtqoBl7naz8AVUBZMjQQ",
"full_address": "202 McClure St\nDravosburg, PA 15034",
"hours": {},
"open": true,
"categories": ["Nightlife"],
"city": "Dravosburg",
"review_count": 5,
"name": "Clancy's Pub",
"neighborhoods": [],
"longitude": -79.8868138,
"state": "PA",
"stars": 3.0,
"latitude": 40.3505527,
"attributes": {"Happy Hour": true, "Accepts Credit Cards": true, "Good For Groups": true, "Outdoor Seating": false, "Price Range": 1},
"type": "business"
}

The file yelp_review.json has this format below and is also separated by hard returns. Now for each business there are many reviews. So if a business has five reviews the final output will contain five rows of data for that one business. This file is 2.1 Gigabytes

{
"votes": {"funny": 0, "useful": 0, "cool": 0},
"user_id": "uK8tzraOp4M5u3uYrqIBXg",
"review_id": "Di3exaUCFNw1V4kSNW5pgA",
"stars": 5,
"date": "2013-11-08",
"text": "All the food is great here. But the best thing they have is their wings. Their wings are simply fantastic!! The \"Wet Cajun\" are by the best & most popular. I also like the seasoned salt wings. Wing Night is Monday & Wednesday night, $0.75 whole wings!\n\nThe dining area is nice. Very family friendly! The bar is very nice is well. This place is truly a Yinzer's dream!! \"Pittsburgh Dad\" would love this place n'at!!",
"type": "review",
"business_id": "UsFtqoBl7naz8AVUBZMjQQ"
}

Notice that “business_id” is in both objects. This is the field we wish to join with.

Originally I started with this answer on OpenStack.

So what we really want is a left join of all the data like a database would perform.

The jq software one to read functions out of a file to make the command line easier. With help from pkoppstein at the jq github we have a file called leftJoin.jq.

# leftJoin(a1; a2; field) expects a1 and a2 to be arrays of JSON objects
# and that for each of the objects, the field value is a string.
# A left join is performed on "field".
def leftJoin(a1; a2; field):
# hash phase:
(reduce a2[] as $o ({}; . + { ($o | field): $o } )) as $h2
# join phase:
| reduce a1[] as $o ([]; . + [$h2[$o | field] + $o ])|.[];

leftJoin( $file2; $file1; .business_id)

Based on this code the last line is what passes in the variables for the file names and sets the key used to join to be business_id. With in the reduce commands it is turning the lists of objects into json arrays and then finding the field of business_id and concatenating the two json objects together with the (+) plus sign. The final command “|.[]” at the end where the semicolon finalizes the command is used to turn the json array back into a stream of json list objects. The Ryft appliance only reads in json as lists of objects.

If there are any fields that are identically named the jq code will use the one from file2. So because both files have the field of “type” the new data file will be type = review.

To run this we use the command line of:
jq -nc --slurpfile file1 yelp_business.json --slurpfile file2 yelp_review.json -f leftJoin.jq > yelp_bus_review.json

As a result this command takes the two files and passes them to the jq commands to do the work of joining them together. It will write out a new file called yelp_bus_review.json. It may take a long time to run depending on the size of your files but I end up with a 4.8 Gigabyte file when finished. Here are two rows in the new file:

{"business_id":"UsFtqoBl7naz8AVUBZMjQQ","full_address":"202 McClure St\nDravosburg, PA 15034","hours":{},"open":true,"categories":["Nightlife"],"city":"Dravosburg","review_count":5,"name":"Clancy's Pub","neighborhoods":[],"longitude":-79.8868138,"state":"PA","stars":4,"latitude":40.3505527,"attributes":{"Happy Hour":true,"Accepts Credit Cards":true,"Good For Groups":true,"Outdoor Seating":false,"Price Range":1},"type":"review","votes":{"funny":0,"useful":0,"cool":0},"user_id":"JPPhyFE-UE453zA6K0TVgw","review_id":"mjCJR33jvUNt41iJCxDU_g","date":"2014-11-28","text":"Cold cheap beer. Good bar food. Good service. \n\nLooking for a great Pittsburgh style fish sandwich, this is the place to go. The breading is light, fish is more than plentiful and a good side of home cut fries. \n\nGood grilled chicken salads or steak. Soup of day is homemade and lots of specials. Great place for lunch or bar snacks and beer."}
{"business_id":"UsFtqoBl7naz8AVUBZMjQQ","full_address":"202 McClure St\nDravosburg, PA 15034","hours":{},"open":true,"categories":["Nightlife"],"city":"Dravosburg","review_count":5,"name":"Clancy's Pub","neighborhoods":[],"longitude":-79.8868138,"state":"PA","stars":2,"latitude":40.3505527,"attributes":{"Happy Hour":true,"Accepts Credit Cards":true,"Good For Groups":true,"Outdoor Seating":false,"Price Range":1},"type":"review","votes":{"funny":0,"useful":0,"cool":0},"user_id":"pl78RcFgklDns8atQegwVA","review_id": "kG7wxkBu62X6yxUuZ5IQ6Q","date":"2016-02-24","text":"Possibly the most overhyped establishment in Allegheny County. If you're not a regular, you will be ignored by those who're tending bar. Beer selection is okay, the prices are good and the service is terrible. I would go here, but only if it was someone else's idea."}

Now we have one large flat file instead of two. This will allow for quicker searches as we do not have to join the data together in the middle of the search.

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint