Digital Cherries by Bradley Johnson

JSON Joins - jq

Like a bunch of json objects being manipulated. Digital Cherries by Bradley Johnson

Digital Cherries by Bradley Johnson https://art.wowak.com like little JSON objects

Manipulation of JSON files is an interesting challenge but it is much easier than trying to manipulate XML files. I was given two files that were lists of JSON objects. The two files had a common key that could be used to join the files together called business_id. The object was to flatten out the two files into one file.

Once they are in one large file we will be using a system called Ryft to search the data. Ryft is an FPGA (Field Programmable Gate Array) system that searches data very quickly with special capabilities for performing fuzzy searches with a large edit distance. I am hypothesizing that the Ryft will work better on flat data instead of needing to perform a join between two tables.

The files are from the Yelp dataset challenge. We will use the program called jq to join the data. The data files are yelp_business.json and yelp_review.json

The file yelp_business.json has this format below and each record / object is separated by a hard return. This file is 74 Megabytes. Note: I added hard returns between the fields to make it easier to read.
{
"business_id": "UsFtqoBl7naz8AVUBZMjQQ",
"full_address": "202 McClure St\nDravosburg, PA 15034",
"hours": {},
"open": true,
"categories": ["Nightlife"],
"city": "Dravosburg",
"review_count": 5,
"name": "Clancy's Pub",
"neighborhoods": [],
"longitude": -79.8868138,
"state": "PA",
"stars": 3.0,
"latitude": 40.3505527,
"attributes": {"Happy Hour": true, "Accepts Credit Cards": true, "Good For Groups": true, "Outdoor Seating": false, "Price Range": 1},
"type": "business"
}

The file yelp_review.json has this format below and is also separated by hard returns. Now for each business there are many reviews. So if a business has five reviews the final output will contain five rows of data for that one business. This file is 2.1 Gigabytes

{
"votes": {"funny": 0, "useful": 0, "cool": 0},
"user_id": "uK8tzraOp4M5u3uYrqIBXg",
"review_id": "Di3exaUCFNw1V4kSNW5pgA",
"stars": 5,
"date": "2013-11-08",
"text": "All the food is great here. But the best thing they have is their wings. Their wings are simply fantastic!! The \"Wet Cajun\" are by the best & most popular. I also like the seasoned salt wings. Wing Night is Monday & Wednesday night, $0.75 whole wings!\n\nThe dining area is nice. Very family friendly! The bar is very nice is well. This place is truly a Yinzer's dream!! \"Pittsburgh Dad\" would love this place n'at!!",
"type": "review",
"business_id": "UsFtqoBl7naz8AVUBZMjQQ"
}

Notice that “business_id” is in both objects. This is the field we wish to join with.

Originally I started with this answer on OpenStack.

So what we really want is a left join of all the data like a database would perform.

The jq software one to read functions out of a file to make the command line easier. With help from pkoppstein at the jq github we have a file called leftJoin.jq.

# leftJoin(a1; a2; field) expects a1 and a2 to be arrays of JSON objects
# and that for each of the objects, the field value is a string.
# A left join is performed on "field".
def leftJoin(a1; a2; field):
# hash phase:
(reduce a2[] as $o ({}; . + { ($o | field): $o } )) as $h2
# join phase:
| reduce a1[] as $o ([]; . + [$h2[$o | field] + $o ])|.[];

leftJoin( $file2; $file1; .business_id)

Based on this code the last line is what passes in the variables for the file names and sets the key used to join to be business_id. With in the reduce commands it is turning the lists of objects into json arrays and then finding the field of business_id and concatenating the two json objects together with the (+) plus sign. The final command “|.[]” at the end where the semicolon finalizes the command is used to turn the json array back into a stream of json list objects. The Ryft appliance only reads in json as lists of objects.

If there are any fields that are identically named the jq code will use the one from file2. So because both files have the field of “type” the new data file will be type = review.

To run this we use the command line of:
jq -nc --slurpfile file1 yelp_business.json --slurpfile file2 yelp_review.json -f leftJoin.jq > yelp_bus_review.json

As a result this command takes the two files and passes them to the jq commands to do the work of joining them together. It will write out a new file called yelp_bus_review.json. It may take a long time to run depending on the size of your files but I end up with a 4.8 Gigabyte file when finished. Here are two rows in the new file:

{"business_id":"UsFtqoBl7naz8AVUBZMjQQ","full_address":"202 McClure St\nDravosburg, PA 15034","hours":{},"open":true,"categories":["Nightlife"],"city":"Dravosburg","review_count":5,"name":"Clancy's Pub","neighborhoods":[],"longitude":-79.8868138,"state":"PA","stars":4,"latitude":40.3505527,"attributes":{"Happy Hour":true,"Accepts Credit Cards":true,"Good For Groups":true,"Outdoor Seating":false,"Price Range":1},"type":"review","votes":{"funny":0,"useful":0,"cool":0},"user_id":"JPPhyFE-UE453zA6K0TVgw","review_id":"mjCJR33jvUNt41iJCxDU_g","date":"2014-11-28","text":"Cold cheap beer. Good bar food. Good service. \n\nLooking for a great Pittsburgh style fish sandwich, this is the place to go. The breading is light, fish is more than plentiful and a good side of home cut fries. \n\nGood grilled chicken salads or steak. Soup of day is homemade and lots of specials. Great place for lunch or bar snacks and beer."}
{"business_id":"UsFtqoBl7naz8AVUBZMjQQ","full_address":"202 McClure St\nDravosburg, PA 15034","hours":{},"open":true,"categories":["Nightlife"],"city":"Dravosburg","review_count":5,"name":"Clancy's Pub","neighborhoods":[],"longitude":-79.8868138,"state":"PA","stars":2,"latitude":40.3505527,"attributes":{"Happy Hour":true,"Accepts Credit Cards":true,"Good For Groups":true,"Outdoor Seating":false,"Price Range":1},"type":"review","votes":{"funny":0,"useful":0,"cool":0},"user_id":"pl78RcFgklDns8atQegwVA","review_id": "kG7wxkBu62X6yxUuZ5IQ6Q","date":"2016-02-24","text":"Possibly the most overhyped establishment in Allegheny County. If you're not a regular, you will be ignored by those who're tending bar. Beer selection is okay, the prices are good and the service is terrible. I would go here, but only if it was someone else's idea."}

Now we have one large flat file instead of two. This will allow for quicker searches as we do not have to join the data together in the middle of the search.

Please follow us on our website at https://volumeintegration.com and on twitter at volumeint

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *