A Volume Analytics Flow for Finding Social Media Bots

Volume Analytics Chaos Control

Volume Analytics Chaos Control

Volume Analytics is a software tool used to build, deploy and manage data processing applications.

Volume Analytics is a scalable data management platform that allows the rapid ingest, transformation, and loading high volumes of data into multiple analytic models as defined by your requirements or your existing data models.

Volume Analytics is a platform for streaming large volumes of varied data at high velocity.​

Volume Analytics is a tool that both enables rapid software development and operational maintainability with scalability for high data volumes. Volume Analytics can be used for all of your data mining, fusion, extraction, transform and loading needs. Volume Analytics has been used to mine and analyze social media feeds, monitor and alert on insider threats and automate the search for cyber threats. In addition it is being used to consolidate data from many data sources (databases, HDFS, file systems, data lakes) and producing multiple data models for multiple data analytics visualization tools. It could also be used to consolidate sensor data from IoT devices or monitor a SCADA industrial control network.

Volume Analytics easily facilitates a way to quickly develop highly redundant software that’s both scalable and maintainable. In the end you save money on labor for development and maintenance of systems built with Volume Analytics.

In other words Volume Analytics provides the plumbing of a data processing system. The application you are building has distinct units of work that need to be done. We might compare it to a water treatment plant. Dirty water comes in to the system in a pipe and comes to a large contaminate filter. The filter is a work task and the pipe is a topic. Together they make a flow.

After the first filter another pipe carries the water minus the dirt to another water purification worker. In the water plant there is a dashboard for the managers to monitor the system to see if they need to fix something or add more pipes and cleaning tasks to the system.

Volume Analytics provides the pipes, a platform to run the worker tasks and a management tool to control the flow of data through the system.

A Volume Analytics Flow for Finding Social Media Bots

A Volume Analytics Flow for Finding Social Media Bots

In addition Volume Analytics has redundancy for disaster recovery, high availability and parallel processing. This is where our analogy fails. Data is duplicated across multiple topics. The failure of a particular topic (pipe)  does not destroy any data because it is preserved on another topic. Topics are optimally setup in multiple data centers to maintain high availability.

In Volume Analytics the water filter tasks in the analogy are called tasks. Tasks are groups of code that perform some unit of work. Your specific application will have its own tasks. The tasks are deployed on more than one server in more than one data center.

Benefits

Faster start up time saves money and time.

Volume Analytics allows a faster start up time for a new application or system being built. The team does not need to build the platform that moves the data to tasks. They do not need to build a monitoring system as those features are included. However, Volume Analytics will integrate with your current monitoring systems.

System is down less often

The DevOps team gets visibility into the system out of the box. They do not have to stand up a log search system. So it saves time. They can see what is going on and fix it quickly.

Plan for Growth

As your data grows and the system needs to process more data Volume Analytics grows. Add server instances to increase the processing power.  As work grows Volume Analytics allocates work to new instances. There is no re-coding needed. Save time and money as developers are not needed to re-implement the code to work at a larger scale.

Less Disruptive deployments

Construct your application in a way that allows for deployments of new features with a lower impact on features in production. New code libraries and modules can be deployed to the platform and allowed to interact with the already running parts of the system without an outage. A built in code library repository is included.

In addition currently running flows can be terminated while the data waits on the topics for the newly programmed flow to be started.

This Flow processes files to find IP addresses, searches multiple APIs for matches and inserts data into a HANA database

This Flow processes files to find IP addresses, searches multiple APIs for matches and inserts data into a HANA database

A data processing search threats flow in production. Each of the boxes is a task that performs a unit of work. The task puts the processed data on the topic represented by the star. Then the next task picks up the data and does another part of the job. The combination of a set of tasks and topics is a flow.

Geolocate IP Flow

Geolocate IP Flow

Additional flow to geolocate IP addresses added as the first flow is running.

Combined Flows

Combined Flows

The combination of flows working together. The topic ip4-topic is an integration point.

Modular

Volume Analytics is modular and tasks are reusable. You can reconfigure your data processing pipeline without introducing new code. You can use tasks in more than one application.

Highly Available

Out of the box, Volume Analytics highly available due to its built in redundancy. Work tasks and topics (pipes) run in triplicate. As long as your compute instances are in multiple data centers you will have redundancy built in. Volume Analytics knows how to balance the data between duplicate and avoid data loss if one or more work tasks fail — this extends to the concept of queuing up work if all work tasks fail.

Integration

Volume Analytics integrates with other products. It can retrieve and save data to other systems like topics, queues, databases, file systems and data stores. In addition these integrations happen over encrypted channels.

In our sample application CyberFlow there are many tasks that integrate with other systems. The read bucket task reads files from an AWS S3 bucket, the ThreatCrowd is an API call to https://www.threatcrowd.org and Honeypot calls to https://www.projecthoneypot.org. Then the insert tasks integrate to the SAP HANA database used in this example.

Volume Analytics integrates with your enterprise authentication and authorizations systems like LDAP, ActiveDirectory, CAP and more.

Data Management

Ingests datasets from throughout the enterprise, tracking each delivery and routing it through Volume Analytics to extract the greatest benefit. Shares common capabilities such as text extraction, sentiment analysis, categorization, and indexing. A series of services make those datasets discoverable and available to authorized users and other downstream systems.

Data Analytics

In addition, to the management console Volume Analytics comes with an notebook application. This allows a data scientist or analyst to discover and convert data into information on reports. After your data is processed by Volume Analytics and put into a database the Notebook can be used to visualize the data. The data is sliced and diced and displayed on graphs, charts and maps.

Volume Analytics Notebook

Flow Control Panel

Topic Control Panel

The Flow control panel allows for control and basic monitoring of flows. Flows are groupings of tasks and topics working together. You can stop, start and terminate flows. Launch additional flow virtual machines when there is heavy load of data processing work from this screen. The panel also gives access to start up extra worker tasks as needed. There is also a link that will allow you to analyze the logs in Kibana

Topic Control Panel

Topic Control Panel

The topic control panel allows for the control and monitoring of topics. Monitor and delete topics  from here.

Consumer Monitor Panel

Consumer Monitor Panel

The consumer monitor panel allows for the monitoring of consumer tasks. Consumer tasks are the tasks that read from a topic. They may also write to a topic. This screen will allow you to monitor that the messages are being processed and determine if there is a lag in the processing.

Volume Analytics is used by our customers to process data from many data streams and data sources quickly and reliably. In addition, it has enabled the production of prototype systems that scale up into enterprise systems without rebuilding and re-coding the entire system.

And now this tour of Volume Analytics leads into a video demonstration of how it all works together.

Demonstration Video

This video will further describe the features of Volume Analytics using an example application which parses ip addresses out of incident reports and searches other systems for indications of those IP addresses. The data is saved into a SAP HANA database.

Request a Demo Today

Volume Analytics is scalable, fast, maintainable and repeatable. Contact us to request a free demo and experience the power and efficiency of Volume Analytics today.

Contact

Future Fear by BardIonson.com

Countering Computational Propaganda

Future Fear by bardionson.com

Countering Computational Propaganda

There is something new happening in computer science and social media. It is computational propaganda. Computational propaganda is the use of computer information systems for political purposes according to the journal Big Data. This would also include the efforts of governments to influence public opinion of another country in order to change the foreign relations and policy of that country or to cause dissent of citizens against their government.

Countering Propaganda

The first step and most important to take is to realize that propaganda is real. And then to admit that propaganda does impact the thought process of citizens. So I will start with the first step of describing computational propaganda and then ways to computationally identify it.

Elements of Computational Propaganda

There are various elements to computational propaganda that I will attempt to outline. Some of the elements include: algorithms, automation, human curation source, artificial intelligence, social media bots, sock puppets, troll farms, cyber attacks and stolen information, disinformation and data science.

This new generation of propaganda has gone through a process of computerized automation. It is not totally automated but computers, networks and the internet have made it possible to deliver it in an automated manner. It is also enabling the delivery in a personal manner, in a way that makes it seem to be coming from a real human. Technology is also making strides in the automatic production of propaganda.

Previously propaganda was slower moving and required getting news and editorials published. One used to have to own or control media outlets and then put people in place to spread disinformation. Our new modern way of spreading information online with social media has lowered the cost of spreading propaganda.

What is propaganda?

Propaganda is not just pure lies and conspiracy theories. It has come a long way from the Nazi and the communist USSR mode of operation that might obvious to us now. Current propaganda is infused with truth although it is also either partly true or out of context in hidden ways to individual audiences. This information can now be tailored and targeted at individuals. Often this information is geared to lead people to action as all good propaganda should.

Jacques Ellul in Propaganda The formation of Men’s Attitudes says that it is difficult to determine what propaganda is because it is a “secret action”. Ellul finds it impossible to pin down an exact definition of propaganda that did not take up an entire book. Our society has propaganda baked in that we are not aware of. In short, propaganda for Ellul covered the following areas:

  • Psychological Action – seeks to modify opinions by pure psychological methods
  • Psychological Warfare – destroy the morale of an adversary so that the opponent begins to doubt the validity of his beliefs and actions
  • Brainwashing – only used on prisoners to turn enemies to allies
  • Public and Human Relations – These group activities are also propaganda as they seek to adapt a person to a society. As we will see foreign governments use this technique in social media today.

Ellul says: “modern propaganda is based on scientific analysis of psychology and sociology… the propagandist builds his techniques on the basis of his knowledge of man” Then the effects of the propaganda are measured for results and refined.

These thoughts written in 1956 have only continued to be refined. They are now at work on the internet and in social media. And now there is a faster way to measure the results with computers and tracking of online activity.

In reality, there are competing propaganda efforts happening online. There is a democratic propaganda that competes with anti-democratic strains and militant islamic ones also.

Computational Propaganda in Action

But to take a step back I want to outline what this looks like online at the moment as I understand it.

2016

Computational propaganda burst onto the awareness of the public during the 2016 US presidential election and continues today. This effort by Russian actors over many years and I will outline some of the features that are computational in nature. The entire propaganda system is co-dependant but attempts to hide as disconnected entities.

Recently it has been reported that Russian based accounts on Facebook and Twitter have been spreading propaganda to divide the American public. They used Facebook to support both sides of the Black Lives Matter protests, promote gun rights and anti-immigration. This has also been observed on Reddit and Twitter. Some accounts were bots or semi-automated bots and other accounts were used to purchase targeted advertisements.

Cyber War

One element was stealing information through cyber attacks and social engineering. This exhibited itself in the taking of information from the Democratic National Convention and utilizing classified leaks given to WikiLeaks.

Personal information on American citizens by foreign intelligence services has been stolen. Security questionnaires were taken by Chinese services. And recently credit information was appropriated in addition to voter rolls from an unknown number of states during the elections. This personal information is alleged to have enabled the more precise targeting of propaganda to specific populations.

Propaganda Generation

Then this information was selectively transformed into propaganda by taking it out of context and targeting it at select audiences. I can assume that producers of the propaganda used computer software to search the massive amounts of information that was stolen. Then they edited the most damaging of the information to refine it for maximum effect on specific audiences.

Targeting Propaganda

This was also the case for the efforts of the Trump campaign where they took comments damaging to an opponent and targeted voter suppression ads at specific people. Social media ad networks and databases were used in these efforts. These systems allow anyone to target a message at thousands of different personal identifiers, location and income brackets. This can be done on Facebook and Twitter.

Targeting Ad on Facebook

Targeting Ad on Facebook

The advertisement above is being targeted at people in NY City who engage with liberal political content and have a net worth between 1 million and 2 million.

Personal Propaganda

In addition there are stores of personal data that was correlated with computational algorithms to determine specific personality traits of individuals. Using this information messaging was crafted and target it at psychological vulnerabilities of individuals to change their thinking or push them to action. source

The Troll Factory

On top of this paid targeting there are places called troll factories where armies of people engage in social interactions on social media and blogs. There is one of these in St. Petersburg called The Factory. People will engage with online content to discredit it with propaganda techniques such as “whataboutism”. This is where people contest valid facts by pointing out a perceived hypocrisy on the other side of the issue. They will also attempt to generate fake news to cause panic in a particular community.

Bots and Sock Puppets

These efforts combine with a computational technique of online bots, trolls and sock puppets. Sock puppets are online fake personas. They appear to be real people online often appearing as Americans but they just broadcast propaganda.

Bots are pieces of software that perform online tasks. They may just pickup other people’s messages on line and rebroadcast them. Other bots are more human like and engage in conversations. They often search for keywords in conversations and generate a “whatboutism” or some disparaging message to confuse readers. There are other bots that are semi-automated. Once these are challenged as being bots or have a question they cannot respond to, a human will intervene and manually provide responses.

Bot networks

These bots and sock puppet accounts often work as a team. They will gang up on conversations and re-broadcast each others messages. The goal is to make their message seem mainstream by its volume or popularity. They can confuse rational conversations and arguments online by misdirecting them and triggering emotional responses.

Suppression of Speech Bots

Recently bots have been suppressing messaging from people who are attempting to get out messages to counter disinformation. In the case of Brian Krebs who was attacking the messaging of bots that support Putin they tricked Twitter into disabling Brians account. They did this by causing thousands of accounts to follow him and then retweet his tweets in mass. This effort caused Twitter to automatically assume that Brian was paying bots to promote his account. So they turned it off.

Artificial Intelligence

Artificial intelligence and machine learning are additional computation techniques being weaponized in this information battle. Not only is AI used to automate bots to appear human but it also can to create messaging content. Artificial intelligence processes source documents or training data. Then programmers configure the system to output new messages. It multiplies the efforts of a human to generate new content. This output is then curated or even tested out online. This validates the effectiveness of the content. This feedback loop is used to create more effective triggers for people.

Attacking the Disenfranchised

Often these efforts to trigger action leverage marginalized groups in society. The bots and troll factories can take domestic content and amplify it for their own purposes. It has been shown that Russian based bots often rebroadcast messages that attempt to increase divisions in US citizens. Hamilton68 illustrates these efforts.

Screen Shot 2017-09-16 at 10.43.50 AM.png
This is a dashboard that tracks known Russian bots and exposes what they are promoting. Often this is anything that breeds mistrust in the US government and pits groups against each other.

Some countries also invest in traditional media like newspapers,radio and television stations to broadcast messages. They attempt to make this look like real news but it is actually disinformation and propaganda. This “news” is picked up as legitimate by other news outlets and rebroadcast. People who buy into the messaging with use the source as proof of their opinions.

Conspiracy Theories

Propagandists often use current running conspiracy theories to oppose competing messages or true news. They neutralize the opposition by coming up with a secret conspiracy or amplifying one in circulation.

Live events

Recently The DailyBeast reported that Russian operatives organized and promoted rallies on Facebook. This illustrates the purpose of propaganda which is to move people to action. Once they are acting on the beliefs pushed by the propaganda it can tip into political action.

Politics

At the moment in the US propaganda from Russia seems aimed at changing the foreign policy of the US government. The Russians have secured the information space of their citizens in an authoritarian way that is not acceptable in American society. And they have leveraged the lack of privacy controls in the American capitalist system, where information about people is sold for marketing purposes. It seems that Russian propaganda efforts are only aimed at the destruction of democracy or western values according to Chris Zappone of The Age

Computational Countermeasures

There are hundreds of ideas on how to counter this online propaganda. Some are government policies, industry self policing and educational programs. But I want to focus on countermeasures that are computational. In effect an attempt to fight fire with fire.

Inoculation

One way to minimize the impact of propaganda is to have tools that alert an individual of the event. Another type of computational tool is one that allows a community to monitor what others in their community are being targeted with. This can prevent individual weaknesses from being exploited. There are efforts underway in this space but there are opportunities for continual improvement.

Current concepts

Ideas from Volume

  • A dark advertisement exposure network. Volunteers install browser agents to gather ads and put them in a public searchable database with the targeting criteria. Could also use fake personas to collect advertisements.
  • Public searchable databases of bots and sock puppets identified by computational techniques, such as time of day analysis, linguistic patterns and rebroadcasting behaviours.
  • The bot collection database would also hold the relationships between accounts, messages and motives.
  • Computational methods software package to identify bots that pretend to be human and out them as bots on the social media
  • Browser plug-in that will show a user the motives of a bot and expose a network of bots that link or help each other. It enables a person to ignore and discount ideas coming from that entity.

In a blog post Colin Crowell, Twitter’s vice president of public policy, government and philanthropy, said in a blog post that Twitter itself is a “powerful antidote to the spreading of all types of false information” and “journalists, experts and engaged citizens Tweet side-by-side correcting and challenging public discourse in seconds.”

The issue with this is that the bots are programmatically generating information using artificial intelligence or have armies of cheap labor. How can citizens and journalists keep up with researching and debunking half truths, obvious lies and nonsense?

Perhaps we need to build counter bot networks. Of course, ones that fit into the social media companies terms of service.

Bot Combat

(more ideas from Volume Integration)

  • Bots that disrupt bot networks by sending trigger words to them and keep them busy and away from meaningful conversations.
  • Bots that look for disinformation and hoaxes and broadcast information to debunk it
  • Artificial intelligence social bots that can automatically research messaging from propaganda bots and counter the messaging
  • Crafting fact checking messaging to target back at organizations running troll factories and bot networks
  • A force of vetted volunteers that could perform analysis tasks to find bots and propagandists and then write counter arguments to them

Analysis

Of course this rests on the concept of analysis so we would need tools to visualize the data and perform the human effort needed to find relationships and motives of the actors. Some of this effort would be utilizing the many algorithms already designed to detect bots and online propaganda. In addition Volume Integration has a tool that can help monitor and alert on social media account activity and messaging. See our products.

At the end of the post is a list of papers on methods and analysis techniques to find automated bots online.

Prevention (CyberSecurity)

One factor in recent propaganda has been the ability for bad actors to obtain classified or private information. In this case we need better cyber security is needed. Sometimes the information is gathered via social engineering. Outside actors manipulate a person inside an organization into providing the information or access to the computer systems.

Ideas from Volume

  • Separate internal corporate networks from the internet
  • Increase Cybersecurity methods and policies (patching schedules, inventory control, multiple factor authentication, firewalls, packet inspection, audits)
  • Trust / Risk Verification Systems like Volume Analytics which monitor events on a computer network to alert on unauthorized or risky behaviour.

Conclusion

I am afraid that there is no real collusion. We are at the beginning of our abilities to counter computational propaganda. It is going to be an arms race as tactics and systems change. Technology and technique will breed more technology and technique. I hope we are able to separate out the false information and come a bit closer to truth. In the end, it will be the next phase of human conflict and manipulation to gain power and wealth over personal freedom.

Contact us if you want to know more about our work or follow us on Twitter, LinkedIn or Facebook.

Current Studies on Automating Analysis

Ryft-ONE

Ryft and Apache Zeppelin

Ryft

Ryft is an FPGA – (field programmable gate array) appliance that allows for hosting and searching data quickly. In this post I will show one way to connect up Apache Zeppelin for use in data analysis using Scala code. Previously I showed how to connect Apache Zeppelin to SAP Hana.

The Ryft can quickly search structured and unstructured data without needing to build an index. This ability is attributed the the FPGA that can filter data on demand. It uses the internal 4 FPGA modules to process the data at search time. Other types of search systems like ElasticSearch, solr, Lucine or a database have to build and store an index of the data. Ryft operates without an index.

Ryft Speed Comparison

Ryft Speed Comparison

I have populated my Ryft with a cache of data from Enron. It is a dump of Enron emails obtained from Carnegie Mellon. This was as simple as uploading files to the Ryft and running a command like this:

ryftutil -copy “enron*” -c enron_email -a ryft.volumeintegration.com:8765

In the Zeppelin interface I will be able to search for keywords or phrases in the email files and display them. The size of the enron e-mail archive is 20 megabytes.

Ryft One Appliance

Ryft One Appliance

Apache Zeppelin

Apache Zeppelin is an open source web notebook that allows a person to write code in many languages to manipulate and visualize data.

Apache Zeppelin with Volume Analytics Interface

Apache Zeppelin with Volume Analytics Interface

To Apache Zeppelin work with Ryft I installed Apache Zeppelin onto the Ryft appliance and connected the Spark Ryft Connector jar found at this git project. Or download a prebuilt jar.

Follow the directions provided at the spark-ryft-connector project to compile the jar file needed. I compiled the jar file on my local desktop computer. Place the spark-ryft-connector jar file onto the Ryft machine. I did run into one that was not documented; the ryft connector was not working properly. It gives the error: “java.lang.NoClassDefFoundError: org/apache/spark/Logging”

I resolved the issue by downloading spark-core_2.11-1.5.2.logging.jar from  https://raw.githubusercontent.com/swordsmanliu/SparkStreamingHbase/master/lib/spark-core_2.11-1.5.2.logging.jar and put it in zeppelin/interpreter/spark/dep directory and that resoved the issue.

Now you can create a note in Zeppelin. I am using the Spark interpreter which allows you to write the code in Scala.

First you have to make sure Zeppelin can use the ryft code in the jar file. Make a dependency paragraph with this code:

%dep
z.reset()
z.load("/home/ryftuser/spark-ryft-connector-2.10.6-0.9.0.jar")

Ryft Query

Now make a new paragraph with the code to make form fields and run the Ryft API commands to perform a search. Figuring these queries out takes a detailed study of the documentation.

These are the commands to prepare and run the query. I show a simple search, a fuzzy hamming search and a fuzzy edit distance search. The Ryft can perform very fast fuzzy searches with wide edit distances because there is not an index being built.

Simple Query
queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
query = SimpleQuery(searchFor.toString)
Hamming Query
queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte, fhs)
Edit Distance Query
queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte)
The Search
var searchRDD = sc.ryftRDD(Seq(query), queryOptions)

This produces an RDD that can be manipulated to view the contents using code like the example below.

searchRDD.asInstanceOf[RyftRDD[RyftData]].collect.foreach { ryftData =>
   println(ryftData.offset)
   println(ryftData.length)
   println(ryftData.fuzziness)
   println(ryftData.data.replace("\n", " "))
   println(ryftData.file)
}

The Result in Zeppelin

Result of Searching Ryft with Zeppelin

Result of Searching Ryft with Zeppelin

In addition I have included code that allows the user to click on Show File to see the original e-mail with the relevant text highlighted in bold.

Results in BoldI installed Apache Zeppelin in a way that allows it access to a portion of the file system on the server where I stored the original copy of the email files.

In order for Apache Zeppelin to display the original email, I had to give it access to the part of the filesystem where the original emails were stored.  Ryft uses a catalog of the emails to perform searches, as it performs better when searching fewer larger files than more smaller ones. The catalog feature allows it to combine many small files into one large file.

The search results return a filename and offset which Apache Zeppelin uses to retrieve the relevant file and highlight the appropriate match. 

In the end results Ryft found all instances of the name Mohammad with various spelling differences in 0.148 seconds in a dataset of 30 megabytes. When I performed the same search terms on 48 gigabytes of data it ran the search in 5.89 seconds. And 94 gigabytes took 12.274 seconds, 102 gigabytes took 13 seconds. These are just quick sample numbers using dumps of many files. Perhaps performance could be improved by consolidating small files into catalogs.

Zeppelin Editor

The code is edited in Zeppelin itself.

Code in Zeppelin

Code in Zeppelin

You edit the code in the web interface but it can hide it once you have the form fields working. Here is the part of the code that produces the form fields:

 val searchFor = z.input("Search String", "mohammad")
 val distance = z.input("Search Distance", 2)
 var queryType = z.select("Query Type", Seq(("1","Simple"),("2","Hamming"),("3","Edit Distance"))).toString
 var surrounding = z.input("Surrounding", "line")

So in the end we end up with the following code.

%spark
import com.ryft.spark.connector._
import com.ryft.spark.connector.domain.RyftQueryOptions
import com.ryft.spark.connector.query.SimpleQuery
import com.ryft.spark.connector.query.value.{EditValue, HammingValue}
import com.ryft.spark.connector.rdd.RyftRDD
import com.ryft.spark.connector.domain.{fhs, RyftData, RyftQueryOptions}
import scala.language.postfixOps
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import scala.io.Source

def isEmpty(x: String) = x == null || x.isEmpty
  var queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
  val searchFor = z.input("Search String", "mohammad")
  val distance = z.input("Search Distance", 2)
  var queryType = z.select("Query Type",("2","Hamming"), Seq(("1","Simple"),("2","Hamming"),("3","Edit Distance"))).toString
  var surrounding = z.input("Surrounding", "line")
  var query = SimpleQuery(searchFor.toString)

 
  if (isEmpty(queryType)) {
      queryType = "2"
  }

  if (queryType.toString.toInt == 1) {
        println("simple")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, 0 toByte)
        }
        query = SimpleQuery(searchFor.toString)

  } else if (queryType.toString.toInt ==2) {
        println("hamming")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte, fhs)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte, fhs)
        }
  } else {
        println("edit")
        if (surrounding == "line") {
            queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte)
        } else {
            queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte)
        }
  }

  var searchRDD = sc.ryftRDD(Seq(query), queryOptions)
  var count = searchRDD.count()

  print(s"%html <h2>Count: $count</h2>")

  if (count > 0) {
        println(s"Hamming search RDD first: ${searchRDD.first()}")
        println(searchRDD.count())
        print("%html <table>")
        print("<script>")
        println("function showhide(id) { var e = document.getElementById(id); e.style.display = (e.style.display == 'block') ? 'none' : 'block';}")
        print("</script>")
        print("<tr><td>File</td><td>Data</td></tr>")

        searchRDD.asInstanceOf[RyftRDD[RyftData]].collect.foreach { ryftData =>
            print("<tr><td style='width:600px'><a href=javascript:showhide('"+ryftData.file+"')>Show File </a></td>")
            val x = ryftData.data.replace("\n", " ")
            print(s"<td> $x</td></tr>")
            println("<tr id="+ ryftData.file +" style='display:none;'>")
            println("<td style='width:600px'>")

            val source = Source.fromFile("/home/ryftuser/maildir/"+ryftData.file)
            var theFile = try source.mkString finally source.close()
            var newDoc = ""
            var totalCharCount = 0
            var charCount = 0
            for (c <- theFile) {
                charCount = charCount + 1
                if (totalCharCount + charCount == ryftData.offset) {
                    newDoc = newDoc+"<b>"
                } else if (totalCharCount+charCount == ryftData.offset+ryftData.length+1) {
                    newDoc = newDoc+"</b>"
                }
                newDoc = newDoc+c
            }
            print(newDoc.replace("\n", "<br>"))
            totalCharCount = totalCharCount + charCount
            println("</td>")
            println("</tr>")
        }
        print("</table>")
    }

So this should get you started on being able to search data with Zeppelin and Ryft. YOu can use this interface to experiment with the different edit distances and search queries the Ryft supports. You can also implement additional methods to search by RegEx, IP addresses, dates and currency.

Please follow us on Facebook and on twitter at volumeint.

A Better Web Server with Free SSL

A Better Web Server with Free SSL

In researching the best way to get our conversations on our Rocket Chat server encrypted I ran across the most innovative web server I have seen. In our previous posts on Rocket Chat on Raspberry Pi 2 I describe how to install it all but left the SSL configuration until now.

I found that the easiest way to get Rocket.Chat setup with SSL is to use a second web server. The Rocket Chat git repository had some directions on how to setup apache. But this left the problem of getting an ssl certificate.
Caddy made this so easy. Typically you can install it with apt-get caddyserver but Since I am deployed on Arch Linux on a Raspberry Pi it was more difficult. They have download packages to install on major operating systems. You need to do the following on Arch Linux on Raspberry Pi 2. Otherwise just running ‘sudo pacman -S caddyserver’ will do the trick.

curl -L -O https://aur.archlinux.org/cgit/aur.git/snapshot/caddy-git.tar.gz tar -xvf caddy-git.tar.gz cd caddy-git pacman -S fakeroot mv caddy-git /home/user/ makepkg -sri chown -R user:user caddy-git/

Modify Caddyfile

chat.yourServer.com proxy / 127.0.0.1:3000

Then to start it up run:

caddy -conf="/home/user/Caddyfile" -email yourEmail@server.com -agree

The magic added bonus is that if you have port 80 and port 443 open Caddy will go get a Let’s Encrypt SSLcertificate and start running with it.

The Caddyfile directive file is very powerful and easy to configure. It is so much more flexible and understandable than apache conf files. The proxy command is what takes the users page requests from port 443 on Caddy and passes them through to port 3000 where Rocket Chat is running.

I can say that Caddy is my new favorite web server after many years of using Apache and Jetty.

I am encouraged to see free SSL certificates being offered. It always seemed that the price put on encryption for web sites was out of line with the work it takes to create an SSL certificate. These certificates verify the identity of a web host and encrypt all the data being looked at on a web page. My post on entropy outlines how easy it is to generate enough random data to generate certificates. Let’s Encrypt provides a simple and easy way to get and manage SSL certificates.

Subscribe to this page or follow us at volumeint twitter to be informed when new posts from Volume Labs are posted. Check out http://VolumeIntegration.com for more about Volume.

Best Entropy Generation Software for Linux

Best Entropy Generation Software for Linux

Randomness and entropy – why do computers need it? The key feature of entropy in computer information security is its unpredictability and uncertainty). The important part of keeping information in computers secure is an unpredictable key. The important part is unpredictability, true randomness is not as important.

Like a safe it is important to pick a combination that is unpredictable one that no one can guess. What better way to find a combination that no one will guess than turning to randomness. But if your source of random numbers just happens to give you a series like 222334 you should reject it and try again. It may be random but it does not pass a level of uncertainty or entropy.

Operating systems generate a pool of entropy which is used to generate random numbers. This pool of entropy or randomness is used as a very important part of securing your computer.

With this data secure keys and passwords and session ids to track unique users are generated. One important key to having encrypted data is a piece of data that no one else knows or can guess. A series of binary numbers from the entropy pool is the way computers attempt to find a random, unique value that no other computer would come up with. A little bit of random data is used as a seed to generate lots of random data.

Real Hardware vs. Virtual

If you are running your operating system and software on a physical server maintaining entropy is not really an issue with modern CPU technology. Intel built random number generation into their newer chip sets. Although not everyone trusts it. But for virtual machines (VM) it is more of a problem as the operating system is running on a virtual or simulated machine. Operating systems may collect entropy from things like the movement of the mouse, keystrokes on the keyboard, timing of computing operations or other hardware like video or sound cards. The issue with a VM is that it does not have access to real hardware that behaves in a random fashion, it only has access to virtual hardware er.. software.

Sometimes you may not be able to generate encryption keys or start software like web servers until enough entropy has accumulated. In Linux this pool of randomness is typically in /dev/random/. If your software is using this pool it will wait until there is enough quality randomness the system has collected and computed. But you can always point applications at /dev/urandom which is good enough as long as the server has had time to generate some entropy. /dev/urandom will not block software from working but there is potentially less randomness to any keys generated before entropy has accumulated. In Myths About urandom this is described in great detail.

In my situation, my VM has not generated a sufficient amount of entropy for my taste at the point it boots and begins to start application servers. So, I need to find a way to get some entropy soon after boot up.

Testing for Entropy

For this experiment round I will be testing with VirtualBox running Ubuntu on a Macintosh.

On most Linux operating systems this command will show you the available entropy:

cat /proc/sys/kernel/random/entropy_avail

Note that every time you check the entropy using the command it will deduct some entropy bits from the pool.

To test for randomness I will be taking samples from the /dev/random pool with:

dd if=/dev/random of=random_output.txt count=8192

I may also sample /dev/urandom if the random generation software is slow in building entropy. This will produce a file full of random data.

ENT

The file will be tested using ENT A Pseudorandom Number Sequence Test Program, to gauge the quality of the entropy and measures of randomness. To use it, take the file from the previous dd command and give it to ent as a parameter. Ent will print out metrics ways to judge the randomness and entropy of the data in the file.

Rngtool

I will also use Rngtool to gain metrics on the throughput of the random generation tools.

cat /dev/random | rngtest -c 100

The Best Software

Now, the best software to use in order to keep your pool of entropy full. Well that depends on what operating system you are using and what platform you use to host your virtual machine.

Randomsound

Randomsound works by using the low order bit of the ADC output of your sound card.

I start out with less than 100 bits of entropy but after starting randomsound (sudo randomsound -D) for hours I only have 2652 bits of entropy. It is very slow. I am fairly confident it is not working with the virtual machine in getting to the sound card in the Macintosh. By running entropy_avail after performing typing and mouse movements the entropy did increase to 2830 bits.

So lets stop random sound with: sudo /etc/init.d/randomsound stop and try something else.

timer-entropy

This program is available at http://www.vanheusden.com/te/ where VanHeusden says:

“This program feeds the /dev/random device with entropy-data (random values) read from timers. It does this by measuring how much longer or shorter a sleep takes (this fluctuates a little – microseconds). The time for a sleep jitters due to that the frequency of the clocks of the timers change when they become colder or hotter (and a few other parameters). This program does not require any extra hardware. It produces around 500 bits per second.”
Once started it took about 2 minutes to generate 200 bits of entropy. I let it run to 3968 bits.

I ran dd if=/dev/random of=rand_timer_entropy.txt count=8192 to get a file of random data. Output:

^C0+111 records in
21+0 records out
10752 bytes (11 kB) copied, 255.934 s, 0.0 kB/s

ent rand_timer_entropy.txt reported:

Entropy = 7.979475 bits per byte.
Optimum compression would reduce the size of this 10752 byte file by 0 percent. Chi square distribution for 10752 samples is 306.62, and randomly would exceed this value 2.50 percent of the times. Arithmetic mean value of data bytes is 126.7684 (127.5 = random).
Monte Carlo value for Pi is 3.154017857 (error 0.40 percent).
Serial correlation coefficient is -0.001984 (totally uncorrelated = 0.0).

From Calomel.org I learned that the closer entropy is to 8 bits per byte is better. 7.979475 seems good. Optimum compression is good because the file will not compress which means there is no repeated data. The chi square distribution should be greater than 10% and lower than 90% according to Calomel.org. The timer_entropy failed here with a value of 2.5%. Also the arithmetic mean value is off by one this points to the file being not quite random. The best data will also have a Monte Carlo value of Pi almost equal to Pi. we are off by .4%. Serial correlation should be as close to zero as possible.

HAVEGEd

haveged is based on HAVAGE. HAVAGE stands for HArdware Volatile Entropy Gathering and Expansion. In summary, it works by timing the ticks of a processor between certain operations they call flutter. That is a gross oversimplification but there are four pages of explanation on the haveged website.

Entropy available is at 100 bits now.

sudo /etc/init.d/haveged start

Random Data

Within 10 seconds I have 3848 bits of entropy. So I collect my file of random data which runs in seconds. haveged is very fast at generating entropy. By the outputs below it looks to be around 3.8 MB/s

_dd if=/dev/random of=havegedentropy.txt count=8192
0+8192 records in
2047+1 records out
1048478 bytes (1.0 MB) copied, 0.274385 s, 3.8 MB/s
_$ cat /proc/sys/kernel/random/entropyavail
3968

And at the end the entropy pool is up to 3968 bits.

Ent reports the following:

_ent havegedentropy.txt
Entropy = 7.999847 bits per byte.
Optimum compression would reduce the size of this 1048478 byte file by 0 percent.
Chi square distribution for 1048478 samples is 223.14, and randomly would exceed this value 90.00 percent of the times.
Arithmetic mean value of data bytes is 127.4428 (127.5 = random).
Monte Carlo value for Pi is 3.138109027 (error 0.11 percent). Serial correlation coefficient is 0.000016 (totally uncorrelated = 0.0).

Second run:

Entropy = 7.999848 bits per byte. Optimum compression would reduce the size of this 1048434 byte file by 0 percent.

Chi square distribution for 1048434 samples is 220.87, and randomly would exceed this value 90.00 percent of the times. Arithmetic mean value of data bytes is 127.4005 (127.5 = random).
Monte Carlo value for Pi is 3.140752780 (error 0.03 percent).
Serial correlation coefficient is 0.000161 (totally uncorrelated = 0.0).

Third run:

_ent havegedentropy3.txt Entropy = 7.999779 bits per byte.
Optimum compression would reduce the size of this 1048505 byte file by 0 percent.
Chi square distribution for 1048505 samples is 321.04, and randomly would exceed this value 0.50 percent of the times.
Arithmetic mean value of data bytes is 127.6146 (127.5 = random).
Monte Carlo value for Pi is 3.137052933 (error 0.14 percent).
Serial correlation coefficient is -0.000989 (totally uncorrelated = 0.0).

Now these are better numbers. Entropy is higher and the rest of the values are closer to the goal. The chi square distribution is almost too high now. It is at 90% the first two runs but then drops to .5% on the final run.

The good thing about haveged is that it generates data very quickly and works on VirtualBox. According to Tom Leek on Security Stackexchange haveged needs access to the rdtsc instruction to get to the processors clock with sub-nanosecond precision. On VirtualBox this instruction is not virtualized. On VMware it is, so for haveged to work you need to configure VMware with:
monitor_control.disable_tsc_offsetting=TRUE

Rngtest

Using rngtest to verify the quality of haveaged yields the following results:

cat /dev/random | rngtest -c 10000
rngtest 5
Copyright (c) 2004 by Henrique de Moraes Holschuh This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. rngtest: starting FIPS tests…
rngtest: bits received from input: 200000032
rngtest: FIPS 140-2 successes: 9995
rngtest: FIPS 140-2 failures: 5
rngtest: FIPS 140-2(2001-10-10) Monobit: 1
rngtest: FIPS 140-2(2001-10-10) Poker: 0
rngtest: FIPS 140-2(2001-10-10) Runs: 3
rngtest: FIPS 140-2(2001-10-10) Long run: 1
rngtest: FIPS 140-2(2001-10-10) Continuous run: 0
rngtest: input channel speed: (min=1.322; avg=29.429; max=75.092)Mibits/s
rngtest: FIPS tests speed: (min=1.789; avg=133.068; max=171.833)Mibits/s
rngtest: Program run time: 7925437 microseconds

Second run

cat /dev/random | rngtest -c 10000
rngtest 5
Copyright (c) 2004 by Henrique de Moraes Holschuh This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

rngtest: starting FIPS tests…
rngtest: bits received from input: 200000032
rngtest: FIPS 140-2 successes: 9987
rngtest: FIPS 140-2 failures: 13
rngtest: FIPS 140-2(2001-10-10) Monobit: 2
rngtest: FIPS 140-2(2001-10-10) Poker: 2
rngtest: FIPS 140-2(2001-10-10) Runs: 3
rngtest: FIPS 140-2(2001-10-10) Long run: 6
rngtest: FIPS 140-2(2001-10-10) Continuous run: 0
rngtest: input channel speed: (min=1.537; avg=28.381; max=19073.486)Mibits/s
rngtest: FIPS tests speed: (min=1.803; avg=137.853; max=171.833)Mibits/s
rngtest: Program run time: 8116055 microseconds

Out of 20,000 tests I had only 18 failures.

rngd

Rngd-tools and the rngd command is not a tool to generate entropy. It is a program that takes randomness from a true random hardware device and puts it into /dev/random. These devices include OneRNG and trueRNG

oneRNG A True Random Generation Hardware Device

But we can use rngd to insert our own data file or data stream into /dev/random with this command:

sudo rngd -f -r [random data file]

Confirmed Random File

To see if we can get better randomness I will try to load data from RNG Research into our entropy pool. The data is generated from an outside source that is not a computer. They have tested the data with a tool called diehard and documented the entropy and randomness of the files. I am downloading block0.rng. Do not do this if you really want to use /dev/random to generate keys. This data is no longer unpredictable since it is public on the internet.

_$ cat /proc/sys/kernel/random/entropyavail
67
$ sudo rngd -r Downloads/block0.rng
_$ cat /proc/sys/kernel/random/entropyavail
3164
_$ dd if=/dev/random of=randfile.txt count=8192
0+8192 records in
1943+1 records out
_995134 bytes (995 kB) copied, 0.305225 s, 3.3 MB/s $ ent randfile.txt
Entropy = 7.999799 bits per byte.

Optimum compression would reduce the size of this 995134 byte file by 0 percent.

Chi square distribution for 995134 samples is 276.44, and randomly would exceed this value 25.00 percent of the times.

Arithmetic mean value of data bytes is 127.5281 (127.5 = random).
Monte Carlo value for Pi is 3.146266317 (error 0.15 percent).
Serial correlation coefficient is 0.000545 (totally uncorrelated = 0.0).

If this data is compared to the results from the other tools all of the tests above indicate that it is more random. Using a device outside of the computer to generate random entropy may be be best solution.

timer-entropy and haveged together

Now what happens if I combine methods. By mistake I left multiple methods running and noticed better results. With timer-entropy and haveged the results are:

Entropy = 7.999807 bits per byte.

Optimum compression would reduce the size of this 1048339 byte file by 0 percent.

Chi square distribution for 1048339 samples is 280.77, and randomly would exceed this value 25.00 percent of the times.

Arithmetic mean value of data bytes is 127.3958 (127.5 = random).

Monte Carlo value for Pi is 3.144199676 (error 0.08 percent).

Serial correlation coefficient is -0.000197 (totally uncorrelated = 0.0).

Results

Test randomsound timer-entropy haveged known random data timer-entropy & haveged
entropy = 8 N/A 7.979475 7.999847 7.999799 7.999807
chi square = 10%-90% N/A 2.50% 90% 25% 25%
Arithmetic mean = 127.5 N/A 126.7684 127.4428 127.5281 127.3958
Monte Carlo value for Pi N/A error 0.40% error 0.11% error 0.15% error .08%
Serial correlation = 0 N/A -0.001984 0.000016 0.000545 -0.000197
rngtest N/A N/A .09% failure .01% failure 0.05%
speed did not finish 23k/sec 3.8 MB/sec N/A 3.8 MB/sec

Conclusion

Getting random data is important to keeping secrets. At one time I thought this problem was solved but then virtualization became the best way to deploy new servers. When software ran on real hardware it was much easier to get sources to build entropy. Now virtual machine platforms are blocking access to the real world where randomness can be harvested.

It seems that the best way to get entropy may be connecting an external device dedicated to generating it for all the virtual machines being hosted. But from these experiments it could be that use of haveged and at least one other software method for gathering entropy might be the best option.

So in conclusion, I do not yet know the best software to generate random data and entropy. But hopefully this will point you in a beneficial direction.

Other Sources of Randomness

RDRAND

Computers with Intel CPU’s have a built in random data generator hardware on the Ivy Bridge Xeon processor. It is called with the RDRAND instruction set. There is a way to have rngd use it for populating the entropy pool /dev/random as documented in this case study. I am running on a MacBook which does not have the Xeon so it will not work.

Random.org

Random.org is a service on the internet where one can get random data. Random.org uses multiple sources of noise from the static from radios to feed into algorithms that produce random files for free.

Entropy in the News

Forbes Cloud Computing Security Technology

BBC – Web’s random numbers are too weak, researchers warn

Resources used

White Noise Boutique – (Art) a place to purchase random white noise.

https://en.wikipedia.org/wiki/Entropy_(computing))

http://spectrum.ieee.org/computing/hardware/behind-intels-new-randomnumber-generator

http://bredsaal.dk/improving-randomness-and-entropy-in-ubuntu-9-10

http://openfortress.org/cryptodoc/random/

http://www.randomnumbers.info/content/Generating.htm

https://blog.cloudflare.com/why-randomness-matters/

https://blog.mozilla.org/warner/2014/03/04/remote-entropy/

http://bredsaal.dk/improving-randomness-and-entropy-in-ubuntu-9-10

haveged as a source of entropy

Not-So-Random Numbers in Virtualized Linux and the Whirlwind RNG

Jumping the Gap - Data Transmission Over An Air Gap

Jumping the Gap - Data Transmission Over An Air Gap

This experiment from Volume Labs demonstrates how to send data between two points without the use of RF (radio frequency) or a wired connection. One computer is translating a text file into a series of QR Codes and displaying them to a screen. The files are divided up into small packages since a QR Code can hold a max of 3Kb or around 4,000 characters. The second computer is performing change detection on input from a camera and translating the QR Codes back into a file.

Applications

I see three potential applications of this technology.

Unidirectional Data Transfer

One use would be the unidirectional movement of data between networks. It is often called a one-way transfer device or a data diode. One-way transfer systems are used to protect a network or computer against viruses, outbound transmission of data, and other cyber attacks. One business that uses one-way transfer systems is the power generation industry. To protect the power grid, there should be a gap between the internet and the power control systems.

Data Transfer To/From Faraday Cage

Another application of this technology is sending data visually in or out of a Faraday cage without introducing an RF leak.

one-way transfer via qr code

Long-Distance Data Transfer

Another use is to transmit data over a larger distance using a camera with a telephoto lens.

long distance transmission

It could be used to send data between devices, share large documents to a wide audience, or transmit additional data alongside a television broadcast. This is accomplished using an open source package called Zxing. The QrCodeTransfer code is available at github.

 

Additional Reference added 4/30/2014:

  • QuickeR: Using video QR codes to transfer data