A Volume Analytics Flow for Finding Social Media Bots

Volume Analytics Chaos Control

Volume Analytics Chaos Control

Volume Analytics is a software tool used to build, deploy and manage data processing applications.

Volume Analytics is a scalable data management platform that allows the rapid ingest, transformation, and loading high volumes of data into multiple analytic models as defined by your requirements or your existing data models.

Volume Analytics is a platform for streaming large volumes of varied data at high velocity.​

Volume Analytics is a tool that both enables rapid software development and operational maintainability with scalability for high data volumes. Volume Analytics can be used for all of your data mining, fusion, extraction, transform and loading needs. Volume Analytics has been used to mine and analyze social media feeds, monitor and alert on insider threats and automate the search for cyber threats. In addition it is being used to consolidate data from many data sources (databases, HDFS, file systems, data lakes) and producing multiple data models for multiple data analytics visualization tools. It could also be used to consolidate sensor data from IoT devices or monitor a SCADA industrial control network.

Volume Analytics easily facilitates a way to quickly develop highly redundant software that’s both scalable and maintainable. In the end you save money on labor for development and maintenance of systems built with Volume Analytics.

In other words Volume Analytics provides the plumbing of a data processing system. The application you are building has distinct units of work that need to be done. We might compare it to a water treatment plant. Dirty water comes in to the system in a pipe and comes to a large contaminate filter. The filter is a work task and the pipe is a topic. Together they make a flow.

After the first filter another pipe carries the water minus the dirt to another water purification worker. In the water plant there is a dashboard for the managers to monitor the system to see if they need to fix something or add more pipes and cleaning tasks to the system.

Volume Analytics provides the pipes, a platform to run the worker tasks and a management tool to control the flow of data through the system.

A Volume Analytics Flow for Finding Social Media Bots

A Volume Analytics Flow for Finding Social Media Bots

In addition Volume Analytics has redundancy for disaster recovery, high availability and parallel processing. This is where our analogy fails. Data is duplicated across multiple topics. The failure of a particular topic (pipe)  does not destroy any data because it is preserved on another topic. Topics are optimally setup in multiple data centers to maintain high availability.

In Volume Analytics the water filter tasks in the analogy are called tasks. Tasks are groups of code that perform some unit of work. Your specific application will have its own tasks. The tasks are deployed on more than one server in more than one data center.

Benefits

Faster start up time saves money and time.

Volume Analytics allows a faster start up time for a new application or system being built. The team does not need to build the platform that moves the data to tasks. They do not need to build a monitoring system as those features are included. However, Volume Analytics will integrate with your current monitoring systems.

System is down less often

The DevOps team gets visibility into the system out of the box. They do not have to stand up a log search system. So it saves time. They can see what is going on and fix it quickly.

Plan for Growth

As your data grows and the system needs to process more data Volume Analytics grows. Add server instances to increase the processing power.  As work grows Volume Analytics allocates work to new instances. There is no re-coding needed. Save time and money as developers are not needed to re-implement the code to work at a larger scale.

Less Disruptive deployments

Construct your application in a way that allows for deployments of new features with a lower impact on features in production. New code libraries and modules can be deployed to the platform and allowed to interact with the already running parts of the system without an outage. A built in code library repository is included.

In addition currently running flows can be terminated while the data waits on the topics for the newly programmed flow to be started.

This Flow processes files to find IP addresses, searches multiple APIs for matches and inserts data into a HANA database

This Flow processes files to find IP addresses, searches multiple APIs for matches and inserts data into a HANA database

A data processing search threats flow in production. Each of the boxes is a task that performs a unit of work. The task puts the processed data on the topic represented by the star. Then the next task picks up the data and does another part of the job. The combination of a set of tasks and topics is a flow.

Geolocate IP Flow

Geolocate IP Flow

Additional flow to geolocate IP addresses added as the first flow is running.

Combined Flows

Combined Flows

The combination of flows working together. The topic ip4-topic is an integration point.

Modular

Volume Analytics is modular and tasks are reusable. You can reconfigure your data processing pipeline without introducing new code. You can use tasks in more than one application.

Highly Available

Out of the box, Volume Analytics highly available due to its built in redundancy. Work tasks and topics (pipes) run in triplicate. As long as your compute instances are in multiple data centers you will have redundancy built in. Volume Analytics knows how to balance the data between duplicate and avoid data loss if one or more work tasks fail — this extends to the concept of queuing up work if all work tasks fail.

Integration

Volume Analytics integrates with other products. It can retrieve and save data to other systems like topics, queues, databases, file systems and data stores. In addition these integrations happen over encrypted channels.

In our sample application CyberFlow there are many tasks that integrate with other systems. The read bucket task reads files from an AWS S3 bucket, the ThreatCrowd is an API call to https://www.threatcrowd.org and Honeypot calls to https://www.projecthoneypot.org. Then the insert tasks integrate to the SAP HANA database used in this example.

Volume Analytics integrates with your enterprise authentication and authorizations systems like LDAP, ActiveDirectory, CAP and more.

Data Management

Ingests datasets from throughout the enterprise, tracking each delivery and routing it through Volume Analytics to extract the greatest benefit. Shares common capabilities such as text extraction, sentiment analysis, categorization, and indexing. A series of services make those datasets discoverable and available to authorized users and other downstream systems.

Data Analytics

In addition, to the management console Volume Analytics comes with an notebook application. This allows a data scientist or analyst to discover and convert data into information on reports. After your data is processed by Volume Analytics and put into a database the Notebook can be used to visualize the data. The data is sliced and diced and displayed on graphs, charts and maps.

Volume Analytics Notebook

Flow Control Panel

Topic Control Panel

The Flow control panel allows for control and basic monitoring of flows. Flows are groupings of tasks and topics working together. You can stop, start and terminate flows. Launch additional flow virtual machines when there is heavy load of data processing work from this screen. The panel also gives access to start up extra worker tasks as needed. There is also a link that will allow you to analyze the logs in Kibana

Topic Control Panel

Topic Control Panel

The topic control panel allows for the control and monitoring of topics. Monitor and delete topics  from here.

Consumer Monitor Panel

Consumer Monitor Panel

The consumer monitor panel allows for the monitoring of consumer tasks. Consumer tasks are the tasks that read from a topic. They may also write to a topic. This screen will allow you to monitor that the messages are being processed and determine if there is a lag in the processing.

Volume Analytics is used by our customers to process data from many data streams and data sources quickly and reliably. In addition, it has enabled the production of prototype systems that scale up into enterprise systems without rebuilding and re-coding the entire system.

And now this tour of Volume Analytics leads into a video demonstration of how it all works together.

Demonstration Video

This video will further describe the features of Volume Analytics using an example application which parses ip addresses out of incident reports and searches other systems for indications of those IP addresses. The data is saved into a SAP HANA database.

Request a Demo Today

Volume Analytics is scalable, fast, maintainable and repeatable. Contact us to request a free demo and experience the power and efficiency of Volume Analytics today.

Contact

How Do You Host Website on Amazon AWS_

How Do You Host Website on Amazon AWS?

At Volume Labs we have been working to convert our site from WordPress to a static site. In doing this we determined that Hexo was the best tool for us. When considering where to deploy the new site we instantly thought of AWS because they have a way to host static pages right out of S3. We have deployed Volume Labs and Volume Integration to AWS and I will show you how to in this post.

First create an S3 bucket. I named ours using the name of our website. AWS S3 buckets are a place you can store files on AWS and each bucket is unique across all users so our domain name should be unique. S3 is more cost efective than using an AWS server instance.

S3 Bucket Button aws

 

Create the Bucket

S3 is redundant as the data you store there is spread across at least three data centers. You pay for the amount of storage used and the bandwidth used to get it in and out of S3.

When you create the S3 bucket give it the following properties by clicking the properties button. This will configure it to act like a web host and serve up the web pages.

S3 Bucket Properties

Note your website address for the bucket. You will need this later.

Then configure the policy document to allow everyone on the internet read access to your files.

S3 Bucket Policy

S3 Bucket Policy

{ "Version": "2012-10-17", "Id": "Policy1477706476623", "Statement": [ { "Sid": "Stmt14777064735", "Effect": "Allow", "Principal": "*", "Action": "s3:GetObject", "Resource":"arn:aws:s3:::yoursite.com/*" } ] }

Upload your files to the S3 bucket. Use the upload button or the S3 api to upload your files.

Copy your website address that AWS gave you for the S3 bucket and test it out in a browser.

Now configure a certificate with Certificate Manager. Make sure you configure the certificate before you change the DNS settings for the provider of your domain name. When you request a certificate AWS will send you an e-mail to authorize it using the contacts in your dns entry. It will also try webmaster, hostmaster, administrator, postmaster and admin@your-domain.com

Certificate Button

Request a certificate

Enter all of the domains that your site should respond to. Use your main domain name and www. at least. When you finalize your request AWS will send you an e-mail to validate your certificate. Make sure as the owner of your domain name that you have setup your domain registration to send you e-mail. If you have the MX record setup correctly with your domain name provider it will send you an e-mail.

Then we setup CloudFront. CloudFront allows for distribution of content across the world. It caches your content to many servers all over the world this puts the content closer to the people viewing it.

In CloudFront you can configure the connection to the S3 bucket and to the certificate. It is also possible to have it serve up the files with compression this will further improve the speed that your website is delivered to the browser.

CloudFront Button

Press the Create Distribution button. Pick the web delivery method on the next screen.

CloudFront Delivery Method

Set the origin settings. This is where you tell it to read the files from your S3 bucket. Paste in the url for your S3 bucket you saved earlier into the Origin Domain Name.

CloudFront Origin Settings

Now setup the Cache Behaviour. Since we are setting up a SSL TLS certificate turn on the Redirect http to https setting. Also turn on the compress objects setting to improve the speed of downloading your pages. Keep the other settings as is. You can reduce the Time To Live TLS if your pages change more often.

CloudFront Cache Behavior

For the worldwide coverage set the distribution behaviour to use all edge locations. It will push your pages out to servers all over the world. Enter the alternate domain names your site will use. And then set the custom SSL certificate to the one we created using the AWS Certificate Manager.

CloudFront Distribution Behavior

In addition turning off IPv6 will make it easier to deploy using Route 53. So I turned off IPv6 in the distribution behaviour section.

Press the Create Distribution to finish the work here in CloudFront.

After CloudFront finishes distributing your site use the url it generates to view your site. Save this url in order to configure Route 53. It is called that because the default port for DNS is port 53.

Now you need to configure Route 53. Route 53 is a DNS service on steroids with non of the side effects. This is the final step to getting your personal domain name to serve up the content.

Route 53 allows for aliases that will route the requests asking for your root domain and www. subdomain to the CloudFront distribution. It will also direct http traffic to https

Go to Route 53 in the AWS Console. Press the Create Hosted Zone button.

Route 53 Hosted Zone Button

Enter the top level domain name of your site. And press Create. You will see that some settings are created.

Route 53 Name Servers

Take the nameserver NS settings that are generated by Route 53 and enter them in the DNS settings at domain name provider. This will allow AWS Route 53 to act as your domain name service and give you all the nice features in Route 53.

The next step is to set up an alias that will guide requests to your pages sitting in CloudFront which is getting them from S3.

We need two alias routes set as record sets in this screen. So press the Create Record Set button.

Route 53 Alias Record

Click on the alias Yes radio button and enter the CloudFront url where your site is hosted from. Leave the name field blank. Then do it again. Create another record set with the name of www and set the alias to yes and enter the url for the CloudFront distributions again. This will create the route for the www sub domain for your site.

Now wait. It takes 1 to 48 hours for the worldwide network of DNS servers to get the changes you made at your domain name provider.

As another option you can purchase or move your domain name to AWS.

In order to set up e-mail forwarding I use this free e-mail forwarding service Improve MX. Just register your domain with ImproveMX and create a record set for a MX record in Route 53 with the mail server settings.

Now enjoy your site. You should see increased performace over a run of the mill web hosting provider and your costs might be lower.

Please follow use at VolumeInt. And check us out at Volume Integration

Installing Rocket.Chat on Raspberry Pi 2

Installing Rocket.Chat on Raspberry Pi 2

The goal is to get Rocket.Chat running on a Raspberry Pi 2. And what a crazy path it took me on.

Rocket.Chat is an open source chat server that is easy to use and has lots of features that support communication and sharing of links and files. It is going to be setup as a private chat platform for Volume Integration to increase collaboration.

Rocket Chat at Volume

I decided to start at the beginning with NOOBS and Raspbian OS. My research indicated that Rocket.Chat had been installed and run on a Raspberry Pi. Rocket.Chat requires node.js, npm, Meteor, MongoDB. I started by following some directions for installing Meteor. Then ran into major issues getting node and MongoDB to install. At this point I realized that I had a Pi 2 and that many popular packages did not have binaries for it. The newest MongoDB only runs on 64 bit architectures and the Pi is ARM which is 32 bit.

Raspberry Pi Logo

After much searching and compiling of different versions of MongoDb and Node, following the installation without docker instructions, the Rocket.Chat RockOnPi Community released Rocket Chat Raspberry Pi directions. They should have a build soon for the Raspberry Pi Zero. These directions worked with Raspbian. But they call for using mongolab.com for the mongo database. I could not find a build of MongoDB that worked on the Raspberry Pi 2.

But the goal was to get it all working on a single Raspberry Pi. There is a version of Linux called Arch Linux that has MongoDB 3.2. So the first step was to install it. This was a side adventure documented in Installing Arch Linux on Raspberry Pi 2. The major issue is that as of NOOBS 1.5 there was no support to install Arch Linux on the Raspberry Pi 2 using the NOOBS installer. This required me to write the package to the SDcard and boot from there.

Mongo DB

First I installed Mongo DB because it was the hardest part on Raspbian. There was not a build that would work on the Raspberry Pi 2 and support Rocket.Chat. I found instructions for an old version that did work but it was too old for Rocket.Chat.

One item of note with MongoDB on Raspberry Pi is that ARM is inherently a 32 bit OS. This means that MongoDB will support 2 GB database sizes. Sing Li who is a contributor to the Rocket.Chat project told me on their demo chat server,

“that’s by no means limitation for Rocket.Chat 😄 a 2 GB mongodb database IS VERY LARGE ! For reference … this demo server with 38,000 registered users and close to 300,000 messages has a database that is less than 2 GB in size (for message storage). Hopefully the Pi server is expected to handle a little less.”

First the dependencies as root. In Arch Linux login to root. Arch Linux has the default user of alarm. If you are connecting via ssh login as alarm from there you can su root.

pacman -S npm
pacman -S curl
pacman -S graphicsmagick
pacman -S mongodb

This caused issues with incompatible versions. So I ran:

pacman -Syu mongodb

Now we need to make a data directory for Mongo.
mkdir /data/
mkdir /data/db

If your ‘/data/db’ directory doesn’t have the permissions and ownership above, do this:

First check what user and group your mongo user has:
# grep mongo /etc/passwd
mongod:x:498:496:mongod:/var/lib/mongo:/bin/false

You should have an entry for mongod in /etc/passwd , as it’s a daemon.
chmod 0755 /data/db
chown -R 498:496 /data/db # using the user-id , group-id
ls -ld /data/db/
drwxr-xr-x 4 mongod mongod 4096 Oct 26 10:31 /data/db/

The left side ‘drwxr-xr-x’ shows the permissions for the User, Group, and Others. ‘mongod mongod’ shows who owns the directory, and which group that directory belongs to. Both are called ‘mongod’ in this case.

Now try and start mongodb to see if it works. On 32 bit architectures you must start it with the mmapv1 storage engine.
mongod --storageEngine=mmapv1

In theory you should enable mongodb so it will startup on boot.

Modify the /usr/lib/systemd/system/mongodb.service file with the storage engine settings.
systemctl enable mongodb.service

But having it run as a service caused issues when starting Rocket.Chat. Rocket Chat says that the database driver version 2.7 is incompatible. So for now I run it as a regular user with the mongod command. To have it continue running on logout install screen.

pacman -S screen screen mongod --storageEngine=mmapv1

Some of these following directions are based on https://github.com/RocketChat/Rocket.Chat.RaspberryPi

Meteor and NPM install

The easiest way to get both is to clone from the Meteor universal project.

As a user that is not root follow this:
cd ~ git clone --depth 1 https://github.com/4commerce-technologies-AG/meteor.git

then

$HOME/meteor/meteor -v

Rocket Chat Install

I received some great help and encouragement from the Raspberry Pi community on the Rocket.Chat chat site.

You do not need to be root to perform this step.

Download the Rocket.Chat binary for Raspberry Pi

cd $HOME
mkdir rocketchat
cd rocketchat
curl https://cdn-download.rocket.chat/build/rocket.chat-pi-develop.tgz -o rocket.chat.tgz
tar zxvf rocket.chat.tgz

This will download and untar the app in $HOME/rocketchat

After some trial and error I discovered that some dependencies were needed. make, gcc and python2. Root must run pacman.
pacman -S python2
pacman -S make
pacman -S gcc

Now try the install procedure but use python 2.7 and –save will show any errors that happen. I used –save to figure out that I did not have gcc (g++) installed.

cd ~/rocketchat/bundle/programs/server

~/meteor/dev_bundle/bin/npm install --python=python2.7 --save

Testing to make sure it works

export PORT=3000

export ROOT_URL=http://your url or ip address:3000 export

export MONGO_URL=mongodb://localhost:27017/rocketchat

$HOME/meteor/dev_bundle/bin/node main.js

Linger

To keep session running after logout in Arch Linux use:
loginctl enable-linger alarm

Run on Startup

Unfortunately I have been unable to get Rocket Chat to recognize MongoDB when Mongo is running as a service on Arch Linux. It says that the version of the database driver is not compatible. For now I start Mongo up as a user and place it in the background.

Next Steps

The next steps are to configure the Rocket Chat server to startup on boot and run on SSL. We want to protect those chats flowing between our employees. Follow this blog and volumeint twitter to find out about the next posts on how to install Arch Linux on Raspberry Pi 2 and how to get a free SSL certificate for your chat server.

What's the Best Tool to Monitor Redis_

What's the Best Tool to Monitor Redis?

High volume services like Twitter, Pinterest, and Flickr use Redis to deliver small pieces of information very quickly. Redis is ideal for these applications because it stores data in memory and on disk at the same time. Retrieving data from the rows and columns of a database can be slow, so Redis stores data in key-value pairs.

Volume Integration uses Redis in our software product called Volume Analytics. Out of the box, Redis is manipulated via the command line, but we wanted a web interface and monitoring tool to track memory usage and the up/down status of Redis. So we set out to find the best tool for the job.

The tools we evaluated were Reddish, Redis Commander, and Redmon. I was able to install all of them in one day, so the installation is fairly easy.

Reddish

Reddish is programmed in Node.js and has a very basic interface. It allows for web searching of keys by name or wildcards. The interface also allows editing of values in the datastore.

Reddish Console

Reddish was not the right tool for us, as it did not have any way to monitor the service itself.

Redis Commander

Redis Commander is also built with Node.js and includes a tree-based navigation of the data with counts of how much data is in each folder. It also enables users to change the configuration settings of the Redis server through the tree.

Redis Commander Configuration

The interface allows for modification of the configuration and the data elements. At the bottom of the window, Redis Commander provides access to the command-line interface. Redis Commander includes many screens and different options to manage the data in Redis.

Edit Key Values

We found that if there is a lot of data in Redis, it can take a while for Redis Commander to load the data into the tree. This tool also did not meet our need to monitor the service for usage and uptime.

Redmon

Redmon is programmed in Sinatra and was the easiest to install. Just run gem install redmon, then start the Redmon server with a single command.

Redmon Monitor Screen

Redmon contains only three screens: monitoring dashboard, CLI, and configuration control. The first screen was exactly what we needed – a graph showing the performance and usage of the system.

Configuring Redis Redmon

The configuration tab allows us to change the settings of the Redis server to improve performance.

The video below shows the entire Redmon interface.

 

 

 

 

 

 

 

 

 

Redmon Demo

Redmon Demo

Evaluation

After our evaluation, we selected Redmon. It was the only product that was a monitoring tool. Plus, it fit well into our system since we already use Ruby for other parts of the application. Redis Commander and Reddish would be more suitable for projects that need a visual interface to manage data within Redis.

Let us know how you are using Redis and what interface you use. What is your favorite tool for managing Redis?

 

To learn more about Volume Labs and Volume Integration, please follow us on Twitter @volumeint and check out our website.

Cloud Management Demands an Organizational Shake-Up

Cloud Management Demands an Organizational Shake-Up

(flickr.com/George Thomas)

The cloud is here. Most organizations now have contracts that allow for the construction of applications in a cloud environment. The cloud has promised lower costs, great efficiencies, and greater security. But these cost savings cannot be realized without simplifying the organizational structure.

The Rules Have Changed

Every new technology creates new processes. When the personal computer came around, it forced the established information technology department to change or die. The departments that would not change had their work supplanted by other departments that found lower cost and more efficient ways to work. Why pay your big IT department for time on a mainframe, when you can just go buy a PC that does it cheaper?

The same scenario is happening again with the cloud. Cloud computing provides the ability to only pay for the processing and storage you need on demand. In addition, it does not require staff to install and configure servers since all cloud services provide a web interface to instantly use a new server.

We do not manage client/server networks in the same way that we managed mainframes. So depending on old processes and labor categories will greatly hamper the cloud and ultimately make it as inefficient as the old mainframe.

Let’s use companies that build internet applications as guide. It’s possible for a small department to create an innovation so powerful that it can supersede much larger organizational groups. This can occur when those small departments use the cloud, which allows them to purchase only the computing power needed. The first group that figures out how to bypass the old process, policies, and rules is able to build something so important that policies, rules, and organizational structures are redefined.

I have fallen into the trap of rigidly following old policies myself. There is a great new technology called Hadoop, which is a way to process big data over many servers by dividing it into small pieces. But Hadoop requires code and data to be distributed on many computers. So I rejected Hadoop because of an organizational policy against automatic remote code execution. But it turns out that there were other divisions building tools with Hadoop and proving that the technology could change the way data analysis happens. After they showed the power of Hadoop, the policy changed.

For revolutionary technology, there is a way to mitigate risk and modify policies. The adoption of Hadoop has revolutionized the way big data is processed in my organization and made many small groups the most powerful and efficient in meeting new mission sets.

Change the Roles

Right now, many teams are divided into these groups: users (analysts, statisticians, etc.), decision makers, requirements, system administrators, database administrators, systems engineers, programmers, testers, project managers, and security engineers. With the new cloud systems, where almost anyone in this list can learn how to start up a new virtual machine and install software on it, why are all these roles needed?

I have found that it works best to employ one or more technical generalists who know system administration, databases, programming, systems engineering, testing, and requirements. This technical generalist can get a new application running very easily. After sitting with the actual users, technical generalists can collect ideas, build, and show progress quickly.

The cloud is part of what makes this possible. People who build the solution that users need can instantly start up new servers and compute clusters on demand. If the application is successful, they can instantly scale up without having to wait on another department to start up the servers.

Teamwork?

DevOps

Consider combining development with operations and maintenance. The cloud forces this issue anyway, and you gain great efficiencies.

In the cloud, developers and testers are concerned about real production issues. The cloud makes it easier to deploy new software because the developers think of it up front and build code to automatically deploy the software. Developers need to learn how to do system administration in order to write better code, and system administrators need to learn how to code to deploy applications better.

If a developer or analyst can make the decision to implement a new service on a new virtual machine, why should he wait on someone else to click a few buttons on a web interface to start up a new instance? In the worst cases, there is a long list of departments and boards that need to approve the action.

It’s even better if you are able to find a motivated user who can actually build or prototype the application. They know exactly what is needed, and if they have the right technical team supporting them, they will find a way to get it done. I have seen analysts and mathematicians with access to computational power manipulate it on demand as the mission shifted. They were able to gain the technical title that allowed them to be analyst, system administrator, and programmer all in one; they changed the way things work.

Automate

Instead of hiring more people, make sure the people you have are writing code and scripts to automate the work.

  • Testers should be using code to test applications repetitively.
  • System administrators can write code to deploy patches and code.
  • Programmers can write code to test and deploy systems and set up monitoring tools to watch everything.
  • Analysts and mathematicians can write code to filter and sift the data.

Everyone can rely on others in the group to come up with solutions together. But if the key players are in different departments, they will not be able to work together effectively. In my experience, it’s better to have a small team together than a large, disparate one.

With cloud computing, it is possible to automate the scaling up and down of servers. The right code makes it possible to have servers deploy themselves automatically as load goes up or when there are server failures. Work toward building systems that can utilize this feature of the cloud.

Accomplish the Mission

The key is to get everyone closer to the actual mission – solving the problem with software. Too often, work is tightly controlled by functional divisions. In these cases, the system administrator is able to keep the server running but has no power or responsibility to keep the network, database, or application up. But a running server with no network isn’t very useful to the mission.

The use of the cloud puts more power than ever before in the hands of people with technical skills. Anyone with an internet connection can write an application and deploy it on a cloud server for almost nothing. But within large organizations, we get stymied by processes and labor categories. We lack the access to develop and deploy new technology without impediments.

My recommendation is to find ways to collapse job roles and allow technical generalists to gain direct access to the necessary resources needed. The cloud will only live up to its promise if we can control it directly.

What is your experience in deployment of solutions to the cloud? Does your bureaucracy get in the way?

To learn more about Volume Labs and Volume Integration, please follow us on Twitter @volumeint and check out our website.

Assessing Organizational Risk with CloudTrail

Assessing Organizational Risk with CloudTrail

Airplanes leave trails in the clouds to let us know where they’ve been. (flickr.com/Vicki Burton)

Recently, we’ve been experimenting with collecting CloudTrail data from Amazon Web Services (AWS). Here is a description of CloudTrail, according to the FAQ:

AWS CloudTrail is a web service that records API calls made on your account and delivers log files to your Amazon S3 bucket. …CloudTrail provides visibility into user activity by recording API calls made on your account. CloudTrail records important information about each API call, including the name of the API, the identity of the caller, the time of the API call, the request parameters, and the response elements returned by the AWS service. This information helps you to track changes made to your AWS resources and to troubleshoot operational issues. CloudTrail makes it easier to ensure compliance with internal policies and regulatory standards.
We use CloudTrail to further enhance our clients’ knowledge of organizational risk with AWS. We collect AWS information from CloudTrail, load it into our Volume Analytics product, and use the information to reinforce our risk models. While this short tutorial focuses on CloudTrail, the code can be used to read any data from S3.

Getting Started

What you need:

  • Cloud Trail enabled on your AWS instance
  • S3 enabled on your AWS instance
  • Java AWS SDK

Collect CloudTrail Data

create an S3 bucket

 

configure CloudTrail

 

enable API access

 

 

  1. Finally, you can use the AWS Java API to pull CloudTrail data from S3. Make sure you configure the AwsCredentials file with the correct accessKey and secretKey. Here is my sample code for pulling data from S3 in CloudTrailTest.java on GitHub:
//package com.volume.hooks.s3;
import com.amazonaws.auth.ClasspathPropertiesFileCredentialsProvider;
import com.amazonaws.regions.Region;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3Client;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.ListObjectsRequest;
import com.amazonaws.services.s3.model.ObjectListing;
import com.amazonaws.services.s3.model.S3Object;
import com.amazonaws.services.s3.model.S3ObjectSummary;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.Iterator;
import java.util.zip.GZIPInputStream;
public class CloudTrailTest {
public static AmazonS3 s3;
public static Region usEast1;
public static void main(String[] args) throws IOException, InterruptedException {
while (true) {
//Read credentials from AwsCredentials.properties
s3 = new AmazonS3Client(new ClasspathPropertiesFileCredentialsProvider());
//Set your AWS region
usEast1 = Region.getRegion(Regions.US_EAST_1);
s3.setRegion(usEast1);
//Name of the S3 bucket containing CloudTrail JSON
ObjectListing objectListing = s3.listObjects(new ListObjectsRequest()
.withBucketName(“vatraildata”));
//Iterate through all objects in the CloudTrail bucket
for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
System.out.println(“Downloading an object:” + objectSummary.getKey());
S3Object object = s3.getObject(new GetObjectRequest(“vatraildata”, objectSummary.getKey()));
System.out.println(“Content-Type: ” + object.getObjectMetadata().getContentType());
//If the object contains content, treat it as a file
if (objectSummary.getSize() > 0) {
displayTextInputStream(object.getObjectContent());
// Optional: Delete the file after it has been read.
//s3.deleteObject(“vatraildata”, object.getKey());
}
}
}
}
private static void displayTextInputStream(InputStream input) throws IOException {
//All of the files are GZipped JSON
GZIPInputStream gzipStream = new GZIPInputStream(input);
Reader decoder = new InputStreamReader(gzipStream, “US-ASCII”);
BufferedReader reader = new BufferedReader(decoder);
String json = “”;
while (true) {
String line = reader.readLine();
json += line;
if (line == null) {
break;
}
}
// Use your favorite JSON parser and go to town!
ObjectMapper m = new ObjectMapper();
JsonNode rootNode = m.readTree(json);
JsonNode records = rootNode.path(“Records”);
Iterator recordItr = records.iterator();
while (recordItr.hasNext()) {
JsonNode node = (JsonNode) recordItr.next();
JsonNode userIdentity = node.path(“userIdentity”);
JsonNode accountIdNode = (JsonNode) userIdentity.path(“accountId”);
System.out.println(“accountId:” + accountIdNode.asText());
JsonNode typeNode = (JsonNode) userIdentity.path(“type”);
System.out.println(“type:” + typeNode.asText());
JsonNode principalNode = (JsonNode) userIdentity.path(“principalId”);
System.out.println(“principalId:” + principalNode.asText());
JsonNode arnNode = (JsonNode) userIdentity.path(“arn”);
System.out.println(“arn:” + arnNode.asText());
JsonNode accessKeyIdNode = (JsonNode) userIdentity.path(“accessKeyId”);
System.out.println(“accessKeyId:” + accessKeyIdNode.asText());
JsonNode eventName = node.path(“eventName”);
System.out.println(“event:” + eventName.asText());
JsonNode ip = node.path(“sourceIPAddress”);
System.out.println(“ip:” + ip.asText());
JsonNode dateTime = node.path(“eventTime”);
System.out.println(“eventTime:” + dateTime.asText());
System.out.println(“—-“);
}
gzipStream.close();
decoder.close();
reader.close();
}
}
view rawCloudTrailTest hosted with ❤ by GitHub

Conclusion

CloudTrail is extremely helpful for gaining detailed insight into your AWS environment. As we’ve shown in this tutorial, it’s very easy to configure and pull the information into your own applications. Let us know about your experience with CloudTrail and what you discovered about your organizational risk in the comments.

To learn more about Volume Labs and Volume Integration, please follow us on Twitter @volumeint and check out our website.