Skip to content

Instantly share code, notes, and snippets.

@BBischof
Created December 17, 2015 22:56
Show Gist options
  • Save BBischof/9ea11dbcc52b4bb1c048 to your computer and use it in GitHub Desktop.
Save BBischof/9ea11dbcc52b4bb1c048 to your computer and use it in GitHub Desktop.

Data Engineering Capstone Project -- Bryan Bischof

Dec. 17, 2015

Project Description

Given unstructured log data from Aspera's ASCP transfer, one needs to parse these logs, and store them to a large key-value store(currently Redis). The current solution is a Python script that runs a series of regexes, and is deployed on Spark to a Mesos cluster for analysis. However, this script is highly inefficient and isn't designed to interact with a lambda architecture. In particular, it doesn't connect to a permanent data store, and second, it doesn't accept incoming streams, only batch upload and processing.

This project is to rewrite this script to do three things:

  • pure scala implementation of these hundred-so regexs
  • accept incoming streaming data sources
  • also equiped to write to a permanent data store

Accomplishing these goals will involve taking the existing Python script and reimplementing in Scala, and writing original code to handle the incoming stream, and writing to persistant storage. Additionally, integrating with Aspera's FaspStream as a datastream is necessary.

Project Demonstration

The proofing of this project will depend on a few things:

  • is it optimized from the current Python implementation?
  • does it accept the streaming data source?
  • can it write to a permanent data store?

The project will be built as a standalone Scala app with a test data set to demonstrate the successfulness of the script. First, we will build a live demo of this Scala app for some test logs. Second, we will integrate this with the current system and do performance analysis on some test longs. The performance analysis will be presented in D3. Third, we will integrate with FaspStream and present a demo of a data transfer integrated with the Scala app. Finally, we will build a demo of writing to a permanent data store. This will all be presented finally in a Whitepaper with D3 visualizations for performance analysis.

First Step

The first step for this project is twofold

  • understand the current Python implementation
  • understand the log data form

Because this project initially sets out to improve upon the current Python code, it's important to begin working through this code, and converting it to Scala. Getting a Scala parser for this data is the absolute first goal. To write that parser, it is also necessary to understand the (un)structure of the data. So that is an additional component of building this parser.

Data

Fortunately, the data for this project is very available. Additionally, the example data set which is used to evaluate the current implementation is available with documentation. Thus, this example data will be used to build and initially test the app.

An example line from this data set is as follows:

Jan 23 10:47:43 elsaspera03 ascp[5276]: LOG Receiver bl t/o/r/d=5292/5292/0/0 rex_rtt l/h/s/o=0/0/67/8 ooo_rtt l/h/s/o=0/0/67/8 rate_rtt b/l/h/s/r/f=67/67/117/108/0/1 ctl bm/bs=0/0 rex n/s/q/v/a/r=0/0/0/0/0/0 bl l/d/o/r/a/x/dl/df/dm/ds=0/0/0/0/0/0/0/0/0/0 disk l/h/b/w=0/1/0/7683984 rate t/m/c/n/vl/vr/r=80000000/0/69277032/69277032/80000000/80000000/80000000 prog t/f/e=7683984/385957572/1000142 rcvD=0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment