BBischof/finalProposal.md

## finalProposal.md

      
    Raw
  

              finalProposal.md
            
          
    Data Engineering Capstone Project -- Bryan Bischof

Dec. 22, 2015
Project Description

Aspera's ASCP is a transfer protocol that is especially useful for large data transfers over suboptimal networks. In particular, ASCP is a UDP based transfer with guarenteed delivery. FASPstream is a version of ASCP specifically for streaming data transfers. During a transfer of these types, a log file is produced that contains time-series data for

bandwidth
retransmission rate
delay
etc.

These log files provide a fingerprint of what a transfer looked like, at all times during the transfer. It is of extreme value to Aspera to be able to, store, process, and analyze these log files. The current solution is a Python script that runs a series of regexes, and is deployed on Spark to a Mesos cluster for analysis. However, this script is highly inefficient and isn't designed to interact with a lambda architecture. In particular, it doesn't connect to a permanent data store, and second, it doesn't accept incoming streams, only batch upload and processing.
This project is to rewrite this script to do three things:

pure scala implementation of these hundred-so regexs
accept incoming streaming data sources
also equiped to write to a permanent data store

Accomplishing these goals will involve taking the existing Python script and reimplementing in Scala, and writing original code to handle the incoming stream, and writing to persistant storage. Additionally, integrating with Aspera's FaspStream as a datastream is necessary.
Project Demonstration

The proofing of this project will depend on a few things:

is it optimized from the current Python implementation?
does it accept the streaming data source?
can it write to a permanent data store?

The project will be built as a standalone Scala app with a test data set to demonstrate the successfulness of the script. First, we will build a live demo of this Scala app for some test logs. Second, we will integrate this with the current system and do performance analysis on some test logs. The performance analysis will be presented in D3. Third, we will integrate with FaspStream and present a demo of a data transfer integrated with the Scala app. Finally, we will build a demo of writing to a permanent data store. This will all be presented finally in a Whitepaper with D3 visualizations for performance analysis.
First Step

The first step for this project is twofold

understand the current Python implementation
understand the log data form

Because this project initially sets out to improve upon the current Python code, it's important to begin working through this code, and converting it to Scala. Getting a Scala parser for this data is the absolute first goal. To write that parser, it is also necessary to understand the (un)structure of the data. So that is an additional component of building this parser.
Data

Fortunately, the data for this project is very available. Additionally, the example data set which is used to evaluate the current implementation is available with documentation. Thus, this example data will be used to build and initially test the app.
An example line from this data set is as follows:
Jan 23 10:47:43 elsaspera03 ascp[5276]: LOG Receiver bl t/o/r/d=5292/5292/0/0 rex_rtt l/h/s/o=0/0/67/8 ooo_rtt l/h/s/o=0/0/67/8 rate_rtt b/l/h/s/r/f=67/67/117/108/0/1 ctl bm/bs=0/0 rex n/s/q/v/a/r=0/0/0/0/0/0 bl l/d/o/r/a/x/dl/df/dm/ds=0/0/0/0/0/0/0/0/0/0 disk l/h/b/w=0/1/0/7683984 rate t/m/c/n/vl/vr/r=80000000/0/69277032/69277032/80000000/80000000/80000000 prog t/f/e=7683984/385957572/1000142 rcvD=0

Techniques & Pipeline Structure

The data pipeline for the log parser should be relatively straightforward:
Log file == FASPstream ==> (Socket)Spark ====> ScaleKV

This project will be primarily focused on the Spark application. For the purposes of this project, we will begin by reading the log file into Spark, and test it's comprehension of the logfile, and processing. Later we will connect with the streaming source. The Spark application itself will read in the log file, process as lines, and output as lines to be written into a K-V store.
For the streaming integration, I have decided to use Spark Streaming. After reviewing the differences between Storm and Spark Streaming, the need for stateful processing, and the preference for small batch processing. Spark Streaming will allow the incoming data stream to be converted to a DStream for processing with Storm.
Speedbumps

The two most intimidating components of this work are the reading from stream, writing to ScaleKV.
Reading from stream can potentially be overcome by utilizing a socket. However, to be sure that this is worthwhile, I want to run performance tests on processing log files through the stream, and through a regular transfer, and then loading the data in as a batch.
The second potential hangup, is writing to the KV store. I'll have extra work to do to put the data into the format needed to write to the KV store. This work shouldn't be too difficult, but will require and extra step.
Timeline

I expect that the project will be completable in the alloted time. By the first week of January, I should be successful in completing the spark application to process the data. From there, a couple days to integrate with streaming and scaleKV, and a day to work on performance analysis.
External Resources

Most of the resources for this project are contained in Spark. To conform to the eventual use case, I will need to use FASP as a transfer protocol, which is a propriatary technology. Additionally, I will use ScaleKV, which is also a propriatary technology. Both of these will be made available to me by Aspera for the purpose of this project.