Skip to content

Instantly share code, notes, and snippets.

View BBischof's full-sized avatar

Bryan Bischof BBischof

  • current: Hex | prev: Weights and Biases, Stitch Fix, Blue Bottle, QuasiCoherent Labs, IBM
  • Berkeley, California
  • X @bebischof
View GitHub Profile

Data Engineering Capstone Project -- Bryan Bischof

Dec. 17, 2015

Project Description

Given unstructured log data from Aspera's ASCP transfer, one needs to parse these logs, and store them to a large key-value store(currently Redis). The current solution is a Python script that runs a series of regexes, and is deployed on Spark to a Mesos cluster for analysis. However, this script is highly inefficient and isn't designed to interact with a lambda architecture. In particular, it doesn't connect to a permanent data store, and second, it doesn't accept incoming streams, only batch upload and processing.

This project is to rewrite this script to do three things:

  • pure scala implementation of these hundred-so regexs

Rough project steps

  • I wget-ed all the articles from 2015 into a directory,
  • use find | grep | awk to create a list of paths to files, save list to var
  • loop over list of files and use cat | grep | sed to parse the files output to new files
  • loop over new files use cat to concatenate files with parts into single transcripts

Bash Commands Run

@BBischof
BBischof / numeralDecoderPuzzle
Created December 6, 2013 00:41
Some little programming puzzle I found. Apparently Facebook asked this sometime. A message containing letters from A-Z is being encoded to numbers using the following mapping: a -> 1, b -> 2, ... etc. How many decodings of some given number. Takes an input of numeral characters.
import sys
### Some little programming puzzle I found. Apparently Facebook asked this sometime.
###
### A message containing letters from A-Z is being encoded to numbers using the following mapping: a -> 1, b -> 2, ... etc. How many decodings of some given number.
###
### Takes an input of numeral characters.
input = sys.argv[1]

Data Engineering Capstone Project -- Bryan Bischof

Dec. 22, 2015

Project Description

Aspera's ASCP is a transfer protocol that is especially useful for large data transfers over suboptimal networks. In particular, ASCP is a UDP based transfer with guarenteed delivery. FASPstream is a version of ASCP specifically for streaming data transfers. During a transfer of these types, a log file is produced that contains time-series data for

  • bandwidth
  • retransmission rate
#Asim Quotes
- "We keep saying that laziness is happening but we dont really have any proof" -Asim, "No, I can prove that my laziness is happening..."
- "If there is any motivation, money is one" -Asim
- "Ideas are worthless" -Asim
- "Capstone projects are harder than starting a company[sic]" -Asim
- "Any time you see penalties, that is a signal that there is a business there." -Asim
- "Your cat will never turn into a toaster." -Asim
- "You can't lick a volume." -Asim
def memoize(f):
cache = {}
def decorated_function(*args):
if args in cache:
return cache[args]
else:
cache[args] = f(*args)
return cache[args]
return decorated_function
@BBischof
BBischof / hist.sh
Last active March 9, 2016 01:45
Makes a little histogram in bash from your data. Uses = for bars
while read d n
do
printf "%s\t%${n}s" "$d" = | tr ' ' '=' ;
echo " $n" ;
done < data.txt
@BBischof
BBischof / expandIt.py
Created March 22, 2016 19:05
A solution to a codefights daily puzzle that was amusing and might be useful to reference
# You are given a string s composed of letters and numbers, which was compressed with some algorithm. Every letter in s is followed by a number (possibly with leading zeros), which represents the number of times this letter occurs consecutively. For example, "aaaaaaaabbbbbbcc" would be given as "a8b6c2".
# Decompress s, sort its letters and return the kth (1-based) letter of the obtained string.
# Note: each letter occurs in s no more than 251 times.
# Example
# Expand_It("a2b3c2a1", 2) = "a"
@BBischof
BBischof / bean_counter.scala
Created April 1, 2016 18:47
Computing the amount of beans for the hopper
def bean_counter(output_shots: Int) : Int = {
if (output_shots >= 0) {
return (output_shots/2)*20 + output_shots%2*16
} else {
throw new IllegalArgumentException("Not a valid number of shots.")
}
}
@BBischof
BBischof / takeWhileSumLess.scala
Created April 12, 2016 22:07
I like this example of grabbing elements from a List until you reach your desired sum.
{ var sum = 0; (1 to 10000) takeWhile { i => sum += i; sum <= 100 } }