Skip to content

Instantly share code, notes, and snippets.

View jalpino's full-sized avatar

Justin Alpino jalpino

View GitHub Profile
Fix the 'j' code to 'sc' code mappings
- Adjust the mapping provided by the data team to roll up the 'j' codes to the parent level 'sc' code
- Fill the gaps for the missing 'j' codes
Build our cluster centers
- With our manually judged 'r's, group them by their parent level 'sc' codes
- For each group take a random sample (75%?) to produce our uber vectors
- use the relational termstats approach to calculate the weights
- These uber vectors will be used to pre-seed our cluster centers
JavaRDD<LabeledPoint> centers = .....
@jalpino
jalpino / gist:c1bc823423369c3cac2b8bda70b8439b
Created December 2, 2016 16:01
Notes from Spark Developer Training
General Notes
--------------------------------------------------------------------------------------
- MR is very io-bound
- MR is Java based
- Payload must be kvps
- Workflow must be broken into Map-Reduce steps
- Spark uses HDFS as the data storage
- YARN is the execution engine (there is also standalone spark manager and others like Mesos)
- Spark does more localized disk io than MR which is all Hadoop
- Spark operates on orders of magnitude faster because it will utilize memory and local disk. It does not need to persist to disk inbetween each map-reduce set. Map to reduce does, but subsequent operations will utilize data on the paritions, rather than have to re-read them off of HDFS like MR will.
@jalpino
jalpino / gist:e985984a27a69579a7c5c6b5e825a25c
Created October 2, 2016 14:31
Notes from Strata+Hadoop 2016 NYC
-------------------------------------------------------------------------------------
Structured Streaming for Machine Learning
-------------------------------------------------------------------------------------
@shendrickson16 (http://github.com/sethah)
http://github.com/holdenk (another presenter)
- Structured Streaming is still very much ALPHA (boooo!)
- extends the dataset API
- Datasets
- new in Spark 1.6
- give strongly typed version of dataframes
@jalpino
jalpino / levenshtein.js
Created August 6, 2012 13:23
Levenshtein Distance
/**
* This class is used to compare strings for relative distance using
* the Levenshtein Distance algorithm. Essentially it determines how
* many operations (adding of a letter, removing of a letter, substituting a letter)
* it takes to go from string A to string B.
*
* According to Wikipedia, most 'innocent' misspellings differ by two operations
*
* Justin Alpino
*
@jalpino
jalpino / gist:1063473
Created July 4, 2011 15:18
WordSearch Puzzle
<!---first stab at wordSearch custom tag (still very very buggy)--->
<cfsetting enablecfoutputonly = "false"/>
<cfparam name="attributes.rows" default="12" type="numeric" />
<cfparam name="attributes.cols" default="12" type="numeric" />
<!--- Constants --->
<cfset variables.HORIZONTAL = 1>
<cfset variables.VERTICAL = 2>
<cfset variables.DIAGONAL = 3>
<cfset variables.REVERSE = 1>
@jalpino
jalpino / wordcount.cfm
Created May 31, 2011 22:38
An example using Collections.cfc to map, reduce and manipulate data
<!---
This example uses a combination of different methods in Collections.cfc
to find the most popular words being used in the description of the feeds
that ColdfusionBloggers.org aggregates. This approach is not the most
efficient way to reach the result but gives practical example of various
methods in the library
Grab the Collections.cfc from my repo http://github.com/jalpino/collections
and place it in the same folder as this example.