Justin Alpino jalpino

## gist:89d629e4743de54f939d90f84a55819b
Fix the 'j' code to 'sc' code mappings
   - Adjust the mapping provided by the data team to roll up the 'j' codes to the parent level 'sc' code
   - Fill the gaps for the missing 'j' codes

Build our cluster centers
   - With our manually judged 'r's, group them by their parent level 'sc' codes
   - For each group take a random sample (75%?) to produce our uber vectors
      - use the relational termstats approach to calculate the weights
   - These uber vectors will be used to pre-seed our cluster centers
   JavaRDD<LabeledPoint> centers = .....

## gist:c1bc823423369c3cac2b8bda70b8439b
General Notes
--------------------------------------------------------------------------------------
- MR is very io-bound
	- MR is Java based
	- Payload must be kvps
	- Workflow must be broken into Map-Reduce steps
- Spark uses HDFS as the data storage
- YARN is the execution engine (there is also standalone spark manager and others like Mesos)
- Spark does more localized disk io than MR which is all Hadoop
- Spark operates on orders of magnitude faster because it will utilize memory and local disk. It does not need to persist to disk inbetween each map-reduce set. Map to reduce does, but subsequent operations will utilize data on the paritions, rather than have to re-read them off of HDFS like MR will.

## gist:e985984a27a69579a7c5c6b5e825a25c
-------------------------------------------------------------------------------------
Structured Streaming for Machine Learning
-------------------------------------------------------------------------------------
@shendrickson16   (http://github.com/sethah)
http://github.com/holdenk (another presenter)
- Structured Streaming is still very much ALPHA (boooo!)
  - extends the dataset API
- Datasets
  - new in Spark 1.6
  - give strongly typed version of dataframes

## levenshtein.js
/**
 * This class is used to compare strings for relative distance using
 * the Levenshtein Distance algorithm. Essentially it determines how
 * many operations (adding of a letter, removing of a letter, substituting a letter)
 * it takes to go from string A to string B.
 *
 * According to Wikipedia, most 'innocent' misspellings differ by two operations
 *
 * Justin Alpino
 *

## gist:1063473
<!---first stab at wordSearch custom tag (still very very buggy)--->
<cfsetting enablecfoutputonly = "false"/>
<cfparam name="attributes.rows" default="12" type="numeric" />
<cfparam name="attributes.cols" default="12" type="numeric" />

<!--- Constants --->
<cfset variables.HORIZONTAL = 1>
<cfset variables.VERTICAL = 2>
<cfset variables.DIAGONAL = 3>
<cfset variables.REVERSE = 1>

## wordcount.cfm
<!---

	This example uses a combination of different methods in Collections.cfc
	to find the most popular words being used in the description of the feeds
	that ColdfusionBloggers.org aggregates. This approach is not the most
	efficient way to reach the result but gives practical example of various
	methods in the library

	Grab the Collections.cfc from my repo http://github.com/jalpino/collections
	and place it in the same folder as this example.
	Fix the 'j' code to 'sc' code mappings
	- Adjust the mapping provided by the data team to roll up the 'j' codes to the parent level 'sc' code
	- Fill the gaps for the missing 'j' codes

	Build our cluster centers
	- With our manually judged 'r's, group them by their parent level 'sc' codes
	- For each group take a random sample (75%?) to produce our uber vectors
	- use the relational termstats approach to calculate the weights
	- These uber vectors will be used to pre-seed our cluster centers
	JavaRDD<LabeledPoint> centers = .....
	General Notes
	--------------------------------------------------------------------------------------
	- MR is very io-bound
	- MR is Java based
	- Payload must be kvps
	- Workflow must be broken into Map-Reduce steps
	- Spark uses HDFS as the data storage
	- YARN is the execution engine (there is also standalone spark manager and others like Mesos)
	- Spark does more localized disk io than MR which is all Hadoop
	- Spark operates on orders of magnitude faster because it will utilize memory and local disk. It does not need to persist to disk inbetween each map-reduce set. Map to reduce does, but subsequent operations will utilize data on the paritions, rather than have to re-read them off of HDFS like MR will.
	-------------------------------------------------------------------------------------
	Structured Streaming for Machine Learning
	-------------------------------------------------------------------------------------
	@shendrickson16 (http://github.com/sethah)
	http://github.com/holdenk (another presenter)
	- Structured Streaming is still very much ALPHA (boooo!)
	- extends the dataset API
	- Datasets
	- new in Spark 1.6
	- give strongly typed version of dataframes
	/**
	* This class is used to compare strings for relative distance using
	* the Levenshtein Distance algorithm. Essentially it determines how
	* many operations (adding of a letter, removing of a letter, substituting a letter)
	* it takes to go from string A to string B.
	*
	* According to Wikipedia, most 'innocent' misspellings differ by two operations
	*
	* Justin Alpino
	*
	<!---first stab at wordSearch custom tag (still very very buggy)--->
	<cfsetting enablecfoutputonly = "false"/>
	<cfparam name="attributes.rows" default="12" type="numeric" />
	<cfparam name="attributes.cols" default="12" type="numeric" />

	<!--- Constants --->
	<cfset variables.HORIZONTAL = 1>
	<cfset variables.VERTICAL = 2>
	<cfset variables.DIAGONAL = 3>
	<cfset variables.REVERSE = 1>
	<!---

	This example uses a combination of different methods in Collections.cfc
	to find the most popular words being used in the description of the feeds
	that ColdfusionBloggers.org aggregates. This approach is not the most
	efficient way to reach the result but gives practical example of various
	methods in the library

	Grab the Collections.cfc from my repo http://github.com/jalpino/collections
	and place it in the same folder as this example.