This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Fix the 'j' code to 'sc' code mappings | |
| - Adjust the mapping provided by the data team to roll up the 'j' codes to the parent level 'sc' code | |
| - Fill the gaps for the missing 'j' codes | |
| Build our cluster centers | |
| - With our manually judged 'r's, group them by their parent level 'sc' codes | |
| - For each group take a random sample (75%?) to produce our uber vectors | |
| - use the relational termstats approach to calculate the weights | |
| - These uber vectors will be used to pre-seed our cluster centers | |
| JavaRDD<LabeledPoint> centers = ..... |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| General Notes | |
| -------------------------------------------------------------------------------------- | |
| - MR is very io-bound | |
| - MR is Java based | |
| - Payload must be kvps | |
| - Workflow must be broken into Map-Reduce steps | |
| - Spark uses HDFS as the data storage | |
| - YARN is the execution engine (there is also standalone spark manager and others like Mesos) | |
| - Spark does more localized disk io than MR which is all Hadoop | |
| - Spark operates on orders of magnitude faster because it will utilize memory and local disk. It does not need to persist to disk inbetween each map-reduce set. Map to reduce does, but subsequent operations will utilize data on the paritions, rather than have to re-read them off of HDFS like MR will. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ------------------------------------------------------------------------------------- | |
| Structured Streaming for Machine Learning | |
| ------------------------------------------------------------------------------------- | |
| @shendrickson16 (http://github.com/sethah) | |
| http://github.com/holdenk (another presenter) | |
| - Structured Streaming is still very much ALPHA (boooo!) | |
| - extends the dataset API | |
| - Datasets | |
| - new in Spark 1.6 | |
| - give strongly typed version of dataframes |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| /** | |
| * This class is used to compare strings for relative distance using | |
| * the Levenshtein Distance algorithm. Essentially it determines how | |
| * many operations (adding of a letter, removing of a letter, substituting a letter) | |
| * it takes to go from string A to string B. | |
| * | |
| * According to Wikipedia, most 'innocent' misspellings differ by two operations | |
| * | |
| * Justin Alpino | |
| * |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <!---first stab at wordSearch custom tag (still very very buggy)---> | |
| <cfsetting enablecfoutputonly = "false"/> | |
| <cfparam name="attributes.rows" default="12" type="numeric" /> | |
| <cfparam name="attributes.cols" default="12" type="numeric" /> | |
| <!--- Constants ---> | |
| <cfset variables.HORIZONTAL = 1> | |
| <cfset variables.VERTICAL = 2> | |
| <cfset variables.DIAGONAL = 3> | |
| <cfset variables.REVERSE = 1> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <!--- | |
| This example uses a combination of different methods in Collections.cfc | |
| to find the most popular words being used in the description of the feeds | |
| that ColdfusionBloggers.org aggregates. This approach is not the most | |
| efficient way to reach the result but gives practical example of various | |
| methods in the library | |
| Grab the Collections.cfc from my repo http://github.com/jalpino/collections | |
| and place it in the same folder as this example. |