Prasad Chalasani pchalasani

## normcore-llm.md

      
              1 file
            
          
              306 forks
            
          
              40 comments
            
          
              3217 stars
            
          
                veekaybee
                / normcore-llm.md
            
            
              Last active
              November 3, 2024 18:57
            
              
                Normcore LLM Reads
              
          
    Anti-hype LLM reading list

Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought.
Foundational Concepts


Pre-Transformer Models


## gist:ea883a19232833cf2647
import com.twitter.scalding._
import com.twitter.algebird._

/**
 * More sensible aggregation with Monoids.
 * Use SketchMap to get only the top words that we are interested about.
 * SketchMap is a generalization of the CountMinSketch in Algebird. Holds list of top items.
 * The size of the CMS will not grow so this will not run out of mem.
 */
class WordCount5(args: Args) extends Job(args)  {

## spark_ide.py
#!/public/spark-0.9.1/bin/pyspark

import os
import sys

# Set the path for spark installation
# this is the path where you have built spark using sbt/sbt assembly
os.environ['SPARK_HOME'] = "/public/spark-0.9.1"
# os.environ['SPARK_HOME'] = "/home/jie/d2/spark-0.9.1"
# Append to PYTHONPATH so that pyspark could be found

## gist:8172796

      
              1 file
            
          
              405 forks
            
          
              23 comments
            
          
              1649 stars
            
          
                debasishg
                / gist:8172796
            
            
              Last active
              October 3, 2024 12:09
            
              
                A collection of links for streaming algorithms and data structures
              
          
    General Background and Overview


Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
Models and Issues in Data Stream Systems
Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
[Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&amp;rep=rep1&amp;t


## ItemSimilarity.scala
import com.twitter.scalding._
import com.twitter.algebird.{ MinHasher, MinHasher32, MinHashSignature }

/**
 * Computes similar items (with a string itemId), based on approximate
 * Jaccard similarity, using LSH.
 *
 * Assumes an input data TSV file of the following format:
 *
 *    itemId   userId

## cascalog-map.clj
(defn cascalog-map
  [op-var output-fields & {:keys [stateful?]}]
  (let [ser (KryoService/serialize (ops/fn-spec op-var))]
    (proxy [BaseOperation Function] [^Fields output-fields]
      (prepare [^FlowProcess flow-process ^OperationCall op-call]
        (let [op (Util/bootFn (KryoService/deserialize ser))]
          (-> op-call
              (.setContext [op (if stateful? (op))]))))
      (operate [^FlowProcess flow-process ^FunctionCall fn-call]
        (let [[op] (.getContext fn-call)
	import com.twitter.scalding._
	import com.twitter.algebird._

	/**
	* More sensible aggregation with Monoids.
	* Use SketchMap to get only the top words that we are interested about.
	* SketchMap is a generalization of the CountMinSketch in Algebird. Holds list of top items.
	* The size of the CMS will not grow so this will not run out of mem.
	*/
	class WordCount5(args: Args) extends Job(args) {
	#!/public/spark-0.9.1/bin/pyspark

	import os
	import sys

	# Set the path for spark installation
	# this is the path where you have built spark using sbt/sbt assembly
	os.environ['SPARK_HOME'] = "/public/spark-0.9.1"
	# os.environ['SPARK_HOME'] = "/home/jie/d2/spark-0.9.1"
	# Append to PYTHONPATH so that pyspark could be found
	import com.twitter.scalding._
	import com.twitter.algebird.{ MinHasher, MinHasher32, MinHashSignature }

	/**
	* Computes similar items (with a string itemId), based on approximate
	* Jaccard similarity, using LSH.
	*
	* Assumes an input data TSV file of the following format:
	*
	* itemId userId
	(defn cascalog-map
	[op-var output-fields & {:keys [stateful?]}]
	(let [ser (KryoService/serialize (ops/fn-spec op-var))]
	(proxy [BaseOperation Function] [^Fields output-fields]
	(prepare [^FlowProcess flow-process ^OperationCall op-call]
	(let [op (Util/bootFn (KryoService/deserialize ser))]
	(-> op-call
	(.setContext [op (if stateful? (op))]))))
	(operate [^FlowProcess flow-process ^FunctionCall fn-call]
	(let [[op] (.getContext fn-call)