stephenLee

## bitwise-operators.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                stephenLee
                / bitwise-operators.md
            
            
              Created
              November 12, 2012 11:41
                — forked from dideler/bitwise-operators.md
            
              
                Bitwise tricks
              
          
    Inspired by this article.
Neat tricks for speeding up integer computations.
Note: cin.sync_with_stdio(false); disables synchronous IO and gives you a performance boost.
If used, you should only use cin for reading input
(don't use both cin and scanf when sync is disabled, for example)
or you will get unexpected results.
Multiply by a power of 2

x = x << 1; // x = x * 2

  
## dpgmm.py
# Dirichlet process Gaussian mixture model

import numpy as np
from scipy.special import gammaln
from scipy.linalg import cholesky
from sliceSample import sliceSample

def multinomialDraw(dist):
    """Returns a single draw from the given multinomial distribution."""
    return np.random.multinomial(1, dist).argmax()

## demo_20news.py
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

def get_vectors(vocab_size=5000):
    newsgroups_train = fetch_20newsgroups(subset='train')
    vectorizer = CountVectorizer(max_df=.9, max_features=vocab_size)
    vecs = vectorizer.fit_transform(newsgroups_train.data)
    vocabulary = vectorizer.vocabulary
    terms = np.array(vocabulary.keys())

## MovieSimilarities.scala
import spark.SparkContext
import SparkContext._

/**
 * A port of [[http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/]]
 * to Spark.
 * Uses movie ratings data from MovieLens 100k dataset found at [[http://www.grouplens.org/node/73]]
 */
object MovieSimilarities {

## text-classification.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                stephenLee
                / text-classification.md
            
            
              Created
              September 2, 2013 01:37
                — forked from zacstewart/classifier.py
            
          
    Text Classification

To demonstrate text classification with Scikit Learn, we'll build a simple spam filter. While the filters in production for services like Gmail will obviously be vastly more sophisticated, the model we'll have by the end of this chapter is effective and surprisingly accurate.
Spam filtering is the "hello world" of document classification, but something to be aware of is that we aren't limited to two classes. The classifier we will be using supports multi-class classification, which opens up vast opportunities like author identification, support email routing, etc… However, in this example we'll just stick to two classes: SPAM and HAM.
For this exercise, we'll be using a combination of the Enron-Spam data sets and the SpamAssassin public corpus. Both are publicly available for download and are retreived from the internet during the setup phase of the example code that goes with this chapter.
Loading Examples


## hashtagify.py
# coding=UTF-8
from __future__ import division
import nltk
from collections import Counter

# This is a simple tool for adding automatic hashtags into an article title
# Created by Shlomi Babluki
# Sep, 2013


## hngen.py
import urllib2
import re
import sys
from collections import defaultdict
from random import random

"""
PLEASE DO NOT RUN THIS QUOTED CODE FOR THE SAKE OF daemonology's SERVER, IT IS
NOT MY SERVER AND I FEEL BAD FOR ABUSING IT. JUST GET THE RESULTS OF THE
CRAWL HERE: http://pastebin.com/raw.php?i=nqpsnTtW AND SAVE THEM TO "archive.txt"

## gist:8278307

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                stephenLee
                / gist:8278307
            
            
              Created
              January 6, 2014 04:34
                — forked from debasishg/gist:8172796
            
          
General Background and Overview


Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
Models and Issues in Data Stream Systems
Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
[Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&amp;rep


## gist:a9c98e5df380e2453bc8

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                stephenLee
                / gist:a9c98e5df380e2453bc8
            
            
              Created
              June 13, 2014 01:38
                — forked from debasishg/gist:8172796
            
          
General Background and Overview


Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
Models and Issues in Data Stream Systems
Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
[Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&amp;rep


## trends.py
#!/usr/bin/env python
"""
Downloads and cleans up a CSV file from a Google Trends query.

Usage:
    trends.py google.username@gmail.com google.password /path/to/filename query1 [query2 ...]

Requires mechanize:
    pip install mechanize
"""
	# Dirichlet process Gaussian mixture model

	import numpy as np
	from scipy.special import gammaln
	from scipy.linalg import cholesky
	from sliceSample import sliceSample

	def multinomialDraw(dist):
	"""Returns a single draw from the given multinomial distribution."""
	return np.random.multinomial(1, dist).argmax()
	import numpy as np
	from sklearn.datasets import fetch_20newsgroups
	from sklearn.feature_extraction.text import CountVectorizer

	def get_vectors(vocab_size=5000):
	newsgroups_train = fetch_20newsgroups(subset='train')
	vectorizer = CountVectorizer(max_df=.9, max_features=vocab_size)
	vecs = vectorizer.fit_transform(newsgroups_train.data)
	vocabulary = vectorizer.vocabulary
	terms = np.array(vocabulary.keys())
	import spark.SparkContext
	import SparkContext._

	/**
	* A port of [[http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/]]
	* to Spark.
	* Uses movie ratings data from MovieLens 100k dataset found at [[http://www.grouplens.org/node/73]]
	*/
	object MovieSimilarities {
	# coding=UTF-8
	from __future__ import division
	import nltk
	from collections import Counter

	# This is a simple tool for adding automatic hashtags into an article title
	# Created by Shlomi Babluki
	# Sep, 2013
	import urllib2
	import re
	import sys
	from collections import defaultdict
	from random import random

	"""
	PLEASE DO NOT RUN THIS QUOTED CODE FOR THE SAKE OF daemonology's SERVER, IT IS
	NOT MY SERVER AND I FEEL BAD FOR ABUSING IT. JUST GET THE RESULTS OF THE
	CRAWL HERE: http://pastebin.com/raw.php?i=nqpsnTtW AND SAVE THEM TO "archive.txt"
	#!/usr/bin/env python
	"""
	Downloads and cleans up a CSV file from a Google Trends query.

	Usage:
	trends.py google.username@gmail.com google.password /path/to/filename query1 [query2 ...]

	Requires mechanize:
	pip install mechanize
	"""