Ali Hürriyetoğlu ahurriyetoglu

## gist:38574f7ac70cb04e8eb6

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ahurriyetoglu
                / gist:38574f7ac70cb04e8eb6
            
            
              Created
              June 23, 2014 08:23
                — forked from debasishg/gist:8172796
            
          
General Background and Overview


Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
Models and Issues in Data Stream Systems
Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
[Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&amp;rep


## gist:29189bae26bbd0f7a82a
>>> from pandas import DataFrame
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> docs = ["You can catch more flies with honey than you can with vinegar.",
...         "You can lead a horse to water, but you can't make him drink."]
>>> vect = CountVectorizer(min_df=0., max_df=1.0)
>>> X = vect.fit_transform(docs)
>>> print(DataFrame(X.A, columns=vect.get_feature_names()).to_string())
   but  can  catch  drink  flies  him  honey  horse  lead  make  more  than  to  vinegar  water  with  you
0    0    2      1      0      1    0      1      0     0     0     1     1   0        1      0     2    2
1    1    2      0      1      0    1      0      1     1     1     0     0   1        0      1     0    2

## kmeans.py
#!/usr/bin/python
#
# K-means clustering using Lloyd's algorithm in pure Python.
# Written by Lars Buitinck. This code is in the public domain.
#
# The main program runs the clustering algorithm on a bunch of text documents
# specified as command-line arguments. These documents are first converted to
# sparse vectors, represented as lists of (index, value) pairs.

from collections import defaultdict
	>>> from pandas import DataFrame
	>>> from sklearn.feature_extraction.text import CountVectorizer
	>>> docs = ["You can catch more flies with honey than you can with vinegar.",
	... "You can lead a horse to water, but you can't make him drink."]
	>>> vect = CountVectorizer(min_df=0., max_df=1.0)
	>>> X = vect.fit_transform(docs)
	>>> print(DataFrame(X.A, columns=vect.get_feature_names()).to_string())
	but can catch drink flies him honey horse lead make more than to vinegar water with you
	0 0 2 1 0 1 0 1 0 0 0 1 1 0 1 0 2 2
	1 1 2 0 1 0 1 0 1 1 1 0 0 1 0 1 0 2
	#!/usr/bin/python
	#
	# K-means clustering using Lloyd's algorithm in pure Python.
	# Written by Lars Buitinck. This code is in the public domain.
	#
	# The main program runs the clustering algorithm on a bunch of text documents
	# specified as command-line arguments. These documents are first converted to
	# sparse vectors, represented as lists of (index, value) pairs.

	from collections import defaultdict