Skip to content

Instantly share code, notes, and snippets.

@entaroadun
entaroadun / gist:1653794
Created January 21, 2012 20:10
Recommendation and Ratings Public Data Sets For Machine Learning

Movies Recommendation:

Music Recommendation:

@ravasthi
ravasthi / _config.yml
Created February 15, 2012 08:59
Multiple authors on Jekyll
authors:
hanzou:
name: Hanzou Hattori
display_name: Hanzou
gravatar: c66919cb194f96c696c1da0c47354a6a
email: hanzou@company.com
web: http://company.com
twitter: company
github: hhattori
jorgen:
from unicodedata import *
script_data = {
"names":['Common', 'Latin', 'Greek', 'Cyrillic', 'Armenian', 'Hebrew', 'Arabic',
'Syriac', 'Thaana', 'Devanagari', 'Bengali', 'Gurmukhi', 'Gujarati', 'Oriya',
'Tamil', 'Telugu', 'Kannada', 'Malayalam', 'Sinhala', 'Thai', 'Lao', 'Tibetan',
'Myanmar', 'Georgian', 'Hangul', 'Ethiopic', 'Cherokee', 'Canadian_Aboriginal',
'Ogham', 'Runic', 'Khmer', 'Mongolian', 'Hiragana', 'Katakana', 'Bopomofo',
'Han', 'Yi', 'Old_Italic', 'Gothic', 'Deseret', 'Inherited', 'Tagalog',
'Hanunoo', 'Buhid', 'Tagbanwa', 'Limbu', 'Tai_Le', 'Linear_B', 'Ugaritic',
@dreampuf
dreampuf / gist:3946886
Created October 24, 2012 15:51
Pinyin Split
#!/usr/bin/env python
#vim: encoding=utf-8
"""
拼音分词
"""
__author__ = "dreampuf<soddyque@gmail.com>"
import unittest
@luw2007
luw2007 / 词性标记.md
Last active June 29, 2024 14:17
词性标记: 包含 ICTPOS3.0词性标记集、ICTCLAS 汉语词性标注集、jieba 字典中出现的词性、simhash 中可以忽略的部分词性

词的分类

  • 实词:名词、动词、形容词、状态词、区别词、数词、量词、代词
  • 虚词:副词、介词、连词、助词、拟声词、叹词。

ICTPOS3.0词性标记集

n 名词

nr 人名

@wpm
wpm / spark_parallel_boost.py
Last active December 3, 2018 02:56
A simple example of how to integrate the Spark parallel computing framework and the scikit-learn machine learning toolkit. This script randomly generates test and train data sets, trains an ensemble of decision trees using boosting, and applies the ensemble to the test set. The ensemble training is done in parallel.
from pyspark import SparkContext
import numpy as np
from sklearn.cross_validation import train_test_split, Bootstrap
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
def run(sc):
@songron
songron / NTU - Machine Learning
Last active August 25, 2018 08:11
Codes for Machine Learning Foundations(NTU) 台湾国立大学《机器学习基石》(Coursera版)相关的代码、编程作业等。
Codes for Machine Learning Foundations(NTU)
台湾国立大学《机器学习基石》(Coursera版)相关的代码、编程作业等。
课程地址:https://class.coursera.org/ntumlone-001/
import numpy as np
import marisa_trie
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import six
class MarisaCountVectorizer(CountVectorizer):
# ``CountVectorizer.fit`` method calls ``fit_transform`` so
# ``fit`` is not provided
def fit_transform(self, raw_documents, y=None):
@willpatera
willpatera / Google-Sheet-Form-Post.md
Last active May 3, 2024 12:57
Post to google spreadsheet from html form

Overview

This collection of files serves as a simple static demonstration of how to post to a google spreadsheet from an external html <form> following the example by Martin Hawksey

Depreciation Warning: This code is not maintained, and should be seen as reference implementation only. If you're looking to add features or update, fork the code and update as needed.

Run example

You should be able to just open index.html in your browser and test locally.

@ramhiser
ramhiser / one-hot.py
Last active April 7, 2021 06:44
Apply one-hot encoding to a pandas DataFrame
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
def encode_onehot(df, cols):
"""
One-hot encoding is applied to columns specified in a pandas DataFrame.
Modified from: https://gist.github.com/kljensen/5452382