Skip to content

Instantly share code, notes, and snippets.

View hailiang-wang's full-sized avatar
🌴
On vacation

Hai Liang W. hailiang-wang

🌴
On vacation
View GitHub Profile
@erning
erning / hz2py.md
Created November 4, 2011 05:51
汉字转拼音

汉字转拼音 (hz2py)

我们希望安居客的搜索引擎能够更好的做到同音字的容错,采用拼音容错是一个不错的方法。因此,需要一个将汉字转换为拼音的组件。同时,汉字转拼音组件还可以有多个用途,例如以拼音的首字母来检索小区名、人名等。

这样我们需要一个通用的将汉字转换为拼音的服务。

功能

基本功能就是中文拉丁化,输入一段中文文本,输出转变为汉语拼音的文本。

要求原文中的全角标点符号、空格等应该转为对应的半脚符号。原汉字与英文间如果没有空格分隔,转换为拼音后应该加入空格分隔。

@glennblock
glennblock / fork forced sync
Created March 4, 2012 19:27
Force your forked repo to be the same as upstream.
git fetch upstream
git reset --hard upstream/master
@ttezel
ttezel / gist:4138642
Last active May 7, 2024 13:34
Natural Language Processing Notes

#A Collection of NLP notes

##N-grams

###Calculating unigram probabilities:

P( wi ) = count ( wi ) ) / count ( total number of words )

In english..

@luw2007
luw2007 / 词性标记.md
Last active March 18, 2024 06:36
词性标记: 包含 ICTPOS3.0词性标记集、ICTCLAS 汉语词性标注集、jieba 字典中出现的词性、simhash 中可以忽略的部分词性

词的分类

  • 实词:名词、动词、形容词、状态词、区别词、数词、量词、代词
  • 虚词:副词、介词、连词、助词、拟声词、叹词。

ICTPOS3.0词性标记集

n 名词

nr 人名

@zviri
zviri / clusterdump.sh
Created December 3, 2013 10:11
Mahout cheat-sheet
mahout clusterdump \
-dt sequencefile \ # format: {Integer => String}
-d reuters-vectors/dictionary.file-* \ # dictionary: {id => word}
-i reuters-kmeans-clusters/clusters-3-final \ # input
-o clusters.txt \ # output (local filesystem)
-b 10 \ # format length
-n 10 # number of top terms to print
--distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure # default is euclidean distance
@syllog1sm
syllog1sm / gist:10343947
Last active November 7, 2023 13:09
A simple Python dependency parser
"""A simple implementation of a greedy transition-based parser. Released under BSD license."""
from os import path
import os
import sys
from collections import defaultdict
import random
import time
import pickle
SHIFT = 0; RIGHT = 1; LEFT = 2;
@CptMauli
CptMauli / empty-eclipse.target
Created May 2, 2014 07:24
a target platform for use with Eclipse SCADA
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?pde version="3.8"?><target name="simple" sequenceNumber="12">
<locations>
<location path="${env_var:ECLIPSE_432_HOME}" type="Profile"/>
<location path="${project_loc:builder_external}/builder/lib" type="Directory"/>
<location includeAllPlatforms="false" includeConfigurePhase="true" includeMode="planner" includeSource="true" type="InstallableUnit">
<unit id="org.apache.commons.beanutils" version="1.8.0.v201205091237"/>
<unit id="org.apache.commons.collections" version="3.2.0.v2013030210310"/>
<unit id="com.google.guava" version="12.0.0.v201212092141"/>
<unit id="com.google.gson" version="2.1.0.v201303041604"/>
@drorata
drorata / gist:146ce50807d16fd4a6aa
Last active February 27, 2024 10:15
Minimal Working example of Elasticsearch scrolling using Python client
# Initialize the scroll
page = es.search(
index = 'yourIndex',
doc_type = 'yourType',
scroll = '2m',
search_type = 'scan',
size = 1000,
body = {
# Your query's body
})
<img src="http://musa-hw-cafe.qiniudn.com/Screen%20Shot%202014-10-16%20at%2011.12.12.png" width="320px" height="100px"/>
宝莲灯是位置感知服务的客户端。
* 位置感知服务是什么?
基于近场通信技术,渲染实时室内地图,在消费场所,为消费者与消费者、消费者与商户提供基于位置服务的社交网络。
* 典型应用场景
@trusktr
trusktr / DefaultKeyBinding.dict
Last active May 17, 2024 01:58
My DefaultKeyBinding.dict for Mac OS X
/* ~/Library/KeyBindings/DefaultKeyBinding.Dict
This file remaps the key bindings of a single user on Mac OS X 10.5 to more
closely match default behavior on Windows systems. This makes the Command key
behave like Windows Control key. To use Control instead of Command, either swap
Control and Command in Apple->System Preferences->Keyboard->Modifier Keys...
or replace @ with ^ in this file.
Here is a rough cheatsheet for syntax.
Key Modifiers