Skip to content

Instantly share code, notes, and snippets.

@weakish
Created August 6, 2010 06:47
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save weakish/510960 to your computer and use it in GitHub Desktop.
Save weakish/510960 to your computer and use it in GitHub Desktop.
A #python script to markup characters need manual checks. #bottle

I wrote stupidm to markup characters which need manual checking after converting between Simplified and Traditional Chinese. But there may be other usages. For example, markup characters which are not available in some fonts, and annotate pronunciation for rare characters.

Flattr this!


Use stupidm with opencc:

With the make_words script, we can generate table files for stupidm with data files from opencc:

cat st_multi.txt | cut -f 2 | sed -r 's/ //g' |
make_words > st_multi.table

Example:

$ echo '胡适云:“南宫适诚不欺余也。”' | opencc | stupidm st_multi.table '{' '}'
胡{鬍衚}適{适}雲{云}:“南宮适{適}誠不欺餘{余}也。”

Setting up an online service:

With the stupidm_web.py script (require bottle), you can set up an online service quickly.

An nginx proxy config example is also provided.

I've ported stupidm and stupidm_web to Python 2.5 for hosting on Google App Engine. An app.yaml example is provided, too.

You may test my google app engine application:

http://weakish.appspot.com/zhtran

application: stupidm-web-example
version: 1
runtime: python
api_version: 1
handlers:
- url: /zhtran
script: stupidm_web_py2.py
#!/usr/bin/env python3.1
# by weakish <weakish@gmail.com>, licensed under GPL v2.
'''Arrangements of letters of words.
Example:
In:
abc
de
Out:
abc
bac
cab
de
ed
Note: Words with duplicated letters, e.g. hello will produce duplicated
results.
'''
import itertools
import sys
def main():
print(make_words(sys.stdin.read()))
def make_words(text: str) -> str:
return '\n'.join(arrange_chars(word) for word in text.splitlines())
def arrange_chars(word: str) -> str:
''' 'abc' -> 'abc\nacb\nbac\nbca\ncab\ncba' '''
return '\n'.join(''.join(i) for i in itertools.permutations(word))
if __name__ == '__main__':
main()
# config file for nginx
# In debian, put this file under /etc/nginx/sites-available, then link to
# it from /etc/nginx/sites-enabled. (And restart nginx: sudo /etc/init.d/nginx restart
server {
listen 80;
server_name example.com;
access_log /var/log/nginx/example.com.access.log;
location / {
proxy_pass http://localhost:8080;
}
}
劃畫划
劃划畫
畫劃划
畫划劃
划劃畫
划畫劃
滷鹵
鹵滷
歷曆
曆歷
發髮
髮發
只隻
隻只
臺檯颱台
臺檯台颱
臺颱檯台
臺颱台檯
臺台檯颱
臺台颱檯
檯臺颱台
檯臺台颱
檯颱臺台
檯颱台臺
檯台臺颱
檯台颱臺
颱臺檯台
颱臺台檯
颱檯臺台
颱檯台臺
颱台臺檯
颱台檯臺
台臺檯颱
台臺颱檯
台檯臺颱
台檯颱臺
台颱臺檯
台颱檯臺
後后
后後
壇罈
罈壇
復複覆
復覆複
複復覆
複覆復
覆復複
覆複復
盡儘
儘盡
幹乾干榦
幹乾榦干
幹干乾榦
幹干榦乾
幹榦乾干
幹榦干乾
乾幹干榦
乾幹榦干
乾干幹榦
乾干榦幹
乾榦幹干
乾榦干幹
干幹乾榦
干幹榦乾
干乾幹榦
干乾榦幹
干榦幹乾
干榦乾幹
榦幹乾干
榦幹干乾
榦乾幹干
榦乾干幹
榦干幹乾
榦干乾幹
樹榦版榦
樹榦榦版
樹版榦榦
樹版榦榦
樹榦榦版
樹榦版榦
榦樹版榦
榦樹榦版
榦版樹榦
榦版榦樹
榦榦樹版
榦榦版樹
版樹榦榦
版樹榦榦
版榦樹榦
版榦榦樹
版榦樹榦
版榦榦樹
榦樹榦版
榦樹版榦
榦榦樹版
榦榦版樹
榦版樹榦
榦版榦樹
並併并
並并併
併並并
併并並
并並併
并併並
當噹
噹當
志誌
誌志
匯彙
彙匯
係系繫
係繫系
系係繫
系繫係
繫係系
繫系係
髒臟
臟髒
蕩盪
盪蕩
獲穫
穫獲
採采
采採
里裏
裏里
鍾鐘
鐘鍾
飢饑
饑飢
豐丰
丰豐
醜丑
丑醜
了瞭
瞭了
借藉
藉借
克剋
剋克
準准
准準
刮颳
颳刮
制製
製制
籲吁
吁籲
吊弔
弔吊
團糰
糰團
困睏
睏困
佈布
布佈
御禦
禦御
鬭斗
斗鬭
曲麯
麯曲
鬆松
松鬆
澱淀
淀澱
纖縴
縴纖
致緻
緻致
蔑衊
衊蔑
仇讎
讎仇
冬鼕
鼕冬
咸鹹
鹹咸
雲云
云雲
僕仆
仆僕
舍捨
捨舍
籖簽
簽籖
折摺
摺折
谷穀
穀谷
幾几
几幾
闢辟
辟闢
奸姦
姦奸
遊游
游遊
傭佣
佣傭
蘇囌甦
蘇甦囌
囌蘇甦
囌甦蘇
甦蘇囌
甦囌蘇
回迴
迴回
面麪
麪面
向嚮曏
向曏嚮
嚮向曏
嚮曏向
曏向嚮
曏嚮向
夥伙
伙夥
鬱郁
郁鬱
樸朴
朴樸
才纔
纔才
朱硃
硃朱
別彆
彆別
捲卷
卷捲
蒙矇濛懞
蒙矇懞濛
蒙濛矇懞
蒙濛懞矇
蒙懞矇濛
蒙懞濛矇
矇蒙濛懞
矇蒙懞濛
矇濛蒙懞
矇濛懞蒙
矇懞蒙濛
矇懞濛蒙
濛蒙矇懞
濛蒙懞矇
濛矇蒙懞
濛矇懞蒙
濛懞蒙矇
濛懞矇蒙
懞蒙矇濛
懞蒙濛矇
懞矇蒙濛
懞矇濛蒙
懞濛蒙矇
懞濛矇蒙
徵征
征徵
症癥
癥症
惡噁
噁惡
注註
註注
哄鬨
鬨哄
參蔘
蔘參
醃腌
腌醃
彩綵
綵彩
佔占
占佔
欲慾
慾欲
扎紮
紮扎
熏燻
燻熏
贊讚
讚贊
嘗嚐
嚐嘗
煙菸
菸煙
周週賙
周賙週
週周賙
週賙周
賙周週
賙週周
櫃柜
柜櫃
餵喂
喂餵
幸倖
倖幸
兇凶
凶兇
傑杰
杰傑
針鍼
鍼針
戚慼鏚
戚鏚慼
慼戚鏚
慼鏚戚
鏚戚慼
鏚慼戚
托託
託托
挨捱
捱挨
挽輓
輓挽
慄栗
栗慄
煉鍊
鍊煉
鏈鍊
鍊鏈
穗繐
繐穗
雕鵰
鵰雕
樑梁
梁樑
升昇
昇升
擺襬
襬擺
巖岩
岩巖
娘孃
孃娘
僵殭
殭僵
藥葯
葯藥
餘余
余餘
蠟蜡
蜡蠟
出齣
齣出
卜蔔
蔔卜
同衕
衕同
板闆
闆板
漓灕
灕漓
術朮
朮術
侖崙
崙侖
秋鞦
鞦秋
千韆
韆千
簾帘
帘簾
庵菴
菴庵
屍尸
尸屍
胡衚鬍
胡鬍衚
衚胡鬍
衚鬍胡
鬍胡衚
鬍衚胡
須鬚
鬚須
據据
据據
築筑
筑築
誇夸
夸誇
蘋苹
苹蘋
裊嫋
嫋裊
暗闇
闇暗
衝沖冲
衝冲沖
沖衝冲
沖冲衝
冲衝沖
冲沖衝
表錶
錶表
杆桿
桿杆
鑒鑑
鑑鑒
搜蒐
蒐搜
杯盃
盃杯
剷鏟
鏟剷
扣釦
釦扣
念唸
唸念
杠槓
槓杠
泛汎氾
泛氾汎
汎泛氾
汎氾泛
氾泛汎
氾汎泛
核覈
覈核
巨鉅
鉅巨
嘆歎
歎嘆
價价
价價
私俬
俬私
局侷
侷局
拐柺
柺拐
弦絃
絃弦
譁嘩
嘩譁
悽淒
淒悽
家傢
傢家
席蓆
蓆席
酸痠
痠酸
噪譟
譟噪
咽嚥
嚥咽
愈癒
癒愈
凌淩
淩凌
毀譭
譭毀
苔薹
薹苔
糊餬
餬糊
抵牴
牴抵
恤卹
卹恤
蔭廕
廕蔭
皁皂
皂皁
芸蕓
蕓芸
極极
极極
願愿
愿願
勝胜
胜勝
確确
确確
葉叶
叶葉
蟲虫
虫蟲
廠厂
厂廠
修脩
脩修
價价
价價
合閤
閤合
適适
适適
彌瀰
瀰彌
釐厘
厘釐
塗涂
涂塗
個箇个
個个箇
箇個个
箇个個
个個箇
个箇個
於于
于於
黨党
党黨
種种
种種
萬万
万萬
範范
范範
瀋沈
沈瀋
薑姜
姜薑
掛挂
挂掛
閒閑
閑閒
證証
証證
芸蕓
蕓芸
佑祐
祐佑
#!/usr/bin/env python3.1
# by weakish <weakish@gmail.com>, licensed under GPL v2.
'''Markup characters need manual checks.'''
import sys
helpinfo = '''stupidm -- Markup characters need manual checks
Usage:
cat infile | stupidm table starttag endtag > outfile
stupidm -h # print this help page
Example:
table.txt
fb
iopt
$ echo 'hi, foo' | stupidm table.txt '{' '}'
hi{opt}, f{b}oo
'''
def main():
if len(sys.argv) < 2:
print(helpinfo)
sys.exit(2)
else:
if sys.argv[1] == '-h':
print(helpinfo)
sys.exit()
else:
text = sys.stdin.read()
table = gen_table(sys.argv[1])
pre, post = sys.argv[2], sys.argv[3]
print(markup(text, table, pre, post))
def markup(text: str, table: dict, pre: str, post: str) -> str:
def mark(char: 'c') -> str:
return (char + pre + table[char] + post) if (char in table) else char
return ''.join(mark(char) for char in text)
def gen_table(file: 'file') -> dict:
with open(file) as table:
return { line[0]: line[1:].rstrip('\n') for line in table.readlines() }
if __name__ == '__main__':
main()
#!/usr/bin/env python2.5
# -*- coding: utf-8 -*-
# by weakish <weakish@gmail.com>, licensed under GPL v2.
'''Markup characters need manual checks.'''
helpinfo = '''stupidm -- Markup characters need manual checks
Usage:
cat infile | stupidm table starttag endtag > outfile
stupidm -h # print this help page
Example:
table.txt
fb
iopt
$ echo 'hi, foo' | stupidm table.txt '{' '}'
hi{opt}, f{b}oo
'''
def main():
import sys
if len(sys.argv) < 2:
print helpinfo
sys.exit(2)
else:
if sys.argv[1] == '-h':
print helpinfo
sys.exit()
else:
text = unicode(sys.stdin.read(), 'utf-8')
table = gen_table(sys.argv[1])
pre, post = sys.argv[2], sys.argv[3]
print markup(text, table, pre, post).encode('utf-8')
def markup(text, table, pre, post):
def mark(char):
return (char + pre + table[char] + post) if (char in table) else char
return ''.join(mark(char) for char in text)
def gen_table(file):
import codecs
table = codecs.open(file, 'r', encoding='utf-8')
return dict((line[0],line[1:].rstrip('\n')) for line in table.readlines())
if __name__ == '__main__':
main()
#!/usr/bin/env python3.1
# by weakish <weakish@gmail.com>, licensed under GPL v2.
import stupidm
from bottle import get, post, request, run
HEAD = '''<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>簡化中文、傳統中文轉換易誤字標記</title>
</head>
<body>
<h1>簡化中文、傳統中文轉換易誤字標記</h1>
'''
FORM = '''
<form method="post">
<textarea name="content" rows="20" cols="80"></textarea>
<p>
<select name="config">
<option selected="selected" >傳統中文</option>
<option>简化中文</option>
</select>
起始標籤 <input type="text" value="{" name="pre" />
結束標籤 <input type="text" value="}" name="post" />
<input type="submit" value="提交" />
</p>
</form>
'''
FOOTER ='''
<hr />
<p>Powered by <a href="http://gist.github.com/510960">stupidm</a>,
and <a href="http://code.google.com/p/opencc/">opencc</a>. <a href="http://validator.w3.org/check/referer">Valid html5</a>.</p>
</body>
'''
@get('/')
def form():
page = HEAD + FORM + FOOTER
return page
@post('/')
def submit():
text = request.forms.get('content')
option = request.forms.get('config')
tablefile = 'st_multi.table' if option == '傳統中文' else 'ts_multi.table'
table = stupidm.gen_table(tablefile)
pre = request.forms.get('pre')
post = request.forms.get('post')
cooked_text = stupidm.markup(text, table, pre, post)
page = HEAD + '<pre>' + cooked_text + '</pre>' + '<hr />' + FORM + FOOTER
return page
def main(port):
run(port=port)
if __name__ == '__main__':
import sys
main(port=(sys.argv[1] if len(sys.argv) == 2 else 8080))
# -*- coding: utf-8 -*-
# by weakish <weakish@gmail.com>, licensed under GPL v2.
import stupidm_py2 as stupidm
from bottle import get, post, request, default_app
from google.appengine.ext.webapp.util import run_wsgi_app
HEAD = unicode('''<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>簡化中文、傳統中文轉換易誤字標記</title>
</head>
<body>
<h1>簡化中文、傳統中文轉換易誤字標記</h1>
''', encoding='utf-8')
FORM = unicode('''
<form method="post">
<textarea name="content" rows="20" cols="80"></textarea>
<p>
<select name="config">
<option selected="selected" >傳統中文</option>
<option>简化中文</option>
</select>
起始標籤 <input type="text" value="{" name="pre" />
結束標籤 <input type="text" value="}" name="post" />
<input type="submit" value="提交" />
</p>
</form>
''', encoding='utf-8')
FOOTER =unicode('''
<hr />
<p>Powered by <a href="http://gist.github.com/510960">stupidm</a>. <a href="http://validator.w3.org/check/referer">Valid html5</a>. <a href="http://flattr.com/thing/89312/stupidm">Flattr this!</a></p>
</body>
''', encoding='utf-8')
@get('/zhtran')
def form():
page = HEAD + FORM + FOOTER
return page.encode('utf-8')
@post('/zhtran')
def submit():
text = unicode(request.forms.get('content'), encoding='utf-8')
option = unicode(request.forms.get('config'), encoding='utf-8')
tablefile = 'st_multi.table' if option == unicode('傳統中文', encoding='utf-8') else 'ts_multi.table'
table = stupidm.gen_table(tablefile)
pre = unicode(request.forms.get('pre'), encoding='utf-8')
post = unicode(request.forms.get('post'), encoding='utf-8')
cooked_text = stupidm.markup(text, table, pre, post)
page = HEAD + '<pre>' + cooked_text + '</pre>' + '<hr />' + FORM + FOOTER
return page.encode('utf-8')
def main():
run_wsgi_app(default_app())
if __name__ == "__main__":
main()
画划
划画
覆复
复覆
藉借
借藉
乾干
干乾
瞭了
了瞭
炼链
链炼
苹蘋
蘋苹
于於
於于
巨钜
钜巨
衹只
只衹
着著
著着
沈沉
沉沈
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment