Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@cyfdecyf
Created December 18, 2011 09:12
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cyfdecyf/1492799 to your computer and use it in GitHub Desktop.
Save cyfdecyf/1492799 to your computer and use it in GitHub Desktop.
Join consecutive Chinese lines into a single long line.
#!/usr/bin/env ruby
#encoding: UTF-8
# Requires ruby 1.9.x, and assumes UTF-8 encoding
class String
# The regular expression trick to match CJK characters comes from
# http://stackoverflow.com/a/4681577/306935
# For more info on the regex used here, refer to http://oniguruma.rubyforge.org/svn/Syntax.txt
# And for Unicode Character Categories http://www.fileformat.info/info/unicode/category/index.htm
# Unfortunately, we don't have a specific category to specify Chinese
# punctuations, so I have to list them manually. (Not complete, just the most
# common ones here.)
def join_chinese
unless @chinese_regex
han = '\p{Han}|[,。?;:‘’“”、!……()]'
@chinese_regex = Regexp.new("(#{han}) *\n *(#{han})", Regexp::MULTILINE)
end
gsub(@chinese_regex, '\1\2')
end
end
if ARGV.size != 1
puts "Usage: #{File.basename $0} <file>"
exit 1
end
print IO.read(ARGV[0]).join_chinese
this is
a sentence
你好,
我是
某某
Hello, I'm,
某某
他说:
“很高兴见到你”
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment