Skip to content

Instantly share code, notes, and snippets.

@Kimtaro
Created May 28, 2009 02:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Kimtaro/119045 to your computer and use it in GitHub Desktop.
Save Kimtaro/119045 to your computer and use it in GitHub Desktop.
# -*- encoding: utf-8 -*-
require 'benchmark'
require 'rubygems'
require 'active_support'
require 'oniguruma'
include Oniguruma
SKIP_INVALID_UTF8 = Iconv.new('UTF-8//IGNORE', 'UTF-8')
SIZE_R = ORegexp.new('.', '', 'utf8')
SLICE_R = ORegexp.new('.{,10}', '', 'utf8')
class String
def ord_u_all
# Ingore invalid UTF-8 fix from http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/
SKIP_INVALID_UTF8.iconv(self + ' ')[0..-2].unpack('U*')
end
end
n = 100_000
text = '北朝鮮、寧辺で核再処理開始か 韓国政府が分析'
puts "\n> Size"
Benchmark.bm(10) do |x|
x.report('mb_chars') { n.times { text.mb_chars.size } }
x.report('ord_u_all') { n.times { text.ord_u_all.size } }
x.report('regexp') { n.times { SIZE_R.scan(text).size } }
end
puts "\n> Index"
Benchmark.bm(10) do |x|
x.report('mb_chars') { n.times { text.mb_chars[0..10] } }
x.report('ord_u_all') { n.times { text.ord_u_all[0..10] } }
x.report('regexp') { n.times { SLICE_R.match(text)[0] } }
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment