in ruby:
t = topicsColl.find({"subject" => /[\xF3]/}).count()
- regexp = /[^ёЁа-яА-Яa-zA-Zà-üÀ-Ü0-9\.\-\+_]/u from: ** http://stackoverflow.com/questions/6113010/ruby-1-8-7-unicode-regular-expression-question
- {"subject": /[\xE1]/ }
- {"subject": /[áướốóạậì]/i} is a regex query that might work - according to rubular thish should work http://rubular.com/r/mYJNuYJieB (might just be mongohq that can't handle it)
- in time period find the topics that match the regex, find the authors and type out a sample title and 1st 66 characters of content to determine if they are a spammer
- use control X 8 return to enter Unicode in Emacs on Wndows, Mac and Linux (especially useful for Windows which can't handle UTF-8 easily; I believe mintty can handle UTF-8 but I haven't figured ouut how) References:
- http://vietunicode.sourceforge.net/charset/
- http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18
- http://stackoverflow.com/questions/256822/how-to-use-regex-for-utf8-in-ruby documents some ruby 1.9 kludges which I don't think most folks should be using but if it works for you :-) go for it!