rtanglao/findVietnamese.rb

## findVietnamese.rb
#!/usr/bin/env ruby
# -*- coding: utf-8 -*-
require 'rubygems'
require 'json'
require 'time'
require 'date'
require 'mongo'
require 'pp'

MONGO_HOST = ENV["MONGO_HOST"]
raise(StandardError,"Set Mongo hostname in ENV: 'MONGO_HOST'") if !MONGO_HOST
MONGO_PORT = ENV["MONGO_PORT"]
raise(StandardError,"Set Mongo port in ENV: 'MONGO_PORT'") if !MONGO_PORT
MONGO_USER = ENV["MONGO_USER"]
raise(StandardError,"Set Mongo user in ENV: 'MONGO_USER'") if !MONGO_USER
MONGO_PASSWORD = ENV["MONGO_PASSWORD"]
raise(StandardError,"Set Mongo user in ENV: 'MONGO_PASSWORD'") if !MONGO_PASSWORD
db = Mongo::Connection.new(MONGO_HOST, MONGO_PORT.to_i).db("gs")
auth = db.authenticate(MONGO_USER, MONGO_PASSWORD)
if !auth
  raise(StandardError, "Couldn't authenticate, exiting")
  exit
end

topicsColl = db.collection("topics")
t = topicsColl.find({"subject" => /[ảựăậ]/u}).count()
print t
# pp t["subject"]

## regex-to-find-suspected-vietnamese-spammers.md

      
    Raw
  

              regex-to-find-suspected-vietnamese-spammers.md
            
          
    in ruby:
t = topicsColl.find({"subject" => /[\xF3]/}).count()


regexp = /[^ёЁа-яА-Яa-zA-Zà-üÀ-Ü0-9\.\-\+_]/u from:
** http://stackoverflow.com/questions/6113010/ruby-1-8-7-unicode-regular-expression-question
{"subject": /[\xE1]/ }
{"subject": /[áướốóạậì]/i} is a regex query that might work - according to rubular thish should work http://rubular.com/r/mYJNuYJieB (might just be mongohq that can't handle it)
in time period find the topics that match the regex, find the authors and type out a sample title and 1st 66 characters of content to determine if they are a spammer
use control X 8 return to enter Unicode in Emacs on Wndows, Mac and Linux (especially useful for Windows which can't handle UTF-8 easily; I believe mintty can handle UTF-8 but I haven't figured ouut how)
References:
http://vietunicode.sourceforge.net/charset/
http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18
http://stackoverflow.com/questions/256822/how-to-use-regex-for-utf8-in-ruby documents some ruby 1.9 kludges which I don't think most folks should be using but if it works for you :-) go for it!
	#!/usr/bin/env ruby
	# -- coding: utf-8 --
	require 'rubygems'
	require 'json'
	require 'time'
	require 'date'
	require 'mongo'
	require 'pp'

	MONGO_HOST = ENV["MONGO_HOST"]
	raise(StandardError,"Set Mongo hostname in ENV: 'MONGO_HOST'") if !MONGO_HOST
	MONGO_PORT = ENV["MONGO_PORT"]
	raise(StandardError,"Set Mongo port in ENV: 'MONGO_PORT'") if !MONGO_PORT
	MONGO_USER = ENV["MONGO_USER"]
	raise(StandardError,"Set Mongo user in ENV: 'MONGO_USER'") if !MONGO_USER
	MONGO_PASSWORD = ENV["MONGO_PASSWORD"]
	raise(StandardError,"Set Mongo user in ENV: 'MONGO_PASSWORD'") if !MONGO_PASSWORD
	db = Mongo::Connection.new(MONGO_HOST, MONGO_PORT.to_i).db("gs")
	auth = db.authenticate(MONGO_USER, MONGO_PASSWORD)
	if !auth
	raise(StandardError, "Couldn't authenticate, exiting")
	exit
	end

	topicsColl = db.collection("topics")
	t = topicsColl.find({"subject" => /[ảựăậ]/u}).count()
	print t
	# pp t["subject"]