Skip to content

Instantly share code, notes, and snippets.

@timnew
Created September 7, 2017 02:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save timnew/ded5dc35ccb8a4bba9f113e2b597d95d to your computer and use it in GitHub Desktop.
Save timnew/ded5dc35ccb8a4bba9f113e2b597d95d to your computer and use it in GitHub Desktop.
A spam filter targeting on Chinese spammers!
class ChineseSpamFilter
class << self
def spam?(job)
tencent_email?(job.email) &&
(
in_cjk?(job.title, 0.2) ||
in_cjk?(job.company, 0.2)
)
end
def tencent_email?(email)
email.end_with?('@qq.com')
end
# Check whether text is in Chinese, Japanese or Korean
def in_cjk?(text, threashold)
chars = text.unpack('U*')
cjk_char_count = chars.count { |char| cjk_char?(char) }
puts "count: #{cjk_char_count}, ratio: #{(cjk_char_count.to_f / chars.length.to_f) * 100}"
(cjk_char_count.to_f / chars.length.to_f) >= threashold
end
def cjk_char?(char) # Detect char in CJK char by check code block
char.between?(0x4E00, 0x9FFF) || # Main blocks
char.between?(0x3400, 0x4DBF) || # Extended Block A
char.between?(0x20000, 0x2A6DF) || # Extended Block B
char.between?(0x2A700, 0x2B73F) || # Extended Block C
char.between?(0x2B740, 0x2B81F) || # Extended Block D
char.between?(0x2B820, 0x2CEAF) || # Extended Block E
char.between?(0x2CEB0, 0x2EBEF) || # Extended Block F
char.between?(0xF900, 0xFAFF) # Compatibility Block
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment