Skip to content

Instantly share code, notes, and snippets.

@adamlwatson
Created March 18, 2014 16:29
Show Gist options
  • Star 27 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save adamlwatson/9623703 to your computer and use it in GitHub Desktop.
Save adamlwatson/9623703 to your computer and use it in GitHub Desktop.
Strip emoji
# this scrubs emoji sequences from a string - i think it covers all of them
def strip_emoji ( str )
str = str.force_encoding('utf-8').encode
clean_text = ""
# emoticons 1F601 - 1F64F
regex = /[\u{1f600}-\u{1f64f}]/
clean_text = str.gsub regex, ''
#dingbats 2702 - 27B0
regex = /[\u{2702}-\u{27b0}]/
clean_text = clean_text.gsub regex, ''
# transport/map symbols
regex = /[\u{1f680}-\u{1f6ff}]/
clean_text = clean_text.gsub regex, ''
# enclosed chars 24C2 - 1F251
regex = /[\u{24C2}-\u{1F251}]/
clean_text = clean_text.gsub regex, ''
# symbols & pics
regex = /[\u{1f300}-\u{1f5ff}]/
clean_text = clean_text.gsub regex, ''
end
def test_strip_emoji
f = File.open("emoji.txt", "r")
f.each_line do |line|
puts strip_emoji_full(line)
end
f.close
end
@franklsf95
Copy link

This also strips out Chinese characters.

@tigerjj
Copy link

tigerjj commented Jul 20, 2015

Be careful, this also strips out CJK (Chinese, Japanese, Korean)

@philipgiuliani
Copy link

strip_emoji_full method is missing!

@64kramsystem
Copy link

This caught my attention because a colleague of mine used it as reference.

If the objective is to remove the 4-bytes characters from an UTF-8 string (which is the widespread problem of MySQL installations who have been using the default utf8 character set), then this is a more standard solution:

scrubbed_utf8_mb3_string = utf8_mb4_string.each_char.select { |char| char.bytesize < 4 }.join

Note that his code is taken from https://github.com/maximeg/activecleaner.

@juanroldan1989
Copy link

Thanks for this method !

BTW:
Comment from above worked like a charm too : )
https://gist.github.com/adamlwatson/9623703#gistcomment-1785300

@loicginoux
Copy link

loicginoux commented Jul 9, 2017

This does not work for all emojis.
see complete list here http://unicode.org/emoji/charts/full-emoji-list.html
example of unfiltered emojis:
U+1F195
U+1F1F2
U+1F6A7
...

comment from @saveriomiroddi is better.
scrubbed_utf8_mb3_string = utf8_mb4_string.each_char.select { |char| char.bytesize < 4 }.join

@reducm
Copy link

reducm commented Oct 18, 2017

It removes Chinese as well...

@guanting112
Copy link

Try this:
https://github.com/guanting112/remove_emoji

( 它不會移除任何中文,僅會根據標準將所有的 emoji 剔除 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment