Created
August 5, 2021 21:08
-
-
Save kernelsmith/d03926b71f99b0672cf6a737b05919c1 to your computer and use it in GitHub Desktop.
Ruby string encoding defaults to UTF-8, but String#strip doesn't alter its definition of whitespace to match the encoding, it's always defined as: '\x00\t\n\v\f\r '. This does not include unicode whitespace no matter the string's encoding, see Ruby Regexp Character Classes for more info. It would appear that [[:space:]] does in fact include unic…
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This was probably encountered and overcome a long time ago, but I ran into it in my own Ruby dealings and thought maybe it could be an issue elsewhere: | |
# Ruby string encoding defaults to UTF-8, but String#strip doesn't alter its | |
# definition of whitespace to match the encoding | |
# https://ruby-doc.org/core-2.6.8/String.html#method-i-strip | |
# String#strip removes lead/trail whitespace defined as: '\x00\t\n\v\f\r ' | |
# null, horiz tab, line feed, vert tab, form feed, carriage return, & space | |
# This does not include unicode whitespace no matter the string's encoding, | |
# see Regexp for more info | |
# https://ruby-doc.org/core-2.6.8/Regexp.html#class-Regexp-label-Character+Classes | |
# It would appear that [[:space:]] does in fact include unicode whitespace | |
# (at least) when the string encoding is UTF-8. Unicode has all sorts of | |
# whitespace chars that you've probably never heard of like ogham space mark | |
# https://en.wikipedia.org/wiki/Whitespace_character#Unicode | |
STRIP_REGEX = /(?:\A[[:space:]]+|[[:space:]]+\Z)/ | |
def strip_harder!(str) | |
str.gsub!(STRIP_REGEX, '') | |
end | |
def strip_harder(str) | |
str.gsub(STRIP_REGEX, '') | |
end | |
s = "https://www.app.moc/support/security-bulletins.html\u00a0" | |
#=> "https://www.app.moc/support/security-bulletins.html " | |
strip_harder(s) | |
#=> "https://www.app.moc/support/security-bulletins.html" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment