Strip and associated methods in ruby 1.9.2 + (including 1.9.3 and 2.0.0) don't remove leading or trailing non-ascii white space (such as unicode ideographic space or unicode non-breaking space). This is a change from ruby 1.9.1 where these were removed. These space characters are recognized by /[[:space:]]/. This patch (for 2.0.0) restores the 1.9.1 behavior. Same patch will apply to 1.9.2 and 1.9.3 at offset and slight fuzz. This patch (excepting the isspace changes) also fixes a bug in ruby 1.9.1 where strings would fail to hash consistently after being stripped if the strip converted a non-7-bit-ascii-clean string to a 7-bit-ascii-clean string. A constant, UNICODE_STRIP_PATCH, is provided on the String class to query for runtime determination if the patch is applied. See the comment below the diff for a fuller explanation (formatting is not allowed in the GIST description which makes it otherwise difficult to read).

  • Download Gist
Ruby-2.0.0-p0-String-Strip-Unicode-Support-Patch.patch
Diff
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
--- a/string.c 2013-05-04 10:44:37.201664309 -0400
+++ b/string.c 2013-05-04 11:06:36.660207151 -0400
@@ -6802,7 +6802,7 @@ rb_str_lstrip_bang(VALUE str)
int n;
unsigned int cc = rb_enc_codepoint_len(s, e, &n, enc);
- if (!rb_isspace(cc)) break;
+ if (!rb_enc_isspace(cc, enc)) break;
s += n;
}
@@ -6810,6 +6810,9 @@ rb_str_lstrip_bang(VALUE str)
STR_SET_LEN(str, t-s);
memmove(RSTRING_PTR(str), s, RSTRING_LEN(str));
RSTRING_PTR(str)[RSTRING_LEN(str)] = '\0';
+ if (ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {
+ ENC_CODERANGE_CLEAR(str);
+ }
return str;
}
return Qnil;
@@ -6871,7 +6874,7 @@ rb_str_rstrip_bang(VALUE str)
while ((tp = rb_enc_prev_char(s, t, e, enc)) != NULL) {
unsigned int c = rb_enc_codepoint(tp, e, enc);
- if (c && !rb_isspace(c)) break;
+ if (c && !rb_enc_isspace(c, enc)) break;
t = tp;
}
}
@@ -6880,6 +6883,9 @@ rb_str_rstrip_bang(VALUE str)
STR_SET_LEN(str, len);
RSTRING_PTR(str)[len] = '\0';
+ if (ENC_CODERANGE(str) != ENC_CODERANGE_7BIT) {
+ ENC_CODERANGE_CLEAR(str);
+ }
return str;
}
return Qnil;
@@ -8143,6 +8149,7 @@ Init_String(void)
rb_include_module(rb_cString, rb_mComparable);
rb_define_alloc_func(rb_cString, empty_str_alloc);
rb_define_singleton_method(rb_cString, "try_convert", rb_str_s_try_convert, 1);
+ rb_define_const(rb_cString, "UNICODE_STRIP_PATCH", Qtrue);
rb_define_method(rb_cString, "initialize", rb_str_init, -1);
rb_define_method(rb_cString, "initialize_copy", rb_str_replace, 1);
rb_define_method(rb_cString, "<=>", rb_str_cmp_m, 1);

Strip and associated methods in ruby 1.9.2 + (including 1.9.3 and 2.0.0) don't remove leading or trailing non-ascii white space (such unicode ideographic space or unicode non-breaking space). This is a change from ruby 1.9.1 where these were removed. These space characters are recognized by /[[:space:]]/.

1.9.1p378 :001 > "\u3000a\u00a0".strip
=> "a"

1.9.2p320 :001 > "\u3000\u00a0".strip
=> " a "

1.9.3p286 :001 > "\u3000\u00a0".strip
=> " a "

2.0.0p0 :001 > "\u3000\u00a0".strip
=> " a "

This patch restores the 1.9.1 behavior where strip and associated routines ( [l/r]strip[!] ) removes encoding specific spaces. Other then shifting the line numbers this patch works in 1.9.3 and 1.9.2 as well. The patch also fixes a bug in ruby 1.9.1 where a string would fail to hash consistently if stripping whitespace converted a unicode string to a 7-bit-ascii clean string. A constant, UNICODE_STRIP_PATCH, is provided on the String class to query if the patch is present at runtime.

a = (orig = "middle lane swan street\u00A0").strip
b = "middle lane swan street"

a == b
=> true
a.hash == b.hash
=> false
orig.chop.hash == b.hash
=> true

To install using rvm
rvm reinstall ruby-2.0.0-p0-UNICODE_STRIP --patch /Ruby-2.0.0-p0-String-Strip-Unicode-Support-Patch.patch
rvm use ruby-2.0.0-p0-UNICODE_STRIP

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.