zacksheppard/post.md

## post.md

      
    Raw
  

              post.md
            
          
  layout
  title
  date
  comments
  categories
  
  
  post
  Removing leading and trailing spaces and &amp;nbsp in Ruby
  2014-10-17 15:46:09 -0400
  true
  
  
tl;dr


If .strip isn't working, try .gsub(/\A[[:space:]]+|[[:space:]]+\z/, '')
The POSIX space character class [:space:] matches tab(\t) , line feed, form feed(\f), carriage return(\r), and space. In Unicode, it also matches no-break spaces (&nbsp), next line, and the variable-width spaces. Because new lines and &nbsp are sometimes used for ad hoc spacing [:space] is very useful to strip all these out when scraping.


For a recent project, I needed to scrape a set of pages on a site with  inconsistent html, punctuation, and spacing. One of the challenges was getting rid of leading and trailing white spaces.
Ruby has a .strip method that does exactly this but it wasn't working consistently. (It's also worth nothing that there are lesser known .lstrip and .rstrip methods which remove left (leading) and right (trailing) white space on a string.)
After some investigation I figured out that there was a non-breaking space in some of the strings. This didn't show when viewing the source or in the pry console:
<p align="left"> First sentence <br /> Next line</p>
pry> doc.css('.leading-0').children[2].to_s
=> " First sentence "
pry> doc.css('.leading-0').children[4].to_s
=> " Next line"
However it did show in the browser inspect console.
<p align="left">
"Location:"
<br>
" First sentence&nbsp;"
<br>
" Next line"
</p>
Trying to gsub the characters out with .gsub(/\&nbsp\;/, '') didn't work. But after some googling I found the POSIX space character class and was able to remove all leading and trailing spaces with this gsub:
gsub(/\A[[:space:]]+|[[:space:]]+\z/, '')
This matches start of string \A and any white space character or | end of string \A and then any white space character.
It was a particularly good find for this project because the pages I'm scraping are littered with spaces, multiple '\n's and '\r's. [:space] gets rid of all of them!