Note: Just before I started writing this, I noticed that the Feedzirra gem had 1) changed it's name to Feedjira (explanation here) and 2) had pushed a bunch of updates. Everthing I talk about below worked on 0.2.4, and appears to work just fine on 1.2.0. Also, looking at the CHANGELOG, none of the updates should mess with what I talk about here.
Web feeds (RSS, Atom, etc. - but I tend to call them all "RSS feed"), while sometimes viewed as unfashionable/antiquated, are ubiquitous among web publishers. When you are building a system that needs to know when new content is published, RSS feeds are a perfect low-bar method for doing programmatically discovering new content.
In the Ruby world, Feedjira (formerly, Feedzirra) is the most popular tool for fetching and parsing web feeds. It does a great job out of the box; just give it a feed url (or an array of feed urls) and it will return a Feed
object with an entries
method to access all the feed entries with standardized method names. The documentation explains this well, so I'll stop there.
What I'd like to talk about are the times when you want to customize the feed parsing a little bit further. When you find yourself in a position where you are constantly doing some kind of transformation on the fields of a particular RSS feed.
Maybe you want to save the url for an article, but (because of your business logic) you don't want to save the query string. Or maybe you want to use Feedjira to fetch and parse non-standard feeds; or even just some rando XML residing at an api endpoint.
Thankfully, SAX Machine makes defining parsers dead easy and Feedjira (which uses SAX Machine) exposes a method to simply add parsers to the it's internal stack.
The rest of this post walks through an example of creating a custom RSS feed parser and offers a way to organize your code in a Rails app.
Let's take this RSS feed:
<script src="https://gist.github.com/mertonium/11087612.js?file=janky_feed.xml"></script>We want to build a parser that:
- Removes the query string from the article URL
- Gets the published timestamp from
pubDate
instead ofdc:created
.
Here is the full code for a custom parser that achieves these goals (there are explanatory comments in the code):
<script src="https://gist.github.com/mertonium/11087612.js?file=janky_parser.rb"></script>We can see the class in action by firing up a irb console and parsing our actual feed.
<script src="https://gist.github.com/mertonium/11087612.js?file=command_line_example.rb"></script>Our custom feed parser is complete! If you are using Feedjira in a Rails application, you'll need to include your parser in an initializer. At Versa, our custom parsers live in a parsers
folder in our app
folder. Then we use an initializer like the one below:
With out custom parser(s) added to Feedjira's parser stack, it will automatically parse the feeds it is supposed to.
The last thing I wanted to mention on this is that obviously we should be writing tests for our parsers. I won't go into great detail, as it could be a post by itself, but I did write a spec for this example parser.
And with that, we've built a custom feed parser for Feedjira and integrated it into our Rails app. Go forth an consume yee feeds.