Parsing an OPML Document Recursively With Ruby While Preserving Its Structure

Martin Dittus · 2006-02-14 · code, tools · 3 comments

I just started to write an aggregator in Ruby which will form the basis of a number of web applications, and a couple of minutes into the project I'm already excited about the expressiveness of Ruby and its standard library. So much so that I had to share the results of my first five minutes of coding.

I decided that the aggregator I'm writing will take its feed URLs from an OPML document.

A nice property of OPML is that it allows you to group feeds into a hierarchy of named elements, so that you can e.g. group some feeds in a "blogs" category, some other feeds in a "news" category, and so on. You could even have subgroups, so that e.g. your "news" category has subcategories "politics", "weather", "tech", etc.

So I thought a bit about how you can parse the OPML in a way that extracts feed URLs and still preserves this notion of hierarchical "categories" -- and it turns out it's remarkably simple, and it did indeed only take a couple of minutes to implement, most of which was spent reading up on API calls.

Here's the complete function:

# parse_opml (opml_node, parent_names=[])
#
# takes an REXML::Element that has OPML outline nodes as children, 
# parses its subtree recursively and returns a hash:
# { feed_url => [parent_name_1, parent_name_2, ...] } 
#
def parse_opml(opml_node, parent_names=[])
  feeds = {}
  opml_node.elements.each('outline') do |el|
    if (el.elements.size != 0) 
      feeds.merge!(parse_opml(el, parent_names + [el.attributes['text']]))
    end
    if (el.attributes['xmlUrl'])
      feeds[el.attributes['xmlUrl']] = parent_names
    end
  end
  return feeds
end

And here's how you call it:

require 'rexml/Document'

opml = REXML::Document.new(File.read('my_feeds.opml'))
feeds = parse_opml(opml.elements['opml/body'])

To make it clear what I'm trying to do I'll show you a simple example. If this is the content of my_feeds.opml:

<opml version="1.1">
<body>
  <outline xmlUrl="http://example1.com/feed" />
  <outline text="blogs">
    <outline xmlUrl="http://example2.com/feed" />
    <outline text="dev">
      <outline xmlUrl="http://example3.com/feed" />
    </outline>
  </outline>
</body>
</opml>

...then the hash returned from parse_opml will look like this:

{
  "http://example1.com/feed" => [],
  "http://example2.com/feed" => ["blogs"],
  "http://example3.com/feed" => ["blogs", "dev"]
}

And I'm still amazed. Eat this, PHP ;)

Random Notes and Updates, and a Little Pop Culture 2006-02-15

Finally: IBM Proposes Peer Review to Manage US Patent Applications 2006-02-07

Comments

I took a look at this to try and make it work for the opensource podcast directory project (openpodcastdirectory.org). it looks nice but it breaks if you put more than one link into a category. So if blogs has two links you get something like this:

http://example5.com/feed:
- blogs2
http://example3.com/feed: &id001
- blogs
- dev
http://example2.com/feed:
- blogs
http://example4.com/feed: *id001

Any idea about how to fix the problem. I'm trying to make something like this work so I can create the right parent and child categories given an opml list.

Here's a test file http://www.digitalpodcast.com/opml/test1.opml

If you have any idea please let me know.

Thanks

Alex
DigitalPodcast.com

Alex Nesbitt, 2006-06-29 06:28 CET (+0100) Link

sorry - I guess it does work - I was getting confused by the way it display.

I still have the problem that when the same feed is in two categories it gets overwritten in the hash. Any ideas how to fix that?

Alex, 2006-06-29 06:39 CET (+0100) Link

>I still have the problem that when the same feed is in two categories it gets overwritten in the hash. Any ideas how to fix that?

Ahh, that's true. And annoying. It didn't occur to me because my current feed reader does not allow to put a feed in multiple folders, so I didn't even think about testing this.

I don't think there's an obvious quick fix without changing the overall logic, but if anyone has suggestions let us know.

Martin Dittus, 2006-07-07 16:28 CET (+0100) Link

Comments are closed. You can contact me instead.

dekstop: weblog