Using the FeedTools Cache in Plain Ruby Scripts

Martin Dittus · 2005-12-08 · code, tools · 4 comments

FeedTools is an amazingly complete Ruby library by Bob Aman for accessing, parsing and generating feeds from within Ruby. While it is designed to work well with Rails applications, you can just as easily use it in your Ruby scripts:

feed = FeedTools::Feed.open('http://dekstop.de/weblog/index.xml')
puts feed.title
feed.items.each { |item|
  puts item.title
  puts item.link
}

Requesting a feed every time you run your script is fine as long as you only parse your own feeds, but you should be a bit more polite as soon as you start requesting someone else's feed. Over time feeds can cost site owners a lot of bandwidth, so keep that in mind when you start developing your own feed reader. A proper aggregator keeps a local cache of all feeds and only downloads feeds when they have actually changed, not every time the application is run.

FeedTools has excellent support for keeping and managing such a local cache, it's just not documented very well. So in order to address that I'll show you how I've been caching feed requests with FeedTools 0.2.17, and as you'll see it's quite simple as long as you are aware of some caveats.

The following description shows you how to use the default FeedTools caching mechanism with FeedTools::DatabaseFeedCache, which is based on David Heinemeier Hansson's ActiveRecord. In order to use this method you need Rails installed locally, or at least the ActiveRecord gem. Another option would be to implement our own custom cache store, but that's beyond the scope of this article (see FeedTools::feed_cache= for some pointers). Update: I've implemented a simple custom cache store for SQLite3.

Note that I've only started using FeedTools, so there might be errors -- any feedback and suggestions are greatly appreciated.

Prerequisites

The default FeedTools cache is automagically enabled as soon as you fulfill the following prerequisites:

That's it. A database connection is established as soon as you require the FeedTools library, and FeedTools::Feed.open will start caching requests transparently.

Internals, Caveats

FeedTools keeps a reference to a class that implements the caching mechanism, and you can access this reference via FeedTools::feed_cache() -- by default this points to the class FeedTools::DatabaseFeedCache, a subclass of ActiveRecord::Base.

When the FeedTools library is loaded FeedTools calls FeedTools::feed_cache.initialize_cache(), which at this point of course is really a message to FeedTools::DatabaseFeedCache.initialize_cache().

This function attempts to locate your database.yml, extracts the database options for the environment specified in FeedTools::FEED_TOOLS_ENV and uses these options to create a database connection with a call to ActiveRecord::Base.establish_connection().

Note that in contrast to Rails, FeedTools defaults to loading the production database environment; it took me a couple of failed attempts and a look at the FeedTools code to figure that one out.

I assume you could use another database environment by changing the value of FeedTools::FEED_TOOLS_ENV to e.g. 'development' and calling FeedTools::feed_cache.initialize_cache() again -- I'm not yet sure how you would prevent the initial connection to the default environment though.

Another thing to note is that the internal loading mechanism of the database environment is slightly different than you might be used to, with an important side-effect. In Rails there is an easy way to tell ActiveRecord to use the same configuration for several environments:

development:
  adapter: sqlite3
  dbfile: db/feeds.db

production:
  development

This shorctut doesn't work with FeedTools outside of Rails. You need to specify all your environments explicitly, like this:

development:
  adapter: sqlite3
  dbfile: db/feeds.db

production:
  adapter: sqlite3
  dbfile: db/feeds.db

Preparing Your Database

When using FeedTools 0.2.17 you need to set up the database yourself -- in older versions this used to be done automatically, but that apparently had all kinds of issues and thus was abandoned.

The schema definitions for SQLite, PostgreSQL and MySQL are provided with the gem, see the subdirectory ./db/ in the FeedTools gem's local installation directory.

Here's the content of my schema.sqlite.sql:

-- Example Sqlite schema
  CREATE TABLE feeds (
    id                INTEGER PRIMARY KEY NOT NULL,
    url               VARCHAR(255) DEFAULT NULL,
    title             VARCHAR(255) DEFAULT NULL,
    link              VARCHAR(255) DEFAULT NULL,
    feed_data         TEXT DEFAULT NULL,
    feed_data_type    VARCHAR(20) DEFAULT NULL,
    http_headers      TEXT DEFAULT NULL,
    last_retrieved    TIMESTAMP DEFAULT NULL
  );

I've found that there are timezone conversion issues when using DATETIME columns with ActiveRecord and SQLite, so I recommend using TIMESTAMP instead (the SQLite schema delivered with FeedTools uses DATETIME). I have not explored this further however, so I could be talking out of my ass.

The database table used by FeedTools::DatabaseFeedCache, as you can see above, is named 'feeds'. If you want FeedTools to use a different table name (e.g. because it collides with existing table names) you could probably simply override FeedTools::DatabaseFeedCache.table_name before you connect to the database, and of course you need to adjust your database schema accordingly.

Using the Cache

Besides the above you usually don't have to change anything in your code, FeedTools handles your cache appropriately. There are a list of elegant ways it does this, take a look at the documentation and source code -- among other things it stores an expiration period with each feed, and it also only re-requests a feed if it has actually changed on the server. The expiration period is taken from the feed itself, and defaults to an hour if the feed doesn't specify it.

Here are some examples of how you might use the cache:

if (FeedTools::feed_cache_connected?)
  puts "All requests are cached"
end


feed_url = 'http://dekstop.de/weblog/index.xml'

# request a feed and cache it
feed = FeedTools::Feed.open(feed_url)
puts feed.live?    # true

# request the same feed again
feed = FeedTools::Feed.open(feed_url)
puts feed.live?    # false -> we are using a cached version


# check when a cached feed is scheduled to be updated
puts "last retrieved: #{ feed.last_retrieved }"
puts "expires: #{ feed.last_retrieved + feed.time_to_live }"

# output:
# last retrieved: Thu Dec 08 21:12:01 UTC 2005
# expires: Thu Dec 08 22:12:01 UTC 2005


# request a feed from cache regardless of its expiration state
# (e.g. to prevent time consuming network connections)
feed = FeedTools::Feed.open(feed_url, :cache_only => true)
puts feed.live?    # false

Wrapping Up

For the sake of clarity I like to keep a copy of the schema file in my project's directory structure. I normally use a directory structure like this:

./config/database.yml
./db/feeds.db
./db/schema.sqlite.sql
./my_script.rb

This is the content of a database.yml that tells ActiveRecord to use SQLite as database storage (note that in the current FeedTools version only the production environment is actually used):

development:
  adapter: sqlite3
  dbfile: db/feeds.db

production:
  adapter: sqlite3
  dbfile: db/feeds.db

test:
  adapter: sqlite3
  dbfile: db/feeds-test.db

And this is how you quickly generate the corresponding SQLite database file from your schema definition:

$ sqlite3 db/feeds.db < db/schema.sqlite.sql 

That's it in a nutshell. If you have been watching the progress Bob is making with FeedTools you will be aware that any of this is subject to change -- I expect that FeedTools 0.3 will work quite differently regarding database specifics.

Again, I appreciate any feedback -- I'm also interested to hear from you if you're doing it in a different way, of if you have implemented a custom cache store.

Related Articles


Next article:

Previous article:

Recent articles:

Comments

Great write-up!

Btw, it's "Aman," not "Amann." :-)

Regarding configuration, technically, you don't need database.yml, that's mostly a crutch for testing, and quickie scripts. Basically, you can use any method you want to set up ActiveRecord. The only thing that matters is that ActiveRecord knows where the database is. If you're running Rails, this will already be done via the Rails application's own database.yml file.

Strangely, I can't remember why I defaulted to 'production'... I'm sure there was a reason, but it might've been something dumb. Within a Rails application though, it should just use Rail's own environment.

I probably ought to steal Rail's code for loading database.yml. I basically just wrote up a quick hacky script to slurp up the values. It's really pretty ugly.

Fortunately, configuration stuff is getting a major overhaul soon, so I'll probably work that into a future version.

Regarding the schema definitions, those are also available in the RDoc, which is possibly more convenient for some people. Regarding timestamp on SQLite, believe it or not, I've personally never tested FeedTools with SQLite. In fact, I just guessed what I thought the file would look like, and since I never got any complaints, I figured it probably worked. Very sloppy of me, but I'm yet to get SQLite successfully installed, so... In any case, you're probably quite right about the time zone issues.

And yeah, definately, everything is subject to change. Especially with 0.3.x. I've seen bunches of libraries that tried to maintain backward compatibility with older versions, and man, that gets ugly, really fast. Fortunately, FeedTools is abstract enough, that even with dramatic changes, most people never realize anything's different, since they never go beyond the open method and the feed/feed item accessors.

Bob Aman, 2005-12-09 18:38 CET (+0100) Link


Bob,

thanks for the comment! I've corrected your name in the article.

Yeah I've never actually tried to use ActiveRecord in plain Ruby, I'm more used to accessing the database "manually" when outside of Rails -- it's good to know that there are other (shorter?) options to initialize the database. Maybe at some point I'll look into that and post an update, unless someone else posts a link here.

As I've hinted at above I've never fully tried to understand the DATETIME/TIMESTAMP differences in SQLite, I usually simply use timestamps. If I find some time I'll research this further and send you more detailed feedback.

I also might look into implementing an SQLite cache store that doesn't rely on ActiveRecord, in that case I'll write up an article here and post the code.

I absolutely agree with you regarding backward compatibility -- if it's up to me I'd rather use clean frameworks and update my code for new revisions. Rails itself already has too many old way/new way alternatives, and I've found that this usually stands in the way of clear and consistent code structure in my applications. To be honest I'd rather be forced to update my code, but then I'm not hosting big commercial applications, so who am I to complain... ;)

Thanks a lot for this great framework!

martin, 2005-12-09 18:56 CET (+0100) Link


I do try to keep significant API changes to major version number changes. Makes it easier for people to use with rubygems.

Bob Aman, 2005-12-13 00:46 CET (+0100) Link


In order to change the name of the cache table the following thing has worked in Rails:
in config/enviroment.rb add to end:

require 'feed_tools'
FeedTools.configurations[:feed_cache] = "FeedTools::DatabaseFeedCache"
FeedTools::DatabaseFeedCache
class FeedTools::DatabaseFeedCache < ActiveRecord::Base
set_table_name 'feed_tools_cache'
end

Joshua Llorach, 2006-06-16 10:46 CET (+0100) Link


Comments are closed. You can contact me instead.