SearchFox Not Suited for Aggregated, High-traffic Feeds? And Some Comments on Community Attention.

Martin Dittus · 2005-11-04 · commentary, data mining, recommendation engines, tools · 2 comments

Just read in a comment by Esteban Kozak that SearchFox RSS uses both "attention and community data" when determining the value of an article, which means that some of the weird effects documented earlier might be a result of other people's behavior, as opposed to my own.

To recapitulate: I'm trying to understand the algorithms behind SearchFox RSS's "Topics I Like" listing, and found that some terms are conspicuously high on the list where they don't really deserve to be (currently: "quake", "ning" -- see image below), and others that I care about more are nowhere to be found (currently: "ruby", "rails").

My current list of "topics I like" in the SearchFox RSS reader.

But I also should note that I'm watching the rather high-traffic delicio.us Rails feed, and of course there are a lot more posts in it than I care to read. So another explanation for the system's apparent ignorance towards my interests in Rails might be a result of the disparity between the high number of occurrences of the term in my feeds and the comparably low number of articles I actually click on.

Translation: I might read more articles containing the term "Rails" than the terms "oktober" or "macorama", but there are even more articles containing the term "Rails" that I don't read. And I definitely read every single fscklog article (where the other two terms come from).

Which could mean that SearchFox's algorithms are not really suited for aggregated feeds such as http://del.icio.us/rss/tag/rails where the article-to-click ratio is noticeably lower than with a normal blog feed.

My recommendations to SearchFox's developers

First, please talk a bit about the algorithms involved, so that we understand how to use the system to its fullest potential, and that we can anticipate what actions might destroy the validity of its attention arithmetics. For example I would like to know if "quake" and "ning" are "Terms I like" because a lot of other people like them, or because of something I did.

Second, let us help you improve your algorithms. Talk to your users so that we can help you get a better understanding of how we are actually using the system. I have the feeling that your algorithms are irritated by high-traffic feeds, because I do click on a lot of Rails-related articles.

Third, let's see some of that attention data ;). Now that I know that the internal logic is also community-driven I am really curious to know what other people read. Make a "Topics Our Users Like" list with at least the most popular 100 words (let's see some of that long tail!), and link each word to a search in my own feeds.

Ok, enough for now. Just one more thing: I don't get why people are so content with using Rojo, I've used it for a week or so with SearchFox RSS in parallel, and SearchFox clearly took the lead. I guess that as soon as SearchFox gets some more attention, people will realize it's the better system, and Rojo will be toast ;)

Comments

Well, there is nothing weird about SearchFox's behavior. Your diagnosis of the problem is right on the money. We are working on several ways to improve the algorithm. First, we'll add an aging policy for topics. Second, we'll adjust the score calculation to include heavier weight on the total number of clicks on a topic as an extra measure of interestingness.

As for the attention data APIs, you'll have to be patient. But I promise we'll get there.

Esteban Kozak, 2005-11-04 19:36 CET (+0100) Link

Esteban,

thanks for the clarifications.

And don't worry, I'm patient ;)

martin, 2005-11-04 19:44 CET (+0100) Link

Comments are closed. You can contact me instead.

dekstop: weblog