Thursday, April 21, 2011

While making an RSS reader which saves articles, how can I prevent duplicates?

Lets say I have a RSS feed which lists the 3 newest questions on SO. At 1 o'clock, the feed looks like this:

  • While making an RSS reader which saves articles, how can I prevent duplicates?
  • Convert char array to UNICODE in MFC C++
  • How to deploy a Java Swing application with an embedded JavaDB database?

At 2 o'clock, this feed looks like:

  • django url from another template than the one associated with the view-function
  • While making an RSS reader which saves articles, how can I prevent duplicates?
  • Convert char array to UNICODE in MFC C++

(duplicate articles are bold)

I want to download the RSS feed every 5 minutes, parse it and save the articles that aren't already saved, but I do not want duplicates (items that remain in the new, updated feed like the examples above). What can I use to determine if an article is already saved? Thanks

From stackoverflow
  • In theory, you can just use guid for RSS 2, and id for Atom. These are each supposed to be permanent and unique. However, in practice some sites don't conform to this, so you have to use heuristics.

    Time Machine : Sorry, I am making a generic RSS reader which should be able to read all feeds from all sites.
    Peter Hosey : Koning Baard: That's where the heuristics come in. Check for duplicate permalink, title, description/summary, etc. It depends on how sensitive you want to be to duplicates, risking hypersensitivity when you go above the spec's requirements.

0 comments:

Post a Comment