12 February 2008

Handling Redundancy in Content

Nowadays many feed sources do provide content using Atom or RSS and augment it with application specific namespace providing own tags that often duplicate the content in the container format. For example an iTunes podcast can have an item <description> in the Atom/RSS <item> tag along with an <itunes:summary> description of different quality.

Up until 1.4.x Liferea had a simple implementation primarily using the Atom/RSS description. With the exception of the <content:encoded> tag from the Content-Namespace which depending on tag order will always overrule the default description. Only if there was no default item description additional namespace infos (atom:summary, dc:description...) where used as a content source.

This was an unsatisfactory solution for several reasons:

  • More detailed infos in application specific namespaces are invisible.
  • Ordering problems with <description> and <content:encoded> did sometimes hide better content.
  • Dublin Core description (while rare to encounter) did never win.
  • The scenario of a better summary than description always caused the short description to win.
As a simple solution Liferea 1.5.x now selects the "best content" by simple length comparsion. The assumption is that the format of the content (plain text, HTML, XHTML...) doesn't matter, or more exactly the additional length of (X)HTML encoding indicates better content.

As a result you might see additional content in namespace-rich feeds (e.g. iTunes podcast feeds).

No comments: