21 August 2006

Avoiding lax XML parsing

After some discussion with Daniel Veillard (the person behind libxml2) why XML consumers should never process invalid XML and that future libxml2 versions might drop the recovery mode, I reconsiderd the usage of the libxml2 recovery mode in Liferea. It is enabled since the earliest versions to allow parsing the non-XML RSS 0.9x feed formats and to repair occasionally broken XML feeds.

There are two ideological standpoints on parsing of broken XML:
  1. The XML spec says parsing broken XML is forbidden. Tolerant parsing will cause the feed generators to be broken forever.
  2. Some feeds will be always broken. Total perfection of all generators isn't possible and only the user experience counts. Therefore tolerant XML parsing is mandatory for a good aggegrator.
The first opinion is very popular with the XML propagators while the second one (at least I have the impression) is common amongst aggregator developers.

With the rise of Atom 1.0 and the continuous improvements of the major feed generators it is getting more and more realisitic to follow the approach of opinion 1: to refuse broken feeds and force the feed generators to fix the problem. The main reason to forbear the use of libxml2's recovery mode is of course the prospect of its future removal.

The plan:
  • Don't use recovery mode for Atom 1.0 and OPML at once (released with v1.1.1)
  • Continue to use recovery mode for RSS for now.
  • Later split RSS parser into 0.9x and 1.0/2.0 parser where only the 0.9x parser uses the recovery mode.
  • When libxml2 removes recovery mode either drop RSS 0.9x support or write a new parser.
I expect this to cause some user reports about feeds suddenly broken, that worked until now because wrong encoding, broken HTML or unknown entities were recovered/stripped automatically until now.

1 comment:

Anonymous said...

There are two ideological standpoints on parsing of broken XML: [...] Therefore tolerant XML parsing is mandatory for a good aggegrator.

The problem is that there is not a single way to interpret broken XML, so that a feed would look OK with some feed readers (probably the major one(s) since the majority of users will test their feed with those readers), and would be badly interpreted by other feed readers. This means that the minor feed readers would have to emulate (by guesses) the major feed reader(s). It would take time to their developers, introduce bugs and so on. This is not fair.

Of course, users could fix their feeds to be able to be read by every one, but one knows that most of them wouldn't do it, either because they don't know that their feed is broken (because of tolerant XML parsers) or simply because they don't mind. Remember Netscape (and later, MSIE) and HTML...

So, it is very good that libxml2 will drop the recovery mode.