doc/bugs/HTML_inlined_into_Atom_not_necessarily_well-formed.mdwn

   1 If a blog entry contains a HTML named entity, such as the `&mdash;` produced by [[plugins/rst]] for blockquote citations, it's pasted into the Atom feed as-is. However, Atom feeds don't have a DTD, so named entities beyond `&lt;`, `&gt;`, `&quot;`, `&amp;` and `&apos;` aren't well-formed XML.
   2
   3 Possible solutions:
   4
   5 * Put HTML in Atom feeds as type="html" (and use ESCAPE=HTML) instead
   6
   7 > Are there any particular downsides to doing that ..? --[[Joey]]
   8
   9 >> It's the usual XHTML/HTML distinction. type="html" will always be interpreted as "tag soup", I believe - this may lead to it being rendered differently in some browsers. In general ikiwiki seems to claim to produce XHTML (at least, the default page.tmpl makes it claim to be XHTML Strict). On the other hand, this is a much simpler solution... see escape-feed-html branch in my repository, which I'm now using instead --[[smcv]]
  10
  11 >>> Of course, browsers [probably don't treat xhtml pages as xhtml anyway](http://hixie.ch/advocacy/xhtml).
  12 >>> And the same content will be treated as html (probably as tag soup) if it's
  13 >>> in a rss feed.
  14
  15 >>> [[merged|done]]
  16
  17 * Keep HTML in Atom feeds as type="xhtml", but replace named entities with numeric ones,
  18   like in the re-escape-entities branch in my repository ([diff here](http://git.debian.org/?p=users/smcv/ikiwiki.git;a=commitdiff;h=c0eb041c65d0653bacf0d4acb7a602e9bda8888e))
  19
  20 >> I can see why you think this is excessively complex! --[[smcv]]
  21
  22 (Also, the HTML in RSS feeds would probably get better interoperability if it was escaped with ESCAPE=HTML rather than being in a CDATA section?)
  23
  24 > Can't see why? --[[Joey]]
  25
  26 >> For a start, `]]>` in content wouldn't break the feed :-) but I was really thinking of non-XML, non-SGML parsers (more tag soup) that don't understand CDATA (I've suffered from CDATA damage when feeding generated code through gtkdoc, for instance). --[[smcv]]
  27
  28 >>> FWIW, the htmlscrubber escapes the `]]>`. (Wouldn't hurt to make that
  29 >>> more robust tho.)
  30 >>>
  31 >>> ikiwiki has used CDATA from the beginning -- this is the first time
  32 >>> I've heard about rss 2.0 parsers that didn't know about CDATA.
  33 >>>
  34 >>> (IIRC, I used CDATA because the result is more space-efficient and less
  35 >>> craptacular to read manually.)