doc/bugs/garbled_non-ascii_characters_in_body_in_web_interface.mdwn

   1 since my latest jessie upgrade here, charsets are all broken when editing a page. the page i'm trying to edit is [this wishlist](http://anarc.at/wishlist/), and it used to work fine. now, instead of:
   2
   3 `Voici des choses que vous pouvez m'acheter si vous êtes le Père Nowel (yeah right):`
   4
   5 ... as we see in the rendered body right now, when i edit the page i see:
   6
   7 `Voici des choses que vous pouvez m'acheter si vous �tes le P�re Nowel (yeah right):`
   8
   9 ... a typical double-encoding nightmare. The actual binary data is this for the word "Père" according to `hd`:
  10
  11 ~~~~
  12 anarcat@marcos:ikiwiki$ echo "Père" | hd
  13 00000000  50 c3 a8 72 65 0a                                 |P..re.|
  14 00000006
  15 anarcat@marcos:ikiwiki$ echo "P�re" | hd
  16 00000000  50 ef bf bd 72 65 0a                              |P...re.|
  17 00000007
  18 ~~~~
  19
  20 > I don't know what that is, but it isn't the usual double-UTF-8 encoding:
  21 >
  22 >     >>> u'è'.encode('utf-8')
  23 >     '\xc3\xa8'
  24 >     >>> u'è'.encode('utf-8').decode('latin-1').encode('utf-8')
  25 >     '\xc3\x83\xc2\xa8'
  26 >
  27 > A packet capture of the incorrect HTTP request/response headers and body
  28 > might be enlightening? --[[smcv]]
  29 >
  30 > > Here are the headers according to chromium:
  31 > >
  32 > > ~~~~
  33 > > GET /ikiwiki.cgi?do=edit&page=wishlist HTTP/1.1
  34 > > Host: anarc.at
  35 > > Connection: keep-alive
  36 > > Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
  37 > > User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36
  38 > > Referer: http://anarc.at/wishlist/
  39 > > Accept-Encoding: gzip,deflate,sdch
  40 > > Accept-Language: fr,en-US;q=0.8,en;q=0.6
  41 > > Cookie: openid_provider=openid; ikiwiki_session_anarcat=XXXXXXXXXXXXXXXXXXXXXXX
  42 > >
  43 > > HTTP/1.1 200 OK
  44 > > Date: Mon, 08 Sep 2014 21:22:24 GMT
  45 > > Server: Apache/2.4.10 (Debian)
  46 > > Set-Cookie: ikiwiki_session_anarcat=XXXXXXXXXXXXXXXXXXXXXXX; path=/; HttpOnly
  47 > > Vary: Accept-Encoding
  48 > > Content-Encoding: gzip
  49 > > Content-Length: 4093
  50 > > Keep-Alive: timeout=5, max=100
  51 > > Connection: Keep-Alive
  52 > > Content-Type: text/html; charset=utf-8
  53 > > ~~~~
  54 > >
  55 > > ... which seem fairly normal... getting more data than this is a little inconvenient since the data is gzip-encoded and i'm kind of lazy extracting that from the stream. Chromium does seem to auto-detect it as utf8 according to the menus however... not sure what's going on here. I would focus on the following error however, since it's clearly emanating from the CGI... --[[anarcat]]
  56
  57 Clicking on the Cancel button yields the following warning:
  58
  59 ~~~~
  60 Error: Cannot decode string with wide characters at /usr/lib/x86_64-linux-gnu/perl/5.20/Encode.pm line 215.
  61 ~~~~
  62
  63 > Looks as though you might be able to get a Python-style backtrace for this
  64 > by setting `$Carp::Verbose = 1`.
  65 >
  66 > The error is that we're taking some string (which string? only a backtrace
  67 > would tell you) that is already flagged as Unicode, and trying to decode
  68 > it from byte-blob to Unicode again, analogous to this Python:
  69 >
  70 >     some_bytes.decode('utf-8').decode('utf-8')
  71 >
  72 > --[[smcv]]
  73 > >
  74 > > I couldn't figure out where to set that Carp thing - it doesn't work simply by setting it in /usr/bin/ikiwiki - so i am not sure how to use this. However, with some debugging code in Encode.pm, i was able to find a case of double-encoding - in the left menu, for example, which is the source of the Encode.pm crash.
  75 > >
  76 > > It seems that some unicode semantics changed in Perl 5.20, or more precisely, in Encode.pm 2.53, according to [this](https://code.activestate.com/lists/perl-unicode/3314/). 5.20 does have significant Unicode changes, but I am not sure they are related (see [perldelta](https://metacpan.org/pod/distribution/perl/pod/perldelta.pod)). Doing more archeology, it seems that Encode.pm is indeed where the problem started, all the way back in [commit 8005a82](https://github.com/dankogai/p5-encode/commit/8005a82d8aa83024d72b14e66d9eb97d82029eeb#diff-f3330aa405ffb7e3fec2395c1fc953ac) (august 2013), taken from [pull request #11](https://github.com/dankogai/p5-encode/pull/11) which expressively forbids double-decoding, in effect failing like python does in the above example you gave (Perl used to silently succeed instead, a rather big change if you ask me).
  77 > >
  78 > > So stepping back, it seems that this would be a bug in Ikiwiki. It could be in any of those places:
  79 > >
  80 > > ~~~~
  81 > > anarcat@marcos:ikiwiki$ grep -r decode_utf8 IkiWiki* | wc -l
  82 > > 31
  83 > > ~~~~
  84 > >
  85 > > Now the fun part is to determine which one should be turned off... or should we duplicate the logic that was removed in decode_utf8, or make a safe_decode_utf8 for ourselves? --[[anarcat]]
  86
  87 The apache logs yield:
  88
  89 ~~~~
  90 [Mon Sep 08 16:17:43.995827 2014] [cgi:error] [pid 2609] [client 192.168.0.3:47445] AH01215: Died at /usr/share/perl5/IkiWiki/CGI.pm line 467., referer: http://anarc.at/ikiwiki.cgi?do=edit&page=wishlist
  91 ~~~~
  92
  93 Interestingly enough, I can't reproduce the bug here (at least in this page). Also, editing the page through git works fine.
  94
  95 I had put ikiwiki on hold during the last upgrade, so it was upgraded separately. The bug happens both with 3.20140613 and 3.20140831. The major thing that happened today is the upgrade from perl 5.18 to 5.20. Here's the output of `egrep '[0-9] (remove|purge|install|upgrade)' /var/log/dpkg.log | pastebinit -b paste.debian.net` to give an idea of what was upgraded today:
  96
  97 http://paste.debian.net/plain/119944
  98
  99 This is a major bug which should probably be fixed before jessie, yet i can't seem to find a severity statement in reportbug that would justify blocking the release based on this - unless we consider non-english speakers as "most" users (i don't know the demographics well enough). It certainly makes ikiwiki completely unusable for my users that operate on the web interface in french... --[[anarcat]]
 100
 101 Note that on this one page, i can't even get the textarea to display and i immediately get `Error: Cannot decode string with wide characters at /usr/lib/x86_64-linux-gnu/perl/5.20/Encode.pm line 215`: http://anarc.at/ikiwiki.cgi?do=edit&page=hardware%2Fserver%2Fmarcos.
 102
 103 Also note that this is the same as [[forum/"Error: cannot decode string with wide characters" on Mageia Linux x86-64 Cauldron]], I believe. The backtrace I get here is:
 104
 105 ~~~~
 106 Error: Cannot decode string with wide characters at /usr/lib/x86_64-linux-gnu/perl/5.20/Encode.pm line 215. Encode::decode_utf8("**Menu**\x{d}\x{a}\x{d}\x{a} * [[\x{fffd} propos|index]]\x{d}\x{a} * [[Logiciels|software]]"...)
 107 called at /usr/share/perl5/IkiWiki/CGI.pm line 117 IkiWiki::decode_form_utf8(CGI::FormBuilder=HASH(0x2ad63b8))
 108 called at /usr/share/perl5/IkiWiki/Plugin/editpage.pm line 90 IkiWiki::cgi_editpage(CGI=HASH(0xd514f8), CGI::Session=HASH(0x27797e0))
 109 called at /usr/share/perl5/IkiWiki/CGI.pm line 443 IkiWiki::__ANON__(CODE(0xfaa460))
 110 called at /usr/share/perl5/IkiWiki.pm line 2101 IkiWiki::run_hooks("sessioncgi", CODE(0x2520138))
 111 called at /usr/share/perl5/IkiWiki/CGI.pm line 443 IkiWiki::cgi()
 112 called at /usr/bin/ikiwiki line 192 eval {...}
 113 called at /usr/bin/ikiwiki line 192 IkiWiki::main()
 114 called at /usr/bin/ikiwiki line 231
 115 ~~~~
 116
 117 so this would explain the error on cancel, but doesn't explain the weird encoding i get when editing the page... <sigh>...
 118
 119 ... and that leads me to this crazy patch which fixes all the above issue, by avoiding double-decoding... go figure that shit out...
 120
 121 [[!template  id=gitbranch branch=anarcat/dev/safe_unicode author="[[anarcat]]"]]
 122
 123 > [[Looks good to me|users/smcv/ready]] although I'm not sure how valuable
 124 > the `$] < 5.02 || ` test is - I'd be tempted to just call `is_utf8`. --[[smcv]]
 125
 126 >> [[merged|done]] --[[smcv]]