update with new features to deal with large sites

[git.ikiwiki.info.git] / doc / bugs / UTF-16_and_UTF-32_are_unhandled.mdwn
diff --git a/doc/bugs/UTF-16_and_UTF-32_are_unhandled.mdwn b/doc/bugs/UTF-16_and_UTF-32_are_unhandled.mdwn

index 21df334a8e1d1e662dfcd84837f07f6356e19be0..9e8fba4b96df1c47b7a7b8820c6105a9fd1f1482 100644 (file)
--- a/doc/bugs/UTF-16_and_UTF-32_are_unhandled.mdwn
+++ b/doc/bugs/UTF-16_and_UTF-32_are_unhandled.mdwn
@@ -18,3 +18,12 @@ BOMless LE and BE input is probably a lost cause.
  Optimally, UTF-16 (which is ubiquitous in the Windows world) and UTF-32 should be fully supported, probably by converting to mostly-UTF-8 and using `&#xXXXX;` or `&#DDDDD;` XML escapes where necessary.
  
  Suboptimally, UTF-16 and UTF-32 should be converted to UTF-8 where cleanly possible and a warning printed where impossible.
+
+----
+Reading the wikipedia pages about [[!wikipedia UTF-8]] and [[!wikipedia UTF-16]], all valid Unicode characters are representable in UTF-8, UTF-16 and UTF-32, and the only errors possible with UTF-16/32 -> UTF-8 translation are when there are encoding errors in the original document.
+
+Of course, it's entirely possible that not all browsers support utf-8 correctly, and we might need to support the option of encoding into [[!wikipedia CESU-8]] instead, which has the side-effect of allowing the transcription of UTF-16 or UTF-32 encoding errors into the output byte-stream, rather than pedantically removing those bytes.
+
+An interesting question would be how to determine the character set of an arbitrary new file added to the repository, unless the repository itself handles character-encoding, in which case, we can just ask the repository to hand us a UTF-8 encoded version of the file.
+
+-- [[Martin Rudat|http://www.toraboka.com/~mrudat]]