From 5150874861509b45835c7bd9565531b550a35db5 Mon Sep 17 00:00:00 2001 From: smcv Date: Tue, 16 May 2017 05:17:00 -0400 Subject: [PATCH] browsers and specifications support more Unicode than we give them credit for --- .../i18nheadinganchors/discussion.mdwn | 92 +++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 doc/plugins/contrib/i18nheadinganchors/discussion.mdwn diff --git a/doc/plugins/contrib/i18nheadinganchors/discussion.mdwn b/doc/plugins/contrib/i18nheadinganchors/discussion.mdwn new file mode 100644 index 000000000..1c3eb6325 --- /dev/null +++ b/doc/plugins/contrib/i18nheadinganchors/discussion.mdwn @@ -0,0 +1,92 @@ +I would not be comfortable with merging this into headinganchors and enabling it by +default for two reasons: + +* it adds a new dependency on [[!cpan Text::Unidecode]] +* Text::Unidecode specifically documents its transliteration as not being stable + across versions + +There are several "slugify" libraries available other than Text::Unidecode. +It isn't clear to me which one is the best. Pandoc also documents +[an algorithm for generating slugs](http://pandoc.org/MANUAL.html#extension-auto_identifiers), +and it would be nice if our fallback implementation (with i18n disabled) was compatible +with Pandoc's, at least for English text. + +However! In HTML5, IDs are allowed to contain anything except _space characters_ +(space, newline, tab, CR, FF), so we could consider just passing non-ASCII +through the algorithm untouched. This [example link to a Russian +anchor name](#пример) (the output of putting "example" into English-to-Russian +Google Translate) hopefully works? (Use a small browser window to make it +clearer where it goes) + +So perhaps we could try this Unicode-aware version of what Pandoc documents: + +* Remove footnote links if any (this might have to be heuristic, or we could + skip this step for a first implementation) +* Take only the plain text, no markup (passing the heading through HTML::Parser + and collecting only the text nodes would be the fully-correct version of this, + or we could fake it with regexes and be at least mostly correct) +* Strip punctuation, using some Unicode-aware definition of what is punctuation: + perhaps `s/[^-\w_. ]//gu;` (delete anything that is not a (Unicode-aware) word + character, hyphen-minus, underscore, dot or space) +* Replace spaces with hyphen-minus +* Force to lower-case with `lc` +* Strip leading digits and punctuation +* If the string is empty, use `section` +* If we already generated a matching identifier, append `-1`, `-2`, etc. until we find + an unused identifier + +(Or to provide better uniqueness, we could parse the document looking for any existing +ID, then generate IDs avoiding collisions with any of them.) + +This would give us, for example, `## Visiting 北京` → `id="visiting-北京"` +(where Text::Unidecode would instead transliterate, resulting in `id="visiting-bei-jing"`). + +To use these IDs in fragments, I would be inclined to rely on browsers +supporting [IRIs](https://tools.ietf.org/html/rfc3987): ``. + +--[[smcv]] + +---- + +
Some long scrollable text
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+Example fragment ID in Russian should point here
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
-- 2.39.5