doc/plugins/contrib/i18nheadinganchors/discussion.mdwn

   1 I would not be comfortable with merging this into headinganchors and enabling it by
   2 default for two reasons:
   3
   4 * it adds a new dependency on [[!cpan Text::Unidecode]]
   5 * Text::Unidecode specifically documents its transliteration as not being stable
   6   across versions
   7
   8 There are several "slugify" libraries available other than Text::Unidecode.
   9 It isn't clear to me which one is the best. Pandoc also documents
  10 [an algorithm for generating slugs](http://pandoc.org/MANUAL.html#extension-auto_identifiers),
  11 and it would be nice if our fallback implementation (with i18n disabled) was compatible
  12 with Pandoc's, at least for English text.
  13
  14 However! In HTML5, IDs are allowed to contain anything except _space characters_
  15 (space, newline, tab, CR, FF), so we could consider just passing non-ASCII
  16 through the algorithm untouched. This [example link to a Russian
  17 anchor name](#пример) (the output of putting "example" into English-to-Russian
  18 Google Translate) hopefully works? (Use a small browser window to make it
  19 clearer where it goes)
  20
  21 So perhaps we could try this Unicode-aware version of what Pandoc documents:
  22
  23 * Remove footnote links if any (this might have to be heuristic, or we could
  24   skip this step for a first implementation)
  25 * Take only the plain text, no markup (passing the heading through HTML::Parser
  26   and collecting only the text nodes would be the fully-correct version of this,
  27   or we could fake it with regexes and be at least mostly correct)
  28 * Strip punctuation, using some Unicode-aware definition of what is punctuation:
  29   perhaps `s/[^-\w_. ]//gu;` (delete anything that is not a (Unicode-aware) word
  30   character, hyphen-minus, underscore, dot or space)
  31 * Replace spaces with hyphen-minus
  32 * Force to lower-case with `lc`
  33 * Strip leading digits and punctuation
  34 * If the string is empty, use `section`
  35 * If we already generated a matching identifier, append `-1`, `-2`, etc. until we find
  36   an unused identifier
  37
  38 (Or to provide better uniqueness, we could parse the document looking for any existing
  39 ID, then generate IDs avoiding collisions with any of them.)
  40
  41 This would give us, for example, `## Visiting 北京` → `id="visiting-北京"`
  42 (where Text::Unidecode would instead transliterate, resulting in `id="visiting-bei-jing"`).
  43
  44 To use these IDs in fragments, I would be inclined to rely on browsers
  45 supporting [IRIs](https://tools.ietf.org/html/rfc3987): `<a href="#visiting-北京">`.
  46
  47 --[[smcv]]
  48
  49 ----
  50
  51 <pre>Some long scrollable text
  52 .
  53 .
  54 .
  55 .
  56 .
  57 .
  58 .
  59 .
  60 .
  61 .
  62 .
  63 .
  64 .
  65 .
  66 .
  67 .
  68 .
  69 <span id="пример">Example fragment ID in Russian should point here</span>
  70 .
  71 .
  72 .
  73 .
  74 .
  75 .
  76 .
  77 .
  78 .
  79 .
  80 .
  81 .
  82 .
  83 .
  84 .
  85 .
  86 .
  87 .
  88 .
  89 .
  90 .
  91 .
  92 .</pre>