doc/plugins/contrib/i18nheadinganchors/discussion.mdwn

   1 I would not be comfortable with merging this into headinganchors and enabling it by
   2 default for two main reasons:
   3
   4 * it adds a new dependency on [[!cpan Text::Unidecode]]
   5 * Text::Unidecode specifically documents its transliteration as not being stable
   6   across versions
   7
   8 There are several "slugify" libraries available other than Text::Unidecode.
   9 It isn't clear to me which one is the best. Pandoc also documents
  10 [an algorithm for generating slugs](http://pandoc.org/MANUAL.html#extension-auto_identifiers),
  11 and it would be nice if our fallback implementation (with i18n disabled) was compatible
  12 with Pandoc's, at least for English text.
  13
  14 However! In HTML5, IDs are allowed to contain anything except _space characters_
  15 (space, newline, tab, CR, FF), so we could consider just passing non-ASCII
  16 through the algorithm untouched. This [example link to a Russian
  17 anchor name](#пример) (the output of putting "example" into English-to-Russian
  18 Google Translate) hopefully works? (Use a small browser window to make it
  19 clearer where it goes)
  20
  21 > Can we assume Ikiwiki generates HTML5 all the time? I thought that was still a
  22 > setting off by default... --[[anarcat]]
  23
  24 >> ikiwiki always generates HTML5, since 3.20150107. The `html5` option has
  25 >> been repurposed to control whether we generate new-in-HTML5 semantic
  26 >> markup like `<section>` and `<nav>` (`html5` enabled), or HTML4 equivalents
  27 >> like `<div>` with a class (`html5` disabled). The default is still off,
  28 >> although I should probably either toggle it to on or remove the option
  29 >> altogether in the next release. --s
  30
  31 So perhaps we could try this Unicode-aware version of what Pandoc documents:
  32
  33 * Remove footnote links if any (this might have to be heuristic, or we could
  34   skip this step for a first implementation)
  35 * Take only the plain text, no markup (passing the heading through HTML::Parser
  36   and collecting only the text nodes would be the fully-correct version of this,
  37   or we could fake it with regexes and be at least mostly correct)
  38 * Strip punctuation, using some Unicode-aware definition of what is punctuation:
  39   perhaps `s/[^-\w_. ]//gu;` (delete anything that is not a (Unicode-aware) word
  40   character, hyphen-minus, underscore, dot or space)
  41 * Replace spaces with hyphen-minus
  42 * Force to lower-case with `lc`
  43 * Strip leading digits and punctuation
  44 * If the string is empty, use `section`
  45 * If we already generated a matching identifier, append `-1`, `-2`, etc. until we find
  46   an unused identifier
  47
  48 (Or to provide better uniqueness, we could parse the document looking for any existing
  49 ID, then append `-1`, `-2` to each generated ID until there is no collision.)
  50
  51 This would give us, for example, `## Visiting 北京` → `id="visiting-北京"`
  52 (whereas Text::Unidecode would instead transliterate, resulting in
  53 `id="visiting-bei-jing"`).
  54
  55 To use these IDs in fragments, I would be inclined to rely on browsers
  56 supporting [IRIs](https://tools.ietf.org/html/rfc3987): `<a href="#visiting-北京">`.
  57
  58 --[[smcv]]
  59
  60 > I guess this makes sense. I just wonder how well this is actually supported in all
  61 > browsers.. I looked around and suspect this will work in more recent browsers, but,
  62 > as an example, https://caniuse.com/ doesn't have that feature listed in their
  63 > tables. :) -- [[anarcat]]
  64
  65 >> That might well indicate that all major browsers have always supported it so
  66 >> there is no need to check. I don't see any particular reason why a browser vendor
  67 >> would not want to accept arbitrary non-whitespace as a valid anchor.
  68 >>
  69 >> In practice, minor or old browsers are probably insecure anyway, so I don't care
  70 >> too much about supporting them perfectly... --s
  71
  72 > After thinking more about this, I don't feel that IRIs are a good
  73 > solution. Sure, there are machine-readable ways of encoding
  74 > non-ASCII characters in URLs. But that's not the point here: the
  75 > point here is to have *human* readable URLs. In the example I give
  76 > in the plugin documentation, I mention the french word "liberté"
  77 > which can easily be transliterated to "liberte". By using the
  78 > RFC3987 scheme, we could use unicode directly in the links (`a
  79 > href="#liberté"`), but the actual URL would be encoded as
  80 > `#libert%e9`, which is really not as pretty.
  81 >
  82 > I understand you not wanting to introduce another dependency. And I
  83 > also worry about the transliteration not being stable across
  84 > releases. After all, it might not even be stable across Unicode
  85 > releases either! But I'm ready to live with that inconvenience for
  86 > the user-friendliness of the resulting URLs. --[[anarcat]]
  87
  88 ----
  89
  90 Documentation says:
  91
  92 > _Also note that all heading attributes are overriden with the ID tag. If this
  93 > is not desirable, we'd need to fire up a full HTML::Parser or do some more
  94 > regex magic to preserve the attributes other than id= which we want to keep._
  95
  96 I think this is a bug, particularly if you are using Pandoc's
  97 [header attributes](http://pandoc.org/MANUAL.html#extension-header_attributes)
  98 or similar.
  99
 100 > It's not a bug, it's a limitation. :) But sure, it's a thing. It's an issue in
 101 > headinganchors as well of course. -- [[anarcat]]
 102
 103 >> No, current/historical headinganchors has a different bug: it ignores headings
 104 >> that have any attributes, and does not generate anchors for them. That gives it
 105 >> degraded functionality, but no information loss. I think that's less bad. --s
 106
 107 I think we should try to use an existing ID before generating our own, with the
 108 generation step as a fallback, just like Pandoc does. If a htmlize layer like
 109 Text::MultiMarkdown or Pandoc is generating worse IDs than this plugin, the
 110 the right solution to that is to send a bug report / feature request to
 111 make its IDs as good as this plugin's, or turn off ID generation in the
 112 htmlize layer, or stop using Text::MultiMarkdown.
 113
 114 --[[smcv]]
 115
 116 > Agreed. However, the situation I was in was that multimarkdown *and* the
 117 > headinganchors plugins had issues I had to fix. So it was better and easier
 118 > for me to just override whatever attributes were there for testing and
 119 > fixing this in the short term... -- [[anarcat]]
 120
 121 > To bounce on this again: my problem with keeping existing IDs is
 122 > that it basically makes headinganchors fail to do anything if
 123 > something else adds the anchors. So I understand where you're coming
 124 > from with this, but that "bug" was introduced on purpose, to
 125 > actually fix a problem I was having.
 126 >
 127 > So I understand you might not want to *replace* headinganchors
 128 > completely with this module, but could we at least merge it in so I
 129 > wouldn't have to carry this patch around forever? :) Or what's our
 130 > way forward here?
 131 >
 132 > Thanks! -- [[anarcat]]
 133
 134 ----
 135
 136 <pre>Some long scrollable text
 137 .
 138 .
 139 .
 140 .
 141 .
 142 .
 143 .
 144 .
 145 .
 146 .
 147 .
 148 .
 149 .
 150 .
 151 .
 152 .
 153 .
 154 <span id="пример">Example fragment ID in Russian should point here</span>
 155 .
 156 .
 157 .
 158 .
 159 .
 160 .
 161 .
 162 .
 163 .
 164 .
 165 .
 166 .
 167 .
 168 .
 169 .
 170 .
 171 .
 172 .
 173 .
 174 .
 175 .
 176 .
 177 .</pre>
 178
 179 > This works for me on ` Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0` on Debian stretch, FWIW. --[[anarcat]]