1 I would not be comfortable with merging this into headinganchors and enabling it by
2 default for two main reasons:
4 * it adds a new dependency on [[!cpan Text::Unidecode]]
5 * Text::Unidecode specifically documents its transliteration as not being stable
8 There are several "slugify" libraries available other than Text::Unidecode.
9 It isn't clear to me which one is the best. Pandoc also documents
10 [an algorithm for generating slugs](http://pandoc.org/MANUAL.html#extension-auto_identifiers),
11 and it would be nice if our fallback implementation (with i18n disabled) was compatible
12 with Pandoc's, at least for English text.
14 However! In HTML5, IDs are allowed to contain anything except _space characters_
15 (space, newline, tab, CR, FF), so we could consider just passing non-ASCII
16 through the algorithm untouched. This [example link to a Russian
17 anchor name](#пример) (the output of putting "example" into English-to-Russian
18 Google Translate) hopefully works? (Use a small browser window to make it
19 clearer where it goes)
21 > Can we assume Ikiwiki generates HTML5 all the time? I thought that was still a
22 > setting off by default... --[[anarcat]]
24 So perhaps we could try this Unicode-aware version of what Pandoc documents:
26 * Remove footnote links if any (this might have to be heuristic, or we could
27 skip this step for a first implementation)
28 * Take only the plain text, no markup (passing the heading through HTML::Parser
29 and collecting only the text nodes would be the fully-correct version of this,
30 or we could fake it with regexes and be at least mostly correct)
31 * Strip punctuation, using some Unicode-aware definition of what is punctuation:
32 perhaps `s/[^-\w_. ]//gu;` (delete anything that is not a (Unicode-aware) word
33 character, hyphen-minus, underscore, dot or space)
34 * Replace spaces with hyphen-minus
35 * Force to lower-case with `lc`
36 * Strip leading digits and punctuation
37 * If the string is empty, use `section`
38 * If we already generated a matching identifier, append `-1`, `-2`, etc. until we find
41 (Or to provide better uniqueness, we could parse the document looking for any existing
42 ID, then append `-1`, `-2` to each generated ID until there is no collision.)
44 This would give us, for example, `## Visiting 北京` → `id="visiting-北京"`
45 (whereas Text::Unidecode would instead transliterate, resulting in
46 `id="visiting-bei-jing"`).
48 To use these IDs in fragments, I would be inclined to rely on browsers
49 supporting [IRIs](https://tools.ietf.org/html/rfc3987): `<a href="#visiting-北京">`.
53 > I guess this makes sense. I just wonder how well this is actually supported in all
54 > browsers.. I looked around and suspect this will work in more recent browsers, but,
55 > as an example, https://caniuse.com/ doesn't have that feature listed in their
56 > tables. :) -- [[anarcat]]
62 > _Also note that all heading attributes are overriden with the ID tag. If this
63 > is not desirable, we'd need to fire up a full HTML::Parser or do some more
64 > regex magic to preserve the attributes other than id= which we want to keep._
66 I think this is a bug, particularly if you are using Pandoc's
67 [header attributes](http://pandoc.org/MANUAL.html#extension-header_attributes)
70 > It's not a bug, it's a limitation. :) But sure, it's a thing. It's an issue in
71 > headinganchors as well of course. -- [[anarcat]]
73 I think we should try to use an existing ID before generating our own, with the
74 generation step as a fallback, just like Pandoc does. If a htmlize layer like
75 Text::MultiMarkdown or Pandoc is generating worse IDs than this plugin, the
76 the right solution to that is to send a bug report / feature request to
77 make its IDs as good as this plugin's, or turn off ID generation in the
78 htmlize layer, or stop using Text::MultiMarkdown.
82 > Agreed. However, the situation I was in was that multimarkdown *and* the
83 > headinganchors plugins had issues I had to fix. So it was better and easier
84 > for me to just override whatever attributes were there for testing and
85 > fixing this in the short term... -- [[anarcat]]
89 <pre>Some long scrollable text
107 <span id="пример">Example fragment ID in Russian should point here</span>
132 > This works for me on ` Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0` on Debian stretch, FWIW. --[[anarcat]]