Respond with benchmarks and an updated branch

[git.ikiwiki.info.git] / doc / todo / should_optimise_pagespecs.mdwn
diff --git a/doc/todo/should_optimise_pagespecs.mdwn b/doc/todo/should_optimise_pagespecs.mdwn

index 02d7483c24c3f6071543042c1d25142950863aa7..3dfa8e1f28a0c1a87bdfe4750ec319cc2f043286 100644 (file)
--- a/doc/todo/should_optimise_pagespecs.mdwn
+++ b/doc/todo/should_optimise_pagespecs.mdwn
@@ -79,6 +79,8 @@ I can think about reducung the size of my wiki source and making it available on
  > 
  > --[[Joey]]
  
+[[!template id=gitbranch branch=smcv/ready/optimize-depends author="[[smcv]]"]]
+
  >> I've been looking at optimizing ikiwiki for a site using
  >> [[plugins/contrib/album]] (which produces a lot of pages) and it seems
  >> that checking which pages depend on which pages does take a significant
@@ -88,4 +90,170 @@ I can think about reducung the size of my wiki source and making it available on
  >> rather than a single pagespec. This does turn out to be faster, although
  >> not as much as I'd like. --[[smcv]]
  
-[[!tag wishlist patch]]
+>>> I just wanted to note that there is a whole long discussion of dependencies and pagespecs on the [[todo/tracking_bugs_with_dependencies]] page. -- [[Will]]
+
+>>>> Yeah, I had a look at that (as the only other mention of `pagespec_merge`).
+>>>> I think I might have solved some of the problems mentioned there,
+>>>> actually - `pagespec_merge` no longer needs to exist in my branch (although
+>>>> I haven't actually deleted it), because the "or" operation is now done in
+>>>> the Perl code, rather than by merging pagespecs and translating. --[[smcv]]
+
+[[!template id=gitbranch branch=smcv/ready/remove-pagespec-merge author="[[smcv]]"]]
+
+>>>>> I've now added a patch to the end of that branch that deletes
+>>>>> `pagespec_merge` almost entirely (we do need to keep a copy around, in
+>>>>> ikiwiki-transition, but that copy doesn't have to be optimal or support
+>>>>> future features like [[tracking_bugs_with_dependencies]]). --[[smcv]]
+
+---
+
+Some questions on your optimize-depends branch. --[[Joey]]
+
+In saveindex it still or'd together the depends list, but the `{depends}`
+field seems only useful for backwards compatability (ie, ikiwiki-transition
+uses it still), and otherwise just bloats the index.
+
+> If it's acceptable to declare that downgrading IkiWiki requires a complete
+> rebuild, I'm happy with that. I'd prefer to keep the (simple form of the)
+> transition done automatically during a load/save cycle, rather than
+> requiring ikiwiki-transition to be run; we should probably say in NEWS
+> that the performance increase won't fully apply until the next
+> rebuild. --[[smcv]]
+
+>> It is acceptable not to support downgrades.
+>> I don't think we need a NEWS file update since any sort of refresh,
+>> not just a full rebuild, will cause the indexdb to be loaded and saved,
+>> enabling the optimisation. --[[Joey]]
+
+>>> A refresh will load the current dependencies from `{depends}` and save
+>>> them as-is as a one-element `{dependslist}`; only a rebuild will replace
+>>> the single complex pagespec with a long list of simpler pagespecs.
+>>> --[[smcv]]
+
+Is an array the right data structure? `add_depends` has to loop through the
+array to avoid dups, it would be better if a hash were used there. Since
+inline (and other plugins) explicitly add all linked pages, each as a
+separate item, the list can get rather long, and that single add_depends
+loop has suddenly become O(N^2) to the number of pages, which is something
+to avoid..
+
+> I was also thinking about this (I've been playing with some stuff based on the
+> `remove-pagespec-merge` branch).  A hash, by itself, is not optimal because
+> the dependency list holds two things: page names and page specs.  The hash would
+> work well for the page names, but you'll still need to iterate through the page specs.
+> I was thinking of keeping a list and a hash.  You use the list for pagespecs
+> and the hash for individual page names.  To make this work you need to adjust the
+> API so it knows which you're adding.  -- [[Will]]
+
+> I wasn't thinking about a lookup hash, just a dedup hash, FWIW.
+> --[[Joey]]
+
+>> I was under the impression from previous code review that you preferred
+>> to represent unordered sets as lists, rather than hashes with dummy
+>> values. If I was wrong, great, I'll fix that and it'll probably go
+>> a bit faster. --[[smcv]]
+
+>>> It depends, really. And it'd certianly make sense to benchmark such a
+>>> change. --[[Joey]]
+
+>>>> Benchmarked, below. --[[smcv]]
+
+Also, since a lot of places are calling add_depends in a loop, it probably
+makes sense to just make it accept a list of dependencies to add. It'll be
+marginally faster, probably, and should allow for better optimisation
+when adding a lot of depends at once.
+
+> That'd be an API change; perhaps marginally faster, but I don't
+> see how it would allow better optimisation if we're de-duplicating
+> anyway? --[[smcv]]
+
+>> Well, I was thinking that it might be sufficient to build a `%seen`
+>> hash of dependencies inside `add_depends`, if the places that call
+>> it lots were changed to just call it once. Of course the only way to
+>> tell is benchmarking. --[[Joey]]
+
+>>> It doesn't seem that it significantly affects performance either way.
+>>> --[[smcv]]
+
+In Render.pm, we now have a triply nested loop, which is a bit
+scary for efficiency. It seems there should be a way to
+rework this code so it can use the optimised `pagespec_match_list`,
+and/or hoist some of the inner loop calculations (like the `pagename`)
+out.
+
+> I don't think the complexity is any greater than it was: I've just
+> moved one level of "loop" out of the generated Perl, to be
+> in visible code. I'll see whether some of it can be hoisted, though.
+> --[[smcv]]
+
+>> The call to `pagename` is the only part I can see that's clearly
+>> run more often than before. That function is pretty inexpensive, but..
+>> --[[Joey]]
+
+>>> I don't see anything that can be hoisted without significant refactoring,
+>>> actually. Beware that there are two pagename calls in the loop: one for
+>>> `$f` (which is the page we might want to rebuild), and one for `$file`
+>>> (which is the changed page that it might depend on). Note that I didn't
+>>> choose those names!
+>>>
+>>> The three loops are over source files, their lists of dependency pagespecs,
+>>> and files that might have changed. I see the following things we might be
+>>> doing redundantly:
+>>>
+>>> * If `$file` is considered as a potential dependency for more than
+>>>   one `$f`, we evaluate `pagename($file)` more than once. Potential fix:
+>>>   cache them (this turns out to save about half a second on the docwiki,
+>>>   see below).
+>>> * If several pages depend on the same pagespec, we evaluate whether each
+>>>   changed page matches that pagespec more than once: however, we do so
+>>>   with a different location parameter every time, so repeated calls are,
+>>>   in the general case, the only correct thing to do. Potential fix:
+>>>   perhaps special-case "page x depends on page y and nothing else"
+>>>   (i.e. globs that have no wildcards) into a separate hash? I haven't
+>>>   done anything in this direction.
+>>> * Any preparatory work done by pagespec_match (converting the pagespec
+>>>   into Perl, mostly?) is done in the inner loop; switching to
+>>>   pagespec_match_list (significant refactoring) saves more than half a
+>>>   second on the docwiki.
+>>>
+>>> --[[smcv]]
+
+Very good catch on img/meta using the wrong dependency; verified in the wild!
+(I've cherry-picked those bug fixes.)
+
+----
+
+Benchmarking results: I benchmarked by altering docwiki.setup to switch off
+verbose, running "make clean && ./Makefile.PL && make", and timing one rebuild
+of the docwiki followed by three refreshes. Before each refresh I used
+`touch plugins/*.mdwn` to have something significant to refresh.
+
+I'm assuming that "user" CPU time is the important thing here (system time was
+relatively small in all cases, up to 0.35 seconds per run).
+
+master at the time of rebasing: 14.20s to rebuild, 10.04/12.07/14.01s to
+refresh. I think you can see the bug clearly here - the pagespecs are getting
+more complicated every time!
+
+After the initial optimization: 14.27s to rebuild, 8.26/8.33/8.26 to refresh.
+Success!
+
+Not pre-joining dependencies actually took about ~0.2s more; I don't know why.
+I'm worried that duplicates will just build up (again) in less simple cases,
+though, so 0.2s is probably a small price to pay for that not happening (it
+might well be experimental error, for that matter).
+
+Not saving {depends} to the index, using a hash instead of a list to
+de-duplicate, and allowing add_depends to take an arrayref instead of a single
+pagespec had no noticable positive or negative effect on this test.
+
+Memoizing the results of pagename brought the rebuild time down to 14.06s
+and the refresh time down to 7.96/7.92/7.92, a significant win.
+
+Refactoring to use pagespec_match_list looks more risky from a code churn
+point of view; rebuild now takes 14.35s, but refresh is only 7.30/7.29/7.28,
+another significant win.
+
+--[[smcv]]
+
+[[!tag wishlist patch patch/core]]