X-Git-Url: http://git.vanrenterghem.biz/git.ikiwiki.info.git/blobdiff_plain/d5e65e582a34ae2494fee493ddcb2b959e4a4bce..3a04e96389def78bcb873a4487b85f4d75653199:/doc/todo/should_optimise_pagespecs.mdwn?ds=inline diff --git a/doc/todo/should_optimise_pagespecs.mdwn b/doc/todo/should_optimise_pagespecs.mdwn index 0ef8a7847..728ab8994 100644 --- a/doc/todo/should_optimise_pagespecs.mdwn +++ b/doc/todo/should_optimise_pagespecs.mdwn @@ -79,4 +79,235 @@ I can think about reducung the size of my wiki source and making it available on > > --[[Joey]] -[[!tag wishlist]] +>> I've been looking at optimizing ikiwiki for a site using +>> [[plugins/contrib/album]] (which produces a lot of pages) and it seems +>> that checking which pages depend on which pages does take a significant +>> amount of time. The optimize-depends branch in my git repository +>> avoids using `pagespec_merge()` for this (indeed it's no longer used +>> at all), and instead represents dependencies as a list of pagespecs +>> rather than a single pagespec. This does turn out to be faster, although +>> not as much as I'd like. --[[smcv]] + +>>> [[Merged|done]] --[[smcv]] + +>>> I just wanted to note that there is a whole long discussion of dependencies and pagespecs on the [[todo/tracking_bugs_with_dependencies]] page. -- [[Will]] + +>>>> Yeah, I had a look at that (as the only other mention of `pagespec_merge`). +>>>> I think I might have solved some of the problems mentioned there, +>>>> actually - `pagespec_merge` no longer needs to exist in my branch (although +>>>> I haven't actually deleted it), because the "or" operation is now done in +>>>> the Perl code, rather than by merging pagespecs and translating. --[[smcv]] + +>>>>> I've now added a patch to the end of that branch that deletes +>>>>> `pagespec_merge` almost entirely (we do need to keep a copy around, in +>>>>> ikiwiki-transition, but that copy doesn't have to be optimal or support +>>>>> future features like [[tracking_bugs_with_dependencies]]). --[[smcv]] + +--- + +Some questions on your optimize-depends branch. --[[Joey]] + +In saveindex it still or'd together the depends list, but the `{depends}` +field seems only useful for backwards compatability (ie, ikiwiki-transition +uses it still), and otherwise just bloats the index. + +> If it's acceptable to declare that downgrading IkiWiki requires a complete +> rebuild, I'm happy with that. I'd prefer to keep the (simple form of the) +> transition done automatically during a load/save cycle, rather than +> requiring ikiwiki-transition to be run; we should probably say in NEWS +> that the performance increase won't fully apply until the next +> rebuild. --[[smcv]] + +>> It is acceptable not to support downgrades. +>> I don't think we need a NEWS file update since any sort of refresh, +>> not just a full rebuild, will cause the indexdb to be loaded and saved, +>> enabling the optimisation. --[[Joey]] + +>>> A refresh will load the current dependencies from `{depends}` and save +>>> them as-is as a one-element `{dependslist}`; only a rebuild will replace +>>> the single complex pagespec with a long list of simpler pagespecs. +>>> --[[smcv]] + +Is an array the right data structure? `add_depends` has to loop through the +array to avoid dups, it would be better if a hash were used there. Since +inline (and other plugins) explicitly add all linked pages, each as a +separate item, the list can get rather long, and that single add_depends +loop has suddenly become O(N^2) to the number of pages, which is something +to avoid.. + +> I was also thinking about this (I've been playing with some stuff based on the +> `remove-pagespec-merge` branch). A hash, by itself, is not optimal because +> the dependency list holds two things: page names and page specs. The hash would +> work well for the page names, but you'll still need to iterate through the page specs. +> I was thinking of keeping a list and a hash. You use the list for pagespecs +> and the hash for individual page names. To make this work you need to adjust the +> API so it knows which you're adding. -- [[Will]] + +> I wasn't thinking about a lookup hash, just a dedup hash, FWIW. +> --[[Joey]] + +>> I was under the impression from previous code review that you preferred +>> to represent unordered sets as lists, rather than hashes with dummy +>> values. If I was wrong, great, I'll fix that and it'll probably go +>> a bit faster. --[[smcv]] + +>>> It depends, really. And it'd certianly make sense to benchmark such a +>>> change. --[[Joey]] + +>>>> Benchmarked, below. --[[smcv]] + +Also, since a lot of places are calling add_depends in a loop, it probably +makes sense to just make it accept a list of dependencies to add. It'll be +marginally faster, probably, and should allow for better optimisation +when adding a lot of depends at once. + +> That'd be an API change; perhaps marginally faster, but I don't +> see how it would allow better optimisation if we're de-duplicating +> anyway? --[[smcv]] + +>> Well, I was thinking that it might be sufficient to build a `%seen` +>> hash of dependencies inside `add_depends`, if the places that call +>> it lots were changed to just call it once. Of course the only way to +>> tell is benchmarking. --[[Joey]] + +>>> It doesn't seem that it significantly affects performance either way. +>>> --[[smcv]] + +In Render.pm, we now have a triply nested loop, which is a bit +scary for efficiency. It seems there should be a way to +rework this code so it can use the optimised `pagespec_match_list`, +and/or hoist some of the inner loop calculations (like the `pagename`) +out. + +> I don't think the complexity is any greater than it was: I've just +> moved one level of "loop" out of the generated Perl, to be +> in visible code. I'll see whether some of it can be hoisted, though. +> --[[smcv]] + +>> The call to `pagename` is the only part I can see that's clearly +>> run more often than before. That function is pretty inexpensive, but.. +>> --[[Joey]] + +>>> I don't see anything that can be hoisted without significant refactoring, +>>> actually. Beware that there are two pagename calls in the loop: one for +>>> `$f` (which is the page we might want to rebuild), and one for `$file` +>>> (which is the changed page that it might depend on). Note that I didn't +>>> choose those names! +>>> +>>> The three loops are over source files, their lists of dependency pagespecs, +>>> and files that might have changed. I see the following things we might be +>>> doing redundantly: +>>> +>>> * If `$file` is considered as a potential dependency for more than +>>> one `$f`, we evaluate `pagename($file)` more than once. Potential fix: +>>> cache them (this turns out to save about half a second on the docwiki, +>>> see below). +>>> * If several pages depend on the same pagespec, we evaluate whether each +>>> changed page matches that pagespec more than once: however, we do so +>>> with a different location parameter every time, so repeated calls are, +>>> in the general case, the only correct thing to do. Potential fix: +>>> perhaps special-case "page x depends on page y and nothing else" +>>> (i.e. globs that have no wildcards) into a separate hash? I haven't +>>> done anything in this direction. +>>> * Any preparatory work done by pagespec_match (converting the pagespec +>>> into Perl, mostly?) is done in the inner loop; switching to +>>> pagespec_match_list (significant refactoring) saves more than half a +>>> second on the docwiki. +>>> +>>> --[[smcv]] + +Very good catch on img/meta using the wrong dependency; verified in the wild! +(I've cherry-picked those bug fixes.) + +---- + +Benchmarking results: I benchmarked by altering docwiki.setup to switch off +verbose, running "make clean && ./Makefile.PL && make", and timing one rebuild +of the docwiki followed by three refreshes. Before each refresh I used +`touch plugins/*.mdwn` to have something significant to refresh. + +I'm assuming that "user" CPU time is the important thing here (system time was +relatively small in all cases, up to 0.35 seconds per run). + +master at the time of rebasing: 14.20s to rebuild, 10.04/12.07/14.01s to +refresh. I think you can see the bug clearly here - the pagespecs are getting +more complicated every time! + +> I can totally see a bug here, and it's one I didn't think existed. Ie, +> I thought that after the first refresh, the pagespec should stabalize, +> and what it stabalized to was probably unnecessarily long, but not +> growing w/o bounds! +> +> a) Explains why ikiwiki.info has been so slow lately. Well that and some +> other things that overloaded the system. +> b) Suggests to me we will probably want to force a rebuild on upgrade +> when fixing this (via the mechanism in the postinst). +> +> I've investigated why the pagespecs keep growing: When page A changes, +> its old depends are cleared. Then +> page B that inlines A gets rebuilt, and its old depends are also cleared. +> But page B also inlines page C; which means C gets re-rendered. And this +> happens w/o its old depends being cleared, so C's depends are doubled. +> --[[Joey]] + +After the initial optimization: 14.27s to rebuild, 8.26/8.33/8.26 to refresh. +Success! + +Not pre-joining dependencies actually took about ~0.2s more; I don't know why. +I'm worried that duplicates will just build up (again) in less simple cases, +though, so 0.2s is probably a small price to pay for that not happening (it +might well be experimental error, for that matter). + +> It's weird that the suggested optimisations to +> `add_depends` had no effect. So, the commit message to +> b6fcb1cb0ef27e5a63184440675d465fad652acf is actually wrong.. ? --[[Joey]] + +>> I'll try benchmarking again on the non-public wiki where I had the 4% +>> speedup. The docwiki is so small that 4% is hard to measure... --[[smcv]] + +Not saving {depends} to the index, using a hash instead of a list to +de-duplicate, and allowing add_depends to take an arrayref instead of a single +pagespec had no noticable positive or negative effect on this test. + +> I see e4cd168ebedd95585290c97ff42234344bfed46c is still in your branch +> though. I don't like using an arrayref, it could just take `($page, @depends)`. +> and I don't see the need to keep it if it doesn't currently help. + +>> I'll drop it. --[[smcv]] + +> Is there any reason to keep 7227c2debfeef94b35f7d81f42900aa01820caa3 +> if it doesn't improve speed? +> --[[Joey]] + +>> I'll try benchmarking on a more complex wiki and see whether it has a +>> positive or negative effect. It does avoid being O(n**2) in number of +>> dependencies. --[[smcv]] + +Memoizing the results of pagename brought the rebuild time down to 14.06s +and the refresh time down to 7.96/7.92/7.92, a significant win. + +> Ok, that seems safe to memoize. (It's a real function and it isn't +> called with a great many inputs.) Why did you chose to memoize it +> explicitly rather than adding it to the memoize list at the top? + +>> It does depend on global variables, so using Memoize seemed like asking for +>> trouble. I suppose what I did is equivalent to Memoize though... --[[smcv]] + +Refactoring to use pagespec_match_list looks more risky from a code churn +point of view; rebuild now takes 14.35s, but refresh is only 7.30/7.29/7.28, +another significant win. + +--[[smcv]] + +> I had mostly convinced myself that +> `pagespec_match_list` would not lead to a speed gain here. My reasoning +> was that you want to stop after finding one match, while `pagespec_match_list` +> checks all pages for matches. So what we're seeing is that +> on a rebuild, `@changed` is all pages, and not short-circuiting leads +> to unnecessary work. OTOH, on refresh, `@changed` is small and I suppose +> `pagespec_match_list`'s other slight efficiencies win out somehow. +> +> Welcome to the "I made ikiwiki twice as fast +> and all I got was this lousy git sha1sum" club BTW :-) --[[Joey]] + +[[!tag wishlist patch patch/core]]