X-Git-Url: http://git.vanrenterghem.biz/git.ikiwiki.info.git/blobdiff_plain/cecbd529389788c1f1cb647e2ff297cda7554456..db6ec539cd9c5eb42bc6ffd1a918f33cee4bfb10:/doc/todo/smarter_sorting.mdwn diff --git a/doc/todo/smarter_sorting.mdwn b/doc/todo/smarter_sorting.mdwn index 5a6d63ef1..901e143a7 100644 --- a/doc/todo/smarter_sorting.mdwn +++ b/doc/todo/smarter_sorting.mdwn @@ -13,14 +13,119 @@ out of maybe thousands). As [[smcv]] noted, It could be flipped, so the pagespec is applied first, and then sort the smaller matching set. But, checking pagespecs is likely more expensive than sorting. (Also, influence calculation complicates -doing that, since only influences for the M returned pages should be tracked.) +doing that.) Another option, when there is a limit on M pages to return, might be to -cull the M top pages without sorting the rest. This could be done using -a variant of Ye Olde Bubble Sort. Take the first M pages, and (quick)sort. -Then for each of the rest, check if it is higher than the Mth page. -If it is, bubble it up so it's sorted. -If not, throw it out (that's the fast bit and why this is not O(N^2)). +cull the M top pages without sorting the rest. + +> The patch below implements this. +> +> But, I have not thought enough about influence calculation. +> I need to figure out which pagespec matches influences need to be +> accumulated for in order to determine all possible influences of a +> pagespec are known. +> +> The old code accumulates influences from matching all successful pages +> up to the num cutoff, as well as influences from an arbitrary (sometimes +> zero) number of failed matches. New code does not accumulate influences +> from all the top successful matches, only an arbitrary group of +> successes and some failures. +> +> Also, by the time I finished this, it was not measuarably faster than +> the old method. At least not with a few thousand pages; it +> might be worth revisiting this sometime for many more pages? [[done]] +> --[[Joey]] + +
+diff --git a/IkiWiki.pm b/IkiWiki.pm
+index 1730e47..bc8b23d 100644
+--- a/IkiWiki.pm
++++ b/IkiWiki.pm
+@@ -2122,36 +2122,54 @@ sub pagespec_match_list ($$;@) {
+ 	my $num=$params{num};
+ 	delete @params{qw{num deptype reverse sort filter list}};
+ 	
+-	# when only the top matches will be returned, it's efficient to
+-	# sort before matching to pagespec,
+-	if (defined $num && defined $sort) {
+-		@candidates=IkiWiki::SortSpec::sort_pages(
+-			$sort, @candidates);
+-	}
+-	
++	# Find the first num matches (or all), before sorting.
+ 	my @matches;
+-	my $firstfail;
+ 	my $count=0;
+ 	my $accum=IkiWiki::SuccessReason->new();
+-	foreach my $p (@candidates) {
+-		my $r=$sub->($p, %params, location => $page);
++	my $i;
++	for ($i=0; $i < @candidates; $i++) {
++		my $r=$sub->($candidates[$i], %params, location => $page);
+ 		error(sprintf(gettext("cannot match pages: %s"), $r))
+ 			if $r->isa("IkiWiki::ErrorReason");
+ 		$accum |= $r;
+ 		if ($r) {
+-			push @matches, $p;
++			push @matches, $candidates[$i];
+ 			last if defined $num && ++$count == $num;
+ 		}
+ 	}
+ 
++	# We have num natches, but they may not be the best.
++	# Efficiently find and add the rest, without sorting the full list of
++	# candidates.
++	if (defined $num && defined $sort) {
++		@matches=IkiWiki::SortSpec::sort_pages($sort, @matches);
++
++		for ($i++; $i < @candidates; $i++) {
++			# Comparing candidate with lowest match is cheaper,
++			# so it's done before testing against pagespec.
++			if (IkiWiki::SortSpec::cmptwo($candidates[$i], $matches[-1], $sort) < 0 &&
++			    $sub->($candidates[$i], %params, location => $page)
++			) {
++				# this could be done less expensively
++				# using a binary search
++				for (my $j=0; $j < @matches; $j++) {
++					if (IkiWiki::SortSpec::cmptwo($candidates[$i], $matches[$j], $sort) < 0) {
++						splice @matches, $j, $#matches-$j+1, $candidates[$i],
++							@matches[$j..$#matches-1];
++						last;
++					}
++				}
++			}
++		}
++	}
++
+ 	# Add simple dependencies for accumulated influences.
+-	my $i=$accum->influences;
+-	foreach my $k (keys %$i) {
+-		$depends_simple{$page}{lc $k} |= $i->{$k};
++	my $inf=$accum->influences;
++	foreach my $k (keys %$inf) {
++		$depends_simple{$page}{lc $k} |= $inf->{$k};
+ 	}
+ 
+-	# when all matches will be returned, it's efficient to
+-	# sort after matching
++	# Sort if we didn't already.
+ 	if (! defined $num && defined $sort) {
+ 		return IkiWiki::SortSpec::sort_pages(
+ 			$sort, @matches);
+@@ -2455,6 +2473,12 @@ sub sort_pages {
+ 	sort $f @_
+ }
+ 
++sub cmptwo {
++	$a=$_[0];
++	$b=$_[1];
++	$_[2]->();
++}
++
+ sub cmp_title {
+ 	IkiWiki::pagetitle(IkiWiki::basename($a))
+ 	cmp
+
This would be bad when M is very large, and particularly, of course, when there is no limit and all pages are being matched on. (For example, an @@ -29,5 +134,8 @@ date range.) Well, in this case, it *does* make sense to flip it, limit by pagespe first, and do a (quick)sort second. (No influence complications, either.) +> Flipping when there's no limit implemented, and it knocked 1/3 off +> the rebuild time of my blog's archive pages. --[[Joey]] + Adding these special cases will be more complicated, but I think the best of both worlds. --[[Joey]]