-After using it for a while, my feeling is that hyperestradier, as used in
+[[done]], using xapian-omega! --[[Joey]]
+
+After using it for a while, my feeling is that hyperestraier, as used in
the [[plugins/search]] plugin, is not robust enough for ikiwiki. It doesn't
-upgrade well, and it has a habit of sig-11 on certian input from time to
+upgrade well, and it has a habit of sig-11 on certain input from time to
time.
-So some other engine should be found and used instead. Enrico had one that
-he was using for debtags stuff that looked pretty good.
+So some other engine should be found and used instead.
+
+Enrico had one that he was using for debtags stuff that looked pretty good.
+That was [Xapian](http://www.xapian.org/), which has perl bindings in
+libsearch-xapian-perl. The nice thing about xapian is that it does a ranked
+search so it understands what words are most important in a search. (So
+does Lucene..) Another nice thing is it supports "more documents like this
+one" kind of search. --[[Joey]]
+
+## xapian
+
+I've invesitgated xapian briefly. I think a custom xapian indexer and use
+of the omega for cgi searches could work well for ikiwiki. --[[Joey]]
+
+### indexer
+
+A custom indexer is needed because omindex isn't good enough for ikiwiki's
+needs for incremental rendering. (And because, since ikiwiki has page info
+in memory, it's silly to write it to disk and have omindex read it back.)
+
+The indexer would run as a ikiwiki hook. It needs to be passed the page
+name, and the content. Which hook to use is an open question.
+Possibilities:
+
+* `filter` - Since this runs before preprocess, only the actual text
+ written on the page would be indexed. Not text generated by directives,
+ pulled in by inlining, etc. There's something to be said for that. And
+ something to be said against it. It would also get markdown formatted
+ content, mostly, though it would still need to strip html, and also
+ probably strip preprocessor directives too.
+* `sanitize` - Would get the htmlized content, so would need to strip html.
+ Preprocessor directive output would be indexed. Doesn't get a destpage
+ parameter, making optimisation hard.
+* `format` - Would get the entire html page, including the page template.
+ Probably not a good choice as indexing the same template for each page
+ is unnecessary.
+
+The hook would remove any html from the content, and index it.
+It would need to add the same document data that omindex would.
+
+The indexer (and deleter) will need a way to figure out the ids in xapian
+of the documents to delete. One way is storing the id of each page in the
+ikiwiki index.
+
+The other way would be adding a special term to the xapian db that can be
+used with replace_document_by_term/delete_document_by_term.
+Hmm, let's use a term named "P<pagename>".
+
+The hook should try to avoid re-indexing pages that have not changed since
+they were last indexed. One problem is that, if a page with an inline is
+built, every inlined item will get each hook run. And so a naive hook would
+index each of those items, even though none of them have necessarily
+changed. Date stamps are one possibility. Another would be to avoid having
+the hook not do any indexing when `%preprocessing` is set (Ikiwiki.pm would
+need to expose that variable.) Another approach would be to use a
+needsbuild hook and only index the pages that are being built.
+
+#### cgi
+
+The cgi hook would exec omega to handle the searching, much as is done
+with estseek in the current search plugin.
+
+It would first set `OMEGA_CONFIG_FILE=.ikiwiki/omega.conf` ; that omega.conf
+would set `database_dir=.ikiwiki/xapian` and probably also set a custom
+`template_dir`, which would have modified templates branded for ikiwiki. So
+the actual xapian db would be in `.ikiwiki/xapian/default/`.
+
+## lucene
>> I've done a bit of prototyping on this. The current hip search library is [Lucene](http://lucene.apache.org/java/docs/). There's a Perl port called [Plucene](http://search.cpan.org/~tmtm/Plucene-1.25/). Given that it's already packaged, as `libplucene-perl`, I assumed it would be a good starting point. I've written a **very rough** patch against `IkiWiki/Plugin/search.pm` to handle the indexing side (there's no facility to view the results yet, although I have a command-line interface working). That's below, and should apply to SVN trunk.
>> If this seems a sensible approach, I'll write the CGI interface, and clean up the plugin. -- Ben
+>>> The weird thing about lucene is that these are all reimplmentations of
+>>> it. Thank you java.. The C++ version seems like a better choice to me
+>>> (packages are trivial). --[[Joey]]
+
+> Might I suggest renaming the "search" plugin to "hyperestraier", and then creating new search plugins for different engines? No reason to pick a single replacement. --[[JoshTriplett]]
+
<pre>
Index: IkiWiki/Plugin/search.pm
===================================================================
+ $PLUCENE_DIR = $config{wikistatedir}.'/plucene';
+}
+
- sub import { #{{{
+ sub import {
- hook(type => "getopt", id => "hyperestraier",
-- call => \&getopt);
+- call => \&getopt);
- hook(type => "checkconfig", id => "hyperestraier",
+ hook(type => "checkconfig", id => "plucene",
- call => \&checkconfig);
+ call => \&checkconfig);
- hook(type => "pagetemplate", id => "hyperestraier",
-- call => \&pagetemplate);
+- call => \&pagetemplate);
- hook(type => "delete", id => "hyperestraier",
+ hook(type => "delete", id => "plucene",
- call => \&delete);
+ call => \&delete);
- hook(type => "change", id => "hyperestraier",
+ hook(type => "change", id => "plucene",
- call => \&change);
+ call => \&change);
- hook(type => "cgi", id => "hyperestraier",
-- call => \&cgi);
- } # }}}
+- call => \&cgi);
+ }
--sub getopt () { #{{{
+-sub getopt () {
- eval q{use Getopt::Long};
- error($@) if $@;
- Getopt::Long::Configure('pass_through');
- GetOptions("estseek=s" => \$config{estseek});
--} #}}}
+-}
+sub writer {
+ init();
+ grep { defined pagetype($_) } @_;
+}
+
- sub checkconfig () { #{{{
+ sub checkconfig () {
foreach my $required (qw(url cgiurl)) {
if (! length $config{$required}) {
@@ -36,112 +58,55 @@
}
- } #}}}
+ }
-my $form;
--sub pagetemplate (@) { #{{{
+-sub pagetemplate (@) {
- my %params=@_;
- my $page=$params{page};
- my $template=$params{template};
+#my $form;
-+#sub pagetemplate (@) { #{{{
++#sub pagetemplate (@) {
+# my %params=@_;
+# my $page=$params{page};
+# my $template=$params{template};
+#
+# $template->param(searchform => $form);
+# }
-+#} #}}}
++#}
- # Add search box to page header.
- if ($template->query(name => "searchform")) {
-
- $template->param(searchform => $form);
- }
--} #}}}
+-}
-
- sub delete (@) { #{{{
+ sub delete (@) {
- debug(gettext("cleaning hyperestraier search index"));
- estcmd("purge -cl");
- estcfg();
+ $reader->delete_term( Plucene::Index::Term->new({ field => "id", text => $_ }));
+ }
+ $reader->close;
- } #}}}
+ }
- sub change (@) { #{{{
+ sub change (@) {
- debug(gettext("updating hyperestraier search index"));
- estcmd("gather -cm -bc -cl -sd",
- map {
+ $doc->add(Plucene::Document::Field->UnStored('text' => $data));
+ $writer->add_document($doc);
+ }
- } #}}}
+ }
-
--sub cgi ($) { #{{{
+-sub cgi ($) {
- my $cgi=shift;
-
- if (defined $cgi->param('phrase') || defined $cgi->param("navi")) {
- chdir("$config{wikistatedir}/hyperestraier") || error("chdir: $!");
- exec("./".IkiWiki::basename($config{cgiurl})) || error("estseek.cgi failed");
- }
--} #}}}
+-}
-
-my $configured=0;
--sub estcfg () { #{{{
+-sub estcfg () {
- return if $configured;
- $configured=1;
-
- unlink($cgi);
- my $estseek = defined $config{estseek} ? $config{estseek} : '/usr/lib/estraier/estseek.cgi';
- symlink($estseek, $cgi) || error("symlink $estseek $cgi: $!");
--} # }}}
+-}
-
--sub estcmd ($;@) { #{{{
+-sub estcmd ($;@) {
- my @params=split(' ', shift);
- push @params, "-cl", "$config{wikistatedir}/hyperestraier";
- if (@_) {
- open(STDOUT, "/dev/null"); # shut it up (closing won't work)
- exec("estcmd", @params) || error("can't run estcmd");
- }
--} #}}}
+-}
-
-1
+1;