doc/plugins/aggregate/discussion.mdwn

   1 I'm trying to set up a [planet of my users' blogs](http://help.schmonz.com/planet/). I've enabled the aggregate, meta, and tag plugins (but not htmltidy, that thing has a gajillion dependencies). `aggregateinternal` is 1. The cron job is running and I've also enabled the webtrigger. My usage is like so:
   2
   3     \[[!inline pages="internal(planet/*) show=0"]]
   4
   5     \[[!aggregate
   6     name="Amitai's blog"
   7     url="http://www.schmonz.com/"
   8     dir="planet/schmonz-blog"
   9     feedurl="http://www.schmonz.com/atom/"
  10     expirecount="2"
  11     tag="schmonz"
  12     ]]
  13
  14     \[[!aggregate
  15     name="Amitai's photos"
  16     url="http://photos.schmonz.com/"
  17     dir="planet/schmonz-photos"
  18     feedurl="http://photos.schmonz.com/main.php?g2_view=rss.SimpleRender&g2_itemId=7"
  19     expirecount="2"
  20     tag="schmonz"
  21     ]]
  22
  23
  24 (and a few more `aggregate` directives like these)
  25
  26 Two things aren't working as I'd expect:
  27
  28 1. `expirecount` doesn't take effect on the first run, but on the second. (This is minor, just a bit confusing at first.)
  29 2. Where are the article bodies for e.g. David's and Nathan's blogs? The bodies aren't showing up in the `._aggregated` files for those feeds, but the bodies for my own blog do, which explains the planet problem, but I don't understand the underlying aggregation problem. (Those feeds include article bodies, and show up normally in my usual feed reader rss2email.) How can I debug this further? --[[schmonz]]
  30
  31 > I only looked at David's, but its rss feed is not escaping the html
  32 > inside the rss `description` tags, which is illegal for rss 2.0. These
  33 > unknown tags then get ignored, including their content, and all that's
  34 > left is whitespace. Escaping the html to `&lt;` and `&gt;` fixes the
  35 > problem. You can see the feed validator complain about it here:
  36 > <http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fwww.davidj.org%2Frss.xml>
  37 >
  38 > It's sorta unfortunate that [[!cpan XML::Feed]] doesn't just assume the
  39 > un-esxaped html is part of the description field. Probably other feed
  40 > parsers are more lenient. --[[Joey]]
  41
  42 >> Thanks for the quick response (and the `expirecount` fix); I've forwarded it to David so he can fix his feed. Nathan's Atom feed validates -- it's generated by the same CMS as mine -- so I'm still at a loss on that one. --[[schmonz]]
  43
  44 >>> Nathan's feed contains only summary elements, with no content elements.
  45 >>> This is legal according to the Atom spec, so I've fixed ikiwiki to use
  46 >>> the summary if no content is available. --[[Joey]]
  47
  48 >>>> After applying your diffs, blowing away my cached aggregated stuff, and running the aggregate cron job by hand, the resulting planet still doesn't have Nathan's summaries... and the two posts from each feed that aren't being expired aren't the two newest ones (not sure what the pattern is there). Have I done something wrong? --[[schmonz]]
  49
  50 >>>>> I think that both issues are now fixed. Thanks for testing.
  51 >>>>> --[[Joey]]
  52
  53 >>>>>> I can confirm, they're fixed on my end. --[[schmonz]]
  54
  55 New bug: new posts aren't getting displayed (or cached for aggregation). After fixing his feed, David posted a new item today, and the aggregator is convinced there's nothing to do, whether by cronjob or webtrigger. I verified that it wasn't another problem with his feed by adding another of my ikiwiki's feed to the planet, running the aggregator, posting a new item, and running the aggregator again: no new item. --[[schmonz]]
  56
  57 > Even if you start it more frequently, aggregation will only occur every
  58 > `updateinterval` minutes (default 15), maximum. Does this explain what
  59 > you're seeing? --[[Joey]]
  60
  61 >> Crap, right, and my test update has since made it into the planet. His post still hasn't. So it must be something with David's feed again? A quick test with XML::Feed looks like it's parsing just fine: --[[schmonz]]
  62
  63     $ perl
  64     use XML::Feed;
  65     my $feed = XML::Feed->parse(URI->new('http://www.davidj.org/rss.xml')) or die XML::Feed->errstr;
  66     print $feed->title, "\n";
  67     for my $entry ($feed->entries) {
  68     print $entry->title, ": ", $entry->issued, "\n";
  69     }
  70     ^D
  71     davidj.org
  72     Amway Stories - Refrigerator Pictures: 2008-09-19T00:12:27
  73     Amway Stories - Coffee: 2008-09-13T10:08:17
  74     Google Alphabet Update: 2008-09-11T22:55:37
  75     Writing for writing's sake: 2008-09-09T23:39:05
  76     Google Chrome: 2008-09-02T23:12:26
  77     Mister Casual: 2008-07-25T09:01:17
  78     Parental Conversations: 2008-07-24T10:44:44
  79     Place Of George Orwell: 2008-06-03T22:11:07
  80     The Raw Beauty Of A National Duolian: 2008-05-31T12:41:06
  81
  82 > I had no problem getting the "Refrigerator Pictures" post to aggregate
  83 > here, though without a copy of the old feed I can't be 100% sure I've
  84 > reproduced your ikiwiki's state. --[[Joey]]
  85
  86 >> Okay, I blew away the cached entries and aggregator state files and reran the aggregator and all appears well again. If the problem recurs I'll be sure to post here. :-) --[[schmonz]]
  87
  88 >>> On the off chance that you retained a copy of the old state, I'd not
  89 >>> mind having a copy to investigate. --[[Joey]]
  90
  91 >>>> Didn't think of that, will keep a copy if there's a next time. -- [[schmonz]]
  92
  93 -----
  94
  95 In a corporate environment where feeds are generally behind
  96 authentication, I need to prime the aggregator's `LWP::UserAgent`
  97 with some cookies. What I've done is write a custom plugin to populate
  98 `$config{cookies}` with an `HTTP::Cookies` object, plus this diff:
  99
 100     --- /var/tmp/pkg/lib/perl5/vendor_perl/5.10.0/IkiWiki/Plugin/aggregate.pm  2010-06-24 13:03:33.000000000 -0400
 101     +++ aggregate.pm    2010-06-24 13:04:09.000000000 -0400
 102     @@ -488,7 +488,11 @@
 103                         }
 104                         $feed->{feedurl}=pop @urls;
 105                 }
 106     -           my $res=URI::Fetch->fetch($feed->{feedurl});
 107     +           my $res=URI::Fetch->fetch($feed->{feedurl},
 108     +                                     UserAgent => LWP::UserAgent->new(
 109     +                                           cookie_jar => $config{cookies},
 110     +                                     ),
 111     +           );
 112                 if (! $res) {
 113                         $feed->{message}=URI::Fetch->errstr;
 114                         $feed->{error}=1;
 115
 116 It works, but I have to remember to apply the diff whenever I update
 117 ikiwiki.  Can you provide a more elegant means of allowing cookies and/or
 118 the user agent to be programmatically manipulated? --[[schmonz]]
 119
 120 > Ping -- is the above patch perhaps acceptable (or near-acceptable)? -- [[schmonz]]
 121
 122 >> Pong.. I'd be happier with a more 100% solution that let cookies be used
 123 >> w/o needing to write a custom plugin to do it. --[[Joey]]
 124
 125 >>> According to LWP::UserAgent, for the common case, a complete
 126 >>> and valid configuration for `$config{cookies}` would be `{ file =>
 127 >>> "$ENV{HOME}/.cookies.txt" }`. In the more common case of not needing
 128 >>> to prime one's cookies, `cookie_jar` can be `undef` (that's the
 129 >>> default). In my less common case, the cookies are generated by
 130 >>> visiting a couple magic URLs, which would be trivial to turn into
 131 >>> config options, except that these particular URLs rely on SPNEGO
 132 >>> and so LWP::Authen::Negotiate has to be loaded. So I think adding
 133 >>> `$config{cookies}` (and using it in the aggregate plugin) should
 134 >>> be safe, might help people in typical cases, and won't prevent
 135 >>> further enhancements for less typical cases. --[[schmonz]]
 136
 137 >>>> Ok, done. Called it cookiejar. --[[Joey]]