X-Git-Url: http://git.vanrenterghem.biz/git.ikiwiki.info.git/blobdiff_plain/8e92468eae9ac0ab8161a0c71ff6c6a0a8aef07a..3d6cb667aafa175819ffcec86930eef3621c8b04:/doc/tips/convert_mediawiki_to_ikiwiki.mdwn diff --git a/doc/tips/convert_mediawiki_to_ikiwiki.mdwn b/doc/tips/convert_mediawiki_to_ikiwiki.mdwn index f03703b46..e60b413dd 100644 --- a/doc/tips/convert_mediawiki_to_ikiwiki.mdwn +++ b/doc/tips/convert_mediawiki_to_ikiwiki.mdwn @@ -1,4 +1,286 @@ -[[sabr]] explains how to [import MediaWiki content into -git](http://u32.net/Mediawiki_Conversion/index.html?updated), including -full edit hostory. The [[plugins/contrib/mediawiki]] plugin can then be -used by ikiwiki to build the wiki. +[[!toc levels=2]] + +Mediawiki is a dynamically-generated wiki which stores its data in a +relational database. Pages are marked up using a proprietary markup. It is +possible to import the contents of a Mediawiki site into an ikiwiki, +converting some of the Mediawiki conventions into Ikiwiki ones. + +The following instructions describe ways of obtaining the current version of +the wiki. We do not yet cover importing the history of edits. + +Another set of instructions and conversion tools (which imports the full history) +can be found at + +## Step 1: Getting a list of pages + +The first bit of information you require is a list of pages in the Mediawiki. +There are several different ways of obtaining these. + +### Parsing the output of `Special:Allpages` + +Mediawikis have a special page called `Special:Allpages` which list all the +pages for a given namespace on the wiki. + +If you fetch the output of this page to a local file with something like + + wget -q -O tmpfile 'http://your-mediawiki/wiki/Special:Allpages' + +You can extract the list of page names using the following python script. Note +that this script is sensitive to the specific markup used on the page, so if +you have tweaked your mediawiki theme a lot from the original, you will need +to adjust this script too: + + import sys + from xml.dom.minidom import parse, parseString + + dom = parse(sys.argv[1]) + tables = dom.getElementsByTagName("table") + pagetable = tables[-1] + anchors = pagetable.getElementsByTagName("a") + for a in anchors: + print a.firstChild.toxml().\ + replace('&','&').\ + replace('<','<').\ + replace('>','>') + +Also, if you have pages with titles that need to be encoded to be represented +in HTML, you may need to add further processing to the last line. + +Note that by default, `Special:Allpages` will only list pages in the main +namespace. You need to add a `&namespace=XX` argument to get pages in a +different namespace. (See below for the default list of namespaces) + +Note that the page names obtained this way will not include any namespace +specific prefix: e.g. `Category:` will be stripped off. + +### Querying the database + +If you have access to the relational database in which your mediawiki data is +stored, it is possible to derive a list of page names from this. With mediawiki's +MySQL backend, the page table is, appropriately enough, called `table`: + + SELECT page_namespace, page_title FROM page; + +As with the previous method, you will need to do some filtering based on the +namespace. + +### namespaces + +The list of default namespaces in mediawiki is available from . Here are reproduced the ones you are most likely to encounter if you are running a small mediawiki install for your own purposes: + +[[!table data=""" +Index | Name | Example +0 | Main | Foo +1 | Talk | Talk:Foo +2 | User | User:Jon +3 | User talk | User_talk:Jon +6 | File | File:Barack_Obama_signature.svg +10 | Template | Template:Prettytable +14 | Category | Category:Pages_needing_review +"""]] + +## Step 2: fetching the page data + +Once you have a list of page names, you can fetch the data for each page. + +### Method 1: via HTTP and `action=raw` + +You need to create two derived strings from the page titles: the +destination path for the page and the source URL. Assuming `$pagename` +contains a pagename obtained above, and `$wiki` contains the URL to your +mediawiki's `index.php` file: + + src=`echo "$pagename" | tr ' ' _ | sed 's,&,&,g'` + dest=`"$pagename" | tr ' ' _ | sed 's,&,__38__,g'` + + mkdir -p `dirname "$dest"` + wget -q "$wiki?title=$src&action=raw" -O "$dest" + +You may need to add more conversions here depending on the precise page titles +used in your wiki. + +If you are trying to fetch pages from a different namespace to the default, +you will need to prefix the page title with the relevant prefix, e.g. +`Category:` for category pages. You probably don't want to prefix it to the +output page, but you may want to vary the destination path (i.e. insert an +extra directory component corresponding to your ikiwiki's `tagbase`). + +### Method 2: via HTTP and `Special:Export` + +Mediawiki also has a special page `Special:Export` which can be used to obtain +the source of the page and other metadata such as the last contributor, or the +full history, etc. + +You need to send a `POST` request to the `Special:Export` page. See the source +of the page fetched via `GET` to determine the correct arguments. + +You will then need to write an XML parser to extract the data you need from +the result. + +### Method 3: via the database + +It is possible to extract the page data from the database with some +well-crafted queries. + +## Step 3: format conversion + +The next step is to convert Mediawiki conventions into Ikiwiki ones. + +### categories + +Mediawiki uses a special page name prefix to define "Categories", which +otherwise behave like ikiwiki tags. You can convert every Mediawiki category +into an ikiwiki tag name using a script such as + + import sys, re + pattern = r'\[\[Category:([^\]]+)\]\]' + + def manglecat(mo): + return '\[[!tag %s]]' % mo.group(1).strip().replace(' ','_') + + for line in sys.stdin.readlines(): + res = re.match(pattern, line) + if res: + sys.stdout.write(re.sub(pattern, manglecat, line)) + else: sys.stdout.write(line) + +## Step 4: Mediawiki plugin or Converting to Markdown + +You can use a plugin to make ikiwiki support Mediawiki syntax, or you can +convert pages to a format ikiwiki understands. + +### Step 4a: Mediawiki plugin + +The [[plugins/contrib/mediawiki]] plugin can be used by ikiwiki to interpret +most of the Mediawiki syntax. + +The following things are not working: + +* templates +* tables +* spaces and other funky characters ("?") in page names + +### Step 4b: Converting pages + +#### Converting to Markdown + +There is a Python script for converting from the Mediawiki format to Markdown in [[mithro]]'s conversion repository at . *WARNING:* While the script tries to preserve everything is can, Markdown syntax is not as flexible as Mediawiki so the conversion is lossy! + + # The script needs the mwlib library to work + # If you don't have easy_install installed, apt-get install python-setuptools + sudo easy_install mwlib + + # Get the repository + git clone git://github.com/mithro/media2iki.git + cd media2iki + + # Do a conversion + python mediawiki2markdown.py --no-strict --no-debugger > output.md + + +[[mithro]] doesn't frequent this page, so please report issues on the [github issue tracker](https://github.com/mithro/media2iki/issues). + +## Scripts + +There is a repository of tools for converting MediaWiki to Git based Markdown wiki formats (such as ikiwiki and github wikis) at . It also includes a standalone tool for converting from the Mediawiki format to Markdown. [[mithro]] doesn't frequent this page, so please report issues on the [github issue tracker](https://github.com/mithro/media2iki/issues). + +[[Albert]] wrote a ruby script to convert from mediawiki's database to ikiwiki at + +[[scy]] wrote a python script to convert from mediawiki XML dumps to git repositories at . + +[[Anarcat]] wrote a python script to convert from a mediawiki website to ikiwiki at git://src.anarcat.ath.cx/mediawikigitdump.git/. The script doesn't need any special access or privileges and communicates with the documented API (so it's a bit slower, but allows you to mirror sites you are not managing, like parts of Wikipedia). The script can also incrementally import new changes from a running site, through RecentChanges inspection. It also supports mithro's new Mediawiki2markdown converter (which I have a copy here: git://src.anarcat.ath.cx/media2iki.git/). + +> Some assembly is required to get Mediawiki2markdown and its mwlib +> gitmodule available in the right place for it to use.. perhaps you could +> automate that? --[[Joey]] + +> > You mean a debian package? :) media2iki is actually a submodule, so you need to go through extra steps to install it. mwlib being the most annoying part... I have fixed my script so it looks for media2iki directly in the submodule and improved the install instructions in the README file, but I'm not sure I can do much more short of starting to package the whole thing... --[[anarcat]] + +>>> You may have forgotten to push that, I don't see those changes. +>>> Packaging the python library might be a good 1st step. +>>> --[[Joey]] + +> Also, when I try to run it with -t on www.amateur-radio-wiki.net, it +> fails on some html in the page named "4_metres". On archiveteam.org, +> it fails trying to write to a page filename starting with "/", --[[Joey]] + +> > can you show me exactly which commandline arguments you're using? also, I have made improvements over the converter too, also available here: git://src/anarcat.ath.cx/media2iki.git/ -- [[anarcat]] + +>>> Not using your new converter, just the installation I did earlier +>>> today: +>>> --[[Joey]] + +
+fetching page 4 metres  from http://www.amateur-radio-wiki.net//index.php?action=raw&title=4+metres into 4_metres.mdwn
+Unknown tag TagNode tagname='div' vlist={'style': {u'float': u'left', u'border': u'2px solid #aaa', u'margin-left': u'20px'}}->'div' div
+Traceback (most recent call last):
+  File "./mediawikigitdump.py", line 298, in 
+    fetch_allpages(namespace)
+  File "./mediawikigitdump.py", line 82, in fetch_allpages
+    fetch_page(page.getAttribute('title'))
+  File "./mediawikigitdump.py", line 187, in fetch_page
+    c.parse(urllib.urlopen(url).read())
+  File "/home/joey/tmp/mediawikigitdump/mediawiki2markdown.py", line 285, in parse
+    self.parse_node(ast)
+  File "/home/joey/tmp/mediawikigitdump/mediawiki2markdown.py", line 76, in parse_node
+    f(node)
+  File "/home/joey/tmp/mediawikigitdump/mediawiki2markdown.py", line 88, in on_article
+    self.parse_children(node)
+  File "/home/joey/tmp/mediawikigitdump/mediawiki2markdown.py", line 83, in parse_children
+    self.parse_node(child)
+  File "/home/joey/tmp/mediawikigitdump/mediawiki2markdown.py", line 76, in parse_node
+    f(node)
+  File "/home/joey/tmp/mediawikigitdump/mediawiki2markdown.py", line 413, in on_section
+    self.parse_node(child)
+  File "/home/joey/tmp/mediawikigitdump/mediawiki2markdown.py", line 76, in parse_node
+    f(node)
+  File "/home/joey/tmp/mediawikigitdump/mediawiki2markdown.py", line 83, in parse_children
+    self.parse_node(child)
+  File "/home/joey/tmp/mediawikigitdump/mediawiki2markdown.py", line 76, in parse_node
+    f(node)
+  File "/home/joey/tmp/mediawikigitdump/mediawiki2markdown.py", line 474, in on_tagnode
+    assert not options.STRICT
+AssertionError
+zsh: exit 1     ./mediawikigitdump.py -v -t http://www.amateur-radio-wiki.net/
+
+ +
+joey@wren:~/tmp/mediawikigitdump>./mediawikigitdump.py -v -t http://archiveteam.org            
+fetching page list from namespace 0 ()
+found 222 pages
+fetching page /Sites using MediaWiki (English)  from http://archiveteam.org/index.php?action=raw&title=%2FSites+using+MediaWiki+%28English%29 into /Sites_using_MediaWiki_(English).mdwn
+Traceback (most recent call last):
+  File "./mediawikigitdump.py", line 298, in 
+    fetch_allpages(namespace)
+  File "./mediawikigitdump.py", line 82, in fetch_allpages
+    fetch_page(page.getAttribute('title'))
+  File "./mediawikigitdump.py", line 188, in fetch_page
+    f = open(filename, 'w')
+IOError: [Errno 13] Permission denied: u'/Sites_using_MediaWiki_(English).mdwn'
+zsh: exit 1     ./mediawikigitdump.py -v -t http://archiveteam.org
+
+ +> > > > > I have updated my script to call the parser without strict mode and to trim leading slashes (and /../, for that matter...) -- [[anarcat]] + +> > > > > > Getting this error with the new version on any site I try (when using -t only): `TypeError: argument 1 must be string or read-only character buffer, not None` +> > > > > > bisecting, commit 55941a3bd89d43d09b0c126c9088eee0076b5ea2 broke it. +> > > > > > --[[Joey]] + +> > > > > > > I can't reproduce here, can you try with -v or -d to try to trace down the problem? -- [[anarcat]] + +
+fetching page list from namespace 0 ()
+found 473 pages
+fetching page 0 - 9  from http://www.amateur-radio-wiki.net/index.php?action=raw&title=0+-+9 into 0_-_9.mdwn
+Traceback (most recent call last):
+  File "./mediawikigitdump.py", line 304, in 
+    main()
+  File "./mediawikigitdump.py", line 301, in main
+    fetch_allpages(options.namespace)
+  File "./mediawikigitdump.py", line 74, in fetch_allpages
+    fetch_page(page.getAttribute('title'))
+  File "./mediawikigitdump.py", line 180, in fetch_page
+    f.write(options.convert(urllib.urlopen(url).read()))
+TypeError: argument 1 must be string or read-only character buffer, not None
+zsh: exit 1     ./mediawikigitdump.py -v -d -t http://www.amateur-radio-wiki.net/
+