Description
Provide a way to import / export topics from Wikipedia into JAMWiki.
-
-
-
- Do you mean mediawiki, not Wikipedia? Mediawiki is the software, Wikipedia is a service using that software. XML import is needed for many other things including importing entire mediawiki sites into jamwiki as an upgrade.
Author(s)
Status
The code is now in the Subversion trunk repository, is running on jamwiki.org, and will be included in JAMWiki 0.5.0. I'd like to mark this feature as "experimental" for now due to some issues (see below), so it is not linked to from Special:Specialpages, but can only be visited from Special:Import.
Comments
Import from Wikipedia
Hi! I have implemented some XML-import functionality. Now I'm testing it with files exported from Ukrainian Wikipedia. I have found some issues which might be interesting for you.
- Topics with names containing "?" can't be loaded. Why did you decided to add "?" to the INVALID_TOPIC_NAME_PATTERN?
- Import of topics with link to topic which name contains "?" (even if topic not exist) lead to hanging of the process.
- First letter of categories and templates are not automatically capitalized. Category:test is no the same as Category:Test. --Gutsul 23-Nov-2006 01:43 PST
- Question marks have been problematic due to their usage with HTML query parameters, but if they are needed for Mediawiki compatibility then I can look into it. As to capitalization of categories, I didn't realize Mediawiki treated "Category:Test" and "Category:test" the same,
-
-
-
- Not only that, it treats "category:test" and "Category:test" the same, and jamwiki doesn't as of 0.6.3. Unfortunately that doesn't hold true in all versions of mediawiki when a user creates their own namespace by using colons in names, and it should. So it's problematic since mediawiki does this wrong.
- so I can try to get that fixed for the next release. Let me know how the rest of your testing goes - I'm currently out of town but will try to get online at least once a day. -- Ryan 23-Nov-2006 08:23 PST
- Gutsul, I've moved your topic near to the thread I started about the same topic... I'm working on a similar thing. If you already have some code, can you please share it ? I was planning to work on xml import from mediawiki dump, if you already started doing something we might work together. I first implemented an option to zip topics which will help me keep the installation size smaller, next is the import itself. I will check in my code in the sourceforge repository in my private branch (https://svn.sourceforge.net/svnroot/jamwiki/wiki/branches/ncsaba) in the next days. Cheers, Csaba 23-Nov-2006 08:38 PST
- I've also started a small XML Wikipedia import tool in my Java Wikipedia API tools project: http://plog4u.svn.sourceforge.net/viewvc/plog4u/info.bliki.wiki/src/info/bliki/wiki/dump/ it would be nice if we could generalize this tool for example to create offline HTML (or PDF?) documents. -- Axel Kramer 23-Nov-2006 10:17 PST
- First - I've added code (now running on jamwiki.org) to allow categories to be processed in a case-insensitive way, so "Category:Test" and "Category:test" should now be viewed the same.
- Second - I'd also like to see a more general way of importing/exporting XML. Gutsul wrote to me to ask for Subversion write access, so hopefully his code will soon be in the repository and we can all take a look to figure out how best this feature should be implemented. -- Ryan 27-Nov-2006 16:28 PST
- Regarding issues with "?" in JAMWiki topic names - I tried removing the "?" from the disallowed character list, and it broke badly. The problem is that it's very difficult to determine whether a question mark is part of a topic name or the beginning of a query parameter. Given the choice between allowing question marks in topic names or disallowing them I'd be inclined to disallow them, but a strong argument could be made that since Mediawiki supports them JAMWiki should as well. At this point I'm not sure what the correct solution is, so if anyone feels strongly that JAMWiki should support topic names containing question marks please add your comments. -- Ryan 30-Nov-2006 00:58 PST
Status
Gutsul emailed me the XML import code last night, and it's non-obtrusive enough that I don't see any reason not to merge it into the trunk. I'd like to do a bit of cleanup first for Coding Style and to remove a DataHandler dependency, and it also needs to be updated to match changes made for 0.5.0, but once that's done I'll commit it and would appreciate feedback from others. -- Ryan 29-Nov-2006 12:38 PST
- The code is now integrated on the trunk. I've made some updates for coding style and re-arranged a few things, but for the most part it's the same code as Gutsul sent to me. I ran a couple of tests on my local machine to make sure the code works (it does), but I didn't look too closely into the XML processing code, and will leave it to others to provide any feedback for that. A couple of issues I noticed:
- The current code does not import version history, and it would be nice to have that capability.
- The current code does not import author history. I'm not sure how to handle that problem, but due to attribution requirements with the GFDL and CC-SA licenses it probably needs to be addressed.
- Axel and others have expressed interest in more generic XML support, so I'll leave it to those who are interested to work out a more flexible framework.
- Since the code is now in Subversion anyone can view and modify it, so please do so and add any comments or feedback here. -- Ryan 29-Nov-2006 15:26 PST