Current development on JAMWiki is primarily focused on maintenance rather than new features due to a lack of developer availability. If you are interested in working on JAMWiki please join the jamwiki-devel mailing list.

Tech:XML import

ktip.png This page (and all pages in the Tech: namespace) is a developer discussion about a feature that is either proposed for inclusion in JAMWiki or one that has already been implemented. This page is NOT documentation of JAMWiki functionality - for a list of documentation, see Category:JAMWiki.
Status of this feature: IMPLEMENTED. This functionality was implemented in JAMWiki 0.5.0 and significantly enhanced with JAMWiki 0.8.0.
Contents

Description[edit]

Provide a way to import / export topics from Wikipedia into JAMWiki.

Do you mean mediawiki, not Wikipedia? Mediawiki is the software, Wikipedia is a service using that software. XML import is needed for many other things including importing entire mediawiki sites into jamwiki as an upgrade.

Author(s)[edit]

Status[edit]

The code is now in the Subversion trunk repository, is running on jamwiki.org, and will be included in JAMWiki 0.5.0. I'd like to mark this feature as "experimental" for now due to some issues (see below), so it is not linked to from Special:Specialpages, but can only be visited from Special:Import.

Status Update July 2009[edit]

Import[edit]

See http://meta.wikimedia.org/wiki/Help:Import and http://meta.wikimedia.org/wiki/Help:Export for descriptions of how this functionality is currently implemented in Mediawiki. See Special:Import and Special:Export for JAMWiki's current import/export functionality. Some items I would like to implemented:

  1. Import full topic history.
  2. Provide attribution to the original author.
  3. When a topic already exists force the importing user to indicate whether to delete / merge the topic.
  4. When importing a topic multiple times, do not re-import the same revisions. This functionality would allow for syncing wikis by exporting and re-importing topic information.
  5. Do not make the import / export functionality specific to Mediawiki; instead create a generic import / export framework and make the file format a plugin.
  6. Add ROLE_IMPORT to control who can import topics.

-- Ryan • (comments) • 05-Jul-2009 16:10 PDT

revision 2629 adds support for importing full topic history, includes edit comment and edit date in the import, and makes use of a new org.jamwiki.migrate.Migrator interface. I'll continue working on this functionality as time permits. -- Ryan • (comments) • 09-Jul-2009 23:28 PDT
Recent additions add support for importing author name. I'm not sure that I'll get around to adding support for merging revisions (#3 and #4 above), but even without that functionality the importing process is much improved. The next piece I'd like to work on is adding unit tests and then starting on export functionality. -- Ryan • (comments) • 14-Jul-2009 08:32 PDT

Export[edit]

Base unit tests for import are now in place, and revision 2647 adds support for exporting topics as XML. Note that these changes are not yet available on jamwiki.org but once all database changes for 0.8.0 are complete I'll update the site with the new functionality. -- Ryan • (comments) • 25-Jul-2009 22:56 PDT

TODO:
  • Topic names need to be converted to Mediawiki namespaces.
  • Unit tests are needed.
  • Security checks are needed to ensure a user has access to export a topic.
-- Ryan • (comments) • 25-Jul-2009 23:21 PDT

Status Update 02-October-2009[edit]

At this point these are the outstanding issues I'm aware of with import / export:

  • Topic names containing question marks fail to be imported. revision 2720 resolves this issue by stripping question marks from topic names in the import file.
  • Attempts to import a topic when a topic with the same name already exists is not supported. Adding functionality to merge revisions will need to be a future enhancement.
  • A user can export a topic even if the site security is configured to prevent them from viewing the topic. A workaround for this issue is to simply disable exports for non-privileged users.
  • When exporting a topic there needs to be some work done for namespace configuration. This is probably a future enhancement; in the mean time the export XML can be manually edited. Update: revision 2737 converts namespaces to Mediawiki namespaces when exporting.
  • Topic content that contains topic names with "?" in them are not updated. This is a future enhancement.
  • Case-sensitivity handling for JAMWiki is an ongoing concern - User:wrh2 is not treated as the same topic as User:Wrh2.
  • More unit tests are needed, particularly for export functionality.

At this point I'm comfortable releasing with the functionality as-is since it's an improvement over what was previously available, but would like to revisit these features in the future to make it even more useful. Additional topics that could be examined would be implemented exports of templates included in a topic and other features currently offered by Mediawiki. -- Ryan • (comments) • 03-Oct-2009 10:50 PDT

Comments[edit]

Import from Wikipedia[edit]

Hi! I have implemented some XML-import functionality. Now I'm testing it with files exported from Ukrainian Wikipedia. I have found some issues which might be interesting for you.

  • Topics with names containing "?" can't be loaded. Why did you decided to add "?" to the INVALID_TOPIC_NAME_PATTERN?
  • Import of topics with link to topic which name contains "?" (even if topic not exist) lead to hanging of the process.
  • First letter of categories and templates are not automatically capitalized. Category:test is no the same as Category:Test. --Gutsul 23-Nov-2006 01:43 PST
Question marks have been problematic due to their usage with HTML query parameters, but if they are needed for Mediawiki compatibility then I can look into it. As to capitalization of categories, I didn't realize Mediawiki treated "Category:Test" and "Category:test" the same,
Not only that, it treats "category:test" and "Category:test" the same, and jamwiki doesn't as of 0.6.3. Unfortunately that doesn't hold true in all versions of mediawiki when a user creates their own namespace by using colons in names, and it should. So it's problematic since mediawiki does this wrong.
so I can try to get that fixed for the next release. Let me know how the rest of your testing goes - I'm currently out of town but will try to get online at least once a day. -- Ryan 23-Nov-2006 08:23 PST
Gutsul, I've moved your topic near to the thread I started about the same topic... I'm working on a similar thing. If you already have some code, can you please share it ? I was planning to work on xml import from mediawiki dump, if you already started doing something we might work together. I first implemented an option to zip topics which will help me keep the installation size smaller, next is the import itself. I will check in my code in the sourceforge repository in my private branch (https://svn.sourceforge.net/svnroot/jamwiki/wiki/branches/ncsaba) in the next days. Cheers, Csaba 23-Nov-2006 08:38 PST
I've also started a small XML Wikipedia import tool in my Java Wikipedia API tools project: http://plog4u.svn.sourceforge.net/viewvc/plog4u/info.bliki.wiki/src/info/bliki/wiki/dump/ it would be nice if we could generalize this tool for example to create offline HTML (or PDF?) documents. -- Axel Kramer 23-Nov-2006 10:17 PST
First - I've added code (now running on jamwiki.org) to allow categories to be processed in a case-insensitive way, so "Category:Test" and "Category:test" should now be viewed the same.
Second - I'd also like to see a more general way of importing/exporting XML. Gutsul wrote to me to ask for Subversion write access, so hopefully his code will soon be in the repository and we can all take a look to figure out how best this feature should be implemented. -- Ryan 27-Nov-2006 16:28 PST
Regarding issues with "?" in JAMWiki topic names - I tried removing the "?" from the disallowed character list, and it broke badly. The problem is that it's very difficult to determine whether a question mark is part of a topic name or the beginning of a query parameter. Given the choice between allowing question marks in topic names or disallowing them I'd be inclined to disallow them, but a strong argument could be made that since Mediawiki supports them JAMWiki should as well. At this point I'm not sure what the correct solution is, so if anyone feels strongly that JAMWiki should support topic names containing question marks please add your comments. -- Ryan 30-Nov-2006 00:58 PST

Status[edit]

Gutsul emailed me the XML import code last night, and it's non-obtrusive enough that I don't see any reason not to merge it into the trunk. I'd like to do a bit of cleanup first for Coding Style and to remove a DataHandler dependency, and it also needs to be updated to match changes made for 0.5.0, but once that's done I'll commit it and would appreciate feedback from others. -- Ryan 29-Nov-2006 12:38 PST

The code is now integrated on the trunk. I've made some updates for coding style and re-arranged a few things, but for the most part it's the same code as Gutsul sent to me. I ran a couple of tests on my local machine to make sure the code works (it does), but I didn't look too closely into the XML processing code, and will leave it to others to provide any feedback for that. A couple of issues I noticed:
  • The current code does not import version history, and it would be nice to have that capability.
  • The current code does not import author history. I'm not sure how to handle that problem, but due to attribution requirements with the GFDL and CC-SA licenses it probably needs to be addressed.
  • Axel and others have expressed interest in more generic XML support, so I'll leave it to those who are interested to work out a more flexible framework.
Since the code is now in Subversion anyone can view and modify it, so please do so and add any comments or feedback here. -- Ryan 29-Nov-2006 15:26 PST

Issues[edit]

While importing data from a custom knowledge-database I ran into some issues. These issues are caused by valid XML but invalid data:

  • Some entries had the same title than another one, so the import ran into an java.lang.Exception: Error by import: java.lang.Exception: Failure while executing insert into jam_category ( child_topic_id, category_name, sort_key ) values ( ?, ?, ? ). While this could be removed easily in XML, it was quite hard to find.
  • Although it is written in this article, I ran into the issue with the question mark on titles. Perhaps the import could replace this special char...?
  • In the XML-File were several HTML-links. This wasn't a problem while import, but the articles were damaged at view (perhaps spam-blacklist or the allow-html-setting could be the cause).

Apart from these small problems, this is a great feature! Thanks for sharing!! --hp 24-Nov-2008 00:43 PST

Thanks for the feedback! I've honestly not used this feature at all, but the original author of this code and others have had success with it. During the 0.8.x release cycle I'm hoping to get to the point where JAMWiki can import a full Mediawiki instance and also export XML that Mediawiki can then import, but that remains several months into the future. In the mean time, if you have an example of the HTML links that failed let me know and I can try to investigate to see what's going on. -- Ryan 24-Nov-2008 08:13 PST