Current development on JAMWiki is primarily focused on maintenance rather than new features due to a lack of developer availability. If you are interested in working on JAMWiki please join the jamwiki-devel mailing list.

Tech:Parser Improvements

ktip.png This page (and all pages in the Tech: namespace) is a developer discussion about a feature that is either proposed for inclusion in JAMWiki or one that has already been implemented. This page is NOT documentation of JAMWiki functionality - for a list of documentation, see Category:JAMWiki.
Status of this feature: IMPLEMENTED. Parser improvements have been implemented in the majority of JAMWiki releases, although the specific issues identified below were handled in versions prior to JAMWiki 1.0.
Contents

Description[edit]

Beginning with JAMWiki 0.6.5 the parser was modified to make unit testing significantly more robust. As a result it is much easier to validate parser output against Mediawiki, which demonstrates some disturbingly bad coverage in areas such as unbalanced tag handling (example: <b><i>unbalanced</b></i>).

It should be a goal of JAMWiki to produce output that is as close to Mediawiki as possible. This article will be used as an ongoing tool to track enhancements and enhancement proposals for the JAMWiki JFlex parser. NOTE: it is unlikely that every single parser change will be tracked here, but the hope is to be able to use this page to track significant changes when possible.

Author(s)[edit]

Status[edit]

JAMWiki 0.6.x[edit]

JAMWiki 0.6.5 began the work of improving the parser by treating parsed tags as a stack of tokens. This change allows significantly more flexibility, and should lead to the ability to handle unbalanced tags and other issues that Mediawiki currently deals with nicely.

Update 26-March-2008[edit]

Modifications over the past few days include the following:

  • The parser should now automatically close some open HTML tags. Handling of unbalanced tags (<b><i>text</b></i>) is still on the to-do list, and once that change is implemented it should resolve most of the remaining corner cases where a tag could be left open.
  • List handling has been converted to use the new tag stack infrastructure. In addition, unit testing of the list parsing code revealed some instances where JAMWiki produced output that differed from Mediawiki (notably in handling definition lists) and this has mostly been updated.
  • Output of newlines and tags should more closely match Mediawiki. This is not a functional change and will solely affect how the HTML source is formatted. For example:

<!-- old output -->
<ul><li>list item
</li></ul>

<!-- new output, matches Mediawiki -->
<ul>
<li>list item</li>
</ul>

  • The tag stack code has been modified to be a bit more robust. Many further improvements are still to come. The end goal is code that is more flexible, easier to read/understand, and hopefully much easier to maintain.

Update 5-April-2008[edit]

A long-standing issue with the JAMWiki parser should be resolved by revision 2146. This change means that JAMWiki can finally parse the following examples of bold / italic text correctly:


this is '''''bold''' and italic'' text

this is '''''italic'' and bold''' text

this is '''bold''''' followed by italic'' text

this is ''italic''''' followed by bold''' text

With all of the changes made to the parser thus far there have been some minor regressions - notably it seems that section edits are unnecessarily trimming newlines - but I haven't noticed any major issues. Remaining tasks include looking into whether paragraph parsing can be improved / simplified, adding better handling of unbalanced HTML tags, and addressing some of the reports of Mediawiki parsing differences that have been made on the Feedback page. Once that work is complete and enough unit tests have been created to verify that everything is working as expected the final 0.6.6 code should be ready for release. -- Ryan 05-Apr-2008 12:47 PDT

Update 12-April-2008[edit]

The current status on the parser changes are as follows:

  • Unbalanced tag parsing ("<b><i>text</b></i>") should now work.
  • Fixing some of the issues reported with templates is still on my to-do list. That work could potentially slip to the 0.6.7 release.
  • Paragraph parsing is broken in some cases right now - five of the eight unit tests fail, although the most common cases parse correctly. That issue will definitely need to be fixed prior to 0.6.6.
  • Newlines generated when parsing <pre> tags are slightly different from what Mediawiki generates - I'll try to reconcile this soon as I don't believe it will be tough to resolve.

The parser presently handles paragraphs during a "post-processor" parsing run, but I think that this parsing can be moved into the main "processor" run, which should eliminate some complexity and solve many problems. I don't know if that's something that can be implemented this weekend or whether it will take a couple of weeks, but that's the next item on my to-do list. -- Ryan 12-Apr-2008 18:04 PDT

Update 13-April-2008[edit]

I was up late last night and spent additional time on parser work today, so here's another update:

  • Paragraph handling should mostly be fixed. There is still one unit test failing, but it is a minor issue (corner case with empty lines) and JAMWiki 0.6.5 and earlier versions also would have failed to parse the wiki syntax in question correctly.
  • <pre> tag parsing now mostly handles newlines in exactly the same way as Mediawiki. There are a couple of unit test failing, but the failures are cosmetic (newlines in HTML) and do not affect how the syntax will be rendered on a web page.

-- Ryan 13-Apr-2008 16:38 PDT

JAMWiki 0.9.x[edit]

Update 19-February-2009[edit]

There have been numerous parser enhancements made for the JAMWiki 0.9.x release cycle:

  • Performance optimizations. An unnecessary parsing pass has been removed, improving performance by approximately 10-15%.
  • Attribute validation. The parser will now validate that HTML attributes are valid according to the XHTML spec and drop invalid attributes. Note that validation is fairly loose and will clean up incorrect attributes such as <div align=left>, <div ALIGN='left'>, etc. This change also improves support for Javascript event handlers for wikis that enable Javascript. Previously JAMWiki simply verified that an attribute was on a list of valid attributes, thus missing some valid attributes and allowing use of attributes that might not be valid for a specific element (such as <div colspan="3">.
  • Support for the {{#if:}} parser function, and limited support for the {{#expr:}} parser function - only +, -, * and / operations are supported, and parentheses must be used to ensure that the order of operations is followed (ie: {{#expr: 4 + (2 * 3)}} (result: 24) instead of {{#expr: 4 + 2 * 3}} (result: 18).
  • Added support for the abbr, col, colgroup, ins, del, thead, tbody and tfoot tags.

I would also like to add support for {{subst:}} which should be very easy to do. -- Ryan • (comments) • 19-Feb-2010 08:52 PST

revision 2884 adds support for {{subst:}} -- Ryan • (comments) • 20-Feb-2010 17:27 PST
Support has also been added for the {{#ifeq:}} parser function in revision 2905. -- Ryan • (comments) • 01-Mar-2010 23:04 PST


Comments[edit]

I suspect that parser updates will be a major focus area of mine during the 0.6.6 release cycle, and if the improvements are significant enough they could warrant bumping the next version to 0.7.0. These changes should also be good for outside developers who want to use the JAMWiki parser engine, as the engine should be made more robust. -- Ryan 23-Mar-2008 10:50 PDT