[erlang-bugs] Two xmerl_xpath predicate handling bugs

Sun Aug 17 10:26:20 CEST 2008

Summary:

  1. "//c[1]" against "<a><b><c/></b><b><c/></b></a>" should return
     both 'c' elements; xmerl_xpath returns the first one.

  2. "/a/b[@e='f'][position()=1]" against "<a><b c='d'/><b
     e='f'/></a>" should return the second 'b' element; xmerl_xpath
     returns an empty set.

For the first bug, the XPath spec states:

    NOTE: The location path //para[1] does *not* mean the same as the
    location path /descendant::para[1]. The latter selects the first
    descendant para element; the former selects all descendant para
    elements that are the first para children of their parents.

Accordingly, "//c[1]" against "<a><c/><c/></a>" should
match both 'c' elements, but xmerl_xpath only returns the first.

This is because when xmerl_xpath:path_expr applies the child::c
axis/node-test selection to its context nodeset of all nodes,
xmerl_xpath:axis merges the result nodesets before path_expr calls
pred_expr, so the [1] predicate is applied to the merged nodeset
rather than to each individual nodeset.

For the second bug, the spec states:

    child::para[attribute::type='warning'][position()=5] selects the
    fifth para child of the context node that has a type attribute
    with value warning

Accordingly, "/a/b[@e='f'][position()=1]" against "<a></a>" should return the second 'b' element, because it is the
first b child of the context node (the root 'a' element) that has its
'e' attribute with value 'f'; but xmerl_xpath returns an empty set.

This is because xmerl_xpath numbers the node positions only once after
applying the axis/node tests, so "position()" still evaluates to 2 in
xmerl_xpath_pred.  (Note that "/a/b[@e='f'][1]" still works correctly
because xmerl_xpath includes a short-circuit to not depend on
#xmlNode.pos.)

As a third bug, you might also argue that "//b/following::b" against
"<a></a>" should return the final 'b' element only once,
because mathematically unions of sets should omit duplicates and
node-sets as defined in expression contexts omit duplicates, but if
the user ensures to call lists:usort on the result, the only other
negative consequence is worsened performance in contrived test cases.

Also beware I'm unlikely to submit a patch for the above issues any
time soon.  (At this point, I'm more tempted to write my own XPath
implementation that uses lazy axis walking, but I've already spent
far, far more time on XPath than I really should have...)