Monday, March 10, 2008

Dom4j and XPath

In this, the latest installment of obscure gotchas in Java development, I'm going to discuss an interesting behavior of Dom4j, definitely something to beware of: when you use '/' or '//' to start an XPath search expression in conjunction with the instance method Node.select{Nodes | SingleNode}, the search does not start at that node! In fact, it will always start at the actual document root, contrary to what one may expect from looking at the code / API.

Allow me to illustrate with an example. Lets say you are working with this simplified XML document:

<Account>
<Owner>
<ContactInfo>
<Name>Tom Jones</Name>
...
</ContactInfo>
...
</Owner>
<Cosigner>
<ContactInfo>
<Name>Jim Johnson</Name>
...
</ContactInfo>
...
</Cosigner>
</Account>

Dom4j makes it easy to find the Cosigner node:
Node cosignerNode = document.selectSingleNode("/Account/Cosigner");

and at first glance, I thought the following code snippet would return the Cosigner's name:
cosignerNode.selectSingleNode("//ContactInfo/Name") => "Tom Jones"

Counter intuitively, this code returns 'Tom Jones'. This is because when you start an XPath query with '/' or '//', Dom4j will traverse the DOM back up to the root node to begin its search. By removing the leading slashes, it works as expected:
cosignerNode.selectSingleNode("ContactInfo/Name") => "Jim Johnson"

So in conclusion, be careful whenever you use selectSingleNode; make sure that you understand that whenever you use // relative XPath queries, the result will come from the root of the entire document, and will not be limited to children of the node on which you invoke it.

5 comments:

Mihai Campean said...

This is dead on! I encountered the same issue with dom4j and I got so pissed off because I didn't know how to use the XPath expressions to select some info from certain nodes. Thanks Jared!

Tim said...

Thanks, I ran into the same issue when dealing with dom4j until I found your page. Quite counter-intuitive, I assumed that "/" or "//" would operate using the selected node as the root.

sjgibbs said...

The convention of "/" as the root came from UNIX so has quite a heritage. Relative URLs work the same way for the same reason.

"//" is new to Xpath, but the leading "/" is unabiguously "root" followed by an implicit use of (I suspect) the descendant-or-self axis. You can use a similar axis explicitly to get the behavior you expected from "//":

cosignerNode.selectSingleNode("descendant::Name")

Note that Xpaths using // are hideously slow since they imply a search of the *entire* tree - right down to every leaf node. Using axes ensures your code runs a *lot* faster since you exclude more of the tree from the search.

See also:

http://www.w3schools.com/xpath/xpath_axes.asp

dontcare said...

Have you looked at VTD-XML? it is 10x faster and 5x more memory efficent tha DOM4J or JDOM

http://vtd-xml.sf.net

Mike Talbutt said...

Thanks - thats been doing my head in all day !.