Indexing XML

It is often important to index XPath queries. Not only is XPath useful on its own, but it is also the basis for the FLWOR expressions in XQuery.

A typical XPath expression will select only a small fraction of any XML document (such as the value of a particular attribute). Thus, a sensible strategy is to represent the XML documents as tables. There are several possible maps from XML documents to tables. One of the most common is ORDPATH.

In the ORDPATH model, the root node receives the identifier 1, the first node contained in the root node receives the identifier 1.1, the second one receives the identifier 1.2, and so on. Given the ORDPATH identifiers, we can easily determine whether two nodes are neighbors, or whether they have a child-parent relationship.

As an example, here’s an XML document and its (simplified) ORDPATH representation:


<liste temps="janvier" >
<bateau />
<bateau >
<canard />
</bateau>
</liste>

ORDPATH name type value
1 liste element
1.1 temps attribute janvier
1.2 bateau element
1.3 bateau element
1.3.1 canard element

Given a table, we can easily index it using standard indexes such as B trees or hash tables. For example, if we index the value column, we can quickly process the XPath expression @temps=”janvier”.

Effectively, we can map XPath and XQuery queries into SQL. This leaves relatively little room for XML-specific indexes. I am certain that XML database designers have even smarter strategies, but do they work significantly better?

Reference: P. O’Neil, et al.. ORDPATHs: insert-friendly XML node labels. 2004.

Further reading: Native XML databases: have they taken the world over yet?

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

4 thoughts on “Indexing XML”

  1. Thanks Daniel. I’m interested in anything more you learn on this topic, so please follow up if and when you do.

    I think there’s a typo with “”: it should probably be “”.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.