David Silverlight

Subscribe to David Silverlight: eMailAlertsEmail Alerts
Get David Silverlight: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: XML Magazine

XML: Article

Shedding a Little Light on XML

Shedding a Little Light on XML

Wouldn't it be great to have something named after you? Up to now I've had no luck. Off the top of my head, the only way for this to happen is to (1) have children, (2) invent a lifesaving technique like the Heimlich maneuver, (3) die of a painful disease (Lou Gehrig's disease, for example),...

...(4) name a star after yourself (the "XML Star" was still available the last time I checked), (5) invent a clever method of grouping XML data (aka the Muenchian Method).

Hopefully I will never qualify for option 3. Steve Muench, however, has managed to gain immortality by developing a very innovative way to group XML data.

Grouping XML data is a common programming dilemma, which will be resolved in this month's column using his technique, that is, the Muenchian Method. The Muenchian Method is not inherently obvious, primarily because it employs two functions, key() and generate-id(), neither of which is widely used. However, once they're demonstrated as part of a solution, their usefulness will stand out.

Black Belt
What Is the Muenchian Method?

The Muenchian Method is a novel technique used for grouping XML data using the XSLT key() function. The technique offers a clever solution to the questions "How do I group unsorted XML data efficiently?" and "How do I find the unique values in a set of values?"

While the method is used for grouping data, it's also used to find the unique values in a set of values. It has widespread applications beyond grouping. For example, you might want to count the number of states represented by a list of salesmen's territories. Though it may not sound like a grouping problem (in fact, it isn't a grouping problem), you'd still use the key tables in the Muenchian Method to obtain the answer.

To further explain this technique, I'll use a sample XML document containing unsorted product information. The output generated will produce a list of products grouped by region, then by product name. Both groupings will be sorted.

Overview: A 50,000-Ft View of the Algorithm
Two loops will be used to process the product information.

The outer loop will iterate through all unique "regions" in the XML document. Each iteration of the loop will display the region name, sorted alphabetically, as the header for the HTML table that's generated. At this point the key() and generate-id() functions will be integral in generating a distinct set of regions.

The inner loop will iterate through all products for the current region and will display product information alphabetically. Here, the key() function will be integral in allowing us to quickly access the products for a given region.

Breakdown
Step 1: Define the primary key to be used in the Muenchian grouping.

Defining the primary key is the first step in Muenchian grouping. Using the xsl:key element sets up the groundwork for later accessibility to all "products" for a given "region" via a key table. At this point we're essentially defining how products for a region will be retrieved later on using the key() function, but no work is actually being done. All that's accomplished here is that the XSLT processor is learning how to populate the key table so that later on in the process the key() function can be used to retrieve the elements.

The xsl:key element is defined as follows:

<xsl:key name="products" match="product" use="region"/>

There are three parts to the xsl:key element:

  1. Name: The name that will be used to refer to this key table.
  2. Match: The pattern that refers to the nodes that populate the key table. In this example "product" nodes are those that can be accessed using this key.
  3. Use: This "use" expression defines the value associated with the node used in the key table. In this example the "product" nodes will be accessed using the "region" as the key. This will allow us to retrieve all of the "products" for a specific "region."
Step 2: Loop through the unique regions (the primary key) in the document.

To fully appreciate what happens in Step 2, it's important to understand how the generate-id() function works: generate-id(nodeset) generates a unique identifier for the nodeset that's passed to it. It requires a nodeset as a parameter, but will return only the identifier for the first node (in document order) of the nodeset.

Note: It might appear on the surface that repeated calls to this function would produce a new generate-id() value each time they were called, thus causing the algorithm to fail. However, the same node will always return the same identifier during a single process. In a sense it behaves more like a "return-id" function than a "generate-id" function.

It's important to remember that values returned from generate-id() must be used blindly. There's no rhyme or reason for the values generated other than that they be lexically valid as name tokens in XML. The processor is required to return the same ID for a given node every time it's requested in the running of a single process, but the processor isn't obliged to return the same value the next time around, even with the same data.

In some of the examples the values returned from the generate-id() function are displayed as part of the output. This is shown strictly to give a demonstration of what's happening under the covers. Please note that the generate-id() values should never be "interpreted" or used for analysis and/or comparison. You can't rely on the pattern generated since it will change from execution to execution, even if the data hasn't changed. The values are interesting to look at, however.

On to Step 2...
The end result of the xsl:for-each element below will be a return of product elements for the value of the region that's passed to it. More specifically, it's the result of examining all product elements and filtering out those whose "region" child values are duplicates, then filtering in the first element in document order with each unique value for a "region" child.

<xsl:for-each select="product
[generate-id(.)=generate-id
(key('products',region))]">
The logic behind this statement isn't inherently obvious, but a further breakdown should clarify it. Essentially, the expression will return the elements from which the IDs from the product elements "generate-id(.)" match the IDs returned from the indexed key lookup function "generate-id(key('products',region))".

The goal in this equality test is uniqueness: find "only one" of each value, then act on the found ones. The technique accomplishes this requirement by using the key table to find "the first of each in document order" of each value. This works because (1) the key() function returns the set of nodes of the given value (in document order), and (2) the generate-id() function works on the first node (in document order) when given more than one node.

By combining (1) and (2) above, the result is the first node of each key table value in document order, thus giving us the first node representing each unique value.

The actual IDs generated from the two sets are given in Figure 1, which shows the IDs that correspond to both. Take a moment to examine this figure. It's important to observe that each call to generate-id(key('products',region)) will return a single value. However, when visually scanning the entire list of values displayed in Figure 1, it becomes apparent that there are only two unique IDs in the list (having values of "IDANHZDB" and "IDATGZDB"). On the other hand, the generate-id(.) expression will generate a unique ID for each product.

When visually scanning this column in Figure 1, we see that each element returns a unique value. When evaluating these two expressions in the equality test, only two elements are returned, producing only the unique "regions" from the XML document. Take a moment to look at the values in Listing 1 and Figure 1 to see what actual IDs are generated using these predicates. The downloadable source code for this article also contains this example and the code for Listing 2 (www.sys-con.com/xml).

Step 3: Display the products for each region.

As each region is iterated through, the key() function comes into play once again. This time it's used as part of the "for-each" loop to iterate through the product elements for the given region. Rather than looping through the document, the key function uses the current region value to return a nodeset of all the corresponding product elements from the named key table. Though it's up to the processor to implement the key() function quickly, it's assumed the processor will obtain values from the key table much faster than revisiting the source node tree.

<xsl:for-each select ="key('products', region)">

The expression "key('products',region)" will return all of the "product" elements from the key table whose "use=" expression defined in xsl:key (see Step 1) evaluated to the same value as the "region" child of the current element. In the example we specified use="region". If region has a value of "SouthWest", then all the product elements from the key table that contain a child element with a value of "SouthWest" will be returned.

Q: What are some of the cautions regarding the Muenchian Method?
A:
The Muenchian Method requires an XML processor that supports keys, which not all parsers do. Although MSXML 3.0, Xalan, and Saxon support them, XT does not at the time of this writing. Alternative techniques that don't use the key() function do exist, but they're inherently slower based on the assumption that a processor will optimize the access to key tables (see Elements of Design section).

Keys can be memory intensive in their own way as a result of how they index document information. Using keys requires the information to remain in memory, thus demanding extra overhead for support.

The complete code for the Muenchian Method example can be seen in Listing 2. Refer to Figure 2 to see the output generated.

Elements of Design
A World Without Muench

This month's Elements of Design highlights other methods of grouping. The two methods showcased here were submitted by developers in the field. Each demonstrates a variation on the original message, has its own pros and cons, and simply offers alternative techniques for grouping XML data.

Grouping Without the Use of the key() function
(G. Ken Holman)
As mentioned in the preceding question, not all XSLT processors support the key() function. This solution becomes especially useful in scenarios in which the key() function isn't an option. Following are a few comments explaining the differences between Ken's method and the Muenchian Method, but to fully appreciate it, download the source code for the full example.

The essence of using the variable instead is very similar to a key table. In this solution all products are put into the variable (which itself is in document order):

<xsl:variable name="products" select="//product"/>

Visit every product (which essentially happens when using a predicate à la Muench) in sorted order, but act only on the first product in document order (again using generate-id(), but this time with the variable instead of the lookup table):

<xsl:for-each select="$products">
<xsl:sort select="name" order="ascending"/>
<xsl:variable name="region" select="region"/>

<xsl:if test="generate-id(.)=generate-
id($products[region=$region])">

The rest is similar, except to grab the products of the same region from the variable instead of the key table:

<xsl:for-each select="$products[region=$region]">

Grouping Using the Distinct Template
(EXSLT.ORG)
This solution varies from both Ken's approach and the Muenchian Method because it doesn't use key() or generate-id(). Instead, it uses the set:distinct template defined by the community initiative for commonly required extensions not available in XSLT.

This particular template can be used to return the distinct "region" elements instead of implementing the generate-id() function to return "product" elements with distinct "region" children. The focus changes slightly since we're obliged to deal with nodes being distinguished.

* * *

A working example of the solutions in this article can be downloaded from www.sys-con.com/xml.

More Stories By David Silverlight

David Silverlight is the chief XML evangelist for Infoteria. He has
been working in the trenches for a number of years as a software
architect and consultant, specializing in database-driven Web
applications. He also maintains www.xmlpitstop.com, a resource for
XML examples, resources, and everything else XML.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.