XSLT 2.0: Regular Expression Functions and Instructions

XSLT 2.0: Regular expressions

One of the weaknesses in XSLT 1.0 was the very simple set of string manipulation features. In comparison to many popular programming languages, the string functions lacked one very powerful feature, regular expressions. Intel SOA Expressway actually offers this functionality with extension functions for our customer base. In XSLT 2.0, the XSLT working group plugged this hole for everyone in a couple of ways that we’ll look at in this post.

Regular expressions are a very flexible way to match strings or portions of strings based on patterns. These patterns are specified by a special syntax and are unrelated to the patterns used to match nodes in XSLT templates. See the Regular Expression Syntax (http://www.w3.org/TR/xpath-functions/#regex-syntax) section of the XQuery 1.0 and XPath 2.0 Functions and Operators specification for the details of the syntax.

Because regular expressions are so popular, I’ll assume some basic knowledge that’s easy to pick up from the web for the examples in this post. Still, it’s worth pointing out that the regular expression syntax supports some advanced features. There are “greedy” quantifiers (* and +) to match the longest substring possible and “reluctant” versions (*? and +?) to match the shortest substring. Enclosing a portion of the regular expression with parentheses, for example “(123)”, specifies a captured subgroup that can be referenced by number. Finally, as you might expect for a language targeted at processing XML, there are some special matching specifications for Unicode and XML name characters.

Often it’s enough simply to match a regular expression when deciding whether and how to process a node or some text. The matches(string, regex) XPath function
takes a string, attempts to match the regular expression, and returns true if finds any matches.
This function is handy in predicates and in <xsl:choose> and <xsl:if> where a Boolean result is needed to determine the matches.

When converting formats, sometimes it’s useful to change parts of a string that can be found with a regular expression. The replace(string, regex, replacement) XPath
function returns a new string formed by substituting the replacement string for every match to the regular expression. The replacement string itself can refer to portions of the matched substring with numbered variables, such as $1 for the first captured group substring and $0 for the entire substring.

Breaking a string into a sequence of smaller strings is done with the tokenize(string, regex) XPath function. Matches to the regular expression mark the boundary between substrings (tokens) and will not be included in the result. This function is useful for separating strings that are delimited, such as strings of comma separated values. But because tokenize() takes a regular expression, the separator can be much more sophisticated than a simple character.

In order for XSLT to offer more even more flexible regular expression processing than the functions in XPath, the <xsl:analyze-string> instruction can break a string apart much like tokenize() while allowing the stylesheet to process both the matching and non-matching parts. Because it’s an instruction, it can also create nodes to structure the data, which the XPath functions can’t do.

Here’s an example using <xsl:analyze-string> to pull out the current top ten trends from Twitter and build a very simple HTML page. The Twitter API is a RESTful API, which means it returns results in response to an HTTP request on a resource. In this case, we can ask the search API with the URI http://search.twitter.com/trends.json to get the latest top ten trends, which are returned only in JSON format. An example of the returned trends is shown below.

{"as_of":"Fri, 23 Oct 2009 16:15:38 +0000","trends":[
{"name":"Follow Friday","url":"http://search.twitter.com/search?q=%22Follow+Friday%22+OR+%22Its+Friday%22"},
{"name":"Nick Griffin","url":"http://search.twitter.com/search?q=%22Nick+Griffin%22"},
{"name":"Windows 7","url":"http://search.twitter.com/search?q=%22Windows+7%22"},
{"name":"Paranormal Activity","url":"http://search.twitter.com/search?q=%22Paranormal+Activity%22"},
{"name":"Soupy Sales","url":"http://search.twitter.com/search?q=%22Soupy+Sales%22"},

JSON uses a recursive syntax to represent data objects, which isn’t possible to parse in the general case using regular expressions. But in this case, the data we’re interested in is a list enclosed in the bracket characters “[“ and “]”. We will process the “name”:”value”, “url”:”value” pairs in the list with a regular expression to build a very simple HTML page that lists the top ten trends as links to the Twitter search page for that topic.

The stylesheet below is in the form of a simplified stylesheet module in order to show a self-contained stylesheet.

<html xsl:version="2.0"
<title>Current Twitter Trends</title>
<h2>Current Top 10 Twitter Trends</h2>
<xsl:variable name="trends" select="unparsed-text('http://search.twitter.com/trends.json')"/>
<xsl:variable name="trendList" select="tokenize($trends, '\[|\]')[2]"/>
<xsl:analyze-string select="$trendList" regex='\{{.*?:"(.*?)",.*?:"(.*?)"\}}'>
<xsl:attribute name="href" select="regex-group(2)"/>
<xsl:value-of select="regex-group(1)"/>

The variable $trends retrieves the current trends from the Twitter search API by calling the unparsed-text() function. We need to use this function because the trends are returned only in JSON format and can’t be parsed directly by the XSLT processor. The $trendList variable uses tokenize() to separate the JSON response into pieces broken by “[“ and “]”. Those pieces are returned as a sequence of strings, which is effectively indexed with the “[2]” filter to get the top ten list.

Next <xsl:analyze-string> uses the involved regular expression “\{{.*?:"(.*?)",.*?:"(.*?)"\}}” to find all the “name”:”value”,”url”:”value” list entries. The name and URL values are captured in capturing groups with the enclosing quotes stripped off. In the <xsl:matching-substring> instruction, list items are built with links showing the trend name, captured in regular expression group 1, and the Twitter search page URL for that trend, captured in regular expression group 2.

The output of the stylesheet for the trends JSON return value shown above is show below.

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Current Twitter Trends</title>
<h2>Current Top 10 Twitter Trends</h2>
<li><a href="http://search.twitter.com/search?q=%22Follow+Friday%22+OR+%22Its+Friday%22">Follow Friday</a></li>
<li><a href="http://search.twitter.com/search?q=%23iusuallylieabout">#iusuallylieabout</a></li>
<li><a href="http://search.twitter.com/search?q=%22Nick+Griffin%22">Nick Griffin</a></li>
<li><a href="http://search.twitter.com/search?q=Cassetteboy">Cassetteboy</a></li>
<li><a href="http://search.twitter.com/search?q=Halloween">Halloween</a></li>
<li><a href="http://search.twitter.com/search?q=TGIF">TGIF</a></li>
<li><a href="http://search.twitter.com/search?q=%22Windows+7%22">Windows 7</a></li>
<li><a href="http://search.twitter.com/search?q=%22Paranormal+Activity%22">Paranormal Activity</a></li>
<li><a href="http://search.twitter.com/search?q=%22Soupy+Sales%22">Soupy Sales</a></li>
<li><a href="http://search.twitter.com/search?q=%23mylasttweetonearth">#mylasttweetonearth</a></li>

The addition of regular expressions is a big step forward in functionality for XSLT 2.0. Not only does it offer simple matching, but also provides very flexible string substitution and breaking capabilities. And last but certainly not least, the analyze-string instruction allows for very sophisticated processing of parts of string, including turning unstructured strings into structured XML data. I’m curious, for those using XSLT 2.0, do you find yourself relying more on the XPath functions or on the XSLT instructions?

For more complete information about compiler optimizations, see our Optimization Notice.


Russell Davoli (Intel)'s picture

@raja38: Much of the regular expression support is in XPath 2.0, so it's shared between XQuery and XSLT 2.0. The choice of XQuery vs XSLT more likely depends on the type of data source to process and personal preference of programming language style.

raja38's picture

Hi !!!
Thanks for this interesting post ,,, But now a days people using the regular expression in XQuery instead of XSLT... What u think about this ...l ike which one is more sophasticated for regular expressions ...

Rajamani marimuthu

Russell Davoli (Intel)'s picture

@contract management: thanks for the kind words! I'll keep trying to keep it interesting.

Russell Davoli (Intel)'s picture

@Paul: Lookarounds are useful in some cases, but the W3C chose to not include all the advanced regular expression features in the XPath/XSLT/XQuery specifications. They are concerned with interoperability and not forcing particular implementation languages or libraries. Perhaps now there are good use cases (do you have some for the XML processing domain?) and enough implementations in enough languages and libraries to add lookaheads in the next version of the specifications.

anonymous's picture

Give me variable-length lookbehinds and I'll be impressed.

Hell, start with lookarounds period.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.