XSLT 2.0: Regular Expression Functions and Instructions

XSLT 2.0: Regular expressions

One of the weaknesses in XSLT 1.0 was the very simple set of string manipulation features. In comparison to many popular programming languages, the string functions lacked one very powerful feature, regular expressions. Intel SOA Expressway actually offers this functionality with extension functions for our customer base. In XSLT 2.0, the XSLT working group plugged this hole for everyone in a couple of ways that we’ll look at in this post.

Regular expressions are a very flexible way to match strings or portions of strings based on patterns. These patterns are specified by a special syntax and are unrelated to the patterns used to match nodes in XSLT templates. See the Regular Expression Syntax (http://www.w3.org/TR/xpath-functions/#regex-syntax) section of the XQuery 1.0 and XPath 2.0 Functions and Operators specification for the details of the syntax.

Because regular expressions are so popular, I’ll assume some basic knowledge that’s easy to pick up from the web for the examples in this post. Still, it’s worth pointing out that the regular expression syntax supports some advanced features. There are “greedy” quantifiers (* and +) to match the longest substring possible and “reluctant” versions (*? and +?) to match the shortest substring. Enclosing a portion of the regular expression with parentheses, for example “(123)”, specifies a captured subgroup that can be referenced by number. Finally, as you might expect for a language targeted at processing XML, there are some special matching specifications for Unicode and XML name characters.

Often it’s enough simply to match a regular expression when deciding whether and how to process a node or some text. The matches(string, regex) XPath function
takes a string, attempts to match the regular expression, and returns true if finds any matches.
This function is handy in predicates and in <xsl:choose> and <xsl:if> where a Boolean result is needed to determine the matches.

When converting formats, sometimes it’s useful to change parts of a string that can be found with a regular expression. The replace(string, regex, replacement) XPath
function returns a new string formed by substituting the replacement string for every match to the regular expression. The replacement string itself can refer to portions of the matched substring with numbered variables, such as $1 for the first captured group substring and $0 for the entire substring.

Breaking a string into a sequence of smaller strings is done with the tokenize(string, regex) XPath function. Matches to the regular expression mark the boundary between substrings (tokens) and will not be included in the result. This function is useful for separating strings that are delimited, such as strings of comma separated values. But because tokenize() takes a regular expression, the separator can be much more sophisticated than a simple character.

In order for XSLT to offer more even more flexible regular expression processing than the functions in XPath, the <xsl:analyze-string> instruction can break a string apart much like tokenize() while allowing the stylesheet to process both the matching and non-matching parts. Because it’s an instruction, it can also create nodes to structure the data, which the XPath functions can’t do.

Here’s an example using <xsl:analyze-string> to pull out the current top ten trends from Twitter and build a very simple HTML page. The Twitter API is a RESTful API, which means it returns results in response to an HTTP request on a resource. In this case, we can ask the search API with the URI http://search.twitter.com/trends.json to get the latest top ten trends, which are returned only in JSON format. An example of the returned trends is shown below.

{"as_of":"Fri, 23 Oct 2009 16:15:38 +0000","trends":[
{"name":"Follow Friday","url":"http://search.twitter.com/search?q=%22Follow+Friday%22+OR+%22Its+Friday%22"},
{"name":"Nick Griffin","url":"http://search.twitter.com/search?q=%22Nick+Griffin%22"},
{"name":"Windows 7","url":"http://search.twitter.com/search?q=%22Windows+7%22"},
{"name":"Paranormal Activity","url":"http://search.twitter.com/search?q=%22Paranormal+Activity%22"},
{"name":"Soupy Sales","url":"http://search.twitter.com/search?q=%22Soupy+Sales%22"},

JSON uses a recursive syntax to represent data objects, which isn’t possible to parse in the general case using regular expressions. But in this case, the data we’re interested in is a list enclosed in the bracket characters “[“ and “]”. We will process the “name”:”value”, “url”:”value” pairs in the list with a regular expression to build a very simple HTML page that lists the top ten trends as links to the Twitter search page for that topic.

The stylesheet below is in the form of a simplified stylesheet module in order to show a self-contained stylesheet.

<html xsl:version="2.0"
<title>Current Twitter Trends</title>
<h2>Current Top 10 Twitter Trends</h2>
<xsl:variable name="trends" select="unparsed-text('http://search.twitter.com/trends.json')"/>
<xsl:variable name="trendList" select="tokenize($trends, '\[|\]')[2]"/>
<xsl:analyze-string select="$trendList" regex='\{{.*?:"(.*?)",.*?:"(.*?)"\}}'>
<xsl:attribute name="href" select="regex-group(2)"/>
<xsl:value-of select="regex-group(1)"/>

The variable $trends retrieves the current trends from the Twitter search API by calling the unparsed-text() function. We need to use this function because the trends are returned only in JSON format and can’t be parsed directly by the XSLT processor. The $trendList variable uses tokenize() to separate the JSON response into pieces broken by “[“ and “]”. Those pieces are returned as a sequence of strings, which is effectively indexed with the “[2]” filter to get the top ten list.

Next <xsl:analyze-string> uses the involved regular expression “\{{.*?:"(.*?)",.*?:"(.*?)"\}}” to find all the “name”:”value”,”url”:”value” list entries. The name and URL values are captured in capturing groups with the enclosing quotes stripped off. In the <xsl:matching-substring> instruction, list items are built with links showing the trend name, captured in regular expression group 1, and the Twitter search page URL for that trend, captured in regular expression group 2.

The output of the stylesheet for the trends JSON return value shown above is show below.

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Current Twitter Trends</title>
<h2>Current Top 10 Twitter Trends</h2>
<li><a href="http://search.twitter.com/search?q=%22Follow+Friday%22+OR+%22Its+Friday%22">Follow Friday</a></li>
<li><a href="http://search.twitter.com/search?q=%23iusuallylieabout">#iusuallylieabout</a></li>
<li><a href="http://search.twitter.com/search?q=%22Nick+Griffin%22">Nick Griffin</a></li>
<li><a href="http://search.twitter.com/search?q=Cassetteboy">Cassetteboy</a></li>
<li><a href="http://search.twitter.com/search?q=Halloween">Halloween</a></li>
<li><a href="http://search.twitter.com/search?q=TGIF">TGIF</a></li>
<li><a href="http://search.twitter.com/search?q=%22Windows+7%22">Windows 7</a></li>
<li><a href="http://search.twitter.com/search?q=%22Paranormal+Activity%22">Paranormal Activity</a></li>
<li><a href="http://search.twitter.com/search?q=%22Soupy+Sales%22">Soupy Sales</a></li>
<li><a href="http://search.twitter.com/search?q=%23mylasttweetonearth">#mylasttweetonearth</a></li>

The addition of regular expressions is a big step forward in functionality for XSLT 2.0. Not only does it offer simple matching, but also provides very flexible string substitution and breaking capabilities. And last but certainly not least, the analyze-string instruction allows for very sophisticated processing of parts of string, including turning unstructured strings into structured XML data. I’m curious, for those using XSLT 2.0, do you find yourself relying more on the XPath functions or on the XSLT instructions?

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.