Table of Contents
Writing Extractors in Intel Mash Maker
In order to apply mashups to a web page, Intel Mash Maker needs to know what the page means. To do this, Mash Maker needs to have an extractor telling it how to extract meaning from the page.
Extractors are written within Mash Maker itself, and are stored in the central Mash Maker database. Following the model of Wikipedia, any user can edit any extractor in the database, and vandalism is dealt with using reversion.
At present, writing an extractor requires a reasonable amount of technical knowledge, including an understanding of HTML, and basic knowledge of XPath. We intend to provide a more approachabel interface for creating extractors in the future.
The easiest way to learn how to write extractors is to look at some existing ones. Browse to a page you find interesting, and have a look at the extractor that produced the data. Most of them are fairly simple.
Getting Started
To start writing an extrator for a page, click on the extras button and select "Data Tree Sidebar". Alternatively, you can enable the sidebar using the normal FireFox menus.

The Data Tree sidebar displays a tree representation of the RDF description of the current page. To edit the extractor that produces this data, you need to open the structure editor, by clicking on the "structure" button, at the top of the sidebar. You will probably also want to install FireBug so that you can easily inspect the HTML for a page.
Once you have opened the structure editor and turned on firebug, your browser should look like the following:

The Structure Editor
The key to writing an extractor is the structure editor:

This panel displays information about how to extract the property or child item selected in the data tree:
-
Name: A name for this kind of page. For example "BBC News Story", "Yelp Listing", "Facebook Friend"
-
Browse: Browse other extractors for the same domain. Also provides ability to revert to a previous version.
-
URL Regexp: A
regular expression that a URL must match in order for this extractor to be applied.
-
Priority: If several extractors match a URL then the one with the highest priority is chosen
-
Property: This panel describes the property currently being described. Click on a property in the data tree to edit the definition of that property.
-
Name: The name of the current property. Either use the drop-down list to select a standard property, or type in a new property name. This corresponds to an RDF URI in the Mash Maker namespace. If two properties have the same name then they are assumed to have the same meaning. [better support for namespaces is likely to appear soon]
-
Type: If defining a new property, this is the type [this approach to defining new properties is likely to change]
-
Node XPath: An
XPath expression describing how to find the object. XPath is a very easy language to learn. Someone competant with HTML should be able to learn the bulk of it in a few minutes. The easiest way to write XPath for an element is to use FireBug to pick out the thing you are interested in, and then enter a brief description of the node.
-
Text RegExp: If the value of the property is text, then you may want only part of the text from the node identified by the XPath. If so, then write a
regular expression describing the text you want. The value is the first capturing group to have a value. For example, if you want to extract a time, you might write "(d+:d+(a|p)m)".
-
Required: If you can't find this property then ignore the whole item. This is commonly used to simplify the XPath needed for the item.
-
Encoded: This text is form encoded - decode it.
-
Presence: The result should be a boolean value saying whether we found something matching the xpath and regexp.
-
Concat: The result is the text obtained from concaternating all nodes that matched the XPath, rather than just the text of the first matching node.
-
Add Top: Add a new property for the whole page
-
Add: Add a property to items on the page
-
Delete: Delete this property
-
Up/Down: Move this property up/down in the property order
If you have made a change to the extractor then click "Refresh" to refresh the data tree immediately, based on this new definition. Once you are happy with what you have made, click "Publish" to share this extractor with the wider community.
It is particularly important that you give an item a URL property, as this is used to identify the object in many places.
[This document will be expanded later with more information]