The idea that you might mine the structure of a document or set of documents is a new one, and Scientio is the first company to produce tools to do this. With XML Miner you can mine the structure, data and text of a document simultaneously. On this page we'll try to explain why would you want to mine the structure of a document or data source, and how you would go about it.
For there to be any information to be gained from the structure of a document, that structure itself must be variable in some way. This is what makes semi-structured data such as XML documents interesting. Most XML Schemata, such as, for instance, NewsML allow enormous range and variability in documents that conform to the schema. More than 90% of the NewsML schema is taken up defining optional items, and this is true of most non-trivial schemata. The contents and structure of such documents are varied according to need, and thus useful information can be inferred from the contents and the structure.
To give you an example, let's look at another kind of simpler semi-structured data source: books. Books have a structure defined by custom. There are a set of structural elements you might find in any given book, and when they occur there are conventions about their order. The presence and number of the elements is defined by the author and the publisher. The general-purpose schema for a collection of books might look like this:
So, as we know, books contain chapters, perhaps sub-chapters, and maybe prefaces and so on, but the presence and number of any item is variable. We've added two data items to the schema, the title and a boolean that says if the current book is non-fiction or not.
We picked a set of books from the Haydon house library and created a data set containing the key parts of the structure. In a real world application the data set might contain the text of the books as well, but we've contented ourselves with just the skeletal structure of the books. As a simple example of structure mining, we'll attempt to predict whether a book is fiction or not based on the structure alone.
The result of running XMLMiner on this data is a small Metarule rule set containing just these two rules:
if index is absent then nonfiction will be false (confidence 1.00)
if index is present then nonfiction will be true (confidence 1.00)
So XmlMiner has worked out that fiction books don't have indexes. It's responding to the ansence or presence of nodes in the structural trees of the books.
This is a very simple example of structure mining, but this same technique applies to to any heirarchical data. Also structure mining can be applied at the ame time as data mining and text mining, so rule sets can be created and exploited that depend on one or all of these techniques simultaneously.
You activate structure mining in a data mining run by specifying that the nodes of interest are
presence nodes, or
arity nodes. For the former, XmlMiner checks for the presence or absence of a node, for the latter XmlMiner counts the incidence of the nodes specified. These values are then used as if they were boolean or integer values in the input data set.