Extracting knowledge
Data mining and expert systems – two different disciplines?
Data mining is the process of analyzing data to detect useful structure that can be re-used for gain. Expert systems encode an expert’s experience and knowledge as a set of rules so that this can be re-used, also for gain. Scientio staff were some of the pioneers of the first data mining systems, and have used and created versions of all the major algorithms used in data mining. This lead to the insight that the performance of the best of these various algorithms was generally similar, and that what was important was how the algorithm represented what it had discovered.
For instance, one might learn an interesting relationship with a neural net, and be able to predict future values with it, but one would not be able to extract from the neural net any idea what it had learned. Scientio decided to concentrate on algorithms that created models that the user could easily understand – it decided to use fuzzy logic rules to represent the knowledge learned through data mining. Fortuitously this representation can also be used as a very effective expert system.
Scientio has thus combined these two disciplines into one. Scientio’s product XMLMiner takes data, mines it and produces an expert system to represent the knowledge learned.
Forms of data mining
There are three main ways to extract knowledge from data:
- Supervised learning is the process of learning by example. The sample data is organised into patterns with one or more associated values the system is to learn to predict. After training you can use the model on fresh data to predict these values.
- Unsupervised learning or clustering is a way of deciding if there is structure in a large set of data, and reporting the structure found to the user. Shopping basket analysis is an example of this.
- Reinforcement learning is the process of optimising a system that can be simulated by changing parameters or functional elements.
XmlMiner with our other systems can be used to perform all three of these kinds of learning. In each case algorithms have been developed to create expert system results without loss of performance. In the case of reinforcement learning Scientio uses Genetic Programming of Fuzzy Logic Rules – a field invented by Scientio’s principal.
Forms of data
Data mining, as a mature technology, has fallen into a bit of a rut. Twenty years ago the dominant form of data representation was relational database tables, and all the current algorithms are directed at these.
The world is now full of new media and new forms of data that break out of the tabular data paradigm. Data mining has not caught up. The world of data can be represented using the diagram on the left. Conventional data mining tools only target the smallest circle. Scientio’s XMLMiner can target the tree based data circle, and much of the outer circle.
Mining relationships as well as data
Conventional data mining mines only numbers and categories. Scientio have produced the first general tools to mine the structure of data too. If you consider a family tree as an example of a data set, the structure encodes relationships – literally who is related to whom. There are many forms of data where the relationships are more important than the data items. If you are trying to detect terrorists, for instance, bald facts about individuals might not get you very far, but look at who an individual is phoning or is related to, and the data comes to life.
Tree shaped data can be effectively represented in XML. As shown in the picture, this includes conventional tabular data too.
Scientio’s flagship product, XML Miner can data and structure mine any XML data. It can learn to make inferences based on the presence or absence of particular kinds of nodes in an XML document, and based on the number of those nodes too.
Mining text, data and structure simultaneously
To complete the array of data types that XmlMiner can handle, it also supports text mining. This is conventional word frequency based text mining with stemming and stop words using the naïve Bayes algorithm. Because XML data often contains mixed data and text ( examples are RSS feeds, news stories, financial data feeds) , and because any of the forms of mining might be useful on any given set of documents, XmlMiner permits all of the kinds of data mining to be performed at once.