Uncertainty is of fundamental interest for artificial intelligence and data mining applications.
There is little point in using data mining or rule based systems unless some uncertainty exists as to the correct algorithm to use, the values of data items or the validity of inferences made from them.
By definition uncertainty is a nebulous thing, but XML Miner thoroughly and consistently attempts to quantify the uncertainty in any application, and to give the best results possible if data is missing or vague.
The data mining system, the metarule language and the runtime processor all work together to try to quantify uncertainty at each stage of processing. They do so using the following features.
XML Miner data, text and structure mining
- The user can specify a percentage of the training data to set aside as a test set.
- The fuzzy logic rule induction process minimises and optimises the rule set created and annotates each rule with a degree of certainty, scaled 0-1, based on the support for the rule in the source data.
- The rule set is tested using the runtime processor fuzzy logic inference engine on both the training data and any test data set aside.
- XML Miner reports performance for both sets of data, as RMS error for numeric variables or percentage correct for categorical.
- XML Miner also reports the percentage of any training or test patterns that did not generate a valid output with usable confidence level.
Runtime processor
- The runtime processor supports Fuzzy Tri-state logic. Internal logical values can be 0-1 for fuzzy degrees of truth, or -1, representing an unknown state.
- The runtime processor's logical and numeric processing handles and correctly propogates unknown data states through the rules. A missing data value on one of the inputs does not prevent processing if the output can be inferred from other sources.
- For each output a confidence value is generated. If the combination of inputs presented did not fire any rules then an unknown state is signalled.
- Each rule can be annotated with a certainty factor in the range 0-1 which is used in the rule aggregation process.
- All inputs to the runtime processor can be either crisp values, or values with associated uncertainty.
- Numeric inputs can be specified as various kinds of fuzzy numbers.
- Categorical inputs can be specified as a set of alternate categories, each with an asociated 0-1 confidence level.
- Numeric outputs from the runtime processor are supplied as a crisp central value, and as a fuzzy number, complete with certainty value.
- For categorical outputs, the runtime processor supplies both the most likely category, with associated uncertainty, and an ordered list of alternate categories (if they exist) annotated with confidence figures.
- Where the runtime processor is supplied with data from an XML document, it will aggregate values across multiple nodes where present.