Skip Navigation Links
Skip navigation links
Applications
Contact Scientio
Development Weblog
Products
Web services
Consultancy
Legal
Pricing and buy online
Skip navigation links
Burble - the Facebook chatbot
ChaosKit
ConceptMine
Scientiobot
XmlMiner
Metarule
Data mining demo
Metarule Web Editor
Text Mining
Text Mining Demo
Structure mining
Rule based inference demo
Uncertainty
Lacuna overview
Questionnaire
Videos
Concept Strings
Text Mining 

Semi-structured data, like that found in XML documents, frequently contains large amounts of text. XML is used for all kinds of messaging, for instance RSS feeds, or news stories. These messages contain a mix of long textual, and short numeric or categorical tagged sections. XML Miner can simultaneously mine each of these data types.

Scientio has two text mining algorithms available:

The first, which is embedded in XML Miner, is the Naive Bayes algorithm. This is language independant, though English language stop words and stemming can be employed if required. Textual elements are pre-processed by our text processor which creates the probability tables used in Naive Bayes. This process creates a set of vocabularies, with their associated word frequencies and calculated memberships for each patternĀ of each vocabulary..
In the learning stage that follows these memberships are used as input data along with any numeric or categorical data items, and so textual data is fully integrated into the data mining process. The vocabularies are embedded into the data dictionary section of the rule set, so that the rule set and our standard runtime can perform both data and text based inference on new data.

Main points are:
  • Industry standard algorithm
  • Language independant
  • Stemming and stop words can be used for English text
  • Fully integrated and automatic
  • vocabularies and word frequencies supplied as part of the generated rule set

The second text mining algorithm is used in our new Concept Mining product ConceptMine. This can be easily integrated into the data mining system to pre-process textual elements. ConceptMine generates numeric signatures for each document that locates them in 'concept space'. These can be used as input to the data mining process.

Main points are:

  • Language dependant, but most languages supportable
  • Generates much smaller rule sets than other algorithms
  • Has more 'human' recognition characteristics
 

Copyright (c) 2007, Scientio LLC All Rights Reserved