Articles on and examples of using Scientio technology.


Concept mining versus text mining

Apr-22005
We hope you are aware, if you come to the Scientio website a lot, that as well as performing data and structure mining of XML documents, XML miner does Text Mining too.
Text mining is really statistical analysis of word frequencies within a document. The two most popular learning algorithms for Text mining are Naïve Bayes and Support Vector Machines. We implement the former. Text mining generally ignores word order, so “The cat sat on the mat” is the same as “the mat sat on the cat” to most text mining systems. This is called the “bag of words” approach. Some systems use natural language processing, so would see these two sentences as different, but surprisingly their performance is not yet as high as bag of words approaches in the typical text mining application of inferring what a passage of text is about, and relating it to a set of preselected topics.

The big drawback of  bag of words learning algorithms is that the models created are very big. There is typically a record in the model for every word that crops up in the source documents. The size of the models is bounded, thankfully, because languages only have a finite number of words. Stemming, the process of identifying the key part of the word that relates to the meaning, can reduce the total number of words stored. As an example “like” “likes” “liking” and “liked” can all be stored as “like” without loss of recognition performance. You need a different stemming algorithm for each language, so this process sacrifices language independence, which bag of words approaches generally have.

I said above that the key task of text mining was to identify what a piece of text was about – It would be great if a machine could understand text, but we’ll leave that for another day! Inferring what a piece of text is about is a much simpler task than understanding it. As an example, my knowledge of French is poor, but I could still probably work out what two French people were talking about in general, without actually being able to follow the conversation.


I’ll be concentrating, from now on, on the English language. English is particularly rich with homonyms and synonyms. Any word we use might have multiple meanings, and we use the context to disambiguate what is meant. We can formalise the idea of meaning somewhat by saying that meanings correspond to concepts, and that multiple words might be used to represent a particular concept. Each word is likely to map onto multiple concepts, so we end up with a model as shown below.


If we’re trying to work out what a piece of text is about, then the concepts in it should be much more relevant than the words themselves. How do we go from words to concepts, however?

If you’ve ever used a Thesaurus the answer should be clear. Although we generally use them just to look up alternate words with a similar meaning, thesauri group words by meaning and these meanings are arranged in hierarchies.  Several years ago a wonderful open source database of the English language was created called WordNet.  http://wordnet.princeton.edu/  This has all the structures we are looking for as well as a built in stemmer. We created a class library to encapsulate this a over a year ago.
So, given this class library, we can find out the concepts present in a piece of text, and so we can perform “concept mining” on the result rather than text mining.

There are caveats however. The authors of WordNet did not supply any probability information on particular meanings. I.e. word ‘X’ might have meanings ‘A’, ‘B’ and ‘C’. We do not know if ‘A’ is more likely than ‘B’. Although they collected data on relative frequency, and so, by inference, which meaning is more likely, they did not store it. We are therefore required to consider equally every meaning a word might have. Those of you who are up on these things might see that if the frequencies were available, a Bayesian belief network approach to work out which concepts were active in any context might work well.

The first algorithm we came up with for making use of concept mining used only nouns, and was based on the fact that name concepts form hierarchies, and that these are present in WordNet. The hierarchies use hyponymy as links. This means as you go down the tree, each child is an example of the thing above, and as you go up the tree the concepts get more general. Surprisingly, Word Net has only eleven root nodes to these hierarchies, or to put it another way, there are only eleven hierarchies for nouns in the English language. We constructed a signature for a given document by scanning through the text and applying scores to each concept linked to a word in the text as it turned up. Scores were then propagated up through the hierarchy to the root nodes. The signature was formed as a vector of these eleven scores. These signatures were then normalised by word count and we made the normalised signatures inputs to our standard fuzzy rule inference learning algorithm in XmlMiner.

The results on the Reuters data set, (see http://www.scientio.com/reutdemo.aspx) as compared to our existing Naïve Bayes were not sparkling, but interesting. Concept mining scored 55%, compared to 85% for Naïve Bayes, but given that there are 10 reasonably equally distributed topic classes, this is very much better than blind chance, and shows that there is a future in such approaches. On the other hand, the models are very small, since we only have to store eleven integers per document.

 
Posted by Administrator | 0 Comments | Trackback Url | Bookmark with:        
Tags:

Links to this Post

Comments

Name:
URL:
Email:
Comments:

CAPTCHA Image Validation