Articles on and examples of using Scientio technology.


Calculating document similarity efficiently

Jun-102005
We were recently contacted by a customer with a difficult problem. They wanted to compare new documents being loaded into a database with existing ones to make sure the new  documents were not duplicates or near duplicates of those already in the database.

On the face of it this is an innocent enough request, but they wanted to do this with a database containing up to a million documents!

Doing that efficiently is a much bigger problem.

Comparing two documents is relatively simple. You might, for instance, perform a word frequency analysis and then perform some kind of similarity measure between the word vectors produced. However, with a million documents this kind of pair-wise comparison might have to be done on average half a million times every time you wanted to add a new document!

Ideally you would locate the documents in some kind of space, based on content, and then locate any incoming document in the same space, to be sure there were not any near neighbors that might be duplicates. The standard solution is to use word space, i.e. to use the space of word frequencies to compare documents in. Unfortunately the English language contains more than 40,000 words, so this word space has enormous dimensionality!  For all the complexity this brings, this ids the standard solution.

In previous articles we've talked about our developing concept-mining capability. One early spin-off of this is the ability to reduce the content of an English language document to a nine way vector of doubles, representing the excitation of the nine root concepts in the noun tree of the WordNet model of the English language.

As we've discussed before, it's easy to be cynical about this fairly ad-hoc measure; except that when we use it as part of a text mining algorithm it has far better performance than one would expect, although not as good as Naive Bayes and Support Vector Machines.

It seems that this measure might also be useful as a means of locating a document in ""concept space"" rather than a word frequency space, and that having done that, words that are close in the concept space might have the right kind of similarity in the real world.

We constructed an experiment to look into this.

The obvious data set to use is the Reuters data set, which we use for other text mining examples. This contains 8599 samples of text, each around a paragraph long, containing news stories from the mid 80s, each averaging 150 words. We started by extracting the stories, and then running each through our measure, transferring them to ""concept space"".

We used our KDTree class, which is part of our Chaos tools toolbox, as the database in the test. the KDTree class takes as input a two dimensional array. In this case the columns of the array hold the nine excitation values, and the rows represent the documents. Once you've inserted data you can look up neighbors to any new data vector. the KDTree class can return the 'n' nearest, or all the vectors within a given radius. The result of such a search is an array of KDTreePoint objects, that are each annotated with the index of the document they represent and the distance of this document to the new vector in the chosen space. These are sorted, closest first.

So, using the KDTree classes and our concept mining classes we generated a breadboard that enabled us to load documents, encode them, storte them and retrieve them based on the measure.

 The first thing to find out is  whether this measure is sufficiently expressive to enable the user to reliably locate the same document you put in. i.e. if you use the measure to put it into a database, can you use it to get it out again? If the measure created lots of duplicates this might be very hard. It's worth pointing out that the measure considers only nouns, ignoring casing, punctuation and spaces.

The second thing to look at, and the key thing for the application we were considering, is whether corrupting the documents in some way would destroy the ability of the system to retrieve them. To this end we tried the whole process three times, once with uncorrupted data, once with 5% of the text chopped of the end, and once with 5% of the text chopped off the start.

The following table shows the results of the processing and the performance of the system on the 8599 records of the Reuters data set.

rank in search by which the correct document has been found.

No truncation count

No truncation %

truncation at rear count

truncation at rear %

truncation at start count

Truncation at start%

1

8599

100

7403

86.09

5886

68.44

2

 

 

7820

90.94

6572

76.42

3

 

   

7994

92.96

6913

80.39

4

 

 

8096

94.15

7159

83.25

5

 

 

8163

94.92

7334

85.28

6

 

 

8225

95.65

7466

86.82

We also looked at the time requirements for processing these documents.

Timings on a 2Ghz Athlon 64 with 1Gb ram

Constructing language model

4.7 Seconds

Time to create 8599 signatures

117.59 Seconds

Time to load the KDTree with 8599 records

0.046 Seconds

All in all, this measure turns out to be surprisingly effective. To be able to reduce an arbitrary document to a 72 byte signature and reliably get it back again, and to go on to locate similar data items in a robust fashion is quite an achievement.


 
Posted by Administrator | 0 Comments | Trackback Url | Bookmark with:        
Tags:

Links to this Post

Comments

Name:
URL:
Email:
Comments:

CAPTCHA Image Validation