Concept mine and compare documents efficiently for similarity with ConceptMine
Concept Mine is a .Net class library and Web Service that enables you to efficiently compare and index documents for similarity in both lexical terms and with the concepts held within them. It does this using new techniques in Concept Mining.
It uses a model of the English language (Spanish also available) to calculate a short numeric 'signature' that categorizes the concepts found within a given document.
This signature enables simple and efficient lookup of documents in concept space, and thus efficient comparison, indexing and retrieval. The class library contains indexing code based on KD Trees to efficiently index documents and locate similar documents in O(log n) time.
Applications are:
- Detection of duplicate or near duplicate documents in large corpora
- Anti-plagiarism source matching
- Indexing of documents by concept
- Concept driven search engines
- Document clustering
- Document topic inference
- Text and Concept Mining
Features are:
- Easy extension to further languages for which a WordNet model exists (all major languages).
- Efficient, low overhead document matching.
- Can be configured to locate documents from a data source that are near duplicates, differing in formatting or containing typos or revisions.
- Can be configured to provide a conceptual analysis of the concepts in a document, using concept mining, and generate a list of key words such as proper nouns from a given document.
- Can be configured to locate documents that are similar in embedded concepts but lexically dissimilar.
- Easy integration with common databases, ASP.Net based websites, .Net applications etc.
- Thread-safe for simple multi-user support.
- Java version in development.
- Signatures consist of 9-16 floating point values, depending on language.
- Signature size is independent of document length.
- Signatures are not influenced by formatting or white space.
- Defeats common plagiarism tactics: for anti-plagiarism applications, documents that have been modified using thesaurus based substitution of words, or by the re-arrangement of sentence or paragraph order will still generate identical signatures to the originals.
Requirements: .NetTM platform
- Visual Studio 2005 for development
or other .Net development environment such as #develop
- Windows 2000, 2003, XP or Vista operating system.
- Linux, Unix, Solaris, Mac OS X support using Mono
- 30Mb space on a hard disk drive
- .Net CLR 2.0 download here
Requirements: Web Service
- Any development environment supporting WSDL web services.
A Spanish version is also available and further languages can easily be added where a WordNet model exists. Please contact support@scientio.com to discuss language requirements or to request an evaluation.