Skip Navigation Links
Skip navigation links
Applications
Contact Scientio
Development Weblog
Products
Web services
Consultancy
Legal
Pricing and buy online
Skip navigation links
Burble - the Facebook chatbot
ChaosKit
ConceptMine
Scientiobot
XmlMiner
Concept Strings
ConceptMine overview 
Scientio LLC > Products > ConceptMine
 

Concept mine and compare documents efficiently for similarity with ConceptMine

Concept Mine is a .Net class library and Web Service that enables you to efficiently compare and index documents for similarity in both lexical terms and with the concepts held within them. It does this using new techniques in Concept Mining.
It uses a model of the English language (Spanish also available) to calculate a short numeric 'signature' that categorizes the concepts found within a given document.
This signature enables simple and efficient lookup of documents in concept space, and thus efficient comparison, indexing and retrieval. The class library contains indexing code based on KD Trees to efficiently index documents and locate similar documents in O(log n) time.

Applications are:
  • Detection of duplicate or near duplicate documents in large corpora
  • Anti-plagiarism source matching 
  • Indexing of documents by concept
  • Concept driven search engines
  • Document clustering
  • Document topic inference
  • Text and Concept Mining
Features are:
  • Easy extension to further languages for which a WordNet model exists (all major languages).
  • Efficient, low overhead document matching.
  • Can be configured to locate documents from a data source that are near duplicates, differing in formatting or containing typos or revisions.
  • Can be configured to provide a conceptual analysis of the concepts in a document, using concept mining, and generate a list of key words such as proper nouns from a given document.
  • Can be configured to locate documents that are similar in embedded concepts but lexically dissimilar.
  • Easy integration with common databases, ASP.Net based websites, .Net applications etc.
  • Thread-safe for simple multi-user support.
  • Java version in development.
  • Signatures consist of 9-16 floating point values, depending on language.
  • Signature size is independent of document length.
  • Signatures are not influenced by formatting or white space.
  • Defeats common plagiarism tactics: for anti-plagiarism applications, documents that have been modified using  thesaurus based substitution of words, or by the re-arrangement of sentence or paragraph order will still generate identical signatures to the originals.
Requirements: .NetTM platform
  • Visual Studio 2005 for development
    or other .Net development environment such as #develop
  • Windows 2000, 2003, XP or Vista operating system.
  • Linux, Unix, Solaris, Mac OS X support using Mono
  • 30Mb space on a hard disk drive
  • .Net CLR 2.0 download here

Requirements: Web Service

  • Any development environment supporting WSDL web services.

A Spanish version is also available and further languages can easily be added where a WordNet model exists. Please contact support@scientio.com to discuss language requirements or to request an evaluation.

 ConceptMine Documents

ConceptMine presentation.swfConceptMine presentationSystem Account
Using concept structures for efficient document comparison and location.pdfUsing concept structures for efficient document comparison and locationadministrator
Text and Concept mining briefing document.pdfText and Concept mining briefing documentadministrator

Copyright (c) 2007, Scientio LLC All Rights Reserved