I introduced concepts strings in a previous post. To recap, they are a data structure representation of a piece of text that encodes part of speech and concept information.
We've been looking at various methods to make use of large sets of concept strings to make inferences about some new concept string. I.e. to create a machine learning algorithm that makes use of them.
We've tried several indexing methods, but have finally had success with a Generalized Suffix Tree.
This is an efficient method for storing sequences that permits the user to match any part of another sequence to those already stored. The tree builds in O(n) and the memory usage scales the same way. You can find a match in time proportional to the sequence you are trying to match, and not the size of the database.
On the down side this was evil to code and test in an object oriented fashion. There's more work to be done, but expect a web service allowing you to play with them in the next few weeks.