• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 167
  • Last Modified:

text indexing/searching for document management

Hi all,

I have to create a text indexing/searching engine for a document management application.

I remember reading quite a while ago about an engine that used inverted indexing, (I think it was called), with dictionary coding to implement keyword retrieval and compression in a sort of "double whammy". I think it indexed all occurences of strings with a dictionary 'key', and then shrank the source by storing the dictionary keys rather than the associated strings. This had the advantage of compression of original text, with being able to use the compressed text to search - by looking up the dictionary key(s) rather than the associated strings.

I could probably get that up and running but wanted to get some comments before I start what is going to be quite a long project. If anybody has done anything similar, or implemented a different solution I would be very glad if they could point me in the right direction / share experiences.

Many thanks in advance

  • 2
  • 2
  • 2
  • +2
1 Solution
This is a demo on how to implement a Grep search like program, that may help you on the beggining.

At this link you will find a program with source that shows how to implement a search engine that is able to look inside Zip files:

Hope it helps!
VSF, did you get the code from the first link to work? I get an error on the GetVolumeInformation function on line 207

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

GaryBond, sounds like an interesting project of yours to me. A straightforward version of this engine should not be too hard to write, and if you use a common data backend (Interbase SQL or whatever), you should not even have to worry too much about optimizing for efficient data storage.

However, for a really useful version, you would have to consider the following things among others:
* What to do with "random" words, e.g., words that are unique?
* What to do with non-word characters?
* How will your engine react if "invalid" data is to be indexed, like, binary data?
* How do you want to tread text with metadata (HTML, RTF, etc.)?
With a few changes in the code it works... it seems there is some kind of code error on the part responsable for showing the drive letters!

GaryBondAuthor Commented:
Hi all,

Firstly thanks all for you comments - I will have a look at the links over the weekend.(4 days off this weekend here in the UK - woo hoo)

To AvonWyss, - yep, I am looking forward to the project = it should be fun. Sadly I dont have the option of using a proprietary product like SQL Server, Interbase etc. (Otherwise I would leave the text indexing to the database - hey, they could do a better job than me, ha ha). Initially, I was going to do something like this:

1) scan the data and make a list of all unique words, (as defined by whitespace chars), ranked in order of frequency of occurence.
2) Assign a one byte identifier for the most popular 256
3) From then on assign a two byte identifier for the other words to complete the dictionary
4) Compress the data by replacing the words with their dictionary (one or two byte) keys.
5) Make an index, (probably linked list type structure) of the dictionary words followed by every document in which they occurred.
5) Searching could take place by searching for the dictionary keys in the compressed data.

The actual data comes from OCR'ing images of the original documents,(most of the original documents do not exist any more), and will not probably need to be reproduced. It just exists to provide a 'way in' to the document images.

Some obvious problems,
a) Top limit of 2^16 words in the index, caused by 2 byte dictionary key. Not very scalable.
b) Needs another flag to say whether a word is represented by a one-byte or two-byte key
c)Index could be big - depending on how the documents are identified the linked list type stuctures could be large. Sort of negates the compression achieved.
d)Solution is OK if all the data is there at the start - but what about if data is to be added after the index is all built? Would have to go through the whole process for each set of additional data. Again not scalable.
e)Have to keep the OCR data since 65536 unique words would probably not be big enough - so some data would remain uncompressed and un-indexed. (Otherwise we could throw the OCR data away after we complete the index).
f) Your point about meta-data ia alse very valid, and one not addressed either. (would need a file reader to 'get at' the actual data I guess, and thereby not index the meta-data)

Anyhow, sorry to go on a bit, but as you can see the solution at the moment is all in my head. Hence my posting here - to clarify my thinking.

Thanks again for all the comments - please keep 'em coming.
Ok, I now have better insight in your project. Note that, since OCR is likely going to have many word with false recognized words (for instance, 1st may be recognized as lst, savvy as sawy, etc.), so that there will be several index entries for just one work. Not to forget that, if you want to reproduce the text by using the index, you have to do your index case-sensitive, so that "ONE" is not the same as "One", "oNe" or "one".

Therefore, if you want to stay with your initial idea, I would recommend to do the following:
* Make three separated word indexes. I'll show their purpose later on.
* Add flags to your "markers". For instance, you should differ between lowercase, UPPERCASE and Normal-cased words, so that they are only present once in the index.

The three word indexes would be:

* Common words. These words are used very often, like "the", "a", "yes", "no". It does not make sense to store information about in which documents they are to be found, since they will be in virtually any document, and nobody will really go and use them for a search.

* Normal words. They are several times and in different documents, but they are not as widely used as common words and hold information about which documents contain them. This is the most used and most important index. When a word is used too often, it can be promoted to a common word, and it's document information can be dropped.

* Rare words. Words only used very few times will be sored here. When their count goes high enough, they are promoted to normal words. After indexing lots of documents, you can assume that they will not be used for searching and eventually drop them from your index.

The documents you want to rebuild can basically be stored in three different fashions:
* Original text
* Index-References only
* Mixed text and references

The first one is obviously the easiest to implement. You could also apply some common lossless compression on them to reduce their size.

The second one is not a really good idea, since not only it forces you to have every word somewhere in your index, but it will also prevent you from storig any meta-information; even a double space or a line break will not be possible. Therefore this will probably be quite useless.

The third one would allow you to insert special small "markers" in the normal text to substitute words that are present in your index. For instance, you could say that some escape char will say that a marker follows. Then two bytes or so denoting which word in the index is to be used. This may be the best solution, but be aware that index cleanup will also require you to write back the indexed data before removing it from the index.

I hope that these thoughts are helpful to you.
GaryBondAuthor Commented:
Many thanks AvonWyss - I have done a few tests now and will broadly go along the route you describe - its going to be fun 8-)

Thanks for all the help everyone.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

  • 2
  • 2
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now