Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions

text indexing/searching for document management

Posted on 2002-03-26
Last Modified: 2010-04-04
Hi all,

I have to create a text indexing/searching engine for a document management application.

I remember reading quite a while ago about an engine that used inverted indexing, (I think it was called), with dictionary coding to implement keyword retrieval and compression in a sort of "double whammy". I think it indexed all occurences of strings with a dictionary 'key', and then shrank the source by storing the dictionary keys rather than the associated strings. This had the advantage of compression of original text, with being able to use the compressed text to search - by looking up the dictionary key(s) rather than the associated strings.

I could probably get that up and running but wanted to get some comments before I start what is going to be quite a long project. If anybody has done anything similar, or implemented a different solution I would be very glad if they could point me in the right direction / share experiences.

Many thanks in advance

Question by:GaryBond
  • 2
  • 2
  • 2
  • +2

Expert Comment

ID: 6897175

Expert Comment

ID: 6897276
This is a demo on how to implement a Grep search like program, that may help you on the beggining.

At this link you will find a program with source that shows how to implement a search engine that is able to look inside Zip files:

Hope it helps!

Expert Comment

ID: 6897325
VSF, did you get the code from the first link to work? I get an error on the GetVolumeInformation function on line 207

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

LVL 14

Expert Comment

ID: 6897794
GaryBond, sounds like an interesting project of yours to me. A straightforward version of this engine should not be too hard to write, and if you use a common data backend (Interbase SQL or whatever), you should not even have to worry too much about optimizing for efficient data storage.

However, for a really useful version, you would have to consider the following things among others:
* What to do with "random" words, e.g., words that are unique?
* What to do with non-word characters?
* How will your engine react if "invalid" data is to be indexed, like, binary data?
* How do you want to tread text with metadata (HTML, RTF, etc.)?

Expert Comment

ID: 6898920
With a few changes in the code it works... it seems there is some kind of code error on the part responsable for showing the drive letters!


Author Comment

ID: 6898994
Hi all,

Firstly thanks all for you comments - I will have a look at the links over the weekend.(4 days off this weekend here in the UK - woo hoo)

To AvonWyss, - yep, I am looking forward to the project = it should be fun. Sadly I dont have the option of using a proprietary product like SQL Server, Interbase etc. (Otherwise I would leave the text indexing to the database - hey, they could do a better job than me, ha ha). Initially, I was going to do something like this:

1) scan the data and make a list of all unique words, (as defined by whitespace chars), ranked in order of frequency of occurence.
2) Assign a one byte identifier for the most popular 256
3) From then on assign a two byte identifier for the other words to complete the dictionary
4) Compress the data by replacing the words with their dictionary (one or two byte) keys.
5) Make an index, (probably linked list type structure) of the dictionary words followed by every document in which they occurred.
5) Searching could take place by searching for the dictionary keys in the compressed data.

The actual data comes from OCR'ing images of the original documents,(most of the original documents do not exist any more), and will not probably need to be reproduced. It just exists to provide a 'way in' to the document images.

Some obvious problems,
a) Top limit of 2^16 words in the index, caused by 2 byte dictionary key. Not very scalable.
b) Needs another flag to say whether a word is represented by a one-byte or two-byte key
c)Index could be big - depending on how the documents are identified the linked list type stuctures could be large. Sort of negates the compression achieved.
d)Solution is OK if all the data is there at the start - but what about if data is to be added after the index is all built? Would have to go through the whole process for each set of additional data. Again not scalable.
e)Have to keep the OCR data since 65536 unique words would probably not be big enough - so some data would remain uncompressed and un-indexed. (Otherwise we could throw the OCR data away after we complete the index).
f) Your point about meta-data ia alse very valid, and one not addressed either. (would need a file reader to 'get at' the actual data I guess, and thereby not index the meta-data)

Anyhow, sorry to go on a bit, but as you can see the solution at the moment is all in my head. Hence my posting here - to clarify my thinking.

Thanks again for all the comments - please keep 'em coming.
LVL 14

Accepted Solution

AvonWyss earned 200 total points
ID: 6900480
Ok, I now have better insight in your project. Note that, since OCR is likely going to have many word with false recognized words (for instance, 1st may be recognized as lst, savvy as sawy, etc.), so that there will be several index entries for just one work. Not to forget that, if you want to reproduce the text by using the index, you have to do your index case-sensitive, so that "ONE" is not the same as "One", "oNe" or "one".

Therefore, if you want to stay with your initial idea, I would recommend to do the following:
* Make three separated word indexes. I'll show their purpose later on.
* Add flags to your "markers". For instance, you should differ between lowercase, UPPERCASE and Normal-cased words, so that they are only present once in the index.

The three word indexes would be:

* Common words. These words are used very often, like "the", "a", "yes", "no". It does not make sense to store information about in which documents they are to be found, since they will be in virtually any document, and nobody will really go and use them for a search.

* Normal words. They are several times and in different documents, but they are not as widely used as common words and hold information about which documents contain them. This is the most used and most important index. When a word is used too often, it can be promoted to a common word, and it's document information can be dropped.

* Rare words. Words only used very few times will be sored here. When their count goes high enough, they are promoted to normal words. After indexing lots of documents, you can assume that they will not be used for searching and eventually drop them from your index.

The documents you want to rebuild can basically be stored in three different fashions:
* Original text
* Index-References only
* Mixed text and references

The first one is obviously the easiest to implement. You could also apply some common lossless compression on them to reduce their size.

The second one is not a really good idea, since not only it forces you to have every word somewhere in your index, but it will also prevent you from storig any meta-information; even a double space or a line break will not be possible. Therefore this will probably be quite useless.

The third one would allow you to insert special small "markers" in the normal text to substitute words that are present in your index. For instance, you could say that some escape char will say that a marker follows. Then two bytes or so denoting which word in the index is to be used. This may be the best solution, but be aware that index cleanup will also require you to write back the indexed data before removing it from the index.

I hope that these thoughts are helpful to you.

Author Comment

ID: 6944253
Many thanks AvonWyss - I have done a few tests now and will broadly go along the route you describe - its going to be fun 8-)

Thanks for all the help everyone.

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Print Graphic and Text to Epson TM-T88v 12 378
Working with hours 3 65
Error E2158 compiling with Delphi XE10 Seattle 2 112
Firemonkey BASS_Init into a thread 17 30
The uses clause is one of those things that just tends to grow and grow. Most of the time this is in the main form, as it's from this form that all others are called. If you have a big application (including many forms), the uses clause in the in…
Introduction The parallel port is a very commonly known port, it was widely used to connect a printer to the PC, if you look at the back of your computer, for those who don't have newer computers, there will be a port with 25 pins and a small print…
The Email Laundry PDF encryption service allows companies to send confidential encrypted  emails to anybody. The PDF document can also contain attachments that are embedded in the encrypted PDF. The password is randomly generated by The Email Laundr…
A short tutorial showing how to set up an email signature in Outlook on the Web (previously known as OWA). For free email signatures designs, visit https://www.mail-signatures.com/articles/signature-templates/?sts=6651 If you want to manage em…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question