Solved

text indexing/searching for document management

Posted on 2002-03-26
8
155 Views
Last Modified: 2010-04-04
Hi all,

I have to create a text indexing/searching engine for a document management application.

I remember reading quite a while ago about an engine that used inverted indexing, (I think it was called), with dictionary coding to implement keyword retrieval and compression in a sort of "double whammy". I think it indexed all occurences of strings with a dictionary 'key', and then shrank the source by storing the dictionary keys rather than the associated strings. This had the advantage of compression of original text, with being able to use the compressed text to search - by looking up the dictionary key(s) rather than the associated strings.

I could probably get that up and running but wanted to get some comments before I start what is going to be quite a long project. If anybody has done anything similar, or implemented a different solution I would be very glad if they could point me in the right direction / share experiences.

Many thanks in advance

Gary.
0
Comment
Question by:GaryBond
  • 2
  • 2
  • 2
  • +2
8 Comments
 
LVL 9

Expert Comment

by:ginsonic
Comment Utility
listening
0
 
LVL 3

Expert Comment

by:VSF
Comment Utility
This is a demo on how to implement a Grep search like program, that may help you on the beggining.
http://www.victory.hpg.com.br/UpLoads/GREP%20Sample.zip

At this link you will find a program with source that shows how to implement a search engine that is able to look inside Zip files:
www.pcmag.com/article/0,2997,s%253D1476%2526a%253D21498,00.asp

Hope it helps!
VSF
www.victory.hpg.com.br
www.boatoda.hpg.com.br
0
 
LVL 1

Expert Comment

by:mpoots
Comment Utility
VSF, did you get the code from the first link to work? I get an error on the GetVolumeInformation function on line 207

Marcel
0
 
LVL 14

Expert Comment

by:AvonWyss
Comment Utility
GaryBond, sounds like an interesting project of yours to me. A straightforward version of this engine should not be too hard to write, and if you use a common data backend (Interbase SQL or whatever), you should not even have to worry too much about optimizing for efficient data storage.

However, for a really useful version, you would have to consider the following things among others:
* What to do with "random" words, e.g., words that are unique?
* What to do with non-word characters?
* How will your engine react if "invalid" data is to be indexed, like, binary data?
* How do you want to tread text with metadata (HTML, RTF, etc.)?
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 3

Expert Comment

by:VSF
Comment Utility
mpoots:
With a few changes in the code it works... it seems there is some kind of code error on the part responsable for showing the drive letters!

VSF
0
 
LVL 1

Author Comment

by:GaryBond
Comment Utility
Hi all,

Firstly thanks all for you comments - I will have a look at the links over the weekend.(4 days off this weekend here in the UK - woo hoo)

To AvonWyss, - yep, I am looking forward to the project = it should be fun. Sadly I dont have the option of using a proprietary product like SQL Server, Interbase etc. (Otherwise I would leave the text indexing to the database - hey, they could do a better job than me, ha ha). Initially, I was going to do something like this:

1) scan the data and make a list of all unique words, (as defined by whitespace chars), ranked in order of frequency of occurence.
2) Assign a one byte identifier for the most popular 256
3) From then on assign a two byte identifier for the other words to complete the dictionary
4) Compress the data by replacing the words with their dictionary (one or two byte) keys.
5) Make an index, (probably linked list type structure) of the dictionary words followed by every document in which they occurred.
5) Searching could take place by searching for the dictionary keys in the compressed data.

The actual data comes from OCR'ing images of the original documents,(most of the original documents do not exist any more), and will not probably need to be reproduced. It just exists to provide a 'way in' to the document images.

Some obvious problems,
a) Top limit of 2^16 words in the index, caused by 2 byte dictionary key. Not very scalable.
b) Needs another flag to say whether a word is represented by a one-byte or two-byte key
c)Index could be big - depending on how the documents are identified the linked list type stuctures could be large. Sort of negates the compression achieved.
d)Solution is OK if all the data is there at the start - but what about if data is to be added after the index is all built? Would have to go through the whole process for each set of additional data. Again not scalable.
e)Have to keep the OCR data since 65536 unique words would probably not be big enough - so some data would remain uncompressed and un-indexed. (Otherwise we could throw the OCR data away after we complete the index).
f) Your point about meta-data ia alse very valid, and one not addressed either. (would need a file reader to 'get at' the actual data I guess, and thereby not index the meta-data)

Anyhow, sorry to go on a bit, but as you can see the solution at the moment is all in my head. Hence my posting here - to clarify my thinking.

Thanks again for all the comments - please keep 'em coming.
Gary.
0
 
LVL 14

Accepted Solution

by:
AvonWyss earned 200 total points
Comment Utility
Ok, I now have better insight in your project. Note that, since OCR is likely going to have many word with false recognized words (for instance, 1st may be recognized as lst, savvy as sawy, etc.), so that there will be several index entries for just one work. Not to forget that, if you want to reproduce the text by using the index, you have to do your index case-sensitive, so that "ONE" is not the same as "One", "oNe" or "one".

Therefore, if you want to stay with your initial idea, I would recommend to do the following:
* Make three separated word indexes. I'll show their purpose later on.
* Add flags to your "markers". For instance, you should differ between lowercase, UPPERCASE and Normal-cased words, so that they are only present once in the index.

The three word indexes would be:

* Common words. These words are used very often, like "the", "a", "yes", "no". It does not make sense to store information about in which documents they are to be found, since they will be in virtually any document, and nobody will really go and use them for a search.

* Normal words. They are several times and in different documents, but they are not as widely used as common words and hold information about which documents contain them. This is the most used and most important index. When a word is used too often, it can be promoted to a common word, and it's document information can be dropped.

* Rare words. Words only used very few times will be sored here. When their count goes high enough, they are promoted to normal words. After indexing lots of documents, you can assume that they will not be used for searching and eventually drop them from your index.

The documents you want to rebuild can basically be stored in three different fashions:
* Original text
* Index-References only
* Mixed text and references

The first one is obviously the easiest to implement. You could also apply some common lossless compression on them to reduce their size.

The second one is not a really good idea, since not only it forces you to have every word somewhere in your index, but it will also prevent you from storig any meta-information; even a double space or a line break will not be possible. Therefore this will probably be quite useless.

The third one would allow you to insert special small "markers" in the normal text to substitute words that are present in your index. For instance, you could say that some escape char will say that a marker follows. Then two bytes or so denoting which word in the index is to be used. This may be the best solution, but be aware that index cleanup will also require you to write back the indexed data before removing it from the index.

I hope that these thoughts are helpful to you.
0
 
LVL 1

Author Comment

by:GaryBond
Comment Utility
Many thanks AvonWyss - I have done a few tests now and will broadly go along the route you describe - its going to be fun 8-)

Thanks for all the help everyone.
regards
Gary
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

A lot of questions regard threads in Delphi.   One of the more specific questions is how to show progress of the thread.   Updating a progressbar from inside a thread is a mistake. A solution to this would be to send a synchronized message to the…
Creating an auto free TStringList The TStringList is a basic and frequently used object in Delphi. On many occasions, you may want to create a temporary list, process some items in the list and be done with the list. In such cases, you have to…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now