Information retrieval

Actually I've asked this question before, but I was pretty much stumbling in the dark, so I didn't get the terms right, and my qustion was therefore sort of impossible to answer (though BigRat did give me some ideas, thanks!). Anyway, this time I'm more specific and I'm hoping that reposting will make some more people notice it. Here it comes:

I want to retrieve sorted documents from a _static_ database, which I will build myself. Simplified example: Imagine a lot of rather small documents (maybe 50-800 words each). I want to place them all in one big file, and build an index with all unique words. When this is done, I can lookup any word in the index, and it will show me which documents this particular word exists in. So, I lookup say, 'exchange', in the index, and I have stored the information that this word exists in document 1, 4, 6.... and so on.

This can be done in several ways, and is NOT the problem.

Now imagine that my documents all have some fields in common (let's pretend the documents are bookreviews), eg. writer, publisher, and maybe the year it was written. I would make an index for each field, making it possible to limit the search to writer or publisher. This isn't the problem either.

The problem is that I NEED to be able to display the result sorted in different ways. The above will retrieve the documents in the order they where stored (at least with the implementation methods I can think of). So how do I sort the retrieved result? If I got 500.000 hits, I really don't want to make a sort, which will require a lookup in each document! I would probably need some presorted lookup-tables, which is very possible since the database is 100% static, once it's build.

What I need is information on how to do this. I already bought 2 books, and despite promising titles they where of absolutely no use.

This is a very specific use of information retrieval, but my hope is that some of you out there have knowledge about this, or know where to get it! Any comments, links or booktitles appreciated.

Regards,
Pede
LVL 1
pedeAsked:
Who is Participating?
 
alexoConnect With a Mentor Commented:
Two solutions come to mind.  The first one is feasible only for static databases (such as yours) with a limited number of key choices.  The second one is general.

1) Create a lot of index files.
Each index file will consist of pre-sorted key values of one particular key with pointers/links to the corresponding records.  Retreiving the info sorted by the key is as easy as iterating through the index file.

2) Use an SQL database (via ODBC if it helps) and issue "sorted by" queries.
0
 
alexoCommented:
In case I wasn't clear enough, the index files will contain keys sorted by different criteria.
For example:
 - book names sorted by name
 - book names sorted by author
 - book names sorted by ISBN
 - book authors sorted by name
etc.

Since the DB is 100% static creating the files is no biggie.
0
 
pedeAuthor Commented:
Once again it seems that a hint was what I needed. I will start with the last one.

2. Im not going to use SQL. This is a very specific task, and it should be possible to make smaller and faster systems, than some free SQL-base could provide.

1. A lot of index'es would require way too much space, and this is a very important issue, both concerning RAM and disk (that would be CD-ROM). But you did point me in the right direction (at least I think so ;) so I really can't reject your answer. If you got any ideas on 'less than', 'greater than' and ranges, pleeeesssaseee let me know :)

Thanks,
Pede


0
Easily manage email signatures in Office 365

Managing email signatures in Office 365 can be a challenging task if you don't have the right tool. CodeTwo Email Signatures for Office 365 will help you implement a unified email signature look, no matter what email client is used by users. Test it for free!

 
alexoCommented:
Pede, it is your decision however I prefer not to get get a grade at all than to get a 'C' or 'D'  grade.  If you feel that my answer is not good enough we can continue the discussion until you are either satisfied or another experts provides a better solution.

>> If you got any ideas on 'less than', 'greater than' and ranges [...]
It depends on your implementation.  What did you decide to do?
0
 
pedeAuthor Commented:
Sorry, I didn't think about that. It's not that it was a bad answer, I just didn't use it directly. I don't suppose I could change it now? If I can, please let me know!

My approach is:
When I'm not sorting I work with bitmaps, 1 bit for each record. There is a bitmap attached to each indexword, so finding a word in the index gives me a bitmap where all the 1's are records where this word exists, and the 0's are (of course) documents where this word doesn't exist. When stored on disk, the bitmaps are compressed so they won't take up much space. I will work with bitmaps because AND'ing, OR'ing and NOT'ing is fast and easy.
When it comes to sort, I would make two files for each sortable field. One for ascending and one for descending sort. These files will contain 32-bit values, containing a record number, coresponding to the big file with all records. I will traverse one of these files (depending on witch sort is wanted), and for each value I will check if the coresponding bit in the bitmap to be sorted, is set. IF this bit is set, I will add the 32-bit value to a list. When I'm done, I will have a list containing 32-bit values, sorted by the wanted field, and only containing the records which where set in the bitmap -> a sorted result set.
The difference is that I now have 32-bit values, instead of a bitmap, but that's fine with me. If I want to do further AND, OR or NOT'ing, I would do it on the bitmap, and the sorting will be gone, unless I sort again. All my routines will have to work with both bitmaps and pointers (let's call them Pointermaps), but their semantics are the same, so it's my best solution so far.

Now I'm onto GT (greater than), LT (less than) and [...] (ranges). It seems that the sort-files could be used for this too, but that would require some additional information in the files, making them grow (maybe too) much.

I'm also worried about wildcards. I need at least to be able to search for 'exp' and find all words starting with 'exp', eg. 'experts'. This looks like pretty much the same problem as GT, LT.

Regards,
Pede





0
 
alexoCommented:
>> I don't suppose I could change it now? If I can, please let me know!
Well, you can post a 0-point "question" to the customer support area describing the situation and they (probably Linda) will take care of it.

Now, when dealing with ranges( including GT and LT) you hit the eternal compromise: either size or speed.  Obviously you need access to the field VALUES for ranges and you either save them in the ascending/descending files (size issues) or deal with indirection to the main file for every record (speed issues).  You can also use a more flexible data structure, like a binary tree, instead of a flat file.  Hmmm...  On second thought, why don't you check which data structure provides for fast access times and searches and small size (ignore insertions, etc.)  There are many data structures and algorithms books (Knuth is a classic) or you can scavenge the net resources.
0
 
pedeAuthor Commented:
After browsing the net for 4 hours I still haven't found "the" source of information. Actually I haven't found anything I could use. Only short notices about books, and descriptions of courses in computer science. If you know of any information about my use of information retrieval (ESPECIALLY on the range subject), please make a comment.

And regarding the range subject: Even if I include the values in the asc/desc files, I will still have to traverse them, so a search on, say, >0 could mean traversing the whole file, and OR'ing the hits for every word. I suspect this could be pretty slow.

I have posted the request on customer support, BTW.

Regards,
Pede


0
 
alexoCommented:
You'll traverse one index file without indirections to the main one.
About the net, search for B-trees (or B+ trees)
0
 
pedeAuthor Commented:
I can only agree that it's faster than using indirections into the main file, but still it could be traversing of a megabyte or more. It's the only solution I can think of, but I was hoping that some genious method existed that I didn't know of.

I know about B(+)-trees, but which problem exactly will they solve for me?

/Pede
0
 
alexoCommented:
>> I know about B(+)-trees, but which problem exactly will they solve for me?
Using a tree structure for indexing is faster and more versatile than flat files.  Once you decide to implement more complex queries you'll appreciate it.  Since B(+)-trees are well known, there are numerous free implementations of them (some are in the public domain, some include source).
0
 
linda101698Commented:
Once a question is graded and moved to the Previously Asked Questions (PAQ), I do not have any way to change the grade.

The best I can do is post a question to award alexo additional points.  See the question in this topic area alexo.

Linda Gardner
Customer Service @ Experts Exchange
0
All Courses

From novice to tech pro — start learning today.