Solved

Information retrieval

Posted on 1998-12-22
11
274 Views
Last Modified: 2011-10-03
Actually I've asked this question before, but I was pretty much stumbling in the dark, so I didn't get the terms right, and my qustion was therefore sort of impossible to answer (though BigRat did give me some ideas, thanks!). Anyway, this time I'm more specific and I'm hoping that reposting will make some more people notice it. Here it comes:

I want to retrieve sorted documents from a _static_ database, which I will build myself. Simplified example: Imagine a lot of rather small documents (maybe 50-800 words each). I want to place them all in one big file, and build an index with all unique words. When this is done, I can lookup any word in the index, and it will show me which documents this particular word exists in. So, I lookup say, 'exchange', in the index, and I have stored the information that this word exists in document 1, 4, 6.... and so on.

This can be done in several ways, and is NOT the problem.

Now imagine that my documents all have some fields in common (let's pretend the documents are bookreviews), eg. writer, publisher, and maybe the year it was written. I would make an index for each field, making it possible to limit the search to writer or publisher. This isn't the problem either.

The problem is that I NEED to be able to display the result sorted in different ways. The above will retrieve the documents in the order they where stored (at least with the implementation methods I can think of). So how do I sort the retrieved result? If I got 500.000 hits, I really don't want to make a sort, which will require a lookup in each document! I would probably need some presorted lookup-tables, which is very possible since the database is 100% static, once it's build.

What I need is information on how to do this. I already bought 2 books, and despite promising titles they where of absolutely no use.

This is a very specific use of information retrieval, but my hope is that some of you out there have knowledge about this, or know where to get it! Any comments, links or booktitles appreciated.

Regards,
Pede
0
Comment
Question by:pede
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
11 Comments
 
LVL 11

Accepted Solution

by:
alexo earned 100 total points
ID: 1255366
Two solutions come to mind.  The first one is feasible only for static databases (such as yours) with a limited number of key choices.  The second one is general.

1) Create a lot of index files.
Each index file will consist of pre-sorted key values of one particular key with pointers/links to the corresponding records.  Retreiving the info sorted by the key is as easy as iterating through the index file.

2) Use an SQL database (via ODBC if it helps) and issue "sorted by" queries.
0
 
LVL 11

Expert Comment

by:alexo
ID: 1255367
In case I wasn't clear enough, the index files will contain keys sorted by different criteria.
For example:
 - book names sorted by name
 - book names sorted by author
 - book names sorted by ISBN
 - book authors sorted by name
etc.

Since the DB is 100% static creating the files is no biggie.
0
 
LVL 1

Author Comment

by:pede
ID: 1255368
Once again it seems that a hint was what I needed. I will start with the last one.

2. Im not going to use SQL. This is a very specific task, and it should be possible to make smaller and faster systems, than some free SQL-base could provide.

1. A lot of index'es would require way too much space, and this is a very important issue, both concerning RAM and disk (that would be CD-ROM). But you did point me in the right direction (at least I think so ;) so I really can't reject your answer. If you got any ideas on 'less than', 'greater than' and ranges, pleeeesssaseee let me know :)

Thanks,
Pede


0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 11

Expert Comment

by:alexo
ID: 1255369
Pede, it is your decision however I prefer not to get get a grade at all than to get a 'C' or 'D'  grade.  If you feel that my answer is not good enough we can continue the discussion until you are either satisfied or another experts provides a better solution.

>> If you got any ideas on 'less than', 'greater than' and ranges [...]
It depends on your implementation.  What did you decide to do?
0
 
LVL 1

Author Comment

by:pede
ID: 1255370
Sorry, I didn't think about that. It's not that it was a bad answer, I just didn't use it directly. I don't suppose I could change it now? If I can, please let me know!

My approach is:
When I'm not sorting I work with bitmaps, 1 bit for each record. There is a bitmap attached to each indexword, so finding a word in the index gives me a bitmap where all the 1's are records where this word exists, and the 0's are (of course) documents where this word doesn't exist. When stored on disk, the bitmaps are compressed so they won't take up much space. I will work with bitmaps because AND'ing, OR'ing and NOT'ing is fast and easy.
When it comes to sort, I would make two files for each sortable field. One for ascending and one for descending sort. These files will contain 32-bit values, containing a record number, coresponding to the big file with all records. I will traverse one of these files (depending on witch sort is wanted), and for each value I will check if the coresponding bit in the bitmap to be sorted, is set. IF this bit is set, I will add the 32-bit value to a list. When I'm done, I will have a list containing 32-bit values, sorted by the wanted field, and only containing the records which where set in the bitmap -> a sorted result set.
The difference is that I now have 32-bit values, instead of a bitmap, but that's fine with me. If I want to do further AND, OR or NOT'ing, I would do it on the bitmap, and the sorting will be gone, unless I sort again. All my routines will have to work with both bitmaps and pointers (let's call them Pointermaps), but their semantics are the same, so it's my best solution so far.

Now I'm onto GT (greater than), LT (less than) and [...] (ranges). It seems that the sort-files could be used for this too, but that would require some additional information in the files, making them grow (maybe too) much.

I'm also worried about wildcards. I need at least to be able to search for 'exp' and find all words starting with 'exp', eg. 'experts'. This looks like pretty much the same problem as GT, LT.

Regards,
Pede





0
 
LVL 11

Expert Comment

by:alexo
ID: 1255371
>> I don't suppose I could change it now? If I can, please let me know!
Well, you can post a 0-point "question" to the customer support area describing the situation and they (probably Linda) will take care of it.

Now, when dealing with ranges( including GT and LT) you hit the eternal compromise: either size or speed.  Obviously you need access to the field VALUES for ranges and you either save them in the ascending/descending files (size issues) or deal with indirection to the main file for every record (speed issues).  You can also use a more flexible data structure, like a binary tree, instead of a flat file.  Hmmm...  On second thought, why don't you check which data structure provides for fast access times and searches and small size (ignore insertions, etc.)  There are many data structures and algorithms books (Knuth is a classic) or you can scavenge the net resources.
0
 
LVL 1

Author Comment

by:pede
ID: 1255372
After browsing the net for 4 hours I still haven't found "the" source of information. Actually I haven't found anything I could use. Only short notices about books, and descriptions of courses in computer science. If you know of any information about my use of information retrieval (ESPECIALLY on the range subject), please make a comment.

And regarding the range subject: Even if I include the values in the asc/desc files, I will still have to traverse them, so a search on, say, >0 could mean traversing the whole file, and OR'ing the hits for every word. I suspect this could be pretty slow.

I have posted the request on customer support, BTW.

Regards,
Pede


0
 
LVL 11

Expert Comment

by:alexo
ID: 1255373
You'll traverse one index file without indirections to the main one.
About the net, search for B-trees (or B+ trees)
0
 
LVL 1

Author Comment

by:pede
ID: 1255374
I can only agree that it's faster than using indirections into the main file, but still it could be traversing of a megabyte or more. It's the only solution I can think of, but I was hoping that some genious method existed that I didn't know of.

I know about B(+)-trees, but which problem exactly will they solve for me?

/Pede
0
 
LVL 11

Expert Comment

by:alexo
ID: 1255375
>> I know about B(+)-trees, but which problem exactly will they solve for me?
Using a tree structure for indexing is faster and more versatile than flat files.  Once you decide to implement more complex queries you'll appreciate it.  Since B(+)-trees are well known, there are numerous free implementations of them (some are in the public domain, some include source).
0
 
LVL 7

Expert Comment

by:linda101698
ID: 1255376
Once a question is graded and moved to the Previously Asked Questions (PAQ), I do not have any way to change the grade.

The best I can do is post a question to award alexo additional points.  See the question in this topic area alexo.

Linda Gardner
Customer Service @ Experts Exchange
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

An Outlet in Cocoa is a persistent reference to a GUI control; it connects a property (a variable) to a control.  For example, it is common to create an Outlet for the text field GUI control and change the text that appears in this field via that Ou…
Windows programmers of the C/C++ variety, how many of you realise that since Window 9x Microsoft has been lying to you about what constitutes Unicode (http://en.wikipedia.org/wiki/Unicode)? They will have you believe that Unicode requires you to use…
The goal of this video is to provide viewers with basic examples to understand and use structures in the C programming language.
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use while-loops in the C programming language.

636 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question