Solved

Information retrieval

Posted on 1998-12-22
11
268 Views
Last Modified: 2011-10-03
Actually I've asked this question before, but I was pretty much stumbling in the dark, so I didn't get the terms right, and my qustion was therefore sort of impossible to answer (though BigRat did give me some ideas, thanks!). Anyway, this time I'm more specific and I'm hoping that reposting will make some more people notice it. Here it comes:

I want to retrieve sorted documents from a _static_ database, which I will build myself. Simplified example: Imagine a lot of rather small documents (maybe 50-800 words each). I want to place them all in one big file, and build an index with all unique words. When this is done, I can lookup any word in the index, and it will show me which documents this particular word exists in. So, I lookup say, 'exchange', in the index, and I have stored the information that this word exists in document 1, 4, 6.... and so on.

This can be done in several ways, and is NOT the problem.

Now imagine that my documents all have some fields in common (let's pretend the documents are bookreviews), eg. writer, publisher, and maybe the year it was written. I would make an index for each field, making it possible to limit the search to writer or publisher. This isn't the problem either.

The problem is that I NEED to be able to display the result sorted in different ways. The above will retrieve the documents in the order they where stored (at least with the implementation methods I can think of). So how do I sort the retrieved result? If I got 500.000 hits, I really don't want to make a sort, which will require a lookup in each document! I would probably need some presorted lookup-tables, which is very possible since the database is 100% static, once it's build.

What I need is information on how to do this. I already bought 2 books, and despite promising titles they where of absolutely no use.

This is a very specific use of information retrieval, but my hope is that some of you out there have knowledge about this, or know where to get it! Any comments, links or booktitles appreciated.

Regards,
Pede
0
Comment
Question by:pede
  • 6
  • 4
11 Comments
 
LVL 11

Accepted Solution

by:
alexo earned 100 total points
ID: 1255366
Two solutions come to mind.  The first one is feasible only for static databases (such as yours) with a limited number of key choices.  The second one is general.

1) Create a lot of index files.
Each index file will consist of pre-sorted key values of one particular key with pointers/links to the corresponding records.  Retreiving the info sorted by the key is as easy as iterating through the index file.

2) Use an SQL database (via ODBC if it helps) and issue "sorted by" queries.
0
 
LVL 11

Expert Comment

by:alexo
ID: 1255367
In case I wasn't clear enough, the index files will contain keys sorted by different criteria.
For example:
 - book names sorted by name
 - book names sorted by author
 - book names sorted by ISBN
 - book authors sorted by name
etc.

Since the DB is 100% static creating the files is no biggie.
0
 
LVL 1

Author Comment

by:pede
ID: 1255368
Once again it seems that a hint was what I needed. I will start with the last one.

2. Im not going to use SQL. This is a very specific task, and it should be possible to make smaller and faster systems, than some free SQL-base could provide.

1. A lot of index'es would require way too much space, and this is a very important issue, both concerning RAM and disk (that would be CD-ROM). But you did point me in the right direction (at least I think so ;) so I really can't reject your answer. If you got any ideas on 'less than', 'greater than' and ranges, pleeeesssaseee let me know :)

Thanks,
Pede


0
 
LVL 11

Expert Comment

by:alexo
ID: 1255369
Pede, it is your decision however I prefer not to get get a grade at all than to get a 'C' or 'D'  grade.  If you feel that my answer is not good enough we can continue the discussion until you are either satisfied or another experts provides a better solution.

>> If you got any ideas on 'less than', 'greater than' and ranges [...]
It depends on your implementation.  What did you decide to do?
0
 
LVL 1

Author Comment

by:pede
ID: 1255370
Sorry, I didn't think about that. It's not that it was a bad answer, I just didn't use it directly. I don't suppose I could change it now? If I can, please let me know!

My approach is:
When I'm not sorting I work with bitmaps, 1 bit for each record. There is a bitmap attached to each indexword, so finding a word in the index gives me a bitmap where all the 1's are records where this word exists, and the 0's are (of course) documents where this word doesn't exist. When stored on disk, the bitmaps are compressed so they won't take up much space. I will work with bitmaps because AND'ing, OR'ing and NOT'ing is fast and easy.
When it comes to sort, I would make two files for each sortable field. One for ascending and one for descending sort. These files will contain 32-bit values, containing a record number, coresponding to the big file with all records. I will traverse one of these files (depending on witch sort is wanted), and for each value I will check if the coresponding bit in the bitmap to be sorted, is set. IF this bit is set, I will add the 32-bit value to a list. When I'm done, I will have a list containing 32-bit values, sorted by the wanted field, and only containing the records which where set in the bitmap -> a sorted result set.
The difference is that I now have 32-bit values, instead of a bitmap, but that's fine with me. If I want to do further AND, OR or NOT'ing, I would do it on the bitmap, and the sorting will be gone, unless I sort again. All my routines will have to work with both bitmaps and pointers (let's call them Pointermaps), but their semantics are the same, so it's my best solution so far.

Now I'm onto GT (greater than), LT (less than) and [...] (ranges). It seems that the sort-files could be used for this too, but that would require some additional information in the files, making them grow (maybe too) much.

I'm also worried about wildcards. I need at least to be able to search for 'exp' and find all words starting with 'exp', eg. 'experts'. This looks like pretty much the same problem as GT, LT.

Regards,
Pede





0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 
LVL 11

Expert Comment

by:alexo
ID: 1255371
>> I don't suppose I could change it now? If I can, please let me know!
Well, you can post a 0-point "question" to the customer support area describing the situation and they (probably Linda) will take care of it.

Now, when dealing with ranges( including GT and LT) you hit the eternal compromise: either size or speed.  Obviously you need access to the field VALUES for ranges and you either save them in the ascending/descending files (size issues) or deal with indirection to the main file for every record (speed issues).  You can also use a more flexible data structure, like a binary tree, instead of a flat file.  Hmmm...  On second thought, why don't you check which data structure provides for fast access times and searches and small size (ignore insertions, etc.)  There are many data structures and algorithms books (Knuth is a classic) or you can scavenge the net resources.
0
 
LVL 1

Author Comment

by:pede
ID: 1255372
After browsing the net for 4 hours I still haven't found "the" source of information. Actually I haven't found anything I could use. Only short notices about books, and descriptions of courses in computer science. If you know of any information about my use of information retrieval (ESPECIALLY on the range subject), please make a comment.

And regarding the range subject: Even if I include the values in the asc/desc files, I will still have to traverse them, so a search on, say, >0 could mean traversing the whole file, and OR'ing the hits for every word. I suspect this could be pretty slow.

I have posted the request on customer support, BTW.

Regards,
Pede


0
 
LVL 11

Expert Comment

by:alexo
ID: 1255373
You'll traverse one index file without indirections to the main one.
About the net, search for B-trees (or B+ trees)
0
 
LVL 1

Author Comment

by:pede
ID: 1255374
I can only agree that it's faster than using indirections into the main file, but still it could be traversing of a megabyte or more. It's the only solution I can think of, but I was hoping that some genious method existed that I didn't know of.

I know about B(+)-trees, but which problem exactly will they solve for me?

/Pede
0
 
LVL 11

Expert Comment

by:alexo
ID: 1255375
>> I know about B(+)-trees, but which problem exactly will they solve for me?
Using a tree structure for indexing is faster and more versatile than flat files.  Once you decide to implement more complex queries you'll appreciate it.  Since B(+)-trees are well known, there are numerous free implementations of them (some are in the public domain, some include source).
0
 
LVL 7

Expert Comment

by:linda101698
ID: 1255376
Once a question is graded and moved to the Previously Asked Questions (PAQ), I do not have any way to change the grade.

The best I can do is post a question to award alexo additional points.  See the question in this topic area alexo.

Linda Gardner
Customer Service @ Experts Exchange
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Preface I don't like visual development tools that are supposed to write a program for me. Even if it is Xcode and I can use Interface Builder. Yes, it is a perfect tool and has helped me a lot, mainly, in the beginning, when my programs were small…
Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
The goal of this video is to provide viewers with basic examples to understand and use structures in the C programming language.
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use for-loops in the C programming language.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now