Solved

Create inverted index in C

Posted on 2006-11-18
8
2,004 Views
Last Modified: 2012-05-05
I want to create an inverted index of XML documents in C.  I have a forward index already created which is just a list of document ids and word ids which appear in the document.

What is an efficient algorithm to create an inverted index?  Thinking about it, the only way I can see to do this is to load in the entire forward index into memory, then, starting with word id 1, search through the entire forward index for any occurences of that word, and mark the document id which it corresponded to.  Then, record the list of docids into the forward index file.  Then repeat this with word 2, and so on.

Essentially, this would require the program to traverse the entire forward index for every single word id.  This seems extremely slow, but I can't see any other way to do it faster (other than dividing up the job with multiple threads.)  Is there a faster algorithm that anyone knows of?
0
Comment
Question by:chsalvia
  • 5
  • 3
8 Comments
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971192
Hi chsalvia,

That is essentially what you do but there are much more efficient ways.

Try this:

1. Traverse your current index collecting every word that appears.
2. Sort the word list and remove duplicates.

You now have the keys to your index.

3. Traverse your forward index again and for each word that appears against each document, add the document id to the reverse index.

Paul
0
 

Author Comment

by:chsalvia
ID: 17971256
>> 1. Traverse your current index collecting every word that appears.
>> 2. Sort the word list and remove duplicates.

Well, I only have unique words in the forward index, so fortunately I won't need to worry about duplicates.

>>3. Traverse your forward index again and for each word that appears against each document, add the document id to the reverse index.

Are you saying step 3 would be done in one pass?
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971275
>>Well, I only have unique words in the forward index, so fortunately I won't need to worry about duplicates.
But wont the same word appear many times in the documents?

>>Are you saying step 3 would be done in one pass?
If my picture of your situation is correct, yes.

Here's what I understood you have.

Doc1 AWord,AnotherWord,AThirdWord...
...
Doc22,AWord,YetAnotherWord,..
...

And you are looking for an index:

AWord Doc1,Doc22,...
AnotherWord Doc1,...
...

Am I wrong?

Paul
0
 

Author Comment

by:chsalvia
ID: 17971328
Yes, your understanding of what I'm saying is correct.  I just don't see how I could do it in a single pass.  That would be very memory intensive, because I'd need to maintain a 2-dimensional array in memory which mapped each word to lists of docids.  I suppose I'd need to divide it up, so that maybe each pass could do 10,000 words or something.  I have like 500,000 xml docs in total, so I'm afraid I'd run out of memory if I did it in one pass.
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971377
If you cant do it in memory then do it to disk.

You could probably do it in one pass too.

Create a new empty folder.

For each document in your forward list:
 For each word in that document:
  If a file does not exist for that word, create it, named with the word. Append the document number to the file.

Processing the resulting files should build your reversed index.

This may be a little more effecient if you use a database but it is certainly not something you should expect to take a short time.

Paul
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971396
The primary requirement to keep this kind of process reasonably fast is to reduce the nuber of passes of your data to a minimum.

The above technique would be comparatively slow but it has its benefits. You can stop the process any time and restart it from where it left off.

Tuning can be done by leaving as many word files open as you can as opening and closing them takes a significant length of time. If you are running under windoze I have some code that can get around the open files limit in a clean way. If you want I could dig it out.

Paul
0
 

Author Comment

by:chsalvia
ID: 17971450
Thanks.  I'm running on linux - but your technique sounds good.  

But do you think it would be faster to do it in one pass on disk, using multiple files as you describe, or to do it in multiple passes in memory?  Suppose I make multiple passes, but for each pass I keep 10,000 word-ids in memory, and record word-id/doc-id matches for any of those 10,000?  
0
 
LVL 16

Accepted Solution

by:
PaulCaswell earned 500 total points
ID: 17972150
That sounds like a good plan. The time cost of disk I/O is usually the significant one so watch out for too much disk I/O on the primary tree.

A possibly better alternative would be to keep the secondary index in memory as far as possible but keep hit-counts for each word. When you get low on memory you could page out the least used words to disk. This is less easy but may end up being the better solution if you need to do this often.

I would probably implement the simple method described in #17971377 first and, if that ends up too slow move to the one in the paragraph above or your idea of multiple passes indexing a few words each pass.

One worthwhile test would be to count how many unique words you actually have. That should get a handle on how important optimisation is.

Paul

0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

This tutorial is posted by Aaron Wojnowski, administrator at SDKExpert.net.  To view more iPhone tutorials, visit www.sdkexpert.net. This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn ho…
Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
The goal of this video is to provide viewers with basic examples to understand how to use strings and some functions related to them in the C programming language.
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use nested-loops in the C programming language.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now