Link to home
Start Free TrialLog in
Avatar of chsalvia
chsalvia

asked on

Create inverted index in C

I want to create an inverted index of XML documents in C.  I have a forward index already created which is just a list of document ids and word ids which appear in the document.

What is an efficient algorithm to create an inverted index?  Thinking about it, the only way I can see to do this is to load in the entire forward index into memory, then, starting with word id 1, search through the entire forward index for any occurences of that word, and mark the document id which it corresponded to.  Then, record the list of docids into the forward index file.  Then repeat this with word 2, and so on.

Essentially, this would require the program to traverse the entire forward index for every single word id.  This seems extremely slow, but I can't see any other way to do it faster (other than dividing up the job with multiple threads.)  Is there a faster algorithm that anyone knows of?
Avatar of PaulCaswell
PaulCaswell
Flag of United Kingdom of Great Britain and Northern Ireland image

Hi chsalvia,

That is essentially what you do but there are much more efficient ways.

Try this:

1. Traverse your current index collecting every word that appears.
2. Sort the word list and remove duplicates.

You now have the keys to your index.

3. Traverse your forward index again and for each word that appears against each document, add the document id to the reverse index.

Paul
Avatar of chsalvia
chsalvia

ASKER

>> 1. Traverse your current index collecting every word that appears.
>> 2. Sort the word list and remove duplicates.

Well, I only have unique words in the forward index, so fortunately I won't need to worry about duplicates.

>>3. Traverse your forward index again and for each word that appears against each document, add the document id to the reverse index.

Are you saying step 3 would be done in one pass?
>>Well, I only have unique words in the forward index, so fortunately I won't need to worry about duplicates.
But wont the same word appear many times in the documents?

>>Are you saying step 3 would be done in one pass?
If my picture of your situation is correct, yes.

Here's what I understood you have.

Doc1 AWord,AnotherWord,AThirdWord...
...
Doc22,AWord,YetAnotherWord,..
...

And you are looking for an index:

AWord Doc1,Doc22,...
AnotherWord Doc1,...
...

Am I wrong?

Paul
Yes, your understanding of what I'm saying is correct.  I just don't see how I could do it in a single pass.  That would be very memory intensive, because I'd need to maintain a 2-dimensional array in memory which mapped each word to lists of docids.  I suppose I'd need to divide it up, so that maybe each pass could do 10,000 words or something.  I have like 500,000 xml docs in total, so I'm afraid I'd run out of memory if I did it in one pass.
If you cant do it in memory then do it to disk.

You could probably do it in one pass too.

Create a new empty folder.

For each document in your forward list:
 For each word in that document:
  If a file does not exist for that word, create it, named with the word. Append the document number to the file.

Processing the resulting files should build your reversed index.

This may be a little more effecient if you use a database but it is certainly not something you should expect to take a short time.

Paul
The primary requirement to keep this kind of process reasonably fast is to reduce the nuber of passes of your data to a minimum.

The above technique would be comparatively slow but it has its benefits. You can stop the process any time and restart it from where it left off.

Tuning can be done by leaving as many word files open as you can as opening and closing them takes a significant length of time. If you are running under windoze I have some code that can get around the open files limit in a clean way. If you want I could dig it out.

Paul
Thanks.  I'm running on linux - but your technique sounds good.  

But do you think it would be faster to do it in one pass on disk, using multiple files as you describe, or to do it in multiple passes in memory?  Suppose I make multiple passes, but for each pass I keep 10,000 word-ids in memory, and record word-id/doc-id matches for any of those 10,000?  
ASKER CERTIFIED SOLUTION
Avatar of PaulCaswell
PaulCaswell
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial