Solved

Create inverted index in C

Posted on 2006-11-18
8
2,009 Views
Last Modified: 2012-05-05
I want to create an inverted index of XML documents in C.  I have a forward index already created which is just a list of document ids and word ids which appear in the document.

What is an efficient algorithm to create an inverted index?  Thinking about it, the only way I can see to do this is to load in the entire forward index into memory, then, starting with word id 1, search through the entire forward index for any occurences of that word, and mark the document id which it corresponded to.  Then, record the list of docids into the forward index file.  Then repeat this with word 2, and so on.

Essentially, this would require the program to traverse the entire forward index for every single word id.  This seems extremely slow, but I can't see any other way to do it faster (other than dividing up the job with multiple threads.)  Is there a faster algorithm that anyone knows of?
0
Comment
Question by:chsalvia
  • 5
  • 3
8 Comments
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971192
Hi chsalvia,

That is essentially what you do but there are much more efficient ways.

Try this:

1. Traverse your current index collecting every word that appears.
2. Sort the word list and remove duplicates.

You now have the keys to your index.

3. Traverse your forward index again and for each word that appears against each document, add the document id to the reverse index.

Paul
0
 

Author Comment

by:chsalvia
ID: 17971256
>> 1. Traverse your current index collecting every word that appears.
>> 2. Sort the word list and remove duplicates.

Well, I only have unique words in the forward index, so fortunately I won't need to worry about duplicates.

>>3. Traverse your forward index again and for each word that appears against each document, add the document id to the reverse index.

Are you saying step 3 would be done in one pass?
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971275
>>Well, I only have unique words in the forward index, so fortunately I won't need to worry about duplicates.
But wont the same word appear many times in the documents?

>>Are you saying step 3 would be done in one pass?
If my picture of your situation is correct, yes.

Here's what I understood you have.

Doc1 AWord,AnotherWord,AThirdWord...
...
Doc22,AWord,YetAnotherWord,..
...

And you are looking for an index:

AWord Doc1,Doc22,...
AnotherWord Doc1,...
...

Am I wrong?

Paul
0
Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

 

Author Comment

by:chsalvia
ID: 17971328
Yes, your understanding of what I'm saying is correct.  I just don't see how I could do it in a single pass.  That would be very memory intensive, because I'd need to maintain a 2-dimensional array in memory which mapped each word to lists of docids.  I suppose I'd need to divide it up, so that maybe each pass could do 10,000 words or something.  I have like 500,000 xml docs in total, so I'm afraid I'd run out of memory if I did it in one pass.
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971377
If you cant do it in memory then do it to disk.

You could probably do it in one pass too.

Create a new empty folder.

For each document in your forward list:
 For each word in that document:
  If a file does not exist for that word, create it, named with the word. Append the document number to the file.

Processing the resulting files should build your reversed index.

This may be a little more effecient if you use a database but it is certainly not something you should expect to take a short time.

Paul
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971396
The primary requirement to keep this kind of process reasonably fast is to reduce the nuber of passes of your data to a minimum.

The above technique would be comparatively slow but it has its benefits. You can stop the process any time and restart it from where it left off.

Tuning can be done by leaving as many word files open as you can as opening and closing them takes a significant length of time. If you are running under windoze I have some code that can get around the open files limit in a clean way. If you want I could dig it out.

Paul
0
 

Author Comment

by:chsalvia
ID: 17971450
Thanks.  I'm running on linux - but your technique sounds good.  

But do you think it would be faster to do it in one pass on disk, using multiple files as you describe, or to do it in multiple passes in memory?  Suppose I make multiple passes, but for each pass I keep 10,000 word-ids in memory, and record word-id/doc-id matches for any of those 10,000?  
0
 
LVL 16

Accepted Solution

by:
PaulCaswell earned 500 total points
ID: 17972150
That sounds like a good plan. The time cost of disk I/O is usually the significant one so watch out for too much disk I/O on the primary tree.

A possibly better alternative would be to keep the secondary index in memory as far as possible but keep hit-counts for each word. When you get low on memory you could page out the least used words to disk. This is less easy but may end up being the better solution if you need to do this often.

I would probably implement the simple method described in #17971377 first and, if that ends up too slow move to the one in the paragraph above or your idea of multiple passes indexing a few words each pass.

One worthwhile test would be to count how many unique words you actually have. That should get a handle on how important optimisation is.

Paul

0

Featured Post

Space-Age Communications Transitions to DevOps

ViaSat, a global provider of satellite and wireless communications, securely connects businesses, governments, and organizations to the Internet. Learn how ViaSat’s Network Solutions Engineer, drove the transition from a traditional network support to a DevOps-centric model.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
Examines three attack vectors, specifically, the different types of malware used in malicious attacks, web application attacks, and finally, network based attacks.  Concludes by examining the means of securing and protecting critical systems and inf…
The goal of this video is to provide viewers with basic examples to understand and use structures in the C programming language.
The goal of this video is to provide viewers with basic examples to understand recursion in the C programming language.

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question