Solved

Create inverted index in C

Posted on 2006-11-18
8
2,012 Views
Last Modified: 2012-05-05
I want to create an inverted index of XML documents in C.  I have a forward index already created which is just a list of document ids and word ids which appear in the document.

What is an efficient algorithm to create an inverted index?  Thinking about it, the only way I can see to do this is to load in the entire forward index into memory, then, starting with word id 1, search through the entire forward index for any occurences of that word, and mark the document id which it corresponded to.  Then, record the list of docids into the forward index file.  Then repeat this with word 2, and so on.

Essentially, this would require the program to traverse the entire forward index for every single word id.  This seems extremely slow, but I can't see any other way to do it faster (other than dividing up the job with multiple threads.)  Is there a faster algorithm that anyone knows of?
0
Comment
Question by:chsalvia
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 3
8 Comments
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971192
Hi chsalvia,

That is essentially what you do but there are much more efficient ways.

Try this:

1. Traverse your current index collecting every word that appears.
2. Sort the word list and remove duplicates.

You now have the keys to your index.

3. Traverse your forward index again and for each word that appears against each document, add the document id to the reverse index.

Paul
0
 

Author Comment

by:chsalvia
ID: 17971256
>> 1. Traverse your current index collecting every word that appears.
>> 2. Sort the word list and remove duplicates.

Well, I only have unique words in the forward index, so fortunately I won't need to worry about duplicates.

>>3. Traverse your forward index again and for each word that appears against each document, add the document id to the reverse index.

Are you saying step 3 would be done in one pass?
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971275
>>Well, I only have unique words in the forward index, so fortunately I won't need to worry about duplicates.
But wont the same word appear many times in the documents?

>>Are you saying step 3 would be done in one pass?
If my picture of your situation is correct, yes.

Here's what I understood you have.

Doc1 AWord,AnotherWord,AThirdWord...
...
Doc22,AWord,YetAnotherWord,..
...

And you are looking for an index:

AWord Doc1,Doc22,...
AnotherWord Doc1,...
...

Am I wrong?

Paul
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:chsalvia
ID: 17971328
Yes, your understanding of what I'm saying is correct.  I just don't see how I could do it in a single pass.  That would be very memory intensive, because I'd need to maintain a 2-dimensional array in memory which mapped each word to lists of docids.  I suppose I'd need to divide it up, so that maybe each pass could do 10,000 words or something.  I have like 500,000 xml docs in total, so I'm afraid I'd run out of memory if I did it in one pass.
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971377
If you cant do it in memory then do it to disk.

You could probably do it in one pass too.

Create a new empty folder.

For each document in your forward list:
 For each word in that document:
  If a file does not exist for that word, create it, named with the word. Append the document number to the file.

Processing the resulting files should build your reversed index.

This may be a little more effecient if you use a database but it is certainly not something you should expect to take a short time.

Paul
0
 
LVL 16

Expert Comment

by:PaulCaswell
ID: 17971396
The primary requirement to keep this kind of process reasonably fast is to reduce the nuber of passes of your data to a minimum.

The above technique would be comparatively slow but it has its benefits. You can stop the process any time and restart it from where it left off.

Tuning can be done by leaving as many word files open as you can as opening and closing them takes a significant length of time. If you are running under windoze I have some code that can get around the open files limit in a clean way. If you want I could dig it out.

Paul
0
 

Author Comment

by:chsalvia
ID: 17971450
Thanks.  I'm running on linux - but your technique sounds good.  

But do you think it would be faster to do it in one pass on disk, using multiple files as you describe, or to do it in multiple passes in memory?  Suppose I make multiple passes, but for each pass I keep 10,000 word-ids in memory, and record word-id/doc-id matches for any of those 10,000?  
0
 
LVL 16

Accepted Solution

by:
PaulCaswell earned 500 total points
ID: 17972150
That sounds like a good plan. The time cost of disk I/O is usually the significant one so watch out for too much disk I/O on the primary tree.

A possibly better alternative would be to keep the secondary index in memory as far as possible but keep hit-counts for each word. When you get low on memory you could page out the least used words to disk. This is less easy but may end up being the better solution if you need to do this often.

I would probably implement the simple method described in #17971377 first and, if that ends up too slow move to the one in the paragraph above or your idea of multiple passes indexing a few words each pass.

One worthwhile test would be to count how many unique words you actually have. That should get a handle on how important optimisation is.

Paul

0

Featured Post

[Live Webinar] The Cloud Skills Gap

As Cloud technologies come of age, business leaders grapple with the impact it has on their team's skills and the gap associated with the use of a cloud platform.

Join experts from 451 Research and Concerto Cloud Services on July 27th where we will examine fact and fiction.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
Windows programmers of the C/C++ variety, how many of you realise that since Window 9x Microsoft has been lying to you about what constitutes Unicode (http://en.wikipedia.org/wiki/Unicode)? They will have you believe that Unicode requires you to use…
The goal of this video is to provide viewers with basic examples to understand opening and writing to files in the C programming language.
The goal of this video is to provide viewers with basic examples to understand how to create, access, and change arrays in the C programming language.

636 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question