I want to create an inverted index of XML documents in C. I have a forward index already created which is just a list of document ids and word ids which appear in the document.
What is an efficient algorithm to create an inverted index? Thinking about it, the only way I can see to do this is to load in the entire forward index into memory, then, starting with word id 1, search through the entire forward index for any occurences of that word, and mark the document id which it corresponded to. Then, record the list of docids into the forward index file. Then repeat this with word 2, and so on.
Essentially, this would require the program to traverse the entire forward index for every single word id. This seems extremely slow, but I can't see any other way to do it faster (other than dividing up the job with multiple threads.) Is there a faster algorithm that anyone knows of?