asked on

Search engine

Hi, I wan't to make a read-only search engine, mainly for use with CD's. The data will be text, and maybe BLOB's. I have lot's of ideas of how to build the base (and read it too), but I am worried about performance issues. Anyone got links to material covering the subject?
This is not going to be Oracle 9 (coz' not since it's read only ;), but it's got to be _fast_, even with large amounts of data (maybe 600MB of text).

Regards, and thanks in advance

Pede

BigRat

Try this as an idea.
Give each file a number.
Extract words from each file, normalize them (into uppercase for example), and put them into a BTree (make yourself). Each word in the tree maps to a number. Each new word gets a number counting form 1.
When you lookup the word in the BTree and get its number (or it gets a new number) write this number and the number of the file into a flat file (ie: 2 four byte numbers).
When all input has been so processed, SORT the file on the 1st column (Word number) Sort by eliminating duplicates.
Now read the flat file record by record, remember the position of a new word as p, count the number of entries which start with p (hit count) and write these two values back into your BTree.
To make a query on word x you lookup x in the BTree, extract the file offset p and hit count. You allocate hitcount memory locations and read out hitcount file numbers from offset p. You then have a sorted result set. Doing AND and OR between such sets is easy.

pede

ASKER

Hi BigRat, thanks for your answner. I've read it through several times, and it sounds pretty good (except I don't get it 100%, it is a pretty big subject to cover in 11 lines! I suppose my BTree should be balanced, or?), but I really do need more material. I know I didn't give a lot of details, but I can provide a little more: All my data will end up in one huge file. This file contains records of variable length, and there will be something like 100.000 to 800.000 of them. I believe bitmaps must be part of the solution. Let's say I got 500.000 records. I will have a bitmap for each unique word, with 500.000 bits, where 1 means the word IS in this record, and 0 means it's NOT in this record (the bitmaps will be RLE-packed).
Well, this comment is already too long, which is the reason why I need links. There MUST be something about this on the net (I've tried to search for it, but DAMN you get a lot of hits on 'search' and 'engine' and the like).

I wonder if it's faster to pack everything in one file, or I should use seperate files for index' and so on...

And caching... I could go on... Some extensive material is what I need. Maybe a book?

Regards,
Pede

PS. Any comments are welcome! Even if you don't have any links.

BigRat

Whether the BTree is balanced or not is irrelevant. Such trees sort themselves out in the end. You word to set lookup will cost only LogM(N) disk transfers where N is the number of words and M the number of leaves in branch or root, normally around 256. Ie. a couple of disk transfers.
Form your original question I assumed that you wanted to index a set of text documents, and store these and the documents on a CD or such like. The solution which I proposed to you is used in our company for indexing documents and book titles. We actually make software for public and university libraries. The problem is always the number of index words (English has around 250,000 words) which runs into thousands when you start word indexing documents (technically it's called catch-word indexing).
If you ended up with 20K words and for each word you need a 500/8K Byte bitmap for each, you're going to use a LOT of space just for these.
As you have seen by searching the net (Yahoo, Alta-Vista) for two words you can get VERY large result sets. The joining of these result-sets with AND is very important and must be performed quickly.
The packing into one file or several is irrelevant, it's best to leave them alone. The time needed is to read the result sets for each word quickly off the disk. After word lookup you will need N transfers where N = M*4/B. M is the number of times the word turns up in a document, at worst this is the number of documents; and M is the size in Bytes of a block on the disk. Supposing you have 18K documents, and the word "the" turns up in every one and the disk block size is 512 bytes. You'll need 18*1024*4/512 = 144 disk transfers. Note that the data is linearly stored on the media so track buffering (on CDs a MUST) will save you quite a bit of time.
More?

pede

ASKER

Well, actually I'm starting to get a little confused. I thought I had it figured out ('sort' of, hehe), bit now I'm not so sure.
I have a few questions, even short answers will be appreciated! You gave me something to go on, so you can propose an answer, and get your points.

If I start my binary tree with the word 'ZOO' it will end up pretty imbalanced, won't it??

I hope track-buffering isn't as hard as it sounds ;) Only nessecary during lookup in the binary tree, right? (I suppose it will be on disk).

'The packing into one file or several is irrelevant, it's best to leave them alone'. You mean several files?

Regards,
Pede

You wrote that bitmaps are too large, but the way I see it, you will have to store a filepointer (32-bit value) for every occurence of a word. In the case the word 'the' (ausuming is exists in all records) you will need 500K * 32 bit, isn't this correct? With a bitmap I will need 500K * 1. If a word only exists a few times, you will only need a few 32bit pointers, but I will still have the whole bitmap (~64K), but if I RLE pack it (Run Length Encoding), I could get down to, say, 50 bytes. I would of course have to unpack it, but I don't consider CPU-power as a big problem (as opposed to disk-access).

'Doing AND and OR between such sets is easy'. You are probably right, but I can't see it yet. But AND and OR on bitmaps MUST be easier! This search engine will mostly be used to make a search, view the result, and then add additional criterias. I must be able to save intermediate results, which fits very well with the bitmap approach! How would you do this with your engine?

BigRat

I'm prepared to help you further down the line so we won't close it yet. The points backwards :-
The track-buffering is actually done by the operating system. I was trying to say that the strategy of reading the result set linearly was VERY operating system friendly.
You should leave your text files alone - just copy them to the CD-ROM when it gets burnt.
You shouldn't need to write a BTree package yourself. If you ask a question on who's got a source for you, you should get a reply and free code. You don't have to worry about balancing trees. Any decent implementation will handle that for you.
You'll need a sort program to sort the word/file number pairs. On Unix you'll find a sort program for free. On Windows you might have to search around.
You'll need some for of display for the found text file and a set of dialogs for entering the search terms with AND, OR and NOT. You'll find you can quite easily do queries like "(rats AND mice) OR rodents AND NOT (weasles OR ferrets)" and that would be very nice wouldn't it. Handling this sort or expression will require a parser or a simple Reverse Polish analyser. Once again asking here might find you free code! If you try splitting the thing down into functional units you might be able to get other people to supply the bits and that'll save you a LOT of work.

BigRat

Now onto the encoding. Your 500K*1 bit is required for each of the documents. Which is 18K*500K/8. Yes you could pack them via RLE. Each result set in memory is then 500K/8 bytes or roughly 50K. My result sets can be greater than this but are often very small, since the number of entries is the number of documents in which the word appears. If only once then 4 bytes. The other point is you can only construct the bitmap when you know ALL the words. That is you must analyse ALL the documents first. My system creates the "result-sets" (not sorted) as we go along, which makes it very easy to do "for each document in the directory..." type algorithm.
An OR with my result sets is a sorted merge of the number streams. An AND is a sorted AND of teh streams. I process entries until a condition, you just AND or OR two 50K byte chunks. The coding is easier I'll admit but the problem is on the gerneration, query analysis and display. Considerations of result-set format are trivial.

pede

ASKER

If I pack the bitmap, I can get down to very few bytes if the word only exists once. Not 50K, not even 1K. Maybe even below 20 bytes. I would do something like starting with a 32-bit value(x), meaning: the next byte will appear x times. If the word doesn't exist in the first 100K documents, the 32-bit value will be 100K/8, and the following byte will be 0. By this total of 5 bytes I have covered the first 100K documents. So actually I want to use bitmaps with the intention of _saving_ diskspace! I would then unpack the bitmaps when they are in memory. I would never need more than two bitmaps in memory at a time (to do AND, OR and so on).

The engine is not intended to be used with textdocuments in the traditionel way (although that's probably what I wrote at first). The input for the base will come from a textfile (usually created by extracting from SQL-bases), but the file to put on a CD is a binary file. It will mostly be used to search for companies. Searching for companynames, turnover and other financial stuff (there will probably be more figures than actual words). There will be several indexes, eg. one for name, one for turnover and so on. So a document is typically one company, containing all it's information.

A typical search would be:
search for turnover > 100000.
View result.. Hmm.. Too many...
AND company name should contain 'IBM'

.and so on...

Actually I now see a (potential) problem (I'm in the analyzing phase here ;). Finding '> 100000' should ofcourse not be done by AND'ing thousands of bitmaps. Is this easy with your binary tree?

BigRat

If document A contains only word B (unlikely but whatever) then the Bitmap contains only 1 bit, compression gives say 4 bytes for the mapping A->B. But the mapping we require is B->A, namely in which document(s) does word B occur?
The case of > (greater than) and their associates (<,<=,>=) are difficult. In a simple implementation one reads ALL entries from the tree from the entry found (or appropiately) and OR therm all together. That's slow but that's it. The only other possiblity is use semantic information.namely that they are numbers and create two entries "1000" and "1000+". The second entry implements strict >, ie: it is the union of all entries greater than this one. The LessThan is done by finding the entry 1000 and getting the previous entry in the Tree. Most tree implementations support this type of functionality.
The real challenge is to handle (key>=1000) AND (key<=9999) type queries by generating a result-set of result-sets from the tree directly. Few BTree implementations support this directly, but all SQL databases do.
I know of one implementation where when I search for all books from a given publisher via ISBN I want (isbn>=0-201-00000-0) AND (isbn<=0-201-99999-9) (201 is Addison-Wesley) it extracts every ISBN from the system since the ANDs are executed separately. Fun isn't it?
Finally when you say there will be several indexes how can you then merge the BitMaps since they are based on the number of words per index?

pede

ASKER

I would make 1 bitmap for every unique word. The number of bits will be the number of documents in the base. So, if bit, say, 3 is set, this word exists in document 3. This way I map from B->A (and since all bitmaps have the same length, I just merge them). I will have a file with all words and their bitmaps, and a file which maps into this (a level 2 index), which is where I think a binary tree will fit in.
As mentioned before, index'es won't relate to the whole document, but only to different parts (I call them fields), eg. CompanyName. Like a table in SQL-bases.

It worries me that you regard (>,<,<=,>=) as difficult, and merging all bitmaps together is waaaaay too slow, which leaves me with the second option. Related to this is also truncated searching, like seaching for 'add' should find 'Addison-Wesley' and every other word beginning with 'add'.
'The LessThan is done by finding the entry 1000 and getting the previous entry in the Tree'. I don't get this. The previous (let's say it's 999), contains hits for 999 and >999. How can this help?
I also need to handle sorting by different fields, eg. CompanyName or turnover. If I search for turnover > 10000, I really need to be able to show them sorted by this field. My idea was to make some presorted mapping tables (very fuzzy) for every sortable field.

It probably seem like Im over my head here (it feels like it anyway), but I WANT to do this, and I'm willing to go through this stuff again and again until I get it all. The problem is that the only stuff I got is what I get from you (*bow*)! So when this discussion is over, I STILL haven't got a place to look for futher information.

BTW. I don't need to be able to process complex queries in one pass. I'm satisfied with searches, and the ability to AND, OR, NOT the results together afterwards. This, ofcourse, means that I would need to store intermediate results (bitmaps, bitmaps)...

Regards,

Pede

BigRat

Last things first. You might try the local university library and look up in CommACMs back into the sixties on such things. You could even try searching those libraries on the Web for titles on Information Retrieval and so on. The the uni library doesn't give you access to the material, then your local library will be able to order it by InterLibrary loan.
The level of complexity which you ambitiously want to achieve (that's fine by me!) is commercial, so you won't find a great deal of detailed information.
If you enumerate all document before then you know the size of the bit map. However during construction you will need to hold n bitmaps uncompressed in memory where n is the number of words (technically called terms).
My less than comments needs the word entry replaced by entries. The Entry 999 contains only 999. I meant 999 and 998 and ... Sorry.
Presorting is never a good idea. You must take care of all possible cases. If you join result sets of queries into one result set, this will mess up your sorting completely. It is better to sort how the user wants AFTER the result set has been formed. And after you have asked him if he so wants when the result set contains more than say 200 entries. Most people so no and reformulate the search.
And finally the processing complex queries in one pass. It was exactly THAT which this ISBN search DID NOT DO and that it why it read EVERY SINGLE entry out of the tree. But that is really only an optimisation. Incidentally most SQL databases do this sort of optimisation because one cannot rely on the user entering a query which is optimal. Ie: truncated left search rather than a range (example already quoted!)
have ratitude!

pede

ASKER

'My less than comments needs the word entry replaced by entries. The Entry 999 contains only 999. I meant 999 and 998 and' - Aren't we back at the beginning then? We optimized GT (greater than), but LT (less...) is still very slow. What about using a bitmap (or whatever) with ALL documents and 'NOT' with GT-1? or maybe that was what you meant?

Regarding the sorting. I know of a pretty fast engine (Optosof) which have exactly the features I need. (It's very expensive, BTW, so don't ask me to use it;)
It can sort the result, even though I got 300.000 hits, and it does it in a few seconds. Ofcourse it doesn't actually sort the documents. It uses lookup files, and I can see that it makes two lookup files for each sort-field, one for ascending and one for descending sort. I also wan't to be able to display a huge hitlist, in sorted order!

/Pede

pede

ASKER

Please propose an answer so you can get your points. You can continue this discussion with comments anyway, if you wish.

/Pede

ASKER CERTIFIED SOLUTION

BigRat

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

pede

ASKER

Thanks, I'll have a look at it

/Pede