asked on

requirement for search engine ?

hi,

what are the requirement (i mean the hard and software) for create a search engine eg: altavista ?

And for the database also, can mysql handle the task, or need the oracle ?

Any idea ?

chris_calabrese

Let's see...

On the software side, you'll need software that can craw the web looking for references, a full text indexing engine, and a database to store the indices in. None of these are too difficult to do right, but they are all very difficult to do with decent performance. How the database back-end works doesn't really matter as long as it's fast and can store the amount of data you have.

On the hardware side you'll need a very fast internet connection and enough disk space to index the roughly 1 billion pages on the web. If we guess 100 indexible words per page and say 10 bytes per entry (it's basically a hash with some overhead if you do it right), that's about 1 terabyte of disk space. Oh yeah, you'll also need fast enough computers to access all this stuff in real time.

bsher

ASKER

i am using perl to develop the search engine and mysql as a database...

everytime i query a search word it goes to mysql, are the other search engine normally like this, or using text file for searching?

And what are the hardware i need? pentium kaimax ? oracle?
give me a brief description..

Thanx

ASKER CERTIFIED SOLUTION

chris_calabrese

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

bsher

ASKER

i was told that a search engine with 100'000 url is using "plain text + perl"
technique.....and even for a larger search engine

i used to using that kind of technique, but unfortunately, it take few second for the query...

chris_calabrese

If by flat file you mean something simple like a single file containing
word:url
pairs, then you have to scan the entire file linearly (assuming no sorting o the file, etc.), which makes access O(n), where n is the size of the file. If you're on fast hardware, you'll get about 4 MB/s. If each word is 8 bytes, each URL has 100 indexible words, and each URL is an average of 20 bytes, that's (8 + 20 + 2) * 100 = 3000 bytes/URL. For 100,000 URL's, you get about 286 MB, or about 72 seconds/search. If you double the number of URL's, you'll double the time.

Alternately, if your structure looks like one file with
word:word-index
another with
word-index:url-index
and a thrid with
url-index:url

Then, with the same 100,000 URL's, and if we assume 10,000 words, then the first file will be under 1MB, the second will be about 114 MB, and the third about two MB, or an average search time of about 30 seconds, less than half the original design. Doubling the number of URL's will still double the time.

If we switch to a fixed format binary structure and keep the index sorted, then we can improve the algorithims to O(ln(n)) by using binary search, with the expected time dropping to something like .21 seconds (assume 10 ms / data access), with a doubling of the file size increasing this to about .22 seconds. Clearly this is a massive improvement - those are 21/100 seconds and 22/100 seconds, not 21 and 22!

However, the problem with this scheme is that building the sorted files will take on the order of a minute or so. Also, the main file is still quite large and sorting it will require a lot of space. These problems can be solved by switching from a sorted file (binary search) to a hashed file, with access now being O(1) (i.e., no matter how large the file gets, the time to do a search will be the same).

However, there is not a lot of software out there that's good at handling building/maintaining hashes, and we still have the problem of the main file growing larger than a single file can be on a given OS platform if we are indexing a very large number of URL's. These problems can be addressed by going to a real data management system such as the familiar relational database engines (Oracle, postgres, Sybase, db2, mysql, etc.), or with an embedded database package (ndbm).