requirement for search engine ?


what are the requirement (i mean the hard and software) for create a search engine eg: altavista ?

And for the database also, can mysql handle the task, or need the oracle ?

Any idea ?
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Let's see...

On the software side, you'll need software that can craw the web looking for references, a full text indexing engine, and a database to store the indices in.  None of these are too difficult to do right, but they are all very difficult to do with decent performance.  How the database back-end works doesn't really matter as long as it's fast and can store the amount of data you have.

On the hardware side you'll need a very fast internet connection and enough disk space to index the roughly 1 billion pages on the web.  If we guess 100 indexible words per page and say 10 bytes per entry (it's basically a hash with some overhead if you do it right), that's about 1 terabyte of disk space.  Oh yeah, you'll also need fast enough computers to access all this stuff in real time.
bsherAuthor Commented:
i am using perl to develop the search engine and mysql as a database...

everytime i query a search word it goes to mysql, are the other search engine normally like this, or using text file for searching?

And what are the hardware i need? pentium kaimax ? oracle?
give me a brief description..

1) Every time you run a query, you obviously have to consult the database.  I don't know what kinds of databases the various commercial search engines use (Oracle, mysql, db2, ndbm, etc.), but I seriously doubt they're using flat text files given the size of the data involved.

2)  Given that we've already deduced the need for 1TB of disk, we're definitely not talking anything with a Pentium in it unless it's a cluster of small machines running Linux (a la  If you're looking for a single machine, you're talking the high-end offerings from Sun, HP, and Compaq/Digital.

3)  The cost of the disk space for this machine alone will be on the order of $50,000-$100,000.  The machine itself, or machines if clustering, will run you about the same.  The T3 line will run you several $,$$$/month too (you can't index the whole web over a DSL or cable-modem you known!).

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
bsherAuthor Commented:
i was told that a search engine with 100'000 url is using "plain text + perl"
technique.....and even for a larger search engine

i used to using that kind of technique, but unfortunately, it take few second for the query...

If by flat file you mean something simple like a single file containing
pairs, then you have to scan the entire file linearly (assuming no sorting o the file, etc.), which makes access O(n), where n is the size of the file.  If you're on fast hardware, you'll get about 4 MB/s.  If each word is 8 bytes, each URL has 100 indexible words, and each URL is an average of 20 bytes, that's (8 + 20 + 2) * 100 = 3000 bytes/URL.  For 100,000 URL's, you get about 286 MB, or about 72 seconds/search.  If you double the number of URL's, you'll double the time.

Alternately, if your structure looks like one file with
another with
and a thrid with

Then, with the same 100,000 URL's, and if we assume 10,000 words, then the first file will be under 1MB, the second will be about 114 MB, and the third about two MB, or an average search time of about 30 seconds, less than half the original design.  Doubling the number of URL's will still double the time.

If we switch to a fixed format binary structure and keep the index sorted, then we can improve the algorithims to O(ln(n)) by using binary search, with the expected time dropping to something like .21 seconds (assume 10 ms / data access), with a doubling of the file size increasing this to about .22 seconds.  Clearly this is a massive improvement - those are 21/100 seconds and 22/100 seconds, not 21 and 22!

However, the problem with this scheme is that building the sorted files will take on the order of a minute or so.  Also, the main file is still quite large and sorting it will require a lot of space.  These problems can be solved by switching from a sorted file (binary search) to a hashed file, with access now being O(1) (i.e., no matter how large the file gets, the time to do a search will be the same).

However, there is not a lot of software out there that's good at handling building/maintaining hashes, and we still have the problem of the main file growing larger than a single file can be on a given OS platform if we are indexing a very large number of URL's.  These problems can be addressed by going to a real data management system such as the familiar relational database engines (Oracle, postgres, Sybase, db2, mysql, etc.), or with an embedded database package (ndbm).
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Unix OS

From novice to tech pro — start learning today.