requirement for search engine ?

Posted on 2000-01-20
Medium Priority
Last Modified: 2010-04-21

what are the requirement (i mean the hard and software) for create a search engine eg: altavista ?

And for the database also, can mysql handle the task, or need the oracle ?

Any idea ?
Question by:bsher
  • 3
  • 2
LVL 14

Expert Comment

ID: 2371733
Let's see...

On the software side, you'll need software that can craw the web looking for references, a full text indexing engine, and a database to store the indices in.  None of these are too difficult to do right, but they are all very difficult to do with decent performance.  How the database back-end works doesn't really matter as long as it's fast and can store the amount of data you have.

On the hardware side you'll need a very fast internet connection and enough disk space to index the roughly 1 billion pages on the web.  If we guess 100 indexible words per page and say 10 bytes per entry (it's basically a hash with some overhead if you do it right), that's about 1 terabyte of disk space.  Oh yeah, you'll also need fast enough computers to access all this stuff in real time.

Author Comment

ID: 2372915
i am using perl to develop the search engine and mysql as a database...

everytime i query a search word it goes to mysql, are the other search engine normally like this, or using text file for searching?

And what are the hardware i need? pentium kaimax ? oracle?
give me a brief description..

LVL 14

Accepted Solution

chris_calabrese earned 40 total points
ID: 2381570
1) Every time you run a query, you obviously have to consult the database.  I don't know what kinds of databases the various commercial search engines use (Oracle, mysql, db2, ndbm, etc.), but I seriously doubt they're using flat text files given the size of the data involved.

2)  Given that we've already deduced the need for 1TB of disk, we're definitely not talking anything with a Pentium in it unless it's a cluster of small machines running Linux (a la www.google.com).  If you're looking for a single machine, you're talking the high-end offerings from Sun, HP, and Compaq/Digital.

3)  The cost of the disk space for this machine alone will be on the order of $50,000-$100,000.  The machine itself, or machines if clustering, will run you about the same.  The T3 line will run you several $,$$$/month too (you can't index the whole web over a DSL or cable-modem you known!).

Author Comment

ID: 2385686
i was told that a search engine with 100'000 url is using "plain text + perl"
technique.....and even for a larger search engine

i used to using that kind of technique, but unfortunately, it take few second for the query...

LVL 14

Expert Comment

ID: 2386035
If by flat file you mean something simple like a single file containing
pairs, then you have to scan the entire file linearly (assuming no sorting o the file, etc.), which makes access O(n), where n is the size of the file.  If you're on fast hardware, you'll get about 4 MB/s.  If each word is 8 bytes, each URL has 100 indexible words, and each URL is an average of 20 bytes, that's (8 + 20 + 2) * 100 = 3000 bytes/URL.  For 100,000 URL's, you get about 286 MB, or about 72 seconds/search.  If you double the number of URL's, you'll double the time.

Alternately, if your structure looks like one file with
another with
and a thrid with

Then, with the same 100,000 URL's, and if we assume 10,000 words, then the first file will be under 1MB, the second will be about 114 MB, and the third about two MB, or an average search time of about 30 seconds, less than half the original design.  Doubling the number of URL's will still double the time.

If we switch to a fixed format binary structure and keep the index sorted, then we can improve the algorithims to O(ln(n)) by using binary search, with the expected time dropping to something like .21 seconds (assume 10 ms / data access), with a doubling of the file size increasing this to about .22 seconds.  Clearly this is a massive improvement - those are 21/100 seconds and 22/100 seconds, not 21 and 22!

However, the problem with this scheme is that building the sorted files will take on the order of a minute or so.  Also, the main file is still quite large and sorting it will require a lot of space.  These problems can be solved by switching from a sorted file (binary search) to a hashed file, with access now being O(1) (i.e., no matter how large the file gets, the time to do a search will be the same).

However, there is not a lot of software out there that's good at handling building/maintaining hashes, and we still have the problem of the main file growing larger than a single file can be on a given OS platform if we are indexing a very large number of URL's.  These problems can be addressed by going to a real data management system such as the familiar relational database engines (Oracle, postgres, Sybase, db2, mysql, etc.), or with an embedded database package (ndbm).

Featured Post

Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When you do backups in the Solaris Operating System, the file system must be inactive. Otherwise, the output may be inconsistent. A file system is inactive when it's unmounted or it's write-locked by the operating system. Although the fssnap utility…
Using libpcap/Jpcap to capture and send packets on Solaris version (10/11) Library used: 1.      Libpcap (http://www.tcpdump.org) Version 1.2 2.      Jpcap(http://netresearch.ics.uci.edu/kfujii/Jpcap/doc/index.html) Version 0.6 Prerequisite: 1.      GCC …
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
Suggested Courses
Course of the Month5 days, 1 hour left to enroll

601 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question