Link to home
Start Free TrialLog in
Avatar of AlexMert
AlexMert

asked on

Fastest way to search in a 27 GB database ?

Hi experts,

Will it take too long to search for a specific text in a 27 GB database full of text with the almost the fastest processors on a server, several times per second ? Can be there any other method to make the text search more fast ? Thankyou
Avatar of Gary
Gary
Flag of Ireland image

What kind of database?
What kind of text?
What kind of fields?
And so on...
CPU does not matter as much as the amount of RAM, speed of the RAM, and what the I/O capabilities of the disk subsystem is.

So what CPU do you have, how much RAM, Speed of RAM and is is single, dual, or triple channel, and what is the disk setup?
ASKER CERTIFIED SOLUTION
Avatar of d-glitch
d-glitch
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of AlexMert
AlexMert

ASKER

Well right now I don't have anything, just an idea that I want to do, and I want to know if I can make searching for an ASCII text, very fast in a 27 GB database full of ASCII text

So @d-glitch, creating a index, will increase the speed of the search ? Sounds interesting
Indexed searches will speed things up by factors of 10^4 or more.
Indexed searches will speed things up by factors of 10^4 or more.

Not necessarily so. If you are searching multiple text fields and using wildcards like %my text% then indexing will have no impact.
I agree with GarC123.  It is true that if what you are looking for is index, it will speed up.

However,  if you doing a wild card search or a search across multiple fields, then a index does not help at all because you aren't looking at the index.

Think of it this way.  You want to look up the word "the" in a dictionary.  You can look at the top each page as a "index" because it shows the 1st and last word on each page.  That make it real easy and fast to find the page with the word "the" on it.  That is way faster than looking at each word on each page.

However, if you want to find every definition that has the word "the" in it you have to look at each definition in the whole dictionary.  The "index" does not help at all.
No, I just need to look for a fixed specific text in that database. With indexing , how many searches could I make per minute for that word, if the datebase would be 27 GB and a high speed RAm and processor ? What changes would I have ?
I just need to look for a fixed specific text
This my sentence full of words
If you want to search for sentence or my sentence then an index is useless

Without saying what kind of fields, blob, text etc and the type of queries and the type of server (is this a dedicated database server) then all this is just generic advice.
But anyway you've accepted a generic answer.
@GaryC123, as I mentioned in an early post, I don't have anything right now, not even that database, I just want to start a project and I want to know what are my oportunities and possibilities. I can do anything like buying only server dedicated for database and all things dedicated for this, just to can make the searching as fast as possible. But I need to know what to make...
Maybe you should open a new question.
Ask a vague question you are only going to get vague and probably inaccurate answers.
What is it you are planning on doing?
Over and out.
I just want to know if its possible to make searching in a very big database like 27+GB, for a specific text, fast enough to make several searches per second
Size of the database is irrelevant -  27 Gb might sound big to you but it isn't really - databases can be in the multiple terabytes in size.
Maybe.  What are the day and what are the searches?

Assume your data is 270M 100 byte sentences.  You make a hash table for your data.
One part of your index is a hashed list of 1M words and the sentances that contain them.
Another part is the most common 100K two word phrases and their sentances.
And maybe you do three word phrases too.
Todo a search, start with the least common word in the search string. What sentences does it appear in?  How many?  You can check them all, or look at the 2nd least common word and look for overlap.  You may be able to do all the work in your small, highly optimized index.
Better, real-world examples will get you better, real-world answers.
Ok Thanks alot for all the help, I will optimise as d-glitch said the index and also Gary gived me very much optimism :)
Is the data within this "data base" fixed?  

Or are you adding and removing "records"?

What does a "record" consist of?

What will be your key?
The data is fixed. I don't know what are records, but I won't add absolutely anything or removing anything from the database. Also there will be no key :) thanks
You do realize that the index is nothing but a small file that contains the keys and pointers to the record.

Without a key, you can't have a index.  In fact without a key/index you don't have a data base, you have a flat file.  Which means each search is going to scan the while 27GB file every/each time.

Which gets back to my suggestion of getting a computer with the fastest RAM and CPU you can get, creating a in-memory file system, loading the file into that and go for it.
>>  You do realize that the index is nothing but a small file that contains the keys and
       pointers to the record.

The sort of index I have suggested would be carefully designed and highly optimized to meet the users' performance specifications.  It won't and can't exist until the users deliver those specifications.

>> Size of the database is irrelevant

Really hard to imagine anyone believing that.

======================================================================

Since we have essentially zero information on the database and searches, I have assumed the following:
     The database consists of 270M 100 byte sentences, numbered sequentially.
     There are 2.7B words (avg word len => 9).
     There are 1M unique words.

The index would be a hashed list of every word that appears in the  database.  The data associated with each word would include the number of every sentence that contains the word.  

This index contains 2.7B sentence ID numbers (one for each word in the database).
The sentence ID number takes 28 bits, so this table is comparable in size to the original database.  The enormous advantage is that it is searchable.

Suppose you need to search the database for the following sentence:
       The rings of Uranus were discovered during a syzygy in 1977.

You look up each word in your hashed index.  Two words stand out as uncommon:  Uranus (500K occurrences) and syzygy (2K occurrences).  Proper hashing lets you find data very fast (approaching one try on average).

Your search effort is reduced from 270M sentence comparisons to a maximum of 2K.
But sentence comparisons may still be to computationally intensive.

You could also look for overlaps of Uranus and syzygy.  Are any of the 2K syzygy sentences also on the list of 500K Uranus sentences?  This would reduce your effort to 2K index entry comparisons.  If there are any overlaps, you will still have to do some sentence comparisons.  If not, you are done.
d-glitch,

I believe most of the responders here understand what index's can do, however I don't think  AlexMert really understands (no offense AlexMert).  

As you notice one of his responses is "Also there will be no key", thus my response of letting him know what an index really is.

Without really knowing any more  about what the data is we could all be guessing (and making assumptions) about what can and can't be done.