We help IT Professionals succeed at work.

Pre-caching of large databases

chsalvia asked
I often work with large datasets (over 100GB), and so caching is of course extremely important.

My question is about how MySQL (or other similar DBMSs) handle large data sets.  I assume most DBMSs use some caching mechanism such as LRU to reduce swapping pages from disk.  But I want to know if MySQL has any "pre-fetching" or "pre-caching" option for when a database file is initially opened.

A caching mechanism like LRU is effective when a lot of pages are already cached in memory.  But when you first open a file that isn't the case.  Only after a significant number of accesses would a lot of pages be cached in memory, so I would imagine that when you initially open a large database, each record lookup will almost always require a disk access (i.e. there will be cache-misses almost 100% of the time because the file was just opened so there are no pages in the cache.)

I was wondering how MySQL (or other similar DBMSs) handles this, if at all.  Is there an option to prefetch an even distribution of pages or something when the database is opened?
Watch Question

Top Expert 2007

In mysqlsql this depends on the used storage-engine.


Might post that question in the innodb-forum


Here is a bit of discussion about this (PostgreSQL wise):

I would say this problem is a bit research-like.
Ok, pre-fetching or pre-caching but what?
If you have 100GB of data, how would the server at startup know which indexes, etc should load into memory? And even so, there is no more than 16 GB of memory available (or less). I agree that caching would make things work faster, but then you need to somehow know what to cache at startup.

A suggestion might be to remember all the select within the last 2-3 days
in an sql file and then run that when you start the db.
This way there will be some information in the cache and it might move
more quickly at the next queries.




That's a good point.  How would the server know which data to prefetch?  I suppose one strategy would be to use an LRU (Least Recently Used) based cache, and just write all the row numbers in the LRU cache to disk every time the database closes.  Then, when the database is reopened, read in all the row numbers saved to disk, and cache each of those rows.  This would basically save all the most recently used row numbers to disk, so that when you reopen the database, all commonly accessed rows are already cached.

Apparently, some DBMSs attempt to do row pre-fetching, but others don't.  I don't know what sort of pre-fetching strategy they use, or if it is any way similar to what I suggested above.  In the discussion you linked to, it seems that postgreSQL doesn't, but Oracle does.  I don' t know if MySQL does.  Regardless, I suppose that a custom solution could be implemented on top of a database like postgreSQL, even if the DBMS itself doesn't support row pre-caching.  But, what do you think about the above strategy I mentioned?



What you suggests, of course can be done.
Take the MySQL or PostgreSQL sources and see if you can implement it :)
I know that PostgreSQL has a query planner based on a genetic algorithm, the more queries it gets the better it becomes. I think there is no such thing in MySQL yet.


Explore More ContentExplore courses, solutions, and other research materials related to this topic.