How will index work on a two column database in a huge table

Rohit Bajaj
Rohit Bajaj used Ask the Experts™
on
Hi,
suppose i have a mysql table  with rows (originalUrl : varchar(500) , shortUrl : varchar(10))
The queries that will be executed on this table will mostly be
1. select * from table where shortUrl = X
2. insert into table (originalUrl, shortUrl)

So there should be an index on shortUrl to speed this up.

I have the following question -
1. What exactly the index table will store ?
My understanding is index table will store items like - (shortUrl, pointerToDisk) // where pointerToDisk will locate exactly the place in disk where the row is stored.

2. Where is index table stored ?
Is it always stored in Disk or memory ?

3. What is the size of index table exceeds that of RAM ?
In this case the full index table will never be in RAM and so how will queries like select * from table where shortUrl = x execute
Will a part of index table be pulled out everytime to check the location ?

4. In case where this table is very huge say 3 TB. How big will index table be...

5. If index table is larger than size of RAM and since then the queries will take a lot of time. Is there a better alternative ?? Like using noSQL database. or storing data in two machines splitting them rather than on one machine ?

Thanks
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Hi Rohit,

my answers cannot be accurate because you did not say what database engine is used. If you need more details then search the web. E.g. InnoDB index is described here: https://dev.mysql.com/doc/refman/8.0/en/innodb-physical-structure.html

Of course, better is to start from something more simple. So the two basic index types are CLUSTERED and NONCLUSTERED. The CLUSTERED index is max 1 per table and this index contains all the table data. Other indexes (also called secondary indexes) point to the clustered index if any or to the data on a heap. More info: https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html

Short answers to your questions:
1) The index table contains the key in some key structure (e.g. B-tree etc.) and a pointer to the data which is never the physical place on a disk but rather the key in clustered index or database file page address.

2) Standard index is always stored on a disk because we cannot risk data volatility. Memory is used for caching only and for special indexes which are described e.g. here: https://dev.mysql.com/doc/refman/8.0/en/memory-storage-engine.html

3) The database engine must read the index to a memory but it doesn't need the whole index into RAM obviously. The tree structure allows to read just a few index pages to retrieve data requested.  Of course, if you are querying the whole large table the RAM is freed when necessary, JOIN of large tables can create temp tables on a disk etc.

4) You should create as small indexes as possible... For your table (originalUrl : varchar(500) , shortUrl : varchar(10)) and if we suppose all long URLs are 500 chars long the 3 TB table means about 5000000000 rows. (Do you really have so many short URLs?) You may decide the shortUrl column will be a key in clustered index and then you don't need extra space for the index except of the tree structure above the data.

5) Imagine the query  select * from table where shortUrl = X
This query must traverse the index tree which means to read just a few pages from the disk to find the page containing the originalUrl. So no problem with RAM at all. You may look here: http://www.unofficialmysqlguide.com/btrees.html  for better imagination of the index tree traversing.

Of course, the more RAM you have the less disk swapping operations are needed for your queries... I would recommend to read more about it. You may start e.g. here: https://aimeos.org/docs/Administrators/Optimize_performance/MySQL
and here: https://dba.stackexchange.com/questions/100672/innodb-index-doesnt-fit-in-memory

And think about SSD drive for your database if you have enough funds...
ste5anSenior Developer

Commented:
Good explanation.

Addendum: In your use-case, this means in short, that your table must have a clustered primary key on the short URL column. And as I wrote in the other post, consider using GUID (UUID in MySQL or use a true random GUID) as base value.
I would use the shortUrl as the clustered PK index. Additional data are not necessary here. Of course, if one shortUrl should open more web pages then you cannot define shortUrl as a PK.
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

1/ if the table is innodb, and the short url is the PK, or a combination of both fields, there won't be pointers because the whole table will be stored as a btree.

this does not make things much more efficient for very big tables. innodb will load entire pages in memory. so if the table is much bigger than your available ram, and the query distribution spawns the whole table, you'll end up trashing.

if for example you know that older queries are much less frequent, you may alleviate this issue by prefixing your short urls with part of the current timestamp so most queries will be issued against the latest pages. and possibly use a separate server for older queries.

likewise partitionning the table can help : you can run a cluster of sql servers and have each of them handle part of the data. either partition manually and have each server hold the needed data, or simply issue queries to different servers based on a prefix or actually partition the table and issue queries to different server based on the partition it will hit.

note that an sql engine is probably not your best bet if you're not storing anything more. maybe use a keystore such as memcache with disk backend, or redis, or cassandra.

if you can indicate the volume of your table, we should be able to tell you what will and will not work

2/ the index is stored on disk but normally whatever is in use is cached
with the above setup, each page will contain the part of the index corresponding to the data it holds.
and each entire page will be loaded/unloaded on demand.

3/ yes parts of the index. i believe this is covered above. the needed page/pages will be loaded on demand. if the ram is full, older pages need to be flushed. if this happens often while answering to basic key,value queries, the db engine will trash very heavily. don't forget the index will be bigger than the equivalent textual data.

4/ i'd blindly answer around 4 at first and probably 5 if you add and remove stuff a lot. 6 tops. if the index is the PK this is not "extra" space. i'm answering for a clustered indexed containing both columns.

5/ see above : multiple ones
Why should the index size be 4 (or even 5 or 6) TB for 3 TB data?

The query asking for one Short URL will read 33 pages in the worst case of 5000000000 rows. This is calculated for one index key per page so the actual average number of page reads will be lower. To make it more complex by adding some timestamp to URL does not make sense.
the index is always bigger than the data. this accounts for the overhead due to the index structures itself. as stated above the hypothesis was the index would contain ALL the data including the target URL. the overhead will vary a lot depending on the insertion policy, whether deletions occur, and how the keys are generated.

if you only store the short urls in the index and they are actually short ( aroung 5-6 chars + properly optimised ), a regular server will be able to store the full index in ram but will need to read the urls from disk most of the time.

--

hmm... the worst case would depend on how long the keys and urls are. pages default to 8M in size as far as i remember. i wonder how you calculate the 33 pages though you may well be right.

to make it complex by adding timestamps does make sense ( i'd probably add only an asci representation of the beginning of the timestamp ) if you rely on those dates to read the more recent pages on a host and others on a different one. sharding the data is obviously better but much more complex to maintain and scale.

that said, if the data is only as big as 3 TB, it is likely that a single enterprise-grade server with enough ram would be good enough. and if not, it is very likely that an SQL backend is not the best option available.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial