Mario Bernheim
asked on
How do I optimize for full table scan on a 1.2Million row table?
MySQL - Percona Edition.
Full table scan is taking over 1 second and crashing our backend servers, the table has about 1.2 million records, if the IP is located then it returns very quickly and has no issues, eventually we hit max connections on the server if it can't find it as it does a full table scan. This server is beefy, 128GB of ram + 800GB SSD and dual hexacore processors
The table structure is this:
These are the indexes:
If we do 800 requests per second (lookups) our database server and our backend servers max out their connections waiting for response from backend server... thoughts?
Full table scan is taking over 1 second and crashing our backend servers, the table has about 1.2 million records, if the IP is located then it returns very quickly and has no issues, eventually we hit max connections on the server if it can't find it as it does a full table scan. This server is beefy, 128GB of ram + 800GB SSD and dual hexacore processors
SELECT *
FROM geoip_proxy_ipv4_09_2018
WHERE INET_ATON('192.168.0.1') BETWEEN ip_from AND ip_to
LIMIT 1
The table structure is this:
CREATE TABLE IF NOT EXISTS `geoip_proxy_ipv4_09_2018` (
`ip_from` decimal(10,0) NOT NULL,
`ip_to` decimal(10,0) NOT NULL,
`proxy_type` varchar(3) NOT NULL,
`country_code` varchar(2) NOT NULL,
`country_name` varchar(128) NOT NULL,
`region_name` varchar(128) NOT NULL,
`city_name` varchar(128) NOT NULL,
UNIQUE KEY `idx_ip_from_ip_to` (`ip_from`,`ip_to`),
KEY `ip_to` (`ip_to`),
KEY `ip_from` (`ip_from`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
These are the indexes:
Indexes
Keyname Type Unique Packed Column Cardinality Collation Null Comment
idx_ip_from_ip_to BTREE Yes No ip_from 1254839 A No
ip_to 1254839 A No
ip_to BTREE No No ip_to 1254839 A No
ip_from BTREE No No ip_from 1254839 A No
If we do 800 requests per second (lookups) our database server and our backend servers max out their connections waiting for response from backend server... thoughts?
ASKER
Thanks, Ive gone ahead and done those changes... the change on the type and kept one index on the from and to columns... unfortunately, when the situation persists... while doing a full table scan it will hang all backend servers and they will eventualy max out and start swapping... the database server also runs out of resources if I leave the benchmark on too long...
ASKER
Hello, this seemed to fix the issue...
The queries will take about (Query took 0.0003 sec) on results where it cant find the IP... question is... WHY?
SELECT proxy_type FROM (SELECT proxy_type, `ip_from` FROM `geoip_proxy_ipv4_09_2018` WHERE ip_to >= INET_ATON('173.245.203.9') LIMIT 1) AS mytable WHERE ip_from <= INET_ATON('173.245.203.9')
The queries will take about (Query took 0.0003 sec) on results where it cant find the IP... question is... WHY?
MySQL's optimization docs don't really address range queries in which the constant value is the target, but it does provide some hints on how this would be optimized. Note that this is a little educated guess-work on my part, based on notes in https://dev.mysql.com/doc/refman/5.7/en/range-optimization.html. We might have more insight based on the EXPLAIN results.
MySQL only uses a single index to generate the initial scan of a table. So initially, MySQL is going to choose between the `ip_to` and `ip_from` indexes. Also, MySQL collapses range scan requirements into the "greatest common denominator" (my term), which is almost always more general than the actual WHERE clause. In this example, in which C is the searched constant, A is ip_from, and B is ip_to:
With the second query, the benefits are immediately clear in that MySQL can now use both indexes, though it uses them consecutively. The first query will result in a simple predicate of "C<=B", which will use the `ip_to` index. The results of this move into the outer query, with another simple predicate of "B<=A", which will use the `ip_from` index. Both of these are strictly index scans, which happens very quickly.
Again, this is a little guess-work, but it makes sense in the context of how MySQL internally optimizes queries. I would imagine that your EXPLAIN results would show a similar scenario to what I've described.
MySQL only uses a single index to generate the initial scan of a table. So initially, MySQL is going to choose between the `ip_to` and `ip_from` indexes. Also, MySQL collapses range scan requirements into the "greatest common denominator" (my term), which is almost always more general than the actual WHERE clause. In this example, in which C is the searched constant, A is ip_from, and B is ip_to:
[...] WHERE C BETWEEN A AND B
/* Refactored = early step of optimization */
WHERE A <= C AND C <= B
Here it becomes a little more clear what's going to happen. MySQL can only choose one of the indexes available, so it will scan the index of the table to "guess" which predicate will yield less rows for the first part of row selection. I.e, are there less rows in "A<=C" or "C<=B"? Whichever it is, MySQL will pre-select all the matching rows for secondary filtering against the entire WHERE clause. In cases where your target is in the middle of the domain range, or in which MySQL guesses incorrectly, it could be pre-selecting half (or slightly more) of the table. The secondary filtering is a row-by-row comparison of all predicates. If your pre-selected rows constitutes a large enough bulk, MySQL may resort to file-based filtering/sorting at this point. I'm guessing that is happening, since you described your server's thrashing and swapping.With the second query, the benefits are immediately clear in that MySQL can now use both indexes, though it uses them consecutively. The first query will result in a simple predicate of "C<=B", which will use the `ip_to` index. The results of this move into the outer query, with another simple predicate of "B<=A", which will use the `ip_from` index. Both of these are strictly index scans, which happens very quickly.
Again, this is a little guess-work, but it makes sense in the context of how MySQL internally optimizes queries. I would imagine that your EXPLAIN results would show a similar scenario to what I've described.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
thanks everyone
Few questions...
Why the index on ip_from..., the index idx_ip_from_ip_to will be used for select with ip_from as well... (try the explain statement)...
(Now you have one index that will never be used but will need to be updated on changes.).
The INET_ATON delivers an INT UNSIGNED ,...... decimal != INT.... So why used decimal type.
INET_ATON will result in a 4 byte unsigned integer.. (max 9 digits).
DECIMAL is a numeric string (with up to 65 digits...) this is not the same as a binary integer, or even mapped to an integer type.
From the manual:
To store values generated by INET_ATON(), use an INT UNSIGNED column rather than INT, which is signed. If you use a signed column, values corresponding to IP addresses for which the first octet is greater than 127 cannot be stored correctly. See Section 11.2.6, “Out-of-Range and Overflow Handling”.
Please try to use the same types for ip_from & ip_to as INET_ATON gives that at least prevents type conversions, and possibly will get the index to actually get used.