I am putting together a website that needs to be able to show the nearest X locations to a given postcode.
In order to do so, I have imported a very large table of 1.7 million records, which is a complete list of all postcodes in the UK. My intention was to use MySQL to calculate the nearest locations, through using the Haversine formula in conjunction with the latitude and longitude, taken from the postcode table.
However, I have only just started test this and have come to realise the massive resource implication, given that in order to find the 20 nearest, I would need to individually calculate the distance from X of all 1.7 million records, then sort by distance...
Here is the MySQL that i was originally playing around with:
SELECT type_place.postcode, type_place.place_title, postcodelatlng.latitude, postcodelatlng.longitude, 111.045*haversine(latitude,longitude,latpoint, longpoint) AS distance_in_km FROM postcodelatlng INNER JOIN type_place ON type_place.postcode=postcodelatlng.postcode INNER JOIN ( SELECT 54.97 AS latpoint, -1.607 AS longpoint ) AS p WHERE postcodelatlng.latitude < 56 AND postcodelatlng.latitude > 54 AND postcodelatlng.longitude<-1.5 AND postcodelatlng.longitude>-1.7 ORDER BY distance_in_km LIMIT 15
I think i perhaps wasnt clear enough. When the system is in use, I would like to be able to calculate the closest 10 locations to any defined location. Therefore given the location could change, it would need calculating at runtime.
Regards
Dean
Dan Craciun
From what I understood, you have 1.7 million possible locations.
Create a table with 1.7 million records, each line containing the location and it's 10 neighbors.
Thanks to everyone for their comments, apologies for the massive delay responding. I have been meaning too, but wanted to run some test before commenting, and never found the time...
I suspect one of the reasons for the poor performance is that the Db is stored on a shared host, so im competing for processing power. I have ran some tests and it looks like the main reason for the delay is simply querying the large postcode table (1.7 million records).
I performed the haversine calc on table of 50,100,150 and 200 records only and it was pretty much instant.
SELECT post4.latitude, post4.longitude, 111.045*haversine(latitude,longitude,latpoint, longpoint) AS distance_in_km FROM post4 INNER JOIN ( SELECT 57.144165160000000 AS latpoint, -2.114847768000000 longpoint ) AS p ORDER BY distance_in_km
That increased to 14 secs when i added a WHERE clause
SELECT * FROM type_place AS p JOIN postcodelatlng AS pc ON p.postcode=pc.postcode WHERE pc.latitude>54 AND pc.latitude<56 AND pc.longitude<-1.5 AND pc.longitude>-1.7
Finally joining all records and performing the haversine calc took 180secs.....
SELECT 111.045*haversine(latitude,longitude,latpoint, longpoint) AS distance_in_km FROM type_place AS p JOIN postcodelatlng AS pc ON p.postcode=pc.postcode JOIN ( SELECT 57.144165160000000 AS latpoint, -2.114847768000000 longpoint ) AS c ORDER BY distance_in_km
I realised that i dont actually need to access the large table, as i only need to know what places are near other places in my smaller table of 1000 rows. Therefore i plan to just add the lat/lng information to the place table and perform the select and calc on that instead.
Just curious - did you read the article and try the pattern with CREATE TEMPORARY TABLE? I'd be interested to know what the speed was like. My guess is that if the LAT and LON are indexed columns, you would see nearly instantaneous results.
Dean OBrien
ASKER
Hi Ray,
Yes i had read you article, but to be honest i kind of discounted the temporary table to begin with, because i thought if the initial JOIN was taking 6+ seconds then it would surely have similar results... lazy i know.
I have been trying this morning to see if that would have indeed been the case, but the script im using doesnt seem to like creating the temporary table with the JOIN.
Im using the PHP below:
<?phpini_set("display_errors",1<wbr ></wbr>);error_reporting(E_ALL);// A SCRIPT TIMER$alpha_time = microtime(TRUE);/* Open a connection */$mysqli = new mysqli("localhost", "xxx", "xxx", "xxx");if (mysqli_connect_errno()) { printf("Connect failed: %s\n", mysqli_connect_error()); exit();}$mysqli->query("CREATE TEMPORARY TABLE IF NOT EXISTS temp_table ( INDEX(place_id) ) ENGINE=MEMORYAS (SELECT * FROM type_place )");if ($result = $mysqli->query("SELECT * FROM temp_table")) { printf("Select returned %d rows.\n", $result->num_rows); /* free result set */ $result->close();}$mysqli->close(); // A SCRIPT TIMER $omega_time = microtime(TRUE); $lapse_time = $omega_time - $alpha_time; $lapse_msec = $lapse_time * 1000.0; $lapse_echo = number_format($lapse_msec,<wbr ></wbr> 1); echo "<br/>SCRIPT TIME: $lapse_echo MILLISECONDS <br/>";?>
The script works fine when simply selecting all from the type_place table(returns 750 rows). However, when i swap that for the following sql:
SELECT p.place_title, pc.latitude, pc.longitude FROM type_place AS p JOIN postcodelatlng AS pc ON p.postcode=pc.postcode WHERE pc.latitude>55 AND pc.latitude<58 AND pc.longitude<-1.5 AND pc.longitude>-2.2
It doesnt return a result, even though the same sql returns 26 records when i run it directly in phpmyadmin.
Finally, not sure if it works in the same way as when called from script, but when i run the create temporary table script in phpmyadmin (making the join to the large table) it also takes 18secs:
CREATE TEMPORARY TABLE nearby ( distance DECIMAL(6,1) ) ENGINE=MEMORY SELECT p.place_title, pc.latitude, pc.longitude FROM type_place AS p JOIN postcodelatlng AS pc ON p.postcode=pc.postcode WHERE pc.latitude>55 AND pc.latitude<58 AND pc.longitude<-1.5 AND pc.longitude>-2.2
I think the general idea of the temporary table is to winnow down the number of possible results to some number that is small enough to run fast. Since I don't have your data, I'm not going to be able to tell you why this is taking so long, but the output of EXPLAIN SELECT might be helpful. Also, you want to have indexes and matching data types on any columns that are used in WHERE, JOIN, GROUP, ORDER, etc.
Thanks for your comment.
I think i perhaps wasnt clear enough. When the system is in use, I would like to be able to calculate the closest 10 locations to any defined location. Therefore given the location could change, it would need calculating at runtime.
Regards
Dean