Link to home
Start Free TrialLog in
Avatar of Dean OBrien
Dean OBrienFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Calculating 'nearest' 10 locations to a postcode

Experts,

I am putting together a website that needs to be able to show the nearest X locations to a given postcode.

In order to do so, I have imported a very large table of 1.7 million records, which is a complete list of all postcodes in the UK. My intention was to use MySQL to calculate the nearest locations, through using the Haversine formula in conjunction with the latitude and longitude, taken from the postcode table.

However, I have only just started test this and have come to realise the massive resource implication, given that in order to find the 20 nearest, I would need to individually calculate the distance from X of all 1.7 million records, then sort by distance...

Here is the MySQL that i was originally playing around with:
SELECT type_place.postcode, type_place.place_title, postcodelatlng.latitude, postcodelatlng.longitude,
       111.045*haversine(latitude,longitude,latpoint, longpoint) AS distance_in_km
 FROM postcodelatlng
 INNER JOIN type_place
 ON type_place.postcode=postcodelatlng.postcode
 INNER JOIN (
     SELECT  54.97  AS latpoint,  -1.607 AS longpoint
   ) AS  p
   WHERE postcodelatlng.latitude < 56 AND postcodelatlng.latitude > 54 AND postcodelatlng.longitude<-1.5 AND postcodelatlng.longitude>-1.7
 ORDER BY distance_in_km
 LIMIT 15

Open in new window


As youi can see, I added a WHERE clause to limit the search to +/- 1 point on lat/long. However, when i ran the test it took 30 secs to complete..

Could someone suggest a better approach to the problem?

Regards
Dean
SOLUTION
Avatar of Dan Craciun
Dan Craciun
Flag of Romania image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Dean OBrien

ASKER

Hi Dan,

Thanks for your comment.

I think i perhaps wasnt clear enough. When the system is in use, I would like to be able to calculate the closest 10 locations to any defined location. Therefore given the location could change, it would need calculating at runtime.

Regards
Dean
From what I understood, you have 1.7 million possible locations.
Create a table with 1.7 million records, each line containing the location and it's 10 neighbors.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks to everyone for their comments, apologies for the massive delay responding. I have been meaning too, but wanted to run some test before commenting, and never found the time...

I suspect one of the reasons for the poor performance is that the Db is stored on a shared host, so im competing for processing power. I have ran some tests and it looks like the main reason for the delay is simply querying the large postcode table (1.7 million records).

I performed the haversine calc on table of 50,100,150 and 200 records only and it was pretty much instant.
SELECT post4.latitude, post4.longitude, 111.045*haversine(latitude,longitude,latpoint, longpoint) AS distance_in_km FROM post4 INNER JOIN ( SELECT 57.144165160000000 AS latpoint, -2.114847768000000 longpoint ) AS p ORDER BY distance_in_km

Open in new window


Yet, simply joining my place table to the postcode table to get the lat/lng info took 6 -7 secs...
 
SELECT * FROM type_place AS p JOIN postcodelatlng AS pc ON p.postcode=pc.postcode

Open in new window


That increased to 14 secs when i added a WHERE clause
SELECT * FROM type_place AS p JOIN postcodelatlng AS pc ON p.postcode=pc.postcode WHERE pc.latitude>54 AND pc.latitude<56 AND pc.longitude<-1.5 AND pc.longitude>-1.7

Open in new window


Finally joining all records and performing the haversine calc took 180secs.....
SELECT 111.045*haversine(latitude,longitude,latpoint, longpoint) AS distance_in_km FROM type_place AS p JOIN postcodelatlng AS pc ON p.postcode=pc.postcode JOIN ( SELECT 57.144165160000000 AS latpoint, -2.114847768000000 longpoint ) AS c ORDER BY distance_in_km

Open in new window


I realised that i dont actually need to access the large table, as i only need to know what places are near other places in my smaller table of 1000 rows. Therefore i plan to just add the lat/lng information to the place table and perform the select and calc on that instead.

Thanks again
Dean
Just curious - did you read the article and try the pattern with CREATE TEMPORARY TABLE?  I'd be interested to know what the speed was like.  My guess is that if the LAT and LON are indexed columns, you would see nearly instantaneous results.
Hi Ray,

Yes i had read you article, but to be honest i kind of discounted the temporary table to begin with, because i thought if the initial JOIN was taking 6+ seconds then it would surely have similar results... lazy i know.

I have been trying this morning to see if that would have indeed been the case, but the script im using doesnt seem to like creating the temporary table with the JOIN.

Im using the PHP below:
<?php
ini_set("display_errors",1<wbr ></wbr>);
error_reporting(E_ALL);

// A SCRIPT TIMER
$alpha_time = microtime(TRUE);

/* Open a connection */
$mysqli = new mysqli("localhost", "xxx", "xxx", "xxx");

if (mysqli_connect_errno()) {
    printf("Connect failed: %s\n", mysqli_connect_error());
    exit();
}

$mysqli->query("
CREATE TEMPORARY TABLE IF NOT EXISTS 
  temp_table ( INDEX(place_id) ) 
ENGINE=MEMORY
AS (
SELECT * FROM type_place 
)
");

if ($result = $mysqli->query("SELECT * FROM temp_table")) {
    printf("Select returned %d rows.\n", $result->num_rows);
    /* free result set */
    $result->close();
}

$mysqli->close();

    // A SCRIPT TIMER
	$omega_time = microtime(TRUE);
    $lapse_time = $omega_time - $alpha_time;
    $lapse_msec = $lapse_time * 1000.0;
    $lapse_echo = number_format($lapse_msec,<wbr ></wbr> 1);
    echo "<br/>SCRIPT TIME: $lapse_echo MILLISECONDS <br/>";
?>

Open in new window


The script works fine when simply selecting all from the type_place table(returns 750 rows). However, when i swap that for the following sql:
SELECT p.place_title, pc.latitude, pc.longitude FROM type_place AS p JOIN postcodelatlng AS pc ON p.postcode=pc.postcode WHERE pc.latitude>55 AND pc.latitude<58 AND pc.longitude<-1.5 AND pc.longitude>-2.2

Open in new window


It doesnt return a result, even though the same sql returns 26 records when i run it directly in phpmyadmin.

Finally, not sure if it works in the same way as when called from script, but when i run the create temporary table script in phpmyadmin (making the join to the large table) it also takes 18secs:
CREATE TEMPORARY TABLE nearby ( distance DECIMAL(6,1) ) ENGINE=MEMORY SELECT p.place_title, pc.latitude, pc.longitude FROM type_place AS p JOIN postcodelatlng AS pc ON p.postcode=pc.postcode WHERE pc.latitude>55 AND pc.latitude<58 AND pc.longitude<-1.5 AND pc.longitude>-2.2

Open in new window


Regards
Dean
I think the general idea of the temporary table is to winnow down the number of possible results to some number that is small enough to run fast.  Since I don't have your data, I'm not going to be able to tell you why this is taking so long, but the output of EXPLAIN SELECT might be helpful.  Also, you want to have indexes and matching data types on any columns that are used in WHERE, JOIN, GROUP, ORDER, etc.