Editor's Choice: This article has been selected by our editors as an exceptional contribution.

# Using GeoCodes to Find an Average Location

Published:
Updated:
Part of the Global Positioning System
A geocode is the major subset of a GPS coordinate, the other parts being the altitude and the time.  It is commonly expressed in an ordered pair, giving latitude and longitude in signed decimal numbers separated by a comma.  With about 6 digits of precision, you can locate with "rooftop accuracy."  In the world of geocodes, I live at 38.930247,-77.147828.

Finding an Average Location with GeoCode Data
In a recent EE question we were presented with this dialog: I have an array of latitude and longitude pairs (from zero to 15 pairs) that I need to process to derive a single average pair.

Some Additional Discussion to Clarify the Inputs and Desired Outputs
"There is a possibility that some pairs will be wildly inaccurate, and thus throw-off the "average". Omitting the highest and lowest (Longitude?) seems to be an easy method to implement as a general rule that in most cases will avoid capturing the most inaccurate values. All values should be within, say 20 nautical miles of each other, and this method should identify an erroneous out-of-state entry for example.

"One possible algorithm might go something like this... Compute the average location by averaging the lats and lons and using the resulting geocode as the average location.  Then walk the array of input data and compute the distance from the average location for all of the geocodes.  Discard the one that is the greatest distance away.  Lather, rinse, repeat until the greatest distance is within an acceptable locus. (note: this is the approach shown in this article)

Aha! Most of the Data is Trustworthy
"That makes a big difference.  If we know that most of our information is usable and we are just trying to discard the few outlier elements that are plainly wrong, the task has a better definition."

Making it Easy to Find an Average Location
To make this an easy thing we decided on a design pattern that would be object-oriented and would create a property with the average geocode information.  The way you would use it looks something like this:

``````<?php
error_reporting(E_ALL);
\$ga = new GeoAverage;
\$ga->geoRedact();
\$avgGeoCode = \$ga->avgGeoCode;
``````
In the end, the algorithm we used is shown below.  Here is an explanation of the programming.

The GeoObject Class Keeps Its Distance From the Center
The first part of the code is the GeoObject class that encapsulates a geocode and the associated programming and information.  The class constructor (lines 9-15) receives a geocode, which is always a comma-separated pair of decimal numbers containing the latitude and longitude in that order.  It may have more or fewer significant digits, depending on the need for accuracy in geolocation.  The constructor stores the lat,lon information in a class variable.

The setCenter() method (lines 17-25) is much like the constructor in that it receives a geocode, but as soon as it has the code, it calls a class method to compute the distance between the original geocode and the newly added center geocode.  You can instantiate a GeoObject, load a geocode into the setCenter() method and you will receive the distance in the return value from the setCenter() method.

There are two data access methods.  GetGeoCode() returns the geocode loaded by the class constructor.  GetDistance() returns the distance, which will be the same value that was returned by the most recent call to the setCenter() method.

The distance computation method (lines 38-79) implements the Haversine formula to compute the "great circle" distance between the two geocodes.  As a practical matter, you can use plane geometry when the distances are small (perhaps a few hundred miles or less) but as you get closer to global navigation distances you need to consider the curvature of the earth.  To see this effect in action. take a globe and a piece of string.  Hold one end at New York and draw the string across the globe to London.  The string will show you the great circle route.  You can learn more about the Haversine formula in this Wikipedia article.
http://en.wikipedia.org/wiki/Haversine_formula

Sorting by Distance
The next piece of code (lines 84-94) is a pair of usort() callback functions that let us order the array of GeoObjects by ascending or descending radial distance from the center.

The GeoAverage Class Processes a set of GeoObjects
The first part of this class contains the class variable data definitions.  We put these definitions here because we want to be able to see these values right at the top of any data visualization with print_r() or var_dump().  (Note that we did not bother with that in the GeoObject class; it only has a few properties and it is easy to keep them sorted.)  One of our values is private (the numeric value of this->percent) and all of the others are public, making them easy to access from outside the class.  Here are the descriptions of these variables.

\$this->percent is a numeric value between 0.01 and 1.00.  It tells the percentage of data samples that we must keep as we average the collection of geocodes.  Our logic will enable us to discard outliers, geocodes that are a long way away from the norm.  If we discard too many outliers we may not have enough data points to get a "good" result.  We can adjust this percentage to suit our needs.  When the programming discards more than \$this->percent of our geocodes, the result will be captioned "failure."

\$this->pctRequired is the human-readable display of the value of \$this->percent.

\$this->acceptRadius is the distance in miles that is considered to be acceptably close.  Any geocode that is within this distance of the center will be kept and used in the computation of the center location.  If that sounds odd, it will make more sense once you see the algorithm for determining the center.

\$this->pctRetained is the percentage of the geocodes that were used to compute the center.

\$this->maxDistance is the largest distance from the center, among the geocodes that were used to compute the location of the center.

\$this->avgDistance is the average distance from the center to all of the geocodes that were used to compute the location of the center.

\$this->numGeoCodes is the total number of geocodes that were loaded into the object.

\$this->numRetained is the number of geocodes that were used to compute the location of the center.

\$this->numDiscarded is the number of geocodes that were not used to compute the location of the center because they were deemed to be outliers with respect to \$this->acceptRadius.

\$this->avgLat is the arithmetic mean of all the latitudes that were used to compute the location of the center.

\$this->avgLon is the arithmetic mean of all the longitudes that were used to compute the location of the center.

\$this->avgGeoCode is the standard comma-separated geocode notation incorporating \$this->avgLat and \$this->avgLon.

\$this->result is the status of the object.  It will be "Unfinshed" or "Success" or "Failure."

\$this->geoLocale is the array of GeoObjects that were used to compute the location of the center.

The methods() of the GeoAverage Class
The first method is the class constructor (lines 117-122).  It simply resets the default values.  While technically it may not be needed because of our variable definitions, I am just in the habit of programming this way!

The setAcceptRadius() method sets the acceptable distance from the center (lines 125-129).  Since it changes a value that is important to our algorithm, it calls the setResult() method without any parameters to cause the object to assume "Unfinished" status.

The setPercent() method (lines 132-150) sets the percentage of the total number of geocodes that must be within the acceptable distance for the class to assume the "Success" status.  You can call this method with either a decimal fraction or a percentage expression like "60%" and it will behave intuitively. Because this is one of the settings that changes the nature of the output values, it calls the setResult() method without any parameters to cause the object to assume "Unfinished" status.

The private setResult() method (lines 153-161) is used to post the status of the object in the \$this->result property.  When called without any arguments it resets default values and sets the status to "Unfinished."

The addGeoCode() method (lines 163-167) is used to add a new Lat/Lon pair into the array of unprocessed GeoObjects.  Since it changes the input data, it calls the setResult() method without any parameters to cause the object to assume "Unfinished" status.

The remGeoCode() method (lines 169-182) is used to remove a Lat/Lon pair from the array of unprocessed GeoObjects.  Note that it removes only the first matching geocode.  If there are multiple instances of identical geocodes and you want to remove them all you must call this method until it returns FALSE.  Since it changes the input data, it calls the setResult() method without any parameters to cause the object to assume "Unfinished" status.

The setCenters() method (lines 184-190) takes its argument in the form of a geocode and assigns it as the center in all of the GeoObjects.  As the center is assigned in the GeoObjects, each GeoObject automatically recomputes its distance from the center.

The avgGeoCodes() method (lines 192-208) takes its argument in the form of an array of GeoObjects.  It computes the average latitude and the average longitude, then it calls the setCenters() method to update the center points in all of the GeoObjects.  Finally, it sorts the input array in ascending order and returns the array.  The sort has the effect of putting the greatest off-center distance in the last position of the array, where it is easily accessed with end() or array_pop.

The compute_average_distances() method (lines 210-220) takes its argument in the form of an array of GeoObjects.  It computes the average distance for all of the geocodes from the currently set \$this->center properties of the GeoObjects.

The geoRedact() method (lines 222-253) makes a copy of the array of GeoObjects, orders the array by distance from the average center, and removes the last element of the array.  It tests this GeoObject to see if its location is within the values allowed by \$this->acceptRadius.  When it finds the first GeoObject that is acceptable, its work is done because all the other GeoObjects are of equal or lesser distance from the current center point.

The setGeoData() method (lines 255-292) captures some statistics about our efforts to find the average center point and puts these into the public properties of the object.  It sets the "Success" or "Failure" indicator.  Finally it sorts the original collection of GeoObjects in descending order by distance from the programatically chosen center point.  This makes it easy to visualize any outliers, since they will be at the top of the list of GeoObjects when the array is visualized with print_r().

Testing the GeoAverage Class
There are a handful of tests included in the script at the bottom.  Each test uses print_r() to visualize the object as its properties are changed and the average center geocode is recomputed.  Some tests are designed to fail because they do not achieve a high enough percentage of geocodes within the acceptable radius.

``````<?php // RAY_EE_average_location.php
error_reporting(E_ALL);
echo "<pre>";

// AN OBJECT TO HOLD GEOCODES AND DISTANCES FROM THE CENTER
Class GeoObject
{
public function __construct(\$geocode)
{
\$this->geocode = \$geocode;
\$arr = explode(',', \$geocode);
\$this->lat = \$arr;
\$this->lon = \$arr;
}

public function setCenter(\$center)
{
\$this->center = \$center;
\$arr = explode(',', \$center);
\$lat = \$arr;
\$lon = \$arr;
\$this->distance = \$this->compute_distance(\$this->lat, \$this->lon, \$lat, \$lon, 'MILES');
return \$this->distance;
}

public function getGeocode()
{
return \$this->geocode;
}

public function getDistance()
{
return \$this->distance;
}

// MAN PAGE: http://en.wikipedia.org/wiki/Haversine_formula
protected function compute_distance(\$from_lat, \$from_lon, \$to_lat, \$to_lon, \$units='KM')
{
// CHOOSE A UNIT OF MEASURE BY THE FIRST CHARACTER
\$units = strtoupper(substr(trim(\$units),0,1));

// ENSURE THAT ALL COORDINATES ARE FLOATING POINT VALUES
\$from_lat = floatval(\$from_lat);
\$from_lon = floatval(\$from_lon);
\$to_lat   = floatval(\$to_lat);
\$to_lon   = floatval(\$to_lon);

// IF THE SAME POINT THE DISTANCE IS ZERO AND HAVERSINE IS NOT NEEDED
if ( (\$from_lat == \$to_lat) && (\$from_lon == \$to_lon) )
{
return 0.0;
}

// COMPUTE THE DISTANCE WITH THE HAVERSINE FORMULA
\$distance
= acos
)
;

// DISTANCE IN MILES AND KM - ADD OTHER MEASURES IF NEEDED
\$miles = (float) \$distance * 69.0;
\$km    = (float) \$miles * 1.61;

// RETURN MILES
if (\$units == 'M') return round(\$miles,1);

// RETURN KILOMETERS
if (\$units == 'K') return round(\$km,2);

// UNITS NOT UNDERSTOOD
return('INVALID FIFTH ARGUMENT - USE MILES OR KM');
}
}

// USER SORTS FOR AN ARRAY OF GEOOBJECTS
function geoSort_Ascending(\$a, \$b)
{
if     (\$a->getDistance() == \$b->getDistance()) return 0;
return (\$a->getDistance() <  \$b->getDistance()) ? -1 : 1;
}

function geoSort_Descending(\$a, \$b)
{
if     (\$a->getDistance() == \$b->getDistance()) return 0;
return (\$a->getDistance() >  \$b->getDistance()) ? -1 : 1;
}

// A CLASS TO FIND THE AVERAGE LOCATION FROM A SET OF GEOCODES
Class GeoAverage
{
// GROUP OF THESE DEFINITIONS MAKES var_dump() EASIER TO READ
private \$percent       = 1.0;
public  \$pctRequired   = 0;
public  \$pctRetained   = 0;
public  \$maxDistance   = -1.0;
public  \$avgDistance   = -1.0;
public  \$numGeoCodes   = -1.0;
public  \$numRetained   = -1.0;
public  \$avgLat        = 0.0;
public  \$avgLon        = 0.0;
public  \$avgGeoCode    = '0.0,0.0';
public  \$result        = 'Unfinished';
public  \$geoLocale     = array();

// CONSTRUCTOR SETS DEFAULT VALUES FOR RESULTS
public function __construct()
{
\$this->setPercent();
\$this->setResult();
}

// THE MAXIMUM ACCEPTABLE RANGE FROM THE CENTER POINT
{
\$this->setResult();
}

// THE MINIMUM ACCEPTABLE PERCENT OF GEOCODES THAT MUST BE KEPT IN THE FINAL DETERMINATION OF THE LOCALE
public function setPercent(\$limit=100)
{
// IF EXPRESSED IN FRACTIONAL TERMS
if (\$limit <= 1.0)
{
\$this->percent = \$limit;
}
// IF EXPRESSED IN WHOLE NUMBER PERCENT
else
{
\$limit = preg_replace('#[^0-9\.]#', NULL, \$limit);
\$this->percent  = \$limit / 100.0;
}
if (!\$this->percent) \$this->percent = 0.01;
if (\$this->percent < 0.01) \$this->percent = 0.01;
if (\$this->percent > 1.00) \$this->percent = 1.00;

\$this->setResult();
}

// THE STATUS OF THE COMPUTATIONS
private function setResult(\$x='Unfinished')
{
\$this->result = \$x;
if (\$x == 'Unfinished')
{
\$this->avgGeoCode = '0.0,0.0';
\$this->geoLocale = array();
}
}

{
\$this->geoObjects[] = new GeoObject(\$geocode);
\$this->setResult();
}

public function remGeoCode(\$geocode)
{
foreach (\$this->geoObjects as \$key => \$geo)
{
\$gc = (string)\$geo->getGeoCode();
if (\$gc == \$geocode)
{
unset(\$this->geoObjects[\$key]);
\$this->setResult();
return TRUE;
}
}
return FALSE;
}

protected function setCenters(\$center)
{
foreach (\$this->geoObjects as \$key => \$geo)
{
\$this->geoObjects[\$key]->setCenter(\$center);
}
}

protected function avgGeoCodes(\$objs)
{
\$lats = array();
\$lons = array();
foreach (\$objs as \$obj)
{
\$lats[] = \$obj->lat;
\$lons[] = \$obj->lon;
}
\$this->avgLat = array_sum(\$lats) / count(\$lats);
\$this->avgLon = array_sum(\$lons) / count(\$lons);
\$this->setCenters( "\$this->avgLat,\$this->avgLon" );

// SORT THE GEOCODES
usort(\$objs, "geoSort_Ascending");
return \$objs;
}

protected function compute_average_distances(\$objs)
{
\$objs = \$this->avgGeoCodes(\$objs);
\$dist = 0.0;
foreach (\$objs as \$obj)
{
\$dist += \$obj->getDistance();
}
\$dist = number_format( (\$dist / count(\$objs)), 1);
return \$dist;
}

// A FUNCTION TO TRY TO COALESCE THE GEOCODES UNTIL GOOD
public function geoRedact()
{
// COPY THE ARRAY OF GEO-OBJECT POINTS
\$this->geoLocale = \$this->geoObjects;

// THE NUMBER OF ITERATIONS WE CAN MAKE
\$num = count(\$this->geoLocale) - 1;
while (\$num)
{
// GET A NEW AVERAGE GEOCODE
\$this->geoLocale = \$this->avgGeoCodes(\$this->geoLocale);
\$away     = array_pop(\$this->geoLocale);

// COMPARE MAXIMUM DISTANCE TO ACCEPTABLE DISTANCE
\$distance = \$away->getDistance();
{
// RESTORE THIS ELEMENT BECAUSE IT IS INSIDE THE ACCEPTABLE DISTANCE
\$this->geoLocale[] = \$away;
\$this->setGeoData();
return TRUE;
}

// REPEAT UNTIL IT COALESCES OR WE RUN OUT OF DATA
\$num--;
}

// NEVER COALESCED ADEQUATELY -- RETURN THE REMAINING POINTS
\$this->setGeoData();
return FALSE;
}

// A FUNCTION TO SET THE FINAL INFORMATION ABOUT THE ATTEMPT TO COALESCE
protected function setGeoData()
{
// MAXIMUM DISTANCE FROM THE RESOLVED CENTER
\$away = end(\$this->geoLocale);
\$maxDistance = \$away->getDistance();

// AVERAGE DISTANCE FROM THE RESOLVED CENTER
\$avgDistance = \$this->compute_average_distances(\$this->geoLocale);

// NUMBER OF TOTAL AND DISCARDED GEOCODES
\$numGeoCodes  = count(\$this->geoObjects);
\$numRetained  = count(\$this->geoLocale);
\$pctRetained  = (\$numRetained / \$numGeoCodes) * 100;

// TIDY UP SOME STATISTICS
\$this->maxDistance  = number_format(\$maxDistance, 1);
\$this->avgDistance  = number_format(\$avgDistance, 1);
\$this->numGeoCodes  = number_format(\$numGeoCodes);
\$this->numRetained  = number_format(\$numRetained);
\$this->pctRetained  = number_format(\$pctRetained) . '%';
\$this->pctRequired  = number_format(\$this->percent * 100) . '%';
\$this->avgLat       = number_format(\$this->avgLat, 6);
\$this->avgLon       = number_format(\$this->avgLon, 6);
\$this->avgGeoCode   = "\$this->avgLat,\$this->avgLon";

// THE NUMBER OF ELEMENTS WE CAN REMOVE BEFORE DECLARING FAILURE
\$lim = count(\$this->geoLocale) * (1.0 - \$this->percent);
\$lim = round(\$lim);

// SET SIGNAL OF SUCCESS OR FAILURE BASED ON AMOUNT OF RETAINED DATA
\$this->setResult("Failure");
if ( \$numDiscarded <= \$lim) \$this->setResult("Success");

// SORT THE ORIGINAL BY GREATEST DISTANCE FROM THE FINAL LOCATION
usort(\$this->geoObjects, "geoSort_Descending");
}
}

// TESTING OUR SCRIPT
\$ga = new GeoAverage;

// LOAD UP SOME TEST DATA THAT IS CLOSE TOGETHER

// SET A TIGHT GEOGRAPHICAL REQUIREMENT

echo "<h2>Unfinished Object</h2>";
print_r(\$ga);

// PERFORM THE CALCULATIONS
\$ga->geoRedact();

echo "<h2>Finished Object with Success - required 100% / 10 miles</h2>";
print_r(\$ga);

// ADD AN OUTLIER AND SHOW THE UNFINISHED OBJECT
echo "<h2>Unfinished Object (Newly Added GeoCode)</h2>";
print_r(\$ga);

// RECOMPUTE AND SHOW THE FINISHED OBJECT
\$ga->geoRedact();
echo "<h2>Finished Object with Failure - required 100% / 10 miles, got less than 100%</h2>";
print_r(\$ga);

// LOWER OUR STANDARDS AND RECOMPUTE
\$ga->setPercent(0.6);
\$ga->geoRedact();
echo "<h2>Finished Object with Success - required 60% / 10 miles, got more than 60%</h2>";
print_r(\$ga);

// ADD SOME MORE TEST DATA

// CONSTRAIN THE DISTANCE AND PERCENT REQUIREMENT
\$ga->setPercent('70%');

// PERFORM SOME CALCULATIONS WITH THE NEW DATA AND NEW CONSTRAINTS
\$ga->geoRedact();
echo "<h2>Finished Object with Success - required 70% / 20 miles, got more than 70%</h2>";
print_r(\$ga);

// RAISE THE ACCEPTABLE DISTANCE

// PERFORM SOME CALCULATIONS
\$ga->geoRedact();
echo "<h2>Finished Object with Success - required 70% / 80 miles, got more than 70%</h2>";
print_r(\$ga);

// PERFORM SOME CALCULATIONS
\$ga->geoRedact();
echo "<h2>Finished Object with Success - required 70% / 80 miles, got more than 70% (some outliers)</h2>";
print_r(\$ga);

// RAISE OUR STANDARDS AND NARROW OUR LOCUS OF ACCEPTABLE GEOCODES
\$ga->setPercent('75%');

// PERFORM SOME CALCULATIONS
\$ga->geoRedact();
echo "<h2>Finished Object with Failure - required 75% / 70 miles, got less than 75% (more outliers)</h2>";
print_r(\$ga);

// REMOVE SOME OF THE OUTLIER GEOCODES
if (\$ga->remGeoCode('37.3974,-122.0732')) echo PHP_EOL . "REMOVED '37.3974,-122.0732'";
if (\$ga->remGeoCode('37.3974,-122.0732')) echo PHP_EOL . "REMOVED '37.3974,-122.0732'";
if (\$ga->remGeoCode('37.3974,-122.0732')) echo PHP_EOL . "REMOVED '37.3974,-122.0732'";
if (\$ga->remGeoCode('37.3974,-122.0732')) echo PHP_EOL . "REMOVED '37.3974,-122.0732'";
if (\$ga->remGeoCode('37.3974,-122.0732')) echo PHP_EOL . "REMOVED '37.3974,-122.0732'";
if (\$ga->remGeoCode('37.3974,-122.0732')) echo PHP_EOL . "REMOVED '37.3974,-122.0732'";

// RECALCULATE
\$ga->geoRedact();
echo "<h2>Finished Object with Success - required 75% / 70 miles, got more than 75% (fewer outliers)</h2>";
print_r(\$ga);
``````
Conclusion, or What's This Good For?
This article has shown a way to take a set of geocodes and determine if they are close to each other, or scattered about the landscape.  This technology could be used in data scrubbing.  Perhaps you have received several sets of geographic data that have postal codes and geolocations.  In my personal experience these data sets, especially the free ones, are notoriously full of inaccuracies.  You want to compare the data from different sources to see if the data is more-or-less in consonance.  This technology could tell you that.  Another use might be to follow a remote signaling device.  If the device left an acceptably close radius, the technology could raise an alert.  Another possible use might be in "heat-mapping" the travels of a signaling device.  If an occasional spurious signal is received, it could be rejected before the map is drawn.

With the increasing prevalence of mobile devices, knowing where your clients are coming from is more important than ever!