Link to home
Start Free TrialLog in
Avatar of williamskellorn
williamskellorn

asked on

Filter MySQL results by max value for each 24 hour cycle using PHP

I previous asked a question (https://www.experts-exchange.com/questions/28153645/Filter-MySQL-stored-coordinates-to-best-in-24-hour-cycle-using-PHP.html) the solution for which I need to further refine - only after closing the question did I realise the solution didn't quite get what I required.

Using the simplified php below (and attached) on the attached sql example, For each PTT value, I need to return the values from the row with the latest timestamp that has the best location_class value - during each 24 hour cycle.

Currently the location_class value returned is the best for the time window, but the rest of the values returned are from the first row (with earliest timestamp) encountered with a qualifying value for location_class (1,2 or 3), instead of the row with latest timestamp that has the best location_class.

How can I amend the query to achieve this, preferably without requiring a further php loop to iterate through?

At this point I'd be happy for the row with the best location_class, irrespective of whether it is the latest row with that class.

Note - The sample SQL isn't designed to test the requirement for results for each 24 hour cycle - just the issue of selecting the correct row for max location_class.

Many thanks in advance!

<?php
// show all errors while debugging
error_reporting(E_ALL);
ini_set('display_errors', '1');

//set default timezone for date conversion
date_default_timezone_set('UTC');

$connection = mysql_connect ("localhost", "user", "pass");

if (!$connection) 
	{
	die('Not connecting to db server : ' . mysql_error());
	}
// Sets the active MySQL database.
$db_selected = mysql_select_db("satellite_data_example", $connection);

if (!$db_selected) {
	die('Can\'t use db : ' . mysql_error());
	}

// get a unique list of all PTT
$get_ptts = "
SELECT distinct ptt from birds 
ORDER BY ptt";
$result=mysql_query($get_ptts) or die ("Could not retrieve unique PTT numbers: ".mysql_error());
?>

<?php
// Itterates through the MySQL results, for each PTT.
while ($row = @mysql_fetch_assoc($result)) {

	// Var for unique PTT number of current bird
	$bird = $row['ptt'];
	print "Current ptt = " . $bird . "
	<br />
	";

	//get ptt, location_date, location_class, latitude, longitude for best row from each transmission for current bird during timewindow
	$get_best_locs = "
	SELECT MAX(location_class) AS id, location_class, ptt, latitude, longitude, location_date, FROM_UNIXTIME(location_date) FROM birds
	WHERE (ptt=$bird)
	and (location_class IN (1,2,3))
	and (location_date >= 1398072323)
	and (location_date <= 1398115003)
	GROUP BY YEAR(FROM_UNIXTIME(location_date)), MONTH(FROM_UNIXTIME(location_date)), DAY(FROM_UNIXTIME(location_date))";

		
	$result2=mysql_query($get_best_locs) or die ("Could not retrieve all best locations data: ".mysql_error());
	if (mysql_num_rows($result2)) {  // if one or more non-empty rows are returned..
		
			
		while ($row2 = @mysql_fetch_assoc($result2)) {
			    $locDateTimestamp = strtotime($row2["FROM_UNIXTIME(location_date)"]);
				print "ptt - " . $row2['ptt'] . "id - " . $row2['id'] . " class - ". $row2['location_class'] .  " timestamp - " . $row2['location_date'] . " (" . date('d.m.Y H:i:s e', $locDateTimestamp) . ") latlng - (" . $row2['latitude'] . "," . $row2['longitude'] . ")
				<br /><br />";
				
				}
		}
	}
?>

Open in new window

php-example.html
birds.sql
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

What constitutes "best" when we're talking about location_class?  In the SQL it seems to have values of 0, 1, 2, 3, A, B.  Which do you want to call "best?"
For the reasons outlined here, you will need to get off of the MySQL extension.  I'll give you an example that uses the currently supported MySQLi extension instead of the obsolete MySQL, as soon as I understand about "best."
Avatar of williamskellorn
williamskellorn

ASKER

Sorry if not clear - the qualifying values for location_class are '1','2'or '3' , and I'm looking for the max value, so 3 would be best, then 2 then 1.

Should have said max, not best.

Apologies.
Thanks, that clarifies it.  Please see if this example gives you the answers you want.
http://iconoun.com/demo/temp_williamskellorn.php

A few notes...

You must stop using the MySQL extension, so take this opportunity to learn the MySQLi implementation.  It's a very easy transition.

Check the table definition. I changed the location_date, latitude and longitude to more reasonable column definitions.

Please post back if you have any questions, or if this is not the way you want the output to be grouped and collated. ~Ray

<?php // demo/temp_williamskellorn.php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);
echo '<pre>';


// SEE http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28427438.html


// DATABASE CONNECTION AND SELECTION VARIABLES - GET THESE FROM YOUR HOSTING COMPANY
$db_host = "localhost"; // PROBABLY THIS IS OK
$db_name = "??";
$db_user = "??";
$db_word = "??";

// OPEN A CONNECTION TO THE DATA BASE SERVER AND SELECT THE DB
$mysqli = new mysqli($db_host, $db_user, $db_word, $db_name);

// DID THE CONNECT/SELECT WORK OR FAIL?
if ($mysqli->connect_errno)
{
    trigger_error("CONNECT FAIL: $mysqli->connect_errno $mysqli->connect_error", E_USER_ERROR);
}


$mysqli->query('SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";');
$mysqli->query('SET time_zone = "+00:00";');

// NOTE THE CORRECTIONS TO THE DATA TYPES IN THE MODIFIED COLUMNS
$mysqli->query("CREATE TEMPORARY TABLE `birds` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `ptt` varchar(50) NOT NULL DEFAULT '',
  `location_date`                  DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00',
  `location_class` varchar(50) NOT NULL DEFAULT '',
  `latitude`                       DECIMAL(11,6) NOT NULL DEFAULT '0.0',
  `longitude`                      DECIMAL(11,6) NOT NULL DEFAULT '0.0',
  UNIQUE KEY `id` (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1")
;

$arr = array(
array('128303', '1398072323', '0', '38.38434', '-0.38647'),
array('128303', '1398074434', '0', '38.41594', '-0.70849'),
array('128303', '1398077966', 'B', '38.44554', '-0.8972'),
array('128303', '1398087460', '0', '38.89024', '-0.56271'),
array('128303', '1398092003', '1', '38.86478', '-0.84718'),
array('128303', '1398093572', 'B', '38.91282', '-0.87303'),
array('128303', '1398096591', '0', '39.12389', '-0.65677'),
array('128303', '1398098399', '1', '39.19132', '-0.58176'),
array('128303', '1398100680', '1', '39.24069', '-0.62806'),
array('128303', '1398102564', 'B', '39.25547', '-0.63597'),
array('128303', '1398106766', '3', '39.23175', '-0.64606'),
array('128304', '1398109369', '0', '39.26613', '-0.65195'),
array('128304', '1398112352', 'A', '39.26271', '-0.66721'),
array('128304', '1398113216', '1', '39.25291', '-0.66799'),
array('128304', '1398115003', '2', '39.25449', '-0.64654'),
array('128304', '1398115003', '0', '39.25449', '-0.64654')
)
;

// LOAD THE DATA
foreach ($arr as $sub)
{
    // CHANGE THE UNIX TIMESTAMP INTO A HUMAN-READABLE DATETIME VALUE
    $sub[1] = date('c', $sub[1]);
    $mysqli->query("INSERT INTO `birds` (`ptt`, `location_date`, `location_class`, `latitude`, `longitude`) VALUES ('$sub[0]', '$sub[1]', '$sub[2]', '$sub[3]', '$sub[4]') ");
}


// ACTIVATE THIS BLOCK TO LOOK AT THE LOADED DATA
$res = $mysqli->query('SELECT * FROM birds');
while($row = $res->fetch_object())
{
//  print_r($row);
}

$sql = <<<EOD
SELECT id, MAX(location_class) AS location_class, ptt, latitude, longitude, location_date FROM birds
WHERE location_class IN (3,2,1)
GROUP BY ptt ORDER BY location_class DESC, location_date DESC, ptt ASC;
EOD;


$res = $mysqli->query($sql);
while($row = $res->fetch_object())
{
    var_dump($row);
}

Open in new window

Thank you for the swift response Ray.

The results from your mysqli example are still returning coordinates and location_date values from the incorrect rows. The location_class returned in each object is correct (i.e. the max value for each bird (ptt)) but the coordinates and location_date values are from the first row encountered in location_date order with a value within the allowed range for location_class.

In your example (and mine) The latlng coordinates and location_date are currently being returned from rows 5 and 14 - they should be returned from rows 11 and 15 which have the max value location_class for each bird (unique ptt).

Furthermore, the grouping by 24 hour period in my initial example script is still required (GROUP BY YEAR(FROM_UNIXTIME(location_date)), MONTH(FROM_UNIXTIME(location_date)), DAY(FROM_UNIXTIME(location_date)) - so that for each 24 hour cycle we get the max value location_class row for any bird (ptt) that has data for that time period. Apologies for not including data from multiple days in the example to test this.

I've amended the test data and attached again - the first two rows are now from two days earlier, so if the data are grouped by 24 hour cycle, 3 objects should be returned with the data from the following rows, in ascending location_date order (except location_date would be converted to datetime in your script) -

row number - 2
id - 93975
location_class - 2
ptt - 128303
latitude - 38.41594
longitude - -0.70849
location_date - 1397988034

row number - 11
id - 94016
location_class - 3
ptt - 128303
latitude - 39.23175
longitude - -0.64606
location_date - 1398106766

row number - 15
id - 94027
location_class - 2
ptt - 128304
latitude - 39.25449
longitude - -0.64654
location_date - 1398115003


The first while loop iterating through the ptt values in my example is so that each bird has their returned data grouped (as these are then used to build Googlemap polylines for each bird).

I fully agree the table definition should be improved for each column - but I am dealing with a large inherited table that I dare not reformat at this point -though I hope to in the future. The way the data is being analysed by other staff requires the timestamps to be stored as timestamps for now.

Thank you once again - I hope this further clarifies the requirement.
birds-two-days.sql
table that I dare not reformat
Nobody would suggest that you reformat.  Just use the ALTER TABLE command to add the appropriate column definitions, then you can use those column definitions in your work and everyone else can "catch up" as time permits.

Please clarify the role of the ptt column in this data design.  In the original question it says, "For each PTT value, I need to return the values from the row with the latest timestamp that has the best location_class value - during each 24 hour cycle."  But in the latest post, it shows two outputs for ptt=128303.  Which is the correct guidance?

Also, I suppose we need to know the boundaries for the 24-hour grouping.  Are these by distinct dates?  Or do the 24 hour periods span dates, say, from 6:00am until 5:59:59am tomorrow?
The PTT value represents each unique bird that has a satellite-transmitter tag. Each row represents a message from that unique tag. In the full table we have several different possible PTT values as we are tracking several birds.

The objective is to have one set of coordinates and timestamp (a position in time and geographically) for each distinct day that we have any data from a given bird. Those coordinates and position should be from the row with the best quality (max value out of acceptable range of 3,2 or 1 from location_class) message that that bird (ptt) has transmitted during a given day, and if there is more than one message (row of data) during that day with the same max value location_class, ideally the coords and timestamp should be chosen from the row with the latest (highest value) timestamp during that day.

In the revised sample posted there would be two outputs for PTT 128303 because they have location_dates from different distinct days.

My current example script groups by location_date down to year / month / day granularity - so I presume the 24 hour window runs from 00:00:00 to 23:59:59 - distinct days as you say. This is how I would like it to continue to work.

The simplified example doesn't show it, but the chosen coords are then used to construct Googlemap polylines representing the route for each bird (ptt), so they need to be printed in location_date ascending order (they are used to construct Googlemaps initialisation Javascript). Hence the ptt is the key value that the returned data needs to be ordered by.

I hope this finally clears up the issue - I'm sorry it isn't clear what I'm trying to get at!
there would be two outputs for PTT 128303
OK, that eliminates the GROUP clause.  Back to the drawing board...
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Ray - belated thanks for further efforts on this - I haven't had a chance to give it a test yet - will do soon and get back to you.

Many thanks.
Sorry for the delay in replying.

While the suggested solution works if the table column for location_date is defined as date time and the timestamps converted on entry, I currently need to work with the inherited db that has the timestamps stored as strings. It has hundreds of thousands of rows, with more being added every day - and I don't know if creating a temporary table,  grabing all rows and reformating the timestamps, adding them into the temp table before executing the rest of the script will be performant with the frequency that the script is executed - can vary from 10 to 50 times a day.

I've tried to retrofit your suggestion to work with the last sample sql file I provided (birds-two-days.sql) but for unknown reasons the query executing for each row of the intermediate $dcps array won't filter down to the max location_class entry for each ptt, for each location_date whole day as it does with your last suggestion (with date time location_date instead of timestamps).

I'm now wondering whether to just pull down all required data in the query that generates the $dcps array, and then group array entries that have the smae ptt and location_date values, and remove all entries except the one with the highest max_location_class value. I'm guessing this may be as quick or quicker than fetching with a fresh SQL query.

My current code below (which doesn't manage to filter down to the expected three objects after the second query).

<?php // demo/temp_williamskellorn.php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);
echo '<pre>';


// SEE http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28427438.html


// DATABASE CONNECTION AND SELECTION VARIABLES - GET THESE FROM YOUR HOSTING COMPANY
$db_host = "localhost"; // PROBABLY THIS IS OK
$db_name = "satellite_data_example";
$db_user = "??";
$db_word = "??";

// OPEN A CONNECTION TO THE DATA BASE SERVER AND SELECT THE DB
$mysqli = new mysqli($db_host, $db_user, $db_word, $db_name);

// DID THE CONNECT/SELECT WORK OR FAIL?
if ($mysqli->connect_errno)
{
    trigger_error("CONNECT FAIL: $mysqli->connect_errno $mysqli->connect_error", E_USER_ERROR);
}


$mysqli->query('SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";');
$mysqli->query('SET time_zone = "+00:00";');

// HAVE A LOOK AT THE LOADED DATA
$res = $mysqli->query('SELECT * FROM birds');
while($row = $res->fetch_object())
{
    
	
	echo PHP_EOL
    . "$row->ptt "
    . "$row->location_date "
    . "$row->location_class "
    . "$row->latitude,$row->longitude "
    . "ID: $row->id"
    ;
}
echo PHP_EOL;
print '<br /><br />';
// FIND THE PARTS OF EACH ROW WITH THE MAX(location_class) FOR THE GROUP
$sql = <<<EOD
SELECT FROM_UNIXTIME(location_date,'%Y-%m-%d') AS loc_date, MAX(location_class) AS max_location_class, ptt
FROM birds
WHERE location_class IN (1,2,3)
GROUP BY ptt, location_date
ORDER BY ptt, location_date
EOD;
if (!$res = $mysqli->query($sql))
{
    $err
    = 'QUERY FAILURE:'
    . ' ERRNO: '
    . $mysqli->errno
    . ' ERROR: '
    . $mysqli->error
    . PHP_EOL
    . ' QUERY: '
    . $sql
    ;
    trigger_error($err, E_USER_ERROR);
}

// COPY THE INFORMATION WE NEED TO IDENTIFY THE RIGHT ROWS
while ($row = $res->fetch_object())
{
    $dcps[] = $row;
}

print_r($dcps);
print '<br /><br />';

// GET ID, LAT, LNG FOR EACH ROW
$dataset = array();
foreach ($dcps as $row)
{
    $sql = <<<EOD
	SELECT *
	FROM birds
	WHERE FROM_UNIXTIME(location_date,'%Y-%m-%d')  = '$row->loc_date'
	  AND location_class = '$row->max_location_class'
	  AND ptt            = '$row->ptt'
	ORDER BY ptt, location_date
	LIMIT 1
EOD;
	if (!$res = $mysqli->query($sql))
	{
	    $err
	    = 'QUERY FAILURE:'
	    . ' ERRNO: '
	    . $mysqli->errno
	    . ' ERROR: '
	    . $mysqli->error
	    . PHP_EOL
	    . ' QUERY: '
	    . $sql
	    ;
	    trigger_error($err, E_USER_ERROR);
	}
	$dataset[] = $res->fetch_object();
}

// SHOW THE WORK PRODUCT
print_r($dataset);

?>

Open in new window

It's just belatedly dawned on me that I'm fighting against the sane way to do this too hard.  I could do a one off operation to add a column to the table which would convert the incoming timestamp data to date time, and then include the conversion in my pull-down script that adds new data each day. I'll then have a date time version of the timestamp to use your suggested approach on.  Obvious, and may have been what you were suggesting initially as a work-around.

Will try this and report back, assign points.

Sorry that I'm a bit slow on the uptake!
Yep, that sounds like a good way to go.
This solution is great for the demonstration problem as explained - I have had subsequent difficulty with the number of queries used when executed on the full db, and have had to fallback to php array filtering to replace the second round of queries to avoid timeouts.

Nevertheless, a good solution - very sorry I didn't mark it as such in good time.
Thanks for the points and thanks for using EE.  If your queries are slow, this is a great place to ask about how to speed them up!  Post the CREATE TABLE statements with your questions, please!  Best regards, ~Ray