mbevilacqua
asked on
select max of rows for each dimension
Table : THE_TABLE
Columns:
THE_TABLE_ID int
DATE_ID date_id - (day since Jan 1 1970)
TIME_ID time_id - subset of DATE_ID (the minute of the day (1 to 60 minutes *24))
DIMENSION_ID - the dimension to show the latest value for each unique entry
How do I select the row with the highest DATE_ID and TIME_ID for each DIMENSION_ID? All ID columns are indexed, so I do not want to use a function. I need this to work on DB2, MSSQL,. and Oracle.
This code example works to find the latest of all rows, but I need it to return a row for each unique DIMESNION_ID.
CONSTRAINTS:
SQL must work in Oracle, DB2, and MSSQL
SQL must use the indexes on DATE_ID and TIME_ID columns.
Columns:
THE_TABLE_ID int
DATE_ID date_id - (day since Jan 1 1970)
TIME_ID time_id - subset of DATE_ID (the minute of the day (1 to 60 minutes *24))
DIMENSION_ID - the dimension to show the latest value for each unique entry
How do I select the row with the highest DATE_ID and TIME_ID for each DIMENSION_ID? All ID columns are indexed, so I do not want to use a function. I need this to work on DB2, MSSQL,. and Oracle.
This code example works to find the latest of all rows, but I need it to return a row for each unique DIMESNION_ID.
CONSTRAINTS:
SQL must work in Oracle, DB2, and MSSQL
SQL must use the indexes on DATE_ID and TIME_ID columns.
SELECT *
FROM (SELECT t.*
, ROW_NUMBER() OVER (ORDER BY DATE_ID DESC, TIME_ID DESC) r
FROM the_table t
) sq
WHERE sq.r <= (select count(distinct DIMENSION_ID) from the_table)
By my reading, you'd like the maximum date_id for each dimension_id, and then get the maximum time_id from that set for each dimension, and finally include the_table_id for rows that have those max. dates and times for each dimension. If so, I believe the query below should work, in general. It should be generic SQL which should work with each database you stated, but I have not tested it specifically on any of them. As to your second constraint, well, that would depend on the query optimizers for each database engine. There's no reason why they wouldn't use the indices for these queries (as there is no external functions involved), but I obviously cannot say for certain.
In the SQL, here is what I mean for each of the subselect abbreviations:
mtpdd -- max time per date and dimension
mdpd -- max date per dimension
mdtpd -- max time and date per dimension
Starting with the first subselect (mtpdd), we do a select which determines the maximum time for each dimension-date pair. Yes, this is more than what you want, but please be patient. The next subselect (mdpd) determines the maximum date for each dimension. When we do a(n INNER) JOIN of mtpdd with mdpd, we get the maximum date and time for each dimension (mdtpd). If all you want is just this information, then you're done. But my interpretation of your question made me think you wanted the the_table_id for each row which has the max date and time for a dimension. So we have perform the JOIN between the_table and mdtpd.
The caveat is that you would get more than one row for any dimension, if there are more than one row with the maximum date and time for a dimension. So if you have code assuming only one row per dimension, that may break or not work as expected. But if your data is such that that would never be the case, then of course you'd only get one row per dimension.
In the SQL, here is what I mean for each of the subselect abbreviations:
mtpdd -- max time per date and dimension
mdpd -- max date per dimension
mdtpd -- max time and date per dimension
Starting with the first subselect (mtpdd), we do a select which determines the maximum time for each dimension-date pair. Yes, this is more than what you want, but please be patient. The next subselect (mdpd) determines the maximum date for each dimension. When we do a(n INNER) JOIN of mtpdd with mdpd, we get the maximum date and time for each dimension (mdtpd). If all you want is just this information, then you're done. But my interpretation of your question made me think you wanted the the_table_id for each row which has the max date and time for a dimension. So we have perform the JOIN between the_table and mdtpd.
The caveat is that you would get more than one row for any dimension, if there are more than one row with the maximum date and time for a dimension. So if you have code assuming only one row per dimension, that may break or not work as expected. But if your data is such that that would never be the case, then of course you'd only get one row per dimension.
SELECT t.* FROM
the_table t
JOIN
(SELECT mtpdd.dimension_id, mtpdd.date_id, mtpdd.max_time AS time_id FROM
(SELECT dimension_id, date_id, max(time_id) as max_time
FROM the_table
GROUP BY dimension_id, date_id) AS mtpdd
JOIN
(SELECT dimension_id, max(date_id) as max_date
FROM the_table
GROUP BY dimension_id) AS mdpd
ON (mtpdd.dimension_id = mdpd.dimension_id
AND mtpdd.date_id = mdpd.max_date))
AS mdtpd
ON (t.dimension_id = mdtpd.dimension_id AND t.date_id = mdtpd.date_id
AND t.time_id = mdtpd.time_id)
ORDER BY dimension_id
After a little extra reading, you may want to replace all occurrences of "JOIN" in my query above with "INNER JOIN". DB2 may not like "JOIN" alone, and we don't want an OUTER JOIN to be assumed.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Greg
Open in new window