?
Solved

select unique row based on most current date

Posted on 2006-05-17
46
Medium Priority
?
1,952 Views
Last Modified: 2008-06-25
ok, i have two example rows here in my database, (table logs)

fields:

url                                                        ip                            thedate
-------------------------------------------------------------------------------------------
http://www.mysite.com/test.php            192.168.25.5             2006-05-16 20:30:56
http://www.mysite.com/test.php       192.168.25.5      2006-05-16 20:02:28
-------------------------------------------------------------------------------------------

there are about 13,000 rows similar to this where the same ip is listed hundreds of times with a different time signature on each row.  What i need is an sql query that pulls the row with the latest time and single instance of the ip along with the matching url.  i can't group by thedate because each date/time is different resulting all of the rows being resulted.  i just want the row based upon no duplicate ip's with the latest date.
is this possible?

thank you!!
0
Comment
Question by:linuxrox
  • 26
  • 7
  • 6
  • +2
46 Comments
 

Author Comment

by:linuxrox
ID: 16702753
trying to figure something out..no luck yet.  none of the queries i've come up with are working yet!
0
 
LVL 17

Expert Comment

by:akshah123
ID: 16702755
SELECT url, ip, max(thedate)
FROM logs
GROUP BY ip, url;
0
NFR key for Veeam Agent for Linux

Veeam is happy to provide a free NFR license for one year.  It allows for the non‑production use and valid for five workstations and two servers. Veeam Agent for Linux is a simple backup tool for your Linux installations, both on‑premises and in the public cloud.

 
LVL 33

Accepted Solution

by:
snoyes_jw earned 1000 total points
ID: 16702774
Assuming you have version 4.1 or later, it would be:

SELECT *
FROM   logs t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM logs t2
              WHERE t1.ip = t2.ip);

akshah123's solution will not guarantee that the value you get for `url` matches the value you get for `thedate` in any row.
0
 

Author Comment

by:linuxrox
ID: 16702803
akshah123, the url's are different also...sorry for not including that info.
so :

SELECT url, ip, max(thedate)
FROM logs
GROUP BY ip, url;

won't work either...
0
 
LVL 17

Expert Comment

by:akshah123
ID: 16702856
then either try snoyes_jw's way (a little slow) or try ...

SELECT l1.url, l1.ip, l1.thedate
FROM logs l1
INNER JOIN logs l2 ON l2.ip = l1.ip
GROUP BY l1.ip
HAVING l1.thedate = max(l2.thedate);
0
 

Author Comment

by:linuxrox
ID: 16702923
snoyes_jw:

i ran your query on my server and it shut down mysqld :) heheh....  i'm running 5.0.18-nt
i tried it twice and it just froze and had to use KILL to stop the process hahahaha
0
 

Author Comment

by:linuxrox
ID: 16702943
same deal akshah123, that query halted the server and had to shutdown mysqld
0
 
LVL 17

Assisted Solution

by:akshah123
akshah123 earned 600 total points
ID: 16702961
Please perform

explain
SELECT l1.url, l1.ip, l1.thedate
FROM logs l1
INNER JOIN logs l2 ON l2.ip = l1.ip
GROUP BY l1.ip
HAVING l1.thedate = max(l2.thedate);

and provide the results here. Also, the

SHOW Create table logs;

would be quite helpful too.
0
 
LVL 33

Expert Comment

by:snoyes_jw
ID: 16702963
Well, that's not so good.

What indexes do you have on the table?

You can try the approach given in the manual for older versions of MySQL that didn't support subqueries.  It requires the use of multiple queries and a temporary table.
http://dev.mysql.com/doc/refman/4.1/en/example-maximum-column-group-row.html
0
 

Author Comment

by:linuxrox
ID: 16702983
don't understand l1.url, l1.ip or l1.thedate
0
 

Author Comment

by:linuxrox
ID: 16702998
there are no indexes on the table logs
0
 
LVL 17

Expert Comment

by:akshah123
ID: 16703009
>>there are no indexes on the table logs

You will at least need an index on ip for this query to produce results in a reasonable period of time.
0
 

Author Comment

by:linuxrox
ID: 16703055
i'm using Navicat 2005 to run the queries
0
 
LVL 22

Expert Comment

by:NovaDenizen
ID: 16703074
You need an index on (url, ip, thedate) if you want the query to run in a reasonable amount of time.
0
 
LVL 33

Expert Comment

by:snoyes_jw
ID: 16703086
akshah123: that query with the having is not going to produce the desired results.  The GROUP BY is executed before the HAVING, so you'll lose the data needed to search all of l2 for the maximum date value.

linuxrox: The bit with l1.url, l1.ip, l1.thedate comes from aliasing the table so we can join it to itself without ambiguity.
0
 

Author Comment

by:linuxrox
ID: 16703112
here's the create table log:
what needs to be added?

# Table "log" DDL

CREATE TABLE `log` (
  `id` int(14) unsigned NOT NULL auto_increment,
  `url` varchar(255) default NULL,
  `ip` varchar(255) default NULL,
  `thedate` timestamp NULL default NULL,
  `referer` varchar(255) default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='InnoDB free: 11264 kB; InnoDB free: 11264 kB; InnoDB free: 1'
0
 

Author Comment

by:linuxrox
ID: 16703181
need to read about indexes...unfamiliar with them.  i'm used to aliases, just thought you had to set an alias like,
select `log`.ip as IP or something to that nature to set an alias.
0
 
LVL 22

Expert Comment

by:NovaDenizen
ID: 16703187
ALTER TABLE log ADD INDEX (url, ip, thedate);

I'm guessing that adding this index will basically double the disk space consumed by your table.
0
 

Author Comment

by:linuxrox
ID: 16703219
ok, i added those indexes now
0
 

Author Comment

by:linuxrox
ID: 16703281
akshah123: here is now explain on:
SELECT l1.url, l1.ip, l1.thedate
FROM log l1
INNER JOIN log l2 ON l2.ip = l1.ip
GROUP BY l1.ip
HAVING l1.thedate = max(l2.thedate);


1      SIMPLE      l2      index            url      521            13863      Using index; Using temporary; Using filesort
1      SIMPLE      l1      index            url      521            13863      Using where; Using index
0
 

Author Comment

by:linuxrox
ID: 16703298
i ran the following even after putting in the indexes:

SELECT l1.url, l1.ip, l1.thedate
FROM log l1
INNER JOIN log l2 ON l2.ip = l1.ip
GROUP BY l1.ip
HAVING l1.thedate = max(l2.thedate)


hung up mysqlnt and had to kill process
0
 
LVL 17

Expert Comment

by:akshah123
ID: 16703311
>>i ran the following even after putting in the indexes:

That is because

ALTER TABLE log ADD INDEX (url, ip, thedate);

is totally useless.  You need two separate index for the table log.  Try,

ALTER TABLE log ADD INDEX (ip);

ALTER TABLE log ADD INDEX (thedate);

ALTER TABLE log DROP INDEX (url, ip, thedate);
0
 
LVL 33

Expert Comment

by:snoyes_jw
ID: 16703339
I concur with akshah123.  The reason is that indexes are evaluated left to right.  None of the queries use `url` for any of the join conditions or where clauses, so that index is not usuable.
0
 

Author Comment

by:linuxrox
ID: 16703377
ok, here is now the create table log:
CREATE TABLE `log` (
  `id` int(14) unsigned NOT NULL auto_increment,
  `url` varchar(255) default NULL,
  `ip` varchar(255) default NULL,
  `thedate` timestamp NULL default NULL,
  `referer` varchar(255) default NULL,
  PRIMARY KEY  (`id`),
  KEY `ip` (`ip`),
  KEY `thedate` (`thedate`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='InnoDB free: 11264 kB; InnoDB free: 11264 kB; InnoDB free: 1'
0
 

Author Comment

by:linuxrox
ID: 16703465
hmm, well i reran that query akshah123...took like 30 seconds and system seemed locked up.
after the query was over only 20 results were shown from a table that holds over 12 thousand rows :)
werid stuff i guess.
0
 
LVL 17

Expert Comment

by:akshah123
ID: 16703486
>>>hmm, well i reran that query akshah123...took like 30 seconds and system seemed locked up.
after the query was over only 20 results were shown from a table that holds over 12 thousand rows :)
werid stuff i guess.

Did you try snoyes_jw's initial query?  As snoyes_jw pointed out earlier, there might be a major flaw in my query...
0
 

Author Comment

by:linuxrox
ID: 16703530
let me try his...
0
 

Author Comment

by:linuxrox
ID: 16703596
ok, i tried his and it is returning more rows...i stopped the results at 300 because it was just bogging down the server and took so long.  it's taking 30 seconds or so and my cpu usage just goes skyrocketing.
don't get me wrong...this server is not real powerful at all..only 550 MHZ but nothing big runs on it really and i never have any issues with it on any other queries...but anyway, running this:

SELECT *
FROM   log t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM log t2
              WHERE t1.ip = t2.ip);

took about the same amount of time, loading up the server a lot yet returned over 300 rows which is about half the number of rows that should be returned.  i know from just doing a group by ip that there are around 600 unique ip's in the database...but like i said, i had to stop the query because it slowed the system down to a crawl.
0
 
LVL 22

Expert Comment

by:NovaDenizen
ID: 16703786
I mentioned the (url, ip, thedate) index with the idea of optimizing this query:
SELECT url, ip, max(thedate)
FROM logs
GROUP BY ip, url;

and I see that I put ip and url in the wrong order.  So, I would suggest dropping my (url, ip, thedate) index and adding a (ip, url, thedate) index, assuming you want to do this particular query.

Adding two separate indexes, one for url and one for thedate is not very useful.  The query will only be able to use one of them.

0
 

Author Comment

by:linuxrox
ID: 16703832
Ok, i'll try that..
currently, having not done what you just suggested with the indexes, the following query
took about 30 seconds to produce 549 results:

SELECT *
FROM   log t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM log t2
              WHERE t1.ip = t2.ip)
order by t1.ip Desc
0
 

Author Comment

by:linuxrox
ID: 16703906
ok, made those changes with the indexes:  log now looks like this:
# Table "log" DDL

CREATE TABLE `log` (
  `id` int(14) unsigned NOT NULL auto_increment,
  `url` varchar(255) default NULL,
  `ip` varchar(255) default NULL,
  `thedate` timestamp NULL default NULL,
  `referer` varchar(255) default NULL,
  PRIMARY KEY  (`id`),
  KEY `ip` (`ip`,`url`,`thedate`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='InnoDB free: 11264 kB; InnoDB free: 11264 kB; InnoDB free: 1'

query:  (still takes like 20 seconds or so)  is that normal?

SELECT *
FROM   log t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM log t2
              WHERE t1.ip = t2.ip)
order by t1.ip Desc
0
 

Author Comment

by:linuxrox
ID: 16703934
what's funny is that if i just do:
SELECT *
FROM   log

---it takes like 2 seconds to show all 14,046 rows and fields!

but this takes at least 8 times longer:

SELECT *
FROM   log t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM log t2
              WHERE t1.ip = t2.ip)
order by t1.ip Desc
0
 
LVL 22

Expert Comment

by:NovaDenizen
ID: 16703954
The subselect is really slow.  I'm still confused about what you want to accomplish with the query.
0
 

Author Comment

by:linuxrox
ID: 16704029
ok..very simple actually...

that table is full of rows containg a URL, an ip address, the date, and a referer url  --all which are generated from a php program and fed into the log table of my database.

so several hundred rows could contain the exact same ip address BUT have a different timestamp and different referer's and URLs.
SO, what i was wanting is only the row with ONE IP LISTING which ALSO had the most RECENT timestamp and the referer's and urls.  that way, i don't get duplicate IP results from the query...just the one with the most recent timestamp per unique ip address...  understand what i mean?

i mean, it would be simple if you could do:

SELECT *
FROM   log
Where
`log`.thedate = // some sort of way where the date was the most current or closest to most current
Group By
`log`.ip
0
 
LVL 5

Expert Comment

by:aminerd
ID: 16704263
I believe you should see a performance increase in the subquery version by modifying your current index on ip, url, and thedate to just ip and thedate. In that case, MySQL will read MAX(thedate) directly from the index and not have to seek to the records.
0
 

Author Comment

by:linuxrox
ID: 16705575
i transferred the table to my home machine (AMD dual core X2 4400+ OC'd at 2.5 Ghz 1 gig ram) and the query took only 3 seconds.  
there are 15,322 rows to search through to pull that data and it didn't really place any burden on my home pc.  what i'm wondering is if i use a typical <? $query = mysql_query("
SELECT *
FROM   log t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM log t2
              WHERE t1.ip = t2.ip)
order by t1.ip Desc
");
while($r=mysql_fetch_assoc($query)){ echo $r['ip'];}

from within php is that gonna bring it to a crawl again on my work machine which is much slower.

Also, i'm not sure but from what i've learned here today it seems as if using indexes DOES speed things up quite a bit..i noticed that on normal queries...however it seems as if you can get OUT of using subqueries if at all possible, that is the way to go..which in my case means poor database design pertaining to the logging feature.  just seeing what i've seen from subqueries so far, they don't seem too friendly and efficient on a large number of rows to sift through.  but if NOT using subqueries is it best to go ahead and use indexes anyway?
0
 
LVL 33

Expert Comment

by:snoyes_jw
ID: 16708003
I'm no great expert when it comes to indexing strategies.  But as far as I understand it, you need to look at the parts of the query that affect which rows are selected.  That's stuff that shows up in the WHERE clause and the join condition.  Indexing things that appear in the select list doesn't help any.

For the query above, you are searching for rows based on `thedate` and `ip`.  I think if you index both of those, the `ip` index will be used within the subquery, and `thedate` index will be used on the outer query.  But an EXPLAIN would show which, if any, MySQL will use.

It is natural that a simple SELECT * FROM log would execute quickly; MySQL doesn't have to do any joining or sorting or anything; just fetch and return all rows at once.  If I say, "read me that book", or if I say, "read to me just those words in that book which appear more than once", which can you do faster?  Clearly the first, because you don't have to go count each word and see how often it appears.

Yes, subqueries can be slow.  But sometimes, they are the most reasonable way to get the information you want.  You can always make the results faster by upgrading the hardware.
0
 

Author Comment

by:linuxrox
ID: 16708570
thanks snoyes_jw, yes the hardware IS an issue.  i can't upgrade it though :)  no allowance!  and yes that totally makes sense on the select * process and why it would be much quicker.  indexing is interesting and is new to me.  i'll read up on that.  
I thank you all for the valuable information and help!  it'll take me a while to figure out how to distribute the points on this one!!  I thank all of you though quite a bit!
0
 

Author Comment

by:linuxrox
ID: 16708709
something else VERY interesting i just saw.  looks like after running the query a couple of times when no rows have been altered or no rows added, the query is instant :)  then i added another row and it slowed down to about 7 seconds or so like normal....i didn't know mysql did that; as if it sees that nothing has changed in the dataset and returns the same results for the query.  is that normal ?
0
 
LVL 5

Assisted Solution

by:aminerd
aminerd earned 400 total points
ID: 16708730
"Indexing things that appear in the select list doesn't help any."

Actually, it's my understanding that it can. If everything you're selecting is also present in the index, MySQL doesn't have to seek the drive to each individual record to retreive the data; it just reads it straight off the index. For instance:

SELECT col3 FROM test WHERE col1='' and col2='';

If your index is on col1, col2, and col3, MySQL will read col3 from the index, not from the actual record. It's more effecient because MySQL doesn't have to spin the disk to get col3.
0
 
LVL 33

Expert Comment

by:snoyes_jw
ID: 16708732
0
 

Author Comment

by:linuxrox
ID: 16708777
dang...cool stuff.  so much to learn, so little time!
0
 
LVL 22

Expert Comment

by:NovaDenizen
ID: 16708954
> looks like after running the query a couple of times when no rows have been altered or no rows added, the query is instant :)

mysql caches queries and their results.  If the underlying data has not changed since query X was run, and query X comes in again, it will just return the same data without actually running the query.
0
 

Author Comment

by:linuxrox
ID: 16709004
hmm, looks like there is a timeout period or something.  even though the rows haven't changed, after a minute or so it slows down and runs what seems to be the normal query and not use the cached query, returning the same number of rows; which must be the exact same data because there is no way the current rows can be altered...strange.
0
 

Author Comment

by:linuxrox
ID: 16709019
actually, my bad...i take that back....i had a window open that autorefreshed; thus changing the time field for the row :)  so yes, the cache works great.
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When table data gets too large to manage or queries take too long to execute the solution is often to buy bigger hardware or assign more CPUs and memory resources to the machine to solve the problem. However, the best, cheapest and most effective so…
Backups and Disaster RecoveryIn this post, we’ll look at strategies for backups and disaster recovery.
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
In this video, Percona Solutions Engineer Barrett Chambers discusses some of the basic syntax differences between MySQL and MongoDB. To learn more check out our webinar on MongoDB administration for MySQL DBA: https://www.percona.com/resources/we…
Suggested Courses
Course of the Month16 days, 11 hours left to enroll

862 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question