• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1967
  • Last Modified:

select unique row based on most current date

ok, i have two example rows here in my database, (table logs)

fields:

url                                                        ip                            thedate
-------------------------------------------------------------------------------------------
http://www.mysite.com/test.php            192.168.25.5             2006-05-16 20:30:56
http://www.mysite.com/test.php       192.168.25.5      2006-05-16 20:02:28
-------------------------------------------------------------------------------------------

there are about 13,000 rows similar to this where the same ip is listed hundreds of times with a different time signature on each row.  What i need is an sql query that pulls the row with the latest time and single instance of the ip along with the matching url.  i can't group by thedate because each date/time is different resulting all of the rows being resulted.  i just want the row based upon no duplicate ip's with the latest date.
is this possible?

thank you!!
0
linuxrox
Asked:
linuxrox
  • 26
  • 7
  • 6
  • +2
3 Solutions
 
linuxroxAuthor Commented:
trying to figure something out..no luck yet.  none of the queries i've come up with are working yet!
0
 
akshah123Commented:
SELECT url, ip, max(thedate)
FROM logs
GROUP BY ip, url;
0
Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

 
snoyes_jwCommented:
Assuming you have version 4.1 or later, it would be:

SELECT *
FROM   logs t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM logs t2
              WHERE t1.ip = t2.ip);

akshah123's solution will not guarantee that the value you get for `url` matches the value you get for `thedate` in any row.
0
 
linuxroxAuthor Commented:
akshah123, the url's are different also...sorry for not including that info.
so :

SELECT url, ip, max(thedate)
FROM logs
GROUP BY ip, url;

won't work either...
0
 
akshah123Commented:
then either try snoyes_jw's way (a little slow) or try ...

SELECT l1.url, l1.ip, l1.thedate
FROM logs l1
INNER JOIN logs l2 ON l2.ip = l1.ip
GROUP BY l1.ip
HAVING l1.thedate = max(l2.thedate);
0
 
linuxroxAuthor Commented:
snoyes_jw:

i ran your query on my server and it shut down mysqld :) heheh....  i'm running 5.0.18-nt
i tried it twice and it just froze and had to use KILL to stop the process hahahaha
0
 
linuxroxAuthor Commented:
same deal akshah123, that query halted the server and had to shutdown mysqld
0
 
akshah123Commented:
Please perform

explain
SELECT l1.url, l1.ip, l1.thedate
FROM logs l1
INNER JOIN logs l2 ON l2.ip = l1.ip
GROUP BY l1.ip
HAVING l1.thedate = max(l2.thedate);

and provide the results here. Also, the

SHOW Create table logs;

would be quite helpful too.
0
 
snoyes_jwCommented:
Well, that's not so good.

What indexes do you have on the table?

You can try the approach given in the manual for older versions of MySQL that didn't support subqueries.  It requires the use of multiple queries and a temporary table.
http://dev.mysql.com/doc/refman/4.1/en/example-maximum-column-group-row.html
0
 
linuxroxAuthor Commented:
don't understand l1.url, l1.ip or l1.thedate
0
 
linuxroxAuthor Commented:
there are no indexes on the table logs
0
 
akshah123Commented:
>>there are no indexes on the table logs

You will at least need an index on ip for this query to produce results in a reasonable period of time.
0
 
linuxroxAuthor Commented:
i'm using Navicat 2005 to run the queries
0
 
NovaDenizenCommented:
You need an index on (url, ip, thedate) if you want the query to run in a reasonable amount of time.
0
 
snoyes_jwCommented:
akshah123: that query with the having is not going to produce the desired results.  The GROUP BY is executed before the HAVING, so you'll lose the data needed to search all of l2 for the maximum date value.

linuxrox: The bit with l1.url, l1.ip, l1.thedate comes from aliasing the table so we can join it to itself without ambiguity.
0
 
linuxroxAuthor Commented:
here's the create table log:
what needs to be added?

# Table "log" DDL

CREATE TABLE `log` (
  `id` int(14) unsigned NOT NULL auto_increment,
  `url` varchar(255) default NULL,
  `ip` varchar(255) default NULL,
  `thedate` timestamp NULL default NULL,
  `referer` varchar(255) default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='InnoDB free: 11264 kB; InnoDB free: 11264 kB; InnoDB free: 1'
0
 
linuxroxAuthor Commented:
need to read about indexes...unfamiliar with them.  i'm used to aliases, just thought you had to set an alias like,
select `log`.ip as IP or something to that nature to set an alias.
0
 
NovaDenizenCommented:
ALTER TABLE log ADD INDEX (url, ip, thedate);

I'm guessing that adding this index will basically double the disk space consumed by your table.
0
 
linuxroxAuthor Commented:
ok, i added those indexes now
0
 
linuxroxAuthor Commented:
akshah123: here is now explain on:
SELECT l1.url, l1.ip, l1.thedate
FROM log l1
INNER JOIN log l2 ON l2.ip = l1.ip
GROUP BY l1.ip
HAVING l1.thedate = max(l2.thedate);


1      SIMPLE      l2      index            url      521            13863      Using index; Using temporary; Using filesort
1      SIMPLE      l1      index            url      521            13863      Using where; Using index
0
 
linuxroxAuthor Commented:
i ran the following even after putting in the indexes:

SELECT l1.url, l1.ip, l1.thedate
FROM log l1
INNER JOIN log l2 ON l2.ip = l1.ip
GROUP BY l1.ip
HAVING l1.thedate = max(l2.thedate)


hung up mysqlnt and had to kill process
0
 
akshah123Commented:
>>i ran the following even after putting in the indexes:

That is because

ALTER TABLE log ADD INDEX (url, ip, thedate);

is totally useless.  You need two separate index for the table log.  Try,

ALTER TABLE log ADD INDEX (ip);

ALTER TABLE log ADD INDEX (thedate);

ALTER TABLE log DROP INDEX (url, ip, thedate);
0
 
snoyes_jwCommented:
I concur with akshah123.  The reason is that indexes are evaluated left to right.  None of the queries use `url` for any of the join conditions or where clauses, so that index is not usuable.
0
 
linuxroxAuthor Commented:
ok, here is now the create table log:
CREATE TABLE `log` (
  `id` int(14) unsigned NOT NULL auto_increment,
  `url` varchar(255) default NULL,
  `ip` varchar(255) default NULL,
  `thedate` timestamp NULL default NULL,
  `referer` varchar(255) default NULL,
  PRIMARY KEY  (`id`),
  KEY `ip` (`ip`),
  KEY `thedate` (`thedate`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='InnoDB free: 11264 kB; InnoDB free: 11264 kB; InnoDB free: 1'
0
 
linuxroxAuthor Commented:
hmm, well i reran that query akshah123...took like 30 seconds and system seemed locked up.
after the query was over only 20 results were shown from a table that holds over 12 thousand rows :)
werid stuff i guess.
0
 
akshah123Commented:
>>>hmm, well i reran that query akshah123...took like 30 seconds and system seemed locked up.
after the query was over only 20 results were shown from a table that holds over 12 thousand rows :)
werid stuff i guess.

Did you try snoyes_jw's initial query?  As snoyes_jw pointed out earlier, there might be a major flaw in my query...
0
 
linuxroxAuthor Commented:
let me try his...
0
 
linuxroxAuthor Commented:
ok, i tried his and it is returning more rows...i stopped the results at 300 because it was just bogging down the server and took so long.  it's taking 30 seconds or so and my cpu usage just goes skyrocketing.
don't get me wrong...this server is not real powerful at all..only 550 MHZ but nothing big runs on it really and i never have any issues with it on any other queries...but anyway, running this:

SELECT *
FROM   log t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM log t2
              WHERE t1.ip = t2.ip);

took about the same amount of time, loading up the server a lot yet returned over 300 rows which is about half the number of rows that should be returned.  i know from just doing a group by ip that there are around 600 unique ip's in the database...but like i said, i had to stop the query because it slowed the system down to a crawl.
0
 
NovaDenizenCommented:
I mentioned the (url, ip, thedate) index with the idea of optimizing this query:
SELECT url, ip, max(thedate)
FROM logs
GROUP BY ip, url;

and I see that I put ip and url in the wrong order.  So, I would suggest dropping my (url, ip, thedate) index and adding a (ip, url, thedate) index, assuming you want to do this particular query.

Adding two separate indexes, one for url and one for thedate is not very useful.  The query will only be able to use one of them.

0
 
linuxroxAuthor Commented:
Ok, i'll try that..
currently, having not done what you just suggested with the indexes, the following query
took about 30 seconds to produce 549 results:

SELECT *
FROM   log t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM log t2
              WHERE t1.ip = t2.ip)
order by t1.ip Desc
0
 
linuxroxAuthor Commented:
ok, made those changes with the indexes:  log now looks like this:
# Table "log" DDL

CREATE TABLE `log` (
  `id` int(14) unsigned NOT NULL auto_increment,
  `url` varchar(255) default NULL,
  `ip` varchar(255) default NULL,
  `thedate` timestamp NULL default NULL,
  `referer` varchar(255) default NULL,
  PRIMARY KEY  (`id`),
  KEY `ip` (`ip`,`url`,`thedate`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='InnoDB free: 11264 kB; InnoDB free: 11264 kB; InnoDB free: 1'

query:  (still takes like 20 seconds or so)  is that normal?

SELECT *
FROM   log t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM log t2
              WHERE t1.ip = t2.ip)
order by t1.ip Desc
0
 
linuxroxAuthor Commented:
what's funny is that if i just do:
SELECT *
FROM   log

---it takes like 2 seconds to show all 14,046 rows and fields!

but this takes at least 8 times longer:

SELECT *
FROM   log t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM log t2
              WHERE t1.ip = t2.ip)
order by t1.ip Desc
0
 
NovaDenizenCommented:
The subselect is really slow.  I'm still confused about what you want to accomplish with the query.
0
 
linuxroxAuthor Commented:
ok..very simple actually...

that table is full of rows containg a URL, an ip address, the date, and a referer url  --all which are generated from a php program and fed into the log table of my database.

so several hundred rows could contain the exact same ip address BUT have a different timestamp and different referer's and URLs.
SO, what i was wanting is only the row with ONE IP LISTING which ALSO had the most RECENT timestamp and the referer's and urls.  that way, i don't get duplicate IP results from the query...just the one with the most recent timestamp per unique ip address...  understand what i mean?

i mean, it would be simple if you could do:

SELECT *
FROM   log
Where
`log`.thedate = // some sort of way where the date was the most current or closest to most current
Group By
`log`.ip
0
 
aminerdCommented:
I believe you should see a performance increase in the subquery version by modifying your current index on ip, url, and thedate to just ip and thedate. In that case, MySQL will read MAX(thedate) directly from the index and not have to seek to the records.
0
 
linuxroxAuthor Commented:
i transferred the table to my home machine (AMD dual core X2 4400+ OC'd at 2.5 Ghz 1 gig ram) and the query took only 3 seconds.  
there are 15,322 rows to search through to pull that data and it didn't really place any burden on my home pc.  what i'm wondering is if i use a typical <? $query = mysql_query("
SELECT *
FROM   log t1
WHERE  thedate=(SELECT MAX(t2.thedate)
              FROM log t2
              WHERE t1.ip = t2.ip)
order by t1.ip Desc
");
while($r=mysql_fetch_assoc($query)){ echo $r['ip'];}

from within php is that gonna bring it to a crawl again on my work machine which is much slower.

Also, i'm not sure but from what i've learned here today it seems as if using indexes DOES speed things up quite a bit..i noticed that on normal queries...however it seems as if you can get OUT of using subqueries if at all possible, that is the way to go..which in my case means poor database design pertaining to the logging feature.  just seeing what i've seen from subqueries so far, they don't seem too friendly and efficient on a large number of rows to sift through.  but if NOT using subqueries is it best to go ahead and use indexes anyway?
0
 
snoyes_jwCommented:
I'm no great expert when it comes to indexing strategies.  But as far as I understand it, you need to look at the parts of the query that affect which rows are selected.  That's stuff that shows up in the WHERE clause and the join condition.  Indexing things that appear in the select list doesn't help any.

For the query above, you are searching for rows based on `thedate` and `ip`.  I think if you index both of those, the `ip` index will be used within the subquery, and `thedate` index will be used on the outer query.  But an EXPLAIN would show which, if any, MySQL will use.

It is natural that a simple SELECT * FROM log would execute quickly; MySQL doesn't have to do any joining or sorting or anything; just fetch and return all rows at once.  If I say, "read me that book", or if I say, "read to me just those words in that book which appear more than once", which can you do faster?  Clearly the first, because you don't have to go count each word and see how often it appears.

Yes, subqueries can be slow.  But sometimes, they are the most reasonable way to get the information you want.  You can always make the results faster by upgrading the hardware.
0
 
linuxroxAuthor Commented:
thanks snoyes_jw, yes the hardware IS an issue.  i can't upgrade it though :)  no allowance!  and yes that totally makes sense on the select * process and why it would be much quicker.  indexing is interesting and is new to me.  i'll read up on that.  
I thank you all for the valuable information and help!  it'll take me a while to figure out how to distribute the points on this one!!  I thank all of you though quite a bit!
0
 
linuxroxAuthor Commented:
something else VERY interesting i just saw.  looks like after running the query a couple of times when no rows have been altered or no rows added, the query is instant :)  then i added another row and it slowed down to about 7 seconds or so like normal....i didn't know mysql did that; as if it sees that nothing has changed in the dataset and returns the same results for the query.  is that normal ?
0
 
aminerdCommented:
"Indexing things that appear in the select list doesn't help any."

Actually, it's my understanding that it can. If everything you're selecting is also present in the index, MySQL doesn't have to seek the drive to each individual record to retreive the data; it just reads it straight off the index. For instance:

SELECT col3 FROM test WHERE col1='' and col2='';

If your index is on col1, col2, and col3, MySQL will read col3 from the index, not from the actual record. It's more effecient because MySQL doesn't have to spin the disk to get col3.
0
 
linuxroxAuthor Commented:
dang...cool stuff.  so much to learn, so little time!
0
 
NovaDenizenCommented:
> looks like after running the query a couple of times when no rows have been altered or no rows added, the query is instant :)

mysql caches queries and their results.  If the underlying data has not changed since query X was run, and query X comes in again, it will just return the same data without actually running the query.
0
 
linuxroxAuthor Commented:
hmm, looks like there is a timeout period or something.  even though the rows haven't changed, after a minute or so it slows down and runs what seems to be the normal query and not use the cached query, returning the same number of rows; which must be the exact same data because there is no way the current rows can be altered...strange.
0
 
linuxroxAuthor Commented:
actually, my bad...i take that back....i had a window open that autorefreshed; thus changing the time field for the row :)  so yes, the cache works great.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

  • 26
  • 7
  • 6
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now