Link to home
Start Free TrialLog in
Avatar of letharion
letharionFlag for Sweden

asked on

How to select max/min/first/last by group. Aka "Can my program become a short query"?

I have a program that "selects * from table" and the processes this data, like this:

1) Group each entry by their datetime column, ignoring second precision, order the group by an index value (column i)
2) Find the largest, smallest, first, and last value of column x, in each group.
3) Sum all the values of column y in each group
4) Loop over all data, the write to disc.

I have already written the program, so the task is done, but I think the whole think looks like it could be done easily directly in sql. If this can be done in a "short enough" query, then I'd have fewer bugs to worry about in my own code.

Mainly as an exercise to myself in sql, I'm wondering just how easily this can be done in sql.

Selecting the max(x) and min(x) values of each group is probably quite simple, and I'm hoping the first/last problem could be solved by using max(i)/min(i), but I'm not very familiar with sql, and just randomly guessing seems like a big waste of time.

Again, my question is really "how short" this can be, since if the query requires a lot lines, I'm actually worse of, since I would then need to debug in an unfamiliar language.
Avatar of Goodangel Matope
Goodangel Matope
Flag of Zambia image

Well, if you are unfamiliar with SQL then it might be a good idea to leave it as is. However, if you are familiar with MySQL stored procedures, that would be the way to go. You could create a stored procedure that obtains the result set and saves it to a disk file.
Avatar of letharion

ASKER

>if you are unfamiliar with SQL then it might be a good idea to leave it as is
Good point. But regardless of whether I bring such a query into a production environment or not, it would still have been interesting (and hopefully effective) to learn SQL by having "mapped" a query to my code.
the main reason for doing this in the sql is to reduce the network traffic/overhead of your calling application
having to retrieve each row from the database, which could be considerable

SOMETHING LIKE THIS


SELECT THEDATE,MINX,MAXX,TOTY,Z.X AS FIRSTX,Z1.X AS LASTX
FROM (
select thedate,min(x) as minx,max(x) as maxx,sum(y) as toty,min(I) AS MINI,MAX(I) AS MAXI
FROM (
select date_format(datecol,'%Y%b%d %H%i') as thedate, x, y,i
from yourtable
) AS X
GROUP BY THEDATE
) AS y
) AS xx
iNNER jOIN
(
select date_format(datecol,'%Y%b%d %H%i') as thedate, x,i
from yourtable
) AS Z
ON xx.THEDATE=Z.THEDATE
AND MINI=Z.I
iNNER jOIN
(
select date_format(datecol,'%Y%b%d %H%i') as thedate, x,i
from yourtable
) AS Z1
ON xx.THEDATE=Z1.THEDATE
AND MAXI=Z1.I
ORDER BY 1
Thanks a lot :)

I'm gonna go ahead and see if I can make some sense of that, but I notice one thing right away. The left parenthesis on the line
) AS xx
doesn't seem to have a corresponding right parenthesis.
ASKER CERTIFIED SOLUTION
Avatar of Lowfatspread
Lowfatspread
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The query validates, and runs.

Unfortunately I get:
ERROR 126 (HY000): Incorrect key file for table '/tmp/#sql_3fc_2.MYI'; try to repair it

I googled a bit and found:
"was the result of a temporary table getting so large that it filled TMPDIR, generating the above error."

I'm gonna look into it.
Hmm, yep, "watch df -h" during the query verifies I run out of disc space. (I have 3G free on that computer)
Ok, so I worked that the disc space issue out.
The query runs for a minute, and then quits with

ERROR 1052 (23000) at line 1: Column 'THEDATE' in field list is ambiguous

I'm wildly guessing here that the very first THEDATE should be y.THEDATE?

Anyway, is there anything in particular that should be indexed for this query to run faster? I've been waiting for >2 hours now
well a covering  index on the three columns

 I,Datecol,x

in that order may help...

and an index on the DateCol itself may also help
Neither did anything to help unfortunately.
I'm just gonna very that the output is what I expect it to be, then I'll close this question.

New question specific to performance optimization here:
https://www.experts-exchange.com/questions/25784510/Optimize-time-for-large-query.html?fromWizard=true
Works well :)