MySQL - DISTINCT on single column



Hello,

I am having a problem applying a DISTINCT attribute to a single column.  Here is my scenarion.

We have a MySQL table used to manage the published files for a given software application

Table:  manifest_file

FileID
FileName
FileCRC32
VersionMajor
VersionMinor
VersionDate
FirmID


FirmID represents who the file is published to.  0 means everyone.


I need to return the FileID for the files with the highest Version numbers & dates.  


SELECT DISTINCT FileName, FileID
FROM manifest_file
WHERE Firmid=1 OR FirmID=0
ORDER BY VersionMajor DESC, VersionMinorDESC, VersionDateDESC

Unfortunately, this applies the DISTINCT attribute to both columns.


Using a GROUP BY returns the right number of results, but the row data is arbitrary (so the FileID and other fields will not be correct)


SELECT FileID, FileName, FileCr32, VersionMajor, VersionMinor, VersionDate
FROM manifest_file
WHERE FirmID=1 OR FirmID=0
GROUP BY FileName
ORDER BY VersionMajor DESC, VersionMinor DESC, VersionDate DESC




Please help me.  I, essentially, just need the Distinct to be applied to FileName but the ORDER BY statement must be honored in order to assure I get the correct results.

Thank you in advance
ahijabAsked:
Who is Participating?
 
Kevin CrossConnect With a Mentor Chief Technology OfficerCommented:
Yes. I agree. However, just be smart in the way this is constructed. Look at my point on formula I showed. If you have '9.16.20111001' and '10.4.20111001', take a look at what happens:

SELECT VersionString
FROM (
   SELECT '9.16.20111001' VersionString UNION
   SELECT '10.4.20111001'
) derived
ORDER BY VersionString DESC
;

Open in new window


If you choose to go with VersionString and need it in that format, then see if this helps you:
http://www.experts-exchange.com/Programming/Languages/SQL_Syntax/A_252-How-to-Mimic-the-T-SQL-IsNumeric-Function-in-MySQL.html
It is an article on creating the IsNumeric() and IsInteger() functions, but I show an example that involved sorting on both sides of a delimiter like '.'. You can actually pull the date apart then just treat the rest as a decimal number then compare.

If you can avoid it, though, that is even better, especially since you are going this route to make your queries simpler. :)

Even though you are choosing to simplify, did you try http:#36898036 ?
If only because I spent the time typing it for you, it would be great for you to see if working. ;) If you run the inner query by itself, you can see what is going on under the covers and helps explain the rank = 1 on the outside query.
0
 
Ray PaseurCommented:
Have you tried SELECT FileID, DISTINCT FileName ...?

It seems to me that this should be two tables instead of one.  A junction table could be used to create the joins you want.
0
 
ahijabAuthor Commented:
Yes, I was hoping for the same - unfortunately, that is not acceptable syntax in modern MySQL.
0
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

 
Kevin CrossChief Technology OfficerCommented:
Yes, when you use GROUP BY, you have to tell SQL which row to pull. Most systems generate an error, but MySQL simply assigns a FIRST/LAST type aggregate by just picking the value to display. Why you need is to either rank the data or use a JOIN to a derived table getting the max version per FileID. For the latter, you should really check if you need all three, i.e., is VersionDate enough to identify the latest VersionMajor and VersionMinor combination? Normally, you release newer versions with later dates, right?

For the ranking, you can see my ranking article for explanation.

If you need help with examples, please post some sample data along with table create script and specify the expected results, i.e., confirm which field is truly needed to find the right row for a given FileID.
0
 
Ray PaseurCommented:
Crap.  Seems like MySQL should be smart enough to handle that.

Here is what I would do, given the current single-table configuration.  Select all the rows with a thoughtfully designed ORDER BY clause.  Iterate over the selections with the while() iterator.  With each row, add an element to an associative array.  The key of the array will be the FileName, and the value of the array will be the entire row.  As FileName strings match the array keys, the corresponding row of the array will be overwritten and in the end you will have the last element of the results set in the associative array.

Does that make sense for your needs?
0
 
ahijabAuthor Commented:
Sure, I know how I could solve my solution programmatically... but I'd like to use an optimized query instead.  I'm hoping someone can suggest an alternative.
0
 
Ray PaseurCommented:
In that case mwvisa1 can probably help better than I can. I tend to think in PHP terms first and MySQL terms later.  Post the information requested in ID36897013 and that will help.  Best of luck, ~Ray
0
 
Aaron TomoskySD-WAN SimplifiedCommented:
Agree wIth mwvisa1. Group by is the way to go. Mssql wouldn't even let you use a group by without explicitly telling it what to do with the other column. Mysql let's you but like you said it arbitrarily picks values.

I suggest using group by filename an wrap all the other columns in an aggregate function (max, min, average, etc...)

If that doesn't work, you can also do sub queries butthat gets messy fast. Let us know the criteria for the other columns and we can help you write the statement.
0
 
Kevin CrossChief Technology OfficerCommented:
Reading this again, I will expand my thought to ask if FileID is auto-incrementing, I.e., isn't highest FileID per filename the latest version? If it is, then the derived table can be a simple aggregate of MAX(FileID) ... GROUP BY FileName and then in the other query JOIN on both columns, then show the others.
0
 
ahijabAuthor Commented:
Thank you for your feedback and responses:

- FileID is auto-incrementing, however I do not want to depend on the primary key value to determine if it is the most recent.   In various circumstances, the ordering may be jumbled.

- Similarly, I do not want to depend exclusively on VersionDate to determine the file version.  It's possible our code will branch/fork, and v1.5 will be worked on concurrently as v1.4.  That's why VersionDate should be the LAST item of the ORDER BY clause.

- It is possible to give an example query based on the table details I posed above,m wvisa?  I greatly appreciate your responses.


Table:  manifest_file

FileID (PK, int)
FirmID (FK, int)
FileName (varchar[255])
FileCRC32 (varchar[40])
VersionMajor (int)
VersionMinor (int)
VersionDate (DateTime)
0
 
Kevin CrossChief Technology OfficerCommented:
Okay, I got you. I will try to work up an example with ranking approach. For now, for the GROUP BY approach, you can see if you can build a combination numeric value that represents VersionMajor, VersionMinor, and VersionDate in a manner that preserves their impact to ordering and guaranteed to not get collisions say if I have 1.14 and 2.12...you wouldn't want the 1+14 to trump 2+12, so you would multiple out something like:
VersionMajor * 10^13 + VersionMinor * 10^8 + VersionDate = 10001420111001 (for example)

This works if your VersionMinor/VersionMajor are not more than 5 digits long. The length of a BIGINT is 19 digits if I am not mistaken, so you can use that as data type or NUMERIC(18,0).

The query would look something like this:

SELECT f.*
FROM manifest_file f
JOIN (
   SELECT FileName
        , MAX({formula you go with}) AS FileVersion
   FROM manifest_file
   GROUP BY FileName
) fMax ON fMax.FileName = f.FileName
  AND fMax.FileVersion = {formula you go with (using f aliased table columns)}
;

Open in new window


Hopefully that makes sense. Maybe there is a simpler way by nesting further, but trying to not make this a maintenance mess.
0
 
Kevin CrossChief Technology OfficerCommented:
Note on above, you have to convert the date to YYYYMMDD format then to INT.
0
 
Kevin CrossChief Technology OfficerCommented:
Okay, here is the tested ranking code. You can see the explanation and another sample of sales by salesperson in my article. It is equivalent to the ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) functionality in other database platforms where you are partitioning by FileName and ordering by the version information descending.
SELECT FileID, FileName, FileCRC32
     , VersionMajor, VersionMinor, VersionDate
FROM 
(
   SELECT f.*
        , @rownum := 
             CASE f.FileName 
                WHEN @fileName THEN @rownum 
                ELSE 0 
             END + 1 AS rank
        , @filename := f.FileName
   FROM manifest_file f, 
   (
      SELECT @rownum := 0, @filename := NULL
   ) r
   WHERE f.FirmID IN (0, 1)
   ORDER BY FileName
          , VersionMajor DESC, VersionMinor DESC, VersionDate DESC
) derived
WHERE rank = 1
;

Open in new window

0
 
ahijabAuthor Commented:
Thanks for your response.

I have decided the best practice would be just to add another column to the table representing "VersionString" .... which would represent something like: VersionMajor + "." + VersionMinor + "." + Date.Year....etc...

That will allow me to greatly simplify my queries.
0
 
ahijabAuthor Commented:
I didn't see your previous post.  Yes, that worked perfectly.  Thank you!
0
 
Kevin CrossChief Technology OfficerCommented:
I am glad that helped.
Best regards and happy coding,

Kevin
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.