Solved

MySQL - DISTINCT on single column

Posted on 2011-10-01
16
669 Views
Last Modified: 2012-05-12


Hello,

I am having a problem applying a DISTINCT attribute to a single column.  Here is my scenarion.

We have a MySQL table used to manage the published files for a given software application

Table:  manifest_file

FileID
FileName
FileCRC32
VersionMajor
VersionMinor
VersionDate
FirmID


FirmID represents who the file is published to.  0 means everyone.


I need to return the FileID for the files with the highest Version numbers & dates.  


SELECT DISTINCT FileName, FileID
FROM manifest_file
WHERE Firmid=1 OR FirmID=0
ORDER BY VersionMajor DESC, VersionMinorDESC, VersionDateDESC

Unfortunately, this applies the DISTINCT attribute to both columns.


Using a GROUP BY returns the right number of results, but the row data is arbitrary (so the FileID and other fields will not be correct)


SELECT FileID, FileName, FileCr32, VersionMajor, VersionMinor, VersionDate
FROM manifest_file
WHERE FirmID=1 OR FirmID=0
GROUP BY FileName
ORDER BY VersionMajor DESC, VersionMinor DESC, VersionDate DESC




Please help me.  I, essentially, just need the Distinct to be applied to FileName but the ORDER BY statement must be honored in order to assure I get the correct results.

Thank you in advance
0
Comment
Question by:ahijab
  • 7
  • 5
  • 3
  • +1
16 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36896964
Have you tried SELECT FileID, DISTINCT FileName ...?

It seems to me that this should be two tables instead of one.  A junction table could be used to create the joins you want.
0
 

Author Comment

by:ahijab
ID: 36896998
Yes, I was hoping for the same - unfortunately, that is not acceptable syntax in modern MySQL.
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36897013
Yes, when you use GROUP BY, you have to tell SQL which row to pull. Most systems generate an error, but MySQL simply assigns a FIRST/LAST type aggregate by just picking the value to display. Why you need is to either rank the data or use a JOIN to a derived table getting the max version per FileID. For the latter, you should really check if you need all three, i.e., is VersionDate enough to identify the latest VersionMajor and VersionMinor combination? Normally, you release newer versions with later dates, right?

For the ranking, you can see my ranking article for explanation.

If you need help with examples, please post some sample data along with table create script and specify the expected results, i.e., confirm which field is truly needed to find the right row for a given FileID.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36897016
Crap.  Seems like MySQL should be smart enough to handle that.

Here is what I would do, given the current single-table configuration.  Select all the rows with a thoughtfully designed ORDER BY clause.  Iterate over the selections with the while() iterator.  With each row, add an element to an associative array.  The key of the array will be the FileName, and the value of the array will be the entire row.  As FileName strings match the array keys, the corresponding row of the array will be overwritten and in the end you will have the last element of the results set in the associative array.

Does that make sense for your needs?
0
 

Author Comment

by:ahijab
ID: 36897043
Sure, I know how I could solve my solution programmatically... but I'd like to use an optimized query instead.  I'm hoping someone can suggest an alternative.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36897067
In that case mwvisa1 can probably help better than I can. I tend to think in PHP terms first and MySQL terms later.  Post the information requested in ID36897013 and that will help.  Best of luck, ~Ray
0
 
LVL 38

Expert Comment

by:Aaron Tomosky
ID: 36897483
Agree wIth mwvisa1. Group by is the way to go. Mssql wouldn't even let you use a group by without explicitly telling it what to do with the other column. Mysql let's you but like you said it arbitrarily picks values.

I suggest using group by filename an wrap all the other columns in an aggregate function (max, min, average, etc...)

If that doesn't work, you can also do sub queries butthat gets messy fast. Let us know the criteria for the other columns and we can help you write the statement.
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36897669
Reading this again, I will expand my thought to ask if FileID is auto-incrementing, I.e., isn't highest FileID per filename the latest version? If it is, then the derived table can be a simple aggregate of MAX(FileID) ... GROUP BY FileName and then in the other query JOIN on both columns, then show the others.
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:ahijab
ID: 36897761
Thank you for your feedback and responses:

- FileID is auto-incrementing, however I do not want to depend on the primary key value to determine if it is the most recent.   In various circumstances, the ordering may be jumbled.

- Similarly, I do not want to depend exclusively on VersionDate to determine the file version.  It's possible our code will branch/fork, and v1.5 will be worked on concurrently as v1.4.  That's why VersionDate should be the LAST item of the ORDER BY clause.

- It is possible to give an example query based on the table details I posed above,m wvisa?  I greatly appreciate your responses.


Table:  manifest_file

FileID (PK, int)
FirmID (FK, int)
FileName (varchar[255])
FileCRC32 (varchar[40])
VersionMajor (int)
VersionMinor (int)
VersionDate (DateTime)
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36897947
Okay, I got you. I will try to work up an example with ranking approach. For now, for the GROUP BY approach, you can see if you can build a combination numeric value that represents VersionMajor, VersionMinor, and VersionDate in a manner that preserves their impact to ordering and guaranteed to not get collisions say if I have 1.14 and 2.12...you wouldn't want the 1+14 to trump 2+12, so you would multiple out something like:
VersionMajor * 10^13 + VersionMinor * 10^8 + VersionDate = 10001420111001 (for example)

This works if your VersionMinor/VersionMajor are not more than 5 digits long. The length of a BIGINT is 19 digits if I am not mistaken, so you can use that as data type or NUMERIC(18,0).

The query would look something like this:

SELECT f.*
FROM manifest_file f
JOIN (
   SELECT FileName
        , MAX({formula you go with}) AS FileVersion
   FROM manifest_file
   GROUP BY FileName
) fMax ON fMax.FileName = f.FileName
  AND fMax.FileVersion = {formula you go with (using f aliased table columns)}
;

Open in new window


Hopefully that makes sense. Maybe there is a simpler way by nesting further, but trying to not make this a maintenance mess.
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36897956
Note on above, you have to convert the date to YYYYMMDD format then to INT.
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36898036
Okay, here is the tested ranking code. You can see the explanation and another sample of sales by salesperson in my article. It is equivalent to the ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) functionality in other database platforms where you are partitioning by FileName and ordering by the version information descending.
SELECT FileID, FileName, FileCRC32
     , VersionMajor, VersionMinor, VersionDate
FROM 
(
   SELECT f.*
        , @rownum := 
             CASE f.FileName 
                WHEN @fileName THEN @rownum 
                ELSE 0 
             END + 1 AS rank
        , @filename := f.FileName
   FROM manifest_file f, 
   (
      SELECT @rownum := 0, @filename := NULL
   ) r
   WHERE f.FirmID IN (0, 1)
   ORDER BY FileName
          , VersionMajor DESC, VersionMinor DESC, VersionDate DESC
) derived
WHERE rank = 1
;

Open in new window

0
 

Author Comment

by:ahijab
ID: 36898096
Thanks for your response.

I have decided the best practice would be just to add another column to the table representing "VersionString" .... which would represent something like: VersionMajor + "." + VersionMinor + "." + Date.Year....etc...

That will allow me to greatly simplify my queries.
0
 
LVL 59

Accepted Solution

by:
Kevin Cross earned 500 total points
ID: 36898114
Yes. I agree. However, just be smart in the way this is constructed. Look at my point on formula I showed. If you have '9.16.20111001' and '10.4.20111001', take a look at what happens:

SELECT VersionString
FROM (
   SELECT '9.16.20111001' VersionString UNION
   SELECT '10.4.20111001'
) derived
ORDER BY VersionString DESC
;

Open in new window


If you choose to go with VersionString and need it in that format, then see if this helps you:
http://www.experts-exchange.com/Programming/Languages/SQL_Syntax/A_252-How-to-Mimic-the-T-SQL-IsNumeric-Function-in-MySQL.html
It is an article on creating the IsNumeric() and IsInteger() functions, but I show an example that involved sorting on both sides of a delimiter like '.'. You can actually pull the date apart then just treat the rest as a decimal number then compare.

If you can avoid it, though, that is even better, especially since you are going this route to make your queries simpler. :)

Even though you are choosing to simplify, did you try http:#36898036 ?
If only because I spent the time typing it for you, it would be great for you to see if working. ;) If you run the inner query by itself, you can see what is going on under the covers and helps explain the rank = 1 on the outside query.
0
 

Author Comment

by:ahijab
ID: 36898233
I didn't see your previous post.  Yes, that worked perfectly.  Thank you!
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36898253
I am glad that helped.
Best regards and happy coding,

Kevin
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Creating and Managing Databases with phpMyAdmin in cPanel.
Many companies are looking to get out of the datacenter business and to services like Microsoft Azure to provide Infrastructure as a Service (IaaS) solutions for legacy client server workloads, rather than continuing to make capital investments in h…
Video by: Steve
Using examples as well as descriptions, step through each of the common simple join types, explaining differences in syntax, differences in expected outputs and showing how the queries run along with the actual outputs based upon a simple set of dem…
The viewer will learn how to count occurrences of each item in an array.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now