Select Distinct is returning duplicates ...

PortletPaulEE Topic Advisor
CERTIFIED EXPERT
More years at various grindstones than I care to remember. Some interesting grindstones though: Shipbuilding, Aerospace ..... IT
Published:
Updated:
If you find yourself in this situation “I have used SELECT DISTINCT but I’m getting duplicates” then I'm sorry to say you are using the wrong SQL technique as it only does one thing which is: produces whole rows that are unique. If the results you are getting still have “duplicates” then you are expecting something else that SELECT DISTINCT does not do. In truth, “select distinct” is a rather dull animal with minimal decision making powers.
 
The most common solution to this “I still got duplicates” is to use GROUP BY instead, but please read on. (or, cheat and click this.)
The first thing to recognize about SELECT DISTINCT is that overall it’s a bit boring. It’s isn’t a function so if you are using a query like this:
 
SELECT DISTINCT (column1), column2, ... column40
You will be disappointed to learn that those parentheses around column1 do NOT work to “make column1 unique, then by magic decide how to simplify the other columns”.  

And: With or without the parentheses, SELECT DISTINCT does not know that you really want it to concentrate on column1 or (column1 and column2) ignoring the other columns.

SELECT DISTINCT is a “row operator”, not a function, and not magic. It will evaluate ALL of the columns that you have listed and remove rows that have exactly the same values in all columns so that each row returned is different to every other row that is returned. That might mean for example that a value containing an exclamation in column27 and a value that does not have an exclamation in column27 causes 2 rows to be returned even if every other column has the same values.
col1 col2 col3 col4 col5 col6 col7 col8 col9 cl10 cl11 cl12 ... col27
                      aaaa bbbb cccc dddd eeee ffff gggg hhhh iiii jjjj kkkk llll ... blah! <<I'm unique!
                      aaaa bbbb cccc dddd eeee ffff gggg hhhh iiii jjjj kkkk llll ... blah  << me too!

Open in new window

Small note: I personally would not describe this as “having duplicates”, I prefer to describe this as “unwanted repetition"

 

So, what to do?

First step:  Concentrate on the columns that MUST be unique, e.g.
 
select
                      column1 --<< MUST be unique columns  here
                      from atable
                      group by
                      column1 --<< MUST be unique columns  here

Open in new window

Second step: Decide what can be done with the other columns,
e.g. accept the maximum values of each
 
select
                      column1
                      , MAX(column2), MAX(column3), ...  MAX(column27) ...
                      from atable
                      group by
                      column1 --<< MUST be unique columns  here

Open in new window

Note: A query using GROUP BY is not nearly as boring as poor old “SELECT DISTINCT” and GROUP BY does offer quite a number of interesting options such as MAX(), MIN(), AVERAGE() etc.

BUT I hear you say "I don’t want MAX() values or MIN() values and such like"
– I need “the Most Recent record” (or similar).

This gets just a little more complex to satisfy, but it can be satisfied.
Let’s look at “most recent  record”, here are 2 basic options:

1. Using row_number() [if that function is supported in your database], e.g.
 
Select *
                      from ( select *
                                 , row_number() (partition by column1 order by column2 DESC) as rn
                                 from atable
                               ) as sq
                      where rn = 1

Open in new window

Note: row_number() starts at 1 for each “partition by” which will be the “must be unique” column(s) and it will give 1 as determined by the order you specify, e.g. a date/time column in descending order would give 1 at “the most recent”.
IF you have the ability to use row_number() I recommend using this.
 
2. Using group by and a subquery, e.g.
 
Select *
                      from atable
                      inner join ( select column1, max(column2) as column2 from atable group by column1) as sq
                      on atable.column1 = sq.column1 and atable.column2 = sq.column2

Open in new window

Note: this is NOT guaranteed to produce unique rows as “it will depend” on your data (e.g. there may be more than one row where column2 is equal to the max(column2) – despite this it can effectively be used on date/time columns to achieve “most recent” selection.

Find More help here:

For a more detailed & in depth article on the query options briefly described here please refer to DISTINCT and GROUP BY... and why does it not work for my query? (by angelIII).


And More opinions here:
 

select distinct ... group by;   NO! No, no, no, please NO!


I should point out that using select distinct isn’t always bad, and it can of course be used to good advantage. For example “select distinct code from lookup_table;” could be very efficient and neat. Of course the exact same result could be achieved by a marginally longer query “select code from lookup_table group by code;”. So what’s my point here?
 
1. select distinct code from lookup_table;
2. select code from lookup_table group by code;

?. select distinct code from lookup_table group by code;
As queries 1. and 2. are directly equivalent,
why oh why do some folks toss in a "select distinct " on top of a " group by" ??

Please don’t add select distinct to a group by

It is just not necessary and does nothing to improve your query.
& It's one of my “pet hates” (I do have others, but they are not relevant here).

 

The “Hail Mary Distinct”    NO! No, no, no, please NO! (cubed)


So far I’ve been rather gentle, but now I’ll get tougher. There is I’m afraid some use of SELECT DISTINCT that is distinctly bad which I will call it the “Hail Mary Distinct”. Often this is also associated with poor performance.

Scene 1, Act 1.

You may have just completed quite a lengthy query, maybe a dozen tables/joins, several case expressions and a few  calculations, maybe a user defined function or two. It's just beautiful to behold!

Run.

Damn, it's got "unwanted repetition" in the results. The project manager is breathing down your neck, there's that next interesting thing to do, what now. Simple! Toss in a single word, DISTINCT, and my problems disappear.

Run.

It's been running for over an hour, can I kill it?
The truth is that tossing in that single word DISTINCT is not a solution to "unwanted repetition" and the mere existence of that is an indicator that you need to revisit the joins being used - somewhere in there you are multiplying the rows in an undesirable manner - and to solve it you have more work in front of you.

A "Hail Mary Distinct" ADDS more effort to the query, it might disguise the symptoms, but it does not solve the cause(s).

So, what now? Is there a solution?

Well yes there most probably is, but it relies on you finding which joins are the culprits, and this may mean challenging each one, one at a time, and when you start to see "unwanted repetition" appear you know which join(s) are the cause.

Then, how you deal with each cause may have a slightly different solution - but once again it could employ a group by, but this time as a nested subquery that will replace a joined table.

e.g. instead of this
 
LEFT JOIN MAASSIVE  ON MAIN_TBL.X_FK = MAASSIVE.CAT
                      
                      -- it might become:
                      
                      LEFT JOIN ( select cat, count(something) as needed_this_info from maassive
                                            where maassive.x = 1
                                            group by cat
                                         ) TINY ON MAIN_TBL.X_FK = TINY.CAT

Open in new window



I cannot think of a good generic way to present this as code, but if confronted with someone else's "Hail Mary Distinct" query (and I've more than seen my share!) - AND, if I'm unfamiliar with the data model. I concentrate on the FROM & WHERE clauses. Often replacing the whole select clause with "select count(*)".

I then "start small" by commenting-out any unwanted where clause details and run these cut-down queries adding one table at a time recording the counts of records. This will give a pretty clear idea of where the data volume is being generated and helps focus effort where it is most likely to produce a benefit.

Of course more sophisticated tools are also available (e.g. the execution plan) but I like to feedback that an iterative approach provides. In particular the immediate feedback on performance.

Well, that's it. Something of a rant, and not a lot of technical detail, but I feel better. Thanks for reading this far.


Other opinions about Distinct:
Some Simple SQL Rules to Live By (DISTINCT is *usually* bad)
Why I Hate DISTINCT
7
57,566 Views
PortletPaulEE Topic Advisor
CERTIFIED EXPERT
More years at various grindstones than I care to remember. Some interesting grindstones though: Shipbuilding, Aerospace ..... IT

Comments (2)

Guy Hengel [angelIII / a3]Billing Engineer
CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2009

Commented:
thanks for the link/reference to my article :)
Its really helpful.
it worked with the second option
2. Using group by and a subquery
thanks a lot

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.