DISTINCT and GROUP BY... and why does it not work for my query?

0. Introduction

Often, when running a query with joins, the results show up "duplicates", and often, those duplicates can be "eliminated" in the results using DISTINCT, for example.
Using DISTINCT is simple: just add it after the SELECT keyword, and you don't get duplicates. Unless you expect something else of what is a duplicate (and what not) than SQL does actually do.
DISTINCT will not return two rows with the same values. To make it clear: it will compare the returned columns for the SELECT it is applied to, and not to the full table/join. The user problem is usually that they "expect" SQL to apply the DISTINCT only for one (or more) key fields, for example the first column returned in the select.
Usually, GROUP BY can solve this, but it might not be the most efficient method, or fail to accomplish other issues with the requirements in the query.
The good method to solve the problem is to step back, and look at both the data and the requested output, which, once clarified, can be translated into SQL Query "easily".

1. Tables and Data

We usually have 2 tables in the scenario:

one master/parent table,

one dependant table, linked through a 1:n relationship, with information such as history, traffic, accounting, etc information.

Notes:

a related table is usually referred to as child table.

a related table should have, as all tables, its own primary key.

In real-world, the field employee_pk would be indexed to ensure optimal performance.
For visualizing, here the data, queried using the SQL Server Management Studio 2005:
Employees:

Work Records:

If you have trouble with dates/times, please refer to this article.

2. What is the exact requirement?

Request: Presume we want to see the last work day, per employee.
For Smith and Brown, we can forecast no major problems, but Bond lists 2 records for the last day...
Hence, you have to clarify:

do we just want the last record (date+time), or any record for the last date (ignoring time)?

if all "duplicates" have the same value, is there another column to discriminate them, so we can decide on which one to take?

if there are multiple records to be taken, what is the result we want to have?

Important:
This "problem" has to be solved first. Once the results from child table are OK, we can then join those results to the parent table (see step 7)

In our example, we could say:

A: give me just the last record, per employee, considering date + time

B: give me all records of the last day, per employee

C: give me the last date, per employee, with the earliest time for the work-start field, but the latest time for the work-end field

As you see, I showed the expected result data, and NOT the code yet; and all the results are different.
We will in the next steps show the SQL to achieve those results, in the different engines.

3. This might be fine... but usually is not

To get this result: Result_0.JPG, you just need to run this query, works across all databases (which is the only + for this syntax):

select employee_pk
                       , max(work_start_dt) last_start
                       , max(work_end_dt) last_end
                       , max(pk) last_pk
                      from tbl_Employee_WorkRecords
                      group by employee_pk;

DISTINCT and GROUP BY... and why does it not work for my query?

0. Introduction

1. Tables and Data

2. What is the exact requirement?

3. This might be fine... but usually is not

4. Result A

5. Result B

6. Result C

7. Combine with master table

8. Conclusion

Comments (9)