Solved

Selecting TOP 1 of each record

Posted on 2009-07-06
22
515 Views
Last Modified: 2012-05-07
I'm working with the following data where a simple GROUP BY statement will not work.   The only thing they have in common 100% of the time is a column named "OrderNo".

For Example:

Table 1
============================
CaseName | DateSubmitted | OrderNo | SubjectName

I.E.
============================
"Northwind vs Test" | Jun 1, 2009 | 1000 | John Doe
"Northwind vs. Testing" | Jun 1, 2009 | 1000 | John Doe
"Northwind vs Test" | Jun 2, 2009 | 1000 | Jonathan Doe
"McDonalds vs Burger King" | Aug 1, 2008 | 2000 | Tim Smith
"Mcdonalds vs Burger King" | Aug 2, 2008 | 2000 | Tim Smith

What I would like the query to return is:
"Northwind vs Test" | Jun 1, 2009 | 1000 | John Doe
"McDonalds vs Burger King" | Aug 1, 2008 | 2000 | Tim Smith

As you can tell, the system is like this because of inconsistencies and human error but the rows all say the same thing in a different way.   I don't necessarily care in which order the TOP 1 is selected but I've tried a bunch of stuff and I want to see what ideas you guys have on the issue.

Thanks

0
Comment
Question by:LukeJD
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 4
  • 3
  • +5
22 Comments
 
LVL 17

Expert Comment

by:pssandhu
ID: 24789528
What if you do something like this since you say that then data is basically the same:
Select OrderNo, MAX(DateSubmitted), MAX(SubjectName), MAX(CaseName)
From Table
Group by OrderNo
0
 
LVL 40

Expert Comment

by:mrjoltcola
ID: 24789535
Why will GROUP BY not work?

I would do it with a GROUP BY, unless you can clarify why it wont work.


select casename, min(datesubmitted), orderno, subjectname
  from table1
group by casename, orderno, subjectname
;

Open in new window

0
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 24789536
the problem "test" vs "testing" is that it is not a 100% match (leaving case sensitively apart). so "=" won't work.
using "LIKE" won't help neither, as it's impossible to use a % somewhere to make all th eversions match.
SOUNDEX can help. but it might give you false matches, also.

the real solution will be assisted manual data cleaning.
0
Space-Age Communications Transitions to DevOps

ViaSat, a global provider of satellite and wireless communications, securely connects businesses, governments, and organizations to the Internet. Learn how ViaSat’s Network Solutions Engineer, drove the transition from a traditional network support to a DevOps-centric model.

 
LVL 40

Expert Comment

by:mrjoltcola
ID: 24789565
Oh, I see now. I did not read close enough.

So I agree with angelIII. You could do some rudimentary replacements in the query such as replacing "vs." with "vs" and grouping by a case insensitive version of the column, but that would still leave plural/singular problems.

Yes, data should be cleaned. This is also a case where the table should be normalized so the casename is not in that table itself, but instead use a foreign key.
0
 
LVL 12

Expert Comment

by:williamcampbell
ID: 24789647
Select a random row

SELECT TOP 1 column FROM table
ORDER BY NEWID()


0
 

Author Comment

by:LukeJD
ID: 24789845
I was thinking of a self join, something along these lines:

SELECT a.* FROM (SELECT DISTINCT(o.OrderNo) FROM orders o) b
JOIN orders a ON a.OrderNo = b.OrderNo

But that seems to still create the duplicates.    Is there a way of drawing TOP 1 information based off of a column from another joined table?
0
 

Author Comment

by:LukeJD
ID: 24789882
Also, I wanted to add that there are over 100,000 rows in the table and doing a manual clean up will be labor intensive.
0
 
LVL 41

Accepted Solution

by:
ralmada earned 250 total points
ID: 24791266
What about this? (SQL 2005+)
select * from (
	select *, row_number() over (partition by OrderNo order by OrderNo) as rn
	from Table1
) a
where rn = 1

Open in new window

0
 
LVL 15

Assisted Solution

by:rob_farley
rob_farley earned 250 total points
ID: 24792899
Ok... I'm going to pitch in now.

The ROW_NUMBER() method suggested is great. But for another alternative, that works really well if you're actually looking at two different tables, how about:


SELECT t_top.*
FROM
(
SELECT DISTINCT Name
FROM TableName
) t
CROSS APPLY
(
SELECT TOP (1) *
FROM TableName t2
WHERE t2.Name = t.Name
ORDER BY SomeCriteria
) t_top


If you imagine something which lists the most expensive Product in each ProductSubcategory (in AdventureWorks)

SELECT s.Name, p.*
FROM Production.ProductSubcategory s
  CROSS APPLY
  (SELECT TOP (1) *
  FROM Production.Product ps
  WHERE ps.ProductSubcategoryID = s.ProductSubcategoryID
  ORDER BY ListPrice DESC) p

CROSS APPLY is like INNER JOIN where the second table is actually correlated to the first. You can use OUTER APPLY if you don't want to eliminate rows in the first table that don't produce any rows in the second.

Rob
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 24793500
This should help..

select distinct A.CaseName, A.DateSubmitted, A..OrderNo , A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.OrderNo = B. OrderNo and A.DateSubmitted = B.DateSubmitted
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 24793507
Ooops sorry a bad copy/paste

select distinct A.CaseName, A.DateSubmitted, A..OrderNo , A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.OrderNo = B.OrderNo and A.DateSubmitted = B.DateSubmitted
0
 
LVL 41

Expert Comment

by:ralmada
ID: 24793990
@Racimo, I was tempted to follow the same approach, however I feel your query will fail to eliminate the first two rows, which are duplicates, since they both have same OrderNo and DateSubmitted. That's why I've suggested the ROW_NUMBER() approach in my comment 24791266
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 24794186
@ralmada

<<, however I feel your query will fail to eliminate the first two rows, which are duplicates>>

 The main problem is the lack of unique identifer for tuples.  As angelIII mentionned, there is a need to do data cleaning first by removing dupplicates, else running any extraction will necessarily make one *assume* what the correct answer might be.  

For instance, by using ROWNUMBER(), one has to assume that the first row that appears *in ROWNUMBER sequence* is necessarily the right row to select.  But since the questionner only mentionned OrderNo as a natural identifier, I did not feel it was necessariy to bring additional assumptions to the equation.

I guess the safest way to do that would be to run a dupplicate remover based on CaseName and SubjectName candidate key than run the followin based on the natural key.

something like...

select distinct A.CaseName, A.DateSubmitted, A..OrderNo , A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.CaseName = B.CaseName and A.SubjectName = B.SubjectName andA.SubjectName = B.SubjectName and A.DateSubmitted = B.DateSubmitted
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 24794198
Ooops I mean

select distinct A.CaseName, A.DateSubmitted, A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.CaseName = B.CaseName and A.SubjectName = B.SubjectName and A.DateSubmitted = B.DateSubmitted
0
 
LVL 41

Expert Comment

by:ralmada
ID: 25042600
I still think my comment 24791266 could help the asker.
0
 
LVL 12

Expert Comment

by:williamcampbell
ID: 25043746
Seconded My Comment 24789647 is also useful
0
 
LVL 41

Expert Comment

by:ralmada
ID: 25047607
My recommendation is split points between comment 24791266 and 24792899.
0
 
LVL 12

Expert Comment

by:williamcampbell
ID: 25057084
Thats fine
0

Featured Post

Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Never store passwords in plain text or just their hash: it seems a no-brainier, but there are still plenty of people doing that. I present the why and how on this subject, offering my own real life solution that you can implement right away, bringin…
I have a large data set and a SSIS package. How can I load this file in multi threading?
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question