LukeJD
asked on
Selecting TOP 1 of each record
I'm working with the following data where a simple GROUP BY statement will not work. The only thing they have in common 100% of the time is a column named "OrderNo".
For Example:
Table 1
========================== ==
CaseName | DateSubmitted | OrderNo | SubjectName
I.E.
========================== ==
"Northwind vs Test" | Jun 1, 2009 | 1000 | John Doe
"Northwind vs. Testing" | Jun 1, 2009 | 1000 | John Doe
"Northwind vs Test" | Jun 2, 2009 | 1000 | Jonathan Doe
"McDonalds vs Burger King" | Aug 1, 2008 | 2000 | Tim Smith
"Mcdonalds vs Burger King" | Aug 2, 2008 | 2000 | Tim Smith
What I would like the query to return is:
"Northwind vs Test" | Jun 1, 2009 | 1000 | John Doe
"McDonalds vs Burger King" | Aug 1, 2008 | 2000 | Tim Smith
As you can tell, the system is like this because of inconsistencies and human error but the rows all say the same thing in a different way. I don't necessarily care in which order the TOP 1 is selected but I've tried a bunch of stuff and I want to see what ideas you guys have on the issue.
Thanks
For Example:
Table 1
==========================
CaseName | DateSubmitted | OrderNo | SubjectName
I.E.
==========================
"Northwind vs Test" | Jun 1, 2009 | 1000 | John Doe
"Northwind vs. Testing" | Jun 1, 2009 | 1000 | John Doe
"Northwind vs Test" | Jun 2, 2009 | 1000 | Jonathan Doe
"McDonalds vs Burger King" | Aug 1, 2008 | 2000 | Tim Smith
"Mcdonalds vs Burger King" | Aug 2, 2008 | 2000 | Tim Smith
What I would like the query to return is:
"Northwind vs Test" | Jun 1, 2009 | 1000 | John Doe
"McDonalds vs Burger King" | Aug 1, 2008 | 2000 | Tim Smith
As you can tell, the system is like this because of inconsistencies and human error but the rows all say the same thing in a different way. I don't necessarily care in which order the TOP 1 is selected but I've tried a bunch of stuff and I want to see what ideas you guys have on the issue.
Thanks
Why will GROUP BY not work?
I would do it with a GROUP BY, unless you can clarify why it wont work.
I would do it with a GROUP BY, unless you can clarify why it wont work.
select casename, min(datesubmitted), orderno, subjectname
from table1
group by casename, orderno, subjectname
;
the problem "test" vs "testing" is that it is not a 100% match (leaving case sensitively apart). so "=" won't work.
using "LIKE" won't help neither, as it's impossible to use a % somewhere to make all th eversions match.
SOUNDEX can help. but it might give you false matches, also.
the real solution will be assisted manual data cleaning.
using "LIKE" won't help neither, as it's impossible to use a % somewhere to make all th eversions match.
SOUNDEX can help. but it might give you false matches, also.
the real solution will be assisted manual data cleaning.
Oh, I see now. I did not read close enough.
So I agree with angelIII. You could do some rudimentary replacements in the query such as replacing "vs." with "vs" and grouping by a case insensitive version of the column, but that would still leave plural/singular problems.
Yes, data should be cleaned. This is also a case where the table should be normalized so the casename is not in that table itself, but instead use a foreign key.
So I agree with angelIII. You could do some rudimentary replacements in the query such as replacing "vs." with "vs" and grouping by a case insensitive version of the column, but that would still leave plural/singular problems.
Yes, data should be cleaned. This is also a case where the table should be normalized so the casename is not in that table itself, but instead use a foreign key.
Select a random row
SELECT TOP 1 column FROM table
ORDER BY NEWID()
SELECT TOP 1 column FROM table
ORDER BY NEWID()
ASKER
I was thinking of a self join, something along these lines:
SELECT a.* FROM (SELECT DISTINCT(o.OrderNo) FROM orders o) b
JOIN orders a ON a.OrderNo = b.OrderNo
But that seems to still create the duplicates. Is there a way of drawing TOP 1 information based off of a column from another joined table?
SELECT a.* FROM (SELECT DISTINCT(o.OrderNo) FROM orders o) b
JOIN orders a ON a.OrderNo = b.OrderNo
But that seems to still create the duplicates. Is there a way of drawing TOP 1 information based off of a column from another joined table?
ASKER
Also, I wanted to add that there are over 100,000 rows in the table and doing a manual clean up will be labor intensive.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
This should help..
select distinct A.CaseName, A.DateSubmitted, A..OrderNo , A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.OrderNo = B. OrderNo and A.DateSubmitted = B.DateSubmitted
select distinct A.CaseName, A.DateSubmitted, A..OrderNo , A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.OrderNo = B. OrderNo and A.DateSubmitted = B.DateSubmitted
Ooops sorry a bad copy/paste
select distinct A.CaseName, A.DateSubmitted, A..OrderNo , A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.OrderNo = B.OrderNo and A.DateSubmitted = B.DateSubmitted
select distinct A.CaseName, A.DateSubmitted, A..OrderNo , A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.OrderNo = B.OrderNo and A.DateSubmitted = B.DateSubmitted
@Racimo, I was tempted to follow the same approach, however I feel your query will fail to eliminate the first two rows, which are duplicates, since they both have same OrderNo and DateSubmitted. That's why I've suggested the ROW_NUMBER() approach in my comment 24791266
@ralmada
<<, however I feel your query will fail to eliminate the first two rows, which are duplicates>>
The main problem is the lack of unique identifer for tuples. As angelIII mentionned, there is a need to do data cleaning first by removing dupplicates, else running any extraction will necessarily make one *assume* what the correct answer might be.
For instance, by using ROWNUMBER(), one has to assume that the first row that appears *in ROWNUMBER sequence* is necessarily the right row to select. But since the questionner only mentionned OrderNo as a natural identifier, I did not feel it was necessariy to bring additional assumptions to the equation.
I guess the safest way to do that would be to run a dupplicate remover based on CaseName and SubjectName candidate key than run the followin based on the natural key.
something like...
select distinct A.CaseName, A.DateSubmitted, A..OrderNo , A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.CaseName = B.CaseName and A.SubjectName = B.SubjectName andA.SubjectName = B.SubjectName and A.DateSubmitted = B.DateSubmitted
<<, however I feel your query will fail to eliminate the first two rows, which are duplicates>>
The main problem is the lack of unique identifer for tuples. As angelIII mentionned, there is a need to do data cleaning first by removing dupplicates, else running any extraction will necessarily make one *assume* what the correct answer might be.
For instance, by using ROWNUMBER(), one has to assume that the first row that appears *in ROWNUMBER sequence* is necessarily the right row to select. But since the questionner only mentionned OrderNo as a natural identifier, I did not feel it was necessariy to bring additional assumptions to the equation.
I guess the safest way to do that would be to run a dupplicate remover based on CaseName and SubjectName candidate key than run the followin based on the natural key.
something like...
select distinct A.CaseName, A.DateSubmitted, A..OrderNo , A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.CaseName = B.CaseName and A.SubjectName = B.SubjectName andA.SubjectName = B.SubjectName and A.DateSubmitted = B.DateSubmitted
Ooops I mean
select distinct A.CaseName, A.DateSubmitted, A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.CaseName = B.CaseName and A.SubjectName = B.SubjectName and A.DateSubmitted = B.DateSubmitted
select distinct A.CaseName, A.DateSubmitted, A.SubjectName
from table1 A inner join
(select CaseName, SubjectName, min(DateSubmitted) as DateSubmitted from table1 group by CaseName, SubjectName) B on A.CaseName = B.CaseName and A.SubjectName = B.SubjectName and A.DateSubmitted = B.DateSubmitted
I still think my comment 24791266 could help the asker.
Seconded My Comment 24789647 is also useful
My recommendation is split points between comment 24791266 and 24792899.
Thats fine
Select OrderNo, MAX(DateSubmitted), MAX(SubjectName), MAX(CaseName)
From Table
Group by OrderNo