Link to home
Start Free TrialLog in
Avatar of AJS_Developer
AJS_Developer

asked on

Use SQL to find unique sets

In a SQL Server 2005 database there is a table called Table_1. It's columns are ID, SetName, and SetValue. I'm after an SQL statement (without the help of iterating with a cursor) that will return all unique combinations of values in SetValue.

For example:

Running the statement against this data:

1          1          A

2          1          B

3          1          C

4          2          A

5          2          B

6          3          A

7          3          B

8          3          C


would return

1          1          A

2          1          B

3          1          C

4          2          A

5          2          B

All rows with SetName 3 would be left out, because really set 3 is just a duplicate of set 1.

Thanks in advance!
Avatar of momi_sabag
momi_sabag
Flag of United States of America image

try something like this

with a as (
select t1.id, t1.setName, t1.value,
       count(*) over(partition by t1.setName) as set_count
       sum (case when t2.setName is not null then 1 else 0 end) over(partition by t1.setName) as set2_count
from yourtable t1 left outer join yourtable t2
 on t1.setName < t2.setName
 and t1.value = t2.value
)
select id,setName,value
from a t3
where set_count = set2_count
sorry
try this one

with a as (
select t1.id, t1.setName, t1.value,
       sum (case when t1.setName is not null and t2.setName is null then 1 else 0 end) over(partition by t1.setName) as only_in_set1
       sum (case when t2.setName is not null and t1.setName is null then 1 else 0 end) over(partition by t1.setName) as only_in_set2
from yourtable t1 full outer join yourtable t2
 on t1.setName < t2.setName
 and t1.value = t2.value
)
select id,setName,value
from a t3
where only_in_set1 = 0
and only_in_set2 = 0
have a look.

select distinct t1.id,t1.setname,t1.setvalue from tab1 t1 full outer join tab1 t2
 on t1.setName
BTW, the table and data I have used in above query is like:


create table tab1
(
id int ,
setname int,
setvalue varchar(1)
)

insert into tab1
select 1,          1,          'A' union all
select 2 ,         1,          'B' union all
select 3  ,        1,          'C' union all
select 4   ,       2,          'A' union all
select 5 ,         2,          'B' union all
select 6,          3,          'A' union all
select 7,          3,          'B' union all
select 8 ,         3,          'C'
Avatar of AJS_Developer
AJS_Developer

ASKER

Thanks RiteshShah!

I made some small changes to your statement (see attached) and I get the correct result.

However (this may be because I didn't phrase the question correctly) if I use the following test data:

1      1      A        
2      1      B        
3      1      C        
4      2      A        
5      2      B        
6      3      D        
7      3      E        
8      3      F        

then what I would expect is

1      1      A        
2      1      B        
3      1      C        
4      2      A        
5      2      B        
6      3      D        
7      3      E        
8      3      F        

because in this case each of the 3 sets are unique, but you're statement returns

1      1      A        
2      1      B        


Select	distinct t1.id,
	t1.setname,
	t1.setvalue
from	table_1 t1 full outer join table_1 t2
		on t1.setName <t2.setName
		and t1.setvalue = t2.setvalue
where	t2.id is not null and t1.id is not null order by t1.id

Open in new window

momi_sabaq,

Using your second suggestion (which I made some syntax changes to, code attached) I get the following incorrect result:

1      1      A        
1      1      A        
2      1      B        
2      1      B        
3      1      C        
4      2      A        
5      2      B        

with a as (
	select	t1.id, t1.setName, t1.Setvalue, 
			sum (case when t1.setName is not null and t2.setName is null then 1 else 0 end) over(partition by t1.setName) as only_in_set1,
			sum (case when t2.setName is not null and t1.setName is null then 1 else 0 end) over(partition by t1.setName) as only_in_set2
	from	table_1 t1 full outer join table_1 t2
				on t1.setName < t2.setName
				and t1.Setvalue = t2.Setvalue
	)
select	id,
		setName,
		Setvalue
from	a t3
where	only_in_set1 = 0
		and only_in_set2 = 0

Open in new window

momi_sabaq,

Using your first suggestion (which I also made some changes to, code attached) gave me the same incorrect result

1      1      A        
1      1      A        
2      1      B        
2      1      B        
3      1      C        
4      2      A        
5      2      B        



with a as (
select t1.id, t1.setName, t1.setvalue, 
       count(*) over(partition by t1.setName) as set_count,
       sum (case when t2.setName is not null then 1 else 0 end) over(partition by t1.setName) as set2_count
from table_1 t1 left outer join table_1 t2
 on t1.setName < t2.setName
 and t1.setvalue = t2.setvalue
)
select id,setName,setvalue
from a t3
where set_count = set2_count

Open in new window

Avatar of Sharath S
are you still looking for the solution?
As a solution to this problem I'd also accept a proof as to why it is impossible to accomplish without using a cursor. Thanks for all your help so far.

If you have the data like this, your query will not work. let me know if you still looking for solution.
select * from #temp
 
id	setname	setvalue
1	1	A
2	1	B
3	1	C
4	2	A
5	2	B
8	2	C
6	3	A
7	3	B
8	3	C
 
Select	distinct t1.id,	t1.setname,	t1.setvalue
from	#temp t1 full outer join #temp t2
		on t1.setName <t2.setName
		and t1.setvalue = t2.setvalue
where	t2.id is not null and t1.id is not null order by t1.id
 
id	setname	setvalue
1	1	A
2	1	B
3	1	C
4	2	A
5	2	B
8	2	C

Open in new window

See code attached for a possible solution. Here are my test results (Have I missed a special case somewhere?)

Case 1: (3 sets total, 3rd is duplicate of first)

1      1      A        
2      1      B        
3      1      C        
4      2      A        
5      2      B        
6      3      A        
7      3      B        
8      3      C        

returns

1      1      A        
2      1      B        
3      1      C        
4      2      A        
5      2      B        

Case 2: (3 sets total, all the same)

1      1      A        
2      1      B        
3      1      C        
4      2      A        
5      2      B        
6      3      A        
7      3      B        
8      3      C        
9      2      C        

returns

1      1      A        
2      1      B        
3      1      C        

Case 4: (5 sets total, 1st and 4th are the same, 3rd and 5th are the same, 2nd unique, sets 1, 3, 4, 5 share at least one common element)

1      1      A        
2      1      B        
3      1      C        
4      2      A        
5      2      B        
6      3      C        
7      3      D        
8      3      E        
9      4      A        
10      4      B        
11      4      C        
12      5      C        
13      5      D        
14      5      E        

returns

1      1      A        
2      1      B        
3      1      C        
4      2      A        
5      2      B        
6      3      C        
7      3      D        
8      3      E        


Thanks for all the suggestions!
WITH t2 AS (
	SELECT	t1.ID,
			t1.SetName,
			t1.SetValue,
			COUNT(*) OVER(PARTITION BY t1.SetName) AS counter
	FROM	Table_1 t1
	)
SELECT	*
FROM	Table_1 t7
WHERE	t7.SetName NOT IN (
			SELECT	DISTINCT t3_SetName AS Duplicate_SetName
			FROM	(
				SELECT	*,
						COUNT(*) OVER(PARTITION BY t3_SetName, t4_SetName) AS counter
				FROM	(
					SELECT	t3.ID		AS t3_ID,
							t3.SetName	AS t3_SetName,
							t3.SetValue	AS t3_SetValue,
							t3.counter	AS t3_counter,
							t4.ID		AS t4_ID,
							t4.SetName	AS t4_SetName,
							t4.SetValue	AS t4_SetValue,
							t4.counter	AS t4_counter
					FROM	t2 t3 LEFT JOIN t2 t4
								ON t4.counter = t3.counter
								AND t4.SetValue = t3.SetValue
								AND t4.SetName < t3.SetName
				) t5
			) t6
			WHERE	counter = t3_counter
					AND NOT t4_ID IS NULL
		)

Open in new window

Hi All,

I couldn't help myself - I've tidied up the code a bit so that it should be easier to transpose to whatever table you are trying to apply this too. Now you can just substitute field / table names in the top WITH clause to get going.

I'm not sure how practical this is going to be either - running on real world data seems to be taking a LONG time.

WITH t2 AS (
    SELECT  t1.ID,
            t1.SetName,
            t1.SetValue,
            COUNT(*) OVER(PARTITION BY t1.SetName) AS counter
    FROM    Table_1 t1
    )
SELECT  ID,
        SetName,
        SetValue
FROM    t2 t7
WHERE   t7.SetName NOT IN (
            SELECT  DISTINCT t3_SetName AS Duplicate_SetName
            FROM    (
                SELECT  *,
                        COUNT(*) OVER(PARTITION BY t3_SetName, t4_SetName) AS counter
                FROM    (
                    SELECT  t3.SetName  AS t3_SetName,
                            t3.counter  AS t3_counter,
                            t4.ID       AS t4_ID,
                            t4.SetName  AS t4_SetName
                    FROM    t2 t3 LEFT JOIN t2 t4
                                ON t4.counter = t3.counter
                                AND t4.SetValue = t3.SetValue
                                AND t4.SetName < t3.SetName
                ) t5
            ) t6
            WHERE   counter = t3_counter
                    AND NOT t4_ID IS NULL
        )

Open in new window


I tried like this. also tested with the below examples and working fine.
select t4.*
  from Table_1 t4
  join (select min(SetName) as SetName
          from (select distinct SetName,
                      (select rtrim(substring(isnull((select top 5 ','+SetValue
                        from Table_1 t1 where t1.SetName = t2.SetName for xml path('')),''),2,2000))) Grp
                  from Table_1 t2) t3 group by Grp)t5
    on t4.SetName = t5.SetName
 order by t4.id
 

create table #temp(id int,SetName int,SetValue varchar(2))
insert into #temp values (1,1,'A'),(2,1,'B'),(3,1,'C'),(4,2,'A'),(5,2,'B'),(6,3,'A'),(7,3,'B'),(8,3,'C')
select * from #temp
 
id	SetName	SetValue
1	1	A
2	1	B
3	1	C
4	2	A
5	2	B
6	3	A
7	3	B
8	3	C
 
select t4.* 
  from #temp t4
  join (select min(SetName) as SetName
          from (select distinct SetName,
                      (select rtrim(substring(isnull((select top 5 ','+SetValue 
                        from #temp t1 where t1.SetName = t2.SetName for xml path('')),''),2,2000))) Grp
                  from #temp t2) t3 group by Grp)t5
    on t4.SetName = t5.SetName
 order by t4.id
 
id	SetName	SetValue
1	1	A
2	1	B
3	1	C
4	2	A
5	2	B
 
 drop table #temp
go
--------------------------------------------------------------------------------------------------------
create table #temp(id int,SetName int,SetValue varchar(2))
insert into #temp values (1,1,'A'),(2,1,'B'),(3,1,'C'),(4,2,'A'),(5,2,'B'),(6,3,'D'),(7,3,'E'),(8,3,'F')
select * from #temp
 
id	SetName	SetValue
1	1	A
2	1	B
3	1	C
4	2	A
5	2	B
6	3	D
7	3	E
8	3	F
 
select t4.* 
  from #temp t4
  join (select min(SetName) as SetName
          from (select distinct SetName,
                      (select rtrim(substring(isnull((select top 5 ','+SetValue 
                        from #temp t1 where t1.SetName = t2.SetName for xml path('')),''),2,2000))) Grp
                  from #temp t2) t3 group by Grp)t5
    on t4.SetName = t5.SetName
 order by t4.id
 
 id	SetName	SetValue
1	1	A
2	1	B
3	1	C
4	2	A
5	2	B
6	3	D
7	3	E
8	3	F
 
 drop table #temp
go
 
--------------------------------------------------------------------------------------------------------
create table #temp(id int,SetName int,SetValue varchar(2))
insert into #temp values (1,1,'A'),(2,1,'B'),(3,1,'C'),(4,2,'A'),(5,2,'B'),(6,2,'C'),(7,3,'A'),(8,3,'B'),(9,3,'C')
select * from #temp
 
id	SetName	SetValue
1	1	A
2	1	B
3	1	C
4	2	A
5	2	B
6	2	C
7	3	A
8	3	B
9	3	C
 
select t4.* 
  from #temp t4
  join (select min(SetName) as SetName
          from (select distinct SetName,
                      (select rtrim(substring(isnull((select top 5 ','+SetValue 
                        from #temp t1 where t1.SetName = t2.SetName for xml path('')),''),2,2000))) Grp
                  from #temp t2) t3 group by Grp)t5
    on t4.SetName = t5.SetName
 order by t4.id
 
id	SetName	SetValue
1	1	A
2	1	B
3	1	C
 
 drop table #temp
go

Open in new window

if you have data like the attached sample set, my above query won't work. An ORDER BY clause is required in this case.
create table #temp(id int,SetName int,SetValue varchar(2))
insert into #temp values (1,1,'A'),(2,1,'B'),(3,1,'C'),(4,2,'A'),(5,2,'B'),(6,3,'C'),(7,3,'B'),(8,3,'A')
select * from #temp
 
id	SetName	SetValue
1	1	A
2	1	B
3	1	C
4	2	A
5	2	B
6	3	C
7	3	B
8	3	A
 
select t4.* 
  from #temp t4
  join (select min(SetName) as SetName
          from (select distinct SetName,
                      (select rtrim(substring(isnull((select top 5 ','+SetValue 
                        from #temp t1 where t1.SetName = t2.SetName order by t1.SetValue for xml path('')),''),2,2000))) Grp
                  from #temp t2) t3 group by Grp)t5
    on t4.SetName = t5.SetName
 order by t4.id
 
id	SetName	SetValue
1	1	A
2	1	B
3	1	C
4	2	A
5	2	B
 
 drop table #temp
go

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Sharath S
Sharath S
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Sharath_123,

Your solution is brilliant. Using FOR XML as a concatenation technique proved the fastest solution.

On a sample of 5,000 records, with SetValue's of roughly 15 - 20 characters, your script performed consistently at 5 to 8 seconds, ahead of my solution which took 30 - 40 seconds, and a solution that used temp tables to build up a unique string per set that took hours.
Thanks AJS_Developer, Glad to help you.