We help IT Professionals succeed at work.

Compare a group of rows

146 Views
Last Modified: 2014-08-07
We have a table with members. They are identified by a memberID and a memberType. Members can be part of a group, so we add them to a table where the members get a groupNumber. However, if groups consist of exactly the same members, we have to undouble them. See my example:

create table dbo.Grp (grpNr int, memberID varchar(50), memberType varchar(50))

insert into grp values(1, '000000000599267', 'Client')
insert into grp values(1, '000000000599268', 'Client')
insert into grp values(2, '000000004192744', 'Client')
insert into grp values(2, '000000004192745', 'Client')
insert into grp values(3, '000000000599267', 'Client')
insert into grp values(3, '000000000599268', 'Client')
insert into grp values(4, '000000004192745', 'Client')
insert into grp values(4, '000000000599267', 'Client')
insert into grp values(4, '000000004192744', 'Client')

select * from Grp

Open in new window


As you can see, groups 1 and 3 are the same, groups 2 and 4 are unique.

I have tried to find double groups by using CHECKSUM_AGG:

select grpNr, checksum_agg(CHECKSUM(memberType + memberID))
from Grp
group by grpNr

Open in new window


If you run this code, you will see that the result is the same for groups 1, 2 and 3. Only 4 is different. This is not correct; only 1 and 3 are identical.

I have tried to compare groups by concatenating the keys of group members, using FOR XML PATH. This works fine for small amounts of data, but in our production situation we have about 1 million members, who can each be part of many groups. The query with FOR XML PATH takes too long to complete.

Does anyone have any suggestions on how to solve this problem? I have read that CHECKSUM and CHECKSUM_AGG don't guarantee absolute uniqueness, but I didn't know that even on a small testset I would run into double values. I know that HASHBYTES - MD5 is supposed to work better, but can I use that over a group of rows instead of just one row?
Comment
Watch Question

CERTIFIED EXPERT

Commented:
have you tried to select distinct and save the results into a new table?

Author

Commented:
I am not trying to find double rows, but double groups of rows. Members can belong to several groups, so I can't undouble on memberID. So how can I use distinct?
CERTIFIED EXPERT

Commented:
explain "double groups of rows"

you wrote :if groups consist of exactly the same members, we have to undouble them:
undouble groups or undouble members?

maybe write an expected results?

Author

Commented:
I have to undouble groups. In my example, groups 1 and 3 are identical. I want to find a way to recognize groups that consist of exactly the same collection of members.

So, without undoubling the groups may look like this:
grpNr       memberID                                           memberType
----------- -------------------------------------------------- --------------------------------------------------
1           000000000599267                                    Client
1           000000000599268                                    Client
2           000000004192744                                    Client
2           000000004192745                                    Client
3           000000000599267                                    Client
3           000000000599268                                    Client
4           000000004192745                                    Client
4           000000000599267                                    Client
4           000000004192744                                    Client

And when duplicates are removed, I want something like this:

grpNr       memberID                                           memberType
----------- -------------------------------------------------- --------------------------------------------------
1           000000000599267                                    Client
1           000000000599268                                    Client
2           000000004192744                                    Client
2           000000004192745                                    Client
4           000000004192745                                    Client
4           000000000599267                                    Client
4           000000004192744                                    Client

Group 3 is removed, because it was the same as group 1. What I am looking for, is a way to find that groups 1 and 3 are identical, so I can get rid of the duplicate.

Hope this explains what I am trying to achieve.
PortletPaulEE Topic Advisor
CERTIFIED EXPERT
Most Valuable Expert 2014
Awarded 2013

Commented:
Your original query concept works so far as I can see but I did reverse the field position, this result
| GRPNR |       CHKS |
|-------|------------|
|     1 |  117440512 |
|     2 |   16777216 |
|     3 |  117440512 |
|     4 | -271572349 |

-- produced by
SELECT
      grpNr
    , CHECKSUM_AGG(CHECKSUM(memberID + memberType)) AS chks
FROM Grp
GROUP BY
      grpNr
;

Open in new window

This then allows seleting the groups that contain the duplicate membership
SELECT
      grpNr
FROM (
            SELECT
                  grpNr
                , chks
                , COUNT(grpNr) OVER (PARTITION BY chks) AS count_of
            FROM (
                        SELECT
                              grpNr
                            , CHECKSUM_AGG(CHECKSUM(memberID + memberType)) AS chks
                        FROM Grp
                        GROUP BY
                              grpNr
                  ) sq1
      ) sq2
WHERE count_of > 1
;

Open in new window

| GRPNR |
|-------|
|     3 |
|     1 |
	

Open in new window

http://sqlfiddle.com/#!6/c898f/1 (mssql 2012)

Author

Commented:
That is strange... This is my query result:

grpNr       chks
----------- -----------
1           52428800
2           52428800
3           52428800
4           935418437

(4 row(s) affected)

--Produced by

SELECT
      grpNr
    , CHECKSUM_AGG(CHECKSUM(memberID + memberType)) AS chks
FROM Grp
GROUP BY
      grpNr

Open in new window


So I get a different result from CHECKSUM_AGG.
PortletPaulEE Topic Advisor
CERTIFIED EXPERT
Most Valuable Expert 2014
Awarded 2013

Commented:
definitely the same data?
what is your dbms version?
what happens if you use only memberID in that sample?

what is the output of this?
SELECT
      grpNr
    , memberID
    , memberType
    , CHECKSUM(memberID + memberType) AS chks
FROM Grp
ORDER BY
      memberID
    , memberType
    , grpNr
;

Open in new window

| GRPNR |        MEMBERID | MEMBERTYPE |       CHKS |
|-------|-----------------|------------|------------|
|     1 | 000000000599267 |     Client | -288349565 |
|     3 | 000000000599267 |     Client | -288349565 |
|     4 | 000000000599267 |     Client | -288349565 |
|     1 | 000000000599268 |     Client | -372235645 |
|     3 | 000000000599268 |     Client | -372235645 |
|     2 | 000000004192744 |     Client | -841162178 |
|     4 | 000000004192744 |     Client | -841162178 |
|     2 | 000000004192745 |     Client | -857939394 |
|     4 | 000000004192745 |     Client | -857939394 |

Open in new window

http://sqlfiddle.com/#!6/c898f/3

Author

Commented:
My server version: Microsoft SQL Server 2012 (SP1) - 11.0.3128.0 (X64), Build 7601: Service Pack 1.

This is the output:
grpNr       memberID                                           memberType                                         chks
----------- -------------------------------------------------- -------------------------------------------------- -----------
1           000000000599267                                    Client                                             887183941
3           000000000599267                                    Client                                             887183941
4           000000000599267                                    Client                                             887183941
1           000000000599268                                    Client                                             935418437
3           000000000599268                                    Client                                             935418437
2           000000004192744                                    Client                                             166343886
4           000000004192744                                    Client                                             166343886
2           000000004192745                                    Client                                             181023950
4           000000004192745                                    Client                                             181023950

Open in new window



And when I use only memberID:

grpNr       chks
----------- -----------
1           50
2           50
3           50
4           -94274528

--Produced by

SELECT
      grpNr
    , CHECKSUM_AGG(CHECKSUM(memberID)) AS chks
FROM Grp
GROUP BY
      grpNr

Open in new window

PortletPaulEE Topic Advisor
CERTIFIED EXPERT
Most Valuable Expert 2014
Awarded 2013

Commented:
mmm, last try. instead of concatenating the 2 fields just separate then by a comma

SELECT
      grpNr
    , CHECKSUM_AGG(CHECKSUM(memberID , memberType)) AS chks
FROM Grp
GROUP BY
      grpNr
;

Author

Commented:
In that case I get:
grpNr       chks
----------- -----------
1           800
2           800
3           800
4           887792098

Open in new window


So still groups 1-2-3 result in the same checksum.
PortletPaulEE Topic Advisor
CERTIFIED EXPERT
Most Valuable Expert 2014
Awarded 2013

Commented:
sorry, no idea why, this does not seem to be helping
ste5anSenior Developer
CERTIFIED EXPERT

Commented:
You need a relational division to find your matching groups. See Joe's article about it: Divided We Stand: The SQL of Relational Division.

Author

Commented:
@PortletPaul - thanks for your time anyway.

@ste5an - this looks interesting, but I'm not sure how to apply this information. In my example, what would be the divisor?
ste5anSenior Developer
CERTIFIED EXPERT
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION
PortletPaulEE Topic Advisor
CERTIFIED EXPERT
Most Valuable Expert 2014
Awarded 2013
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Author

Commented:
Both thank you for your reaction. I will not be able to look into it until tomorrow morning, but I'll get back on you a.s.a.p.
ste5anSenior Developer
CERTIFIED EXPERT

Commented:
Use
WITH    CTE
          AS ( SELECT   grpNr ,
                        memberID ,
                        ROW_NUMBER() OVER ( PARTITION BY grpNr ORDER BY memberID ASC ) AS rowNum ,
                        COUNT(memberID) OVER ( PARTITION BY grpNr ) AS numRows
               FROM     @grp
             )
    SELECT DISTINCT
            g1.grpNr ,
            g2.grpNr
    FROM    CTE AS g1
            LEFT JOIN CTE AS g2 ON g1.grpNr <> g2.grpNr
                                   AND g1.memberID = g2.memberID
                                   AND g1.rowNum = g2.rowNum
                                   AND g1.numRows = g2.numRows;

Open in new window

and it depends on the indices. But it gets some table spools, so it may be indeed faster.
PortletPaulEE Topic Advisor
CERTIFIED EXPERT
Most Valuable Expert 2014
Awarded 2013

Commented:
@ste5an, it needs to be an inner join :)
ste5anSenior Developer
CERTIFIED EXPERT

Commented:
Just cosmetics. The INNER JOIN simply shows use a list of dups. the LEFT JOIN shows us all:

(9 row(s) affected)
grpNr       grpNr
----------- -----------
1           3
2           NULL
3           1
4           NULL

(4 row(s) affected)

grpNr       grpNr
----------- -----------
1           3
3           1

(2 row(s) affected)

Open in new window


My point was adding g2.grpNr to display which group is the dup :)
ste5anSenior Developer
CERTIFIED EXPERT

Commented:
@PortletPaul: Take a look at this one, I'm not sure whether it's a copy and paste error.. Group 3 and Group 4 are not equal.

DECLARE @Grp TABLE
    (
      grpNr INT ,
      memberID VARCHAR(50) ,
      memberType VARCHAR(50)
    );

INSERT  INTO @grp
VALUES  ( 1, '1', 'Client' ),
        ( 1, '2', 'Client' ),
        ( 2, '3', 'Client' ),
        ( 2, '4', 'Client' ),
        ( 3, '1', 'Client' ),
        ( 3, '2', 'Client' ),
        ( 3, '3', 'Client' ),
        ( 4, '4', 'Client' ),
        ( 4, '1', 'Client' ),
        ( 4, '3', 'Client' );

SELECT  G0.grpNr ,
        RD.grpNr
FROM    ( SELECT DISTINCT
                    grpNr
          FROM      @Grp
        ) G0
        OUTER APPLY ( SELECT    G1.grpNr
                      FROM      @Grp G1
                      WHERE     NOT EXISTS ( SELECT G2.memberID
                                             FROM   @Grp G2
                                             WHERE  G2.grpNr = G1.grpNr
                                             EXCEPT
                                             SELECT G3.memberID
                                             FROM   @Grp G3
                                             WHERE  G3.grpNr = G0.grpNr )
                                AND ( SELECT    COUNT(*)
                                      FROM      @Grp G2
                                      WHERE     G2.grpNr = G1.grpNr
                                    ) = ( SELECT    COUNT(*)
                                          FROM      @Grp G3
                                          WHERE     G3.grpNr = G0.grpNr
                                        )
                                AND G1.grpNr != G0.grpNr
                      GROUP BY  G1.grpNr
                    ) RD;

WITH    CTE
          AS ( SELECT   grpNr ,
                        memberID ,
                        memberType ,
                        ROW_NUMBER() OVER ( PARTITION BY grpNr ORDER BY memberID ASC, memberType ASC ) AS rowNum ,
                        COUNT(memberID) OVER ( PARTITION BY grpNr ) AS numRows
               FROM     @Grp G
             )
    SELECT DISTINCT
            g1.grpNr ,
            g2.grpNr
    FROM    CTE AS g1
            INNER JOIN CTE AS g2 ON g1.grpNr <> g2.grpNr
                                    AND g1.memberID = g2.memberID
                                    AND g1.memberType = g2.memberType
                                    AND g1.rowNum = g2.rowNum
                                    AND g1.numRows = g2.numRows;

Open in new window

ste5anSenior Developer
CERTIFIED EXPERT

Commented:
Found it: DISTINCT and INNER are hiding some rows..

DECLARE @Grp TABLE
    (
      grpNr INT ,
      memberID VARCHAR(50) ,
      memberType VARCHAR(50)
    );

INSERT  INTO @grp
VALUES  ( 1, '1', 'Client' ),
        ( 1, '2', 'Client' ),
        ( 2, '3', 'Client' ),
        ( 2, '4', 'Client' ),
        ( 3, '1', 'Client' ),
        ( 3, '2', 'Client' ),
        ( 3, '3', 'Client' ),
        ( 4, '4', 'Client' ),
        ( 4, '1', 'Client' ),
        ( 4, '3', 'Client' );

WITH    CTE
          AS ( SELECT   G.grpNr ,
                        G.memberID ,
                        G.memberType ,
                        ROW_NUMBER() OVER ( PARTITION BY G.grpNr ORDER BY G.memberID ASC, G.memberType ASC ) AS rowNum ,
                        COUNT(G.memberID) OVER ( PARTITION BY G.grpNr ) AS numRows
               FROM     @Grp G
             )
    SELECT DISTINCT
            g1.*,
            g2.*
    FROM    CTE AS g1
            left JOIN CTE AS g2 ON g1.grpNr <> g2.grpNr
                                    AND g1.memberID = g2.memberID
                                    AND g1.memberType = g2.memberType
                                    AND g1.rowNum = g2.rowNum
                                    AND g1.numRows = g2.numRows;

Open in new window

ste5anSenior Developer
CERTIFIED EXPERT

Commented:
@pven13:

The NOT EXISTS with the EXCEPT operation is the relational division. The comparision of the cardinalities - the COUNT(*)'s - is needed to match for identity in those groups.

When you remove the cardinality check, then you get also the covered groups. In your sample, group 4 covers group 2.
Senior DBA
CERTIFIED EXPERT
Most Valuable Expert 2018
Distinguished Expert 2019
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION
ste5anSenior Developer
CERTIFIED EXPERT

Commented:
@Scott: But it requires some interpretation, cause the results are not symmetric.
Scott PletcherSenior DBA
CERTIFIED EXPERT
Most Valuable Expert 2018
Distinguished Expert 2019

Commented:
@ste5an:

Huh?  It lists, only once (without using a "DISTINCT" kludge with its large overhead), the duplicate combinations, with the lower group number always grp1, and the total number of members that are duplicated in those groups.  What more would be needed?
PortletPaulEE Topic Advisor
CERTIFIED EXPERT
Most Valuable Expert 2014
Awarded 2013

Commented:
Scott's solution is very nice.
Ensuring both memberId and memberType are evaluated:
after  "g1.memberID = g2.memberId" (line 12)
add "AND g1.memberType = g2.memberType"
ste5anSenior Developer
CERTIFIED EXPERT

Commented:
Yup, and using != instead of < gives the necessary symmetry.
Scott PletcherSenior DBA
CERTIFIED EXPERT
Most Valuable Expert 2018
Distinguished Expert 2019

Commented:
>> Yup, and using != instead of < gives the necessary symmetry. <<

But that also gives duplicate results.

I want my joins to be 1 to 2, 1 to 3, 1 to 4, etc., only, not both 1 to 2 and 2 to 1, 1 to 3 and 3 to 1, 1 to 4 and 4 to 1, etc..  The latter would of course be wasted I/O, and clutter the results with duplicate values, just in reverse order (even more confusing).

Author

Commented:
Thank you all for your contributions. The solutions offered by ScottPletcher and ste5an both work fine, even on the full dataset, and as far as I have tested so far give the right results. PortletPaul's solution needed some adjustments, but was also useful. It was nice to see these different approaches leading to the same result.

I decided to divide the 500 points over you three, with ScottPletcher and ste5an receiveing more points than PortletPaul. Hope nobody feels offended by that...

Thanks again. I adjusted my code based on ScottPletcher's solution, because it performed better with 1 million members.

Gain unlimited access to on-demand training courses with an Experts Exchange subscription.

Get Access
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Empower Your Career
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE

Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions
Unlock the solution to this question.
Join our community and discover your potential

Experts Exchange is the only place where you can interact directly with leading experts in the technology field. Become a member today and access the collective knowledge of thousands of technology experts.

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.