?
Solved

Compare a group of rows

Posted on 2014-08-06
29
Medium Priority
?
133 Views
Last Modified: 2014-08-07
We have a table with members. They are identified by a memberID and a memberType. Members can be part of a group, so we add them to a table where the members get a groupNumber. However, if groups consist of exactly the same members, we have to undouble them. See my example:

create table dbo.Grp (grpNr int, memberID varchar(50), memberType varchar(50))

insert into grp values(1, '000000000599267', 'Client')
insert into grp values(1, '000000000599268', 'Client')
insert into grp values(2, '000000004192744', 'Client')
insert into grp values(2, '000000004192745', 'Client')
insert into grp values(3, '000000000599267', 'Client')
insert into grp values(3, '000000000599268', 'Client')
insert into grp values(4, '000000004192745', 'Client')
insert into grp values(4, '000000000599267', 'Client')
insert into grp values(4, '000000004192744', 'Client')

select * from Grp

Open in new window


As you can see, groups 1 and 3 are the same, groups 2 and 4 are unique.

I have tried to find double groups by using CHECKSUM_AGG:

select grpNr, checksum_agg(CHECKSUM(memberType + memberID))
from Grp
group by grpNr

Open in new window


If you run this code, you will see that the result is the same for groups 1, 2 and 3. Only 4 is different. This is not correct; only 1 and 3 are identical.

I have tried to compare groups by concatenating the keys of group members, using FOR XML PATH. This works fine for small amounts of data, but in our production situation we have about 1 million members, who can each be part of many groups. The query with FOR XML PATH takes too long to complete.

Does anyone have any suggestions on how to solve this problem? I have read that CHECKSUM and CHECKSUM_AGG don't guarantee absolute uniqueness, but I didn't know that even on a small testset I would run into double values. I know that HASHBYTES - MD5 is supposed to work better, but can I use that over a group of rows instead of just one row?
0
Comment
Question by:pven13
  • 9
  • 8
  • 7
  • +2
29 Comments
 
LVL 14

Expert Comment

by:theruck
ID: 40243079
have you tried to select distinct and save the results into a new table?
0
 
LVL 1

Author Comment

by:pven13
ID: 40243083
I am not trying to find double rows, but double groups of rows. Members can belong to several groups, so I can't undouble on memberID. So how can I use distinct?
0
 
LVL 14

Expert Comment

by:theruck
ID: 40243114
explain "double groups of rows"

you wrote :if groups consist of exactly the same members, we have to undouble them:
undouble groups or undouble members?

maybe write an expected results?
0
Get your Conversational Ransomware Defense e‑book

This e-book gives you an insight into the ransomware threat and reviews the fundamentals of top-notch ransomware preparedness and recovery. To help you protect yourself and your organization. The initial infection may be inevitable, so the best protection is to be fully prepared.

 
LVL 1

Author Comment

by:pven13
ID: 40243137
I have to undouble groups. In my example, groups 1 and 3 are identical. I want to find a way to recognize groups that consist of exactly the same collection of members.

So, without undoubling the groups may look like this:
grpNr       memberID                                           memberType
----------- -------------------------------------------------- --------------------------------------------------
1           000000000599267                                    Client
1           000000000599268                                    Client
2           000000004192744                                    Client
2           000000004192745                                    Client
3           000000000599267                                    Client
3           000000000599268                                    Client
4           000000004192745                                    Client
4           000000000599267                                    Client
4           000000004192744                                    Client

And when duplicates are removed, I want something like this:

grpNr       memberID                                           memberType
----------- -------------------------------------------------- --------------------------------------------------
1           000000000599267                                    Client
1           000000000599268                                    Client
2           000000004192744                                    Client
2           000000004192745                                    Client
4           000000004192745                                    Client
4           000000000599267                                    Client
4           000000004192744                                    Client

Group 3 is removed, because it was the same as group 1. What I am looking for, is a way to find that groups 1 and 3 are identical, so I can get rid of the duplicate.

Hope this explains what I am trying to achieve.
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 40243163
Your original query concept works so far as I can see but I did reverse the field position, this result
| GRPNR |       CHKS |
|-------|------------|
|     1 |  117440512 |
|     2 |   16777216 |
|     3 |  117440512 |
|     4 | -271572349 |

-- produced by
SELECT
      grpNr
    , CHECKSUM_AGG(CHECKSUM(memberID + memberType)) AS chks
FROM Grp
GROUP BY
      grpNr
;

Open in new window

This then allows seleting the groups that contain the duplicate membership
SELECT
      grpNr
FROM (
            SELECT
                  grpNr
                , chks
                , COUNT(grpNr) OVER (PARTITION BY chks) AS count_of
            FROM (
                        SELECT
                              grpNr
                            , CHECKSUM_AGG(CHECKSUM(memberID + memberType)) AS chks
                        FROM Grp
                        GROUP BY
                              grpNr
                  ) sq1
      ) sq2
WHERE count_of > 1
;

Open in new window

| GRPNR |
|-------|
|     3 |
|     1 |
	

Open in new window

http://sqlfiddle.com/#!6/c898f/1 (mssql 2012)
0
 
LVL 1

Author Comment

by:pven13
ID: 40243236
That is strange... This is my query result:

grpNr       chks
----------- -----------
1           52428800
2           52428800
3           52428800
4           935418437

(4 row(s) affected)

--Produced by

SELECT
      grpNr
    , CHECKSUM_AGG(CHECKSUM(memberID + memberType)) AS chks
FROM Grp
GROUP BY
      grpNr

Open in new window


So I get a different result from CHECKSUM_AGG.
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 40243244
definitely the same data?
what is your dbms version?
what happens if you use only memberID in that sample?

what is the output of this?
SELECT
      grpNr
    , memberID
    , memberType
    , CHECKSUM(memberID + memberType) AS chks
FROM Grp
ORDER BY
      memberID
    , memberType
    , grpNr
;

Open in new window

| GRPNR |        MEMBERID | MEMBERTYPE |       CHKS |
|-------|-----------------|------------|------------|
|     1 | 000000000599267 |     Client | -288349565 |
|     3 | 000000000599267 |     Client | -288349565 |
|     4 | 000000000599267 |     Client | -288349565 |
|     1 | 000000000599268 |     Client | -372235645 |
|     3 | 000000000599268 |     Client | -372235645 |
|     2 | 000000004192744 |     Client | -841162178 |
|     4 | 000000004192744 |     Client | -841162178 |
|     2 | 000000004192745 |     Client | -857939394 |
|     4 | 000000004192745 |     Client | -857939394 |

Open in new window

http://sqlfiddle.com/#!6/c898f/3
0
 
LVL 1

Author Comment

by:pven13
ID: 40243253
My server version: Microsoft SQL Server 2012 (SP1) - 11.0.3128.0 (X64), Build 7601: Service Pack 1.

This is the output:
grpNr       memberID                                           memberType                                         chks
----------- -------------------------------------------------- -------------------------------------------------- -----------
1           000000000599267                                    Client                                             887183941
3           000000000599267                                    Client                                             887183941
4           000000000599267                                    Client                                             887183941
1           000000000599268                                    Client                                             935418437
3           000000000599268                                    Client                                             935418437
2           000000004192744                                    Client                                             166343886
4           000000004192744                                    Client                                             166343886
2           000000004192745                                    Client                                             181023950
4           000000004192745                                    Client                                             181023950

Open in new window



And when I use only memberID:

grpNr       chks
----------- -----------
1           50
2           50
3           50
4           -94274528

--Produced by

SELECT
      grpNr
    , CHECKSUM_AGG(CHECKSUM(memberID)) AS chks
FROM Grp
GROUP BY
      grpNr

Open in new window

0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 40243261
mmm, last try. instead of concatenating the 2 fields just separate then by a comma

SELECT
      grpNr
    , CHECKSUM_AGG(CHECKSUM(memberID , memberType)) AS chks
FROM Grp
GROUP BY
      grpNr
;
0
 
LVL 1

Author Comment

by:pven13
ID: 40243269
In that case I get:
grpNr       chks
----------- -----------
1           800
2           800
3           800
4           887792098

Open in new window


So still groups 1-2-3 result in the same checksum.
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 40243347
sorry, no idea why, this does not seem to be helping
0
 
LVL 36

Expert Comment

by:ste5an
ID: 40243378
You need a relational division to find your matching groups. See Joe's article about it: Divided We Stand: The SQL of Relational Division.
0
 
LVL 1

Author Comment

by:pven13
ID: 40243490
@PortletPaul - thanks for your time anyway.

@ste5an - this looks interesting, but I'm not sure how to apply this information. In my example, what would be the divisor?
0
 
LVL 36

Assisted Solution

by:ste5an
ste5an earned 800 total points
ID: 40243519
E.g.
 
DECLARE @Grp TABLE
    (
      grpNr INT ,
      memberID VARCHAR(50) ,
      memberType VARCHAR(50)
    );

INSERT  INTO @grp
VALUES  ( 1, '000000000599267', 'Client' ),
        ( 1, '000000000599268', 'Client' ),
        ( 2, '000000004192744', 'Client' ),
        ( 2, '000000004192745', 'Client' ),
        ( 3, '000000000599267', 'Client' ),
        ( 3, '000000000599268', 'Client' ),
        ( 4, '000000004192745', 'Client' ),
        ( 4, '000000000599267', 'Client' ),
        ( 4, '000000004192744', 'Client' );

SELECT  G0.grpNr ,
        RD.grpNr
FROM    ( SELECT DISTINCT
                    grpNr
          FROM      @Grp
        ) G0
        OUTER APPLY ( SELECT    G1.grpNr
                      FROM      @Grp G1
                      WHERE     NOT EXISTS ( SELECT G2.memberID
                                             FROM   @Grp G2
                                             WHERE  G2.grpNr = G1.grpNr
                                             EXCEPT
                                             SELECT G3.memberID
                                             FROM   @Grp G3
                                             WHERE  G3.grpNr = G0.grpNr )
                                AND ( SELECT    COUNT(*)
                                      FROM      @Grp G2
                                      WHERE     G2.grpNr = G1.grpNr
                                    ) = ( SELECT    COUNT(*)
                                          FROM      @Grp G3
                                          WHERE     G3.grpNr = G0.grpNr
                                        )
                                AND G1.grpNr != G0.grpNr
                      GROUP BY  G1.grpNr
                    ) RD;

Open in new window


But test it carfully.

Caveat: I'm ignoring the memberType for simplicity ;) It also uses the assumption that (grpNr, memberID) are unique.
0
 
LVL 49

Assisted Solution

by:PortletPaul
PortletPaul earned 400 total points
ID: 40243617
I don't know how well this will perform, but it may work for you
WITH
      CTE
      AS (
                  SELECT
                        grpNr
                      , memberID
                      , memberType
                      , ROW_NUMBER() OVER (PARTITION BY grpNr ORDER BY memberID ASC, memberType ASC) AS rowNum
                      , COUNT(memberID) OVER (PARTITION BY grpNr)                                    AS numRows
                  FROM dbo.grp
            )
SELECT DISTINCT
      g1.grpNr
FROM CTE AS g1
      INNER JOIN CTE AS g2
                  ON g1.grpNr <> g2.grpNr
                        AND g1.memberID = g2.memberID
                        AND g1.memberType = g2.memberType
                        AND g1.rowNum = g2.rowNum
                        AND g1.numRows = g2.numRows
;

Open in new window

http://sqlfiddle.com/#!6/c898f/14
0
 
LVL 1

Author Comment

by:pven13
ID: 40243642
Both thank you for your reaction. I will not be able to look into it until tomorrow morning, but I'll get back on you a.s.a.p.
0
 
LVL 36

Expert Comment

by:ste5an
ID: 40243648
Use
WITH    CTE
          AS ( SELECT   grpNr ,
                        memberID ,
                        ROW_NUMBER() OVER ( PARTITION BY grpNr ORDER BY memberID ASC ) AS rowNum ,
                        COUNT(memberID) OVER ( PARTITION BY grpNr ) AS numRows
               FROM     @grp
             )
    SELECT DISTINCT
            g1.grpNr ,
            g2.grpNr
    FROM    CTE AS g1
            LEFT JOIN CTE AS g2 ON g1.grpNr <> g2.grpNr
                                   AND g1.memberID = g2.memberID
                                   AND g1.rowNum = g2.rowNum
                                   AND g1.numRows = g2.numRows;

Open in new window

and it depends on the indices. But it gets some table spools, so it may be indeed faster.
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 40243697
@ste5an, it needs to be an inner join :)
0
 
LVL 36

Expert Comment

by:ste5an
ID: 40243723
Just cosmetics. The INNER JOIN simply shows use a list of dups. the LEFT JOIN shows us all:

(9 row(s) affected)
grpNr       grpNr
----------- -----------
1           3
2           NULL
3           1
4           NULL

(4 row(s) affected)

grpNr       grpNr
----------- -----------
1           3
3           1

(2 row(s) affected)

Open in new window


My point was adding g2.grpNr to display which group is the dup :)
0
 
LVL 36

Expert Comment

by:ste5an
ID: 40243761
@PortletPaul: Take a look at this one, I'm not sure whether it's a copy and paste error.. Group 3 and Group 4 are not equal.

DECLARE @Grp TABLE
    (
      grpNr INT ,
      memberID VARCHAR(50) ,
      memberType VARCHAR(50)
    );

INSERT  INTO @grp
VALUES  ( 1, '1', 'Client' ),
        ( 1, '2', 'Client' ),
        ( 2, '3', 'Client' ),
        ( 2, '4', 'Client' ),
        ( 3, '1', 'Client' ),
        ( 3, '2', 'Client' ),
        ( 3, '3', 'Client' ),
        ( 4, '4', 'Client' ),
        ( 4, '1', 'Client' ),
        ( 4, '3', 'Client' );

SELECT  G0.grpNr ,
        RD.grpNr
FROM    ( SELECT DISTINCT
                    grpNr
          FROM      @Grp
        ) G0
        OUTER APPLY ( SELECT    G1.grpNr
                      FROM      @Grp G1
                      WHERE     NOT EXISTS ( SELECT G2.memberID
                                             FROM   @Grp G2
                                             WHERE  G2.grpNr = G1.grpNr
                                             EXCEPT
                                             SELECT G3.memberID
                                             FROM   @Grp G3
                                             WHERE  G3.grpNr = G0.grpNr )
                                AND ( SELECT    COUNT(*)
                                      FROM      @Grp G2
                                      WHERE     G2.grpNr = G1.grpNr
                                    ) = ( SELECT    COUNT(*)
                                          FROM      @Grp G3
                                          WHERE     G3.grpNr = G0.grpNr
                                        )
                                AND G1.grpNr != G0.grpNr
                      GROUP BY  G1.grpNr
                    ) RD;

WITH    CTE
          AS ( SELECT   grpNr ,
                        memberID ,
                        memberType ,
                        ROW_NUMBER() OVER ( PARTITION BY grpNr ORDER BY memberID ASC, memberType ASC ) AS rowNum ,
                        COUNT(memberID) OVER ( PARTITION BY grpNr ) AS numRows
               FROM     @Grp G
             )
    SELECT DISTINCT
            g1.grpNr ,
            g2.grpNr
    FROM    CTE AS g1
            INNER JOIN CTE AS g2 ON g1.grpNr <> g2.grpNr
                                    AND g1.memberID = g2.memberID
                                    AND g1.memberType = g2.memberType
                                    AND g1.rowNum = g2.rowNum
                                    AND g1.numRows = g2.numRows;

Open in new window

0
 
LVL 36

Expert Comment

by:ste5an
ID: 40243788
Found it: DISTINCT and INNER are hiding some rows..

DECLARE @Grp TABLE
    (
      grpNr INT ,
      memberID VARCHAR(50) ,
      memberType VARCHAR(50)
    );

INSERT  INTO @grp
VALUES  ( 1, '1', 'Client' ),
        ( 1, '2', 'Client' ),
        ( 2, '3', 'Client' ),
        ( 2, '4', 'Client' ),
        ( 3, '1', 'Client' ),
        ( 3, '2', 'Client' ),
        ( 3, '3', 'Client' ),
        ( 4, '4', 'Client' ),
        ( 4, '1', 'Client' ),
        ( 4, '3', 'Client' );

WITH    CTE
          AS ( SELECT   G.grpNr ,
                        G.memberID ,
                        G.memberType ,
                        ROW_NUMBER() OVER ( PARTITION BY G.grpNr ORDER BY G.memberID ASC, G.memberType ASC ) AS rowNum ,
                        COUNT(G.memberID) OVER ( PARTITION BY G.grpNr ) AS numRows
               FROM     @Grp G
             )
    SELECT DISTINCT
            g1.*,
            g2.*
    FROM    CTE AS g1
            left JOIN CTE AS g2 ON g1.grpNr <> g2.grpNr
                                    AND g1.memberID = g2.memberID
                                    AND g1.memberType = g2.memberType
                                    AND g1.rowNum = g2.rowNum
                                    AND g1.numRows = g2.numRows;

Open in new window

0
 
LVL 36

Expert Comment

by:ste5an
ID: 40243831
@pven13:

The NOT EXISTS with the EXCEPT operation is the relational division. The comparision of the cardinalities - the COUNT(*)'s - is needed to match for identity in those groups.

When you remove the cardinality check, then you get also the covered groups. In your sample, group 4 covers group 2.
0
 
LVL 70

Accepted Solution

by:
Scott Pletcher earned 800 total points
ID: 40244257
I think this is reasonably straightforward, and can be done without Windowing functions.

If you have a lot of members, you might gain some efficiency by putting the "grpIndivMbrCounts" into a temp table clustered by grpNr.


;WITH grpIndivMbrCounts AS (
    SELECT grpNr, COUNT(memberID) AS grpMbrCount
    FROM Grp
    GROUP BY grpNr
)
SELECT grpJoinedMbrCounts.*, grpI1.grpMbrCount
FROM (
    SELECT g1.grpNr AS grpNr1, g2.grpNr AS grpNr2, COUNT(*) AS grpMbrCount
    FROM Grp g1
    INNER JOIN Grp g2 ON
        g1.grpNr < g2.grpNr AND
        g1.memberID = g2.memberId
    GROUP BY
        g1.grpNr, g2.grpNr
) AS grpJoinedMbrCounts
INNER JOIN grpIndivMbrCounts AS grpI1 ON
    grpI1.GrpNr = grpJoinedMbrCounts.grpNr1 AND
    grpI1.grpMbrCount = grpJoinedMbrCounts.grpMbrCount
INNER JOIN grpIndivMbrCounts AS grpI2 ON
    grpI2.GrpNr = grpJoinedMbrCounts.grpNr2 AND
    grpI2.grpMbrCount = grpJoinedMbrCounts.grpMbrCount
0
 
LVL 36

Expert Comment

by:ste5an
ID: 40244515
@Scott: But it requires some interpretation, cause the results are not symmetric.
0
 
LVL 70

Expert Comment

by:Scott Pletcher
ID: 40244570
@ste5an:

Huh?  It lists, only once (without using a "DISTINCT" kludge with its large overhead), the duplicate combinations, with the lower group number always grp1, and the total number of members that are duplicated in those groups.  What more would be needed?
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 40244952
Scott's solution is very nice.
Ensuring both memberId and memberType are evaluated:
after  "g1.memberID = g2.memberId" (line 12)
add "AND g1.memberType = g2.memberType"
0
 
LVL 36

Expert Comment

by:ste5an
ID: 40245044
Yup, and using != instead of < gives the necessary symmetry.
0
 
LVL 70

Expert Comment

by:Scott Pletcher
ID: 40245060
>> Yup, and using != instead of < gives the necessary symmetry. <<

But that also gives duplicate results.

I want my joins to be 1 to 2, 1 to 3, 1 to 4, etc., only, not both 1 to 2 and 2 to 1, 1 to 3 and 3 to 1, 1 to 4 and 4 to 1, etc..  The latter would of course be wasted I/O, and clutter the results with duplicate values, just in reverse order (even more confusing).
0
 
LVL 1

Author Comment

by:pven13
ID: 40245532
Thank you all for your contributions. The solutions offered by ScottPletcher and ste5an both work fine, even on the full dataset, and as far as I have tested so far give the right results. PortletPaul's solution needed some adjustments, but was also useful. It was nice to see these different approaches leading to the same result.

I decided to divide the 500 points over you three, with ScottPletcher and ste5an receiveing more points than PortletPaul. Hope nobody feels offended by that...

Thanks again. I adjusted my code based on ScottPletcher's solution, because it performed better with 1 million members.
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article shows gives you an overview on SQL Server 2016 row level security. You will also get to know the usages of row-level-security and how it works
It is possible to export the data of a SQL Table in SSMS and generate INSERT statements. It's neatly tucked away in the generate scripts option of a database.
Via a live example, show how to extract information from SQL Server on Database, Connection and Server properties
Viewers will learn how to use the INSERT statement to insert data into their tables. It will also introduce the NULL statement, to show them what happens when no value is giving for any given column.
Suggested Courses

850 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question