Find duplicates within row and across group

johnnyg123
johnnyg123 used Ask the Experts™
on
Here is some sample data from 3 SQL Server 2014 tables

 CREATE TABLE #Vendor ( 
     ID int NOT NULL,
	 LastName varchar(50),
	 FirstName varchar(50),
	 State1 varchar(50),
	 License1 varchar(50),
	 State2 varchar(50),
	 License2 varchar(50),
	 State3 varchar(50),
	 License3 varchar(50)

        )
 INSERT INTO #Vendor VALUES

 (1, 'Doe','John','CA','45678','TX','178989','CA','99999'),
 (2, 'Smith','John','CA','12ABCD','CA','12ABCD','NY','45678'),
 (3, 'Doe','Jane','CA','09876',Null,Null,Null,Null),
 (3, 'Doe','Jane','CA','09877','CA','45678',Null,Null)



--------------------------------------------------------------------------------



  CREATE TABLE #SalesPerson ( 
     ID int NOT NULL,
	 LastName varchar(50),
	 FirstName varchar(50),
	 State varchar(50),
	 License varchar(50),
	  )

 INSERT INTO #SalesPerson VALUES

 (1, 'Doe','John','CA','45678'),
 (2, 'Smith','John','CA','12ABCD'),
 (3, 'Doe','Jane','CA','09876')


--------------------------------------------------------------------------------



 CREATE TABLE #Master ( 
     ID int NOT NULL,
	 LastName varchar(50),
	 FirstName varchar(50),
	 State1 varchar(50),
	 License1 varchar(50),
	 CALicense1 char(1),
	 NameMatch1 char(1),
	 State2 varchar(50),
	 License2 varchar(50),
	 CALicense2 char(1),
	 NameMatch2 char(1),
	 State3 varchar(50),
	 License3 varchar(50),
	 CALicense3 char(1),
	 NameMatch3 char(1),
	 UniqueCALicenseWithinID char(1),
	 UniqueCALicenseAcrossID char(1)

        )



		 INSERT INTO #Master VALUES

 (1, 'Doe','John','CA','45678','Y','Y','TX','178989','N','','CA','99999','Y','N','N','Y'),
 (2, 'Smith','John','CA','12ABCD','Y','Y','CA','12ABCD','Y','Y','NY','45678','N','','Y','Y'),
 (3, 'Doe','Jane','CA','09876','Y','Y',Null,Null,'','',Null,Null,'','','Y','Y'),
 (3, 'Doe','Jane','CA','09877','Y','N','CA','45678','Y','Y',Null,Null,'','','Y','N')

----------------------------------------------------------------------------------------------------

Open in new window


Unfortunately, the data the vendor sends that is in the vendor table is not always correct.


Here is what we know about the vendor data

The sales ID and the name associated with it is correct.  (however the license id may be incorrect which would invalidate the record so that's why name check needs to be done    (also,there may or may not be an entry for all sales ids)

License set to null if state is null

There is only suppose to be one entry for each state but as you can see this is not always the case
One of the multiple state entries may or not have the correct state license info.  It could in fact be the license info for a different  sales person





Here is a bit of background on the data

                   A sales person can be licensed in multiple states

                  We are only interested in the sales people licensed in ca




As part of a data scrub process, I have created a master table in the layout shown above

Here is description of what the flags represent

(Note: all flags empty string if state and license are null)

 CALicense  ----    Indicates if the state associated with license is CA..  (Note: if not, NameMatch set to empty string)  (1 occurrence for each license)

NameMatch   ---  indicates if there is a last and first name match in the salesperson table for the License in the vendor table  (for CA)  (1 occurrence for each license)

UniqueLicenseWithinID  --- set to 'Y' if there are not different non null, CA license values within 3 license fields for the sales person id   (only take into account CA license.  possible to have same license number in different state)    This will indicate if this row is valid to use in subsequent processing

UniqueLicenseAcrossID --- set to 'Y' if there is not a match of non null, CA license values in any other license fields for all the sales person ids  (only take into account CA license.  possible to have same license number in different state)  This will indicate if this row is valid to use in subsequent processing   (Note: if this is 'N' all rows containing license will have it set to 'N' as well


The insert for the master table are the expected rows

I know the requirements seem a bit nutty

Please let me know if I need to clarify anything

Thanks!
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Distinguished Expert 2017

Commented:
This will be much easier if you normalize the data FIRST.
Duplicates:
select id, lastname, firstname, license
from
(	select b.id, b.lastname, b.firstname, b.license
	from
	(	select id, license = license1
		from Vendor
		where state1 = 'CA'
		union all 
		select id, license = license2
		from Vendor
		where state2 = 'CA'
		union all 
		select id, license = license3
		from Vendor
		where state3 = 'CA'
	) a
	join SalesPerson b
	on a.id = b.id
	and a.license = b.license
) c
group by id, lastname, firstname, license
having count(*) > 1

Open in new window

Author

Commented:
Thanks so much for all the comments!

Pat
I certainly do appreciate that it would be much easier if the data was normalized first

One thing I forgot to mention

There is a desire  (not mine but from others) to distinguish whether the duplicate was the found in columns within  the same row or whether it was
found within columns in different rows so the source of the error can be traced



John

your solution works perfectly for dups within columns within the same row

I need to also find dups across multiple rows as well

I do apologize for not making this clearer in my question

A separate query to find the dups across rows would be great!

Here is hopefully an example that is clearer with a bit more normalization and no need to get the name

CREATE TABLE #License(
       [id] [int] NULL,
       [license1] [nchar](10) NULL,
       [state1] [nchar](10) NULL,
       [license2] [nchar](10) NULL,
       [state2] [nchar](10) NULL,
       [license3] [nchar](10) NULL,
       [state3] [nchar](10) NULL)
      

 INSERT INTO #License
            ([id]
            ,[license1]
            ,[state1]
            ,[license2]
            ,[state2]
            ,[license3]
            ,[state3])
                              
      VALUES
            (11111111
            ,12345    
            ,'CA'
            ,34567    
            ,'IN'
          ,12345
           ,'CA'    
                           
                  )

 INSERT INTO #License
            ([id]
            ,[license1]
            ,[state1]
            ,[license2]
            ,[state2]
            ,[license3]
            ,[state3]
                        )
      VALUES
            (22222222
            ,66666    
            ,'MI'
            ,999999    
            ,'NY'
         ,135798    
            ,'CA'
                              
                  )


 INSERT INTO #License
            ([id]
            ,[license1]
            ,[state1]
            ,[license2]
            ,[state2]
            ,[license3]
            ,[state3]
                        )
      VALUES
            (333333333
         
           ,999999    
            ,'NY'
         ,135798    
            ,'CA'

                    ,66666    
            ,'MI'
                              
                  )


The query you provided will work perfectly to catch the dup for id 11111111

Just need a query that will return

id                  License          State
22222222    135798          CA

333333333   135798        CA

(doesn't matter which set of columns dups found in)

Thanks!
I put the data in first-normal-form so I could filter on State (CA), and join against SalesPerson to validate the license.  

In your latest post, you introduced new ID's, if these were added to SalesPerson with the corresponding valid license then I believe my query would identify duplication within a row, and across rows.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial