Link to home
Start Free TrialLog in
Avatar of 25112
25112

asked on

merge duplicate record details in same row..

select count(*), EMP_ID FROM EMPHIST WHERE EMPHIST_DATE = '2012-01-01' GROUP BY EMP_ID HAVING COUNT(*)=2 order by 1
There are some entries of Employees listed in more than one office for a particular date.

EMPHIST can be joined OFFICE &  LOCATION to get some other details..

we need a list of the duplicate in one row if possible..

example: Let's say John Adams in listed under Office A and also Office B as of Jan 1st 2012.. we want to report such cases.. with the duplicate info all in one row..
1st column: Office A's code (in Office table)
2nd Column: Office A's Description (in Office table)
3rd Column: County of Office A (in Office table)
4th Column: Location Number of Office A (in LOCATION table)
5th Column: Location Name of Office A(LOCATION table)
6th Column: Employee ID
7th column: Employee FirstName
8th column: Employee LastName
9st column: Office B's code (in Office table)
10th Column: Office B's Description (in Office table)
11th Column: County of Office B (in Office table)
12th Column: Location Number of Office B(in LOCATION table)
13th Column: Location Name of Office B(LOCATION table)
14th Column: Employee ID
15th column: Employee FirstName
16th column: Employee LastName

EmpHist and Office link through Office_key (PK-FK)

EmpHist.OFFICE_KEY = Office.OFFICE_KEY

office links to Location..
Office.Location_KEY = Location.Location_KEY

how would you merge these 2 sets of 8 columns into one row (16 columns) in tsql?
Avatar of Sharath S
Sharath S
Flag of United States of America image

Check this. Not thested though as I do not have any sample data. Change the column/table names as per your schema.
;WITH EMPHIST_CTE 
     AS (SELECT * 
           FROM (SELECT *, 
                        COUNT(*) OVER ( partition BY EMP_ID, EMPHIST_DATE) cnt, 
                        ROW_NUMBER() OVER ( partition BY EMP_ID, EMPHIST_DATE ORDER BY OFFICE_KEY) rn 
                   FROM EMPHIST 
                  WHERE EMPHIST_DATE = '2012-01-01') t1 
          WHERE cnt = 2), 
     EMPHIST_CTE2 
     AS (SELECT EMP_ID, 
                EMP_FNAME, 
                EMP_LNAME, 
                MAX(CASE rn 
                      WHEN 1 THEN OFFICE_KEY 
                    END) OFFICE_KEY1, 
                MAX(CASE rn 
                      WHEN 2 THEN OFFICE_KEY 
                    END) OFFICE_KEY2 
           FROM EMPHIST_CTE 
          GROUP BY EMP_ID, 
                   EMP_FNAME, 
                   EMP_LNAME) 
SELECT o1.Code,o1.Description,o1.Country,l1.Location_Number,l1.Location_Name,
       e1.EMP_ID,e1.EMP_FNAME,e1.EMP_LNAME,
	   o2.Code,o2.Description,o2.Country,l2.Location_Number,l2.Location_Name 
  FROM EMPHIST_CTE2 e1 
       JOIN OFFICE o1 
         ON e1.OFFICE_KEY1 = o1.OFFICE_KEY 
       JOIN LOCATION l1 
         ON o1.Location_KEY = l1.Location_KEY 
       JOIN OFFICE o2 
         ON e1.OFFICE_KEY2 = o1.OFFICE_KEY 
       JOIN LOCATION l1 
         ON o1.Location_KEY = l1.Location_KEY 

Open in new window

Avatar of 25112
25112

ASKER

thanks very much, Sharath.. i tweaked it a little and it works good, now...

sometimes the firstname/lastname are not guarenteed to be the same in duplicates..it may have a small letter transposed or written differently.. (example- Matt vs Matthew etc)  so i worked around it as you can see below..

because of this i had to call EMPHIST twice at the end.. so i just am including EMPID and OFFICE_KEY in the first CTE instead of calling the whole table ..

i also had to add 2 WHERE conditions after the final joins (s1.EMPHIST_Date = '2012-01-01' and s2.EMPHIST_Date = '2012-01-01') as there may be records with other dates also.

i first ran the first CTE and it took 30 minutes.. maybe it cached it.. then when i ran the whole thing, it ran in 4 minutes.. (30k rows)..

0)with the above logic tweaks, is there any other code changes that can benefit the query?
1)with the below, can you recommend any indexes that can cut down the run time?
2)can you save this resultset to a static table through code (like 'SELECT * INTO tmp FROM SOMETABLE' syntax)
3)if I want to run this for 3 duplicate records, i just have to change "=2" to "=3", right? (and add a few more EMPHIST joins at the end...) it will have to be seperate reports/query for 2,3,4 duplicates and so forth, right?
-----------------------------------------------------------------------------
;WITH EMPHIST_CTE
     AS (SELECT *
           FROM (SELECT EMP_ID, OFFICE_KEY,
                        COUNT(*) OVER ( partition BY EMP_ID) cnt,
                        ROW_NUMBER() OVER ( partition BY EMP_ID ORDER BY EMP_ID) rn
                   FROM EMPHIST
                  WHERE EMPHIST_Date = '2012-01-01' ) t1
          WHERE cnt = 2),
     EMPHIST_CTE2
     AS (SELECT EMP_ID,
                MAX(CASE rn
                      WHEN 1 THEN OFFICE_KEY
                    END) OFFICE_KEY1,
                MAX(CASE rn
                      WHEN 2 THEN OFFICE_KEY
                    END) OFFICE_KEY2
           FROM EMPHIST_CTE
          GROUP BY EMP_ID)
SELECT o1.Code,o1.Description,o1.County,l1.Location_Number,l1.Location_Name,
       e1.EMP_ID,s1.EMP_FNAME,s1.EMP_LNAME,
         o2.Code,o2.Description,o2.County,l2.Location_Number,l2.Location_Name,
         s2.EMP_FNAME,s2.EMP_LNAME,
  FROM EMPHIST_CTE2 e1
            JOIN EMPHIST s1
            ON e1.EMP_ID = s1.EMP_ID and e1.OFFICE_KEY1 = s1.OFFICE_KEY
       JOIN OFFICE o1
         ON e1.OFFICE_KEY1 = o1.OFFICE_KEY
       JOIN LOCATION l1
         ON s1.LOCATION_KEY = l1.LOCATION_KEY
            JOIN EMPHIST s2
            ON e1.EMP_ID = s2.EMP_ID and e1.OFFICE_KEY2 = s2.OFFICE_KEY
       JOIN OFFICE o2
         ON e1.OFFICE_KEY2 = o2.OFFICE_KEY
       JOIN LOCATION l2
         ON s2.LOCATION_KEY = l2.LOCATION_KEY
         WHERE s1.EMPHIST_Date = '2012-01-01'
         AND s2.EMPHIST_Date = '2012-01-01'
ASKER CERTIFIED SOLUTION
Avatar of Sharath S
Sharath S
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of 25112

ASKER

Thanks- from 4 minutes it is cut to 1 minute- your idea was great..

1)
i liked how you avoided the extra joins by modifying the first CTE to partition by the date also.
this is very efficient than my original thinking.. . but i did not understand your comment
'I do not suggest joining to EMPHIST again because there will be duplicate records on EMP_ID which results in unnecessary duplicates records.'
it would have been inefficient, but can you please tell me how it would have created dups?

2)
Also combining the name suggestion was nice. It is reasonable enough to have Matt for Matthew or vice versa.
if someone's maiden name is changed for example, and if that is important, i have to go back to adding those extra joins from the previous steps, right?   right now, we'll leave it as it is- it is good.

2a)
one change i did which you will see is we need LOCATION_KEY for the joins. So i brought it into the first CTE.

2b)
i also modified it for upto 4 duplicates, with 2 more left joins.. so there will be lot of NULLs, which is OK... please comment if they are efficient, with 2 more LEFT joins, as many users could be running the same report at the same time.

attached file: Query1.sql


3)
I needed one more tweak...  (pl refer query2.sql)

several offices will be running the report at the same time.. but the office must see only their set of data.
so for example, Office A is running a report, and they want to see which other offices have that employee (say, EmployeeA) listed under them for that date. This OFFICE_KEY will be sent as a parameter.
Office B, Office C, and Office D are also sharing that employee for that date. When Office A runs it they should see the details of EmpID of EmployeeA and the details of Office B,Office C and Office D for that Employee in the report.
Now they may be sharing EmployeeB with Office E & Office F.. so that line also with those 2 records should be in the report.

so with the above example, the result will be like this (Office A running to see which of their Employees were listed under other offices for that date)

EmployeeA Office_B_Details Office_C_Details Office_D_Details
EmployeeB Office_E_Details Office_F_Details NULL

etc

When OfficeB runs the report, they will see a totally different set of result..

so we need some kind of filtering with the OFFICE_KEY that will be passed... (this will reduce (upto) 4 sets of data that we  have now down to 3) in the attached example, 11203 is the office_key that is requesting the data.

i tried one solution, it is attached as Query2.sql.. can you please review it and give me your comments.

i wanted to make sure the requesting officekey is always the first one (rn=1).. hence i created a temp table and deleted the records that dont have this office at all, and then i rearrange the records to make sure this Office is always on the top.. then I display only 3 sets of data (o2,o3,o4) as o1 will be the requesting office.. this likely is not too efficient .. when first executed, it took 2 minutes.. just after a few runs, it is executing as previously (1 minute)... maybe it is caching it.. also i just realized i need to put isnull for o3 and o4 values since they may be NULLs.. but o1 & o2 should always have a value

please suggest a better way to do this..

thanks again very much Sharath.
Query1.sql
Query2.sql