25112
asked on
merge duplicate record details in same row..
select count(*), EMP_ID FROM EMPHIST WHERE EMPHIST_DATE = '2012-01-01' GROUP BY EMP_ID HAVING COUNT(*)=2 order by 1
There are some entries of Employees listed in more than one office for a particular date.
EMPHIST can be joined OFFICE & LOCATION to get some other details..
we need a list of the duplicate in one row if possible..
example: Let's say John Adams in listed under Office A and also Office B as of Jan 1st 2012.. we want to report such cases.. with the duplicate info all in one row..
1st column: Office A's code (in Office table)
2nd Column: Office A's Description (in Office table)
3rd Column: County of Office A (in Office table)
4th Column: Location Number of Office A (in LOCATION table)
5th Column: Location Name of Office A(LOCATION table)
6th Column: Employee ID
7th column: Employee FirstName
8th column: Employee LastName
9st column: Office B's code (in Office table)
10th Column: Office B's Description (in Office table)
11th Column: County of Office B (in Office table)
12th Column: Location Number of Office B(in LOCATION table)
13th Column: Location Name of Office B(LOCATION table)
14th Column: Employee ID
15th column: Employee FirstName
16th column: Employee LastName
EmpHist and Office link through Office_key (PK-FK)
EmpHist.OFFICE_KEY = Office.OFFICE_KEY
office links to Location..
Office.Location_KEY = Location.Location_KEY
how would you merge these 2 sets of 8 columns into one row (16 columns) in tsql?
There are some entries of Employees listed in more than one office for a particular date.
EMPHIST can be joined OFFICE & LOCATION to get some other details..
we need a list of the duplicate in one row if possible..
example: Let's say John Adams in listed under Office A and also Office B as of Jan 1st 2012.. we want to report such cases.. with the duplicate info all in one row..
1st column: Office A's code (in Office table)
2nd Column: Office A's Description (in Office table)
3rd Column: County of Office A (in Office table)
4th Column: Location Number of Office A (in LOCATION table)
5th Column: Location Name of Office A(LOCATION table)
6th Column: Employee ID
7th column: Employee FirstName
8th column: Employee LastName
9st column: Office B's code (in Office table)
10th Column: Office B's Description (in Office table)
11th Column: County of Office B (in Office table)
12th Column: Location Number of Office B(in LOCATION table)
13th Column: Location Name of Office B(LOCATION table)
14th Column: Employee ID
15th column: Employee FirstName
16th column: Employee LastName
EmpHist and Office link through Office_key (PK-FK)
EmpHist.OFFICE_KEY = Office.OFFICE_KEY
office links to Location..
Office.Location_KEY = Location.Location_KEY
how would you merge these 2 sets of 8 columns into one row (16 columns) in tsql?
ASKER
thanks very much, Sharath.. i tweaked it a little and it works good, now...
sometimes the firstname/lastname are not guarenteed to be the same in duplicates..it may have a small letter transposed or written differently.. (example- Matt vs Matthew etc) so i worked around it as you can see below..
because of this i had to call EMPHIST twice at the end.. so i just am including EMPID and OFFICE_KEY in the first CTE instead of calling the whole table ..
i also had to add 2 WHERE conditions after the final joins (s1.EMPHIST_Date = '2012-01-01' and s2.EMPHIST_Date = '2012-01-01') as there may be records with other dates also.
i first ran the first CTE and it took 30 minutes.. maybe it cached it.. then when i ran the whole thing, it ran in 4 minutes.. (30k rows)..
0)with the above logic tweaks, is there any other code changes that can benefit the query?
1)with the below, can you recommend any indexes that can cut down the run time?
2)can you save this resultset to a static table through code (like 'SELECT * INTO tmp FROM SOMETABLE' syntax)
3)if I want to run this for 3 duplicate records, i just have to change "=2" to "=3", right? (and add a few more EMPHIST joins at the end...) it will have to be seperate reports/query for 2,3,4 duplicates and so forth, right?
-------------------------- ---------- ---------- ---------- ---------- ---------- -
;WITH EMPHIST_CTE
AS (SELECT *
FROM (SELECT EMP_ID, OFFICE_KEY,
COUNT(*) OVER ( partition BY EMP_ID) cnt,
ROW_NUMBER() OVER ( partition BY EMP_ID ORDER BY EMP_ID) rn
FROM EMPHIST
WHERE EMPHIST_Date = '2012-01-01' ) t1
WHERE cnt = 2),
EMPHIST_CTE2
AS (SELECT EMP_ID,
MAX(CASE rn
WHEN 1 THEN OFFICE_KEY
END) OFFICE_KEY1,
MAX(CASE rn
WHEN 2 THEN OFFICE_KEY
END) OFFICE_KEY2
FROM EMPHIST_CTE
GROUP BY EMP_ID)
SELECT o1.Code,o1.Description,o1. County,l1. Location_N umber,l1.L ocation_Na me,
e1.EMP_ID,s1.EMP_FNAME,s1. EMP_LNAME,
o2.Code,o2.Description,o2. County,l2. Location_N umber,l2.L ocation_Na me,
s2.EMP_FNAME,s2.EMP_LNAME,
FROM EMPHIST_CTE2 e1
JOIN EMPHIST s1
ON e1.EMP_ID = s1.EMP_ID and e1.OFFICE_KEY1 = s1.OFFICE_KEY
JOIN OFFICE o1
ON e1.OFFICE_KEY1 = o1.OFFICE_KEY
JOIN LOCATION l1
ON s1.LOCATION_KEY = l1.LOCATION_KEY
JOIN EMPHIST s2
ON e1.EMP_ID = s2.EMP_ID and e1.OFFICE_KEY2 = s2.OFFICE_KEY
JOIN OFFICE o2
ON e1.OFFICE_KEY2 = o2.OFFICE_KEY
JOIN LOCATION l2
ON s2.LOCATION_KEY = l2.LOCATION_KEY
WHERE s1.EMPHIST_Date = '2012-01-01'
AND s2.EMPHIST_Date = '2012-01-01'
sometimes the firstname/lastname are not guarenteed to be the same in duplicates..it may have a small letter transposed or written differently.. (example- Matt vs Matthew etc) so i worked around it as you can see below..
because of this i had to call EMPHIST twice at the end.. so i just am including EMPID and OFFICE_KEY in the first CTE instead of calling the whole table ..
i also had to add 2 WHERE conditions after the final joins (s1.EMPHIST_Date = '2012-01-01' and s2.EMPHIST_Date = '2012-01-01') as there may be records with other dates also.
i first ran the first CTE and it took 30 minutes.. maybe it cached it.. then when i ran the whole thing, it ran in 4 minutes.. (30k rows)..
0)with the above logic tweaks, is there any other code changes that can benefit the query?
1)with the below, can you recommend any indexes that can cut down the run time?
2)can you save this resultset to a static table through code (like 'SELECT * INTO tmp FROM SOMETABLE' syntax)
3)if I want to run this for 3 duplicate records, i just have to change "=2" to "=3", right? (and add a few more EMPHIST joins at the end...) it will have to be seperate reports/query for 2,3,4 duplicates and so forth, right?
--------------------------
;WITH EMPHIST_CTE
AS (SELECT *
FROM (SELECT EMP_ID, OFFICE_KEY,
COUNT(*) OVER ( partition BY EMP_ID) cnt,
ROW_NUMBER() OVER ( partition BY EMP_ID ORDER BY EMP_ID) rn
FROM EMPHIST
WHERE EMPHIST_Date = '2012-01-01' ) t1
WHERE cnt = 2),
EMPHIST_CTE2
AS (SELECT EMP_ID,
MAX(CASE rn
WHEN 1 THEN OFFICE_KEY
END) OFFICE_KEY1,
MAX(CASE rn
WHEN 2 THEN OFFICE_KEY
END) OFFICE_KEY2
FROM EMPHIST_CTE
GROUP BY EMP_ID)
SELECT o1.Code,o1.Description,o1.
e1.EMP_ID,s1.EMP_FNAME,s1.
o2.Code,o2.Description,o2.
s2.EMP_FNAME,s2.EMP_LNAME,
FROM EMPHIST_CTE2 e1
JOIN EMPHIST s1
ON e1.EMP_ID = s1.EMP_ID and e1.OFFICE_KEY1 = s1.OFFICE_KEY
JOIN OFFICE o1
ON e1.OFFICE_KEY1 = o1.OFFICE_KEY
JOIN LOCATION l1
ON s1.LOCATION_KEY = l1.LOCATION_KEY
JOIN EMPHIST s2
ON e1.EMP_ID = s2.EMP_ID and e1.OFFICE_KEY2 = s2.OFFICE_KEY
JOIN OFFICE o2
ON e1.OFFICE_KEY2 = o2.OFFICE_KEY
JOIN LOCATION l2
ON s2.LOCATION_KEY = l2.LOCATION_KEY
WHERE s1.EMPHIST_Date = '2012-01-01'
AND s2.EMPHIST_Date = '2012-01-01'
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks- from 4 minutes it is cut to 1 minute- your idea was great..
1)
i liked how you avoided the extra joins by modifying the first CTE to partition by the date also.
this is very efficient than my original thinking.. . but i did not understand your comment
'I do not suggest joining to EMPHIST again because there will be duplicate records on EMP_ID which results in unnecessary duplicates records.'
it would have been inefficient, but can you please tell me how it would have created dups?
2)
Also combining the name suggestion was nice. It is reasonable enough to have Matt for Matthew or vice versa.
if someone's maiden name is changed for example, and if that is important, i have to go back to adding those extra joins from the previous steps, right? right now, we'll leave it as it is- it is good.
2a)
one change i did which you will see is we need LOCATION_KEY for the joins. So i brought it into the first CTE.
2b)
i also modified it for upto 4 duplicates, with 2 more left joins.. so there will be lot of NULLs, which is OK... please comment if they are efficient, with 2 more LEFT joins, as many users could be running the same report at the same time.
attached file: Query1.sql
3)
I needed one more tweak... (pl refer query2.sql)
several offices will be running the report at the same time.. but the office must see only their set of data.
so for example, Office A is running a report, and they want to see which other offices have that employee (say, EmployeeA) listed under them for that date. This OFFICE_KEY will be sent as a parameter.
Office B, Office C, and Office D are also sharing that employee for that date. When Office A runs it they should see the details of EmpID of EmployeeA and the details of Office B,Office C and Office D for that Employee in the report.
Now they may be sharing EmployeeB with Office E & Office F.. so that line also with those 2 records should be in the report.
so with the above example, the result will be like this (Office A running to see which of their Employees were listed under other offices for that date)
EmployeeA Office_B_Details Office_C_Details Office_D_Details
EmployeeB Office_E_Details Office_F_Details NULL
etc
When OfficeB runs the report, they will see a totally different set of result..
so we need some kind of filtering with the OFFICE_KEY that will be passed... (this will reduce (upto) 4 sets of data that we have now down to 3) in the attached example, 11203 is the office_key that is requesting the data.
i tried one solution, it is attached as Query2.sql.. can you please review it and give me your comments.
i wanted to make sure the requesting officekey is always the first one (rn=1).. hence i created a temp table and deleted the records that dont have this office at all, and then i rearrange the records to make sure this Office is always on the top.. then I display only 3 sets of data (o2,o3,o4) as o1 will be the requesting office.. this likely is not too efficient .. when first executed, it took 2 minutes.. just after a few runs, it is executing as previously (1 minute)... maybe it is caching it.. also i just realized i need to put isnull for o3 and o4 values since they may be NULLs.. but o1 & o2 should always have a value
please suggest a better way to do this..
thanks again very much Sharath.
Query1.sql
Query2.sql
1)
i liked how you avoided the extra joins by modifying the first CTE to partition by the date also.
this is very efficient than my original thinking.. . but i did not understand your comment
'I do not suggest joining to EMPHIST again because there will be duplicate records on EMP_ID which results in unnecessary duplicates records.'
it would have been inefficient, but can you please tell me how it would have created dups?
2)
Also combining the name suggestion was nice. It is reasonable enough to have Matt for Matthew or vice versa.
if someone's maiden name is changed for example, and if that is important, i have to go back to adding those extra joins from the previous steps, right? right now, we'll leave it as it is- it is good.
2a)
one change i did which you will see is we need LOCATION_KEY for the joins. So i brought it into the first CTE.
2b)
i also modified it for upto 4 duplicates, with 2 more left joins.. so there will be lot of NULLs, which is OK... please comment if they are efficient, with 2 more LEFT joins, as many users could be running the same report at the same time.
attached file: Query1.sql
3)
I needed one more tweak... (pl refer query2.sql)
several offices will be running the report at the same time.. but the office must see only their set of data.
so for example, Office A is running a report, and they want to see which other offices have that employee (say, EmployeeA) listed under them for that date. This OFFICE_KEY will be sent as a parameter.
Office B, Office C, and Office D are also sharing that employee for that date. When Office A runs it they should see the details of EmpID of EmployeeA and the details of Office B,Office C and Office D for that Employee in the report.
Now they may be sharing EmployeeB with Office E & Office F.. so that line also with those 2 records should be in the report.
so with the above example, the result will be like this (Office A running to see which of their Employees were listed under other offices for that date)
EmployeeA Office_B_Details Office_C_Details Office_D_Details
EmployeeB Office_E_Details Office_F_Details NULL
etc
When OfficeB runs the report, they will see a totally different set of result..
so we need some kind of filtering with the OFFICE_KEY that will be passed... (this will reduce (upto) 4 sets of data that we have now down to 3) in the attached example, 11203 is the office_key that is requesting the data.
i tried one solution, it is attached as Query2.sql.. can you please review it and give me your comments.
i wanted to make sure the requesting officekey is always the first one (rn=1).. hence i created a temp table and deleted the records that dont have this office at all, and then i rearrange the records to make sure this Office is always on the top.. then I display only 3 sets of data (o2,o3,o4) as o1 will be the requesting office.. this likely is not too efficient .. when first executed, it took 2 minutes.. just after a few runs, it is executing as previously (1 minute)... maybe it is caching it.. also i just realized i need to put isnull for o3 and o4 values since they may be NULLs.. but o1 & o2 should always have a value
please suggest a better way to do this..
thanks again very much Sharath.
Query1.sql
Query2.sql
Open in new window