Solved

Query to omit near-duplicate rows

Posted on 2008-09-30
24
803 Views
Last Modified: 2012-08-13
I am trying to query a call log database with (simplified) records like this:

Extn  CallDateTime
1234  2008-09-24 12:04:08.493
1234  2008-09-24 12:04:08.523
1234  2008-09-24 12:04:08.536
1345  2008-09-24 16:06:00.707
1345  2008-09-24 16:06:00.740
1345  2008-09-24 16:06:00.753
1456  2008-09-24 16:07:53.943
1456  2008-09-24 16:07:53.973
1456  2008-09-24 16:07:54.003

As you can see, the phone system logs each call three times identically, except for the CallDateTime column, which will differ by fractions of a second. I would like to be able to have these "near" duplicate records filtered out. Basically, if multiple rows are otherwise identical and their CallDateTime values are within a second of each other, just show me one of them. Similar to this:

Extn  CallDateTime
1234  2008-09-24 12:04:08.536
1345  2008-09-24 16:06:00.753
1456  2008-09-24 16:07:54.003

Or even with the time rounded to the nearest second (but I can do that later).

My first attempt at this was to truncate off the milliseconds, then union two sets. But sometimes, the times cross the second border, and truncating does not always yield identical rows. Any ideas?
0
Comment
Question by:tfgeorge
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 9
  • 8
  • 6
  • +1
24 Comments
 
LVL 11

Expert Comment

by:aaronakin
ID: 22609561
Try this...

SELECT Extn, CallDateTime
  FROM
  (
  SELECT Extn, CAST(CallDateTime AS SMALLDATETIME) CallDateTime
    FROM TableName
  ) a
  GROUP BY Extn, CallDateTime
0
 
LVL 11

Expert Comment

by:aaronakin
ID: 22609570
You could add the number of calls to this as well.

SELECT Extn, CallDateTime, COUNT(*) AS NumCalls
  FROM
  (
  SELECT Extn, CAST(CallDateTime AS SMALLDATETIME) CallDateTime
    FROM TableName
  ) a
  GROUP BY Extn, CallDateTime
0
 
LVL 42

Expert Comment

by:dqmq
ID: 22609678
I think the easiest way is to convert the datetime to number of seconds since some point in time using the datediff function.

Select Extn, min(CallDateTime)  
from YourTable
group by datediff("s", CallDateTime, now())
0
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

 
LVL 42

Expert Comment

by:dqmq
ID: 22609685
oops... this is what I meant:

Select Extn, min(CallDateTime)  
from YourTable
group by Extn, datediff("s", CallDateTime, now())
0
 
LVL 22

Expert Comment

by:dportas
ID: 22609752
Try this:

SELECT Extn, CallDateTime
FROM
  (SELECT Extn, CallDateTime,
   ROW_NUMBER() OVER (PARTITION BY Extn ORDER BY CallDateTime) r
   FROM tbl) t
WHERE r%3 = 0;

(assumes SQL Server 2005 / 2008)
0
 
LVL 1

Author Comment

by:tfgeorge
ID: 22614774
aaronakin: Your solutions just ignore seconds, which is not what I want, and it still won't work when the times span across the minute boundary.

dqmq: Your solution has the same problem as my first attempt - it just ignores the milliseconds and fails when the time differences cross the second boundary.

dportas: Sorry, but I didn't clarify. This needs to work on MSSQL 2000.
0
 
LVL 11

Expert Comment

by:aaronakin
ID: 22615075
How about this?  It might need to be tweaked slightly, but it should get you closer.

DECLARE @Now DATETIME
SET @Now = GETDATE()

SELECT Extn, CallDateTime
  FROM
  (
  SELECT Extn, ROUND(CAST(DATEDIFF(ms,@Now,CallDateTime) AS DECIMAL(15,2))/1000,0) AS CallDateTime
    FROM TableName
  ) a
  GROUP BY Extn, CallDateTime
0
 
LVL 1

Author Comment

by:tfgeorge
ID: 22615808
aaronakin: I don't think rounding and casting can solve the problem. That will just shift and spread out the boundaries that need to be spanned to cause the problem. For example, I could round them all to the nearest day, but there will still be some instances where they just happen to span that day boundary.
0
 
LVL 11

Expert Comment

by:aaronakin
ID: 22616149
Did you try it out?  I'm really taking all the parts of the day out of the question.  It figures out the number of milliseconds from a particular point in time and then rounds it.  This means that any datetimes within 1 second will be considered the same.  It won't matter which day, hour, minute or second it occurred, just as long as it's within a 1 second period.
0
 
LVL 1

Author Comment

by:tfgeorge
ID: 22616437
Firstly, I get an error with DATEDIFF using milliseconds. See: here.

So if I change it to use seconds and no longer divide by 1000, it still fails to work some of the time.

I don't think your example considers datetimes within a one second period of each other as being the same. It considers datetimes that occurred the same rounded number of seconds ago from now to be the same. But counting the number of seconds or milliseconds from now is no different than counting the number or seconds or milliseconds from 1/1/1753, which is what a datetime is. You will occasionally have datetimes that span that rounding boundary.

Another way to look at it is this: I don't want datetimes that fall within a specific second to be considered the same, I want datetimes that fall within any one second interval to be considered the same. Does that make sense? I hope I'm not off here, but in any case your example does not do what I need on my table.
0
 
LVL 11

Expert Comment

by:aaronakin
ID: 22617330
Ok, what do you want to do in the following case?  There are 3 records; records 1 and 2 are within 1 second and records 2 and 3 are within 1 second, but records 1 & 3 are over 1 second.  In this case, what do you want returned?

Extn      CallDateTime
1234      2008-09-24 12:04:08.500
1234      2008-09-24 12:04:08.900
1234      2008-09-24 12:04:09.600

Let me know and I'll see what I can come up with.
0
 
LVL 1

Author Comment

by:tfgeorge
ID: 22617519
That shouldn't ever happen. The three entries are all going to be within a tenth of a second or so of each other, definitely within a second from the earliest to the last. I'm beginning to think it might have to be done using cursors or another row-by-row method.
0
 
LVL 11

Expert Comment

by:aaronakin
ID: 22617570
If that's the case...

If every Extn has 3 datetimes, and they will always be within 1 second of each other, why not just group by Extn and select the min or max date?  Maybe we just need some more info here.  Can one extension have multiple groups of 3 calls?  If so, what column(s) make up the PK?
0
 
LVL 1

Author Comment

by:tfgeorge
ID: 22618180
I left the other columns out for simplicity's sake, and tried to reduce the question to the part I didn't know how to do. An Extn can, and does have many calls logged to it, and each call is logged one, two, three, or more times (but all within a second). There is one column with a random (non-sequential) but unique integer id.
0
 
LVL 42

Expert Comment

by:dqmq
ID: 22619256
Try this:


Select Extn, CallDateTime  
from YourTable T1
where not exists
  (select 1 from YourTable T2
       where T2.id <> T1.id
           and T2.CallDateTime <= DateAdd(s,1,T1.CallDateTime) )
0
 
LVL 42

Expert Comment

by:dqmq
ID: 22619271
Forgot a line, try:

Select Extn, CallDateTime  
from YourTable T1
where not exists
  (select 1 from YourTable T2
       where T1.extn = t2.extn
           and T2.id <> T1.id
           and T2.CallDateTime <= DateAdd(s,1,T1.CallDateTime) )
0
 
LVL 1

Author Comment

by:tfgeorge
ID: 22624156
dqmq: Your solution doesn't work. I haven't had time to analyze it, but the number of rows returned is way too few to be correct.

I think I've figured out a solution using cursors. I was hoping for a set based solution, but I'm not sure it is possible to compare rows to each other within a sliding time window using a simple query. Here's my solution which seems to work. Am I missing anything here?

DECLARE  @Extn nvarchar(16),
   @CallDateTime datetime,
   @LastExtn nvarchar(16),
   @LastCallDateTime datetime
SET @LastExtn = 'blahblah'

DECLARE get_calls CURSOR FOR
SELECT Extn, CallDateTime FROM MyTable
ORDER BY CallDateTime ASC

OPEN get_calls
FETCH NEXT FROM get_calls INTO @Extn, @CallDateTime

WHILE @@FETCH_STATUS = 0
BEGIN
   IF (@Extn <> @LastExtn) OR (@CallDateTime > DATEADD(ss, 1, @LastCallDateTime)) SELECT @Extn, @CallDateTime
   
   SET @LastExtne = @Extn
   SET @LastCallDateTime = @CallDateTime
   FETCH NEXT FROM get_calls INTO @Extn, @CallDateTime
END

CLOSE get_calls
DEALLOCATE get_calls
0
 
LVL 11

Expert Comment

by:aaronakin
ID: 22627128
Don't give up hope just yet.  I'm almost finished with a solution that seems like it might work.  Will post when complete.
0
 
LVL 11

Expert Comment

by:aaronakin
ID: 22627201
Ok, I'm hoping this will work for you.  The actual query is towards the bottom of the snippet, but I went ahead and left in the temporary table and data I used for testing.

Give it a shot and see if it will solve your problem.
IF OBJECT_ID('tempdb..#test') IS NOT NULL
  DROP TABLE #test
GO
 
CREATE TABLE #Test
(
ID				INT,
Extn			INT,
CallDateTime	DATETIME
)
GO
 
INSERT INTO #Test VALUES (5, 1234, '2008-09-24 12:04:08.500')
INSERT INTO #Test VALUES (3, 1234, '2008-09-24 12:04:08.900')
INSERT INTO #Test VALUES (1, 1234, '2008-09-24 12:04:09.400')
INSERT INTO #Test VALUES (7, 1345, '2008-09-24 12:04:08.500')
INSERT INTO #Test VALUES (6, 1345, '2008-09-24 12:04:08.900')
INSERT INTO #Test VALUES (4, 1345, '2008-09-24 12:04:09.200')
INSERT INTO #Test VALUES (2, 1345, '2008-09-24 12:15:08.500')
INSERT INTO #Test VALUES (8, 1345, '2008-09-24 12:15:08.900')
INSERT INTO #Test VALUES (11, 1345, '2008-09-24 12:15:09.400')
INSERT INTO #Test VALUES (9, 1234, '2008-09-24 16:07:53.943')
INSERT INTO #Test VALUES (13, 1234, '2008-09-24 16:07:53.973')
INSERT INTO #Test VALUES (12, 1234, '2008-09-24 16:07:54.003')
 
SELECT TimeID1 AS ID
     , Extn
     , CallDateTime
  FROM
       (
       SELECT t2.Extn
            , t2.CallDateTime
            , t2.TimeID1
            , t2.TimeID2
            , (SELECT MAX(ID) FROM (SELECT TOP 2 ID FROM #Test WHERE Extn = t2.Extn AND ID NOT IN (t2.TimeID1, t2.TimeID2) AND CallDateTime > t2.CallDateTime AND CallDateTime <= DATEADD(ms,1000,t2.CallDateTime) ORDER BY ID) a) AS TimeID3
         FROM
              (
              SELECT t1.Extn
                   , t1.CallDateTime
                   , t1.ID AS TimeID1
                   , (SELECT ID FROM (SELECT TOP 1 ID FROM #Test WHERE Extn = t1.Extn AND ID <> t1.ID AND CallDateTime > t1.CallDateTime AND CallDateTime <= DATEADD(ms,1000,t1.CallDateTime) ORDER BY ID) a) AS TimeID2
                FROM #Test t1
              ) t2
       ) t3
  WHERE TimeID1 IS NOT NULL
    AND TimeID2 IS NOT NULL
    AND TimeID3 IS NOT NULL
 
DROP TABLE #test

Open in new window

0
 
LVL 1

Author Comment

by:tfgeorge
ID: 22627852
In a post above I made yesterday, I mentioned that a call can be logged 1, 2, 3, or more times, all within a second. Does your solution work for an arbitrary number of entries per call, or does it assume that a call is logged exactly three times?
0
 
LVL 11

Accepted Solution

by:
aaronakin earned 250 total points
ID: 22628443
It assumes exactly 3 entries per call.  Sorry, I must have missed that part.

In that case, yes, you will need to loop through, but I wouldn't use a cursor.  Instead use a table loop.  I took the cursor you posted and changed it to loop through a table instead and fixed it up to work completely.

This DOES take into consideration that there could be 1 to infinity number of calls made within a one second period of time.
DECLARE
	@Extn				NVARCHAR(16),
	@CallDateTime		DATETIME,
	@i					INT,
	@NumTotalCalls		INT,
	@NumCloseCalls		INT,
	@NumRelatedCalls	INT
 
CREATE TABLE #Calls
(
	CallID				INT				IDENTITY(1,1),
	Extn				NVARCHAR(16),
	CallDateTime		DATETIME
)
 
CREATE TABLE #CallsOutput
(
	CallID				INT,
	Extn				NVARCHAR(16),
	CallDateTime		DATETIME
)
 
CREATE TABLE #CallsIgnored
(
	CallID				INT
)
 
INSERT INTO #Calls (Extn, CallDateTime)
  SELECT Extn, CallDateTime
    FROM #Test
    ORDER BY Extn, CallDateTime
 
SELECT @NumTotalCalls = @@ROWCOUNT, @i = 1
 
WHILE @i <= @NumTotalCalls
BEGIN
  IF NOT EXISTS (SELECT 1 FROM #CallsIgnored WHERE CallID = @i)
  BEGIN
    SELECT @Extn = Extn, @CallDateTime = CallDateTime
      FROM #Calls
      WHERE CallID = @i
 
    SELECT @NumRelatedCalls = COUNT(*) FROM #Calls WHERE Extn = @Extn AND CallID > @i
 
    IF @NumRelatedCalls > 0
    BEGIN
      INSERT INTO #CallsIgnored
        SELECT CallID
          FROM #Calls
          WHERE Extn = @Extn
            AND CallID > @i
            AND CallDateTime <= DATEADD(ms,1000,@CallDateTime)
 
      SELECT @NumCloseCalls = @@ROWCOUNT
    END
 
    IF @NumRelatedCalls = 0 OR ISNULL(@NumCloseCalls,0) > 0
      INSERT INTO #CallsOutput VALUES (@i, @Extn, @CallDateTime)
  END
 
  SET @i = @i + 1
END
 
SELECT Extn, CallDateTime FROM #CallsOutput ORDER BY Extn, CallDateTime
 
DROP TABLE #Calls
DROP TABLE #CallsOutput
DROP TABLE #CallsIgnored

Open in new window

0
 
LVL 42

Expert Comment

by:dqmq
ID: 22631027
One more time  (using the test data supplied, above)


Select ID, Extn, CallDateTime  
from #Test T1
where
extn=1345 and
not exists
  (select 1 from #Test T2
       where T1.extn = t2.extn
           and T1.id <> T2.id
           and T2.CallDateTime
                Between DateAdd(s,-1,T1.CallDateTime) and T1.CallDateTime
)
order by extn, id
0
 
LVL 42

Assisted Solution

by:dqmq
dqmq earned 250 total points
ID: 22631426
For a full test, take out the extn=1345 and line:

Select Extn, CallDateTime  
from #Test T1
where
not exists
  (select 1 from #Test T2
       where T1.extn = t2.extn
           and T1.id <> T2.id
           and T2.CallDateTime
                Between DateAdd(s,-1,T1.CallDateTime) and T1.CallDateTime
   )
0
 
LVL 1

Author Comment

by:tfgeorge
ID: 22634656
Thanks to aaronakin and dqmq. I'm splitting the points, as you both provided usable solutions, and I didn't initially specify row or set based solution.
0

Featured Post

Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
A Stored Procedure in Microsoft SQL Server is a powerful feature that it can be used to execute the Data Manipulation Language (DML) or Data Definition Language (DDL). Depending on business requirements, a single Stored Procedure can return differe…
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question