Solved

Query to omit near-duplicate rows

Posted on 2008-09-30
24
790 Views
Last Modified: 2012-08-13
I am trying to query a call log database with (simplified) records like this:

Extn  CallDateTime
1234  2008-09-24 12:04:08.493
1234  2008-09-24 12:04:08.523
1234  2008-09-24 12:04:08.536
1345  2008-09-24 16:06:00.707
1345  2008-09-24 16:06:00.740
1345  2008-09-24 16:06:00.753
1456  2008-09-24 16:07:53.943
1456  2008-09-24 16:07:53.973
1456  2008-09-24 16:07:54.003

As you can see, the phone system logs each call three times identically, except for the CallDateTime column, which will differ by fractions of a second. I would like to be able to have these "near" duplicate records filtered out. Basically, if multiple rows are otherwise identical and their CallDateTime values are within a second of each other, just show me one of them. Similar to this:

Extn  CallDateTime
1234  2008-09-24 12:04:08.536
1345  2008-09-24 16:06:00.753
1456  2008-09-24 16:07:54.003

Or even with the time rounded to the nearest second (but I can do that later).

My first attempt at this was to truncate off the milliseconds, then union two sets. But sometimes, the times cross the second border, and truncating does not always yield identical rows. Any ideas?
0
Comment
Question by:tfgeorge
  • 9
  • 8
  • 6
  • +1
24 Comments
 
LVL 11

Expert Comment

by:aaronakin
Comment Utility
Try this...

SELECT Extn, CallDateTime
  FROM
  (
  SELECT Extn, CAST(CallDateTime AS SMALLDATETIME) CallDateTime
    FROM TableName
  ) a
  GROUP BY Extn, CallDateTime
0
 
LVL 11

Expert Comment

by:aaronakin
Comment Utility
You could add the number of calls to this as well.

SELECT Extn, CallDateTime, COUNT(*) AS NumCalls
  FROM
  (
  SELECT Extn, CAST(CallDateTime AS SMALLDATETIME) CallDateTime
    FROM TableName
  ) a
  GROUP BY Extn, CallDateTime
0
 
LVL 42

Expert Comment

by:dqmq
Comment Utility
I think the easiest way is to convert the datetime to number of seconds since some point in time using the datediff function.

Select Extn, min(CallDateTime)  
from YourTable
group by datediff("s", CallDateTime, now())
0
 
LVL 42

Expert Comment

by:dqmq
Comment Utility
oops... this is what I meant:

Select Extn, min(CallDateTime)  
from YourTable
group by Extn, datediff("s", CallDateTime, now())
0
 
LVL 22

Expert Comment

by:dportas
Comment Utility
Try this:

SELECT Extn, CallDateTime
FROM
  (SELECT Extn, CallDateTime,
   ROW_NUMBER() OVER (PARTITION BY Extn ORDER BY CallDateTime) r
   FROM tbl) t
WHERE r%3 = 0;

(assumes SQL Server 2005 / 2008)
0
 
LVL 1

Author Comment

by:tfgeorge
Comment Utility
aaronakin: Your solutions just ignore seconds, which is not what I want, and it still won't work when the times span across the minute boundary.

dqmq: Your solution has the same problem as my first attempt - it just ignores the milliseconds and fails when the time differences cross the second boundary.

dportas: Sorry, but I didn't clarify. This needs to work on MSSQL 2000.
0
 
LVL 11

Expert Comment

by:aaronakin
Comment Utility
How about this?  It might need to be tweaked slightly, but it should get you closer.

DECLARE @Now DATETIME
SET @Now = GETDATE()

SELECT Extn, CallDateTime
  FROM
  (
  SELECT Extn, ROUND(CAST(DATEDIFF(ms,@Now,CallDateTime) AS DECIMAL(15,2))/1000,0) AS CallDateTime
    FROM TableName
  ) a
  GROUP BY Extn, CallDateTime
0
 
LVL 1

Author Comment

by:tfgeorge
Comment Utility
aaronakin: I don't think rounding and casting can solve the problem. That will just shift and spread out the boundaries that need to be spanned to cause the problem. For example, I could round them all to the nearest day, but there will still be some instances where they just happen to span that day boundary.
0
 
LVL 11

Expert Comment

by:aaronakin
Comment Utility
Did you try it out?  I'm really taking all the parts of the day out of the question.  It figures out the number of milliseconds from a particular point in time and then rounds it.  This means that any datetimes within 1 second will be considered the same.  It won't matter which day, hour, minute or second it occurred, just as long as it's within a 1 second period.
0
 
LVL 1

Author Comment

by:tfgeorge
Comment Utility
Firstly, I get an error with DATEDIFF using milliseconds. See: here.

So if I change it to use seconds and no longer divide by 1000, it still fails to work some of the time.

I don't think your example considers datetimes within a one second period of each other as being the same. It considers datetimes that occurred the same rounded number of seconds ago from now to be the same. But counting the number of seconds or milliseconds from now is no different than counting the number or seconds or milliseconds from 1/1/1753, which is what a datetime is. You will occasionally have datetimes that span that rounding boundary.

Another way to look at it is this: I don't want datetimes that fall within a specific second to be considered the same, I want datetimes that fall within any one second interval to be considered the same. Does that make sense? I hope I'm not off here, but in any case your example does not do what I need on my table.
0
 
LVL 11

Expert Comment

by:aaronakin
Comment Utility
Ok, what do you want to do in the following case?  There are 3 records; records 1 and 2 are within 1 second and records 2 and 3 are within 1 second, but records 1 & 3 are over 1 second.  In this case, what do you want returned?

Extn      CallDateTime
1234      2008-09-24 12:04:08.500
1234      2008-09-24 12:04:08.900
1234      2008-09-24 12:04:09.600

Let me know and I'll see what I can come up with.
0
 
LVL 1

Author Comment

by:tfgeorge
Comment Utility
That shouldn't ever happen. The three entries are all going to be within a tenth of a second or so of each other, definitely within a second from the earliest to the last. I'm beginning to think it might have to be done using cursors or another row-by-row method.
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 
LVL 11

Expert Comment

by:aaronakin
Comment Utility
If that's the case...

If every Extn has 3 datetimes, and they will always be within 1 second of each other, why not just group by Extn and select the min or max date?  Maybe we just need some more info here.  Can one extension have multiple groups of 3 calls?  If so, what column(s) make up the PK?
0
 
LVL 1

Author Comment

by:tfgeorge
Comment Utility
I left the other columns out for simplicity's sake, and tried to reduce the question to the part I didn't know how to do. An Extn can, and does have many calls logged to it, and each call is logged one, two, three, or more times (but all within a second). There is one column with a random (non-sequential) but unique integer id.
0
 
LVL 42

Expert Comment

by:dqmq
Comment Utility
Try this:


Select Extn, CallDateTime  
from YourTable T1
where not exists
  (select 1 from YourTable T2
       where T2.id <> T1.id
           and T2.CallDateTime <= DateAdd(s,1,T1.CallDateTime) )
0
 
LVL 42

Expert Comment

by:dqmq
Comment Utility
Forgot a line, try:

Select Extn, CallDateTime  
from YourTable T1
where not exists
  (select 1 from YourTable T2
       where T1.extn = t2.extn
           and T2.id <> T1.id
           and T2.CallDateTime <= DateAdd(s,1,T1.CallDateTime) )
0
 
LVL 1

Author Comment

by:tfgeorge
Comment Utility
dqmq: Your solution doesn't work. I haven't had time to analyze it, but the number of rows returned is way too few to be correct.

I think I've figured out a solution using cursors. I was hoping for a set based solution, but I'm not sure it is possible to compare rows to each other within a sliding time window using a simple query. Here's my solution which seems to work. Am I missing anything here?

DECLARE  @Extn nvarchar(16),
   @CallDateTime datetime,
   @LastExtn nvarchar(16),
   @LastCallDateTime datetime
SET @LastExtn = 'blahblah'

DECLARE get_calls CURSOR FOR
SELECT Extn, CallDateTime FROM MyTable
ORDER BY CallDateTime ASC

OPEN get_calls
FETCH NEXT FROM get_calls INTO @Extn, @CallDateTime

WHILE @@FETCH_STATUS = 0
BEGIN
   IF (@Extn <> @LastExtn) OR (@CallDateTime > DATEADD(ss, 1, @LastCallDateTime)) SELECT @Extn, @CallDateTime
   
   SET @LastExtne = @Extn
   SET @LastCallDateTime = @CallDateTime
   FETCH NEXT FROM get_calls INTO @Extn, @CallDateTime
END

CLOSE get_calls
DEALLOCATE get_calls
0
 
LVL 11

Expert Comment

by:aaronakin
Comment Utility
Don't give up hope just yet.  I'm almost finished with a solution that seems like it might work.  Will post when complete.
0
 
LVL 11

Expert Comment

by:aaronakin
Comment Utility
Ok, I'm hoping this will work for you.  The actual query is towards the bottom of the snippet, but I went ahead and left in the temporary table and data I used for testing.

Give it a shot and see if it will solve your problem.
IF OBJECT_ID('tempdb..#test') IS NOT NULL

  DROP TABLE #test

GO
 

CREATE TABLE #Test

(

ID				INT,

Extn			INT,

CallDateTime	DATETIME

)

GO
 

INSERT INTO #Test VALUES (5, 1234, '2008-09-24 12:04:08.500')

INSERT INTO #Test VALUES (3, 1234, '2008-09-24 12:04:08.900')

INSERT INTO #Test VALUES (1, 1234, '2008-09-24 12:04:09.400')

INSERT INTO #Test VALUES (7, 1345, '2008-09-24 12:04:08.500')

INSERT INTO #Test VALUES (6, 1345, '2008-09-24 12:04:08.900')

INSERT INTO #Test VALUES (4, 1345, '2008-09-24 12:04:09.200')

INSERT INTO #Test VALUES (2, 1345, '2008-09-24 12:15:08.500')

INSERT INTO #Test VALUES (8, 1345, '2008-09-24 12:15:08.900')

INSERT INTO #Test VALUES (11, 1345, '2008-09-24 12:15:09.400')

INSERT INTO #Test VALUES (9, 1234, '2008-09-24 16:07:53.943')

INSERT INTO #Test VALUES (13, 1234, '2008-09-24 16:07:53.973')

INSERT INTO #Test VALUES (12, 1234, '2008-09-24 16:07:54.003')
 

SELECT TimeID1 AS ID

     , Extn

     , CallDateTime

  FROM

       (

       SELECT t2.Extn

            , t2.CallDateTime

            , t2.TimeID1

            , t2.TimeID2

            , (SELECT MAX(ID) FROM (SELECT TOP 2 ID FROM #Test WHERE Extn = t2.Extn AND ID NOT IN (t2.TimeID1, t2.TimeID2) AND CallDateTime > t2.CallDateTime AND CallDateTime <= DATEADD(ms,1000,t2.CallDateTime) ORDER BY ID) a) AS TimeID3

         FROM

              (

              SELECT t1.Extn

                   , t1.CallDateTime

                   , t1.ID AS TimeID1

                   , (SELECT ID FROM (SELECT TOP 1 ID FROM #Test WHERE Extn = t1.Extn AND ID <> t1.ID AND CallDateTime > t1.CallDateTime AND CallDateTime <= DATEADD(ms,1000,t1.CallDateTime) ORDER BY ID) a) AS TimeID2

                FROM #Test t1

              ) t2

       ) t3

  WHERE TimeID1 IS NOT NULL

    AND TimeID2 IS NOT NULL

    AND TimeID3 IS NOT NULL
 

DROP TABLE #test

Open in new window

0
 
LVL 1

Author Comment

by:tfgeorge
Comment Utility
In a post above I made yesterday, I mentioned that a call can be logged 1, 2, 3, or more times, all within a second. Does your solution work for an arbitrary number of entries per call, or does it assume that a call is logged exactly three times?
0
 
LVL 11

Accepted Solution

by:
aaronakin earned 250 total points
Comment Utility
It assumes exactly 3 entries per call.  Sorry, I must have missed that part.

In that case, yes, you will need to loop through, but I wouldn't use a cursor.  Instead use a table loop.  I took the cursor you posted and changed it to loop through a table instead and fixed it up to work completely.

This DOES take into consideration that there could be 1 to infinity number of calls made within a one second period of time.
DECLARE

	@Extn				NVARCHAR(16),

	@CallDateTime		DATETIME,

	@i					INT,

	@NumTotalCalls		INT,

	@NumCloseCalls		INT,

	@NumRelatedCalls	INT
 

CREATE TABLE #Calls

(

	CallID				INT				IDENTITY(1,1),

	Extn				NVARCHAR(16),

	CallDateTime		DATETIME

)
 

CREATE TABLE #CallsOutput

(

	CallID				INT,

	Extn				NVARCHAR(16),

	CallDateTime		DATETIME

)
 

CREATE TABLE #CallsIgnored

(

	CallID				INT

)
 

INSERT INTO #Calls (Extn, CallDateTime)

  SELECT Extn, CallDateTime

    FROM #Test

    ORDER BY Extn, CallDateTime
 

SELECT @NumTotalCalls = @@ROWCOUNT, @i = 1
 

WHILE @i <= @NumTotalCalls

BEGIN

  IF NOT EXISTS (SELECT 1 FROM #CallsIgnored WHERE CallID = @i)

  BEGIN

    SELECT @Extn = Extn, @CallDateTime = CallDateTime

      FROM #Calls

      WHERE CallID = @i
 

    SELECT @NumRelatedCalls = COUNT(*) FROM #Calls WHERE Extn = @Extn AND CallID > @i
 

    IF @NumRelatedCalls > 0

    BEGIN

      INSERT INTO #CallsIgnored

        SELECT CallID

          FROM #Calls

          WHERE Extn = @Extn

            AND CallID > @i

            AND CallDateTime <= DATEADD(ms,1000,@CallDateTime)
 

      SELECT @NumCloseCalls = @@ROWCOUNT

    END
 

    IF @NumRelatedCalls = 0 OR ISNULL(@NumCloseCalls,0) > 0

      INSERT INTO #CallsOutput VALUES (@i, @Extn, @CallDateTime)

  END
 

  SET @i = @i + 1

END
 

SELECT Extn, CallDateTime FROM #CallsOutput ORDER BY Extn, CallDateTime
 

DROP TABLE #Calls

DROP TABLE #CallsOutput

DROP TABLE #CallsIgnored

Open in new window

0
 
LVL 42

Expert Comment

by:dqmq
Comment Utility
One more time  (using the test data supplied, above)


Select ID, Extn, CallDateTime  
from #Test T1
where
extn=1345 and
not exists
  (select 1 from #Test T2
       where T1.extn = t2.extn
           and T1.id <> T2.id
           and T2.CallDateTime
                Between DateAdd(s,-1,T1.CallDateTime) and T1.CallDateTime
)
order by extn, id
0
 
LVL 42

Assisted Solution

by:dqmq
dqmq earned 250 total points
Comment Utility
For a full test, take out the extn=1345 and line:

Select Extn, CallDateTime  
from #Test T1
where
not exists
  (select 1 from #Test T2
       where T1.extn = t2.extn
           and T1.id <> T2.id
           and T2.CallDateTime
                Between DateAdd(s,-1,T1.CallDateTime) and T1.CallDateTime
   )
0
 
LVL 1

Author Comment

by:tfgeorge
Comment Utility
Thanks to aaronakin and dqmq. I'm splitting the points, as you both provided usable solutions, and I didn't initially specify row or set based solution.
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Let's review the features of new SQL Server 2012 (Denali CTP3). It listed as below: PERCENT_RANK(): PERCENT_RANK() function will returns the percentage value of rank of the values among its group. PERCENT_RANK() function value always in be…
International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
Viewers will learn how to use the INSERT statement to insert data into their tables. It will also introduce the NULL statement, to show them what happens when no value is giving for any given column.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now