Solved

Perenatge of Duplicates

Posted on 2009-07-10
18
213 Views
Last Modified: 2012-05-07
I am trying to find  the Percenatge of duplicates found in a given table e.g I need to know the % of Service calls get a follow up call.
The code below already groups the select result from a view and displays only the records that have a duplicate serial  entry in the view.
THank you
ALTER PROCEDURE [dbo].[usp_CallBackPerSerialNum90Days]

	-- Add the parameters for the stored procedure here

	

AS
 

BEGIN
 

	-- SET NOCOUNT ON added to prevent extra result sets from

	-- interfering with SELECT statements.

	SET ROWCOUNT 0

	SET NOCOUNT ON;
 

SELECT     COUNT(*) AS Recurrance, MODEL, SERIAL, VENDOR, PROGRAM_TYPE, INSTALL_DATE_SEQ, CALL_ID, CALL_STATUS_SCA, CALL_TYPE, 

                      SYMPTOM_CODE, OPEN_DATE_SCA

FROM         ServiceCalls90Days

GROUP BY SERIAL, MODEL, VENDOR, PROGRAM_TYPE, INSTALL_DATE_SEQ, CALL_ID, CALL_STATUS_SCA, CALL_TYPE, SYMPTOM_CODE, 

                      OPEN_DATE_SCA

HAVING      (COUNT(*) > 1)

END

Open in new window

0
Comment
Question by:acardullo
  • 12
  • 6
18 Comments
 

Author Comment

by:acardullo
ID: 24825858
Upped the points value
0
 
LVL 8

Expert Comment

by:Bobaran98
ID: 24828732
Try some of my code from below.  It should be pretty close to what you need, though of course I can't test it.  Let me know!  And if it doesn't work as advertised, let me know specifically which database engine you're using.

Enjoy! :-)




ALTER PROCEDURE [dbo].[usp_CallBackPerSerialNum90Days]

        -- Add the parameters for the stored procedure here

        

AS
 

SET NOCOUNT ON
 
 

/* declare all variables */

DECLARE @numTotal int

DECLARE @numUnique int

DECLARE @numUnqDupes int

DECLARE @numDupes int

DECLARE @numSingles int

DECLARE @pctDupeEntries float

DECLARE @pctUnqDupes float
 
 

/* get counts from database */
 

SET @numTotal = (SELECT COUNT(*) FROM ServiceCalls90Days)
 

SET @numUnique = (SELECT COUNT(DISTINCT SERIAL) FROM ServiceCalls90Days)
 

SET @numUnqDupes = (SELECT COUNT(*)

	FROM (SELECT SERIAL, COUNT(SERIAL) as NumSERIAL FROM ServiceCalls90Days GROUP BY SERIAL) TT

	WHERE TT.NumSERIAL > 1

  )
 

SET @numDupes = (SELECT COUNT(*) FROM ServiceCalls90Days WHERE SERIAL IN (

	SELECT SERIAL

	FROM (SELECT SERIAL, COUNT(SERIAL) as NumSERIAL FROM ServiceCalls90Days GROUP BY SERIAL) TT

	WHERE TT.NumSERIAL > 1

  )
 

SET @numSingles = (SELECT (@numTotal - @numDupes))

/* OR */

--SET @numSingles = (SELECT (@numUnique - @numUnqDupes))

/* OR */

--SET @numSingles = (SELECT COUNT(*) FROM

--	(SELECT SERIAL, COUNT(SERIAL) as NumSERIAL FROM ServiceCalls90Days GROUP BY SERIAL) TT

--	WHERE TT.NumSERIAL = 1

--  )
 
 

/* determine percentage of database entries that are non-unique */

SET @pctDupeEntries = (SELECT (CAST(@numDupes as float)/CAST(@numTotal as float)))
 

/* determine percentage of unique serial numbers that exist more than once in database */

SET @pctUnqDupes = (SELECT (CAST(@numUnqDupes as float)/CAST(@numUnique as float)))
 
 

/* choose the percentage you want, then select as your result */

SELECT @pctDupeEntries as PercentDupes

Open in new window

0
 

Author Comment

by:acardullo
ID: 24848304
Msg 156, Level 15, State 1, Procedure usp_CallBackPerSerialNum90Days, Line 45
Incorrect syntax near the keyword 'SET'.
0
 

Author Comment

by:acardullo
ID: 24848496
Btw  ServiceCalls90Days  is a view
0
 

Author Comment

by:acardullo
ID: 24848582
I think thew problem is in here somwhere
SET @numDupes = (SELECT COUNT(*) FROM ServiceCalls90Days WHERE SERIAL IN (
      SELECT SERIAL
      FROM (SELECT SERIAL, COUNT(SERIAL) as NumSERIAL FROM ServiceCalls90Days GROUP BY SERIAL) TT
      WHERE TT.NumSERIAL > 1
  )
0
 

Author Comment

by:acardullo
ID: 24848623
Also  shouldnt there be a * 100  in there somewhere to calculate the actuall % and shouldnt the variables be decimal 10,4 or something?  THanks
0
 

Author Comment

by:acardullo
ID: 24848627
SOrry for the many post but i missed your question.  im running 2005 SQL
0
 
LVL 8

Expert Comment

by:Bobaran98
ID: 24848775
Sorry, there was a typo-- a parenthesis missing in the block of code you identified.  Try the code below and let me know if you have any further errors.

The fact that ServiceCalls90Days is a view isn't a problem in theory.  Depending on the complexity of the view and its feeder tables, there's always the possibility of efficiency issues.  You'll just have to see how quick your responsiveness is when you implement this stored proc.  However, there's nothing too heavy duty about this code, so any slow-downs you experience will be more the result of the view than anything else.

As for percentages, you can format things however you want.  A percentage is, by definition, a fraction, therefore I formatted this query to return percentages as a decimal number between 0 and 1.  Presumably you're implementing this stored procedure to be called by a program or webpage, and so you should probably do your formatting in the code that calls the stored proc.  If, however, you're going to be displaying these figures exactly as they appear, you could do something like this:

SELECT CAST(@pctDupeEntries*100 as numeric(5,2)) as PercentDupes

The 5 indicates the number of total digits, whereas the 2 indicates the number of digits to the right of the decimal.  Since there won't be a number greater than 100, you only need 3 digits left of the decimal.  However, again, unless you're displaying the number exactly as it's returned by the stored proc, I recommend returning it as a float, because you get greater accuracy that way-- then you can format it in your calling code.

ALTER PROCEDURE [dbo].[usp_CallBackPerSerialNum90Days]

        -- Add the parameters for the stored procedure here

        

AS

 

SET NOCOUNT ON

 

 

/* declare all variables */

DECLARE @numTotal int

DECLARE @numUnique int

DECLARE @numUnqDupes int

DECLARE @numDupes int

DECLARE @numSingles int

DECLARE @pctDupeEntries float

DECLARE @pctUnqDupes float

 

 

/* get counts from database */

 

SET @numTotal = (SELECT COUNT(*) FROM ServiceCalls90Days)

 

SET @numUnique = (SELECT COUNT(DISTINCT SERIAL) FROM ServiceCalls90Days)

 

SET @numUnqDupes = (SELECT COUNT(*)

        FROM (SELECT SERIAL, COUNT(SERIAL) as NumSERIAL FROM ServiceCalls90Days GROUP BY SERIAL) TT

        WHERE TT.NumSERIAL > 1

  )

 

SET @numDupes = (SELECT COUNT(*) FROM ServiceCalls90Days WHERE SERIAL IN (

        SELECT SERIAL

        FROM (SELECT SERIAL, COUNT(SERIAL) as NumSERIAL FROM ServiceCalls90Days GROUP BY SERIAL) TT

        WHERE TT.NumSERIAL > 1

    )

  )

 

SET @numSingles = (SELECT (@numTotal - @numDupes))

/* OR */

--SET @numSingles = (SELECT (@numUnique - @numUnqDupes))

/* OR */

--SET @numSingles = (SELECT COUNT(*) FROM

--      (SELECT SERIAL, COUNT(SERIAL) as NumSERIAL FROM ServiceCalls90Days GROUP BY SERIAL) TT

--      WHERE TT.NumSERIAL = 1

--  )

 

 

/* determine percentage of database entries that are non-unique */

SET @pctDupeEntries = (SELECT (CAST(@numDupes as float)/CAST(@numTotal as float)))

 

/* determine percentage of unique serial numbers that exist more than once in database */

SET @pctUnqDupes = (SELECT (CAST(@numUnqDupes as float)/CAST(@numUnique as float)))

 

 

/* choose the percentage you want, then select as your result */

SELECT @pctDupeEntries as PercentDupes

Open in new window

0
 

Author Comment

by:acardullo
ID: 24849557
I am calling this procedure from the reporting services.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 

Author Comment

by:acardullo
ID: 24849596
OK  this si strange
the Numtotal  and numDupes  are yielding the same acount  and the numUnique and numUnqDupes are also yieliding the same amount?
0
 

Author Comment

by:acardullo
ID: 24849611
and the reulst of the query is 1  which is not possible becau ei have way more dupliacte srial numbers in that table
0
 
LVL 8

Accepted Solution

by:
Bobaran98 earned 500 total points
ID: 24850036
All of the results you mention would be correct if you don't have any non-duplicate serial numbers in the view.  In other words, if there's not at least one serial number that appears only once, then the total number of records (numTotal) will equal the number of records that are duplicates (numDupes).  Also, the number of unique serial numbers in the view (numUnique) will match the number of unique serial numbers that are duplicated (numUnqDupes).  Did you pull @numSingles?  It should be 0.  If you don't believe this, uncomment the complete calculation for counting the number of un-duplicated records (see codebox below).

Finally, the resulting percentage should definitely be 1, because that's saying that 100% of the records in your view are duplicated serials.  In that case, perhaps a more helpful number would be to select the value of @pctUnqDupes.

I know this may come across as a bit confusing, and I think it might have to do with our definition of "duplicate."  I'm mostly defining "duplicate" as records whose serial number appears more than once in the view.  Say you have an empty view, no records, then you add a record with serial number 123.  At that point, according to my code, you'll have:

numTotal = 1
numUnique = 1
numUnqDupes = 0
numDupes = 0
numSingles = 1
pctDupeEntries = 0 (or 0%)
pctUnqDupes = 0 (or 0%)

Now, add another entry with that same 123 serial number, and you'll have:

numTotal = 2
 numUnique = 1
 numUnqDupes = 1
 numDupes = 2
 numSingles = 0
 pctDupeEntries = 1 (or 100%)
 pctUnqDupes = 1 (or 100%)

Add a third entry with that same 123 serial number (so now you've got three 123 entries, and that's all you've got).  Now you'll see:

numTotal = 3
 numUnique = 1
 numUnqDupes = 1
 numDupes = 3
 numSingles = 0
 pctDupeEntries = 1 (or 100%)
 pctUnqDupes = 1 (or 100%)

Just to take this one more step, let's now add a new serial number, say 456:

numTotal = 4
  numUnique = 2
  numUnqDupes = 1
  numDupes = 3
  numSingles = 1
  pctDupeEntries = .75 (or 75%)
  pctUnqDupes = .5 (or 50%)
Are you seeing now how I'm doing my math?  Allow me to define these numbers one by one:
  • numTotal = 4 = there are 4 total records in the view
  •  numUnique = 2 = there are 2 unique serial numbers (123 and 456)
  •  numUnqDupes = 1 = one of those unique serial numbers (123) exists in the view more than once
  •  numDupes = 3 = there are 3 records in the view for which the serial number is non-unique (in other words, there are 3 records with the 123 serial number)
  •  numSingles = 1 = there is 1 record in the view for which the serial number is unique (there's only 1 record with the 456 serial number)
  •  pctDupeEntries = .75 (or 75%) = out of the 4 total records, 3 of them have non-unique serial numbers (the 3 with 123)
  •  pctUnqDupes = .5 (or 50%) = out of the 2 unique serial numbers (123 and 456), 1 of them (123) exists multiple times
It occurs to me that there's another percentage that you might like to calculate-- perhaps you want to know how many "extra" entries you have.  In the last example, you've got 1 entry with serial 456 and 3 entries with 123.  That means of your 4 entries total, you've got 2 serial numbers and 2 extras.  In that case, you might say you've got 50% duplicates, but you'd be defining "duplicate" as "extra" rather than as "percentage of entries whose serial number appears more than once."  To determine that percentage, use the following code:

DECLARE @pctExtra float
SET @pctExtra = (SELECT (CAST((@numTotal-@numUnique) as float)/CAST(@numTotal as float)))

--or:
--SET @pctExtra = 1 - (SELECT (CAST(@numUnique as float)/CAST(@numTotal as float)))

Incidentally, I know in this particular case, it looks like this @pctExtra number matches @pctUnqDupes, or perhaps is the complement of it, but it's not.  It just happens to be the same figure because of the numbers I've used.

Does that clear things up at all?  Sorry for the novel!



 
SET @numSingles = (SELECT COUNT(*) FROM

      (SELECT SERIAL, COUNT(SERIAL) as NumSERIAL FROM ServiceCalls90Days GROUP BY SERIAL) TT

      WHERE TT.NumSERIAL = 1

  )

Open in new window

0
 

Author Closing Comment

by:acardullo
ID: 31602135
Thank you. Great Contributor
0
 
LVL 8

Expert Comment

by:Bobaran98
ID: 24850503
Thank you. :-)
0
 

Author Comment

by:acardullo
ID: 24850609
Just wanted to say that i am very happy wiht your solution and the time you took to show me how you derived your comnclusions.  Thanks again!
0
 
LVL 8

Expert Comment

by:Bobaran98
ID: 24851016
It was truly my pleasure.  I love this stuff!
0
 

Author Comment

by:acardullo
ID: 24853090
Bob, if you dont mind.  How would i alter this SP to reflect the call back %  for every REPRESENTATIVE (field in souce view)  thanks.   ill be happy to open new question if youw ant me too.
0
 
LVL 8

Expert Comment

by:Bobaran98
ID: 24853316
Since it's beyond the bounds of your original question, why don't you go ahead and open a new question.  Feel free to repost the code I shared above-- or cross reference this page-- and make sure to specify which percentage calculation you want.  The solution will involve nesting a few queries.  Anyway, once you post the new question, if you want to put a link here, I'll try to take a look at it.  It might be a little while, though, so if someone else can get to it first, that's great too!
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Suggested Solutions

If you find yourself in this situation “I have used SELECT DISTINCT but I’m getting duplicates” then I'm sorry to say you are using the wrong SQL technique as it only does one thing which is: produces whole rows that are unique. If the results you a…
Confronted with some SQL you don't know can be a daunting task. It can be even more daunting if that SQL carries some of the old secret codes used in the Ye Olde query syntax, such as: (+)     as used in Oracle;     *=     =*    as used in Sybase …
This video discusses moving either the default database or any database to a new volume.
Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail.  The methods are covered in more detail in o…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now