Solved

# Benford's Law in SQL Server 2005/2008

Posted on 2009-05-06
1,754 Views
Hi,
I tried to create an SQL statement that calculates the frequency of appearance of the first digit of a population (column, like amount) and to compare to Benford's law (the expected number of occurences).

I came across this algorith in a different question:
Public Function BenfordsLaw(D As Integer) As Double
BenfordsLaw = Log(1 + (1 / D)) / Log(10)
End Function

Do you have an idea whether this is an SQL statement, if not is there an sql statement to compare the occurence of the first digit of a column to the expected occurence according to Benford's Law. I also need to calculate the Z-Scores (Z-stat ratios).
0
Question by:TechNovation
• 12
• 8
• 2
• +3

LVL 31

Expert Comment

ID: 24313703
use this.

declare @d int
set @d=10
select Log(1 + (1 / @d)) / Log(10)
0

LVL 31

Expert Comment

ID: 24313719
the function you saw belong to VB but you could do the same task in SQL Server as I saw you above
0

Author Comment

ID: 24313787
It might be a bit stupid question, but how do I link the code above to a particular column in a table in a database?
0

LVL 31

Expert Comment

ID: 24313796
or you can use float as well.
0

LVL 31

Expert Comment

ID: 24313798
sorry forgot to paste the script.

declare @d float
set @d=0.21
select Log(1 + (1 / @d)) / Log(10)

0

LVL 31

Expert Comment

ID: 24313808
>>It might be a bit stupid question, but how do I link the code above to a particular column in a table in a database?<<

that's is upto you, while inserting the data, you can calculate and insert it in your table, or can make calculation on fly while using SELECT statement.

0

Author Comment

ID: 24313849
What about telling the code which column it should look at in which table and in which database? do I need to add something else to the code?
0

LVL 31

Expert Comment

ID: 24313866
if you can show me table structure and tell me which value should be used for calculation, I will make one SELECT statement for you as demonstration.
0

Author Comment

ID: 24313876
Sorry didn't refresh and posted the question again after your answer. But how would the select stataement look like? say I have two columns "Reference" and " Amount", in a table called "BLTABLE" in a database called BLDATABASE. How do I apply the code?
0

Author Comment

ID: 24313886
THe value for calculation is column "Amount" (followed from above)
0

LVL 31

Expert Comment

ID: 24313912
I want to see your TABLE STRUCTURE.
0

Author Comment

ID: 24313921
Structure would be for example:
Reference Amount
A1             100
A2             333
C4             456
34             785
45             453
...              ...

with in total more than 100 million records
0

LVL 31

Expert Comment

ID: 24313946
well in that case you can do something like.

select  Reference, Amount, Log(1 + (1 / amount)) / Log(10) as BenFordValue from yourTableName
where reference='A1'

this is just an example. Do you want to see value in SELECT only or want to store it in database permenently by adding one more column to your table?
0

Author Comment

ID: 24314018
Actually the final result of the query should look like this:
Digit       Actual Ocurrence (in %)      Expected Occurence (BL)       Z-Scores
1                   20                                                30                                     ...
2                   10                                                 20                                    ...
3                     5                                                 10                                    ...
4                   40                                                  5                                    ...
5                     1                                                  5                                    ...
6                     1                                                   5                                    ...
7                     3                                                   5                                    ...
8                   10                                                   5                                    ...
9                   10                                                   5                                    ...
END OF TABLE

BL occurence is made up I didn't have time to populate here the right figures. So the final result should be one newly created table with exactly 4 columns and 9 rows as above.
0

LVL 60

Expert Comment

ID: 24314027
maybe create a function for it

create function dbo.udf_Benfords
(
@amount money
)
returns float
as
begin
declare @x float
set @x = Log(1 + (1 / @amount)) / Log(10)
return(@x)
end
0

Author Comment

ID: 24314078
By the way isn't the code trying to calculate the expected values (formula based)? What I need is how to collect the actual occurence, let the SQL statement go through the column and grab the first digit, group it and count the percentage of actual occurence. And then calculate the Z-scores. So it is more the actual data that I need.
0

LVL 26

Accepted Solution

Chris Luttrell earned 400 total points
ID: 24320322
OK, this put me back in Stats Class but this might work.  It gives results like the attached image.  I played with an existing table I had and was not sure you you would handle negative and < 1.0 amounts so you may have to adjust it some. Hope this is on the right track.

``````CREATE FUNCTION dbo.udfBenfordsLaw (@Digit INT)

RETURNS FLOAT

as

BEGIN

RETURN (CASE @Digit WHEN 0 THEN 0.0 ELSE Log(1.0 + (1.0 / @Digit)) / Log(10) END)

END

GO

SELECT *, dbo.udfBenfordsLaw(FirstDigit) BenfordLaw, (ActualOcurrance-[AVG])/Std Z

FROM (

SELECT SUBSTRING(CONVERT(VARCHAR,ABS(Amount)),1,1) FirstDigit, COUNT(*) ActualOcurrance, SUM(COUNT(*)) OVER () Total, 100.0 * COUNT(*) / SUM(COUNT(*)) OVER () PCT, AVG(COUNT(*)) OVER () AVG, STDEV(COUNT(*)) OVER () Std

FROM LifecycleBudget LB

WHERE ABS(LB.Amount) > 1

GROUP BY SUBSTRING(CONVERT(VARCHAR,ABS(Amount)),1,1)

) x

ORDER BY FirstDigit
``````
Zscores.png
0

Author Comment

ID: 24323394
I will try the code above of CGLuttrell but from the picture, it already looks very good, I will check
0

LVL 51

Assisted Solution

Mark Wills earned 100 total points
ID: 24324090
Just one small enhancement to CGLuttrell 's solution...

Mainly because of the use of the scaler udf is not really needed and with 100 million rows, will take it's toll... See :  http://sqlblogcasts.com/blogs/simons/archive/2008/11/03/TSQL-Scalar-functions-are-evil-.aspx from a performance perspective.

SELECT *, CASE FirstDigit WHEN 0 THEN 0.0 ELSE Log(1.0 + (1.0 / FirstDigit)) / Log(10) END BenfordLaw, (ActualOcurrance-[AVG])/Std Z
FROM (
SELECT SUBSTRING(CONVERT(VARCHAR,ABS(Amount)),1,1) FirstDigit, COUNT(*) ActualOcurrance, SUM(COUNT(*)) OVER () Total, 100.0 * COUNT(*) / SUM(COUNT(*)) OVER () PCT, AVG(COUNT(*)) OVER () AVG, STDEV(COUNT(*)) OVER () Std
FROM LifecycleBudget LB
WHERE ABS(LB.Amount) > 1
GROUP BY SUBSTRING(CONVERT(VARCHAR,ABS(Amount)),1,1)
) x
ORDER BY FirstDigit

CGLuttrell please drop me a line, email is on my bio...
0

Author Comment

ID: 24330137
Excellent. Many thanks to CGLuttrell, and Mark Wills as well. (increased point to 500 as it is a very concise code and valued added.
0

Author Closing Comment

ID: 31578418
Excellent
0

Author Comment

ID: 24330180
I don't know how the garde 8.6 was calculated, but this for me a 10 out of 10.
0

Author Comment

ID: 24330203
Both CGluttrell and Mark Wills you may want to help further with my question about Pareto Analysis in SQL Server:
http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SQL-Server-2005/Q_24389862.html
0

LVL 51

Expert Comment

ID: 24333882
Hi technovation, thanks for the assist, unexpected, and most welcomed...

The grade is a scientific calculation based on the angle of the sun and the phase of the moon... It has to do with elapsed time and number of entries, not always a good indicator as you have pointed out.

Just had a quick look at the other question as well...

Cheers,
Mark Wills
0

Expert Comment

ID: 41825559
Gave me the absolute horn I had to take action on!!
0

## Featured Post

Naughty Me. While I was changing the database name from DB1 to DB_PROD1 (yep it's not real database name ^v^), I changed the database name and notified my application fellows that I did it. They turn on the application, and everything is working. A â€¦
In this article we will get to know that how can we recover deleted data if it happens accidently. We really can recover deleted rows if we know the time when data is deleted by using the transaction log.
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
This tutorial demonstrates a quick way of adding group price to multiple Magento products.