# Benford's Law in SQL Server 2005/2008

Hi,
I tried to create an SQL statement that calculates the frequency of appearance of the first digit of a population (column, like amount) and to compare to Benford's law (the expected number of occurences).

I came across this algorith in a different question:
Public Function BenfordsLaw(D As Integer) As Double
BenfordsLaw = Log(1 + (1 / D)) / Log(10)
End Function

Do you have an idea whether this is an SQL statement, if not is there an sql statement to compare the occurence of the first digit of a column to the expected occurence according to Benford's Law. I also need to calculate the Z-Scores (Z-stat ratios).
###### Who is Participating?

Senior Database ArchitectCommented:
OK, this put me back in Stats Class but this might work.  It gives results like the attached image.  I played with an existing table I had and was not sure you you would handle negative and < 1.0 amounts so you may have to adjust it some. Hope this is on the right track.

``````CREATE FUNCTION dbo.udfBenfordsLaw (@Digit INT)
RETURNS FLOAT
as
BEGIN
RETURN (CASE @Digit WHEN 0 THEN 0.0 ELSE Log(1.0 + (1.0 / @Digit)) / Log(10) END)
END
GO

SELECT *, dbo.udfBenfordsLaw(FirstDigit) BenfordLaw, (ActualOcurrance-[AVG])/Std Z
FROM (
SELECT SUBSTRING(CONVERT(VARCHAR,ABS(Amount)),1,1) FirstDigit, COUNT(*) ActualOcurrance, SUM(COUNT(*)) OVER () Total, 100.0 * COUNT(*) / SUM(COUNT(*)) OVER () PCT, AVG(COUNT(*)) OVER () AVG, STDEV(COUNT(*)) OVER () Std
FROM LifecycleBudget LB
WHERE ABS(LB.Amount) > 1
GROUP BY SUBSTRING(CONVERT(VARCHAR,ABS(Amount)),1,1)
) x
ORDER BY FirstDigit
``````
Zscores.png
0

Commented:
use this.

declare @d int
set @d=10
select Log(1 + (1 / @d)) / Log(10)
0

Commented:
the function you saw belong to VB but you could do the same task in SQL Server as I saw you above
0

Author Commented:
It might be a bit stupid question, but how do I link the code above to a particular column in a table in a database?
0

Commented:
or you can use float as well.
0

Commented:
sorry forgot to paste the script.

declare @d float
set @d=0.21
select Log(1 + (1 / @d)) / Log(10)

0

Commented:
>>It might be a bit stupid question, but how do I link the code above to a particular column in a table in a database?<<

that's is upto you, while inserting the data, you can calculate and insert it in your table, or can make calculation on fly while using SELECT statement.

0

Author Commented:
What about telling the code which column it should look at in which table and in which database? do I need to add something else to the code?
0

Commented:
if you can show me table structure and tell me which value should be used for calculation, I will make one SELECT statement for you as demonstration.
0

Author Commented:
Sorry didn't refresh and posted the question again after your answer. But how would the select stataement look like? say I have two columns "Reference" and " Amount", in a table called "BLTABLE" in a database called BLDATABASE. How do I apply the code?
0

Author Commented:
THe value for calculation is column "Amount" (followed from above)
0

Commented:
I want to see your TABLE STRUCTURE.
0

Author Commented:
Structure would be for example:
Reference Amount
A1             100
A2             333
C4             456
34             785
45             453
...              ...

with in total more than 100 million records
0

Commented:
well in that case you can do something like.

select  Reference, Amount, Log(1 + (1 / amount)) / Log(10) as BenFordValue from yourTableName
where reference='A1'

this is just an example. Do you want to see value in SELECT only or want to store it in database permenently by adding one more column to your table?
0

Author Commented:
Actually the final result of the query should look like this:
Digit       Actual Ocurrence (in %)      Expected Occurence (BL)       Z-Scores
1                   20                                                30                                     ...
2                   10                                                 20                                    ...
3                     5                                                 10                                    ...
4                   40                                                  5                                    ...
5                     1                                                  5                                    ...
6                     1                                                   5                                    ...
7                     3                                                   5                                    ...
8                   10                                                   5                                    ...
9                   10                                                   5                                    ...
END OF TABLE

BL occurence is made up I didn't have time to populate here the right figures. So the final result should be one newly created table with exactly 4 columns and 9 rows as above.
0

Commented:
maybe create a function for it

create function dbo.udf_Benfords
(
@amount money
)
returns float
as
begin
declare @x float
set @x = Log(1 + (1 / @amount)) / Log(10)
return(@x)
end
0

Author Commented:
By the way isn't the code trying to calculate the expected values (formula based)? What I need is how to collect the actual occurence, let the SQL statement go through the column and grab the first digit, group it and count the percentage of actual occurence. And then calculate the Z-scores. So it is more the actual data that I need.
0

Author Commented:
I will try the code above of CGLuttrell but from the picture, it already looks very good, I will check
0

Just one small enhancement to CGLuttrell 's solution...

Mainly because of the use of the scaler udf is not really needed and with 100 million rows, will take it's toll... See :  http://sqlblogcasts.com/blogs/simons/archive/2008/11/03/TSQL-Scalar-functions-are-evil-.aspx from a performance perspective.

SELECT *, CASE FirstDigit WHEN 0 THEN 0.0 ELSE Log(1.0 + (1.0 / FirstDigit)) / Log(10) END BenfordLaw, (ActualOcurrance-[AVG])/Std Z
FROM (
SELECT SUBSTRING(CONVERT(VARCHAR,ABS(Amount)),1,1) FirstDigit, COUNT(*) ActualOcurrance, SUM(COUNT(*)) OVER () Total, 100.0 * COUNT(*) / SUM(COUNT(*)) OVER () PCT, AVG(COUNT(*)) OVER () AVG, STDEV(COUNT(*)) OVER () Std
FROM LifecycleBudget LB
WHERE ABS(LB.Amount) > 1
GROUP BY SUBSTRING(CONVERT(VARCHAR,ABS(Amount)),1,1)
) x
ORDER BY FirstDigit

CGLuttrell please drop me a line, email is on my bio...
0

Author Commented:
Excellent. Many thanks to CGLuttrell, and Mark Wills as well. (increased point to 500 as it is a very concise code and valued added.
0

Author Commented:
Excellent
0

Author Commented:
I don't know how the garde 8.6 was calculated, but this for me a 10 out of 10.
0

Author Commented:
Both CGluttrell and Mark Wills you may want to help further with my question about Pareto Analysis in SQL Server:
http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SQL-Server-2005/Q_24389862.html
0

Hi technovation, thanks for the assist, unexpected, and most welcomed...

The grade is a scientific calculation based on the angle of the sun and the phase of the moon... It has to do with elapsed time and number of entries, not always a good indicator as you have pointed out.

Just had a quick look at the other question as well...

Cheers,
Mark Wills
0

Commented:
Gave me the absolute horn I had to take action on!!
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.