Difference between two distributions

on
I have two tables of temperature destribution;

And I want to know if these two distributions are significally different;

Therefore I want to apply the Kolmogorov-Smirnov Test for two distributions,

can anyone adapt the test for my needs?

thanks,

see my attached code
Regi-es.pas
Comment
Watch Question

Do more with

EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Awarded 2010
Top Expert 2013

Commented:
I don't see why you would need to adapt the test, it does exactly what you are wanting to do. The Two-sample Kolmogorov-Smirnov Test will tell you if it is likely that both datasets are from the same distribution. You can plug your data in here to see the numbers (and to check your results if you still want to code it).  http://www.physics.csbsju.edu/stats/KS-test.html

Try this for more info on the test if you need it: http://www.princeton.edu/~achaney/tmve/wiki100k/docs/Kolmogorov-Smirnov_test.html

There is a Wikipedia article too, but I mind the math articles there to be a little overly complicated and hard to understand: http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test

Commented:
I'd like to implement that on code;

For a project to run, and answer the user if the datasets are the same or not (with 95% confidence interval);
Software architect
Top Expert 2012
Commented:
I add some code to your project. This is procedure for getting very simple KS distance between two cumulative frequencies.

``````// line between two points
// y = (x - x1) * (y2-y1)/(x2-x1) + y1
procedure CalcKSTest(data, dataOrig: T1DimArray; var fKSTemp: Double; var fDistance, fFreq, fFreqOrig: Double);

function GetCurrentCumulativeValue(d: T1DimArray; fT: Double): Double;
var
k, nMax_Orig: Integer;
fNextT, fNextFreq, fPrevT, fPrevFreq: Double;
begin
Result := 0;
fPrevT := 0;
fPrevFreq := 0;

//calc max cumulative value - original
nMax_Orig := 0;
for k:= Low(d) to High(d) do
begin
nMax_Orig := nMax_Orig + d[k].nCount;
end;

//sum freq. of temp. until fT temperature
for k:= Low(d) to High(d) do
begin
if d[k].fTemp <= fT then //until max fT
begin
fPrevT := d[k].fTemp;
Result := Result + d[k].nCount/nMax_Orig;
fPrevFreq := Result;

if fPrevT=fT then Break;
end
else
begin
//calc rest of interpolated line between temp les than fT and first greater then fT
if d[k].fTemp > fT then
begin
fNextT := fPrevT;
fNextFreq := fPrevFreq;
if k<High(d) then
begin
fNextT := d[k].fTemp;
fNextFreq := Result + d[k].nCount/nMax_Orig;
end;

//go interpolate to fT - line between two points
if fPrevT<>fNextT then
begin
Result := ((fT - fPrevT)*(fNextFreq - fPrevFreq)/(fNextT - fPrevT)) + fPrevFreq;
end;
end;

Break;
end;
end;
end;

var
i, nMax: Integer;
f, f_orig, fTemp: Double;
begin
fKSTemp :=0;
fDistance := 0;
fFreq := 0;
fFreqOrig := 0;
f := 0;

//calc max cumulative value
nMax := 0;
for i:= Low(data) to High(data) do
begin
nMax := nMax + data[i].nCount;
end;

//go for all temp. values from newer source
for i:= Low(data) to High(data) do
begin
f := f + data[i].nCount/nMax;
fTemp := data[i].fTemp;

//what is distance from orig?
//get cumulative freq. for fTemp in orig
f_orig := GetCurrentCumulativeValue(dataOrig, fTemp);
//get max distance
if Abs(f-f_orig)>Abs(fDistance) then
begin
//keep values
fDistance := f-f_orig;
fFreq := f;
fFreqOrig := f_orig;
fKSTemp := fTemp;
end;
end;
end;
``````

... to represent visually add this to the bottom...

``````...
data, dataOrig: T1DimArray;
fKSTemp, fDist, fKSFreq, fKSFreqOrig: Double;
...
Series6.Clear;
Series6.LinePen.Color := clBlue;
CalcKSTest(data, dataOrig, fKSTemp, fDist, fKSFreq, fKSFreqOrig);

...
``````

... this will draw line between two data series where KS distance is a maximum.

Commented:
Drawing the line worked quite well,

But how would I implement a way to use the maximum distance between these distributions, and the code, for example a dialogbox consider the overall difference significant. Given the confidence interval.

Commented:
Hello sinisav,

this is what I'm trying to apply on the function,

given the distributions and the max distance between them, and introducing a confidence interval. The distributions are considered different or virtually the same....
KS-Two.jpg
Software architect
Top Expert 2012

Commented:
If we assume that your cumulative frequency distribution is probability distribution from KS test example - what is missing here? My function give the maximum difference of two distributions. I think that is wrong a way which you want to use to accomplish your main thoughts. KS difference can show you that something is wrong, but where - I think this is your problem. Your example images differs in size and position compares to original one, so it can be used just for detection that something is wrong (average temperature is higher then in original).
You must decide what is maximum value for difference between two distributions where you assume that distributions are "similar". This value is a trigger when you must assume that "something is wrong".
If you know what is wrong here, write a descriptive pseudo-code then we can help you more.

Commented:
hello sinisav,

there is a specific function to determine if the maximum distance between the cumulative distributions is considered different;

I'd like to implement that aswell;

please check the link below for the function and the two distribution Ks-Test chapter:

http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf

Something like this code, but having problem to implement this, th probks funtion is essential to calculate the significance level:

``````Function probks(alam:double):double;

CONST
eps1 = 0.001;
eps2 = 1.0e-8;

Var
a2,fac,sum,term,termbf: real;
j:integer;

Begin
a2:=-2.0*alam*alam;
fac:=2.0;
sum:=0.0;
termbf:=0.0;

for j := 1 to 600 do Begin
term:=fac*exp(a2*sqr(j));
sum:=sum+term;
if (abs(term) <= eps1*termbf) OR (abs(term) <= eps2*sum) then begin
probks:=sum;

end
Else begin
fac:= -fac;
termbf:= abs(term)
end
end;
probks:=1.0;

End;

procedure kstwo (var data1:RealArrayN12; n1:integer; Var data2: RealArrayN12; n2:integer; Var d, prob:real);

VAR
i, j1, j2:integer;
en1, en2, fn1, fn2, dt, d1, d2: real;
Begin

sort(n1,data1);
sort(n2,data2);

en1:= n1;
en2:= n2;
j1:=1;
j2:=1;
fn1:=0.0;
fn2:=0.0;
d:=0.0;

WHILE (j1 <= n1) AND (j2 <= n2) DO BEGIN
d1 :=data1[j1];
d2:=data2[j2];

IF d1<=d2 then Begin
fn1:=j1/en1;
j1:=j1+1;
END;

IF d2<=d1 then Begin
fn2:=j2/en2;
j2:=j2+1;
END;

dt:= abs(fn2-fn1);

IF dt>d Then d:= dt
End;

prob := probks(sqrt(en1*en2/(en1+en2))*d)
End;
``````
ks-test.jpg

Commented:
How would I implement if the maximum distance is, for example 1? Thanks
Software architect
Top Expert 2012

Commented:
I am very busy lately, but in few day I will try to adopt these example. Sorry.

Commented:
ok, implement this function and procedure is very important

Commented:
got it working, nevermind the last questions.

thanks.

Do more with