Difference between two distributions

Frutasamir
Frutasamir used Ask the Experts™
on
I have two tables of temperature destribution;

And I want to know if these two distributions are significally different;

Therefore I want to apply the Kolmogorov-Smirnov Test for two distributions,

can anyone adapt the test for my needs?

thanks,

see my attached code
Regi-es.pas
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Awarded 2010
Top Expert 2013

Commented:
I don't see why you would need to adapt the test, it does exactly what you are wanting to do. The Two-sample Kolmogorov-Smirnov Test will tell you if it is likely that both datasets are from the same distribution. You can plug your data in here to see the numbers (and to check your results if you still want to code it).  http://www.physics.csbsju.edu/stats/KS-test.html

Try this for more info on the test if you need it: http://www.princeton.edu/~achaney/tmve/wiki100k/docs/Kolmogorov-Smirnov_test.html

There is a Wikipedia article too, but I mind the math articles there to be a little overly complicated and hard to understand: http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test

Author

Commented:
I'd like to implement that on code;

For a project to run, and answer the user if the datasets are the same or not (with 95% confidence interval);
Software architect
Top Expert 2012
Commented:
I add some code to your project. This is procedure for getting very simple KS distance between two cumulative frequencies.

// line between two points
// y = (x - x1) * (y2-y1)/(x2-x1) + y1
procedure CalcKSTest(data, dataOrig: T1DimArray; var fKSTemp: Double; var fDistance, fFreq, fFreqOrig: Double);

  function GetCurrentCumulativeValue(d: T1DimArray; fT: Double): Double;
  var
    k, nMax_Orig: Integer;
    fNextT, fNextFreq, fPrevT, fPrevFreq: Double;
  begin
    Result := 0;
    fPrevT := 0;
    fPrevFreq := 0;

    //calc max cumulative value - original
    nMax_Orig := 0;
    for k:= Low(d) to High(d) do
    begin
      nMax_Orig := nMax_Orig + d[k].nCount;
    end;

    //sum freq. of temp. until fT temperature
    for k:= Low(d) to High(d) do
    begin
      if d[k].fTemp <= fT then //until max fT
      begin
        fPrevT := d[k].fTemp;
        Result := Result + d[k].nCount/nMax_Orig;
        fPrevFreq := Result;

        if fPrevT=fT then Break;
      end
      else
      begin
        //calc rest of interpolated line between temp les than fT and first greater then fT
        if d[k].fTemp > fT then
        begin
          fNextT := fPrevT;
          fNextFreq := fPrevFreq;
          if k<High(d) then
          begin
            fNextT := d[k].fTemp;
            fNextFreq := Result + d[k].nCount/nMax_Orig;
          end;

          //go interpolate to fT - line between two points
          if fPrevT<>fNextT then
          begin
            Result := ((fT - fPrevT)*(fNextFreq - fPrevFreq)/(fNextT - fPrevT)) + fPrevFreq;
          end;
        end;

        Break;
      end;
    end;
  end;

var
  i, nMax: Integer;
  f, f_orig, fTemp: Double;
begin
  fKSTemp :=0;
  fDistance := 0;
  fFreq := 0;
  fFreqOrig := 0;
  f := 0;

  //calc max cumulative value
  nMax := 0;
  for i:= Low(data) to High(data) do
  begin
    nMax := nMax + data[i].nCount;
  end;

  //go for all temp. values from newer source
  for i:= Low(data) to High(data) do
  begin
    f := f + data[i].nCount/nMax;
    fTemp := data[i].fTemp;

    //what is distance from orig?
    //get cumulative freq. for fTemp in orig
    f_orig := GetCurrentCumulativeValue(dataOrig, fTemp);
    //get max distance
    if Abs(f-f_orig)>Abs(fDistance) then
    begin
      //keep values
      fDistance := f-f_orig;
      fFreq := f;
      fFreqOrig := f_orig;
      fKSTemp := fTemp;
    end;
  end;
end;

Open in new window


... to represent visually add this to the bottom...

...
  data, dataOrig: T1DimArray;
  fKSTemp, fDist, fKSFreq, fKSFreqOrig: Double;
...
Series6.Clear;
  Series6.LinePen.Color := clBlue;
  CalcKSTest(data, dataOrig, fKSTemp, fDist, fKSFreq, fKSFreqOrig);

  Series6.addxy(fKSTemp, fKSFreqOrig);
  Series6.addxy(fKSTemp, fKSFreq);
...

Open in new window


... this will draw line between two data series where KS distance is a maximum.
OWASP: Avoiding Hacker Tricks

Learn to build secure applications from the mindset of the hacker and avoid being exploited.

Author

Commented:
Drawing the line worked quite well,

But how would I implement a way to use the maximum distance between these distributions, and the code, for example a dialogbox consider the overall difference significant. Given the confidence interval.

Author

Commented:
Hello sinisav,

this is what I'm trying to apply on the function,

given the distributions and the max distance between them, and introducing a confidence interval. The distributions are considered different or virtually the same....
KS-Two.jpg
Sinisa VukSoftware architect
Top Expert 2012

Commented:
If we assume that your cumulative frequency distribution is probability distribution from KS test example - what is missing here? My function give the maximum difference of two distributions. I think that is wrong a way which you want to use to accomplish your main thoughts. KS difference can show you that something is wrong, but where - I think this is your problem. Your example images differs in size and position compares to original one, so it can be used just for detection that something is wrong (average temperature is higher then in original).
You must decide what is maximum value for difference between two distributions where you assume that distributions are "similar". This value is a trigger when you must assume that "something is wrong".
If you know what is wrong here, write a descriptive pseudo-code then we can help you more.

Author

Commented:
hello sinisav,

there is a specific function to determine if the maximum distance between the cumulative distributions is considered different;

I'd like to implement that aswell;

please check the link below for the function and the two distribution Ks-Test chapter:

http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf

Something like this code, but having problem to implement this, th probks funtion is essential to calculate the significance level:

Function probks(alam:double):double;

CONST
eps1 = 0.001;
eps2 = 1.0e-8;

Var
a2,fac,sum,term,termbf: real;
j:integer;

Begin
a2:=-2.0*alam*alam;
fac:=2.0;
sum:=0.0;
termbf:=0.0;

for j := 1 to 600 do Begin
term:=fac*exp(a2*sqr(j));
sum:=sum+term;
if (abs(term) <= eps1*termbf) OR (abs(term) <= eps2*sum) then begin
  probks:=sum;

end
Else begin
fac:= -fac;
termbf:= abs(term)
end
end;
probks:=1.0;

End;

procedure kstwo (var data1:RealArrayN12; n1:integer; Var data2: RealArrayN12; n2:integer; Var d, prob:real);

VAR
i, j1, j2:integer;
en1, en2, fn1, fn2, dt, d1, d2: real;
Begin

sort(n1,data1);
sort(n2,data2);

en1:= n1;
en2:= n2;
j1:=1;
j2:=1;
fn1:=0.0;
fn2:=0.0;
d:=0.0;

WHILE (j1 <= n1) AND (j2 <= n2) DO BEGIN
d1 :=data1[j1];
d2:=data2[j2];

IF d1<=d2 then Begin
fn1:=j1/en1;
j1:=j1+1;
END;

IF d2<=d1 then Begin
fn2:=j2/en2;
j2:=j2+1;
END;

dt:= abs(fn2-fn1);

IF dt>d Then d:= dt
End;

prob := probks(sqrt(en1*en2/(en1+en2))*d)
End;

Open in new window

ks-test.jpg

Author

Commented:
How would I implement if the maximum distance is, for example 1? Thanks
Sinisa VukSoftware architect
Top Expert 2012

Commented:
I am very busy lately, but in few day I will try to adopt these example. Sorry.

Author

Commented:
ok, implement this function and procedure is very important

Author

Commented:
got it working, nevermind the last questions.

thanks.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial