We help IT Professionals succeed at work.

Performing Statistical analysis (kolmogorov smirnov test/related) using R or Matlab or Excel

haravallabhan used Ask the Experts™
I have two excel files one containing
a set of values of a markers with their positions (MARKERID-KS.xls) and
another with a set of values of regions with their positions on a map (MAPID-KS.xls), along with their color intensity.
Please see the attached excel files.

I want to do a statistical test to this to determine if the relationship of the distributions of these two datasets are significant or not ie Marker to Map(this in relation to the color intensity as given in the map). I want to do a Kolmogorov smirnov test, KS-Test (or please advice me if there is any other test that I need could use) on this sample to determine the D Value and to find if there is significance between the two datasets. Can someone here please advice me on to how I can perfom this. If, there a straight forward way to perform this test with the data test using R or Matlab (I havent used Matlab but could try).



Sample of the data
MarkerID      Marker      Beginposition      EndPosition
MARK1      IC      6135      6046
MARK2      IC      22428      22440
MARK3      IC      23665      23684
MARK4      IC      30394      30402
MARK5      IC      30961      30964
MARK6      IC      33439      33455
MARK7      IC      36905      36906
MARK8      IC      42219      42234

COLID1      IC      60639      63226      5.89
COLID2      IC      42039      42259      5.47
COLID3      IC      42626      43386      5.2
COLID4      IC      63200      63369      4.52
COLID5      IC      30699      32083      3.97
COLID6      IC      66360      66555      3.9

Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

I do not understand your data. The test you are referring to compares two distributions: which value constitutes the distribution you want to test? Apparently, it can be either “begin position” or “end position”, but not both in the same test. The test cannot be “in relation to the colour intensity”, because that value is missing for markers.

Please explain a little more about the data and what exactly you are trying to test, not necessarily using statistical terms.




 You are absoultely right and thanks for clarifying. Because the Marker and Map positions have Begin and end positions, the data analysis has been challenging and I was hoping if there could be a way to do this analysis encompassing both the Begin and End position. I will get back in few hours explaining the data in detail.

Thanks for the inputs.



Sorry to get back late.

What I am trying to test here is that if the markers positions have any significance in relation to the position on the map. We could ignore the colour intensity for now.

The questions I am trying to address are

1) Does these two distributions have any significance to each other ?
2) Does the marker position (either begin or end position) correlate to the Begin or end position of the Map ?
3) Considering that since both the distributions come from a random distribution, how do I establish that there is significance i.e how do I do it Using KS-Test and are there any other statistical test that can be performed to find out any significance between these two data ?
4) Given that these data are in excel files, how do I perform the test in R ? (I lost touch with R for sometime)

I will be happy to even consider using just one column i.e say end position of Marker and end position of Map to get these statistics. Additionally I have the distance of the begin position of Marker to the end position of Map ( which is what is my concern), to establish if there is any significance in its distribution/ pattern and if there is a correlation between these two data.

I am unsure how to go about doing this statistically and was adviced by someone to try and do a KS-test. Could you give some light on this ?
Hi haravallabhan,

I was away myself for a while. The figure below shows the cumulative distributions of the field BEGIN for both data sets: MARKER and MAP. They have been obtained very simply by sorting the data points in ascending order and by adding a column for the Y axis with values from 0 to 1, formatted as percent.

The Kolmogorov-Smirnov Test can answer the following question: are the distributions similar? Either one of the distributions is considered as known and correct or both are treated as samples from an unknown distribution.

Formally, the H0 hypothesis is that both samples have been drawn from the same distribution (or that a sample was drawn from the known distribution). The largest vertical distance is the D statistic used by the test. For example, the MARKER distribution reaches 50% around 92000, while the MAP distribution is already at 90% at that point. This distance, 40%, is too large to be random, so the hypothesis is rejected: the samples cannot have been drawn from the same distribution.

The difference between them is so obvious that no test is necessary.

Your real question, however, is quite different. You don't ask whether they are the same, but whether they “have any significance to each other”. To answer that, you need to formulate very precisely what significance that is and how, for example, you could draw one from the other. Once you have a clear formula that seems to fit the data, you can search for a test to validate your hypothesis.

You have shown two data sets with linked numbers (BEGIN and END). The relationship between these two numbers isn't clear. For example, most MAP points show a small increase between BEGIN and END, but there are exceptions with a decrease (up to -13%) and with very large increases (up to 192%). This needs to be addressed first. The dataset MARKER shows 18 points with relatively small variations (mostly increases) and 15 points with large decreases of -66% to -92%. They do not seem to come from the same sampling or the same distribution. You need to treat them separately at this stage.

You basically want to “do something” with these numbers. Statistics don't work that way. You need to formulate some sort of hypothesis and also validate your data to ensure it can answer your question.

Does that help?


Thank you very much. This is much clear. The data that was in the file is not the complete and real data, I made this up to get clarity in my analysis. Can I please ask you how you generated this graph i.e which software did you use , is this MATLAB?  
It is an Excel chart. To reproduce such charts:

1. Make a copy of one dataset in column A of a new worksheet
2. Sort it in ascending order
3. Add value 0 as first value if needed
4. Create a second column with values from 0 to 1 (or 0 to 100%)
    For example, for 5 values, enter 0% (next to value 0), and 20%, 40%, 60%, 80%, 100%
5. Create an XY chart based on the data
6. Repeat 1-4 for the second data set (in another area of the sheet)
7. Copy the data range
8. Paste special on the chart to add a new series with new x values
    (the default is adding new points)

The instructions are for Excel, but any charting software can be used, naturally.

Sorry for late answer. I was away for awhile.

Here is a MATLAB code to read your data, calculate Kolmogorov-Smirnov statistics and plot CDF plots. You can do it not only for beginning positions in Marker and Map data, but also for end positions and middle positions. Try to copy the code into MATLAB editor and run it in cell mode. Make sure your data file are located in your working directory. Uncomment line 4 and 15 to make sure beginning coordinates are always smaller than end coordinates. I plotted beginning positions with solid lines, end - with dotted lines, and middle - with mixed -.- lines.

For details on KSTEST2 function see MATLAB documentation or here:

You probably also can do some kind of enrichment test to test if marker segments frequently overlap with map segments or not. LIke Fisher exact or chi-square.

Anyway to do statistical test you have to formulate your null hypothesis and alternative hypothesis (see some stat textbook). Then we can help you to select the most appropriate statistics.

%% read the files
markerfile = 'MARKERID-KS.xls';
[MarkerData MarkerText] = xlsread(markerfile);
% MarkerData = sort(MarkerData,2); % uncomment to make beginposition < endposition
Marker.(MarkerText{1,1}) = MarkerText(2:end,1);
Marker.(MarkerText{1,2}) = MarkerText(2:end,2);
MarkerText(1,:) = strrep(MarkerText(1,:),' ',''); % make sure no spaces in fields name
for k=3:size(MarkerText,2)
    Marker.(MarkerText{1,k}) = MarkerData(2:end,k-2);
Marker.MiddlePosition = (Marker.Beginposition + Marker.EndPosition)/2;

mapfile = 'MAPID-KS.xls';
[MapData MapText] = xlsread(mapfile);
% MapData(:,1:2) = sort(MapData(:,1:2),2); % uncomment to make beginposition < endposition
Map.(MapText{1,1}) = MapText(2:end,1);
Map.(MapText{1,2}) = MapText(2:end,2);
MapText(1,:) = strrep(MapText(1,:),' ',''); % make sure no spaces in fields name
for k=3:size(MapText,2)
    Map.(MapText{1,k}) = MapData(2:end,k-2);
Map.MiddlePosition = (Map.BEGIN + Map.END)/2;

%% Kolmogorov-Smirnov statistics
[KSbegin.h,KSbegin.p,KSbegin.ksstat] = kstest2(Marker.Beginposition,Map.BEGIN);
[KSend.h,KSend.p,KSend.ksstat] = kstest2(Marker.EndPosition,Map.END);
[KSmid.h,KSmid.p,KSmid.ksstat] = kstest2(Marker.MiddlePosition,Map.MiddlePosition);
disp('Kolmogorov-Smirnov statistics for beginning positions:')
disp('Kolmogorov-Smirnov statistics for end positions:')
disp('Kolmogorov-Smirnov statistics for middle positions:')

%% Distribution plot
F1b = cdfplot(Marker.Beginposition);
hold on
F1m = cdfplot(Marker.MiddlePosition);
F1e = cdfplot(Marker.EndPosition);
F2b = cdfplot(Map.BEGIN);
F2m = cdfplot(Map.MiddlePosition);
F2e = cdfplot(Map.END);
hold off
legend([F1b F2b],'Marker','Map','Location','SE')

Open in new window