I have two excel files one containing

a set of values of a markers with their positions (MARKERID-KS.xls) and

another with a set of values of regions with their positions on a map (MAPID-KS.xls), along with their color intensity.

Please see the attached excel files.

I want to do a statistical test to this to determine if the relationship of the distributions of these two datasets are significant or not ie Marker to Map(this in relation to the color intensity as given in the map). I want to do a Kolmogorov smirnov test, KS-Test (or please advice me if there is any other test that I need could use) on this sample to determine the D Value and to find if there is significance between the two datasets. Can someone here please advice me on to how I can perfom this. If, there a straight forward way to perform this test with the data test using R or Matlab (I havent used Matlab but could try).

Thanks

Cheers

Sample of the data

DATA1 : MARKERID

MarkerID Marker Beginposition EndPosition

MARK1 IC 6135 6046

MARK2 IC 22428 22440

MARK3 IC 23665 23684

MARK4 IC 30394 30402

MARK5 IC 30961 30964

MARK6 IC 33439 33455

MARK7 IC 36905 36906

MARK8 IC 42219 42234

DATA2:MAPID

MAPID REGION BEGIN END COLOR INTENSITY

COLID1 IC 60639 63226 5.89

COLID2 IC 42039 42259 5.47

COLID3 IC 42626 43386 5.2

COLID4 IC 63200 63369 4.52

COLID5 IC 30699 32083 3.97

COLID6 IC 66360 66555 3.9

MARKERID-KS.xls

MAPID-KS.xls

a set of values of a markers with their positions (MARKERID-KS.xls) and

another with a set of values of regions with their positions on a map (MAPID-KS.xls), along with their color intensity.

Please see the attached excel files.

I want to do a statistical test to this to determine if the relationship of the distributions of these two datasets are significant or not ie Marker to Map(this in relation to the color intensity as given in the map). I want to do a Kolmogorov smirnov test, KS-Test (or please advice me if there is any other test that I need could use) on this sample to determine the D Value and to find if there is significance between the two datasets. Can someone here please advice me on to how I can perfom this. If, there a straight forward way to perform this test with the data test using R or Matlab (I havent used Matlab but could try).

Thanks

Cheers

Sample of the data

DATA1 : MARKERID

MarkerID Marker Beginposition EndPosition

MARK1 IC 6135 6046

MARK2 IC 22428 22440

MARK3 IC 23665 23684

MARK4 IC 30394 30402

MARK5 IC 30961 30964

MARK6 IC 33439 33455

MARK7 IC 36905 36906

MARK8 IC 42219 42234

DATA2:MAPID

MAPID REGION BEGIN END COLOR INTENSITY

COLID1 IC 60639 63226 5.89

COLID2 IC 42039 42259 5.47

COLID3 IC 42626 43386 5.2

COLID4 IC 63200 63369 4.52

COLID5 IC 30699 32083 3.97

COLID6 IC 66360 66555 3.9

MARKERID-KS.xls

MAPID-KS.xls

Do more with

EXPERT OFFICE^{®} is a registered trademark of EXPERTS EXCHANGE^{®}

I do not understand your data. The test you are referring to compares two distributions: which value constitutes the distribution you want to test? Apparently, it can be either “begin position” or “end position”, but not both in the same test. The test cannot be “in relation to the colour intensity”, because that value is missing for markers.

Please explain a little more about the data and what exactly you are trying to test, not necessarily using statistical terms.

(°v°)

I was away myself for a while. The figure below shows the cumulative distributions of the field BEGIN for both data sets: MARKER and MAP. They have been obtained very simply by sorting the data points in ascending order and by adding a column for the Y axis with values from 0 to 1, formatted as percent.

The Kolmogorov-Smirnov Test can answer the following question: are the distributions similar? Either one of the distributions is considered as known and correct or both are treated as samples from an unknown distribution.

Formally, the H0 hypothesis is that both samples have been drawn from the same distribution (or that a sample was drawn from the known distribution). The largest vertical distance is the D statistic used by the test. For example, the MARKER distribution reaches 50% around 92000, while the MAP distribution is already at 90% at that point. This distance, 40%, is too large to be random, so the hypothesis is rejected: the samples cannot have been drawn from the same distribution.

The difference between them is so obvious that no test is necessary.

Your real question, however, is quite different. You don't ask whether they are the same, but whether they “have any significance to each other”. To answer that, you need to formulate very precisely what significance that is and how, for example, you could draw one from the other. Once you have a clear formula that seems to fit the data, you can search for a test to validate your hypothesis.

You have shown two data sets with linked numbers (BEGIN and END). The relationship between these two numbers isn't clear. For example, most MAP points show a small increase between BEGIN and END, but there are exceptions with a decrease (up to -13%) and with very large increases (up to 192%). This needs to be addressed first. The dataset MARKER shows 18 points with relatively small variations (mostly increases) and 15 points with large decreases of -66% to -92%. They do not seem to come from the same sampling or the same distribution. You need to treat them separately at this stage.

You basically want to “do something” with these numbers. Statistics don't work that way. You need to formulate some sort of hypothesis and also validate your data to ensure it can answer your question.

Does that help?

(°v°)

Q-26305568.PNG

1. Make a copy of one dataset in column A of a new worksheet

2. Sort it in ascending order

3. Add value 0 as first value if needed

4. Create a second column with values from 0 to 1 (or 0 to 100%)

For example, for 5 values, enter 0% (next to value 0), and 20%, 40%, 60%, 80%, 100%

5. Create an XY chart based on the data

6. Repeat 1-4 for the second data set (in another area of the sheet)

7. Copy the data range

8. Paste special on the chart to add a new series with new x values

(the default is adding new points)

The instructions are for Excel, but any charting software can be used, naturally.

(°v°)

Here is a MATLAB code to read your data, calculate Kolmogorov-Smirnov statistics and plot CDF plots. You can do it not only for beginning positions in Marker and Map data, but also for end positions and middle positions. Try to copy the code into MATLAB editor and run it in cell mode. Make sure your data file are located in your working directory. Uncomment line 4 and 15 to make sure beginning coordinates are always smaller than end coordinates. I plotted beginning positions with solid lines, end - with dotted lines, and middle - with mixed -.- lines.

For details on KSTEST2 function see MATLAB documentation or here:

http://www.mathworks.com/access/helpdesk/help/toolbox/stats/kstest2.html

You probably also can do some kind of enrichment test to test if marker segments frequently overlap with map segments or not. LIke Fisher exact or chi-square.

Anyway to do statistical test you have to formulate your null hypothesis and alternative hypothesis (see some stat textbook). Then we can help you to select the most appropriate statistics.

```
%% read the files
markerfile = 'MARKERID-KS.xls';
[MarkerData MarkerText] = xlsread(markerfile);
% MarkerData = sort(MarkerData,2); % uncomment to make beginposition < endposition
Marker.(MarkerText{1,1}) = MarkerText(2:end,1);
Marker.(MarkerText{1,2}) = MarkerText(2:end,2);
MarkerText(1,:) = strrep(MarkerText(1,:),' ',''); % make sure no spaces in fields name
for k=3:size(MarkerText,2)
Marker.(MarkerText{1,k}) = MarkerData(2:end,k-2);
end
Marker.MiddlePosition = (Marker.Beginposition + Marker.EndPosition)/2;
mapfile = 'MAPID-KS.xls';
[MapData MapText] = xlsread(mapfile);
% MapData(:,1:2) = sort(MapData(:,1:2),2); % uncomment to make beginposition < endposition
Map.(MapText{1,1}) = MapText(2:end,1);
Map.(MapText{1,2}) = MapText(2:end,2);
MapText(1,:) = strrep(MapText(1,:),' ',''); % make sure no spaces in fields name
for k=3:size(MapText,2)
Map.(MapText{1,k}) = MapData(2:end,k-2);
end
Map.MiddlePosition = (Map.BEGIN + Map.END)/2;
%% Kolmogorov-Smirnov statistics
[KSbegin.h,KSbegin.p,KSbegin.ksstat] = kstest2(Marker.Beginposition,Map.BEGIN);
[KSend.h,KSend.p,KSend.ksstat] = kstest2(Marker.EndPosition,Map.END);
[KSmid.h,KSmid.p,KSmid.ksstat] = kstest2(Marker.MiddlePosition,Map.MiddlePosition);
disp('Kolmogorov-Smirnov statistics for beginning positions:')
disp(KSbegin)
disp('Kolmogorov-Smirnov statistics for end positions:')
disp(KSend)
disp('Kolmogorov-Smirnov statistics for middle positions:')
disp(KSmid)
%% Distribution plot
F1b = cdfplot(Marker.Beginposition);
hold on
F1m = cdfplot(Marker.MiddlePosition);
F1e = cdfplot(Marker.EndPosition);
F2b = cdfplot(Map.BEGIN);
F2m = cdfplot(Map.MiddlePosition);
F2e = cdfplot(Map.END);
hold off
set(F1b,'LineWidth',2,'Color','r','LineStyle','-')
set(F1m,'LineWidth',2,'Color','r','LineStyle','-.')
set(F1e,'LineWidth',2,'Color','r','LineStyle',':')
set(F2b,'LineWidth',2,'Color','b','LineStyle','-')
set(F2m,'LineWidth',2,'Color','b','LineStyle','-.')
set(F2e,'LineWidth',2,'Color','b','LineStyle',':')
legend([F1b F2b],'Marker','Map','Location','SE')
```

## Premium Content

You need an Expert Office subscription to comment.Start Free Trial