Solved

# Matlab: how to vectorize for "large data set" loops

Posted on 2013-11-04
Medium Priority
532 Views
Hi, I am learning "large" data set calculations using Matlab. I have a txt file consisting of every trade made for a stock called MTB. My goal is to turn this tick data into daily data.
For example, on the first day, over 15,000 transactions took place, my prgm turn that data into the open, high, low, close, total volume, and net transaction for each day.

My questions:
Can you help me make the code faster?
Do you have any practical "techniques" to verify the calculations since they are made on such large data set?

It took my pgm: 20.7757 seconds
and I go the following warning. I don't really know what it means
Warning: 'rows' flag is ignored for cell arrays.
> In cell.unique at 32
In ex5 at 16
Warning: 'rows' flag is ignored for cell arrays.
> In cell.unique at 32
In ex5 at 17

``````%DESCRIPTION: Turn tick data into daily data
%close,upVol,downVol)
%OUTPUT: openDay,highDay,lowDay,closeDay,volumeDay,netTransaction
%net transaction taded = sum (price*upVol -price*downVol)

clear;
startTime=tic;

%begIdx:Index the first trade for the day from tick database and
%endIdx:index for the last trade for that day

%the number of daily elements, useful for the loop.
n=numel(dailyDate);

%initilize arrays
highDay=[];
lowDay=[];openDay=[];closeDay=[];
volumeDay=[];netTransaction=[];
priceChange(1)=NaN; mfChange(1)=NaN;

%loop: bottleneck is here!!
for j=1:n
openDay(j)=open(begIdx(j));
closeDay(j)=close(endIdx(j));
highDay(j)=max(high(begIdx(j):endIdx(j)));
lowDay(j)=min(low(begIdx(j):endIdx(j)));
volumeDay(j)=sum(upVol(begIdx(j):endIdx(j)))+sum(downVol(begIdx(j):endIdx(j)));
cumSum=0;
for i=begIdx(j):endIdx(j)
cumSum=cumSum+close(i)*(upVol(i)-downVol(i));
end
netTransaction(j)=cumSum;
end
elapsedTimeNonVectorized=toc(startTime)
``````
MTB-db2.txt
0
Question by:pgmerLA
[X]
###### Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

• Help others & share knowledge
• Earn cash & points
• 4
• 2

LVL 15

Expert Comment

ID: 39622958
What is your MATLAB version? Do you have the Financial Toolbox?
0

LVL 27

Assisted Solution

d-glitch earned 600 total points
ID: 39624315
Vectorization is not likely to be helpful, particularly if your input data is a text file.

Your processing is extremely simple.  Find the START  END  MAX  MIN  (comparisons)
and accumulate VOLUME (additions).  There is nothing to vectorize.  The processing time is probably negligible compared to the I/O.

The biggest win might be to optimize the input format.
0

LVL 27

Expert Comment

ID: 39624820
Are you actually running your program on this data?
If so, how long does it take to execute?

Why do you have two for loops?    You should be able to do everything in a single loop.

Are you reading all this data into an array, or processing line-by-line on input?
If you read it into an array, you could use Matlab functions to find the max and min of the price column, and the sum of the shares column.  You would not need any explicit for loops.  Is that what you mean by vectorization?
0

LVL 15

Accepted Solution

yuk99 earned 1400 total points
ID: 39625256
Here is a vectorized version. I compared the output and it's the same.
In my case it takes about 1.6 times less than yours. I'm using the latest MATLAB R2013b, and its JIT compiler is pretty well optimized even for non-vectorized code. If you are using more earlier version, vectorization may improve the performance a lot. TEXTSCAN gives big improvement over TEXTREAD (its actually depricated). Calculation of endIdx from begIdx assumes proper data order, but it's faster than using UNIQUE 2nd time.

``````startTime=tic;
fid = fopen('MTB_db2.txt','rt');
data = textscan(fid,'%s %u %f %f %f %f %f %f','delimiter',',');
fclose(fid);
% getting indices
endIdx = [begIdx(2:end)-1; numel(open)];
% calculations
volumeDay = upVolDay + downVolDay;
elapsedTimeVectorized = toc(startTime)
``````
Here is the results:
``````elapsedTimeNonVectorized =
5.63204913353818
elapsedTimeVectorized =
3.52853749486268
``````
0

LVL 15

Expert Comment

ID: 39625267
As for the warning. As you see, I just removed 'rows' argument. It's used if you want to select unique rows of a matrix. Not working for cell arrays. But since your cell array tradeDate has only 1 column you don't need this argument at all.

To verify calculations you should test your code on a small dataset and compare with manual (or using other tools) calculations. For serious development MATLAB has a nice so called Unit Testing Framework. You can read about it here: http://www.mathworks.com/help/matlab/matlab-unit-test-framework.html
0

LVL 15

Expert Comment

ID: 39625289
Sorry, i missed the open/close output (it was there to get the performance above):
``````% calculations
openDay=open(begIdx);
closeDay=close(endIdx);
...
``````
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilationâ€¦