Solved

Matlab: how to vectorize for "large data set" loops

Posted on 2013-11-04
6
485 Views
Last Modified: 2016-03-02
Hi, I am learning "large" data set calculations using Matlab. I have a txt file consisting of every trade made for a stock called MTB. My goal is to turn this tick data into daily data.
For example, on the first day, over 15,000 transactions took place, my prgm turn that data into the open, high, low, close, total volume, and net transaction for each day.

My questions:
Can you help me make the code faster?
Do you have any practical "techniques" to verify the calculations since they are made on such large data set?

It took my pgm: 20.7757 seconds
and I go the following warning. I don't really know what it means
Warning: 'rows' flag is ignored for cell arrays.
> In cell.unique at 32
  In ex5 at 16
Warning: 'rows' flag is ignored for cell arrays.
> In cell.unique at 32
  In ex5 at 17


   
%DESCRIPTION: Turn tick data into daily data
%INPUT: stock tick data(tradeDate,tradeTime,open,high,low,
%close,upVol,downVol)
%OUTPUT: openDay,highDay,lowDay,closeDay,volumeDay,netTransaction
%net transaction taded = sum (price*upVol -price*downVol)

clear;
startTime=tic;
%load data from MTB_db2
[tradeDate, tradeTime,open,high,low,close,upVol,downVol]=textread('MTB_db2.txt','%s %u %f %f %f %f %f %f','delimiter',',');



%begIdx:Index the first trade for the day from tick database and
%endIdx:index for the last trade for that day
[dailyDate begIdx]=unique(tradeDate,'rows','first');
[dailyDate2 endIdx]=unique(tradeDate,'rows','last');

%the number of daily elements, useful for the loop.
n=numel(dailyDate);

%initilize arrays
highDay=[];
lowDay=[];openDay=[];closeDay=[];
volumeDay=[];netTransaction=[];
priceChange(1)=NaN; mfChange(1)=NaN;

%loop: bottleneck is here!!
for j=1:n
openDay(j)=open(begIdx(j));
closeDay(j)=close(endIdx(j));
highDay(j)=max(high(begIdx(j):endIdx(j)));
lowDay(j)=min(low(begIdx(j):endIdx(j)));
volumeDay(j)=sum(upVol(begIdx(j):endIdx(j)))+sum(downVol(begIdx(j):endIdx(j)));
cumSum=0;
for i=begIdx(j):endIdx(j)
    cumSum=cumSum+close(i)*(upVol(i)-downVol(i));
end
netTransaction(j)=cumSum;
end
elapsedTimeNonVectorized=toc(startTime)

Open in new window

MTB-db2.txt
0
Comment
Question by:pgmerLA
  • 4
  • 2
6 Comments
 
LVL 15

Expert Comment

by:yuk99
ID: 39622958
What is your MATLAB version? Do you have the Financial Toolbox?
I'll look at your code and answer later.
0
 
LVL 27

Assisted Solution

by:d-glitch
d-glitch earned 150 total points
ID: 39624315
Vectorization is not likely to be helpful, particularly if your input data is a text file.

Your processing is extremely simple.  Find the START  END  MAX  MIN  (comparisons)
and accumulate VOLUME (additions).  There is nothing to vectorize.  The processing time is probably negligible compared to the I/O.

The biggest win might be to optimize the input format.
0
 
LVL 27

Expert Comment

by:d-glitch
ID: 39624820
Are you actually running your program on this data?
If so, how long does it take to execute?

Why do you have two for loops?    You should be able to do everything in a single loop.

Are you reading all this data into an array, or processing line-by-line on input?
If you read it into an array, you could use Matlab functions to find the max and min of the price column, and the sum of the shares column.  You would not need any explicit for loops.  Is that what you mean by vectorization?
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 15

Accepted Solution

by:
yuk99 earned 350 total points
ID: 39625256
Here is a vectorized version. I compared the output and it's the same.
In my case it takes about 1.6 times less than yours. I'm using the latest MATLAB R2013b, and its JIT compiler is pretty well optimized even for non-vectorized code. If you are using more earlier version, vectorization may improve the performance a lot. TEXTSCAN gives big improvement over TEXTREAD (its actually depricated). Calculation of endIdx from begIdx assumes proper data order, but it's faster than using UNIQUE 2nd time.

startTime=tic;
% load data
fid = fopen('MTB_db2.txt','rt');
data = textscan(fid,'%s %u %f %f %f %f %f %f','delimiter',',');
fclose(fid);
[tradeDate, tradeTime,open,high,low,close,upVol,downVol] = data{:};
% getting indices
[dailyDateNum, begIdx, tradeDateNum] = unique(tradeDate,'first');
endIdx = [begIdx(2:end)-1; numel(open)];
% calculations
highDay = accumarray(tradeDateNum,high,[],@max);
lowDay = accumarray(tradeDateNum,low,[],@min);
upVolDay = accumarray(tradeDateNum,upVol,[],@sum);
downVolDay = accumarray(tradeDateNum,downVol,[],@sum);
volumeDay = upVolDay + downVolDay;
netTransaction = accumarray(tradeDateNum,close.*(upVol-downVol),[],@sum);
elapsedTimeVectorized = toc(startTime)

Open in new window

Here is the results:
elapsedTimeNonVectorized =
          5.63204913353818
elapsedTimeVectorized =
          3.52853749486268

Open in new window

0
 
LVL 15

Expert Comment

by:yuk99
ID: 39625267
As for the warning. As you see, I just removed 'rows' argument. It's used if you want to select unique rows of a matrix. Not working for cell arrays. But since your cell array tradeDate has only 1 column you don't need this argument at all.

To verify calculations you should test your code on a small dataset and compare with manual (or using other tools) calculations. For serious development MATLAB has a nice so called Unit Testing Framework. You can read about it here: http://www.mathworks.com/help/matlab/matlab-unit-test-framework.html
0
 
LVL 15

Expert Comment

by:yuk99
ID: 39625289
Sorry, i missed the open/close output (it was there to get the performance above):
% calculations
openDay=open(begIdx);
closeDay=close(endIdx);
...

Open in new window

0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you haven’t already, I encourage you to read the first article (http://www.experts-exchange.com/articles/18680/An-Introduction-to-R-Programming-and-R-Studio.html) in my series to gain a basic foundation of R and R Studio.  You will also find the …
This article will show, step by step, how to integrate R code into a R Sweave document
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now