Solved

Matlab: how to vectorize for "large data set" loops

Posted on 2013-11-04
6
509 Views
Last Modified: 2016-03-02
Hi, I am learning "large" data set calculations using Matlab. I have a txt file consisting of every trade made for a stock called MTB. My goal is to turn this tick data into daily data.
For example, on the first day, over 15,000 transactions took place, my prgm turn that data into the open, high, low, close, total volume, and net transaction for each day.

My questions:
Can you help me make the code faster?
Do you have any practical "techniques" to verify the calculations since they are made on such large data set?

It took my pgm: 20.7757 seconds
and I go the following warning. I don't really know what it means
Warning: 'rows' flag is ignored for cell arrays.
> In cell.unique at 32
  In ex5 at 16
Warning: 'rows' flag is ignored for cell arrays.
> In cell.unique at 32
  In ex5 at 17


   
%DESCRIPTION: Turn tick data into daily data
%INPUT: stock tick data(tradeDate,tradeTime,open,high,low,
%close,upVol,downVol)
%OUTPUT: openDay,highDay,lowDay,closeDay,volumeDay,netTransaction
%net transaction taded = sum (price*upVol -price*downVol)

clear;
startTime=tic;
%load data from MTB_db2
[tradeDate, tradeTime,open,high,low,close,upVol,downVol]=textread('MTB_db2.txt','%s %u %f %f %f %f %f %f','delimiter',',');



%begIdx:Index the first trade for the day from tick database and
%endIdx:index for the last trade for that day
[dailyDate begIdx]=unique(tradeDate,'rows','first');
[dailyDate2 endIdx]=unique(tradeDate,'rows','last');

%the number of daily elements, useful for the loop.
n=numel(dailyDate);

%initilize arrays
highDay=[];
lowDay=[];openDay=[];closeDay=[];
volumeDay=[];netTransaction=[];
priceChange(1)=NaN; mfChange(1)=NaN;

%loop: bottleneck is here!!
for j=1:n
openDay(j)=open(begIdx(j));
closeDay(j)=close(endIdx(j));
highDay(j)=max(high(begIdx(j):endIdx(j)));
lowDay(j)=min(low(begIdx(j):endIdx(j)));
volumeDay(j)=sum(upVol(begIdx(j):endIdx(j)))+sum(downVol(begIdx(j):endIdx(j)));
cumSum=0;
for i=begIdx(j):endIdx(j)
    cumSum=cumSum+close(i)*(upVol(i)-downVol(i));
end
netTransaction(j)=cumSum;
end
elapsedTimeNonVectorized=toc(startTime)

Open in new window

MTB-db2.txt
0
Comment
Question by:pgmerLA
  • 4
  • 2
6 Comments
 
LVL 15

Expert Comment

by:yuk99
ID: 39622958
What is your MATLAB version? Do you have the Financial Toolbox?
I'll look at your code and answer later.
0
 
LVL 27

Assisted Solution

by:d-glitch
d-glitch earned 150 total points
ID: 39624315
Vectorization is not likely to be helpful, particularly if your input data is a text file.

Your processing is extremely simple.  Find the START  END  MAX  MIN  (comparisons)
and accumulate VOLUME (additions).  There is nothing to vectorize.  The processing time is probably negligible compared to the I/O.

The biggest win might be to optimize the input format.
0
 
LVL 27

Expert Comment

by:d-glitch
ID: 39624820
Are you actually running your program on this data?
If so, how long does it take to execute?

Why do you have two for loops?    You should be able to do everything in a single loop.

Are you reading all this data into an array, or processing line-by-line on input?
If you read it into an array, you could use Matlab functions to find the max and min of the price column, and the sum of the shares column.  You would not need any explicit for loops.  Is that what you mean by vectorization?
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 15

Accepted Solution

by:
yuk99 earned 350 total points
ID: 39625256
Here is a vectorized version. I compared the output and it's the same.
In my case it takes about 1.6 times less than yours. I'm using the latest MATLAB R2013b, and its JIT compiler is pretty well optimized even for non-vectorized code. If you are using more earlier version, vectorization may improve the performance a lot. TEXTSCAN gives big improvement over TEXTREAD (its actually depricated). Calculation of endIdx from begIdx assumes proper data order, but it's faster than using UNIQUE 2nd time.

startTime=tic;
% load data
fid = fopen('MTB_db2.txt','rt');
data = textscan(fid,'%s %u %f %f %f %f %f %f','delimiter',',');
fclose(fid);
[tradeDate, tradeTime,open,high,low,close,upVol,downVol] = data{:};
% getting indices
[dailyDateNum, begIdx, tradeDateNum] = unique(tradeDate,'first');
endIdx = [begIdx(2:end)-1; numel(open)];
% calculations
highDay = accumarray(tradeDateNum,high,[],@max);
lowDay = accumarray(tradeDateNum,low,[],@min);
upVolDay = accumarray(tradeDateNum,upVol,[],@sum);
downVolDay = accumarray(tradeDateNum,downVol,[],@sum);
volumeDay = upVolDay + downVolDay;
netTransaction = accumarray(tradeDateNum,close.*(upVol-downVol),[],@sum);
elapsedTimeVectorized = toc(startTime)

Open in new window

Here is the results:
elapsedTimeNonVectorized =
          5.63204913353818
elapsedTimeVectorized =
          3.52853749486268

Open in new window

0
 
LVL 15

Expert Comment

by:yuk99
ID: 39625267
As for the warning. As you see, I just removed 'rows' argument. It's used if you want to select unique rows of a matrix. Not working for cell arrays. But since your cell array tradeDate has only 1 column you don't need this argument at all.

To verify calculations you should test your code on a small dataset and compare with manual (or using other tools) calculations. For serious development MATLAB has a nice so called Unit Testing Framework. You can read about it here: http://www.mathworks.com/help/matlab/matlab-unit-test-framework.html
0
 
LVL 15

Expert Comment

by:yuk99
ID: 39625289
Sorry, i missed the open/close output (it was there to get the performance above):
% calculations
openDay=open(begIdx);
closeDay=close(endIdx);
...

Open in new window

0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
gHappy challenge 67 128
sumHeights  challenge 17 75
Capture logon name 13 88
Not seen Link button 5 58
This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question