Solved

Matlab: how to vectorize for "large data set" loops

Posted on 2013-11-04
6
466 Views
Last Modified: 2016-03-02
Hi, I am learning "large" data set calculations using Matlab. I have a txt file consisting of every trade made for a stock called MTB. My goal is to turn this tick data into daily data.
For example, on the first day, over 15,000 transactions took place, my prgm turn that data into the open, high, low, close, total volume, and net transaction for each day.

My questions:
Can you help me make the code faster?
Do you have any practical "techniques" to verify the calculations since they are made on such large data set?

It took my pgm: 20.7757 seconds
and I go the following warning. I don't really know what it means
Warning: 'rows' flag is ignored for cell arrays.
> In cell.unique at 32
  In ex5 at 16
Warning: 'rows' flag is ignored for cell arrays.
> In cell.unique at 32
  In ex5 at 17


   
%DESCRIPTION: Turn tick data into daily data
%INPUT: stock tick data(tradeDate,tradeTime,open,high,low,
%close,upVol,downVol)
%OUTPUT: openDay,highDay,lowDay,closeDay,volumeDay,netTransaction
%net transaction taded = sum (price*upVol -price*downVol)

clear;
startTime=tic;
%load data from MTB_db2
[tradeDate, tradeTime,open,high,low,close,upVol,downVol]=textread('MTB_db2.txt','%s %u %f %f %f %f %f %f','delimiter',',');



%begIdx:Index the first trade for the day from tick database and
%endIdx:index for the last trade for that day
[dailyDate begIdx]=unique(tradeDate,'rows','first');
[dailyDate2 endIdx]=unique(tradeDate,'rows','last');

%the number of daily elements, useful for the loop.
n=numel(dailyDate);

%initilize arrays
highDay=[];
lowDay=[];openDay=[];closeDay=[];
volumeDay=[];netTransaction=[];
priceChange(1)=NaN; mfChange(1)=NaN;

%loop: bottleneck is here!!
for j=1:n
openDay(j)=open(begIdx(j));
closeDay(j)=close(endIdx(j));
highDay(j)=max(high(begIdx(j):endIdx(j)));
lowDay(j)=min(low(begIdx(j):endIdx(j)));
volumeDay(j)=sum(upVol(begIdx(j):endIdx(j)))+sum(downVol(begIdx(j):endIdx(j)));
cumSum=0;
for i=begIdx(j):endIdx(j)
    cumSum=cumSum+close(i)*(upVol(i)-downVol(i));
end
netTransaction(j)=cumSum;
end
elapsedTimeNonVectorized=toc(startTime)

Open in new window

MTB-db2.txt
0
Comment
Question by:pgmerLA
  • 4
  • 2
6 Comments
 
LVL 15

Expert Comment

by:yuk99
ID: 39622958
What is your MATLAB version? Do you have the Financial Toolbox?
I'll look at your code and answer later.
0
 
LVL 27

Assisted Solution

by:d-glitch
d-glitch earned 150 total points
ID: 39624315
Vectorization is not likely to be helpful, particularly if your input data is a text file.

Your processing is extremely simple.  Find the START  END  MAX  MIN  (comparisons)
and accumulate VOLUME (additions).  There is nothing to vectorize.  The processing time is probably negligible compared to the I/O.

The biggest win might be to optimize the input format.
0
 
LVL 27

Expert Comment

by:d-glitch
ID: 39624820
Are you actually running your program on this data?
If so, how long does it take to execute?

Why do you have two for loops?    You should be able to do everything in a single loop.

Are you reading all this data into an array, or processing line-by-line on input?
If you read it into an array, you could use Matlab functions to find the max and min of the price column, and the sum of the shares column.  You would not need any explicit for loops.  Is that what you mean by vectorization?
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 15

Accepted Solution

by:
yuk99 earned 350 total points
ID: 39625256
Here is a vectorized version. I compared the output and it's the same.
In my case it takes about 1.6 times less than yours. I'm using the latest MATLAB R2013b, and its JIT compiler is pretty well optimized even for non-vectorized code. If you are using more earlier version, vectorization may improve the performance a lot. TEXTSCAN gives big improvement over TEXTREAD (its actually depricated). Calculation of endIdx from begIdx assumes proper data order, but it's faster than using UNIQUE 2nd time.

startTime=tic;
% load data
fid = fopen('MTB_db2.txt','rt');
data = textscan(fid,'%s %u %f %f %f %f %f %f','delimiter',',');
fclose(fid);
[tradeDate, tradeTime,open,high,low,close,upVol,downVol] = data{:};
% getting indices
[dailyDateNum, begIdx, tradeDateNum] = unique(tradeDate,'first');
endIdx = [begIdx(2:end)-1; numel(open)];
% calculations
highDay = accumarray(tradeDateNum,high,[],@max);
lowDay = accumarray(tradeDateNum,low,[],@min);
upVolDay = accumarray(tradeDateNum,upVol,[],@sum);
downVolDay = accumarray(tradeDateNum,downVol,[],@sum);
volumeDay = upVolDay + downVolDay;
netTransaction = accumarray(tradeDateNum,close.*(upVol-downVol),[],@sum);
elapsedTimeVectorized = toc(startTime)

Open in new window

Here is the results:
elapsedTimeNonVectorized =
          5.63204913353818
elapsedTimeVectorized =
          3.52853749486268

Open in new window

0
 
LVL 15

Expert Comment

by:yuk99
ID: 39625267
As for the warning. As you see, I just removed 'rows' argument. It's used if you want to select unique rows of a matrix. Not working for cell arrays. But since your cell array tradeDate has only 1 column you don't need this argument at all.

To verify calculations you should test your code on a small dataset and compare with manual (or using other tools) calculations. For serious development MATLAB has a nice so called Unit Testing Framework. You can read about it here: http://www.mathworks.com/help/matlab/matlab-unit-test-framework.html
0
 
LVL 15

Expert Comment

by:yuk99
ID: 39625289
Sorry, i missed the open/close output (it was there to get the performance above):
% calculations
openDay=open(begIdx);
closeDay=close(endIdx);
...

Open in new window

0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
python pyusb how to find a file on mass storage 3 74
notReplace  challenge 53 102
count8 challlenge 13 87
mapShare challenge 13 69
Having just graduated from college and entered the workforce, I don’t find myself always using the tools and programs I grew accustomed to over the past four years. However, there is one program I continually find myself reverting back to…R.   So …
The purpose of this article is to demonstrate how we can use conditional statements using Python.
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

24 Experts available now in Live!

Get 1:1 Help Now