Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Matlab: how to vectorize for "large data set" loops

Posted on 2013-11-04
6
Medium Priority
?
544 Views
Last Modified: 2016-03-02
Hi, I am learning "large" data set calculations using Matlab. I have a txt file consisting of every trade made for a stock called MTB. My goal is to turn this tick data into daily data.
For example, on the first day, over 15,000 transactions took place, my prgm turn that data into the open, high, low, close, total volume, and net transaction for each day.

My questions:
Can you help me make the code faster?
Do you have any practical "techniques" to verify the calculations since they are made on such large data set?

It took my pgm: 20.7757 seconds
and I go the following warning. I don't really know what it means
Warning: 'rows' flag is ignored for cell arrays.
> In cell.unique at 32
  In ex5 at 16
Warning: 'rows' flag is ignored for cell arrays.
> In cell.unique at 32
  In ex5 at 17


   
%DESCRIPTION: Turn tick data into daily data
%INPUT: stock tick data(tradeDate,tradeTime,open,high,low,
%close,upVol,downVol)
%OUTPUT: openDay,highDay,lowDay,closeDay,volumeDay,netTransaction
%net transaction taded = sum (price*upVol -price*downVol)

clear;
startTime=tic;
%load data from MTB_db2
[tradeDate, tradeTime,open,high,low,close,upVol,downVol]=textread('MTB_db2.txt','%s %u %f %f %f %f %f %f','delimiter',',');



%begIdx:Index the first trade for the day from tick database and
%endIdx:index for the last trade for that day
[dailyDate begIdx]=unique(tradeDate,'rows','first');
[dailyDate2 endIdx]=unique(tradeDate,'rows','last');

%the number of daily elements, useful for the loop.
n=numel(dailyDate);

%initilize arrays
highDay=[];
lowDay=[];openDay=[];closeDay=[];
volumeDay=[];netTransaction=[];
priceChange(1)=NaN; mfChange(1)=NaN;

%loop: bottleneck is here!!
for j=1:n
openDay(j)=open(begIdx(j));
closeDay(j)=close(endIdx(j));
highDay(j)=max(high(begIdx(j):endIdx(j)));
lowDay(j)=min(low(begIdx(j):endIdx(j)));
volumeDay(j)=sum(upVol(begIdx(j):endIdx(j)))+sum(downVol(begIdx(j):endIdx(j)));
cumSum=0;
for i=begIdx(j):endIdx(j)
    cumSum=cumSum+close(i)*(upVol(i)-downVol(i));
end
netTransaction(j)=cumSum;
end
elapsedTimeNonVectorized=toc(startTime)

Open in new window

MTB-db2.txt
0
Comment
Question by:pgmerLA
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
6 Comments
 
LVL 15

Expert Comment

by:yuk99
ID: 39622958
What is your MATLAB version? Do you have the Financial Toolbox?
I'll look at your code and answer later.
0
 
LVL 27

Assisted Solution

by:d-glitch
d-glitch earned 600 total points
ID: 39624315
Vectorization is not likely to be helpful, particularly if your input data is a text file.

Your processing is extremely simple.  Find the START  END  MAX  MIN  (comparisons)
and accumulate VOLUME (additions).  There is nothing to vectorize.  The processing time is probably negligible compared to the I/O.

The biggest win might be to optimize the input format.
0
 
LVL 27

Expert Comment

by:d-glitch
ID: 39624820
Are you actually running your program on this data?
If so, how long does it take to execute?

Why do you have two for loops?    You should be able to do everything in a single loop.

Are you reading all this data into an array, or processing line-by-line on input?
If you read it into an array, you could use Matlab functions to find the max and min of the price column, and the sum of the shares column.  You would not need any explicit for loops.  Is that what you mean by vectorization?
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 15

Accepted Solution

by:
yuk99 earned 1400 total points
ID: 39625256
Here is a vectorized version. I compared the output and it's the same.
In my case it takes about 1.6 times less than yours. I'm using the latest MATLAB R2013b, and its JIT compiler is pretty well optimized even for non-vectorized code. If you are using more earlier version, vectorization may improve the performance a lot. TEXTSCAN gives big improvement over TEXTREAD (its actually depricated). Calculation of endIdx from begIdx assumes proper data order, but it's faster than using UNIQUE 2nd time.

startTime=tic;
% load data
fid = fopen('MTB_db2.txt','rt');
data = textscan(fid,'%s %u %f %f %f %f %f %f','delimiter',',');
fclose(fid);
[tradeDate, tradeTime,open,high,low,close,upVol,downVol] = data{:};
% getting indices
[dailyDateNum, begIdx, tradeDateNum] = unique(tradeDate,'first');
endIdx = [begIdx(2:end)-1; numel(open)];
% calculations
highDay = accumarray(tradeDateNum,high,[],@max);
lowDay = accumarray(tradeDateNum,low,[],@min);
upVolDay = accumarray(tradeDateNum,upVol,[],@sum);
downVolDay = accumarray(tradeDateNum,downVol,[],@sum);
volumeDay = upVolDay + downVolDay;
netTransaction = accumarray(tradeDateNum,close.*(upVol-downVol),[],@sum);
elapsedTimeVectorized = toc(startTime)

Open in new window

Here is the results:
elapsedTimeNonVectorized =
          5.63204913353818
elapsedTimeVectorized =
          3.52853749486268

Open in new window

0
 
LVL 15

Expert Comment

by:yuk99
ID: 39625267
As for the warning. As you see, I just removed 'rows' argument. It's used if you want to select unique rows of a matrix. Not working for cell arrays. But since your cell array tradeDate has only 1 column you don't need this argument at all.

To verify calculations you should test your code on a small dataset and compare with manual (or using other tools) calculations. For serious development MATLAB has a nice so called Unit Testing Framework. You can read about it here: http://www.mathworks.com/help/matlab/matlab-unit-test-framework.html
0
 
LVL 15

Expert Comment

by:yuk99
ID: 39625289
Sorry, i missed the open/close output (it was there to get the performance above):
% calculations
openDay=open(begIdx);
closeDay=close(endIdx);
...

Open in new window

0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

There is an easy way, in .NET, to centralize the treatment of all unexpected errors. First of all, instead of launching the application directly in a Form, you need first to write a Sub called Main, in a module. Then, set the Startup Object to th…
Having just graduated from college and entered the workforce, I don’t find myself always using the tools and programs I grew accustomed to over the past four years. However, there is one program I continually find myself reverting back to…R.   So …
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.
Suggested Courses

598 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question