asked on

How to increase the speed of a FOR loop binning function

Hi All,

I have built a function that is too slow. I was wondering if anyone knew a way to speed up/ vectorize a for loop function by a factor of 5 or 10? Her name is "MakeVolumeBinIdx"

The purpose of my program is to bin my input data according to a fixed Volume size. So each row should have (almost) the same size volume .

My Input data is made of Date,Time, Price,Volume which is tick data (transaction per transaction) of a stock.
My output: Date,Time, Open,High,Low,Close,Volume (the volume should be almost equal for each candles or bins)

I have included the function's code, the program, the data set and the R files as well.
Also, I have a Hp laptop that runs on intel i7 and windows8.

Thank you in advance :)

The function I need major help with

MakeVolumeBinIdx<-function(data1.volume,volBinSize){
  ### PURPOSE: Find the indexes for a given size of volume bin, that is 
  ###            Find the indexes of the data.frame where the sum of the volume is equal
  ###             to the input volBinSize.
  ### INPUT: a vector of trades volume and the desired volume bin size
  ### OUTPUT: a vector indicating where each row belongs to which volume bin
  
  #create index
  volBin<-1
  sumVol<-0
  Volume<-data1.volume
  volBinIdx <- numeric(length(Volume))
  
  #create cutting for each volume bin
  for(i in seq_len(length(Volume))){
    sumVol<-sumVol + Volume[i]  
    if (sumVol<= volBinSize) {
      volBinIdx[i] <- volBin
    } else {
      volBinIdx[i] <-  volBin <- volBin + 1
      sumVol <- Volume[i]
    }
  }
  
  #clean environment
  rm(Volume,i,sumVol,volBinSize,volBin)
  
  return(volBinIdx)
}

Open in new window

My Program

##### put all functions neededin-memory
source("FT_functions_SO.R")

## read data in
data1<-read.table("XYZ_EE.txt",sep=",",stringsAsFactor=F,header=F,
                      colClasses=c("character","character","numeric","numeric"))

#Name Columns
colnames(data1)<-c("Date","Time","Price","Volume")

#Add columns for total amount traded
data1["TT"]<-data1[,"Price"]*data1[,"Volume"]


##### time volBinIdx5K
start.time.volBinIdx5k<-Sys.time()

volBinIdx5k<-MakeVolumeBinIdx(data1$Volume,5000)
## Purpose: find indexes where cumulative volume equal 5,000 shares
## Input: data1$Volume and the size of the volume bin
## Output: vector with indexes for each row signifying which row belong to which 
##          volume bin

##### time it took volBinIdx5k
end.time.volBinIdx5k<-Sys.time()
time.volBinIdx5k<-end.time.volBinIdx5k-start.time.volBinIdx5k
time.volBinIdx5k


##### time MakeBinCandles
start.time.MakeBindcandles.5k<-Sys.time()

data1.5k.dfm<-MakeBinCandles(data1,volBinIdx5k)
## Purpose: Create candles based on volume bins
## Input: data1
## Output: data.frame: Date,Time,OHLC,volume,HighIdx,LowIdx,MF,TT, VWAP

##### time it took MakeBinCandles for 5000 shares
end.time.MakeBinCandles.5k<-Sys.time()
time.MakeBinCandles.5k<-end.time.MakeBinCandles.5k-start.time.MakeBindcandles.5k
time.MakeBinCandles.5k

Open in new window

The functions:

### FT_functions_EE.R

MakeVolumeBinIdx<-function(data1.volume,volBinSize){
  ### PURPOSE: Find the indexes for a given size of volume bin, that is 
  ###            Find the indexes of the data.frame where the sum of the volume is equal
  ###             to the input volBinSize.
  ### INPUT: a vector of trades volume and the desired volume bin size
  ### OUTPUT: a vector indicating where each row belongs to which volume bin
  
  #create index
  volBin<-1
  sumVol<-0
  Volume<-data1.volume
  volBinIdx <- numeric(length(Volume))
  
  #create cutting for each volume bin
  for(i in seq_len(length(Volume))){
    sumVol<-sumVol + Volume[i]  
    if (sumVol<= volBinSize) {
      volBinIdx[i] <- volBin
    } else {
      volBinIdx[i] <-  volBin <- volBin + 1
      sumVol <- Volume[i]
    }
  }
  
  #clean environment
  rm(Volume,i,sumVol,volBinSize,volBin)
  
  return(volBinIdx)
}


MakeBinCandles<-function(data,volBinIdxk){
  ### PURPOSE: Create candles based on bins
  ### INPUT: a new data.frame containing only Date,Time, Price,Volume,TT, AND
  ###          a vector containing the output of MakeVolumeBinIdx
  ### OUTPUT: a data.frame  with Date,Time, OHLC, Volume,
  ###           HighIdx,lowIdx, TT,VWAP (=volume weighted average price)
  
  library(dplyr)
  
  data.return<-data %>%
    mutate(volBinIdxk=volBinIdxk) %>%
    group_by(volBinIdxk) %>%
    summarize(Date=head(Date,1),
              Time=head(Time,1),
              Open=head(Price,1),
              High=max(Price),
              Low=min(Price),
              Close=tail(Price,1),
              Volume=sum(Volume),
              # HighIdx=which.max(Price),
              # LowIdx=which.min(Price),
              TT=sum(TT,na.rm=T),
              VWAP=TT/Volume) %>%
    select(-volBinIdxk) %>%
    as.data.frame()
  
  return(data.return)
  
}


MakeBinCandlesXts<-function(data){
  ### PURPOSE: Turn bin candles from data.frame into xts object
  ### INPUT: data frame outputed by MakeBinCandles()
  ### OUTPUT: xts object
  library(xts)
  data$Date<-strptime(paste(data$Date,data$Time),"%m/%d/%Y %H:%M:%S")
  
  data<-data[,-2] # if I don't remove it, all columns become characters
  data.xts<-xts(data[,-1],order.by=as.POSIXct(data[,1]))
  
  return (data.xts)
  
}

Open in new window

XYZ-EE.txt
FT-functions-EE.txt
FT-program-EE.txt

gheist

installed.packages() and sessionInfo() please
is it recent version of R (like 3.x?)
rstudio - is it NOT running inside it? same with plain R? (again - is it the latest one i.e build 1091?)
Not everybody uses Windows (e.g me)
Any libraries loaded in your code not shown here?

ASKER CERTIFIED SOLUTION

daveslater

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

aikimark

I've been looking at this question for several hours and I'm not sure I understand it. Wouldn't the Volume just be the average of the volume items for that stock for that day?
= sum(volume)/count(stockID)

Using a stats program like R, I would try to express the output as simply as possible, using R functions.

gheist

R is not deep into multiprocessing.
Depends on speed you want, you might also schedule tasks to parallel library (default on recent versions of R)
like detectCores number of them.

aikimark

Stats based on your sample file:

TradingDate	Open	High	Low	Close	Avg
9/12/2014	79.84	79.95	79.17	79.35	3.95737617753338
9/14/2014	79.27	79.33	79.04	79.15	3.17172646427476
9/15/2014	79.15	79.48	79.1 	79.33	3.6664891173994
9/16/2014	79.33	79.58	79.13	79.48	3.43944172041261

Open in new window