Solved

Generate OHLC Bars from Tick Data

Posted on 2006-11-09
8
3,251 Views
Last Modified: 2008-01-09
I have a database full of tick data from stocks.  Tick data is basically a timestamped record of all transactions that take place on a certain stock.

Here's a sample from one of the tables:

+-------+------------+---------+------+
| ID    | Timestamp  | Price   | Size |
+-------+------------+---------+------+
| 40001 | 1157722663 |    1308 |  100 |
| 40002 | 1157722663 |    1308 |    2 |
| 40003 | 1157722665 |    1308 |   11 |
| 40004 | 1157722666 | 1307.75 |    1 |
| 40005 | 1157722666 |    1308 |    7 |
| 40006 | 1157722667 | 1308.25 |    1 |
| 40007 | 1157722667 | 1308.25 |    6 |
| 40008 | 1157722668 |    1308 |   10 |
| 40009 | 1157722668 |    1308 |    7 |
| 40010 | 1157722668 |    1308 |    7 |
+-------+------------+---------+------+

What I am trying to do is generate a condensed form of this data known as OHLC bars.  This should be familiar for anyone who has looked at stock charts, but I'll explain it for those who haven't.  If I were generating 1 minute bars, I would take the first price from that minute and call it Open, the last price would be the Close, and the High and Low are the max and min for that range.  There is also a fifth parameter, Volume, that is simply the sum of all of the sizes in that range.  I have written a query that partially works but could not optimize it:

SELECT
 FLOOR(a.Timestamp / 60) * 60 AS Timestamp,
 a.Price AS Open,
 MAX(a.Price) AS High,
 MIN(a.Price) AS Low,
 (SELECT b.Price FROM AAPL_ticks b WHERE FLOOR(b.Timestamp / 60) * 60 = FLOOR(a.Timestamp / 60) * 60 LIMIT 1) AS Close,
 SUM(a.Size) AS Volume FROM AAPL_ticks a GROUP BY FLOOR(a.Timestamp / 60)

This query groups it in 60 second increments.  When I run this query, I get the correct output, but it takes forever (somewhere in the range of 1000 seconds for 390 bars).  I have narrowed down the problem to the subquery that calculates the Close. When I take this statement out the query processes thousands of bars in a few seconds.  I've tried writing a stored procedure for this but failed at that as well.  I've been working on this for a while and have finally given up, so any assistance would be very helpful.  Thanks in advance!

- Paccc
0
Comment
Question by:Paccc
  • 5
  • 3
8 Comments
 
LVL 35

Expert Comment

by:Raynard7
ID: 17912243
Hi,
I have built these kind of things before - but normally I would do them in a different way

I am not sure why you are flooring your values and dividing them by 60 then multiplying by 60.  I would just do a modular division by the start time and subtract the remainder from the value - this way you can set any increment.

Normally with these queries you would build them with a start and end time - ie select * from table where time between 123032475 and 29387502 for example.

As you have the start time you would do something like

select b.price from aapl_tics b where (b.timestamp - @startTime mod 60) = 0

for everything that is 1 minute from the start time.

The processor intensive bits here seem to be all the divisions and multiplications - that are repeated on both sides of the equation, and also in the top select

The more divisions that you can remove the easier it is.  Also division of large numbers is much slower than the division of small numbers (less processes) so if you can put a - in to make the numbers smaller you will find it should run much faster.
0
 
LVL 35

Expert Comment

by:Raynard7
ID: 17912255
by my comment

>>for everything that is 1 minute from the start time.

I meant everything that is at 1 minute intervals from the start time
0
 
LVL 35

Expert Comment

by:Raynard7
ID: 17912267
Additionally - since timestamp is a datatype in mysql you would be best not using it - it makes things hard to read and could potentially confuse the sql engine.
0
 

Author Comment

by:Paccc
ID: 17912359
I'm not sure I understand your suggestion fully.  Firstly, I like the idea of using modulus, I don't know why I didn't think of the originally, and changing the field name would be a good idea as well, thanks for that tip.  Anyway, I'm selecting all available data, not a range of date/times.  Second, I don't understand how I can get the close price from the SQL statement you have.  I tried making changes to my query and it still runs rediculously slow.  This is my new query below:

SELECT
 a.Timestamp - (a.Timestamp mod 60) AS Timestamp,
 a.Price AS Open,
 MAX(a.Price) AS High,
 MIN(a.Price) AS Low,
 (SELECT b.Price FROM AAPL_ticks b WHERE b.Timestamp - (a.Timestamp mod 60) = 0 LIMIT 1) AS Close,
 SUM(a.Size) AS Volume FROM AAPL_ticks a GROUP BY a.Timestamp - (a.Timestamp mod 60)

Thanks.

Paccc
0
Free Gift Card with Acronis Backup Purchase!

Backup any data in any location: local and remote systems, physical and virtual servers, private and public clouds, Macs and PCs, tablets and mobile devices, & more! For limited time only, buy any Acronis backup products and get a FREE Amazon/Best Buy gift card worth up to $200!

 
LVL 35

Expert Comment

by:Raynard7
ID: 17912385
hi,

I guess I am a little confused as this is only half a query

but if a.timestamp is the start time then you want to subtract it from the later time then modulo divide by 60

ie

(b.timestamp - a.timestamp) mod 60 = 0

this is because each 60 seconds the difference between the values will be a multiple of 60 - and exact multiples of 60 mod 60 = 0 - so this way you know that there is one each 60 seconds.

If you are doing (a.timestamp mod 60) you would always get the same result - and also as you are consistantly dividing a very large number it probably will be time consuming
0
 

Author Comment

by:Paccc
ID: 17912428
The close price is the last transaction in that range, it may not be a multiple of 60, for instance, the trade that occurs at 51 seconds after the minute might be the last trade for that minute and would thus be the Close price.  I still do not see how your statement would work.  Thanks for the help.

Paccc
0
 
LVL 35

Accepted Solution

by:
Raynard7 earned 250 total points
ID: 17912465
Ok, sorry - I assumed that you were taking a dump of the last settlement at each second

Well - unfortunatley you are probably never going to get a query that has to check this for each minute for each row to run quickly on any database.

I would suggest that you would be best stinking with the stored procedue idea.  

Create a procedure that goes through the recordset of your table ordered by id and then the timestamp value.
You would need to pass this the start and end time and the interval.

firstly create a temporary table to store the values for the summary for any point of time.
then create a value that holds the time that you are currently up to (@presentTime)
create another variables that holds the information for the previous row's data - time, id and price
you would create a value to see what time you were up to
then run a query to return all the value between the start time and end time ordered by id then timestamp value
You would then set @presentTime with the begin time, and set the data values equal to the first row.
the process would then be to
loop through the records - each row storing the data for the row into set variables and storing the previous data from those into a different set of variables, each time asking, is the timestamp for the current row - @presentTime > interval
if it is then you insert into your temporary table the values from the previous row - add Interval to your @present time and loop
if it is not then you just loop

at the end of this process you would have a temporary table full of the most recent rows for each minute if applicable.

this way you only need to cycle through the table once - and the mathematics is limited - with a query using floor and max and min all the time would prove to be very inefficient.
0
 

Author Comment

by:Paccc
ID: 17913336
Thanks Raynard7 for all your help.  I finally got it working using a stored procedure.  It came out pretty ugly since I had to come up with a workaround in order to use dynamic SQL on the cursor.  I will post the code here in the hopes that it might benfit someone else:

CREATE DEFINER=`root`@`localhost` PROCEDURE `OHLC_BARS`(IN Symbol VARCHAR(20),IN Period INT)
BEGIN
  DROP TEMPORARY TABLE IF EXISTS `tmp_data`;
  CREATE TEMPORARY TABLE `tmp_data` (
    `Timestamp` INT UNSIGNED NOT NULL,
    `Price` DOUBLE NOT NULL,
    KEY `Timestamp` (`Timestamp`)
  ) ENGINE=MEMORY DEFAULT CHARSET=latin1;
  SET @query_str := CONCAT("INSERT INTO `tmp_data` (Timestamp,Price) SELECT `Timestamp` - (`Timestamp` MOD ",Period,") as Timestamp, Price FROM ",Symbol,
                    " ORDER BY Timestamp");
  PREPARE query_str FROM @query_str;
  EXECUTE query_str;
  DEALLOCATE PREPARE query_str;
BEGIN
  DECLARE done INT DEFAULT 0;
  DECLARE bar_ts,this_ts INT DEFAULT 0;
  DECLARE old_price,this_price DOUBLE DEFAULT 0;
  DECLARE cur CURSOR FOR SELECT * FROM `tmp_data`;
  DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done=1;

  DROP TEMPORARY TABLE IF EXISTS `tmp_ohlc`;
  CREATE TEMPORARY TABLE `tmp_ohlc` (
    `Timestamp` INT UNSIGNED NOT NULL,
    `Open` DOUBLE NOT NULL,
    `High` DOUBLE NOT NULL,
    `Low` DOUBLE NOT NULL,
    `Close` DOUBLE,
    `Volume` INT(10) UNSIGNED NOT NULL,
    KEY `Timestamp` (`Timestamp`)
  ) ENGINE=MEMORY DEFAULT CHARSET=latin1;

  SET @query_str := CONCAT("INSERT INTO `tmp_ohlc` (Timestamp,Open,High,Low,Volume) SELECT a.Timestamp - (a.Timestamp mod ",
                    Period,") AS Timestamp, a.Price AS Open, MAX(a.Price) AS High, MIN(a.Price) as Low, SUM(a.Size) AS Volume FROM ",
                    Symbol," a GROUP BY a.Timestamp - (a.Timestamp mod ",
                    Period,") ORDER BY a.Timestamp");
  PREPARE query_str FROM @query_str;
  EXECUTE query_str;
  DEALLOCATE PREPARE query_str;

  OPEN cur;

calc:LOOP
    FETCH cur INTO this_ts,this_price;
    IF done THEN
      LEAVE calc;
    END IF;
    IF NOT old_price THEN SET old_price = this_price; END IF;
    IF this_ts > bar_ts THEN
      UPDATE `tmp_ohlc` SET Close=old_price WHERE `Timestamp` = bar_ts;
      SET bar_ts = this_ts;
    END IF;
    SET old_price = this_price;
  END LOOP;
  UPDATE `tmp_ohlc` SET Close=old_price WHERE `Timestamp` = bar_ts;

  CLOSE cur;

  SELECT * FROM `tmp_ohlc`;
  DROP TEMPORARY TABLE IF EXISTS `tmp_ohlc`;
END;
DROP TEMPORARY TABLE IF EXISTS `tmp_data`;
END
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Fore-Foreword Today (2016) Maxmind has a new approach to the distribution of its data sets.  This article may be obsolete.  Instead of using the examples here, have a look at the MaxMind API (https://www.maxmind.com/en/geolite2-developer-package). …
This guide whil teach how to setup live replication (database mirroring) on 2 servers for backup or other purposes. In our example situation we have this network schema (see atachment). We need to replicate EVERY executed SQL query on server 1 to…
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now