Link to home
Start Free TrialLog in
Avatar of Paccc
Paccc

asked on

Generate OHLC Bars from Tick Data

I have a database full of tick data from stocks.  Tick data is basically a timestamped record of all transactions that take place on a certain stock.

Here's a sample from one of the tables:

+-------+------------+---------+------+
| ID    | Timestamp  | Price   | Size |
+-------+------------+---------+------+
| 40001 | 1157722663 |    1308 |  100 |
| 40002 | 1157722663 |    1308 |    2 |
| 40003 | 1157722665 |    1308 |   11 |
| 40004 | 1157722666 | 1307.75 |    1 |
| 40005 | 1157722666 |    1308 |    7 |
| 40006 | 1157722667 | 1308.25 |    1 |
| 40007 | 1157722667 | 1308.25 |    6 |
| 40008 | 1157722668 |    1308 |   10 |
| 40009 | 1157722668 |    1308 |    7 |
| 40010 | 1157722668 |    1308 |    7 |
+-------+------------+---------+------+

What I am trying to do is generate a condensed form of this data known as OHLC bars.  This should be familiar for anyone who has looked at stock charts, but I'll explain it for those who haven't.  If I were generating 1 minute bars, I would take the first price from that minute and call it Open, the last price would be the Close, and the High and Low are the max and min for that range.  There is also a fifth parameter, Volume, that is simply the sum of all of the sizes in that range.  I have written a query that partially works but could not optimize it:

SELECT
 FLOOR(a.Timestamp / 60) * 60 AS Timestamp,
 a.Price AS Open,
 MAX(a.Price) AS High,
 MIN(a.Price) AS Low,
 (SELECT b.Price FROM AAPL_ticks b WHERE FLOOR(b.Timestamp / 60) * 60 = FLOOR(a.Timestamp / 60) * 60 LIMIT 1) AS Close,
 SUM(a.Size) AS Volume FROM AAPL_ticks a GROUP BY FLOOR(a.Timestamp / 60)

This query groups it in 60 second increments.  When I run this query, I get the correct output, but it takes forever (somewhere in the range of 1000 seconds for 390 bars).  I have narrowed down the problem to the subquery that calculates the Close. When I take this statement out the query processes thousands of bars in a few seconds.  I've tried writing a stored procedure for this but failed at that as well.  I've been working on this for a while and have finally given up, so any assistance would be very helpful.  Thanks in advance!

- Paccc
Avatar of Raynard7
Raynard7

Hi,
I have built these kind of things before - but normally I would do them in a different way

I am not sure why you are flooring your values and dividing them by 60 then multiplying by 60.  I would just do a modular division by the start time and subtract the remainder from the value - this way you can set any increment.

Normally with these queries you would build them with a start and end time - ie select * from table where time between 123032475 and 29387502 for example.

As you have the start time you would do something like

select b.price from aapl_tics b where (b.timestamp - @startTime mod 60) = 0

for everything that is 1 minute from the start time.

The processor intensive bits here seem to be all the divisions and multiplications - that are repeated on both sides of the equation, and also in the top select

The more divisions that you can remove the easier it is.  Also division of large numbers is much slower than the division of small numbers (less processes) so if you can put a - in to make the numbers smaller you will find it should run much faster.
by my comment

>>for everything that is 1 minute from the start time.

I meant everything that is at 1 minute intervals from the start time
Additionally - since timestamp is a datatype in mysql you would be best not using it - it makes things hard to read and could potentially confuse the sql engine.
Avatar of Paccc

ASKER

I'm not sure I understand your suggestion fully.  Firstly, I like the idea of using modulus, I don't know why I didn't think of the originally, and changing the field name would be a good idea as well, thanks for that tip.  Anyway, I'm selecting all available data, not a range of date/times.  Second, I don't understand how I can get the close price from the SQL statement you have.  I tried making changes to my query and it still runs rediculously slow.  This is my new query below:

SELECT
 a.Timestamp - (a.Timestamp mod 60) AS Timestamp,
 a.Price AS Open,
 MAX(a.Price) AS High,
 MIN(a.Price) AS Low,
 (SELECT b.Price FROM AAPL_ticks b WHERE b.Timestamp - (a.Timestamp mod 60) = 0 LIMIT 1) AS Close,
 SUM(a.Size) AS Volume FROM AAPL_ticks a GROUP BY a.Timestamp - (a.Timestamp mod 60)

Thanks.

Paccc
hi,

I guess I am a little confused as this is only half a query

but if a.timestamp is the start time then you want to subtract it from the later time then modulo divide by 60

ie

(b.timestamp - a.timestamp) mod 60 = 0

this is because each 60 seconds the difference between the values will be a multiple of 60 - and exact multiples of 60 mod 60 = 0 - so this way you know that there is one each 60 seconds.

If you are doing (a.timestamp mod 60) you would always get the same result - and also as you are consistantly dividing a very large number it probably will be time consuming
Avatar of Paccc

ASKER

The close price is the last transaction in that range, it may not be a multiple of 60, for instance, the trade that occurs at 51 seconds after the minute might be the last trade for that minute and would thus be the Close price.  I still do not see how your statement would work.  Thanks for the help.

Paccc
ASKER CERTIFIED SOLUTION
Avatar of Raynard7
Raynard7

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Paccc

ASKER

Thanks Raynard7 for all your help.  I finally got it working using a stored procedure.  It came out pretty ugly since I had to come up with a workaround in order to use dynamic SQL on the cursor.  I will post the code here in the hopes that it might benfit someone else:

CREATE DEFINER=`root`@`localhost` PROCEDURE `OHLC_BARS`(IN Symbol VARCHAR(20),IN Period INT)
BEGIN
  DROP TEMPORARY TABLE IF EXISTS `tmp_data`;
  CREATE TEMPORARY TABLE `tmp_data` (
    `Timestamp` INT UNSIGNED NOT NULL,
    `Price` DOUBLE NOT NULL,
    KEY `Timestamp` (`Timestamp`)
  ) ENGINE=MEMORY DEFAULT CHARSET=latin1;
  SET @query_str := CONCAT("INSERT INTO `tmp_data` (Timestamp,Price) SELECT `Timestamp` - (`Timestamp` MOD ",Period,") as Timestamp, Price FROM ",Symbol,
                    " ORDER BY Timestamp");
  PREPARE query_str FROM @query_str;
  EXECUTE query_str;
  DEALLOCATE PREPARE query_str;
BEGIN
  DECLARE done INT DEFAULT 0;
  DECLARE bar_ts,this_ts INT DEFAULT 0;
  DECLARE old_price,this_price DOUBLE DEFAULT 0;
  DECLARE cur CURSOR FOR SELECT * FROM `tmp_data`;
  DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done=1;

  DROP TEMPORARY TABLE IF EXISTS `tmp_ohlc`;
  CREATE TEMPORARY TABLE `tmp_ohlc` (
    `Timestamp` INT UNSIGNED NOT NULL,
    `Open` DOUBLE NOT NULL,
    `High` DOUBLE NOT NULL,
    `Low` DOUBLE NOT NULL,
    `Close` DOUBLE,
    `Volume` INT(10) UNSIGNED NOT NULL,
    KEY `Timestamp` (`Timestamp`)
  ) ENGINE=MEMORY DEFAULT CHARSET=latin1;

  SET @query_str := CONCAT("INSERT INTO `tmp_ohlc` (Timestamp,Open,High,Low,Volume) SELECT a.Timestamp - (a.Timestamp mod ",
                    Period,") AS Timestamp, a.Price AS Open, MAX(a.Price) AS High, MIN(a.Price) as Low, SUM(a.Size) AS Volume FROM ",
                    Symbol," a GROUP BY a.Timestamp - (a.Timestamp mod ",
                    Period,") ORDER BY a.Timestamp");
  PREPARE query_str FROM @query_str;
  EXECUTE query_str;
  DEALLOCATE PREPARE query_str;

  OPEN cur;

calc:LOOP
    FETCH cur INTO this_ts,this_price;
    IF done THEN
      LEAVE calc;
    END IF;
    IF NOT old_price THEN SET old_price = this_price; END IF;
    IF this_ts > bar_ts THEN
      UPDATE `tmp_ohlc` SET Close=old_price WHERE `Timestamp` = bar_ts;
      SET bar_ts = this_ts;
    END IF;
    SET old_price = this_price;
  END LOOP;
  UPDATE `tmp_ohlc` SET Close=old_price WHERE `Timestamp` = bar_ts;

  CLOSE cur;

  SELECT * FROM `tmp_ohlc`;
  DROP TEMPORARY TABLE IF EXISTS `tmp_ohlc`;
END;
DROP TEMPORARY TABLE IF EXISTS `tmp_data`;
END