Link to home
Start Free TrialLog in
Avatar of gnf
gnf

asked on

Calculating correlation between several time series with sql server data mining

I am trying to calculate the trailing correlation between two streams of numbers, each stored in a column in a large table.  Each row is time stamped with a datetime key, but the time stamps are not at regular intervals. They occur at random times.  
I am wondering whether I can use Sql Server data mining and somehow use the functionality of the time series algorithm to get to calculations like correlation.  One advantage of the time series algo in SSAS is that it does not depend on regularly spaced time intervals.  
Or should I calculate it on a streaming basis using an ETL tool like SSIS, sampling the data at regular time intervals, and write the results out to a table?
Avatar of cyberkiwi
cyberkiwi
Flag of New Zealand image

How to you want to correlate them? By position? How about listing say 10-20 numbers from each table and showing how they should correlate?
If you want it by position, you can rownumber them:

select Coalesce(A.RN, B.RN) RN, A.number, B.number from
(select RN=row_number() over (order by datetimekey), * from tbl1) A full join
(select RN=row_number() over (order by datetimekey), * from tbl2) B on A.RN=B.RN

If you want to see them merged by time,

select coalesce(a.datetimekey,b.datetimekey) datetimekey, A.number, B.number
from tbl1 A full join tbl2 B on A.datetimekey = B.datetimekey
order by 1 ASC
Avatar of gnf
gnf

ASKER

Sorry, I should have been a bit clearer in the problem statement.
Let's say we have two stock prices. Their quotes and timestamps arrive randomly and are stored in table a and table b.
Now, I need to calculate the correlation coefficient of these two stock prices over say a 15 minute window.
If I had a row for every exact second, for example, I could do a correlation calculation by using each row's value, grouping the data over 15 minutes worth of 1 second rows.
However, this isn't the case. I have time stamps that are asynchronous. Nevertheless I need to calculate how these two prices move in relation to each other.
One approach may be to first do a query that groups the data into one second buckets, look for the value that is the latest within each second long bucket, and return that as a selected set on which to do a correlation calculation. However, this runs very slowly. A group by query that looks at say a month of data and groups it into one second buckets runs very slowly indeed.
I'm wondering if there is an efficient way to do this without needing to carve the original data into regular time buckets. In SSAS data mining, I've seen the time series algo be able to handle irregular time stamps intervals and still produce valid calculations.
 If this isn't possible, I'm wondering if there is a good way using SSIS to preprocess the data, perhaps by grouping it into one second intervals and doing correlation calcs there.
How far apart are the actual time values? If they come in at say 10-20 second intervals in one table, and 5-8 seconds in the other, isn't using 1 second intervals excessive?

Like you said, you could preprocess the data regularly, making up at 1-second intervals all the gaps in both tables.  This will produce a lot of data to say the least, but it will work.  30 days at a time (2.4mil rows) should take mere minutes if not less.
declare @startdate datetime
set @startdate = getdate() -- a start date

insert tbl1 (timecol, stockdata)
select X.tim, C.stockdata
from
(
-- 31-day month = 2678400 seconds
-- this produces all the seconds for 1 month + a bit from @startdate
select dateadd(s,m.number*1000000+n.number*1000+o.number,@startdate) as tim
from master..spt_values m
inner join master..spt_values n on n.number between 0 and 999 and n.type='P'
inner join master..spt_values o on o.number between 0 and 999 and o.type='P'
where m.number between 0 and 2 and m.type='P'
) X
left join tbl1 A on A.timecol = X.tim
cross apply (
	select top(1) B.stockdata
	from tbl1 B
	where B.timecol < X.tim
	order by B.timecol desc) C
where A.timecol is null -- doesn't yet exist

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of gnf
gnf

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of gnf

ASKER

Sql Server is not ideally suited to working with large flat files like a linear time series, particularly when the table is large (4.2 billion rows).