Solved

Interpolate Missing Data

Posted on 2014-04-07
17
412 Views
Last Modified: 2014-04-25
Hi,

I have a query that insert data into a table from a data logging system. My issue is that when the system is offline for a prolonged period, I get gaps in the data. The data is an accumulating meter reading, and my query will calculate the datachange between samples and save this in the Datachange column. If gaps are present, the datachange will be huge for the sample DateTime, when the system is back online.

I've attached a sample data file to this question to show my issue. If you look at 11/02/2014 (which is when the system went offline) there are missing samples up to 8/3/2014 and therefore a huge datachange.

What I need to do is run an insert query on this table and insert row data in 15min intervals up until 15mins prior to 8/3/2013 when the system is back online. I also need to populate the 'DataChange' and 'Data' columns. to populate these columns I'd have to calculate the difference between the 'Data' values (So in this example 443023-428475), divide this by the number of inserted rows and use this value as the datachange for each inserted row and accumulate the 'Data' rows by adding this value.
SampleData.txt
0
Comment
Question by:wint100
  • 8
  • 6
  • 2
  • +1
17 Comments
 
LVL 65

Expert Comment

by:Jim Horn
ID: 39983597
>My issue is that when the system is offline for a prolonged period, I get gaps in the data.
I have an article out there on T-SQL: Identify bad dates in a time series.  Feel free to take that code and modify it to meet your needs.
0
 
LVL 1

Author Comment

by:wint100
ID: 39984030
I'm afraid SQL isn't my area of expertise so may need a little more assistance if possible.

Cheers
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 40007873
so, the big "gap" should be replaced by apportioning data / data change over the generated intervals.

didn't we do something like that before ?

will check it out...
0
 
LVL 1

Author Comment

by:wint100
ID: 40008240
Possibly, I need to incorporate this into my existing query so it does this as part of that query.

The initial problem would be if the gap occurs for a period (e.g. a monthly report) and the sample data doesn't catch up until the next month, so we wouldn't have that next real data to get the difference needed to calculate the average datachange needed to fill in the gaps. This is why I was considering datamining, to get all the data, with datachange calculated, into a table for me to query at a later date.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 40010816
OK, so this it what I have done so far...

1) saved attachment as tab delimited file and opened/imported in Excel
2) changed some of the columns (get rid of NULL literal, made data and datachange integer columns)
3) imported into SQL Server

and some observations....

4) ID is not sequential so needed to introduce a row_number()
5) did that as part of import, and should probably add an initial layer within the CTE
6) I rely on integer arithmetic (with rounding and divisions)
7) stated goal was to generate INSERTS (insert statement commented but shows select)
8) thought it best to user meterID with row_number()
9) datetime rounding errors (ie .003 second) caused some errors, so now rounded up
10) assumed that any insert was warranted when there was more than 2 "gaps" in time and that there was actually a difference in data values.

The code :

;with cte_15 as
( select round((datediff(minute,c.timestamp+0.00000005,x.timestamp+0.00000005)/15),1) as icount, x.data - c.data as delta_data
, dateadd(minute,datediff(minute,0,c.timestamp+0.00000005),0) as rounded_timestamp, c.*
  from tmp_tldata c
  inner join tmp_tldata x on c.rn +1 = x.rn and c.meterid = x.meterid
), cte_gaps as
  (select 'r' as ins,icount as max_icount, icount, delta_data, (delta_data / icount) as inc_data, rounded_timestamp, rn,ID,TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID
   from cte_15 
   where icount > 2
   and delta_data > 0
   union all 
   select 'y',max_icount, icount - 1, delta_data, inc_data, dateadd(minute,15,rounded_timestamp), rn,ID,TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID
   from cte_gaps
   where icount > 1)
--insert tmp_tldata (ID,TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID) 
select ID,rounded_timestamp,Hundredths,data + (inc_data * (max_icount - icount)) as data,BaseloadData,inc_Data as datachange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID 
from cte_gaps  
where ins = 'y'
order by rn, icount desc
OPTION(MAXRECURSION 0)

Open in new window


a) Import was into tmp_tldata - you would use your real table name...
b) I think there should be a final / additional step to update the following datachange column.
c) The following "data" column should be fine and will pick up the slight variances due to rounding left over from the newly inserted data.
d) haven't optimised
e) because I was a little unsure of the critical columns, I included all. It would be more efficient if we focused on just the columns actually needed within the CTE and link to the appropriate row (via ID) for remaining information to be propagated into inserted columns.
f) that last ORDER BY is not needed for INSERT, just there to make the select / display easier to read.

Apologies for the delays... Easter and Family gatherings slightly interrupted.
0
 
LVL 1

Author Comment

by:wint100
ID: 40010992
Great, thanks Mark. No need for apologies, it is a holiday after all!

So first I add a row number column to my table? I then need to modify my insert query to include a row number OR set this column as an identifier and auto increment?

ID not being sequential is odd, it is an identifier column so should work. I'll check my insert query to make sure it is inserting data in the correct order, it should be ordered by timestamp asc.

You mentioned
changed some of the columns (get rid of NULL literal, made data and datachange integer columns)
Is this a required step, so I need to modify my table to suit. I didn't use integer columns to avoid rounding of the datachange to keep it as accurate as possible.
0
 
LVL 1

Author Comment

by:wint100
ID: 40011001
Ah, thinking about the ID not being sequential, this is because multiple Meters exist in this table, so the ID column will not be sequential because the MeterID is different for each meter. This may throw in another issue if we don't have a sequential ID for each MeterID.
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 40011382
So first I add a row number column to my table? I then need to modify my insert query to include a row number OR set this column as an identifier and auto increment?
i suspect this was not what Mark meant.  He is referring to ROW_NUMBER() which is a function to create sequential numbers on the fly.
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 
LVL 1

Author Comment

by:wint100
ID: 40011393
Ah, OK, this would be far easier and makes sense. I'll have a read into this function

On initial testing it works well. I removed the 'rn' from the query and used ID instead, which obviously only works when the UD is sequential. I cleared the table and ran for a single meterID making the IDs sequential just to test. When multiple meters are in use (50 on this project) the ID will no longer be sequential, so I need to crack the row number function.
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 40011432
When multiple meters are in use (50 on this project) the ID will no longer be sequential, so I need to crack the row number function.
More than likely you will have to use the PARTITION BY clause to reset the counter when you hit a new meter.

But wait for Mark (he is on the other side of the world and is probably counting sheep now), I am sure he will go into more detail.
0
 
LVL 51

Accepted Solution

by:
Mark Wills earned 500 total points
ID: 40012402
Thanks Anthony, and that is pretty much exactly what I was meaning...

Even though ID might be unique, it is not sequential by timestamp (per meterID) and that is what is needed for our joins. A predictable sequential number that can be pointed to with precision.

You'll note that point 8 above says "thought it best to use meterID with row_number()"

And I used this in the import, and looks like we should have it in the CTE. The column 'rn' represents the computed row number.

row_number() over (partition by meterid order by timestamp) as rn

Open in new window


As Anthony previously indicated, the row_number() provides us with a means to create a sequential integer with increments of 1 on the fly.

In terms of rounding, wile we could calculate down to a decimal place or two, so far, the DATA column looks very much to be an integer. So, I was keeping that very much in mind. It will still be very accurate because we are apportioning integer increments over the new inserted columns and the following row retains its correct value which effectively takes up the remaining difference of the individual inserted values.

e.g from data samples...
ID      TIMESTAMP                       DATA                   
243658  2014-01-16 12:59:59.997  20     413157
243659  2014-01-16 14:29:59.997  20     413220

Open in new window


with time rounding we have 15:30 - 13:00

Based on the difference in timestamp, we get 6 blocks of 15 minutes.
So we take DATA difference and get 413220 - 413157 = 63

That means (integer arithmetic) 63 / 6 = 10 remainder 3

So we need to insert another 5 new rows (calculated as 6 but we must exclude current)

ID      TIMESTAMP                       DATA                   
243658  2014-01-16 12:59:59.997  20     413157
443658  2014-01-16 13:15.00.000  20     413167
443658  2014-01-16 13:30.00.000  20     413177
443658  2014-01-16 13:45.00.000  20     413187
443658  2014-01-16 14:00.00.000  20     413197
443658  2014-01-16 14:15.00.000  20     413207
243659  2014-01-16 14:29:59.997  20     413220  

Open in new window


The last row is untouched, bet essentially gets a small inflation (ie 13 instead of 10) of the difference from prior row which includes the remainder.

I did assume (without consultation) that a small skew towards that last row is worthwhile because of the "fake" readings in between, and that last row being a real one.

By using decimal arithmetic we can minimise that last gap, even still, if there is rounding, that's where the remainder will be.

So, incorporating that initial layer to calculate RN, and leaving any DATA calculations according to the underlying data-type, and excluding ID from the insert, then our query would now look like :

;with cte_rn as
( select row_number() over (partition by meterid order by timestamp) as rn, *
  from tmp_tldata
), cte_15 as
  ( select round((datediff(minute,c.timestamp+0.00000005,x.timestamp+0.00000005)/15),1) as icount, x.data - c.data as delta_data
           , dateadd(minute,datediff(minute,0,c.timestamp+0.00000005),0) as rounded_timestamp, c.*
    from cte_rn c
    inner join cte_rn x on c.rn +1 = x.rn and c.meterid = x.meterid
), cte_gaps as
  (select 'r' as ins,icount as max_icount, icount, delta_data, (delta_data / icount) as inc_data, rounded_timestamp, rn,ID,TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID
   from cte_15 
   where icount > 2
   and delta_data > 0
   union all 
   select 'y',max_icount, icount - 1, delta_data, inc_data, dateadd(minute,15,rounded_timestamp), rn,ID,TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID
   from cte_gaps
   where icount > 1)
--insert tmp_tldata (TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID) 
select rounded_timestamp,Hundredths,data + (inc_data * (max_icount - icount)) as data,BaseloadData,inc_Data as datachange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID 
from cte_gaps  
where ins = 'y'
order by rn, icount desc
OPTION(MAXRECURSION 0)

Open in new window


And still, the prior comments about performance / indexes / the datachange calc (update) for the "next" row in the sequence all still applies.
0
 
LVL 1

Author Comment

by:wint100
ID: 40012548
Great, thanks Mark.

Unfortunatley I'm now on a family holiday until Friday and have no way to test this. I'll get on to it once I get home.

Your point about needing to update that following datachange is a crucial step, so maybe we can discuss this in the meantime, and the other points you mention.

Thanks
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 40012592
No problems, lemme know when you are back online (will give me time to think of the best way to do the update).
0
 
LVL 1

Author Comment

by:wint100
ID: 40012735
Will do thanks
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 40016486
Well, I can identify the row, but it means we need to capture the outcomes of the CTE then do the inserts and update from a temp-table.

But I also surmise that datachange is a calculated column and not raw meter data. So, if possible, would be best to change the timing of that computation to after the inserts. Of course, that might not be possible, but would be well worth considering.
0
 
LVL 1

Author Comment

by:wint100
ID: 40016703
Yes datachange is calculated. I think I will send you the rollup query that does this calculation from the 'data' column, that is raw meter data.

I think it would be possible, to just insert the 'data' then do all of the datachange update stuff afterward.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 40017985
Well, we cannot respond to anything sent externally, so I will ignore that comment for now ;)

But, it does make sense to do all the various inserts and then perform the datachange.

If that is possible, then we can use the query above for the inserts and then the "crucial" updates will be handled as a matter of fact after the inserts.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

In this article I will describe the Backup & Restore method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now