Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 561
  • Last Modified:

Interpolate Missing Data

Hi,

I have a query that insert data into a table from a data logging system. My issue is that when the system is offline for a prolonged period, I get gaps in the data. The data is an accumulating meter reading, and my query will calculate the datachange between samples and save this in the Datachange column. If gaps are present, the datachange will be huge for the sample DateTime, when the system is back online.

I've attached a sample data file to this question to show my issue. If you look at 11/02/2014 (which is when the system went offline) there are missing samples up to 8/3/2014 and therefore a huge datachange.

What I need to do is run an insert query on this table and insert row data in 15min intervals up until 15mins prior to 8/3/2013 when the system is back online. I also need to populate the 'DataChange' and 'Data' columns. to populate these columns I'd have to calculate the difference between the 'Data' values (So in this example 443023-428475), divide this by the number of inserted rows and use this value as the datachange for each inserted row and accumulate the 'Data' rows by adding this value.
SampleData.txt
0
wint100
Asked:
wint100
  • 8
  • 6
  • 2
  • +1
1 Solution
 
Jim HornMicrosoft SQL Server Developer, Architect, and AuthorCommented:
>My issue is that when the system is offline for a prolonged period, I get gaps in the data.
I have an article out there on T-SQL: Identify bad dates in a time series.  Feel free to take that code and modify it to meet your needs.
0
 
wint100Author Commented:
I'm afraid SQL isn't my area of expertise so may need a little more assistance if possible.

Cheers
0
 
Mark WillsTopic AdvisorCommented:
so, the big "gap" should be replaced by apportioning data / data change over the generated intervals.

didn't we do something like that before ?

will check it out...
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
wint100Author Commented:
Possibly, I need to incorporate this into my existing query so it does this as part of that query.

The initial problem would be if the gap occurs for a period (e.g. a monthly report) and the sample data doesn't catch up until the next month, so we wouldn't have that next real data to get the difference needed to calculate the average datachange needed to fill in the gaps. This is why I was considering datamining, to get all the data, with datachange calculated, into a table for me to query at a later date.
0
 
Mark WillsTopic AdvisorCommented:
OK, so this it what I have done so far...

1) saved attachment as tab delimited file and opened/imported in Excel
2) changed some of the columns (get rid of NULL literal, made data and datachange integer columns)
3) imported into SQL Server

and some observations....

4) ID is not sequential so needed to introduce a row_number()
5) did that as part of import, and should probably add an initial layer within the CTE
6) I rely on integer arithmetic (with rounding and divisions)
7) stated goal was to generate INSERTS (insert statement commented but shows select)
8) thought it best to user meterID with row_number()
9) datetime rounding errors (ie .003 second) caused some errors, so now rounded up
10) assumed that any insert was warranted when there was more than 2 "gaps" in time and that there was actually a difference in data values.

The code :

;with cte_15 as
( select round((datediff(minute,c.timestamp+0.00000005,x.timestamp+0.00000005)/15),1) as icount, x.data - c.data as delta_data
, dateadd(minute,datediff(minute,0,c.timestamp+0.00000005),0) as rounded_timestamp, c.*
  from tmp_tldata c
  inner join tmp_tldata x on c.rn +1 = x.rn and c.meterid = x.meterid
), cte_gaps as
  (select 'r' as ins,icount as max_icount, icount, delta_data, (delta_data / icount) as inc_data, rounded_timestamp, rn,ID,TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID
   from cte_15 
   where icount > 2
   and delta_data > 0
   union all 
   select 'y',max_icount, icount - 1, delta_data, inc_data, dateadd(minute,15,rounded_timestamp), rn,ID,TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID
   from cte_gaps
   where icount > 1)
--insert tmp_tldata (ID,TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID) 
select ID,rounded_timestamp,Hundredths,data + (inc_data * (max_icount - icount)) as data,BaseloadData,inc_Data as datachange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID 
from cte_gaps  
where ins = 'y'
order by rn, icount desc
OPTION(MAXRECURSION 0)

Open in new window


a) Import was into tmp_tldata - you would use your real table name...
b) I think there should be a final / additional step to update the following datachange column.
c) The following "data" column should be fine and will pick up the slight variances due to rounding left over from the newly inserted data.
d) haven't optimised
e) because I was a little unsure of the critical columns, I included all. It would be more efficient if we focused on just the columns actually needed within the CTE and link to the appropriate row (via ID) for remaining information to be propagated into inserted columns.
f) that last ORDER BY is not needed for INSERT, just there to make the select / display easier to read.

Apologies for the delays... Easter and Family gatherings slightly interrupted.
0
 
wint100Author Commented:
Great, thanks Mark. No need for apologies, it is a holiday after all!

So first I add a row number column to my table? I then need to modify my insert query to include a row number OR set this column as an identifier and auto increment?

ID not being sequential is odd, it is an identifier column so should work. I'll check my insert query to make sure it is inserting data in the correct order, it should be ordered by timestamp asc.

You mentioned
changed some of the columns (get rid of NULL literal, made data and datachange integer columns)
Is this a required step, so I need to modify my table to suit. I didn't use integer columns to avoid rounding of the datachange to keep it as accurate as possible.
0
 
wint100Author Commented:
Ah, thinking about the ID not being sequential, this is because multiple Meters exist in this table, so the ID column will not be sequential because the MeterID is different for each meter. This may throw in another issue if we don't have a sequential ID for each MeterID.
0
 
Anthony PerkinsCommented:
So first I add a row number column to my table? I then need to modify my insert query to include a row number OR set this column as an identifier and auto increment?
i suspect this was not what Mark meant.  He is referring to ROW_NUMBER() which is a function to create sequential numbers on the fly.
0
 
wint100Author Commented:
Ah, OK, this would be far easier and makes sense. I'll have a read into this function

On initial testing it works well. I removed the 'rn' from the query and used ID instead, which obviously only works when the UD is sequential. I cleared the table and ran for a single meterID making the IDs sequential just to test. When multiple meters are in use (50 on this project) the ID will no longer be sequential, so I need to crack the row number function.
0
 
Anthony PerkinsCommented:
When multiple meters are in use (50 on this project) the ID will no longer be sequential, so I need to crack the row number function.
More than likely you will have to use the PARTITION BY clause to reset the counter when you hit a new meter.

But wait for Mark (he is on the other side of the world and is probably counting sheep now), I am sure he will go into more detail.
0
 
Mark WillsTopic AdvisorCommented:
Thanks Anthony, and that is pretty much exactly what I was meaning...

Even though ID might be unique, it is not sequential by timestamp (per meterID) and that is what is needed for our joins. A predictable sequential number that can be pointed to with precision.

You'll note that point 8 above says "thought it best to use meterID with row_number()"

And I used this in the import, and looks like we should have it in the CTE. The column 'rn' represents the computed row number.

row_number() over (partition by meterid order by timestamp) as rn

Open in new window


As Anthony previously indicated, the row_number() provides us with a means to create a sequential integer with increments of 1 on the fly.

In terms of rounding, wile we could calculate down to a decimal place or two, so far, the DATA column looks very much to be an integer. So, I was keeping that very much in mind. It will still be very accurate because we are apportioning integer increments over the new inserted columns and the following row retains its correct value which effectively takes up the remaining difference of the individual inserted values.

e.g from data samples...
ID      TIMESTAMP                       DATA                   
243658  2014-01-16 12:59:59.997  20     413157
243659  2014-01-16 14:29:59.997  20     413220

Open in new window


with time rounding we have 15:30 - 13:00

Based on the difference in timestamp, we get 6 blocks of 15 minutes.
So we take DATA difference and get 413220 - 413157 = 63

That means (integer arithmetic) 63 / 6 = 10 remainder 3

So we need to insert another 5 new rows (calculated as 6 but we must exclude current)

ID      TIMESTAMP                       DATA                   
243658  2014-01-16 12:59:59.997  20     413157
443658  2014-01-16 13:15.00.000  20     413167
443658  2014-01-16 13:30.00.000  20     413177
443658  2014-01-16 13:45.00.000  20     413187
443658  2014-01-16 14:00.00.000  20     413197
443658  2014-01-16 14:15.00.000  20     413207
243659  2014-01-16 14:29:59.997  20     413220  

Open in new window


The last row is untouched, bet essentially gets a small inflation (ie 13 instead of 10) of the difference from prior row which includes the remainder.

I did assume (without consultation) that a small skew towards that last row is worthwhile because of the "fake" readings in between, and that last row being a real one.

By using decimal arithmetic we can minimise that last gap, even still, if there is rounding, that's where the remainder will be.

So, incorporating that initial layer to calculate RN, and leaving any DATA calculations according to the underlying data-type, and excluding ID from the insert, then our query would now look like :

;with cte_rn as
( select row_number() over (partition by meterid order by timestamp) as rn, *
  from tmp_tldata
), cte_15 as
  ( select round((datediff(minute,c.timestamp+0.00000005,x.timestamp+0.00000005)/15),1) as icount, x.data - c.data as delta_data
           , dateadd(minute,datediff(minute,0,c.timestamp+0.00000005),0) as rounded_timestamp, c.*
    from cte_rn c
    inner join cte_rn x on c.rn +1 = x.rn and c.meterid = x.meterid
), cte_gaps as
  (select 'r' as ins,icount as max_icount, icount, delta_data, (delta_data / icount) as inc_data, rounded_timestamp, rn,ID,TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID
   from cte_15 
   where icount > 2
   and delta_data > 0
   union all 
   select 'y',max_icount, icount - 1, delta_data, inc_data, dateadd(minute,15,rounded_timestamp), rn,ID,TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID
   from cte_gaps
   where icount > 1)
--insert tmp_tldata (TimeStamp,Hundredths,Data,BaseloadData,DataChange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID) 
select rounded_timestamp,Hundredths,data + (inc_data * (max_icount - icount)) as data,BaseloadData,inc_Data as datachange,BaseloadDataChange,DateStamp,CalendarDate,Occupancy,OAT,SpaceTemp,Widgets,ManualRead,ManualDataChange,BaseloadManualRead,BaseloadManualDataChange,ManualConsumption,BaseloadManualConsumption,interval,MeterID 
from cte_gaps  
where ins = 'y'
order by rn, icount desc
OPTION(MAXRECURSION 0)

Open in new window


And still, the prior comments about performance / indexes / the datachange calc (update) for the "next" row in the sequence all still applies.
0
 
wint100Author Commented:
Great, thanks Mark.

Unfortunatley I'm now on a family holiday until Friday and have no way to test this. I'll get on to it once I get home.

Your point about needing to update that following datachange is a crucial step, so maybe we can discuss this in the meantime, and the other points you mention.

Thanks
0
 
Mark WillsTopic AdvisorCommented:
No problems, lemme know when you are back online (will give me time to think of the best way to do the update).
0
 
wint100Author Commented:
Will do thanks
0
 
Mark WillsTopic AdvisorCommented:
Well, I can identify the row, but it means we need to capture the outcomes of the CTE then do the inserts and update from a temp-table.

But I also surmise that datachange is a calculated column and not raw meter data. So, if possible, would be best to change the timing of that computation to after the inserts. Of course, that might not be possible, but would be well worth considering.
0
 
wint100Author Commented:
Yes datachange is calculated. I think I will send you the rollup query that does this calculation from the 'data' column, that is raw meter data.

I think it would be possible, to just insert the 'data' then do all of the datachange update stuff afterward.
0
 
Mark WillsTopic AdvisorCommented:
Well, we cannot respond to anything sent externally, so I will ignore that comment for now ;)

But, it does make sense to do all the various inserts and then perform the datachange.

If that is possible, then we can use the query above for the inserts and then the "crucial" updates will be handled as a matter of fact after the inserts.
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 8
  • 6
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now