<

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x

T-SQL: Identify bad dates in a time series

Published on
18,605 Points
5,005 Views
6 Endorsements
Last Modified:
Awarded
Jim Horn
Microsoft SQL Server database developer, architect, and author specializing in Business Intelligence, ETL, and Data Warehousing.
Have you ever had to deal with a history table that had multiple rows with a given start date and end date, and sometimes those dates were bad?  This article is a demo of SQL Server T-SQL that will identify all bad dates in a time series so you can remove them from the further processing and handle them gracefully.

History table with good dates, but some bad.
I've had multiple clients where I built SSIS packages to accept this type of data, often the client's quality controls were not very good, and having these bad dates would have caused all kinds of follow-on processes to either fail or output bad results.

In Scope:  This T-SQL Script will identify all rows that have
  • Gaps between the end date of one row and the start date of the next row
  • Overlaps between the end date and start date of two rows
  • Duplicate rows.  This may be allowed and not an error in some cases, but I'll include it here.
  • Start date = End Date.  In some companies this is known as 'trivializing' and not an error, as it allows the row to stay in the table and not be considered a part of history, but I'll include it here.
  • Start date after end date
  • Start date IS NULL

Assumptions:
  • An end date of NULL means an indefinite end date.  Sometimes a dummy number such as 12-31-2999 will be used instead. Neither one is considered better than the other.
  • Whole days are used, so one row that ends on 6-30-13 and another row that begins on 7-1-13 are considered good rows.  If your company uses datetime, you'll have to modify this code slightly to use datetime instead of time, and seconds instead of day.
  • No guessing on the business rule validations, such as if 1-1-2000 is not possible or allowed, or a date range of less than a month is not allowed.
This code works on SQL Server 2008 R2 and 2012.  The one difference between these versions is the 2012 introduction of LEAD and LAG functions which is discussed at the end of this article.

This code will also work on SQL Server 2005 if the date format is changed to datetime.

First let's create a sample time series table that includes many good rows and at least one row for each type of error.  Note that setting the dateformat to YMD you should be fine for any version of SQL Server and for any language.  Without it, any hard-coding of dates could be subject to implicit conversion errors.
 
SET DATEFORMAT YMD
GO

IF OBJECT_ID('policy') IS NOT NULL
	DROP TABLE policy

CREATE TABLE policy (
    id int identity(1,1) PRIMARY KEY, 
    policy_number int,
    dt_start date, 
    dt_end date, 
    annual_premium money)
       
INSERT INTO policy (policy_number, dt_start, dt_end, annual_premium)
VALUES 
      (1, '2000-01-01', '2005-12-31', 100), 
      (1, '2006-01-01', '2010-12-31', 110), 
      (1, '1-1-2011', NULL, 100),                     -- 1 Good
      (2, '2010-01-01', NULL, 145),                   -- 2 Good
      (3, '2010-01-01', '2005-12-31', 110),           -- 3 Good, even through the rows are entered out of order
      (4, '2000-01-01', '2005-12-31', 100), 
      (4, '2006-01-01', '2007-09-05', 110),           -- 4 Good, ended 9-5-2007
      (5, '2000-01-01', '2005-12-31', 100), 
      (5, '2006-01-05', '2010-12-31', 120),           -- 5 GAP from 12-31-2005 to 1-5-2006
      (6, '2000-01-01', '2005-12-31', 100),           -- 6 Good
      (7, '2000-01-01', '2005-12-31', 100), 
      (7, '2005-12-25', '2010-12-31', 110),           -- 7 OVERLAP from 12-31-2005 to 12-31-2005
      (8, '2000-01-01', '2005-12-31', 100), 
      (8, '2006-01-01', '2006-01-01', 110),           -- 8 Good row, but to make life easier we'll throw this one out.
      (9, '2006-01-01', '2010-01-01', 110),           -- 9 Good
      (10, '2006-01-01', '2027-12-31', 110),          -- 10 Good, even though it has a future end date. 
      (11, '2006-01-01', '2004-12-31', 110),          -- 11 Bad, START DATE = END DATE.
      (12, NULL, '2004-12-31', 110),                  -- 12 Bad, START DATE IS NULL     
      (13, '2011-01-01', NULL, 110), 
      (13, '2006-01-01', '2010-12-31', 100),          -- 13 Good, even though the rows are entered out of order
      (3, '2006-01-01', NULL, 110),
      (14, '2000-01-01', '2005-12-31', 100), 
      (14, '2000-01-01', '2005-12-31', 100),          -- 14 one good, one bad, duplicate. 
      (15, '2000-01-01', '2005-12-31', 100), 
      (15, '2000-01-01', '2005-12-31', 100),             
      (15, '2000-01-01', '2005-12-31', 100)           -- 15 one good, two bad, duplicate

Open in new window


Now let's create a temporary table to store these policies, along with some columns to mark the row as having any errors, for later use.
 
IF OBJECT_ID('tempdb..#policy_time_series') IS NOT NULL  
    DROP TABLE #policy_time_series
       
CREATE TABLE #policy_time_series (
    id int,						-- Not identity
    policy_number int, 
    dt_start date, 
    dt_end date, 
    time_variance_days bigint,
    error_type varchar(50)); 

Open in new window


Notice the use of the semicolon at the end.  This is necessary because the next statement is a Common Table Expression (CTE) which requires it.

The guts of this functionality is a single CTE expression that performs:
  • Identify error rows that will cause a date ordering to fail,
  • Order the remaining rows by date
  • Identify gaps and overlays
  • INSERTs the return set into temp table #policy_time_series for further use.
 
with 
-- Create separate sets for all the single-row errors
-- Identify start date > end date
bad_rows_bad_order as (SELECT id, policy_number, dt_start, dt_end, 'Start date after end date' as error_type FROM policy WHERE dt_end < dt_start)
-- Identify Start dates that are NULL
, bad_rows_nulls as (SELECT id, policy_number, dt_start, dt_end, 'Start date IS NULL' as error_type FROM policy WHERE dt_start IS NULL)
-- Identify Start dates = End Dates
, bad_rows_same_date as (SELECT id, policy_number, dt_start, dt_end, 'Start date = End Date' as error_type FROM policy WHERE dt_start = dt_end)
-- Identify duplicates
, bad_rows_duplicate as (
    SELECT id, policy_number, dt_start, dt_end, 'Duplicate Row' as error_type
    FROM (
        SELECT id, policy_number, dt_start, dt_end, ROW_NUMBER() OVER (PARTITION BY policy_number, dt_start, dt_end ORDER BY (SELECT 0)) AS DuplicateRowNumber
        FROM policy ) a
    WHERE a.DuplicateRowNumber > 1)
-- Perform the time series for all the remaining rows
, date_sort_order as 
(
SELECT id, policy_number, dt_start, dt_end, annual_premium, 
      ROW_NUMBER() OVER (PARTITION BY policy_number ORDER BY dt_start, dt_end) as date_sort_order
FROM policy
WHERE 
    id NOT IN (
        SELECT id FROM bad_rows_bad_order UNION ALL
        SELECT id FROM bad_rows_nulls UNION ALL
        SELECT id FROM bad_rows_same_date UNION ALL
        SELECT id FROM bad_rows_duplicate)
)
-- Calculate the difference in days between remaining rows, then add back the errors.
, date_sort_order_with_differences as
(
    -- Good records:  Calculate the difference in days between a row and the next one in line
    SELECT base.id, base.policy_number, base.dt_start, base.dt_end, 
        compare.dt_start as dt_start_next,
        DATEDIFF(DAY, base.dt_end, compare.dt_start) as time_variance_days,
        NULL as error_type         
    FROM   date_sort_order base
        -- Compare dates with previous row
        LEFT JOIN date_sort_order compare ON base.policy_number = compare.policy_number AND base.date_sort_order = (compare.date_sort_order - 1)
    UNION ALL
    SELECT id, policy_number, dt_start, dt_end, NULL as dt_start_next, NULL as time_variance_days, error_type
    FROM bad_rows_bad_order 
    UNION ALL
    SELECT id, policy_number, dt_start, dt_end, NULL as dt_start_next, NULL as time_variance_days, error_type
    FROM bad_rows_nulls 
    UNION ALL
    SELECT id, policy_number, dt_start, dt_end, NULL as dt_start_next, NULL as time_variance_days, error_type
    FROM bad_rows_same_date 
    UNION ALL
    SELECT id, policy_number, dt_start, dt_end, NULL as dt_start_next, NULL as time_variance_days, error_type
    FROM bad_rows_duplicate	
)
-- Now insert all of these rows into a temp table
INSERT INTO #policy_time_series (id, policy_number, dt_start, dt_end,time_variance_days, error_type)
SELECT id, policy_number, dt_start,  dt_end, time_variance_days, error_type
from   date_sort_order_with_differences

Open in new window


Now mark the error_type for gaps and overlays, and show the results
 
-- Mark error_type for gaps and overlaps
UPDATE #policy_time_series 
SET error_type = 
	CASE WHEN time_variance_days > 1 THEN 'Gaps' WHEN time_variance_days < 1 THEN 'Overlap' END
WHERE error_type IS NULL AND time_variance_days IS NOT NULL

-- Show the results
SELECT 'All rows' as row_status, * FROM #policy_time_series ORDER BY id
SELECT 'Good rows'as row_status, * FROM #policy_time_series WHERE error_type IS NULL AND COALESCE(time_variance_days,1) = 1 ORDER BY id
SELECT 'Bad rows' as row_status, * FROM #policy_time_series WHERE error_type IS NOT NULL ORDER BY id

Open in new window


Results
SQL Server 2012 offers an enhancement to this code by introducing the LEAD and LAG functions, which would eliminate the need for the JOIN statement.

 
        -- Compare dates with previous row
        LEFT JOIN date_sort_order compare ON base.policy_number = compare.policy_number AND base.date_sort_order = (compare.date_sort_order - 1)

Open in new window


An outstanding demo by Pinal Dave on how to apply these functions is here.
 
Thank you for reading my article, feel free to leave me some feedback regarding the content or to recommend future work. 
If you liked this article please click the 'Good Article' button.
 
I look forward to hearing from you. -  Jim Horn  ( LinkedIn ) ( Twitter )

 
6
Comment
Author:Jim Horn
1 Comment
LVL 61

Expert Comment

by:mbizup
Good article.  (Voted 'yes')
0

Featured Post

Exploring SharePoint 2016

Explore SharePoint 2016, the web-based, collaborative platform that integrates with Microsoft Office to provide intranets, secure document management, and collaboration so you can develop your online and offline capabilities.

Join & Write a Comment

Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month