Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

T-SQL: Identify bad dates in a time series

Jim HornSQL Server Data Dude
CERTIFIED EXPERT
Microsoft SQL Server database developer, architect, and author specializing in Business Intelligence, ETL, and Data Warehousing.
Published:
Updated:
Have you ever had to deal with a history table that had multiple rows with a given start date and end date, and sometimes those dates were bad?  This article is a demo of SQL Server T-SQL that will identify all bad dates in a time series so you can remove them from the further processing and handle them gracefully.

History table with good dates, but some bad.
I've had multiple clients where I built SSIS packages to accept this type of data, often the client's quality controls were not very good, and having these bad dates would have caused all kinds of follow-on processes to either fail or output bad results.

In Scope:  This T-SQL Script will identify all rows that have
  • Gaps between the end date of one row and the start date of the next row
  • Overlaps between the end date and start date of two rows
  • Duplicate rows.  This may be allowed and not an error in some cases, but I'll include it here.
  • Start date = End Date.  In some companies this is known as 'trivializing' and not an error, as it allows the row to stay in the table and not be considered a part of history, but I'll include it here.
  • Start date after end date
  • Start date IS NULL

Assumptions:
  • An end date of NULL means an indefinite end date.  Sometimes a dummy number such as 12-31-2999 will be used instead. Neither one is considered better than the other.
  • Whole days are used, so one row that ends on 6-30-13 and another row that begins on 7-1-13 are considered good rows.  If your company uses datetime, you'll have to modify this code slightly to use datetime instead of time, and seconds instead of day.
  • No guessing on the business rule validations, such as if 1-1-2000 is not possible or allowed, or a date range of less than a month is not allowed.
This code works on SQL Server 2008 R2 and 2012.  The one difference between these versions is the 2012 introduction of LEAD and LAG functions which is discussed at the end of this article.

This code will also work on SQL Server 2005 if the date format is changed to datetime.

First let's create a sample time series table that includes many good rows and at least one row for each type of error.  Note that setting the dateformat to YMD you should be fine for any version of SQL Server and for any language.  Without it, any hard-coding of dates could be subject to implicit conversion errors.
 
SET DATEFORMAT YMD
                      GO
                      
                      IF OBJECT_ID('policy') IS NOT NULL
                      	DROP TABLE policy
                      
                      CREATE TABLE policy (
                          id int identity(1,1) PRIMARY KEY, 
                          policy_number int,
                          dt_start date, 
                          dt_end date, 
                          annual_premium money)
                             
                      INSERT INTO policy (policy_number, dt_start, dt_end, annual_premium)
                      VALUES 
                            (1, '2000-01-01', '2005-12-31', 100), 
                            (1, '2006-01-01', '2010-12-31', 110), 
                            (1, '1-1-2011', NULL, 100),                     -- 1 Good
                            (2, '2010-01-01', NULL, 145),                   -- 2 Good
                            (3, '2010-01-01', '2005-12-31', 110),           -- 3 Good, even through the rows are entered out of order
                            (4, '2000-01-01', '2005-12-31', 100), 
                            (4, '2006-01-01', '2007-09-05', 110),           -- 4 Good, ended 9-5-2007
                            (5, '2000-01-01', '2005-12-31', 100), 
                            (5, '2006-01-05', '2010-12-31', 120),           -- 5 GAP from 12-31-2005 to 1-5-2006
                            (6, '2000-01-01', '2005-12-31', 100),           -- 6 Good
                            (7, '2000-01-01', '2005-12-31', 100), 
                            (7, '2005-12-25', '2010-12-31', 110),           -- 7 OVERLAP from 12-31-2005 to 12-31-2005
                            (8, '2000-01-01', '2005-12-31', 100), 
                            (8, '2006-01-01', '2006-01-01', 110),           -- 8 Good row, but to make life easier we'll throw this one out.
                            (9, '2006-01-01', '2010-01-01', 110),           -- 9 Good
                            (10, '2006-01-01', '2027-12-31', 110),          -- 10 Good, even though it has a future end date. 
                            (11, '2006-01-01', '2004-12-31', 110),          -- 11 Bad, START DATE = END DATE.
                            (12, NULL, '2004-12-31', 110),                  -- 12 Bad, START DATE IS NULL     
                            (13, '2011-01-01', NULL, 110), 
                            (13, '2006-01-01', '2010-12-31', 100),          -- 13 Good, even though the rows are entered out of order
                            (3, '2006-01-01', NULL, 110),
                            (14, '2000-01-01', '2005-12-31', 100), 
                            (14, '2000-01-01', '2005-12-31', 100),          -- 14 one good, one bad, duplicate. 
                            (15, '2000-01-01', '2005-12-31', 100), 
                            (15, '2000-01-01', '2005-12-31', 100),             
                            (15, '2000-01-01', '2005-12-31', 100)           -- 15 one good, two bad, duplicate

Open in new window


Now let's create a temporary table to store these policies, along with some columns to mark the row as having any errors, for later use.
 
IF OBJECT_ID('tempdb..#policy_time_series') IS NOT NULL  
                          DROP TABLE #policy_time_series
                             
                      CREATE TABLE #policy_time_series (
                          id int,						-- Not identity
                          policy_number int, 
                          dt_start date, 
                          dt_end date, 
                          time_variance_days bigint,
                          error_type varchar(50)); 

Open in new window


Notice the use of the semicolon at the end.  This is necessary because the next statement is a Common Table Expression (CTE) which requires it.

The guts of this functionality is a single CTE expression that performs:
  • Identify error rows that will cause a date ordering to fail,
  • Order the remaining rows by date
  • Identify gaps and overlays
  • INSERTs the return set into temp table #policy_time_series for further use.
 
with 
                      -- Create separate sets for all the single-row errors
                      -- Identify start date > end date
                      bad_rows_bad_order as (SELECT id, policy_number, dt_start, dt_end, 'Start date after end date' as error_type FROM policy WHERE dt_end < dt_start)
                      -- Identify Start dates that are NULL
                      , bad_rows_nulls as (SELECT id, policy_number, dt_start, dt_end, 'Start date IS NULL' as error_type FROM policy WHERE dt_start IS NULL)
                      -- Identify Start dates = End Dates
                      , bad_rows_same_date as (SELECT id, policy_number, dt_start, dt_end, 'Start date = End Date' as error_type FROM policy WHERE dt_start = dt_end)
                      -- Identify duplicates
                      , bad_rows_duplicate as (
                          SELECT id, policy_number, dt_start, dt_end, 'Duplicate Row' as error_type
                          FROM (
                              SELECT id, policy_number, dt_start, dt_end, ROW_NUMBER() OVER (PARTITION BY policy_number, dt_start, dt_end ORDER BY (SELECT 0)) AS DuplicateRowNumber
                              FROM policy ) a
                          WHERE a.DuplicateRowNumber > 1)
                      -- Perform the time series for all the remaining rows
                      , date_sort_order as 
                      (
                      SELECT id, policy_number, dt_start, dt_end, annual_premium, 
                            ROW_NUMBER() OVER (PARTITION BY policy_number ORDER BY dt_start, dt_end) as date_sort_order
                      FROM policy
                      WHERE 
                          id NOT IN (
                              SELECT id FROM bad_rows_bad_order UNION ALL
                              SELECT id FROM bad_rows_nulls UNION ALL
                              SELECT id FROM bad_rows_same_date UNION ALL
                              SELECT id FROM bad_rows_duplicate)
                      )
                      -- Calculate the difference in days between remaining rows, then add back the errors.
                      , date_sort_order_with_differences as
                      (
                          -- Good records:  Calculate the difference in days between a row and the next one in line
                          SELECT base.id, base.policy_number, base.dt_start, base.dt_end, 
                              compare.dt_start as dt_start_next,
                              DATEDIFF(DAY, base.dt_end, compare.dt_start) as time_variance_days,
                              NULL as error_type         
                          FROM   date_sort_order base
                              -- Compare dates with previous row
                              LEFT JOIN date_sort_order compare ON base.policy_number = compare.policy_number AND base.date_sort_order = (compare.date_sort_order - 1)
                          UNION ALL
                          SELECT id, policy_number, dt_start, dt_end, NULL as dt_start_next, NULL as time_variance_days, error_type
                          FROM bad_rows_bad_order 
                          UNION ALL
                          SELECT id, policy_number, dt_start, dt_end, NULL as dt_start_next, NULL as time_variance_days, error_type
                          FROM bad_rows_nulls 
                          UNION ALL
                          SELECT id, policy_number, dt_start, dt_end, NULL as dt_start_next, NULL as time_variance_days, error_type
                          FROM bad_rows_same_date 
                          UNION ALL
                          SELECT id, policy_number, dt_start, dt_end, NULL as dt_start_next, NULL as time_variance_days, error_type
                          FROM bad_rows_duplicate	
                      )
                      -- Now insert all of these rows into a temp table
                      INSERT INTO #policy_time_series (id, policy_number, dt_start, dt_end,time_variance_days, error_type)
                      SELECT id, policy_number, dt_start,  dt_end, time_variance_days, error_type
                      from   date_sort_order_with_differences

Open in new window


Now mark the error_type for gaps and overlays, and show the results
 
-- Mark error_type for gaps and overlaps
                      UPDATE #policy_time_series 
                      SET error_type = 
                      	CASE WHEN time_variance_days > 1 THEN 'Gaps' WHEN time_variance_days < 1 THEN 'Overlap' END
                      WHERE error_type IS NULL AND time_variance_days IS NOT NULL
                      
                      -- Show the results
                      SELECT 'All rows' as row_status, * FROM #policy_time_series ORDER BY id
                      SELECT 'Good rows'as row_status, * FROM #policy_time_series WHERE error_type IS NULL AND COALESCE(time_variance_days,1) = 1 ORDER BY id
                      SELECT 'Bad rows' as row_status, * FROM #policy_time_series WHERE error_type IS NOT NULL ORDER BY id

Open in new window


Results
SQL Server 2012 offers an enhancement to this code by introducing the LEAD and LAG functions, which would eliminate the need for the JOIN statement.

 
        -- Compare dates with previous row
                              LEFT JOIN date_sort_order compare ON base.policy_number = compare.policy_number AND base.date_sort_order = (compare.date_sort_order - 1)

Open in new window


An outstanding demo by Pinal Dave on how to apply these functions is here.
 
Thank you for reading my article, feel free to leave me some feedback regarding the content or to recommend future work. 
If you liked this article please click the 'Good Article' button.
 
I look forward to hearing from you. -  Jim Horn  ( LinkedIn ) ( Twitter )

 
6
7,214 Views
Jim HornSQL Server Data Dude
CERTIFIED EXPERT
Microsoft SQL Server database developer, architect, and author specializing in Business Intelligence, ETL, and Data Warehousing.

Comments (1)

CERTIFIED EXPERT
Most Valuable Expert 2012
Top Expert 2013

Commented:
Good article.  (Voted 'yes')

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.