Link to home
Start Free TrialLog in
Avatar of Morpheus7
Morpheus7

asked on

Using rowversion for incremental load of datawarehouse.

I am currently working on a solution to load a data warehouse from a SQL Server 2005 source. From my research it seems the best way to do an incremental load is to use rowversion. The tables I need to use all have timestamp fields. This is great but I am confused as to the best way to handle these in SSIS, particularly in converting them. I am currently using min_active_rowversion() to los into a variable and compare it with the previous value stored in a utility table.

The code used to get the current active version is;

'select ? = CONVERT(varchar(50),min_active_rowversion(),1)'

I then use the following to get the last rowversion used for the last extract.

Select ? = isnull(VersionStamp,'0x0000000000000001')
from DWPROC.ETLRowversions
where SchemaName = 'dbo'
and TableName = 'Products'

Both results are mapped to String variables, 'MaxR' and 'Latest' respectively.

To determine what has changed the following is used:

"select * from dbo.Products where Timestamp > " +  
@[User::Latest] + " and Timestamp <= " +  @[User::MaxR]

When I run this, I don't get any results and there should be. Stepping through it, I'm getting an empty string in the variables. I think my proble is data conversion but I'm not sure how to implement this. I think I should be using dbo.fn_varbintohexstr() but I am not sure how to use this or where.
Any guidance would be appreciated.
Thanks
Avatar of Vitor Montalvão
Vitor Montalvão
Flag of Switzerland image

Why not use the native Replication feature from SQL Server? It has his own mechanism to only replicate the changes.
Avatar of Morpheus7
Morpheus7

ASKER

Hi,

The source system is already heavy with replication, so the business does not see this as an option in this case.
So they think that building a different process will be less heavy than using a native one?
I haven't tried doing it that way, but it seems like it should work. You can cast the timestamps as bigint and directly compare them numerically. I'm not entirely sure what you are getting when you treat them as strings, but they are byte strings, not character strings, and it doesn't sound like a good idea to mix the two.
Hi Vitor,

They accept that this is the case but replication is definitely not an option.
I wouldn't normally use replication to load a data warehouse. In my designs, which admittedly were smaller ones, it was sufficient to look at datetime (or datetime2) columns or other keys present in the data to determine what had changed or been added since the last batch.

I have also used Change Tracking, although that was a special situation that called for "lightweight replication" from remote data collectors across an SFTP link. It is similar to using rowversion, but it would probably be more complexity than you need.
Hi Megan,

Many thanks for your reply. The DW would be loaded from multiple sources, one of which is SQL 2005.  I really only need data from 1 table from this db. This leaves really limited options for incremental loading. All of the other sources are SQL 2016 and I'm using CDC for those. After research, using rowwersion seemed to be the best method for SQL 2005, but it's giving me a headache dealing with them. If you have any other suggestions, that would be appreciated.
Thanks
I am still having an issue assigning the rowversion to a variable. If I set a default value, the process works fine but otherwise it gives empty strings - help

Thanks
When possible I have used idempotent logic to extract data from source systems. Again, these were relatively small sources, such as a non-profit's association management system. I would keep track of datetime values and would ensure that the starting point for the extract was at least as early as the end of the last one. If there was any overlapping data I would detect it using natural keys and avoid inserting duplicates. This method depends on datetime values being present in the source data that reflect insert and update times with reasonable accuracy, as well as well-defined natural keys.

You can use CT or CDC when the SQL Server version supports it and when the source database is something you own and can modify. If the source DB is vendor-supplied, it is unlikely that you are going to want to modify it that way. CT is a lighter weight method of detecting changes that makes use of the transaction log in conjunction with a set of added system tables, but it still requires significant changes to the source database, as well as recovery procedures in the event that change packets are lost and a resync is required.

(More in next post...)
Here is a schema,
CREATE TABLE [dbo].[Business](
	[BusinessId] [int] IDENTITY(1,1) NOT NULL,
	[Name] [varchar](255) NOT NULL,
	[RowVers] [timestamp] NOT NULL,
 CONSTRAINT [PK_Business] PRIMARY KEY CLUSTERED 
(
	[BusinessId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Open in new window

and a query against a simple table containing a rowversion column, RowVers, that I adapted from an earlier EE solution example:
SELECT [BusinessId]
      ,[Name]
      ,[RowVers]
      ,RowVersBigInt = CAST(RowVers as bigint)
FROM [EE160908].[dbo].[Business]

Open in new window

and the results:
BusinessId	Name	RowVers	RowVersBigInt
1	Bus1	0x00000000000007D1	2001
2	Bus2	0x00000000000007D2	2002
3	Bus3	0x00000000000007D3	2003

Open in new window

As you can see, the rowversion byte strings readily convert to bigint (as long as the top bit remains zero, but it is likely to do so during our lifetimes). The rowversion strings themselves are big-endian, and if you try to work with them without converting to bigint first you may run into problems.
Hi Megan,

Many thanks for that.

I am storing the rowversion value of the last extract as a varchar in my management table. so that the SSIS job can query that and get all data greater than that value from the table.

The code to get this is
Select VersionStamp
from DWPROC.ETLRowversions
where SchemaName = 'dbo'
and TableName = Products

I would then update the management table with the value of the max rowversion in the source table.
The code to get this is:
 SELECT MAX([timestamp])
FROM products

I have created 2 variables of data type varchar to hold these returned values, but when I step through the process they contain an empty string.
ASKER CERTIFIED SOLUTION
Avatar of Megan Brooks
Megan Brooks
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial