Solved

Excel column not converting properly in SSIS data flow

Posted on 2010-09-13
16
3,635 Views
Last Modified: 2012-05-10
Hi,

I am developing a SSIS package to perform an ETL operation on an external Excel file, loading data from it each day into a SQL table. The Excel file is updated by users and has a few characteristics that my SSIS package needs to handle. Firstly, it has a column named [Description] which contains claim references. The format of the claim number varies. At the top of the file most of the claim references present as 4 digit integers (e.g. 1234) but further down this gives way to a longer format that is composed of three numbers separated by the forward slash character "/" (e.g. 2010/123456/01). Secondly, the users frequently skip rows when inputting new records to create logical record groups. All the data is held in [Sheet1$] and there are a total of 3234 rows in [Sheet1$] including the header row at the top and all the intermittent blank rows within the records. When I make a copy of the Excel file and sort on [Description] I can see that there are 2247 records with a claim reference that is not blank and these are the records I want to pull into the SQL table, leaving the others behind.

I know that Excel tries to guess data types, which is unhelpful to my SSIS data conversion. Therefore, using a tip from one of the experts on this forum from a previous question, I have modified the Excel connection string to use the IMEX=1 option:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=\\EDL-PDC\All\Claims Banking.xls;Extended Properties="EXCEL 8.0;HDR=YES;IMEX=1";

Next I wrote the following query for my Excel source to filter out the unwanted records with a NULL claim reference:
SELECT
      [Description] AS claim_no,
      [Date] AS recovery_date,
      [Amount] AS amount
FROM [Sheet1$]
WHERE [Description] IS NOT NULL;

Next I placed a Data Conversion step in my data flow in between the Excel data source and OLE DB destination. The Data Conversion makes a copy of the Description (claim_no) column as type string with length 30.

Finally the claim_no copy column from the Data Conversion is mapped to my claim_no column in the SQL table, which is of type varchar(30).

When I run the package, it completes without error but loads only 875 rows into SQL. All of these rows have a claim_no in the format XXXX (e.g. 1234). When I remove the WHERE clause from my Excel source SQL command text I get 3233 rows but most of the claim numbers have converted to NULL.

Is anyone able to offer a likely explanation for this and a fix? I should mention that the ETL involves two other columns from the Excel file (recovery_date and amount) both of which are included in the Data Conversion as types date and currency. There are also redundant columns in the Excel file that I ignore in the Excel source SQL command text.

Any helpful pointers to where I may be going wrong will be gratefully received.

Many thanks in advance.

P.S. I've posted this in the MS SQL Server 2008 zone (I'm using SQL Server 2008 R2) because I cannot find a SSIS zone (the search for SSIS or Integration Services in the Zones area yields nothing). Please put me straight if there is a dedicated SSIS zone so I may use this in the future.
0
Comment
Question by:irb56
  • 8
  • 8
16 Comments
 
LVL 16

Expert Comment

by:carsRST
Comment Utility
Lot of text here, so forgive me if I overlooked anything.

But you're probably better served letting SSIS do most of the work for you.  See example link below on importing an Excel file within SSIS.  

But basically you'll add an Excel source, set it to your file and tab, and let SSIS import it.  There are other things you can do in between to limit the empty descriptions from importing, but easiest is to just import and then delete empty rows afterwards.

Import Rows-->the run delete sql to remove bad data

http://blogs.techrepublic.com.com/datacenter/?p=205

The way you're doing it poses too many risks, as you're seeing.  
0
 

Author Comment

by:irb56
Comment Utility
Thanks for the prompt response and sorry for the lengthy description of the issue. I packed in a lot of contextual detail. My problem is not really one of unwanted rows so unfortunately the import then a SQL DELETE suggestion doesn't resolve the issue. When I forget trying to filter out the unwanted rows during the load, the real problem is that lots of columns containing genuine claim references in Excel are ending up as NULL values in SQL.

I had a similar issue the other day with a different Excel file and it was resolved by adding the IMEX=1 option into the Excel connection string but on this occasion it hasn't worked. Apart from the different characteristics of the files, the only other difference I can think of is that the Excel source file is accessed via a network UNC path rather than a local path.
0
 
LVL 16

Expert Comment

by:carsRST
Comment Utility
>>one of unwanted rows so unfortunately the import then a SQL DELETE suggestion doesn't resolve the issue.

I'm saying to import DIFFERENTLY, not just the delete.  See the link - follow it.  The way you're importing is too risky, as you're seeing.

0
 
LVL 16

Expert Comment

by:carsRST
Comment Utility
You're trying to do the work that SSIS already has flushed out.  Let it do the import.
0
 

Author Comment

by:irb56
Comment Utility
Forgive me if I've missed something but the Tech Republic article is a beginner's guide to a simple Excel import into SQL using SSIS, with nothing to handle the false data type assumptions. I've followed these steps which results in the following:
3233 rows imported into SQL (this is all the rows in the Excel file - none missing)
875 rows have a [Description] column value that is not NULL (all have a numeric value in form of 1234)
2358 rows have a [Description] column that is NULL (back to square one - same problem that I started with)

When you let SSIS and the Excel driver do this import without any help (i.e. no changes to the Excel connection string and no intermediate transformation in between the data flow's Excel source and OLE DB destination) the result is that the [Description] source column is identified as DT_R8 (double float) and the destination column (when allowing SSIS to create a new table) is identified as float. There are many values in the Excel [Description] column with values such as 2009/123456/01 and all of these are converting to NULL in the data flow.
0
 
LVL 16

Expert Comment

by:carsRST
Comment Utility
Those steps r an intro guide to importing from excel.  U can dictate what the destination desc column should be within ssis as u create (or change in the database after).  U do not have to accept the defaults.  If after changing the dec datatype and still not getting all, and a conversion task in the middle of source and destination.
0
 
LVL 16

Expert Comment

by:carsRST
Comment Utility
Sorry...on phone.  Was supposed to say "add a conversion task"
0
 

Author Comment

by:irb56
Comment Utility
I have added a conversion task between Excel source and OLE DB destination, which makes a copy of the [Description] column as a string (mapping to a SQL column of type varchar). All the 2009/123456/01 values still convert to NULL. I think we have our wires crossed because I've done all this before and appear to be restating my problem and the steps I've already taken.

I don't understand why the Data Conversion task, which receives the [Description] column as a double-precision float from the data flow pipeline coming out of the Excel source and copies it into the [Copy of Description] of data type string appears to turn 2009/123456/01 into NULL.

This fix should be the IMEX option (see this link: http://microsoftdw.blogspot.com/2005/11/excel-data-not-coming-into-ssis-right.html), which has worked for me previously but not on this occasion. I wonder if it's something to do with the fact that the Excel file used to have a blank row at the very top. Although I've deleted the top row, maybe it still remembers somehow and this upsets things. I'll keep fiddling around as I'm sure this issue is resolvable and may be due to something simple that I'm blind to.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:irb56
Comment Utility
I've added a data viewer to the data flow pipeline coming out of the Excel data source and the problem is that all values that don't fit the 1234 format are coming out NULL at this point, so the issue is not with the Data Conversion task. The question is why does the IMEX option in my Excel connection string appear not to disregard Excel's data type guess? The SSIS package doesn't throw any errors so no evidence of me getting the syntax wrong. I've copied and pasted the connection string directly from my SSIS package (note this is the same as my original posting except that I'm now tinkering with a differently named Excel file stored locally):

Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\BI\Recoveries Test.xls;Extended Properties="EXCEL 8.0;HDR=YES;IMEX=1";
0
 

Author Comment

by:irb56
Comment Utility
Please find attached Excel file containing the problem column (I've chopped out all other columns). If you can successfully import this column into SQL using SSIS and without losing any data, please let me know how you did it.

Many thanks.
Claim-Refs-to-Import.xls
0
 
LVL 16

Expert Comment

by:carsRST
Comment Utility
What version of SSIS are you using?

And is the path to the file always the same?
0
 

Author Comment

by:irb56
Comment Utility
SQL Server 2008 R2 and I've tried different file paths, including a local and network UNC path. Always the same result.
0
 
LVL 16

Accepted Solution

by:
carsRST earned 500 total points
Comment Utility
Two things I suggest you do to get this to work.  I have tested and works fine.

1.  In your connection manager at the bottom, click on the one for Excel.  In the properties (usually at the bottom right), find the "ConnectionSring" section and add the "IMEX=1;" at the end.  Something like below.
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\temp\import.xls; Extended Properties="EXCEL 8.0;HDR=YES;IMEX=1;"

2.  Next on the main "Control Flow", add a "Script Task" and add the following code to it.  This code will make the first line of actual data a string.  Have this Script Task run BEFORE the data flow task.  
You will need to modify the path in the code.






Option Strict Off



Imports System



Imports System.Data



Imports System.Math



Imports Microsoft.SqlServer.Dts.Runtime







Public Class ScriptMain







      Public Sub Main()







        Dim app As Object



        Dim wb As Object

 



        app = CreateObject("Excel.application")



        wb = app.workbooks.open("c:\temp\Claim-Refs-to-Import.xls")





        wb.activesheet.range("A" & 2) = "'" & wb.activesheet.range("A" & 2).value







        wb.save()



        wb.close()



        app.quit()



        app = Nothing



        GC.Collect()



        Dts.TaskResult = Dts.Results.Success



    End Sub



 



End Class

Open in new window

0
 
LVL 16

Expert Comment

by:carsRST
Comment Utility
Don't copy my connection string.  just add the "IMEX=1;" at the end.
0
 

Author Comment

by:irb56
Comment Utility
Thanks very much for your help, especially for going to the trouble of testing the import. Unfortunately the script code isn't valid in my environment when I copy and paste it into a script task. BIDS reports that "the binary code for the script is not found" (and I did change the path name appropriately). By copying just the code of the Main Sub I get rid of the "the binary code for the script is not found" but when running the task an exception is thrown: "Cannot create ActiveX component". I'm a novice with VB so got stuck.

However, I'm pleased to say that I have now fixed the problem a different way and also now understand why the problem is happening. I found this blog useful: http://sqlblog.de/blog/2009/04/ssis-excel-import-column-data-types/. What this told me was that the IMEX=1 option fails to make the column a string if it finds only numeric values in its sampling. The number of column values sampled is defined by the TypeGuessRows key located in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\X.X\Engines\Excel (although on my 64 bit Windows 2008 R2 server I found it here: HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Jet\4.0\Engines\Excel). The default setting for TypeGuessRows is 8 and because the first 8 values for the Description column in my Excel worksheet are all numeric, the IMEX=1 option does not detect mixed data types and doesn't impose a string type to cater for mixed data types. This is why I'm seeing double precision float type coming out of the Excel source even thought the IMEX=1 option is present in the Excel connection string.

I tried changing the TypeGuessRows value to 2000 but this just caused the SSIS package to error at the beginning of the data flow task (when I put it back to 8, it worked again). There's probably a way around this but I found another way (see below).

I have fixed the issue by inserting at row 2 in the source Excel file and placing some dummy data values into this row, including a value of "ABC" for the Description column. I've then hidden this row in Excel so that the end users don't see it. After this, the IMEX option works. The column comes out of the Excel source as a Unicode string and all of my data values are preserved. As you originally suggested, I tidy up the redundant values (caused by the gaps in the rows throughout the worksheet) via an Exceute SQL task that deletes the NULLs.

Many thanks for your help and perseverance.
0
 

Author Closing Comment

by:irb56
Comment Utility
Please see final comment from me (irb56), posted on 16/09/10 to see how I eventually resolved the issue.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even within…
In this article I will describe the Copy Database Wizard method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…
This video demonstrates how to create an example email signature rule for a department in a company using CodeTwo Exchange Rules. The signature will be inserted beneath users' latest emails in conversations and will be displayed in users' Sent Items…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

6 Experts available now in Live!

Get 1:1 Help Now