• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 506
  • Last Modified:

unpivoting and summing up the counts in an SSIS package

Hi Experts, I am attaching the source file(category_list.xlsx), I am assuming it has to be unpivoted to get the output file(cateogry_sum_ouput.csv), If you look at the output file you will understand that I will have to sum the counts for a given category type based on the corresponding %, for ex: for the coupe, 1.66% corresponds to 8,523 count and 0.71% corresponds to 7,950 count, the final output file should have Coupe with sum of counts(8523+7950 = 16473), please help me do this in SSIS 2005, I am aware of using OLEDB source to pick the contents of the excel file(since its escel 2007) but I am getting errors at the unpivot transformation, please help.
thanks


Category-list.xlsx
Category-sum-output.csv
0
shyamaladevib
Asked:
shyamaladevib
  • 5
  • 4
2 Solutions
 
shyamaladevibAuthor Commented:
Please find attached files
0
 
Megan BrooksSQL Server ConsultantCommented:
I have had a lot of problems trying to use the Excel Connection Manager, and I don't particularly recommend it because it can be very quirky. I tried your scenario using SSIS 2008 and was informed that the Excel Connection Manager doesn't work with my x64 OS for lack of an OLE DB provider.
The essential problem that I saw, though, was that the Excel Source decided that the Convertible category had a data type of DT_WSTR while it chose DT_R8 for all the other categories. This caused the Unpivot transformation to report an error, since they all must have the same type.
The data type chosen by the Excel Connection Manager & data source seems to be set in stone. I've tried adjusting the external column data types of the data source, but that just produces errors. You can add a conversion (of the Convertible column to DT_R8) ahead of the Unpivot and that might work, but I am concerned that the percent signs in the incoming data might create further problems. You can remove the percent sign from the DT_WSTR column, but I don't know if you will even get data for the DT_R8 columns because of that. The Excel Connection Manager tends to change things it doesn't like to NULL. Unfortunately I don't have an i86 platform with which to actually run the package.
I need to regularly read and process Excel input myself. Because of all this quirkiness, often leading to data loss, I have taken to manually converting the Excel files to tab-delimited text. The flat file provider/data source allows you to precisely define the data type for each column.
If someone else has a better solution then I would like to hear about it. I don't enjoy having to convert all my Excel input into .txt files, but when the connection manager/data source arbitrarily decides the type of each column, and converts any values to NULL that don't agree with its decision, what can one do?
It would probably also be possible to convert Excel to Access and read the Access database in SSIS. I have had good results reading other Access sources, and the conversion from Excel to Access might involve fewer manual steps. I haven't tried it yet, though.
0
 
Megan BrooksSQL Server ConsultantCommented:
It occurred to me that you could get the Excel Connection Manager/data source to choose the right data types by using a simplified sample spreadsheet when creating the components. The sample spreadsheet would contain only two rows, the header row and one row of data containing exactly the data types that you want.

The problem I ran into, though, would not be fixed by this approach. I found (using SSRS 2005) that if, for example, a ZIP code column started with a 5-digit code then it would be treated as a numeric column, but if it started with a 9-digit code (containing a hyphen) then it would be treated as a text string. Any 5-digit numeric-only codes encountered in the "text" column were converted to nulls without warning. Any 9-digit alphanumeric codes encountered in the numeric column were likewise converted to nulls. So even with the right types declared in the metadata, these components would not be usable. Experiment, though. It's always possible it will work for you. Just watch out for data being converted to nulls.
0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

 
shyamaladevibAuthor Commented:
Hi rscowden,
Thanks a lot for the explanation, would it be possible to let me know how I could try this in sql statement instead of using the unpivot transform?
0
 
Megan BrooksSQL Server ConsultantCommented:
The quickest thing I can think of is to use a UNION (see code sample)
SELECT     'Sedans' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList
WHERE     (Sedans > 0)
UNION
SELECT     'Hatchback' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_6
WHERE     (Hatchback > 0)
UNION
SELECT     'Coupe' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_5
WHERE     (Coupe > 0)
UNION
SELECT     'Suv' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_4
WHERE     (Suv > 0)
UNION
SELECT     'Van/Minivan' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_3
WHERE     ([Van/Minivan] > 0)
UNION
SELECT     'Convertible' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_2
WHERE     (Convertible > 0)
UNION
SELECT     'Wagon' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_1
WHERE     (Wagon > 0)

Open in new window

0
 
shyamaladevibAuthor Commented:
Hi rscowden,
thanks a lot for the response, though this is a easy solution I actually have more than 100 columns(the file I provided is a dummy one, actual contents are different), so it would be way too difficult for me to do this in the way you suggested, please do suggest me another effective method.
Thanks again for your time!
0
 
Megan BrooksSQL Server ConsultantCommented:
If you have 100+ columns of data to normalize then you will need the union of 100+ queries. You can build the query code in a loop in a stored procedure or external application, and you can use column metadata from SQL Server rather than hard-coding all the column names.
You don't actually have to do a union -- you can execute each query and insert the results into a table. That would help avoid running into varchar length limits or SQL Server complexity limits.
Here's a rough sketch (see code sample). The SELECT/INSERT query would have the same form as those in the earlier example, but would include an INSERT query to send the results to a table, on the theory that you can execute 100+ INSERTS without any difficulty but creating a 100+ query UNION might run into trouble somewhere.
The "for each" would probably actually be a cursor run against a metadata query, such as the system metadata views INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.COLUMNS.
I suspect it would be quicker, though, to build the whole thing by hand using copy/paste and search/replace, unless there is some reason why new queries would need to be constructed repeatedly.

For each category column in table metadata
    begin
    set @q = <construct SELECT/INSERT query for just one column name>
    sp_executesql @q [parameterizing the query might simplify things]
    end

Open in new window

0
 
nmcdermaidCommented:
rscowden,
does this help with your excel data type issues?
http://www.sqldts.com/254.aspx
it would work better in access (or any other database), because you must define a specific datatype for each column and you can't store any data that doesn't match the datatype. In Excel you can type anything you want into a cell and there is no check that it matches a datatype. Thats the price you pay for the flexibility that Excel offers. Its not a database, its a matrix where you can tpe anything in any random cell with zero validation.
You can certainly use an expression to convert, say 57% (string) to 0.57 (numeric) but again what if someone types the letter 'A' into the cell? There's nothing stopping them. Excel is just not a reliable data source. Apologies for getting on my soapbox about this but it is a constant frustration for me. You can't have the flexibility of a validation-free spreadsheet and then expect it to fit into a strongly typed (and consistent) ETL or database system.
 
Regarding the original question;
Can you attach the unpivot error that you are getting.
0
 
Megan BrooksSQL Server ConsultantCommented:
I just happened to think that a script transformation with an asynchronous output also should be able to perform this kind of unpivot, but that doesn't solve the problems with the Excel data source.

If you can ensure that the Excel data is clean enough to avoid issues with the source component, or copy the Excel data to another format before proceeding, I think that a script transformation can do the rest. You would still have to update the metadata if the number of columns changed, but the script should be able to iterate over the input columns without knowing how many there are or what they are all called, if it can manage to index the input buffer (I am not yet sure about that).
0
 
shyamaladevibAuthor Commented:
Though its not a direct solution, these inputs provided me some insight.
Thanks
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 5
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now