asked on

unpivoting and summing up the counts in an SSIS package

Hi Experts, I am attaching the source file(category_list.xlsx), I am assuming it has to be unpivoted to get the output file(cateogry_sum_ouput.csv), If you look at the output file you will understand that I will have to sum the counts for a given category type based on the corresponding %, for ex: for the coupe, 1.66% corresponds to 8,523 count and 0.71% corresponds to 7,950 count, the final output file should have Coupe with sum of counts(8523+7950 = 16473), please help me do this in SSIS 2005, I am aware of using OLEDB source to pick the contents of the excel file(since its escel 2007) but I am getting errors at the unpivot transformation, please help.
thanks

Category-list.xlsx
Category-sum-output.csv

shyamaladevib

ASKER

Please find attached files

ASKER CERTIFIED SOLUTION

Megan Brooks

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Megan Brooks

It occurred to me that you could get the Excel Connection Manager/data source to choose the right data types by using a simplified sample spreadsheet when creating the components. The sample spreadsheet would contain only two rows, the header row and one row of data containing exactly the data types that you want.

The problem I ran into, though, would not be fixed by this approach. I found (using SSRS 2005) that if, for example, a ZIP code column started with a 5-digit code then it would be treated as a numeric column, but if it started with a 9-digit code (containing a hyphen) then it would be treated as a text string. Any 5-digit numeric-only codes encountered in the "text" column were converted to nulls without warning. Any 9-digit alphanumeric codes encountered in the numeric column were likewise converted to nulls. So even with the right types declared in the metadata, these components would not be usable. Experiment, though. It's always possible it will work for you. Just watch out for data being converted to nulls.

shyamaladevib

ASKER

Hi rscowden,
Thanks a lot for the explanation, would it be possible to let me know how I could try this in sql statement instead of using the unpivot transform?

Megan Brooks

The quickest thing I can think of is to use a UNION (see code sample)

SELECT     'Sedans' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList
WHERE     (Sedans > 0)
UNION
SELECT     'Hatchback' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_6
WHERE     (Hatchback > 0)
UNION
SELECT     'Coupe' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_5
WHERE     (Coupe > 0)
UNION
SELECT     'Suv' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_4
WHERE     (Suv > 0)
UNION
SELECT     'Van/Minivan' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_3
WHERE     ([Van/Minivan] > 0)
UNION
SELECT     'Convertible' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_2
WHERE     (Convertible > 0)
UNION
SELECT     'Wagon' AS Category, SUM(Count) AS [Sum of Counts]
FROM         CategoryList AS CategoryList_1
WHERE     (Wagon > 0)

Open in new window

shyamaladevib

ASKER

Hi rscowden,
thanks a lot for the response, though this is a easy solution I actually have more than 100 columns(the file I provided is a dummy one, actual contents are different), so it would be way too difficult for me to do this in the way you suggested, please do suggest me another effective method.
Thanks again for your time!

Megan Brooks

If you have 100+ columns of data to normalize then you will need the union of 100+ queries. You can build the query code in a loop in a stored procedure or external application, and you can use column metadata from SQL Server rather than hard-coding all the column names.
You don't actually have to do a union -- you can execute each query and insert the results into a table. That would help avoid running into varchar length limits or SQL Server complexity limits.
Here's a rough sketch (see code sample). The SELECT/INSERT query would have the same form as those in the earlier example, but would include an INSERT query to send the results to a table, on the theory that you can execute 100+ INSERTS without any difficulty but creating a 100+ query UNION might run into trouble somewhere.
The "for each" would probably actually be a cursor run against a metadata query, such as the system metadata views INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.COLUMNS.
I suspect it would be quicker, though, to build the whole thing by hand using copy/paste and search/replace, unless there is some reason why new queries would need to be constructed repeatedly.

For each category column in table metadata
    begin
    set @q = <construct SELECT/INSERT query for just one column name>
    sp_executesql @q [parameterizing the query might simplify things]
    end

Open in new window

SOLUTION

nmcdermaid

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Megan Brooks

I just happened to think that a script transformation with an asynchronous output also should be able to perform this kind of unpivot, but that doesn't solve the problems with the Excel data source.

If you can ensure that the Excel data is clean enough to avoid issues with the source component, or copy the Excel data to another format before proceeding, I think that a script transformation can do the rest. You would still have to update the metadata if the number of columns changed, but the script should be able to iterate over the input columns without knowing how many there are or what they are all called, if it can manage to index the input buffer (I am not yet sure about that).

shyamaladevib

ASKER

Though its not a direct solution, these inputs provided me some insight.
Thanks