Link to home
Start Free TrialLog in
Avatar of MrSlausen
MrSlausen

asked on

C# Collection vs. Indexed SQL Server Table

I currently have a bill of materials list that is approximately 1.5M records.  Each record contains a parent, child combination - along with the quantity of children materials that go into the parent material.  There is also a little bit of additional detail that is irrelevant to this test.  See the screenshot attached for an example.

I have a separate list of materials of approximately 1M records with a quantity sold.  I need to "explode" the quantity sold to determine the amount of quantity sold for each child.  

I am currently using a CTE and recursive query in T-SQL to perform the calculation.  The performance is acceptable, but not optimal.  Because this is somewhat of a iterative process with inherent looping, I figure I can get better performance out of a recursive query in C#.

My thought is that I would load the entire BOM into a collection, the entire list of sales by material into another object, and then iterate over the sales object and use a recursive function to create the exploded sales figures for each.

My problem is that I haven't been able to get my C# application to even perform as well as my T-SQL version.

I have two main C# objects:
- A Generic Dictionary<> : to store the BOM information
- A Generic List<> : to store the output

The Generic dictionary stores a custom type (Key: custom Parent, Child object ; Value: ChildQtyInParent)

The List<> stores the results in a Child, Val object called Values.

I haven't built that sales figure object yet, as even the initial base test with a few manually entered test values are much slower in C#.

1. What is the fastest C# collection, and what is fastest implementation.  
2. How can I beat the speed of a recursive query in T-SQL with a C# collection? My guess is that the collection is slower because it is scanning the entire collection each iteration, whereas the indexes on the database table are preventing this in the db environment.
3. Does my C# recursive function look acceptable?

I really appreciate your help! Program.cs
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Test_BomExplosion
{

    class Program
    {
        static void Main(string[] args)
        {
            Actions a = new Actions();      
            
            Console.WriteLine(DateTime.Now.ToLongTimeString() + " : " + DateTime.Now.Millisecond.ToString());

            // Run a test
            a.GetChildren("testMaterial", 0, 100);

            Console.WriteLine(DateTime.Now.ToLongTimeString() + " : " + DateTime.Now.Millisecond.ToString());
            Console.WriteLine(a.lst.Count.ToString());

            Console.ReadLine();
        }
    }

    class Actions
    {
        public List<Values> lst;
        Dictionary<LiteBOM, double> dic;

        public Actions()
        {
            lst = new List<Values>();
            dic = new Dictionary<LiteBOM, double>();

            // Populate BOM Dictionary
            DatabaseDataContext db = new DatabaseDataContext();
            var bom = from b in db.BOMs
                      select b;

            foreach (BOM b in bom)
            {
                LiteBOM lb = new LiteBOM(b.Parent, b.Child);
                dic.Add(lb, (double)b.ChildQty);
            }
        }


        public void GetChildren(string parent, int lvl, double value)
        {
            foreach (var row in dic.Where(row => row.Key.Parent == parent))
            {
                lst.Add(new Values(row.Key.Parent, row.Value * value));
                GetChildren(row.Key.Child, lvl + 1, value);
            }

        } 

    }

    // Used to Store the Exploded Values
    class Values
    {
        public Values(string MAT, double VAL)
        {
            mat = MAT;
            val = VAL;
        }

        private string mat;
        public string Mat
        {
            get { return mat; }
            set { mat = value; }
        }

        private double val;
        public double Val
        {
            get { return val; }
            set { val = value; }
        }

    }


    class LiteBOM
    {
        public LiteBOM(string par, string chi)
        {
            parent = par;
            child = chi;
        }

        private string parent;
        public string Parent
        {
            get { return parent; }
            set { parent = value; }
        }

        private string child;
        public string Child
        {
            get { return child; }
            set { child = value; }
        }
    }
}

Open in new window

db.png
Avatar of Anthony Perkins
Anthony Perkins
Flag of United States of America image

>>My problem is that I haven't been able to get my C# application to even perform as well as my T-SQL version.<<
Quite frankly I am not surprised.  All you are doing is swapping a massive amount of memory to disk.  Collections were never intended as a substitute for a large resultset.

Perhaps the T-SQL query can be optimized.
Avatar of MrSlausen
MrSlausen

ASKER

But, wouldn't you expect that accessing in-memory information would be faster than reading it from disk?
>>But, wouldn't you expect that accessing in-memory information would be faster than reading it from disk? <<
That is just the point, I very much you are in fact using memory, at least not RAM.  All you are doing is swapping out to a page file to sequentially traverse the collection.
You're not going to get to where you want to be without doing the heavy lifting with SQL.  I run into this a lot, where I can get a slow query to run correctly (stage1), then I will consume data from c# app (stage2) then I will revisit the query to tune it up.  I have in the past sped up queries by using temporary tables to eliminate some of the requred processing...
I suggest you let's us take a crack at optimizing the SQL.
You say you are scanning the collection every time
some .net objects can be sorted and contain a 'binary search' method - maybe that could speed up lookups for a large collection
I used this in the ArrayList.

for example, with your code

for each (var row in dic.Where(row => row.Key.Parent == parent))

if that was a sorted array list sorted by row, then you could binary seek the first row, then loop just through the rows you want.

I think there is a fair chance you could reduce the performance time using code, but don't forget that SQL server can probably cache all of those millions of records into RAM itself.


also to consider, I think you are using one large collection there.

What you could do for a tree type structure, is have a recursive collection of collections (which in turn can contain collections - and so on) - so you recurse into just the items you want to sum, and do not
have to search.

Summing could be fast, but you have the overhead of building the structure of collections, but that might not be too bad.

________________________________________________

have you considered scalability?  What if this thing grows to 150 billion records?  
I actually like the idea of creating a nested tree type structure.  Let's explore the SQL optimization before taking a stab that, because creating such a structure is going to take some time.

Here's the SQL.  Any thoughts regarding optimization of the SQL or the indexes on the table?  It takes me approximately 7 minutes to run the entire thing right now. Not bad, but I think it can be faster.


-- THE FUNCTION --
CREATE FUNCTION [gsc].[udf_BOM]
(	 
	 @parent nvarchar(18)
	,@ParVal int
)
RETURNS @rtn TABLE 
(
	 [Material] nvarchar(18)
	,[DepVal]	float
)
AS
BEGIN


;with bomD (child, [VALUE])
as
(
	select child, cast((@ParVal * b.ChildQty) as bigint) as [VALUE]
	from gsc.BOM as b
	where b.parent = @parent
		   UNION ALL
	select b.child, cast((bETC.[VALUE] * b.ChildQty) as bigint) as [VALUE]
	from gsc.BOM as b 
		inner join bomD as bETC on b.parent = bETC.child
)
insert into @rtn (material, depval)
select child, sum([VALUE]) from bomD group by child;



RETURN 
END


-- THE CALLING PIECE --
insert into DepSales
select 
o.Material, a.[Month], a.[Year], a.DemandType, o.DepVal 
from Sales as a
cross apply [gsc].[udf_BOM](Material, Sales) as o

Open in new window

Have you got indexes on parent and child in your BOM table?
Why is parent nvarchar?  Are you actually using unicode?
Why is DepVal a float?  bigint would be far more appropriate.
Have you determined how long the function takes and how long does the actual INSERT take?
Why are you using a multi-statement UDF()?

Please post the schema for your BOM and Sales tables.
I apologize about the delay in responding. I was out of town.

Attached are the schemas for the two tables as well as the indexes.  Do you have any suggestions on optimization?
CREATE TABLE [gsc].[BOM](
	[Child] [nvarchar](18) NULL,
	[Parent] [nvarchar](18) NULL,
	[BomNbr] [nvarchar](10) NULL,
	[UoM] [nvarchar](2) NULL,
	[ChildQty] [float] NULL,
	[ItemCat] [nvarchar](5) NULL,
	[CreatedBy] [nvarchar](50) NULL
) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [idx_bom_child] ON [gsc].[BOM] 
([Child] ASC)

CREATE NONCLUSTERED INDEX [idx_bom_parent] ON [gsc].[BOM] 
([Parent] ASC)

Open in new window

CREATE TABLE [gsc].[Sales]
(
	[id] [int] IDENTITY(1,1) NOT NULL,
	[Material] [nvarchar](18) NOT NULL,
	[Month] [int] NOT NULL,
	[Year] [int] NOT NULL,
	[DemandType] [nvarchar](25) NOT NULL,
	[Sales] [float] NULL
 CONSTRAINT [PK_Sales] PRIMARY KEY CLUSTERED 
( [id] ASC) 
) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [IX_DemandType] ON [gsc].[Sales] 
([DemandType] ASC)

CREATE NONCLUSTERED INDEX [IX_Material] ON [gsc].[Sales] 
([Material] ASC)

CREATE NONCLUSTERED INDEX [IX_Month] ON [gsc].[Sales] 
([Month] ASC)

CREATE NONCLUSTERED INDEX [IX_Year] ON [gsc].[Sales] 
([Year] ASC)

Open in new window

In addition to my last post:

1. I don't need the ANSI characters so I could move from NVARCHAR to VARCHAR.  Do you think that will really change the performance?

2. I have to use a float, because there are BOMs with non-integer quantities (.75 items go into the parent).

3. I haven't compared the insert time to the query time.

4. I'm using a UDF because using the table-valued UDF was faster than created a stored procedure with and internal insert function.  
ASKER CERTIFIED SOLUTION
Avatar of Anthony Perkins
Anthony Perkins
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Below is what I meant by an inline UDF.  However, it would seem that it would be pretty easy to move it directly to the Stored Procedure and that way do away with the UDF entirely.
-- THE FUNCTION --
CREATE FUNCTION gsc.udf_BOM(
		@parent varchar(18),
		@ParVal int)

RETURNS TABLE

AS

RETURN
    WITH    bomD(child, [VALUE])
              AS (SELECT    child,
                            @ParVal * b.ChildQty [VALUE]
                  FROM      gsc.BOM b
                  WHERE     b.parent = @parent
                  UNION ALL
                  SELECT    b.child,
                            bETC.[VALUE] * b.ChildQty [VALUE]
                  FROM      gsc.BOM b
                            INNER JOIN bomD bETC ON b.parent = bETC.child
                 )
    SELECT  child material,
            SUM([VALUE]) depval
    FROM    bomD
    GROUP BY child;

Open in new window

acperkins: I think your suggestions are going to take me where I desire to be.  It was this type of feedback on indexes and data types that I was looking for.  I'll definitely move away from the NVARCHAR and FLOAT.

The child rows are not unique, so I can't put a clustered index on that field.  If I were to use a clustered index, it would have to be on the child, parent combination.  

I won't be able to rework my objects until this weekend, but as soon as I do, I'll mark this as complete.
>>The child rows are not unique, so I can't put a clustered index on that field.<<
What makes you say that?  I am not saying that is a good idea or bad idea, just simply that it is a misconception to assume that a clustered index has to be unique.  It does not.
My understanding of the clustered index must just be incorrect.  I (obviously incorrectly) thought a clustered index had to be unique.  As a result, I think it probably makes perfect sense to put a clustered index on the child field.
In the end, I was able to take the query down from about 6 minutes to 2 minutes.  Here are the things I did:

1. converted everything to varchar vs nvarchar (I was using nvarchar without any need for unicode)
2. converted floats to appropriate decimal formats
3. created a clustered index on child in the bom table
4. changed the function from multi-statement UDF to an inline UDF

I still may try to do the same thing with a Stored Procedure, but for now the performance is now acceptable.

As a result, I'm going to close out the question.  I really appreciate the feedback!  I was using float and nvarchar far too frequently, I didn't fully understand the difference between a constraint and a clustered index, and I didn't realize I could use an Inline UDF.  As a result of this exercise, I have a much better understanding of these concepts, which is great.

Thanks everyone!
 
This was excellent!