Solved

SQL Server - create aggregated data from a table

Posted on 2015-01-09
12
284 Views
Last Modified: 2015-01-24
Hi,

I have a table sales history which has about 1 million records but could go up to 3 or so millions.

CREATE TABLE salesData(
      [id] [int] IDENTITY(1,1) NOT NULL,
      [originId] [nvarchar](20) NOT NULL,
      [orderId] [nvarchar](250) NOT NULL,
      [orderDate] [datetime] NOT NULL,
      [statusRef] [int] NOT NULL,
      [orderSequence] [int] NOT NULL,
      [productNumber] [int] NOT NULL,
      [customerNumber] [int] NOT NULL,
      [finalQty] [int] NOT NULL,
      [finalValue] [decimal](18, 2) NOT NULL,
      [invoiceNumber] [nvarchar](20) NOT NULL,
      [modifiedDate] [datetime] NOT NULL,
      [deleted] [bit] NOT NULL,
 CONSTRAINT [PK_salesData] PRIMARY KEY CLUSTERED
(
      [id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

The data on this table needs to be available to mobile devices.  Basically a sales agent will select a customer in the mobile app then a list of products will be displayed.  Now along this product list, for each product, I need to display how many that customer bought last 12 months, month by month. Something like this.

Sales     :    Jan  Feb  Mar  Apr etc (for rest of months)
                   20    11    0       4
Product:    Dyson Hoover blah blah

1.  The salesData is updated daily from another system.  
2.  The mobile apps access a stored procedure (not written yet) that needs to return the data above (customerNumber, ProductNumber, then the months and values/counts).  It returns only 12 months (from current month to the same month previous year)
3.  Lots of mobiles will be accessing stored procedure
4.  The returning of the data from the stored procedure needs to be fast
5.  The data is big as I mentioned

So I thought this:

I would create a dedicated aggregated data table aggregatedSalesData.  Not sure how to structure this table, the part that displays months and values.  Not sure whether to keep them flat columns m1, m1value, m2, m2Value etc or columns like monthNum, MonthValue so that each month goes as a row.

I saw this
http://sqlhints.com/2014/03/09/how-to-get-monthly-data-in-sql-server/

Is this the best way for my situation.

I need advice as to whether I am going the right way with this and if I am help with the query to move data everyday from the salesData to the aggregatedSalesData

Many thanks
0
Comment
Question by:gbzhhu
  • 7
  • 3
12 Comments
 
LVL 32

Expert Comment

by:ste5an
ID: 40540468
Createing a snapshot makes only sense when we talk about a dataware house. Cause the snapshot creation process (ETL) is better handled there.

So the first question is: is your table salesData a transactional table?

You should read Kimball: Dimensional Modeling Techniques.

And especially Transaction Fact Tables andPeriodic Snapshot Fact Tables.
0
 
LVL 12

Author Comment

by:gbzhhu
ID: 40540523
Hi ste5an,

I don't know what is meant by transactional table.  I simply created the table with the SQL Server defaults and populate it from an import_salesData table which gets truncated everyday.  Note that performance for the copying of data from salesData to aggregatedSalesData is not an issue.  This task happens 4 am in the morning and will not affect the mobile devices.  It is the reading of the data for the mobile device (from aggregatedSalesData table unless I am corrected) that needs to be fast.  That is why I am thinking of pre creating the  aggregated data

I will do some reading on the article you suggested

Thank you
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40540773
The key factor for speed is to cluster the data the way you need to use it.  In particular, get rid of the false notion that id "should" be the "default" clustering key -- that's just an AWFUL myth, the worst one in dbs.

Based on the limited info so far, it sounds as if you need to cluster the aggregatedSalesData on:
( customerNumber, productNumber, orderMonth )


Btw, yes, you do need to go back and better cluster the salesData table itself too.  It might need clustered by orderDate first, or by customerNumber first, I can't tell without more details, but it should NOT be clustered on id, that much I know already.
0
 
LVL 12

Author Comment

by:gbzhhu
ID: 40540812
Hi Scott,

First, I would provide any info to enable to help me better, so please just ask.  I just do not know what info is useful for you to help me

I get the clustering of the aggregatedSalesData.  The only data that I really want on that table (if I end up using that table) is  customerNumber, productNumber, monthValues (aggregated)

For salesData, how can I be sure that orderDate is unique (order can be sent the same second) also customerNumber  is repeated so not unique.  Perhaps I could cluster on orderDate and invoiceNumber combined?
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40540834
The clustering key you specify doesn't have to be unique.  Although, if it isn't, it can't be a constraint, so you'd need to use a:
CREATE CLUSTERED INDEX salesData__CL ON dbo.salesData ( orderDate )
statement instead of a CONSTRAINT to create the index.

If you want to make it unique yourself, just add id to the end of the index.

CREATE UNIQUE CLUSTERED INDEX ...  ( orderDate, id )

Typically order data is mostly viewed by date, with the most recent data read the most often by far.

If you frequently / most always use that data by customer# also, it may be better to cluster by that.

Running the query below will show you what SQL considers to be "missing indexes" and usage stats for existing indexes.  That will help a lot in seeing what's really going on.  Post those results and we can review them as well.



USE [<your_db_name_here>]

SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

DECLARE @list_missing_indexes bit
DECLARE @table_name_pattern sysname

--NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.
SET @list_missing_indexes = 1
SET @table_name_pattern = 'salesData'


PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

IF @list_missing_indexes = 1
BEGIN
    SELECT
        GETDATE() AS capture_date,
        DB_NAME(mid.database_id) AS Db_Name,
        dps.row_count,
        OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,
        mid.equality_columns, mid.inequality_columns, mid.included_columns,
        user_seeks, user_scans, ca1.max_days_active, unique_compiles,
        last_user_seek, last_user_scan, avg_total_user_cost, avg_user_impact,
        system_seeks, system_scans, last_system_seek, last_system_scan, avg_total_system_cost, avg_system_impact,
        mid.statement, mid.object_id, mid.index_handle
    FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)
    CROSS APPLY (
        SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'
    ) AS ca1
    LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON
        mig.index_handle = mid.index_handle
    LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON
        migs.group_handle = mig.index_group_handle
    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON
        dps.object_id = mid.object_id AND
        dps.index_id IN (0, 1)
    --order by
        --DB_NAME, Table_Name, equality_columns
    WHERE
        1 = 1
        AND mid.database_id = DB_ID() --only current db
        AND OBJECT_NAME(mid.object_id) LIKE @table_name_pattern
        --AND mid.object_id IN (OBJECT_ID('<table_name_1>'), OBJECT_ID('<table_name_2>'))
    ORDER BY
        --avg_total_user_cost * (user_seeks + user_scans) DESC,
        Db_Name, Table_Name, equality_columns, inequality_columns
END --IF

PRINT 'Midpoint @ ' + CONVERT(varchar(30), GETDATE(), 120)
-- list index usage stats (seeks, scans, etc.)
SELECT
    ius2.row_num, DB_NAME() AS db_name,
    CASE WHEN i.name LIKE ca2.table_name + '%'
         THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1, 200)
         ELSE i.name END AS index_name,
    CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +
    CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],
    ca2.table_name,
    i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,
    dps.row_count,
    SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,
    ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,
    ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,
    fk.Reference_Count AS fk_ref_count,
    DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date
        ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,
    FILEGROUP_NAME(i.data_space_id) AS filegroup_name,
    ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,
    ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update
FROM sys.indexes i WITH (NOLOCK)
INNER JOIN sys.objects o WITH (NOLOCK) ON
    o.object_id = i.object_id
CROSS JOIN (
    SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'
) AS ca1
CROSS APPLY (
    SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name
) AS ca2
OUTER APPLY (
    SELECT
        ', ' + COL_NAME(object_id, ic.column_id)
    FROM sys.index_columns ic
    WHERE
        ic.key_ordinal > 0 AND
        ic.object_id = i.object_id AND
        ic.index_id = i.index_id
    ORDER BY
        ic.key_ordinal
    FOR XML PATH('')
) AS key_cols (key_cols)
OUTER APPLY (
    SELECT
        ', ' + COL_NAME(object_id, ic.column_id)
    FROM sys.index_columns ic
    WHERE
        ic.key_ordinal = 0 AND
        ic.object_id = i.object_id AND
        ic.index_id = i.index_id
    ORDER BY
        COL_NAME(object_id, ic.column_id)
    FOR XML PATH('')
) AS nonkey_cols (nonkey_cols)
LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON
    dps.object_id = i.object_id AND
    dps.index_id = i.index_id
LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON
    ius.database_id = DB_ID() AND
    ius.object_id = i.object_id AND
    ius.index_id = i.index_id
LEFT OUTER JOIN (
    SELECT
        database_id, object_id, MAX(user_scans) AS user_scans,
        ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans
    FROM sys.dm_db_index_usage_stats WITH (NOLOCK)
    WHERE
        database_id = DB_ID()
        --AND index_id > 0
    GROUP BY
        database_id, object_id
) AS ius2 ON
    ius2.database_id = DB_ID() AND
    ius2.object_id = i.object_id
LEFT OUTER JOIN (
    SELECT
        referenced_object_id, COUNT(*) AS Reference_Count
    FROM sys.foreign_keys
    WHERE
        is_disabled = 0
    GROUP BY
        referenced_object_id
) AS fk ON
    fk.referenced_object_id = i.object_id
WHERE
    i.object_id > 100 AND
    i.is_hypothetical = 0 AND
    i.type IN (0, 1, 2) AND
    o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND
    (
     o.name LIKE @table_name_pattern AND
     o.name NOT LIKE 'dtprop%' AND
     o.name NOT LIKE 'filestream[_]' AND
     o.name NOT LIKE 'MSpeer%' AND
     o.name NOT LIKE 'MSpub%' AND
     --o.name NOT LIKE 'queue[_]%' AND
     o.name NOT LIKE 'sys%'
    )
    --AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')
ORDER BY
    --row_count DESC,
    --ius.user_scans DESC,
    --ius2.row_num, --user_scans&|user_seeks
    db_name, table_name,
    -- list clustered index first, if any, then other index(es)
    CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,
    key_cols

PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

SET DEADLOCK_PRIORITY NORMAL
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 12

Author Comment

by:gbzhhu
ID: 40544083
Many thanks Scott.

Working through this now and will feedback
0
 
LVL 12

Author Comment

by:gbzhhu
ID: 40544104
I got 2 outputs as follows.  What does that tell you?

2015-01-12 11:30:45.107      pah_promotionPortal_DE_Stage      479848      salesData      [customerNumber], [efssCustomerNumber]      NULL      [orderDate], [productNumber]      1      0      101      1      2015-01-09 17:50:36.473      NULL      9.29434155590326      97.15      0      0      NULL      NULL      0      0      [pah_promotionPortal_DE_Stage].[dbo].[salesData]      2016726237      16159
2015-01-12 11:30:45.107      pah_promotionPortal_DE_Stage      479848      salesData      [efssCustomerNumber]      NULL      [orderDate], [productNumber], [customerNumber]      1      0      101      1      2015-01-09 17:50:27.120      NULL      6.70718491954374      96.07      0      0      NULL      NULL      0      0      [pah_promotionPortal_DE_Stage].[dbo].[salesData]      2016726237      16157


21      pah_promotionPortal_DE_Stage      PK_salesData      Y.Y      salesData      1      479848      id      NULL      0      22      0      2      NULL      2015-01-09 17:50:36.473      NULL      2015-01-07 17:18:21.140      NULL      5      PRIMARY      0      4      0      0      NULL      2015-01-09 17:35:05.453      NULL      NULL
0
 
LVL 12

Author Comment

by:gbzhhu
ID: 40544391
I have now created the aggregated table.

CREATE TABLE [dbo].[salesDataAggregated](
      [customerRef] [int] NOT NULL,
      [productRef] [int] NOT NULL,
      [m1_totalCount] [int] NULL,
      [m1_totalValue] [decimal](18, 2) NULL,
      [m2_totalCount] [int] NULL,
      [m2_totalValue] [decimal](18, 2) NULL,
      [m3_totalCount] [int] NULL,
      [m3_totalValue] [decimal](18, 2) NULL,
      [m4_totalCount] [int] NULL,
      [m4_totalValue] [decimal](18, 2) NULL,
      [m5_totalCount] [int] NULL,
      [m5_totalValue] [decimal](18, 2) NULL,
      [m6_totalCount] [int] NULL,
      [m6_totalValue] [decimal](18, 2) NULL,
      [m7_totalCount] [int] NULL,
      [m7_totalValue] [decimal](18, 2) NULL,
      [m8_totalCount] [int] NULL,
      [m8_totalValue] [decimal](18, 2) NULL,
      [m9_totalCount] [int] NULL,
      [m9_totalValue] [decimal](18, 2) NULL,
      [m10_totalCount] [int] NULL,
      [m10_totalValue] [decimal](18, 2) NULL,
      [m11_totalCount] [int] NULL,
      [m11_totalValue] [decimal](18, 2) NULL,
      [m12_totalCount] [int] NULL,
      [m12_totalValue] [decimal](18, 2) NULL,
      [modifiedDate] [datetime] NOT NULL,
 CONSTRAINT [PK_salesDataToday_1] PRIMARY KEY CLUSTERED
(
      [customerRef] ASC,
      [productRef] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[salesDataAggregated]  WITH CHECK ADD  CONSTRAINT [FK_salesDataToday_customer] FOREIGN KEY([customerRef])
REFERENCES [dbo].[customer] ([id])
GO

ALTER TABLE [dbo].[salesDataAggregated] CHECK CONSTRAINT [FK_salesDataToday_customer]
GO

ALTER TABLE [dbo].[salesDataAggregated]  WITH CHECK ADD  CONSTRAINT [FK_salesDataToday_product] FOREIGN KEY([productRef])
REFERENCES [dbo].[product] ([id])
GO

ALTER TABLE [dbo].[salesDataAggregated] CHECK CONSTRAINT [FK_salesDataToday_product]
GO

ALTER TABLE [dbo].[salesDataAggregated] ADD  CONSTRAINT [DF_salesDataToday_modifiedDate]  DEFAULT (getdate()) FOR [modifiedDate]
GO
0
 
LVL 12

Author Comment

by:gbzhhu
ID: 40544401
I have this query that will display the data to be copied to it daily.  Because the month data is created by the pivot (query copied from some website and adopted) I am struggling with the syntax of inserting into the defined table even though columns count and data type are equal

SELECT * FROM
(SELECT c.id [Customer],
         p.id [Product],
         DATENAME(MONTH, orderDate) [Month],
       COUNT(1) [SalesCount]
FROM salesData sd
      INNER JOIN customer c ON c.efssNumber = sd.efssCustomerNumber AND c.number = sd.customerNumber
      INNER JOIN product p ON p.productNumber = sd.productNumber
WHERE orderDate BETWEEN DATEADD(m,-12,GETDATE()) AND GETDATE() and c.id = 5428
GROUP BY c.id, p.id, DATENAME(MONTH, orderDate)) AS MontlySalesData
PIVOT( SUM([SalesCount])  
    FOR Month IN ([January],[February],[March],[April],[May],
    [June],[July],[August],[September],[October],[November],
    [December])) AS MNamePivot
0
 
LVL 69

Accepted Solution

by:
ScottPletcher earned 500 total points
ID: 40544744
>> I got 2 outputs as follows.  What does that tell you? <<

That the table in the "_Stage" db has been very rarely accessed with a WHERE clause.  But at least there are a couple of seeks for:
( [efssCustomerNumber], [customerNumber] )
but none for the id column.
It's still very preliminary because of the extremely low activity, but as of now a better clustering key would be:
( [efssCustomerNumber] ) OR
( [efssCustomerNumber], [customerNumber] )
depending on other usage details and key cardinality.


Btw, when you look at index stats, you want to use the most active version of the table -- typically production -- to get the most accurate results.  The index stats query does not generate a lot of overhead on the server, so don't worry about running it on prod.
0
 
LVL 12

Author Comment

by:gbzhhu
ID: 40544905
Thanks Scott.

This is new work and the tables are new with data pulled from old system.  These minimal activity you noticed is just me.  They do not even exists in live system yet.

I'll use the suggested cluster.  Thank you.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

This article explains all about SQL Server Piecemeal Restore with examples in step by step manner.
Read about achieving the basic levels of HRIS security in the workplace.
Video by: Steve
Using examples as well as descriptions, step through each of the common simple join types, explaining differences in syntax, differences in expected outputs and showing how the queries run along with the actual outputs based upon a simple set of dem…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now