Link to home
Start Free TrialLog in
Avatar of gbzhhu
gbzhhuFlag for United Kingdom of Great Britain and Northern Ireland

asked on

SQL Server - create aggregated data from a table

Hi,

I have a table sales history which has about 1 million records but could go up to 3 or so millions.

CREATE TABLE salesData(
      [id] [int] IDENTITY(1,1) NOT NULL,
      [originId] [nvarchar](20) NOT NULL,
      [orderId] [nvarchar](250) NOT NULL,
      [orderDate] [datetime] NOT NULL,
      [statusRef] [int] NOT NULL,
      [orderSequence] [int] NOT NULL,
      [productNumber] [int] NOT NULL,
      [customerNumber] [int] NOT NULL,
      [finalQty] [int] NOT NULL,
      [finalValue] [decimal](18, 2) NOT NULL,
      [invoiceNumber] [nvarchar](20) NOT NULL,
      [modifiedDate] [datetime] NOT NULL,
      [deleted] [bit] NOT NULL,
 CONSTRAINT [PK_salesData] PRIMARY KEY CLUSTERED
(
      [id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

The data on this table needs to be available to mobile devices.  Basically a sales agent will select a customer in the mobile app then a list of products will be displayed.  Now along this product list, for each product, I need to display how many that customer bought last 12 months, month by month. Something like this.

Sales     :    Jan  Feb  Mar  Apr etc (for rest of months)
                   20    11    0       4
Product:    Dyson Hoover blah blah

1.  The salesData is updated daily from another system.  
2.  The mobile apps access a stored procedure (not written yet) that needs to return the data above (customerNumber, ProductNumber, then the months and values/counts).  It returns only 12 months (from current month to the same month previous year)
3.  Lots of mobiles will be accessing stored procedure
4.  The returning of the data from the stored procedure needs to be fast
5.  The data is big as I mentioned

So I thought this:

I would create a dedicated aggregated data table aggregatedSalesData.  Not sure how to structure this table, the part that displays months and values.  Not sure whether to keep them flat columns m1, m1value, m2, m2Value etc or columns like monthNum, MonthValue so that each month goes as a row.

I saw this
http://sqlhints.com/2014/03/09/how-to-get-monthly-data-in-sql-server/

Is this the best way for my situation.

I need advice as to whether I am going the right way with this and if I am help with the query to move data everyday from the salesData to the aggregatedSalesData

Many thanks
Avatar of ste5an
ste5an
Flag of Germany image

Createing a snapshot makes only sense when we talk about a dataware house. Cause the snapshot creation process (ETL) is better handled there.

So the first question is: is your table salesData a transactional table?

You should read Kimball: Dimensional Modeling Techniques.

And especially Transaction Fact Tables andPeriodic Snapshot Fact Tables.
Avatar of gbzhhu

ASKER

Hi ste5an,

I don't know what is meant by transactional table.  I simply created the table with the SQL Server defaults and populate it from an import_salesData table which gets truncated everyday.  Note that performance for the copying of data from salesData to aggregatedSalesData is not an issue.  This task happens 4 am in the morning and will not affect the mobile devices.  It is the reading of the data for the mobile device (from aggregatedSalesData table unless I am corrected) that needs to be fast.  That is why I am thinking of pre creating the  aggregated data

I will do some reading on the article you suggested

Thank you
The key factor for speed is to cluster the data the way you need to use it.  In particular, get rid of the false notion that id "should" be the "default" clustering key -- that's just an AWFUL myth, the worst one in dbs.

Based on the limited info so far, it sounds as if you need to cluster the aggregatedSalesData on:
( customerNumber, productNumber, orderMonth )


Btw, yes, you do need to go back and better cluster the salesData table itself too.  It might need clustered by orderDate first, or by customerNumber first, I can't tell without more details, but it should NOT be clustered on id, that much I know already.
Avatar of gbzhhu

ASKER

Hi Scott,

First, I would provide any info to enable to help me better, so please just ask.  I just do not know what info is useful for you to help me

I get the clustering of the aggregatedSalesData.  The only data that I really want on that table (if I end up using that table) is  customerNumber, productNumber, monthValues (aggregated)

For salesData, how can I be sure that orderDate is unique (order can be sent the same second) also customerNumber  is repeated so not unique.  Perhaps I could cluster on orderDate and invoiceNumber combined?
The clustering key you specify doesn't have to be unique.  Although, if it isn't, it can't be a constraint, so you'd need to use a:
CREATE CLUSTERED INDEX salesData__CL ON dbo.salesData ( orderDate )
statement instead of a CONSTRAINT to create the index.

If you want to make it unique yourself, just add id to the end of the index.

CREATE UNIQUE CLUSTERED INDEX ...  ( orderDate, id )

Typically order data is mostly viewed by date, with the most recent data read the most often by far.

If you frequently / most always use that data by customer# also, it may be better to cluster by that.

Running the query below will show you what SQL considers to be "missing indexes" and usage stats for existing indexes.  That will help a lot in seeing what's really going on.  Post those results and we can review them as well.



USE [<your_db_name_here>]

SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

DECLARE @list_missing_indexes bit
DECLARE @table_name_pattern sysname

--NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.
SET @list_missing_indexes = 1
SET @table_name_pattern = 'salesData'


PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

IF @list_missing_indexes = 1
BEGIN
    SELECT
        GETDATE() AS capture_date,
        DB_NAME(mid.database_id) AS Db_Name,
        dps.row_count,
        OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,
        mid.equality_columns, mid.inequality_columns, mid.included_columns,
        user_seeks, user_scans, ca1.max_days_active, unique_compiles,
        last_user_seek, last_user_scan, avg_total_user_cost, avg_user_impact,
        system_seeks, system_scans, last_system_seek, last_system_scan, avg_total_system_cost, avg_system_impact,
        mid.statement, mid.object_id, mid.index_handle
    FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)
    CROSS APPLY (
        SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'
    ) AS ca1
    LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON
        mig.index_handle = mid.index_handle
    LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON
        migs.group_handle = mig.index_group_handle
    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON
        dps.object_id = mid.object_id AND
        dps.index_id IN (0, 1)
    --order by
        --DB_NAME, Table_Name, equality_columns
    WHERE
        1 = 1
        AND mid.database_id = DB_ID() --only current db
        AND OBJECT_NAME(mid.object_id) LIKE @table_name_pattern
        --AND mid.object_id IN (OBJECT_ID('<table_name_1>'), OBJECT_ID('<table_name_2>'))
    ORDER BY
        --avg_total_user_cost * (user_seeks + user_scans) DESC,
        Db_Name, Table_Name, equality_columns, inequality_columns
END --IF

PRINT 'Midpoint @ ' + CONVERT(varchar(30), GETDATE(), 120)
-- list index usage stats (seeks, scans, etc.)
SELECT
    ius2.row_num, DB_NAME() AS db_name,
    CASE WHEN i.name LIKE ca2.table_name + '%'
         THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1, 200)
         ELSE i.name END AS index_name,
    CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +
    CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],
    ca2.table_name,
    i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,
    dps.row_count,
    SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,
    ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,
    ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,
    fk.Reference_Count AS fk_ref_count,
    DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date
        ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,
    FILEGROUP_NAME(i.data_space_id) AS filegroup_name,
    ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,
    ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update
FROM sys.indexes i WITH (NOLOCK)
INNER JOIN sys.objects o WITH (NOLOCK) ON
    o.object_id = i.object_id
CROSS JOIN (
    SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'
) AS ca1
CROSS APPLY (
    SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name
) AS ca2
OUTER APPLY (
    SELECT
        ', ' + COL_NAME(object_id, ic.column_id)
    FROM sys.index_columns ic
    WHERE
        ic.key_ordinal > 0 AND
        ic.object_id = i.object_id AND
        ic.index_id = i.index_id
    ORDER BY
        ic.key_ordinal
    FOR XML PATH('')
) AS key_cols (key_cols)
OUTER APPLY (
    SELECT
        ', ' + COL_NAME(object_id, ic.column_id)
    FROM sys.index_columns ic
    WHERE
        ic.key_ordinal = 0 AND
        ic.object_id = i.object_id AND
        ic.index_id = i.index_id
    ORDER BY
        COL_NAME(object_id, ic.column_id)
    FOR XML PATH('')
) AS nonkey_cols (nonkey_cols)
LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON
    dps.object_id = i.object_id AND
    dps.index_id = i.index_id
LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON
    ius.database_id = DB_ID() AND
    ius.object_id = i.object_id AND
    ius.index_id = i.index_id
LEFT OUTER JOIN (
    SELECT
        database_id, object_id, MAX(user_scans) AS user_scans,
        ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans
    FROM sys.dm_db_index_usage_stats WITH (NOLOCK)
    WHERE
        database_id = DB_ID()
        --AND index_id > 0
    GROUP BY
        database_id, object_id
) AS ius2 ON
    ius2.database_id = DB_ID() AND
    ius2.object_id = i.object_id
LEFT OUTER JOIN (
    SELECT
        referenced_object_id, COUNT(*) AS Reference_Count
    FROM sys.foreign_keys
    WHERE
        is_disabled = 0
    GROUP BY
        referenced_object_id
) AS fk ON
    fk.referenced_object_id = i.object_id
WHERE
    i.object_id > 100 AND
    i.is_hypothetical = 0 AND
    i.type IN (0, 1, 2) AND
    o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND
    (
     o.name LIKE @table_name_pattern AND
     o.name NOT LIKE 'dtprop%' AND
     o.name NOT LIKE 'filestream[_]' AND
     o.name NOT LIKE 'MSpeer%' AND
     o.name NOT LIKE 'MSpub%' AND
     --o.name NOT LIKE 'queue[_]%' AND
     o.name NOT LIKE 'sys%'
    )
    --AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')
ORDER BY
    --row_count DESC,
    --ius.user_scans DESC,
    --ius2.row_num, --user_scans&|user_seeks
    db_name, table_name,
    -- list clustered index first, if any, then other index(es)
    CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,
    key_cols

PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

SET DEADLOCK_PRIORITY NORMAL
Avatar of gbzhhu

ASKER

Many thanks Scott.

Working through this now and will feedback
Avatar of gbzhhu

ASKER

I got 2 outputs as follows.  What does that tell you?

2015-01-12 11:30:45.107      pah_promotionPortal_DE_Stage      479848      salesData      [customerNumber], [efssCustomerNumber]      NULL      [orderDate], [productNumber]      1      0      101      1      2015-01-09 17:50:36.473      NULL      9.29434155590326      97.15      0      0      NULL      NULL      0      0      [pah_promotionPortal_DE_Stage].[dbo].[salesData]      2016726237      16159
2015-01-12 11:30:45.107      pah_promotionPortal_DE_Stage      479848      salesData      [efssCustomerNumber]      NULL      [orderDate], [productNumber], [customerNumber]      1      0      101      1      2015-01-09 17:50:27.120      NULL      6.70718491954374      96.07      0      0      NULL      NULL      0      0      [pah_promotionPortal_DE_Stage].[dbo].[salesData]      2016726237      16157


21      pah_promotionPortal_DE_Stage      PK_salesData      Y.Y      salesData      1      479848      id      NULL      0      22      0      2      NULL      2015-01-09 17:50:36.473      NULL      2015-01-07 17:18:21.140      NULL      5      PRIMARY      0      4      0      0      NULL      2015-01-09 17:35:05.453      NULL      NULL
Avatar of gbzhhu

ASKER

I have now created the aggregated table.

CREATE TABLE [dbo].[salesDataAggregated](
      [customerRef] [int] NOT NULL,
      [productRef] [int] NOT NULL,
      [m1_totalCount] [int] NULL,
      [m1_totalValue] [decimal](18, 2) NULL,
      [m2_totalCount] [int] NULL,
      [m2_totalValue] [decimal](18, 2) NULL,
      [m3_totalCount] [int] NULL,
      [m3_totalValue] [decimal](18, 2) NULL,
      [m4_totalCount] [int] NULL,
      [m4_totalValue] [decimal](18, 2) NULL,
      [m5_totalCount] [int] NULL,
      [m5_totalValue] [decimal](18, 2) NULL,
      [m6_totalCount] [int] NULL,
      [m6_totalValue] [decimal](18, 2) NULL,
      [m7_totalCount] [int] NULL,
      [m7_totalValue] [decimal](18, 2) NULL,
      [m8_totalCount] [int] NULL,
      [m8_totalValue] [decimal](18, 2) NULL,
      [m9_totalCount] [int] NULL,
      [m9_totalValue] [decimal](18, 2) NULL,
      [m10_totalCount] [int] NULL,
      [m10_totalValue] [decimal](18, 2) NULL,
      [m11_totalCount] [int] NULL,
      [m11_totalValue] [decimal](18, 2) NULL,
      [m12_totalCount] [int] NULL,
      [m12_totalValue] [decimal](18, 2) NULL,
      [modifiedDate] [datetime] NOT NULL,
 CONSTRAINT [PK_salesDataToday_1] PRIMARY KEY CLUSTERED
(
      [customerRef] ASC,
      [productRef] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[salesDataAggregated]  WITH CHECK ADD  CONSTRAINT [FK_salesDataToday_customer] FOREIGN KEY([customerRef])
REFERENCES [dbo].[customer] ([id])
GO

ALTER TABLE [dbo].[salesDataAggregated] CHECK CONSTRAINT [FK_salesDataToday_customer]
GO

ALTER TABLE [dbo].[salesDataAggregated]  WITH CHECK ADD  CONSTRAINT [FK_salesDataToday_product] FOREIGN KEY([productRef])
REFERENCES [dbo].[product] ([id])
GO

ALTER TABLE [dbo].[salesDataAggregated] CHECK CONSTRAINT [FK_salesDataToday_product]
GO

ALTER TABLE [dbo].[salesDataAggregated] ADD  CONSTRAINT [DF_salesDataToday_modifiedDate]  DEFAULT (getdate()) FOR [modifiedDate]
GO
Avatar of gbzhhu

ASKER

I have this query that will display the data to be copied to it daily.  Because the month data is created by the pivot (query copied from some website and adopted) I am struggling with the syntax of inserting into the defined table even though columns count and data type are equal

SELECT * FROM
(SELECT c.id [Customer],
         p.id [Product],
         DATENAME(MONTH, orderDate) [Month],
       COUNT(1) [SalesCount]
FROM salesData sd
      INNER JOIN customer c ON c.efssNumber = sd.efssCustomerNumber AND c.number = sd.customerNumber
      INNER JOIN product p ON p.productNumber = sd.productNumber
WHERE orderDate BETWEEN DATEADD(m,-12,GETDATE()) AND GETDATE() and c.id = 5428
GROUP BY c.id, p.id, DATENAME(MONTH, orderDate)) AS MontlySalesData
PIVOT( SUM([SalesCount])  
    FOR Month IN ([January],[February],[March],[April],[May],
    [June],[July],[August],[September],[October],[November],
    [December])) AS MNamePivot
ASKER CERTIFIED SOLUTION
Avatar of Scott Pletcher
Scott Pletcher
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of gbzhhu

ASKER

Thanks Scott.

This is new work and the tables are new with data pulled from old system.  These minimal activity you noticed is just me.  They do not even exists in live system yet.

I'll use the suggested cluster.  Thank you.