asked on

SQL Server - create aggregated data from a table

Hi,

I have a table sales history which has about 1 million records but could go up to 3 or so millions.

CREATE TABLE salesData(
      [id] [int] IDENTITY(1,1) NOT NULL,
      [originId] [nvarchar](20) NOT NULL,
      [orderId] [nvarchar](250) NOT NULL,
      [orderDate] [datetime] NOT NULL,
      [statusRef] [int] NOT NULL,
      [orderSequence] [int] NOT NULL,
      [productNumber] [int] NOT NULL,
      [customerNumber] [int] NOT NULL,
      [finalQty] [int] NOT NULL,
      [finalValue] [decimal](18, 2) NOT NULL,
      [invoiceNumber] [nvarchar](20) NOT NULL,
      [modifiedDate] [datetime] NOT NULL,
      [deleted] [bit] NOT NULL,
CONSTRAINT [PK_salesData] PRIMARY KEY CLUSTERED
(
      [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

The data on this table needs to be available to mobile devices. Basically a sales agent will select a customer in the mobile app then a list of products will be displayed. Now along this product list, for each product, I need to display how many that customer bought last 12 months, month by month. Something like this.

Sales : Jan Feb Mar Apr etc (for rest of months)
20 11 0 4
Product: Dyson Hoover blah blah

1. The salesData is updated daily from another system.
2. The mobile apps access a stored procedure (not written yet) that needs to return the data above (customerNumber, ProductNumber, then the months and values/counts). It returns only 12 months (from current month to the same month previous year)
3. Lots of mobiles will be accessing stored procedure
4. The returning of the data from the stored procedure needs to be fast
5. The data is big as I mentioned

So I thought this:

I would create a dedicated aggregated data table aggregatedSalesData. Not sure how to structure this table, the part that displays months and values. Not sure whether to keep them flat columns m1, m1value, m2, m2Value etc or columns like monthNum, MonthValue so that each month goes as a row.

I saw this
http://sqlhints.com/2014/03/09/how-to-get-monthly-data-in-sql-server/

Is this the best way for my situation.

I need advice as to whether I am going the right way with this and if I am help with the query to move data everyday from the salesData to the aggregatedSalesData

Many thanks

ste5an

Createing a snapshot makes only sense when we talk about a dataware house. Cause the snapshot creation process (ETL) is better handled there.

So the first question is: is your table salesData a transactional table?

You should read Kimball: Dimensional Modeling Techniques.

And especially Transaction Fact Tables andPeriodic Snapshot Fact Tables.

gbzhhu

ASKER

Hi ste5an,

I don't know what is meant by transactional table. I simply created the table with the SQL Server defaults and populate it from an import_salesData table which gets truncated everyday. Note that performance for the copying of data from salesData to aggregatedSalesData is not an issue. This task happens 4 am in the morning and will not affect the mobile devices. It is the reading of the data for the mobile device (from aggregatedSalesData table unless I am corrected) that needs to be fast. That is why I am thinking of pre creating the aggregated data

I will do some reading on the article you suggested

Thank you

Scott Pletcher

The key factor for speed is to cluster the data the way you need to use it. In particular, get rid of the false notion that id "should" be the "default" clustering key -- that's just an AWFUL myth, the worst one in dbs.

Based on the limited info so far, it sounds as if you need to cluster the aggregatedSalesData on:
( customerNumber, productNumber, orderMonth )

Btw, yes, you do need to go back and better cluster the salesData table itself too. It might need clustered by orderDate first, or by customerNumber first, I can't tell without more details, but it should NOT be clustered on id, that much I know already.

gbzhhu

ASKER

Hi Scott,

First, I would provide any info to enable to help me better, so please just ask. I just do not know what info is useful for you to help me

I get the clustering of the aggregatedSalesData. The only data that I really want on that table (if I end up using that table) is customerNumber, productNumber, monthValues (aggregated)

For salesData, how can I be sure that orderDate is unique (order can be sent the same second) also customerNumber is repeated so not unique. Perhaps I could cluster on orderDate and invoiceNumber combined?

Scott Pletcher

The clustering key you specify doesn't have to be unique. Although, if it isn't, it can't be a constraint, so you'd need to use a:
CREATE CLUSTERED INDEX salesData__CL ON dbo.salesData ( orderDate )
statement instead of a CONSTRAINT to create the index.

If you want to make it unique yourself, just add id to the end of the index.

CREATE UNIQUE CLUSTERED INDEX ... ( orderDate, id )

Typically order data is mostly viewed by date, with the most recent data read the most often by far.

If you frequently / most always use that data by customer# also, it may be better to cluster by that.

Running the query below will show you what SQL considers to be "missing indexes" and usage stats for existing indexes. That will help a lot in seeing what's really going on. Post those results and we can review them as well.

USE [<your_db_name_here>]

SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

DECLARE @list_missing_indexes bit
DECLARE @table_name_pattern sysname

--NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.
SET @list_missing_indexes = 1
SET @table_name_pattern = 'salesData'

PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

IF @list_missing_indexes = 1
BEGIN
SELECT
GETDATE() AS capture_date,
DB_NAME(mid.database_id) AS Db_Name,
dps.row_count,
OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,
mid.equality_columns, mid.inequality_columns, mid.included_columns,
user_seeks, user_scans, ca1.max_days_active, unique_compiles,
last_user_seek, last_user_scan, avg_total_user_cost, avg_user_impact,
system_seeks, system_scans, last_system_seek, last_system_scan, avg_total_system_cost, avg_system_impact,
mid.statement, mid.object_id, mid.index_handle
FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)
CROSS APPLY (
SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'
) AS ca1
LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON
mig.index_handle = mid.index_handle
LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON
migs.group_handle = mig.index_group_handle
LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON
dps.object_id = mid.object_id AND
dps.index_id IN (0, 1)
--order by
--DB_NAME, Table_Name, equality_columns
WHERE
1 = 1
AND mid.database_id = DB_ID() --only current db
AND OBJECT_NAME(mid.object_id) LIKE @table_name_pattern
--AND mid.object_id IN (OBJECT_ID('<table_name_1>'), OBJECT_ID('<table_name_2>'))
ORDER BY
--avg_total_user_cost * (user_seeks + user_scans) DESC,
Db_Name, Table_Name, equality_columns, inequality_columns
END --IF

PRINT 'Midpoint @ ' + CONVERT(varchar(30), GETDATE(), 120)
-- list index usage stats (seeks, scans, etc.)
SELECT
ius2.row_num, DB_NAME() AS db_name,
CASE WHEN i.name LIKE ca2.table_name + '%'
THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1, 200)
ELSE i.name END AS index_name,
CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +
CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],
ca2.table_name,
i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,
dps.row_count,
SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,
ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,
ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,
fk.Reference_Count AS fk_ref_count,
DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date
ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,
FILEGROUP_NAME(i.data_space_id) AS filegroup_name,
ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,
ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update
FROM sys.indexes i WITH (NOLOCK)
INNER JOIN sys.objects o WITH (NOLOCK) ON
o.object_id = i.object_id
CROSS JOIN (
SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'
) AS ca1
CROSS APPLY (
SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name
) AS ca2
OUTER APPLY (
SELECT
', ' + COL_NAME(object_id, ic.column_id)
FROM sys.index_columns ic
WHERE
ic.key_ordinal > 0 AND
ic.object_id = i.object_id AND
ic.index_id = i.index_id
ORDER BY
ic.key_ordinal
FOR XML PATH('')
) AS key_cols (key_cols)
OUTER APPLY (
SELECT
', ' + COL_NAME(object_id, ic.column_id)
FROM sys.index_columns ic
WHERE
ic.key_ordinal = 0 AND
ic.object_id = i.object_id AND
ic.index_id = i.index_id
ORDER BY
COL_NAME(object_id, ic.column_id)
FOR XML PATH('')
) AS nonkey_cols (nonkey_cols)
LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON
dps.object_id = i.object_id AND
dps.index_id = i.index_id
LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON
ius.database_id = DB_ID() AND
ius.object_id = i.object_id AND
ius.index_id = i.index_id
LEFT OUTER JOIN (
SELECT
database_id, object_id, MAX(user_scans) AS user_scans,
ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans
FROM sys.dm_db_index_usage_stats WITH (NOLOCK)
WHERE
database_id = DB_ID()
--AND index_id > 0
GROUP BY
database_id, object_id
) AS ius2 ON
ius2.database_id = DB_ID() AND
ius2.object_id = i.object_id
LEFT OUTER JOIN (
SELECT
referenced_object_id, COUNT(*) AS Reference_Count
FROM sys.foreign_keys
WHERE
is_disabled = 0
GROUP BY
referenced_object_id
) AS fk ON
fk.referenced_object_id = i.object_id
WHERE
i.object_id > 100 AND
i.is_hypothetical = 0 AND
i.type IN (0, 1, 2) AND
o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND
(
o.name LIKE @table_name_pattern AND
o.name NOT LIKE 'dtprop%' AND
o.name NOT LIKE 'filestream[_]' AND
o.name NOT LIKE 'MSpeer%' AND
o.name NOT LIKE 'MSpub%' AND
--o.name NOT LIKE 'queue[_]%' AND
o.name NOT LIKE 'sys%'
)
--AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')
ORDER BY
--row_count DESC,
--ius.user_scans DESC,
--ius2.row_num, --user_scans&|user_seeks
db_name, table_name,
-- list clustered index first, if any, then other index(es)
CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,
key_cols

PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

SET DEADLOCK_PRIORITY NORMAL

gbzhhu

ASKER

Many thanks Scott.

Working through this now and will feedback

gbzhhu

ASKER

I got 2 outputs as follows. What does that tell you?

2015-01-12 11:30:45.107      pah_promotionPortal_DE_Stage      479848      salesData      [customerNumber], [efssCustomerNumber]      NULL      [orderDate], [productNumber]      1      0      101      1      2015-01-09 17:50:36.473      NULL      9.29434155590326      97.15      0      0      NULL      NULL      0      0      [pah_promotionPortal_DE_Stage].[dbo].[salesData]      2016726237      16159
2015-01-12 11:30:45.107      pah_promotionPortal_DE_Stage      479848      salesData      [efssCustomerNumber]      NULL      [orderDate], [productNumber], [customerNumber]      1      0      101      1      2015-01-09 17:50:27.120      NULL      6.70718491954374      96.07      0      0      NULL      NULL      0      0      [pah_promotionPortal_DE_Stage].[dbo].[salesData]      2016726237      16157

21      pah_promotionPortal_DE_Stage      PK_salesData      Y.Y      salesData      1      479848      id      NULL      0      22      0      2      NULL      2015-01-09 17:50:36.473      NULL      2015-01-07 17:18:21.140      NULL      5      PRIMARY      0      4      0      0      NULL      2015-01-09 17:35:05.453      NULL      NULL

gbzhhu

ASKER

I have now created the aggregated table.

CREATE TABLE [dbo].[salesDataAggregated](
      [customerRef] [int] NOT NULL,
      [productRef] [int] NOT NULL,
      [m1_totalCount] [int] NULL,
      [m1_totalValue] [decimal](18, 2) NULL,
      [m2_totalCount] [int] NULL,
      [m2_totalValue] [decimal](18, 2) NULL,
      [m3_totalCount] [int] NULL,
      [m3_totalValue] [decimal](18, 2) NULL,
      [m4_totalCount] [int] NULL,
      [m4_totalValue] [decimal](18, 2) NULL,
      [m5_totalCount] [int] NULL,
      [m5_totalValue] [decimal](18, 2) NULL,
      [m6_totalCount] [int] NULL,
      [m6_totalValue] [decimal](18, 2) NULL,
      [m7_totalCount] [int] NULL,
      [m7_totalValue] [decimal](18, 2) NULL,
      [m8_totalCount] [int] NULL,
      [m8_totalValue] [decimal](18, 2) NULL,
      [m9_totalCount] [int] NULL,
      [m9_totalValue] [decimal](18, 2) NULL,
      [m10_totalCount] [int] NULL,
      [m10_totalValue] [decimal](18, 2) NULL,
      [m11_totalCount] [int] NULL,
      [m11_totalValue] [decimal](18, 2) NULL,
      [m12_totalCount] [int] NULL,
      [m12_totalValue] [decimal](18, 2) NULL,
      [modifiedDate] [datetime] NOT NULL,
CONSTRAINT [PK_salesDataToday_1] PRIMARY KEY CLUSTERED
(
      [customerRef] ASC,
      [productRef] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[salesDataAggregated] WITH CHECK ADD CONSTRAINT [FK_salesDataToday_customer] FOREIGN KEY([customerRef])
REFERENCES [dbo].[customer] ([id])
GO

ALTER TABLE [dbo].[salesDataAggregated] CHECK CONSTRAINT [FK_salesDataToday_customer]
GO

ALTER TABLE [dbo].[salesDataAggregated] WITH CHECK ADD CONSTRAINT [FK_salesDataToday_product] FOREIGN KEY([productRef])
REFERENCES [dbo].[product] ([id])
GO

ALTER TABLE [dbo].[salesDataAggregated] CHECK CONSTRAINT [FK_salesDataToday_product]
GO

ALTER TABLE [dbo].[salesDataAggregated] ADD CONSTRAINT [DF_salesDataToday_modifiedDate] DEFAULT (getdate()) FOR [modifiedDate]
GO

gbzhhu

ASKER

I have this query that will display the data to be copied to it daily. Because the month data is created by the pivot (query copied from some website and adopted) I am struggling with the syntax of inserting into the defined table even though columns count and data type are equal

SELECT * FROM
(SELECT c.id [Customer],
p.id [Product],
DATENAME(MONTH, orderDate) [Month],
COUNT(1) [SalesCount]
FROM salesData sd
INNER JOIN customer c ON c.efssNumber = sd.efssCustomerNumber AND c.number = sd.customerNumber
INNER JOIN product p ON p.productNumber = sd.productNumber
WHERE orderDate BETWEEN DATEADD(m,-12,GETDATE()) AND GETDATE() and c.id = 5428
GROUP BY c.id, p.id, DATENAME(MONTH, orderDate)) AS MontlySalesData
PIVOT( SUM([SalesCount])
FOR Month IN ([January],[February],[March],[April],[May],
[June],[July],[August],[September],[October],[November],
[December])) AS MNamePivot

ASKER CERTIFIED SOLUTION

Scott Pletcher

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

gbzhhu

ASKER

Thanks Scott.

This is new work and the tables are new with data pulled from old system. These minimal activity you noticed is just me. They do not even exists in live system yet.

I'll use the suggested cluster. Thank you.