websss
asked on
Sql Server column store compression for row store queries
I have a time series database with the the following columns
The are currently 4 billion rows in this one table, and growing by 1 billion a month.
I need to do a couple of things
1) I would like to partition / split the data onto different drives, perhaps "now to 3 months" on current drive, and "3+ months and >" to a different drive.
2) I would like to compress the older data "3+ > months" for example,
On the compression point, all queries I do are virtually the same
for example, these are typical queries
My Clustered index is on the following (inc order)
I also have a non clustered index on
and a non clustered index (unique primary key) on
I dont have much experience with compression, but I've been trying out postgres with timescaledb and they have a column store compression option where you define a segment by key, i've used imei_number as SEGMENT BY (so it sticks everything in b-tree/disc by imei number first) and the compression is outstanding (180 x), and query performance the same as normal row store.
On my SQL server I've run some stats on page/row compression and the saving are not groundbreaking (about 20-30% based on an SP that estimates)
Therefore I'd like to try column store compression but not sure if it has something similar to Segment by where it orders everything by IMEI_NUMBER in b-tree/disc
Can anyone suggest some direction and options for me to try please, i need to have a much higher compression rate but not kill performance
Luckily data older than 3 months is rarely queried
its on Sql server 2017 enterprise.
id BIGINT -- PRIMARY KEY,
imei_number BIGINT NOT NULL,
gps_datetime datetime NOT NULL,
event_id INT,
speed INT,
latitude decimal(10,6),
longitude decimal(10,6),
raw_data varchar NULL
The are currently 4 billion rows in this one table, and growing by 1 billion a month.
I need to do a couple of things
1) I would like to partition / split the data onto different drives, perhaps "now to 3 months" on current drive, and "3+ months and >" to a different drive.
2) I would like to compress the older data "3+ > months" for example,
On the compression point, all queries I do are virtually the same
for example, these are typical queries
Where imei_number = 1234
And gps_datetime between 'date1' and 'date'
and event_id in 1,5,14 --this is sometimes not used
My Clustered index is on the following (inc order)
imei_number
gps_datetime
id
I also have a non clustered index on
imei_number
event_id
id
and a non clustered index (unique primary key) on
Id
I dont have much experience with compression, but I've been trying out postgres with timescaledb and they have a column store compression option where you define a segment by key, i've used imei_number as SEGMENT BY (so it sticks everything in b-tree/disc by imei number first) and the compression is outstanding (180 x), and query performance the same as normal row store.
On my SQL server I've run some stats on page/row compression and the saving are not groundbreaking (about 20-30% based on an SP that estimates)
Therefore I'd like to try column store compression but not sure if it has something similar to Segment by where it orders everything by IMEI_NUMBER in b-tree/disc
Can anyone suggest some direction and options for me to try please, i need to have a much higher compression rate but not kill performance
Luckily data older than 3 months is rarely queried
its on Sql server 2017 enterprise.
ASKER
Thanks Ste5an
Regarding primary key
I often refer to rows by ID
Is there any performance benefit to setting up a different PK as you mentioned?
As for the column store index, would I setup the first partition as row store then other partition as column store?
Would data automatically migrate to other partition or would I manually have to do this?
Regarding primary key
I often refer to rows by ID
Is there any performance benefit to setting up a different PK as you mentioned?
As for the column store index, would I setup the first partition as row store then other partition as column store?
Would data automatically migrate to other partition or would I manually have to do this?
This is a conundrum.
From a pure performance standpoint, you have the best clustering now: ( imei_number, gps_datetime, id ).
But, based on how you want to move older data to a different device, the best clustering key would be ( gps_datetime, imei_number, id ).
On top of that, though, partitioning can only be based on a single column, as can a columnstore clustered index. For that, gps_datetime first would be much easier to work with. It won't perform the best overall, but it will be far easier to manage.
[The PK isn't relevant, it's the clustering that's critical. Your existing, non-clustered PK is fine.]
How many unique imei numbers do you have?
Are you willing to do the "archiving" to a different device, after the 3 months, with custom code, per imei number, or would you rather than an easy way to archive for all imei numbers together, even if query performance was thereby reduced?
From a pure performance standpoint, you have the best clustering now: ( imei_number, gps_datetime, id ).
But, based on how you want to move older data to a different device, the best clustering key would be ( gps_datetime, imei_number, id ).
On top of that, though, partitioning can only be based on a single column, as can a columnstore clustered index. For that, gps_datetime first would be much easier to work with. It won't perform the best overall, but it will be far easier to manage.
[The PK isn't relevant, it's the clustering that's critical. Your existing, non-clustered PK is fine.]
How many unique imei numbers do you have?
Are you willing to do the "archiving" to a different device, after the 3 months, with custom code, per imei number, or would you rather than an easy way to archive for all imei numbers together, even if query performance was thereby reduced?
ASKER
Thanks Scott
>or would you rather than an easy way to archive for all imei numbers together
I would rather the easy option, but i'm not exactly sure what i'm getting myself into here.
Do I need to kick off a daily job to move old data, or does the partition become "old" after 3 months.
i.e. if I run today, then there are 2 partitions,
1. current
2. old
Then in another 3 months it will create another old partition
1. current
2. old (new)
3. old
Timescale do it by datetime but has a "segment by" option so you group everything by "deviceId" which is apparently what we cannot do here with Sql Server.
There are around 16,000 unique IMEI numbers, that will keep growing to around 35,000 at end of the year.
I'm ok with a monthly manual process to move old data to the other partition, my main goals are
1. disk space used must be at least under half (or more, this is the main goal)
2. relatively easy to manage (I'm not keen on 8 hours every month to manage this migration)
3. performance must be ok, but i'm fine with not as fast as the current 3 months as its rarely used.
Most "newSql" time series databases like influx, timescale, memSql seem to offer this HTAP like offering
I.e. row store for current data (OLTP stuff) and then column store for older data (OLAP stuff)
From what i'm hearing it's not so simple with sql server and more of a pain to manage
So given the above
>or would you rather than an easy way to archive for all imei numbers together
I would rather the easy option, but i'm not exactly sure what i'm getting myself into here.
Do I need to kick off a daily job to move old data, or does the partition become "old" after 3 months.
i.e. if I run today, then there are 2 partitions,
1. current
2. old
Then in another 3 months it will create another old partition
1. current
2. old (new)
3. old
Timescale do it by datetime but has a "segment by" option so you group everything by "deviceId" which is apparently what we cannot do here with Sql Server.
There are around 16,000 unique IMEI numbers, that will keep growing to around 35,000 at end of the year.
I'm ok with a monthly manual process to move old data to the other partition, my main goals are
1. disk space used must be at least under half (or more, this is the main goal)
2. relatively easy to manage (I'm not keen on 8 hours every month to manage this migration)
3. performance must be ok, but i'm fine with not as fast as the current 3 months as its rarely used.
Most "newSql" time series databases like influx, timescale, memSql seem to offer this HTAP like offering
I.e. row store for current data (OLTP stuff) and then column store for older data (OLAP stuff)
From what i'm hearing it's not so simple with sql server and more of a pain to manage
So given the above
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Yes, typically it improves performance, since I/O is usually the bottleneck, not CPU. If, somehow, you have relatively a CPU shortage and lots of I/O excess capacity (extraordinarily rare), then compression would hurt more than help.
Indexes 1 and 14 should be page compressed; index 16 only row compressed (row compression is basically free unless you are severely CPU constrained).
ASKER
Thanks
index 16 on Row compression came out as the following....so no saving here
16 1 135033664 135033664 2408 2408
I guess I can just opt for PAGE compression on the first 2 indexes ?
index 16 on Row compression came out as the following....so no saving here
16 1 135033664 135033664 2408 2408
I guess I can just opt for PAGE compression on the first 2 indexes ?
Yep, that makes sense. Usually ROW saves a little bit, but not always.
Btw, for any pages where page compression ends up not helping, SQL will then not compress that page. That is, the SQL algorithm is sophisticated enough to only actually implement the compression on a given page if it saves a certain amount of space.
btw, here the tables primary key should be: (gps_datetime, imei_number, id).