asked on

Sql Server column store compression for row store queries

I have a time series database with the the following columns

  id BIGINT  -- PRIMARY KEY,
  imei_number BIGINT NOT NULL,
  gps_datetime datetime  NOT NULL,
  event_id INT,
  speed INT,
  latitude decimal(10,6),
  longitude decimal(10,6),
  raw_data varchar NULL

Open in new window

The are currently 4 billion rows in this one table, and growing by 1 billion a month.
I need to do a couple of things

1) I would like to partition / split the data onto different drives, perhaps "now to 3 months" on current drive, and "3+ months and >" to a different drive.
2) I would like to compress the older data "3+ > months" for example,

On the compression point, all queries I do are virtually the same
for example, these are typical queries

Where imei_number = 1234
And gps_datetime between 'date1' and 'date'
and event_id in 1,5,14 --this is sometimes not used

Open in new window

My Clustered index is on the following (inc order)

imei_number
gps_datetime
id

Open in new window

I also have a non clustered index on

imei_number
event_id
id

Open in new window

and a non clustered index (unique primary key) on

Id

Open in new window

I dont have much experience with compression, but I've been trying out postgres with timescaledb and they have a column store compression option where you define a segment by key, i've used imei_number as SEGMENT BY (so it sticks everything in b-tree/disc by imei number first) and the compression is outstanding (180 x), and query performance the same as normal row store.

On my SQL server I've run some stats on page/row compression and the saving are not groundbreaking (about 20-30% based on an SP that estimates)
Therefore I'd like to try column store compression but not sure if it has something similar to Segment by where it orders everything by IMEI_NUMBER in b-tree/disc

Can anyone suggest some direction and options for me to try please, i need to have a much higher compression rate but not kill performance
Luckily data older than 3 months is rarely queried

its on Sql server 2017 enterprise.

ste5an

The only way: You need to setup a test table using column store index.

btw, here the tables primary key should be: (gps_datetime, imei_number, id).

websss

ASKER

Thanks Ste5an

Regarding primary key
I often refer to rows by ID
Is there any performance benefit to setting up a different PK as you mentioned?

As for the column store index, would I setup the first partition as row store then other partition as column store?

Would data automatically migrate to other partition or would I manually have to do this?

Scott Pletcher

This is a conundrum.

From a pure performance standpoint, you have the best clustering now: ( imei_number, gps_datetime, id ).

But, based on how you want to move older data to a different device, the best clustering key would be ( gps_datetime, imei_number, id ).

On top of that, though, partitioning can only be based on a single column, as can a columnstore clustered index. For that, gps_datetime first would be much easier to work with. It won't perform the best overall, but it will be far easier to manage.

[The PK isn't relevant, it's the clustering that's critical. Your existing, non-clustered PK is fine.]

How many unique imei numbers do you have?

Are you willing to do the "archiving" to a different device, after the 3 months, with custom code, per imei number, or would you rather than an easy way to archive for all imei numbers together, even if query performance was thereby reduced?

websss

ASKER

Thanks Scott

>or would you rather than an easy way to archive for all imei numbers together
I would rather the easy option, but i'm not exactly sure what i'm getting myself into here.
Do I need to kick off a daily job to move old data, or does the partition become "old" after 3 months.
i.e. if I run today, then there are 2 partitions,
1. current
2. old

Then in another 3 months it will create another old partition
1. current
2. old (new)
3. old

Timescale do it by datetime but has a "segment by" option so you group everything by "deviceId" which is apparently what we cannot do here with Sql Server.

There are around 16,000 unique IMEI numbers, that will keep growing to around 35,000 at end of the year.
I'm ok with a monthly manual process to move old data to the other partition, my main goals are
1. disk space used must be at least under half (or more, this is the main goal)
2. relatively easy to manage (I'm not keen on 8 hours every month to manage this migration)
3. performance must be ok, but i'm fine with not as fast as the current 3 months as its rarely used.

Most "newSql" time series databases like influx, timescale, memSql seem to offer this HTAP like offering
I.e. row store for current data (OLTP stuff) and then column store for older data (OLAP stuff)
From what i'm hearing it's not so simple with sql server and more of a pain to manage

So given the above

ASKER CERTIFIED SOLUTION

Scott Pletcher

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

Scott Pletcher

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

websss

ASKER

Thanks Scott

Here are the results for PAGE compression,

Row wasn't quite as good

I'm inserting around 1,000 records a sec, I assume this would affect insert performance?

Scott Pletcher

Yes, typically it improves performance, since I/O is usually the bottleneck, not CPU. If, somehow, you have relatively a CPU shortage and lots of I/O excess capacity (extraordinarily rare), then compression would hurt more than help.

Indexes 1 and 14 should be page compressed; index 16 only row compressed (row compression is basically free unless you are severely CPU constrained).

websss

ASKER

Thanks
index 16 on Row compression came out as the following....so no saving here

16 1 135033664 135033664 2408 2408

I guess I can just opt for PAGE compression on the first 2 indexes ?

Scott Pletcher

Yep, that makes sense. Usually ROW saves a little bit, but not always.

Btw, for any pages where page compression ends up not helping, SQL will then not compress that page. That is, the SQL algorithm is sophisticated enough to only actually implement the compression on a given page if it saves a certain amount of space.