• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 314
  • Last Modified:

Design of a Horisontal-partitioned Index

I have a table with 500 million rows. It is not practical to split the table, because a number of queries need to extract data that relies on the complete data set being present.

There are 2 defining parameters; call the one the Identity, and the other Time. For each Identity (of which there are possibly 20,000 unique Identities) there are multiple unique Time-stamped rows (i.e. Each Identity has a sub-array with unique time-stamps).

When I update the table I need to check the Identity-Time combination does not already exist; if it does I do not update (i.e. if there is an existing Identity-Time pair, and a same-value line is attempted to be added, it is discarded because the logic of the system says no new Time-Identity pair can be different to one already posted to the system).

I currently have a Clustered index on Time (Descending), and Identity.

The problem is this means the database is physically sorted with Identity (000001) and all its Time-data, then Identity (000002) and all its Time-data, and so forth. So when I check a pair exists the result is quick. But every insert of a new Time for identity (1234567) requires a re-order because of the Clustered index.

Yes, practically SQL handles the paging better than that, but the principle is I have huge IO as a result of the insertion of new data. And we are adding data at a rate of 100,000 rows a day!

I understand there is a thing called a Horizontally-partitioned Index. I have never used them but am hoping it is possible to partition Identity and only the last Day of time-data so inserts of new data happen near the start (end) of the database / Index, and not throughout the large database depending on the physical order that specific Index is subject to.

I have tried using a normal (non-clustered) index, but found the performance of the clustered index vastly improves other queries. So I hope using a partitioned index will give me the best of both worlds. the problem though is I don't understand the construction of a partitioned index.

I have also tried using the Database (and query) tuning tools included in SQL 2014. I first removed every existing index, and then created a few insert statements representative of the normal operation. the problem is the Tuning Wizards are unable to recommend even a single index. I don't know if this is a bug in SQL 2014, or a result of the huge database, but a number of attempts and no recommended indexing.
If I run the queries with "Show Execution Plan" ON then there are recommended indexes to include, but none are partitioned; they are all standard non-clustered indexes specific to the variables being checked, and updated.
  • 4
  • 3
  • 2
  • +1
2 Solutions
Guy Hengel [angelIII / a3]Billing EngineerCommented:
Can you please post the insert/update query?

Note that with view over vertically split table you could also solve such issue. On top you could create INSTEAD OF trigger on such a view....
Partitioning in SQL Server is a powerful feature which, depending on how efficiently it's used, you may get fantastic results, no significant difference, or even negative impact! Yes, it's one of those tools with which you can shoot yourself in the foot!
I have several tables with more than 5 Billion rows and 8-12M transactions per day (insert and update) and several Foreign Key and Unique key constraints defined on those tables.
Without partitioning it would be impossible to get the current performance from these titanic tables. It has also made a significant impact on duration of the index rebuild,  reducing it by around 90%!
In your case, the unique constraint on ID+TimeStamp seems to be one of the key factors. Can you please specify what is the smallest time difference between two timestamps that matters to you for instance minute, second, millisecond? Depending on that I might be able suggest a possible choice for partitioning your table that you can try on a test plaform and check its suitability.
A couple of points, which are important in designing the partitoin functions for this table:
1. Is ID range fixed or can it grow in future?
2. Are there any archiving requirements with regards to that table? For instance do you need to keep all the data in one place forever or do you need to move old data somewhere else? I'm asking to see if you need/can benefit from sliding partitions.
Please also note that partitinion may impact or require changes to other indexes on that table, so any info on the existing indexes can be useful.
Bird757Author Commented:
The Time stamps are no more frequent that every second (exactly seconds - no milli-seconds). But the data is not every second all the time; we have a batch that arrives where every second is covered, and then a gap of a few seconds to minutes, and then another block of data. The issue with the inserts is the data is not delivered sequentially; we can receive a data block containing seconds 1,2,3,8 now, and a few minutes later seconds 5, 6, 8 from this minute, and seconds 1 & 2 from a minute later. The 2nd insert must post seconds 5, 6, ignore 8, and post seconds 1 & 2 from the later minute. The process is to first built an insert array (Table Variable) containing the rows to be inserted, and then to do a single insert statement.

The Identities are physical objects, and do grow over time as new objects are brought into the system. Because of the nature of the objects they grow in small batches; say 100 over a 4 hour period, but only every alternate week.

Your comment about using Identity and TimeStamp as a unique Constraint I understand, but note I have not set this as a constraint - rather I check the data before insertion against existing data, and remove the duplicates before an insert is attempted. The reason I do this is I found a constraint violation caused the data insert block to fail (I insert 10-20 rows relating to a specific Identity at a time in a single statement; the data arrives in blocks).

We tried archiving, but it raised more problems than the physical size of the table. the reason is most queries against the table are for a specific Identity (or set of Identities; up to 999 Identities at a time), and the past week. But the user needs to be able to roll up data for a week, or month, but not for a defined month; for a specific time start and end that does not necessarily start on the 1st and end at month end.
A specific Identity has between 1M and 12M rows of data, growing at 10,000+ rows a day. Archiving is something we will consider, but is already being done to an extent (we clean out data that is no longer relevant to an un-linked dump database because conditions relevant to  an Identity change, and the information is no longer of interest - these deletes are batch jobs run when the server is quiet)

I am currently starting with a clean slate (no existing indexes). I created a VM and have set this up as a parallel system; pulling in the raw data stream from the live system and processing it. This provides me the opportunity to be unaffected by existing indexes.
Problems using Powershell and Active Directory?

Managing Active Directory does not always have to be complicated.  If you are spending more time trying instead of doing, then it's time to look at something else. For nearly 20 years, AD admins around the world have used one tool for day-to-day AD management: Hyena. Discover why

Splendid, that was pretty much comprehensive. The reason I mentioned the Unique Constraint and asked about the ID and TS ranges is because SQL Server’s table partitioning only allows for a single column to be used but in your case you obviously need both ID and TS for storage (inserting new rows) as well as retrieval, so you can get the maximum benefit of partitioning if you introduce a new smart key that combines these two fields into one.
A common practice in data-warehousing is to use a time dimension which depending on the required granularity is an integer value like YYDDDnnnnn in which YY can represent 99 years starting from 00 for the first date-stamp in your DB, DDD shows day of year (001-366) and nnnnn represents time of day in seconds (00001-86400). So if you use a BIGINT for your smart-key, you'll need 10 digits for storing time up to a second in a century and still have 9 digits for your IDs, which should be more than enough. You can still store your ID and timestamp as they are, to minimise unnecessary impact on existing queries/reports etc. but now you have a strong candidate for an integer PK, which you can use for partitioning your table.
The only decision is whether make the smart key as TS+ID or ID+TS and that's why I  asked you about any existing archiving process. For instance if you archiving is date-based then you can use TS+ID for partitioning which will allow you to easily switch out an old partition with minimal impact on availability/performance of the main table and move the separated partition into an archive at your own convenience. On the other hand if your archiving was based on ID or ID ranges, then obviously ID+TS  as a partition key would be more useful.
There are a quite a few useful articles/books out there about nuances of partitioning, Kimberly Tripp, Kalen Delaney and Lynn Langit are all very knowledgable on the topic and have published very useful articles that you can find easily by googling. I also came across a recent article by Brent Ozar about this topic (http://www.brentozar.com/archive/2012/08/potential-problems-partitioning/). The utmost authority on smart-keys is obviously the legendary Kimball.
Of course you can count me too... I'll be happy if I could help more. : )
Bird757Author Commented:
I have been considering the arrangement of PK along the lines you have proposed, and need to disclose a little more about the actual application (and based on your comment of your databases I suspect we are in similar fields). The data is Telemetry data, and the Identity is the Dallas-chip ID of each device. This is significant for 2 reasons; first, I have a BigInt ID (the device id) - and although I have considered a lookup to map the BigInt to a normal Int (since we shouldn't exceed 2e9 devices), this is not practical because the data arrives on different servers around the world, and is processed. A secondary system collects data from these servers - which are also being interrogated independently by users - and creates the consolidated large database (which is structurally identical to the small satellite systems).
That means, in short, a device with ID xxxxx can drop data onto one of a number of satellite collection servers, and the only unique element of this data is the physical Dallas-ID chip on the device. So hence my Primary Key needs to contain a BigInt (for the device), and Time to the second.

That raises the next issue. Device X may have data as recent as "now", but Device Y has not sent data for the past month (out of range). But at some point in the future Device Y is going to connect, and drop data for the missing time. And to make it worse, it may drop 1 packet, containing current data, and an hour, or day later drop the data spanning the past month! Making the structure of this partition very important.

I think I am beginning to understand that the intention is to create a single column that can behave as a primary key. I am still trying to understand how to achieve this because what I originally set out to do was create a horizontal "split" in the index; where the often-hit part contained the Device-Identities, and the past month block of times specific to that Device-Identity. i.e. Device xxxxx and its times from now to a month ago, and Device yyyyy and its times for a month, but the latest time for that device is a week ago.

Maybe this is unrealistic or not achievable...

The end goal I am trying to achieve is to resolve this problem (as identified by an optimisation analysis of our current server):
Index Searches/sec       = average 1,022
Full Scans/ sec          = average 109.033
Page reads/sec           = average 163.902
Buffer cache hit ratio   = average 99.953
Checkpoint page/sec      = average 8.357
Lazy writes/sec            = average 0.124
Page life expectancy      = average 25,425
Page writes/sec            = average 129.263
Total Server Memory(KB) = average 12,578,757
Target server Memory(KB)= average 12,582,920
Batch Reguest/sec      = average 33.852
SQL Re-compilations/sec = average 0.081
SQL Compilations/sec      = average 0.155

For the index searches or Full scan/sec, the result quite high as what I'm looking for is 1 Full scan/sec per 1000 index searches/sec. The result above shows more than 100 Full scan/sec per 1000 index searches/sec.

The page writes/sec counter return higher that it expected and it might be contributed by poor indexing. This conclusion comes from the Full Index scan and index searches/sec above.

The lazy writes/sec result is also good as it below the recommended value which is 20.

Page life expectancy suppose to be over 300 but from the perfmon on your server, the result is over 25000 on average which is very good.  

Buffer cache hit ration result is also good.

The result for SQL Compilations/sec is good as the value is less than 10% of the number of Batch request/sec. And for SQL Re-compilation/sec is a bit high  as the value is above 10 % of the value from SQL compilations/sec.
Hi again, Based on what you mentioned, it seems that you don't have too many choices but to partition your table based on the Chip ID; but even that is better than having a large clustered index in one partition.  Having said that in the case of our very large tables, most of the partition ranges contain 100M or 200M rows. So 500M rows is not really that bad. Your stats also confirm this as most of the figures are pretty good and within the recommended thresholds. Even the relatively high avg full scans/sec doesn't necessarily mean you have a serious issue with the normal operations on your DB; only one crazy query per day by a careless user or a not-so-efficiently-implemented report per day can sway that average!

If you're working on a parallel platform you can go ahead and test the impact of partitioning just by ID. Even if you don't find a way to include the time-stamp in your partition key, you'll at least manage to replace an index scan on the whole table with an index scan on a single partition. Also with proper padding and regular index rebuilds, you'll be able to maintain efficiency and performance of your indexes between two maintenance jobs.

Another point to consider while re-partitioning your table is to use data_compression; the slight CPU overhead of data_compression is often more than compensated by reduced I/O which almost always results in performance gains (on top of saving storage). This will surely improve your I/O stats.
Scott PletcherSenior DBACommented:
Do you (nearly) always specify the DeviceId when querying the table?  

Do you tend to process rows the same deviceids or for the same time?

Partitions would be most useful here just to allow individual partitions to be rebuilt.  Otherwise properly clustering the table should be enough.

Why is Time Descending in the key?  Do you very often sort by time descending?
>>Do you (nearly) always specify the DeviceId when querying the table?  
No, but you can have other (non-clustered) indexes that are not partitioned or partitioned based on a different partition function.

>>Do you tend to process rows the same deviceids or for the same time?
Not necessarily but in the context of the most common scenarios it would simplify the process or improve the efficiency significantly.

>>Partitions would be most useful here just to allow individual partitions to be rebuilt.  Otherwise properly clustering the table should be enough.
No, after a certain point, the cost of inserting into a single-partition clustered index can become too high, for instance when every time you pass the threshold of the capacity of the b-tree, it has to restructure the whole index and for a multi-billion rows table that won't be a trivial feat.

>>Why is Time Descending in the key?  Do you very often sort by time descending?
Not sure, where I implied desc order for the time?
The primary queries You should be concerned about are those required for the main operations, i.e. inserts/updates/checking uniqueness, etc. It'd be much easier and more practical to design and maintain a separate data-warehouse for heavy reports and/or analytical/historical reports, data mining etc. The key is to keep absolute minimum data necessary for the operational DBs in the OLTP system (both for efficiency of the normal operations as well as facilitating HA and DR scenarios) and move all the historical data away into archive DBs, where they really belong.
In our case, the operational data (5-10B rows) is the last 12 months worth of the data, and although the Devices are never deleted, but dependant transactions older than 12 months get removed - by which time, they're already in the analytical DB as with a fairly small latency ( < 12 hours ) the analytical DBs are kept in sync with the main operational (OLTP) system.
Scott PletcherSenior DBACommented:

My q's were actually for the OP, but one thing you said needs to be cleared up because this particular misconception continues to be propagated:

>> for instance, when every time you pass the threshold of the capacity of the b-tree, it has to restructure the whole index <<

100% false.  The B+ trees used in SQL Server never have to be rebalanced because of INSERTs (although technically a lot of DELETEs could make a rebalancing required, that is a much rarer case, and often a table is rebuilt after large number of DELETEs anyway).  Think about it: it wouldn't be possible for major dbs to run consistently well if they periodically had to re-balance billions of rows.
[One reference: http://blogs.microsoft.co.il/itaybraun/2008/07/24/b-tree-binary-tree-and-indexes-architecture/]


Sorry, reading back, I see that you already answered my first set of qs:
>> the reason is most queries against the table are for a specific Identity (or set of Identities; up to 999 Identities at a time), and the past week. <<

Then identity should be the first clustering key, followed by time (ASC), rather than time first.  This will greatly reduce the overhead of identity / deviceid lookups and their related time range(s).

You might still want to partition here just so that you can split the older section from the current, more active section.  Then you have reduce the FILLFACTOR in the current section, to say 70-80 percent initially, but leave it at 99-100 percent in the historical section; even though the historical section will get some new rows as well, it's not worth all the wasted space to reduce the FILL there.  I'm assuming you can rebuild the current partition ONLINE, so that the FILL can be effectively redistributed without taking the table offline.
Bird757Author Commented:
Thanks for the posts. Let me summarise:
I found a method for creating the partition function (what I am going to refer to as the template) in Books Online, and after some manipulation applied it to my data set. I believe there is still a degree of optimisation here because what I did was create a "Template" containing the past 30 day dates, and technically my data is primarily in the past hour although one in 100 will be from a previous day, and 1 in 1000 from the past month.
Then I found a pair of procedures that adjust the function to add "tomorrow" and delete the oldest day - and adapted this to first check if tomorrow is a date with active data (so this procedure pair can run at random, and only adjust the partition function when a new day arrives).

I then created the index as
ON [dbo].[MyBigTable]([dDateCol], [iIDCol])
 ON  pfMyDailyFunctionTemplate ([dDateCol])

The trick was to understand the creation of the partitioning function (pfMyDailyFunctionTemplate), and then applying the correct structure to the index creation.

I have now processed about a million rows in batches of 64k, changing the index on [MyBigTable] to see what structure works, and the result is 4 different partitioned indexes using the same partitioning template on the 2 that contain dates; one index (above) being a cluster to provide a physical order to the table, and then another that uses only date (I have a MAX(Date) query that was causing problems). The nett effect is my disk IO has gone from Millions of reads/sec to around 10k reads/second - a massive improvement. Fragmentation of the indexes also used to be a problem. Now, with the partitioned index the leading (current date) partition is obviously very fragmented, but it is a small index relative to the table as a whole.

So, partitioning the index has helped. Time and further testing will hopefully tell me if creating a partition that is a Date, and an Identity where sonly the date is partitioned is adequate. But my server is looking a lot happier already.

Thanks to all who posted.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Making Bulk Changes to Active Directory

Watch this video to see how easy it is to make mass changes to Active Directory from an external text file without using complicated scripts.

  • 4
  • 3
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now