[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 407
  • Last Modified:

PostgreSQL: accrueing information using automated partitioning, dynamically generating tables

Hello,

I'm creating a database front-end system which has to accrue sales.

At the moment, there is one table that represents discrete transactions, and another that represents individual items sold. They have a one-to-many relationship. Both tables can contain an arbitrary amount of data, though, potentially a massive amount of data ( in the order of tens of millions of rows, though it can only grow, never shrink).  99% of the time, the user will only want to access recent data. It starts off with no data.

I've been told that  this necessitates partitioning. I'd like to create a table that can *dynamically* generate new tables for every, say , month worth of data, to keep things efficient. This needs to be totally automated and encapsulated to the user, as this database is used by a front-end for non-technical people.

Here is a the manual page for PostgreSQL partitioning: http://www.postgresql.org/docs/current/static/ddl-partitioning.html

My questions are:

1.is this "the way it's done"? This has to be a common sort of problem. There has to be an approach that is generally favoured. Is this it?
2.If so, how? The manual doesn't cover creating tables dynamically.
3.If not, what should I do instead?

Thanks in advance,
Sternocera
0
sternocera
Asked:
sternocera
  • 2
1 Solution
 
grant300Commented:
First, your data volumes (10s of millions of rows) is not trivial, it is not "massive" assuming you are not trying to run this on some broken down old machine.  Admittedly, you don't want the users having to sift through years worth of data if they are only interested in the most recent most of the time.

You can automate the process of addition partitions using stored procedures but I don't believe there is a way to create a table that partitions itself automatically.

The closest you could come is to put an insert trigger on the table and, at a certain date threshold each month, have it create the next months partition.  Quite frankly, I am not sure that is the best way to go.

What you might want to do is opt for a much simpler mechanism, namely two partitions, Current and Archive.  On a periodic basis you can move rows from the "Current" partition to the "Archive" partition and change the definition (date range) on the partitions so that the query processor knows where things are actually located.  If after two or three years you decide the Archive partition is getting too large, you can create Archive2 and stuff it in the middle between Archive and Current.  In fact, you could create yearly partitions but fiddle with the definitions so that the Current one really has the Current data in it and kind of makes a hole in the current year's partition.

In any event, you can generate and/or regenerate the rules for the partitions using stored procedures that create the text of the rule and then execute it.  You can also create a procedure for migrating the older data from the Current partition to the Archive or YearX partition.

If I were you, I think I would start with the Current and Archive partitions and build the tools necessary to manipulate that simple scheme then enlarge it if and when it becomes necessary.

Regards,
Bill
0
 
sternoceraAuthor Commented:
Bill,

At a push, I can see the second table ( The "many" table) containing about 20 million rows (but probably more like 5 million). Since this, as you've said, isn't that unreasonable an amount of data for a modern CPU to sift through, the current and archive solution seems very attractive. Also, the user will need that data very infrequently, so the fact that it's possibly sub-optimal is not a concern.

Fiddling after the fact isn't really an option, because this thing has to work in the field, largely without support.

Moving a few tens of thousands of rows from the current to the archive shouldn't be too computationally expensive, right? I'd like it to happen almost instantaneously, maybe under 5 seconds on a modern CPU. Not having much experience with databases, I'm not sure that this is the case. Please advise.

Anyway, I think that you addressed my main concern, so I'll give you your points. Thanks.

Regards,
Sternocera

0
 
grant300Commented:
You have to move a logical chunk of rows at a time, based on date, so you can update the rules (or triggers) that maintain the partitions.

Moving 50K rows in 5 seconds is probably unrealistic, but 20 to 25 seconds is reasonable.  If you need to make it faster than that, you probably have to cut the number of rows down, maybe doing the move to archive once per day.  Remember, no matter how much CPU you have, you still have disk I/O to deal with and the disks, while they have gotten bigger, definitely have not gotten much faster.

BTW, we have had very good success on the two-chip dual-core Opteron machines.  They have great memory bandwidth and low latencies and really scream.  IBM makes a very nice, reasonably priced box called the x3655 that will take up to 8 drives in a 2-U form factor.

VERY IMPORTANT:  You have to use a hardware RAID controller with a battery backed cache.  Even if you don't RAID the disks, setting the controller up with write caching makes these little boxes much, much faster.

Regards,
Bill
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now