why small number records table grow so big?

backgroud: I did a project two years ago for a client, the database is very tiny so I didn't bother to optimize/maintain it.

the autovacuum is not started ( here is the log information):
2010-10-19 14:59:34 EDT   WARNING:  autovacuum not started because of misconfiguration
2010-10-19 14:59:34 EDT   HINT:  Enable options "stats_start_collector" and "stats_row_level".

basically this postgresql 8.1.9 server  ( Suse Linux Server OS)  is running without maintenance for two years. the table with biggest records is a increase-only log table, which has about only 80,000 records.

Recently it started to have performance issue, so I revisited the server, and found the pgsql/data/dboid/  folder has increased to 3G now. Using "SELECT relfilenode, relpages,relname FROM pg_class" query,  I found one table vmserver has used more than 2G space ( over 1G for table, 1G for a index vmserver_idx1.)

here is the table definition:

 \d vmserver;
                                        Table "public.vmserver"
   Column    |            Type             |                         Modifiers
 id          | bigint                      | not null default nextval(('vmserver_seq'::text)::regclass)
 name        | character varying(100)      | not null default ''::character varying
 ip_address  | character varying(100)      | not null default ''::character varying
 cpu_percent | integer                     | default 0
 ram_percent | integer                     | default 0
 isactive    | boolean                     | default false
 last_update | timestamp without time zone | default ('now'::text)::timestamp without time zone
    "vmserver_ppk" PRIMARY KEY, btree (ip_address)
    "vmserver_idx2" UNIQUE, btree (name)
    "vmserver_idx1" btree (ip_address, last_update)

It has 6 records : ( Yes, 6 records.)
 select * from vmserver;
 id |  name   |   ip_address   | cpu_percent | ram_percent | isactive |        last_update
  5 | server1 | |          13 |           78 | t        | 2010-11-17 07:58:22.950904
  4 | server3  | |          11 |           75 | t        | 2010-11-17 07:58:28.743459
  1 | server2  | |          24 |           88 | t        | 2010-11-17 07:58:33.780282
  3 | server6  | |          15 |           92 | t        | 2010-11-17 07:58:34.781151
  6 | server7  | |          46 |           88 | t        | 2010-11-17 07:58:39.104961
  2 | server4  | |          14 |           87 | t        | 2010-11-17 07:58:41.794194

There are a few application access this database:
1. a .Net application , which only run select query again it:
 DataSet ds = db.SelectQuery("select 'Server '||id||':'||name||':'||ip_address as name, id from vmserver where isactive=true order by id ");

2. a perl script running on each server to report their status every 20 seconds, It only run two querys:
SELECT id,cast (date_part('epoch',now()-last_update) as int ) as update_time FROM vmserver where ip_address='$ip'
update vmserver set cpu_percent=100-$cpu_idle,last_update=now(),ram_percent=$ram_percent  where ip_address='$ip'

3. two store procedure which will be called by a web application also SELECT this table.

So I run reindex "reindex database mydb"; now the index file shrinked to 16K. but the vmserver table is still over 1G ( it actually has two data file now since postgres database can only use 1G data file). I run "vacuum vmserver;", I am supprised, it didn't shink the file.
run "SELECT relfilenode, relpages,relname FROM pg_class where relname='vmserver' " query,  the relpages number is still about 160000 after I vacuum it.

Finally I run:
create table backup_vm_server as select * from vmserver;
TRUNCATE vmserver ;
insert into vmserver select * from backup_vm_server ;

This shrinked the table file.

Question 1: Why the vmserver_idx1 grow so huge? I know I should drop it ( it's just a habid I want every select query running on index, seems very bad idea in this case, since the table is so small and frenquently change on index field.) I will drop it after I fully understand it.  Although last_update is keep changing, I still think the database should be smart enough not let the index file grow to 1G. Anybody can explain it?

Question 2: Why vmserver table grow so huge with 6 records? All the two years, There is no records insert/deleted, only select/update. This is even weird than first question. And the vacuum won't correct it ( the vacuum command took almost 10 minutes.)

I don't need any solution, I can turn  on the autovacuum or setup a cronjob or something to do maintenance, I just need to have a better understanding why this is happenning.


Who is Participating?
It is really very reasonable, it is way more performant to just add a bunch of data to the end of the file and invalidate a region in then to actually seek to the position and overwrite the data.

As well you have to take into account that you are using a transaction based database. That means you can do:

begin transaction;
update a million rows;
wait for 1 hour;
rollback transaction;

what this above does is it updates a million rows. Now imagine you have 20 users accessing the same database. In the time you do your transaction (until the transaction is commited or rollbacked) all the other users need to be able to access the old data, if they where not able to access them, their sessions would just plain hang.

The other part is, imagine you really overwrite all the data and want to do the rollback, how will you do it? ReOverwrite the written data? -> the way postgres is implemented it just tells the system to discard the last updates, and life goes an as if nothing has happend.

In the moment where you COMMIT your million updates, the new data (located now at the end of the file) get valid for all other session and in the same time the old values (before the update) get invalidated so in this instant everyone sees the new data.

It is a complex and complicated system, works remarkable well and I am sure Oracle, DB2 and MSSql are doing the same thing because it's the only logical way to go. The only difference might be how they use the discarded data and when and how they recoup the dead tuples.

A lot of my wisdom comes from (there is really everything you need to know about postgresql included):

the vacuum part:

That is all very easy to explain:

in postgresql, an update does internally:

so for every update you have have one dead tuple (the deleted one). For performance reasons deleted tuples are only marked as deleted but not deleted from the filesystem. The vacuum - process recoups this space.

-- pg-doc --
VACUUM reclaims storage occupied by deleted tuples. In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables.
-- pg-doc --

So you have to run vacuum frequently (best solution is to run the autovacuum daemon).


regarding the index, an index on a small table, where the whole table fit's into a page, rather slows down than speed things up, because the sequential search through one memory page, which is in memory, is nothing compared to what has to be done with the index:

It has to be maintained and upgraded on every update
It has to be shrunk with vacuum
It cost's more to search in the index than sequential in one memory page

you can use:

explain <sql-query>

in postgres and read/learn a bit about interpreting the output, this helps a lot to find the most performant versions of your queries.

jackiechen858Author Commented:
Thanks so much Hatrix76. I agree I should delete the index. also I will setup autovacuum. It's interesting to learn that Postgresql do DELETE&INSERT for UPDATE, I tried to find some information about it, but couldn't.

Would you mind to tell me where you get this information? Is this the unique implementation for Postgresql database, or common practice for all the databases?  Doesn't sound very reasonable to me.


jackiechen858Author Commented:
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.