Tool: Postgres Monitoring System

The Postgres Monitoring System (PgMS) is a comprehensive monitoring system that collects data from a postgres server and stores it in another postgres database. PgMS also includes a web interface to display graphs of the data.

Data PgMS Collects:

Logical IO
- Cache Reads/second
Physical IO
- Disk Reads/second
- Disk Writes/second
System Statistics
- CPU usage
- Load Average
- Free Memory
Disk Space Used
- Per Database
- Per Table
Sessions
- Active
- Idle
Call Rates
- Rows Returned/second
Miss Rates
Uptime (Optional)
- Downtime event logging
- Uptime report

System Requirements:

PostgreSQL Database 9.0 and later*
Graphing System:
- Apache 2.4
- PHP 5.4
Data Collection:
- FreeBSD, Linux**
- Perl 5.16
  - Config::Tiny
  - DBI

Notes:
* It will probably work on earlier versions, but the collection script may need tweaking.
** Operating System matters for collecting system information like CPU and memory usage.

Installation Steps:

Extract PgMS.tar.gz into the apache directory where you want it to run from
Configure the settings in pgms_conf.ini
Load CreatePMS.sql into the database server where you want the PgMS data
Login to the postgres server as an admin and update the passwords for pmsuser and pmssite (or whatever you chose in step 1.)
Once the users are setup you can begin collecting data:
5a) Try running the pgmonitor_populate.pl script to make sure you can connect to the new database. Kill it if it's successful with control + C.
5b) Add the following cron entry for pgmonitor_populate.pl to run continuously
*/1 * * * * /usr/local/PgMS/pgmonitor_populate.pl 5432
The pgmonitor_populate.pl script needs to run on every machine you want monitored.
5c) If you want to collect downtime events then you can also add the following:
0 * * * * /usr/local/www/PgMS/find_downtime.pl -nd
This only needs to run on one machine and it will find all downtime events.
Open a webpage and go to <yourdomain>/<yourpath>/PostgresMonitor.php
Optional - Set specific graph options in the pms_config table. These are mainly for setting a maximum Y-axis value for the graphs on the main page.
Server_id = 0 are global options (See table below).
Optional - Set specific warning thresholds in the warning_levels table.
Optional - Set specific uptime requirements in the uptime_target and uptime_requirement tables.

=> \d pms_config
           Table "public.pms_config"
  Column   |         Type          | Modifiers
-----------+-----------------------+-----------
 server_id | integer               | not null
 name      | character varying(25) | not null
 value     | character varying(25) | not null
Foreign-key constraints:
    "pms_config_server_id_fkey" FOREIGN KEY (server_id) REFERENCES servers(id)

=> select * from pms_config;
 server_id |   name   |  value
-----------+----------+---------
         0 | LogYMax  | 350
         0 | PhyYMax  | 15
         0 | CallYMax | 250
         0 | MissYMax | 25
         0 | MemYMax  | 25
         0 | CallYAbs | 100000
         0 | PhyYAbs  | 100000
         0 | LogYAbs  | 100000
         0 | SizeYMax | 1000000
(9 rows)

Open in new window

Cleaning up old data:

clearDataDaily.php - schedule this to run daily if you don't need more than 30 days of historical data. You can edit the file and change the 30 day limit to whatever you want.
pgms_trim.pl - this will try to combine older data into fewer rows to try and keep the table sizes smaller. I recommend backing up your data before trying to run this.

Using the Web Interface

Title Description

The PgMS title displays three different items.

Program
First, the name of the program (PostgreSQL Monitoring System). This never changes.

Server
The second bit of information displayed is the server that is currently being displayed. When a user chooses a server from the drop down box on in the options menu, the page will be reloaded, and the new server will be displayed in this spot.

Database
The third bit of information displayed is the current database being monitored. By default, all databases are summed together and displayed on the page. If a user chooses to view a specific database in the options menu that databases specific stats will be displayed.

Options Menu

The options menu displays all of the separate options that are provided by the PgMS. There are 6 categories of options. Zoom, Server, Database, History, Other Monitoring, and Help.
Zoom
The zoom options gives the user the option to zoom into graphs either individually, or as a complete group. When zooming with group set, all graphs will zoom into the exact same time frame as selected(no matter what graph you choose to zoom). When zooming individually only the selected graph will zoom, leaving the other graphs untouched. Note: when you drill down to a single graph, the option to zoom by group or individually will disappear. In order to unzoom, select the Unzoom button. This will bring all graphs back to the select history state.

Server
There server option allows the user to select any database that is being monitored. Once a server starts being monitored(one the database population script is started running on a database), that server will appear as an option.

Database
There database drop down allows a user to view All databases, or a select database on a server. All databases are monitored on a system, so if it has any activity on it, it will appear in this list. If a database appears in the list, but presents blank graphs when chosen, it signifies that no data has been collected in the specified history time frame. Try selecting a larger history to see the stats for that database.

History
The history option allows you to view history of 30 minutes, 1, 2, 3, 6, 24, 36 hours, 2, 3 days, and 1 week. If you select a time of 6 hours or greater, the data shown is trimmed. This trimming is controlled by the Compression choice. By default all graphs show the max values displayed, but you may choose to show the average value, or the minimum value. You may also choose the compression option "none" to not compress the data, but this can lead to slow load times and hard to read graphs. For history less than 6 hours all data is displayed, so there is no need to have a compression choice.

Help
Help takes you to this page, in case you ever need to reference anything, or see what a graph means.

Main Overview Page

The main page of PgMS displays eight separate graphs. These display the overall health and performance of the database and system, and are explained in detail below. See first screenshot above.

Logical IO
Logical IO displays the number of reads per second that are coming out of the cache of the database.

Physical IO
Physical IO displays the number of reads coming from disk(in reads per second) and also the number of writes to disk(in writers per second). Postgres has a special checkpoint system which flushes all contents of the shared pool every time a "Checkpoint" is hit. This causes a small spike in the writes/second graph 10-15 minutes(or sooner on busier databases such as tools.test). This is expected behavior.

System Stats
System Stats displays the CPU usage percent, the Load Average, and the Free memory in gigs. The CPU is scaled to the y-axis on the left hand side of the graph, while load averave and Free Memory is scaled to the axis on the right of the graph.

Sessions
Sessions Displays the number of active and Idle sessions attached to the database and any one time. These graphs are stacked, so the number idle graph will display the total number of attached sessions(active+idle), but to see the true number of idle sessions, you can just pan over the graph and it will show you.

Call Rates
Call rates displays the number of rows being returned in the database. This is to show how much SQL activity is currently taking place. This is the best way we can currently monitor how many sql statements are being executed in the database at 1 time.

Miss Rates
The miss rate displays the percentage of total reads(logical and physical) that are coming from disk.

Uptime
Displays database uptime and downtime. Covered in a following section.

Disk Space
This graph displays the size of the database selected, or the size of all the databases combined if "All" is set as the database. If you click on this graph and select a single database it will display the table sizes in a stacked line graph. If "All" is set as the database when you click on the graph it will display a stacked graph of all the database sizes.

Drill Downs

If a user clicks the title of any of the six main graphs, they will be taken to a blown up version of that graph. This graph will have every database on that server broken up into separate graphs, so you can compare the activity of the different databases. Note: The system graph does not have separate databases, as all information is not specific to any database. Data writes are based on the entire database cluster, and not any specific database.

Uptime

This section describes how to use and interpret the Uptime graph.

Colors

Green - Uptime (Database is up, no problems)
Dark Green - Not downtime (PgMS was behind and did not collect data, but the database was still up)
Orange - Scheduled downtime (Taking it down for maintenance, and is approved)
Red - Unscheduled downtime (We didn't expect this, eg: server crash, power outage, etc.)

Selecting a Month, Year, or Specific Dates

Month
Click on the drop-down menu labeled "month" and select your desired month. The currently selected month is on top followed by all the months starting from January.
After selecting a month the page should automatically refresh with the new data.

Year
Click on the drop-down menu labeled "year" and select your desired year. The currently selected year is on top followed by the most recent 5 years.
After selecting a year the page should automatically refresh with the new data.

Specific Dates
Click on the text field labeled "from" or "to" and select a date from the calendar that pops up. In the "from" field it won't let you select a date in the future, but the "to" field allows that. Once you selected the dates you wanted press the "Go" button.
Note: If you select a "from" date that is past the "to" date you select, the page may not load correctly.

Target Uptime Data

There are 5 columns in the Target Uptime table:

Database - The database name (if a specific database is selected up top there will only be one row, otherwise if "All" is selected then there will be a row for each database).
Target Uptime - The percent of time we need the database to be up.
Actual Uptime - The percent of time the database has actually been up.
Target Downtime (min) - The maximum number of minutes we can allow to be down.
Actual Downtime (min) - The actual number of minutes this database has been down.
Potential Total Uptime - The percent of time the database CAN be up if the database is up the rest of the time period.
Pass/Fail - Whether or not the actual uptime meets our target (red background with FAIL, or green text PASS).

Downtime Events

Changing Downtime Status
Click on the drop-down menu in the "Scheduled/Unscheduled" column and select either "Unscheduled," "Scheduled," or "Not Downtime." Once you've changed everything you want to and added comments click either save button (top or bottom of the table).
The system will add a comment for the downtime event noting that you changed the status. The color of the downtime on the graph should also change to the new color relating to the status you chose.
Note: When the database selected is "All" the changes will be saved for each database on the selected server.

Commenting on Downtime Events
Enter as many comments as you want for each/some downtime events. Click save and it will save all of them at once and record your name** and the time posted.
Note: When the database selected is "All" the comments will be saved for each database on the selected server.
** Some extra work is required to capture the username when a user logs in. If you don't have a login page over this network then the username will be blank. The $user variable in the PHP files is how you can set the username.

Collection Scripts and Configuration File

The pgmonitor_populate.pl script is responsible for connecting to both the monitor database and the database to be monitored. It's meant to run continuously and the frequency it collects data is defined in the pgms_conf.ini file. This script is also designed to send alerts via email or syslog when it disconnects or warning thresholds are crossed. Both types of alerting can be turned off in pgms_conf.ini.

The pgms_conf.ini file is where you'll store database connection info like user, password, hostname, and port for both databases. Since this is sensitive information I'd recommend locking down permissions for this file. All the php and perl scripts use the information in this file. If you move it out of the same directory the other files are, you'll have to edit the location variables in the scripts.

The find_downtime.pl script can run periodically to find downtime events. It sees a gap of data collection as a downtime event, and the length of the gap is defined in the file.

Security Recommendations

Lock down permissions on pgms_conf.ini.
Host this in a private network, not public.
Consider restricting access to this tool using htaccess or something similar.
1. Consider writing your own function to capture the username when someone logs in.
Consider encrypting the passwords in pgms_conf.ini and adding a function to decrypt them. This link has some instructions for perl.

Help and Support

If you encounter a bug or need assistance getting this to work, please feel free to start a discussion in the PgMS group! You can also send me a message if I haven't replied to the discussion yet, and I'll try to help out ASAP.

This tool was originally written in 2011 by DBAs and built up over the following years. If you're interested in adding functionality to this tool let me know.

Postgres Monitoring System Tool

Documentation