Monitoring through SQL v5

EDB Postgres Distributed provides several monitoring and statistics views that are specific to its distributed nature. The standard Postgres monitoring is also useful for monitoring EDB Postgres Distributed.

Monitoring Overview

A BDR Group consists of multiple servers, often referred to as nodes. All of the nodes need to be monitored to ensure the health of the whole group.

The bdr_monitor role may execute the bdr.monitor functions to provide an assessment of BDR health using one of three levels:

  • OK - often shown as Green
  • WARNING - often shown as Yellow
  • CRITICAL - often shown as Red
  • as well as UNKNOWN - for unrecognized situations, often shown as Red

BDR also provides dynamic catalog views that show the instantaneous state of various internal metrics and also BDR metadata catalogs that store the configuration defaults and/or configuration changes requested by the user. Some of those views and tables are accessible by bdr_monitor or bdr_read_all_stats, but some contain user or internal information that has higher security requirements.

BDR allows you to monitor each of the nodes individually, or to monitor the whole group by access to a single node. If you wish to monitor each node individually, simply connect to each node and issue monitoring requests. If you wish to monitor the group from a single node then use the views starting with bdr.group since these requests make calls to other nodes to assemble a group-level information set.

If you have been granted access to the bdr.run_on_all_nodes() function by bdr_superuser then you may make your own calls to all nodes.

Monitoring Node Join and Removal

By default, the node management functions wait for the join or part operation to complete. This can be turned off using the respective wait_for_completion function argument. If waiting is turned off, then to see when a join or part operation finishes, check the node state indirectly via bdr.node_summary and bdr.event_summary.

When called, the helper function bdr.wait_for_join_completion() will cause a PostgreSQL session to pause until all outstanding node join operations complete.

Here is an example output of a SELECT query from bdr.node_summary that indicates that two nodes are active and another one is joining:

# SELECT node_name, interface_connstr, peer_state_name,
#     node_seq_id, node_local_dbname
# FROM bdr.node_summary;
-[ RECORD 1 ]-----+-----------------------------------------
node_name         | node1
interface_connstr | host=localhost dbname=postgres port=7432
peer_state_name   | ACTIVE
node_seq_id       | 1
node_local_dbname | postgres
-[ RECORD 2 ]-----+-----------------------------------------
node_name         | node2
interface_connstr | host=localhost dbname=postgres port=7433
peer_state_name   | ACTIVE
node_seq_id       | 2
node_local_dbname | postgres
-[ RECORD 3 ]-----+-----------------------------------------
node_name         | node3
interface_connstr | host=localhost dbname=postgres port=7434
peer_state_name   | JOINING
node_seq_id       | 3
node_local_dbname | postgres

Also, the table bdr.node_catchup_info will give information on the catch-up state, which can be relevant to joining nodes or parting nodes.

When a node is parted, it could be that some nodes in the cluster did not receive all the data from that parting node. So it will create a temporary slot from a node that already received that data and can forward it.

The catchup_state can be one of the following:

10 = setup
20 = start
30 = catchup
40 = done

Monitoring Replication Peers

There are two main views used for monitoring of replication activity:

Most of the information provided by bdr.node_slots can be also obtained by querying the standard PostgreSQL replication monitoring views pg_catalog.pg_stat_replication and pg_catalog.pg_replication_slots.

Each node has one BDR group slot which should never have a connection to it and will very rarely be marked as active. This is normal, and does not imply something is down or disconnected. See Replication Slots created by BDR.

Monitoring Outgoing Replication

There is an additional view used for monitoring of outgoing replication activity:

The bdr.node_replication_rates view gives an overall picture of the outgoing replication activity along with the catchup estimates for peer nodes, specifically.

# SELECT * FROM bdr.node_replication_rates;
-[ RECORD 1 ]----+-----------
peer_node_id     | 112898766
target_name      | node1
sent_lsn         | 0/28AF99C8
replay_lsn       | 0/28AF99C8
replay_lag       | 00:00:00
replay_lag_bytes | 0
replay_lag_size  | 0 bytes
apply_rate       | 822
catchup_interval | 00:00:00
-[ RECORD 2 ]----+-----------
peer_node_id     | 312494765
target_name      | node3
sent_lsn         | 0/28AF99C8
replay_lsn       | 0/28AF99C8
replay_lag       | 00:00:00
replay_lag_bytes | 0
replay_lag_size  | 0 bytes
apply_rate       | 853
catchup_interval | 00:00:00

The apply_rate above refers to the rate in bytes per second. It is the rate at which the peer is consuming data from the local node. The replay_lag when a node reconnects to the cluster is immediately set to zero. We are working on fixing this information; as a workaround, we suggest you use the catchup_interval column that refers to the time required for the peer node to catch up to the local node data. The other fields are also available via the bdr.node_slots view, as explained below.

Note

This catalog is only present when bdr-enteprise extension is installed.

Administrators may query bdr.node_slots for outgoing replication from the local node. It shows information about replication status of all other nodes in the group that are known to the current node, as well as any additional replication slots created by BDR on the current node.

# SELECT node_group_name, target_dbname, target_name, slot_name, active_pid,
#     catalog_xmin, client_addr, sent_lsn, replay_lsn, replay_lag,
#     replay_lag_bytes, replay_lag_size
# FROM bdr.node_slots;
-[ RECORD 1 ]---+----------------------------
node_group_name | bdrgroup
target_dbname   | postgres
target_name     | node3
slot_name       | bdr_postgres_bdrgroup_node3
active_pid      | 15089
catalog_xmin    | 691
client_addr     | 127.0.0.1
sent_lsn        | 0/23F7B70
replay_lsn      | 0/23F7B70
replay_lag      | [NULL]
replay_lag_bytes| 120
replay_lag_size | 120 bytes
-[ RECORD 2 ]---+----------------------------
node_group_name | bdrgroup
target_dbname   | postgres
target_name     | node2
slot_name       | bdr_postgres_bdrgroup_node2
active_pid      | 15031
catalog_xmin    | 691
client_addr     | 127.0.0.1
sent_lsn        | 0/23F7B70
replay_lsn      | 0/23F7B70
replay_lag      | [NULL]
replay_lag_bytes| 84211
replay_lag_size | 82 kB

Note that because BDR is a mesh network, to get full view of lag in the cluster, this query has to be executed on all nodes participating.

replay_lag_bytes reports the difference in WAL positions between the local server's current WAL write position and replay_lsn, the last position confirmed replayed by the peer node. replay_lag_size is just a human-readable form of the same. It is important to understand that WAL usually contains a lot of writes that are not replicated but still count in replay_lag_bytes, including VACUUM activity, index changes, writes associated with other databases on the same node, writes for tables that are not part of a replication set, etc. So the lag in bytes reported here is not the amount of data that must be replicated on the wire to bring the peer node up to date, only the amount of server-side WAL that must be processed.

Similarly, replay_lag is not a measure of how long the peer node will take to catch up, or how long it will take to replay from its current position to the write position at the time bdr.node_slots was queried. It measures the delay between when the peer confirmed the most recent commit and the current wall-clock time. We suggest that you monitor replay_lag_bytes and replay_lag_size or catchup_interval in bdr.node_replication_rates, as this column is set to zero immediately after the node reconnects.

The lag in both bytes and time does not advance while logical replication is streaming a transaction. It only changes when a commit is replicated. So the lag will tend to "sawtooth", rising as a transaction is streamed, then falling again as the peer node commits it, flushes it, and sends confirmation. The reported LSN positions will "stair-step" instead of advancing smoothly, for similar reasons.

When replication is disconnected (active = 'f'), the active_pid column will be NULL, as will client_addr and other fields that only make sense with an active connection. The state field will be 'disconnected'. The _lsn fields will be the same as the confirmed_flush_lsn, since that is the last position that the client is known for certain to have replayed to and saved. The _lag fields will show the elapsed time between the most recent confirmed flush on the client and the current time, and the _lag_size and _lag_bytes fields will report the distance between confirmed_flush_lsn and the local server's current WAL insert position.

Note: It is normal for restart_lsn to be behind the other lsn columns; this does not indicate a problem with replication or a peer node lagging. The restart_lsn is the position that PostgreSQL's internal logical decoding must be reading WAL at if interrupted, and generally reflects the position of the oldest transaction that is not yet replicated and flushed. A very old restart_lsn can make replication slow to restart after disconnection and force retention of more WAL than is desirable, but will otherwise be harmless. If you are concerned, look for very long running transactions and forgotten prepared transactions.

Monitoring Incoming Replication

Incoming replication (also called subscription) can be monitored by querying the bdr.subscription_summary view. This shows the list of known subscriptions to other nodes in the EDB Postgres Distributed cluster and the state of the replication worker, e.g.:

# SELECT node_group_name, origin_name, sub_enabled, sub_slot_name,
#     subscription_status
# FROM bdr.subscription_summary;
-[ RECORD 1 ]-------+----------------------------
node_group_name     | bdrgroup
origin_name         | node2
sub_enabled         | t
sub_slot_name       | bdr_postgres_bdrgroup_node1
subscription_status | replicating
-[ RECORD 2 ]-------+----------------------------
node_group_name     | bdrgroup
origin_name         | node3
sub_enabled         | t
sub_slot_name       | bdr_postgres_bdrgroup_node1
subscription_status | replicating

Monitoring WAL senders using LCR

If the Decoding Worker is enabled, information about the current LCR (Logical Change Record) file for each WAL sender can be monitored via the function bdr.wal_sender_stats, e.g.:

postgres=# SELECT * FROM bdr.wal_sender_stats();
   pid   | is_using_lcr |       decoder_slot_name       |              lcr_file_name
---------+--------------+-------------------------------+------------------------------------------
 2059904 | f            |                               |
 2059909 | t            | bdr_postgres_bdrgroup_decoder | 0000000000000000000000140000000000000000
 2059916 | t            | bdr_postgres_bdrgroup_decoder | 0000000000000000000000140000000000000000
(3 rows)

If is_using_lcr is FALSE, decoder_slot_name/lcr_file_name will be NULL. This will be the case if the Decoding Worker is not enabled, or the WAL sender is serving a logical standby.

Additionally, information about the Decoding Worker can be monitored via the function bdr.get_decoding_worker_stat, e.g.:

postgres=# SELECT * FROM bdr.get_decoding_worker_stat();
   pid   | decoded_upto_lsn | waiting | waiting_for_lsn
---------+------------------+---------+-----------------
 1153091 | 0/1E5EEE8        | t       | 0/1E5EF00
(1 row)

Monitoring BDR Replication Workers

All BDR workers show up in the system view bdr.stat_activity, which has the same columns and information content as pg_stat_activity. So this view offers these insights into the state of a BDR system:

  • The wait_event column has enhanced information, if the reason for waiting is related to BDR.
  • The query column will be blank in BDR workers, except when a writer process is executing DDL

The bdr.workers view shows BDR worker specific details, that are not available from bdr.stat_activity.

The view bdr.event_summary shows last error (if any) reported by any worker which has a problem continuing the work. This is persistent information, so it's important to note the time of the error not just the existence of one, because most errors are transient in their nature and BDR workers will retry the failed operation.

Monitoring BDR Writers

There is another system view bdr.writers to monitor writer activities. This views shows the current status of only writer workers. It includes:

  • sub_name to identify the subscription which the writer belongs to
  • pid of the writer process
  • streaming_allowed to know if the writer supports application of in-progress streaming transactions
  • is_streaming to know if the writer is currently applying a streaming transaction
  • commit_queue_position to check the position of the writer in the commit queue.

BDR honours commit ordering by following the same commit order as happened on the origin. In case of parallel writers, multiple writers could be applying different transactions at the same time. The commit_queue_position shows in which order they will commit. Value 0 means that the writer is the first one to commit. Value -1 means that the commit position is not yet known. This can happen for a streaming transaction or when the writer is not applying any transaction at the moment.

Monitoring Global Locks

The global lock, which is currently only used for DDL replication, is a heavyweight lock that exists across the whole BDR group.

There are currently two types of global locks:

  • DDL lock, used for serializing all DDL operations on permanent (not temporary) objects (i.e. tables) in the database
  • DML relation lock, used for locking out writes to relations during DDL operations that change the relation definition

Either or both entry types may be created for the same transaction, depending on the type of DDL operation and the value of the bdr.ddl_locking setting.

Global locks held on the local node are visible in the bdr.global_locks view. This view shows the type of the lock; for relation locks it shows which relation is being locked, the PID holding the lock (if local), and whether the lock has been globally granted or not. In case of global advisory locks, lock_type column shows GLOBAL_LOCK_ADVISORY and relation column shows the advisory key(s) on which the lock is acquired.

The following is an example output of bdr.global_locks while running an ALTER TABLE statement with bdr.ddl_locking = on:

# SELECT lock_type, relation, pid FROM bdr.global_locks;
-[ RECORD 1 ]--------------
lock_type | GLOBAL_LOCK_DDL
relation  | [NULL]
pid       | 15534
-[ RECORD 2 ]--------------
lock_type | GLOBAL_LOCK_DML
relation  | someschema.sometable
pid       | 15534

See the catalog documentation for details on all fields including lock timing information.

Monitoring Conflicts

Replication conflicts can arise when multiple nodes make changes that affect the same rows in ways that can interact with each other. The BDR system should be monitored to ensure that conflicts are identified and, where possible, application changes are made to eliminate them or make them less frequent.

By default, all conflicts are logged to bdr.conflict_history. Since this contains full details of conflicting data, the rows are protected by row-level security to ensure they are visible only by owners of replicated tables. Owners should expect conflicts and analyze them to see which, if any, might be considered as problems to be resolved.

For monitoring purposes use bdr.conflict_history_summary, which does not contain user data. An example query to count the number of conflicts seen within the current day using an efficient query plan is:

SELECT count(*)
FROM bdr.conflict_history_summary
WHERE local_time > date_trunc('day', current_timestamp)
  AND local_time < date_trunc('day', current_timestamp + '1 day');

Apply Statistics

BDR collects statistics about replication apply, both for each subscription and for each table.

Two monitoring views exist: bdr.stat_subscription for subscription statistics and bdr.stat_relation for relation statistics. These views both provide:

  • Number of INSERTs/UPDATEs/DELETEs/TRUNCATEs replicated
  • Block accesses and cache hit ratio
  • Total I/O time for read/write
  • Number of in-progress transactions streamed to file
  • Number of in-progress transactions streamed to writers
  • Number of in-progress streamed transactions committed/aborted

and for relations only, these statistics:

  • Total time spent processing replication for the relation
  • Total lock wait time to acquire lock (if any) for the relation (only)

and for subscriptions only, these statistics:

  • Number of COMMITs/DDL replicated for the subscription
  • Number of times this subscription has connected upstream

Tracking of these statistics is controlled by the BDR GUCs bdr.track_subscription_apply and bdr.track_relation_apply respectively.

The example output from these would look like this:

# SELECT sub_name, nconnect, ninsert, ncommit, nupdate, ndelete, ntruncate, nddl
FROM bdr.stat_subscription;
-[ RECORD 1 ]----------------------------------
sub_name  | bdr_regression_bdrgroup_node1_node2
nconnect  | 3
ninsert   | 10
ncommit   | 5
nupdate   | 0
ndelete   | 0
ntruncate | 0
nddl      | 2

In this case the subscription connected 3 times to the upstream, inserted 10 rows and did 2 DDL commands inside 5 transactions.

Stats counters for these views can be reset to zero using the functions bdr.reset_subscription_stats and bdr.reset_relation_stats.

Standard PostgreSQL Statistics Views

Statistics on table and index usage are updated normally by the downstream master. This is essential for the correct function of autovacuum. If there are no local writes on the downstream master and statistics have not been reset, these two views should show corresponding results between upstream and downstream:

  • pg_stat_user_tables
  • pg_statio_user_tables
Note

We don't necessarily expect the upstream table statistics to be similar to the downstream ones; we only expect them to change by the same amounts. Consider the example of a table whose statistics show 1M inserts and 1M updates; when a new node joins the BDR group, the statistics for the same table in the new node will show 1M inserts and zero updates. However, from that moment, the upstream and downstream table statistics will change by the same amounts, because all changes on one side will be replicated to the other side.

Since indexes are used to apply changes, the identifying indexes on the downstream side may appear more heavily used with workloads that perform UPDATEs and DELETEs than non-identifying indexes are.

The built-in index monitoring views are:

  • pg_stat_user_indexes
  • pg_statio_user_indexes

All these views are discussed in detail in the PostgreSQL documentation on the statistics views.

Monitoring BDR Versions

BDR allows running different Postgres versions as well as different BDR versions across the nodes in the same cluster. This is useful for upgrading.

The view bdr.group_versions_details uses the function bdr.run_on_all_nodes() to retrieve Postgres and BDR versions from all nodes at the same time. For example:

bdrdb=# SELECT node_name, postgres_version, bdr_version
        FROM bdr.group_versions_details;
 node_name | postgres_version | bdr_version
-----------+------------------+-------------
 node1     | 14.1             | 4.0.0
 node2     | 14.1             | 4.0.0

The recommended setup is to try to have all nodes running the same latest versions as soon as possible. It is recommended that the cluster does not run different BDR versions for too long.

For monitoring purposes, we recommend the following alert levels:

  • status=UNKNOWN, message=This node is not part of any BDR group
  • status=OK, message=All nodes are running same BDR versions
  • status=WARNING, message=There is at least 1 node that is not accessible
  • status=WARNING, message=There are node(s) running different BDR versions when compared to other nodes

The described behavior is implemented in the function bdr.monitor_group_versions(), which uses BDR version information returned from the view bdr.group_version_details to provide a cluster-wide version check. For example:

bdrdb=# SELECT * FROM bdr.monitor_group_versions();
 status |                message
--------+-----------------------------------------
 OK     | All nodes are running same BDR versions

Monitoring Raft Consensus

Raft Consensus should be working cluster-wide at all times. The impact of running a EDB Postgres Distributed cluster without Raft Consensus working might be as follows:

  • BDR data changes replication may still be working correctly
  • Global DDL/DML locks will not work
  • Galloc sequences will eventually run out of chunks
  • Eager Replication will not work
  • Cluster maintenance operations (join node, part node, promote standby) are still allowed but they might not finish (simply hang)
  • Node statuses might not be correctly synced among the BDR nodes
  • BDR group replication slot does not advance LSN, thus keeps WAL files on disk

The view bdr.group_raft_details uses the functions bdr.run_on_all_nodes() and bdr.get_raft_status() to retrieve Raft Consensus status from all nodes at the same time. For example:

bdrdb=# SELECT node_id, node_name, state, leader_id
FROM bdr.group_raft_details;
  node_id   | node_name | node_group_name |     state     | leader_id
------------+-----------+-----------------+---------------+------------
 1148549230 | node1     | top_group       | RAFT_LEADER   | 1148549230
 3367056606 | node2     | top_group       | RAFT_FOLLOWER | 1148549230

We can say that Raft Consensus is working correctly if all below conditions are met:

  • A valid state (RAFT_LEADER or RAFT_FOLLOWER) is defined on all nodes
  • Only one of the nodes is the RAFT_LEADER
  • The leader_id is the same on all rows and must match the node_id of the row where state = RAFT_LEADER

From time to time, Raft Consensus will start a new election to define a new RAFT_LEADER. During an election, there might be an intermediary situation where there is no RAFT_LEADER and some of the nodes consider themselves as RAFT_CANDIDATE. The whole election should not take longer than bdr.raft_election_timeout (by default it is set to 6 seconds). If the query above returns an in-election situation, then simply wait for bdr.raft_election_timeout and run the query again. If after bdr.raft_election_timeout has passed and some the conditions above are still not met, then Raft Consensus is not working.

Raft Consensus might not be working correctly on a single node only; for example one of the nodes does not recognize the current leader and considers itself as a RAFT_CANDIDATE. In this case, it is important to make sure that:

  • All BDR nodes are accessible to each other through both regular and replication connections (check file pg_hba.conf)
  • BDR versions are the same on all nodes
  • bdr.raft_election_timeout is the same on all nodes

In some cases, especially if nodes are geographically distant from each other and/or network latency is high, the default value of bdr.raft_election_timeout (6 seconds) might not be enough. If Raft Consensus is still not working even after making sure everything is correct, consider increasing bdr.raft_election_timeout to, say, 30 seconds on all nodes. From BDR 3.6.11 onwards, setting bdr.raft_election_timeout requires only a server reload.

Given how Raft Consensus affects cluster operational tasks, and also as Raft Consensus is directly responsible for advancing the group slot, we can define monitoring alert levels as follows:

  • status=UNKNOWN, message=This node is not part of any BDR group
  • status=OK, message=Raft Consensus is working correctly
  • status=WARNING, message=There is at least 1 node that is not accessible
  • status=WARNING, message=There are node(s) as RAFT_CANDIDATE, an election might be in progress
  • status=WARNING, message=There is no RAFT_LEADER, an election might be in progress
  • status=CRITICAL, message=There is a single node in Raft Consensus
  • status=CRITICAL, message=There are node(s) as RAFT_CANDIDATE while a RAFT_LEADER is defined
  • status=CRITICAL, message=There are node(s) following a leader different than the node set as RAFT_LEADER

The described behavior is implemented in the function bdr.monitor_group_raft(), which uses Raft Consensus status information returned from the view bdr.group_raft_details to provide a cluster-wide Raft check. For example:

bdrdb=# SELECT * FROM bdr.monitor_group_raft();
node_group_name | status |               message
----------------|--------+-------------------------------------
myroup          | OK     | Raft Consensus is working correctly

Monitoring Replication Slots

Each BDR node keeps:

  • One replication slot per active BDR peer
  • One group replication slot

For example:

bdrdb=# SELECT slot_name, database, active, confirmed_flush_lsn
FROM pg_replication_slots ORDER BY slot_name;
        slot_name         | database | active | confirmed_flush_lsn
--------------------------+----------+--------+---------------------
 bdr_bdrdb_bdrgroup       | bdrdb    | f      | 0/3110A08
 bdr_bdrdb_bdrgroup_node2 | bdrdb    | t      | 0/31F4670
 bdr_bdrdb_bdrgroup_node3 | bdrdb    | t      | 0/31F4670
 bdr_bdrdb_bdrgroup_node4 | bdrdb    | t      | 0/31F4670

Peer slot names follow the convention bdr_<DATABASE>_<GROUP>_<PEER>, while the BDR group slot name follows the convention bdr_<DATABASE>_<GROUP>, which can be accessed using the function bdr.local_group_slot_name().

Peer replication slots should be active on all nodes at all times. If a peer replication slot is not active, then it might mean:

  • The corresponding peer is shutdown or not accessible; or
  • BDR replication is broken.

Grep the log file for ERROR or FATAL and also check bdr.event_summary on all nodes. The root cause might be, for example, an incompatible DDL was executed with DDL replication disabled on one of the nodes.

The BDR group replication slot is however inactive most of the time. BDR maintains this slot and advances its LSN when all other peers have already consumed the corresponding transactions. Consequently it is not necessary to monitor the status of the group slot.

The function bdr.monitor_local_replslots() provides a summary of whether all BDR node replication slots are working as expected, e.g.:

bdrdb=# SELECT * FROM bdr.monitor_local_replslots();
 status |                    message
--------+-------------------------------------------------
 OK     | All BDR replication slots are working correctly

One of the following status summaries will be returned:

  • UNKNOWN: This node is not part of any BDR group
  • OK: All BDR replication slots are working correctly
  • OK: This node is part of a subscriber-only group
  • CRITICAL: There is at least 1 BDR replication slot which is inactive
  • CRITICAL: There is at least 1 BDR replication slot which is missing

Monitoring Transaction COMMITs

By default, BDR transactions commit only on the local node. In that case, transaction COMMIT will be processed quickly.

BDR can be used with standard PostgreSQL synchronous replication, while BDR also provides two new transaction commit modes: CAMO and Eager replication. Each of these modes provides additional robustness features, though at the expense of additional latency at COMMIT. The additional time at COMMIT can be monitored dynamically using the bdr.stat_activity catalog, where processes report different wait_event states. A transaction in COMMIT waiting for confirmations from one or more synchronous standbys reports a SyncRep wait event, whereas the two new modes report EagerRep.