Monitoring through SQL v5
EDB Postgres Distributed provides several monitoring and statistics views that are specific to its distributed nature. The standard Postgres monitoring is also useful for monitoring EDB Postgres Distributed.
Monitoring Overview
A BDR Group consists of multiple servers, often referred to as nodes. All of the nodes need to be monitored to ensure the health of the whole group.
The bdr_monitor role may execute the bdr.monitor
functions to provide an
assessment of BDR health using one of three levels:
OK
- often shown as GreenWARNING
- often shown as YellowCRITICAL
- often shown as Red- as well as
UNKNOWN
- for unrecognized situations, often shown as Red
BDR also provides dynamic catalog views that show the instantaneous state of various internal metrics and also BDR metadata catalogs that store the configuration defaults and/or configuration changes requested by the user. Some of those views and tables are accessible by bdr_monitor or bdr_read_all_stats, but some contain user or internal information that has higher security requirements.
BDR allows you to monitor each of the nodes individually, or to monitor the
whole group by access to a single node. If you wish to monitor each node individually,
simply connect to each node and issue monitoring requests. If you wish to monitor
the group from a single node then use the views starting with bdr.group
since these
requests make calls to other nodes to assemble a group-level information set.
If you have been granted access to the bdr.run_on_all_nodes()
function by
bdr_superuser then you may make your own calls to all nodes.
Monitoring Node Join and Removal
By default, the node management functions wait for the join or part
operation to complete. This can be turned off using the respective
wait_for_completion
function argument. If waiting is turned off,
then to see when a join or part operation finishes,
check the node state indirectly via bdr.node_summary
and
bdr.event_summary
.
When called, the helper function bdr.wait_for_join_completion()
will cause
a PostgreSQL session to pause until all outstanding node join operations
complete.
Here is an example output of a SELECT
query from bdr.node_summary
that
indicates that two nodes are active and another one is joining:
Also, the table bdr.node_catchup_info
will give information
on the catch-up state, which can be relevant to joining nodes or parting nodes.
When a node is parted, it could be that some nodes in the cluster did not receive all the data from that parting node. So it will create a temporary slot from a node that already received that data and can forward it.
The catchup_state
can be one of the following:
Monitoring Replication Peers
There are two main views used for monitoring of replication activity:
bdr.node_slots
for monitoring outgoing replicationbdr.subscription_summary
for monitoring incoming replication
Most of the information provided by bdr.node_slots
can be also obtained by querying
the standard PostgreSQL replication monitoring views
pg_catalog.pg_stat_replication
and
pg_catalog.pg_replication_slots
.
Each node has one BDR group slot which should never have a connection to it
and will very rarely be marked as active. This is normal, and does not imply
something is down or disconnected. See Replication Slots created by BDR
.
Monitoring Outgoing Replication
There is an additional view used for monitoring of outgoing replication activity:
bdr.node_replication_rates
for monitoring outgoing replication
The bdr.node_replication_rates
view gives an overall picture of the outgoing
replication activity along with the catchup estimates for peer nodes,
specifically.
The apply_rate
above refers to the rate in bytes per second. It is the rate
at which the peer is consuming data from the local node. The replay_lag
when
a node reconnects to the cluster is immediately set to zero. We are working on
fixing this information; as a workaround, we suggest you use the catchup_interval
column that refers to the time required for the peer node to catch up to the
local node data. The other fields are also available via the bdr.node_slots
view, as explained below.
Note
This catalog is only present when bdr-enteprise extension is installed.
Administrators may query bdr.node_slots
for outgoing replication from the
local node. It shows information about replication status of all other nodes
in the group that are known to the current node, as well as any additional
replication slots created by BDR on the current node.
Note that because BDR is a mesh network, to get full view of lag in the cluster, this query has to be executed on all nodes participating.
replay_lag_bytes
reports the difference in WAL positions between the local
server's current WAL write position and replay_lsn
, the last position
confirmed replayed by the peer node. replay_lag_size
is just a human-readable
form of the same. It is important to understand that WAL usually contains a lot
of writes that are not replicated but still count in replay_lag_bytes
,
including VACUUM
activity, index changes, writes associated with other
databases on the same node, writes for tables that are not part of a
replication set, etc. So the lag in bytes reported here is not the amount of
data that must be replicated on the wire to bring the peer node up to date,
only the amount of server-side WAL that must be processed.
Similarly, replay_lag
is not a measure of how long the peer node will take to
catch up, or how long it will take to replay from its current position to the
write position at the time bdr.node_slots
was queried. It measures the delay
between when the peer confirmed the most recent commit and the current
wall-clock time. We suggest that you monitor replay_lag_bytes
and replay_lag_size
or catchup_interval
in bdr.node_replication_rates
, as this column is set to
zero immediately after the node reconnects.
The lag in both bytes and time does not advance while logical replication is streaming a transaction. It only changes when a commit is replicated. So the lag will tend to "sawtooth", rising as a transaction is streamed, then falling again as the peer node commits it, flushes it, and sends confirmation. The reported LSN positions will "stair-step" instead of advancing smoothly, for similar reasons.
When replication is disconnected (active
= 'f'
), the active_pid
column
will be NULL
, as will client_addr
and other fields that only make sense
with an active connection. The state
field will be 'disconnected'
. The
_lsn
fields will be the same as the confirmed_flush_lsn
, since that is the
last position that the client is known for certain to have replayed to and saved.
The _lag
fields will show the elapsed time between the most recent confirmed
flush on the client and the current time, and the _lag_size
and _lag_bytes
fields will report the distance between confirmed_flush_lsn
and the local
server's current WAL insert position.
Note: It is normal for restart_lsn
to be behind the other lsn
columns;
this does not indicate a problem with replication or a peer node lagging. The
restart_lsn
is the position that PostgreSQL's internal logical decoding must
be reading WAL at if interrupted, and generally reflects the position of the
oldest transaction that is not yet replicated and flushed. A very old
restart_lsn
can make replication slow to restart after disconnection and
force retention of more WAL than is desirable, but will otherwise be harmless.
If you are concerned, look for very long running transactions and forgotten
prepared transactions.
Monitoring Incoming Replication
Incoming replication (also called subscription) can be monitored by querying
the bdr.subscription_summary
view. This shows the list of known subscriptions
to other nodes in the EDB Postgres Distributed cluster and the state of the replication worker, e.g.:
Monitoring WAL senders using LCR
If the Decoding Worker is enabled, information about the
current LCR (Logical Change Record
) file for each WAL sender can be monitored
via the function bdr.wal_sender_stats,
e.g.:
If is_using_lcr
is FALSE
, decoder_slot_name
/lcr_file_name
will be NULL
.
This will be the case if the Decoding Worker is not enabled, or the WAL sender is
serving a logical standby.
Additionally, information about the Decoding Worker can be monitored via the function bdr.get_decoding_worker_stat, e.g.:
Monitoring BDR Replication Workers
All BDR workers show up in the system view bdr.stat_activity
,
which has the same columns and information content as
pg_stat_activity.
So this view offers these insights into the state of a BDR system:
- The wait_event column has enhanced information, if the reason for waiting is related to BDR.
- The
query
column will be blank in BDR workers, except when a writer process is executing DDL
The bdr.workers
view shows BDR worker specific details, that are not
available from bdr.stat_activity
.
The view bdr.event_summary
shows last error (if any) reported by any worker
which has a problem continuing the work. This is persistent information, so
it's important to note the time of the error not just the existence of one,
because most errors are transient in their nature and BDR workers will retry
the failed operation.
Monitoring BDR Writers
There is another system view bdr.writers
to monitor writer activities.
This views shows the current status of only writer workers. It includes:
sub_name
to identify the subscription which the writer belongs topid
of the writer processstreaming_allowed
to know if the writer supports application of in-progress streaming transactionsis_streaming
to know if the writer is currently applying a streaming transactioncommit_queue_position
to check the position of the writer in the commit queue.
BDR honours commit ordering by following the same commit order as
happened on the origin. In case of parallel writers, multiple writers
could be applying different transactions at the same time. The
commit_queue_position
shows in which order they will commit. Value 0
means that the writer is the first one to commit. Value -1
means that
the commit position is not yet known. This can happen for a streaming
transaction or when the writer is not applying any transaction at the
moment.
Monitoring Global Locks
The global lock, which is currently only used for DDL replication, is a heavyweight lock that exists across the whole BDR group.
There are currently two types of global locks:
- DDL lock, used for serializing all DDL operations on permanent (not temporary) objects (i.e. tables) in the database
- DML relation lock, used for locking out writes to relations during DDL operations that change the relation definition
Either or both entry types may be created for the same transaction, depending on
the type of DDL operation and the value of the bdr.ddl_locking
setting.
Global locks held on the local node are visible in the bdr.global_locks
view. This view shows the type of the lock; for
relation locks it shows which relation is being locked, the PID holding the
lock (if local), and whether the lock has been globally granted or not. In case
of global advisory locks, lock_type
column shows GLOBAL_LOCK_ADVISORY
and
relation
column shows the advisory key(s) on which the lock is acquired.
The following is an example output of bdr.global_locks
while running an
ALTER TABLE
statement with bdr.ddl_locking = on
:
See the catalog documentation for details on all fields including lock timing information.
Monitoring Conflicts
Replication conflicts can arise when multiple nodes make changes that affect the same rows in ways that can interact with each other. The BDR system should be monitored to ensure that conflicts are identified and, where possible, application changes are made to eliminate them or make them less frequent.
By default, all conflicts are logged to bdr.conflict_history
. Since this
contains full details of conflicting data, the rows are protected by
row-level security to ensure they are visible only by
owners of replicated tables. Owners should expect conflicts and analyze them
to see which, if any, might be considered as problems to be resolved.
For monitoring purposes use bdr.conflict_history_summary
, which does
not contain user data. An example query to count the number of conflicts
seen within the current day using an efficient query plan is:
Apply Statistics
BDR collects statistics about replication apply, both for each subscription and for each table.
Two monitoring views exist: bdr.stat_subscription
for subscription statistics
and bdr.stat_relation
for relation statistics. These views both provide:
- Number of INSERTs/UPDATEs/DELETEs/TRUNCATEs replicated
- Block accesses and cache hit ratio
- Total I/O time for read/write
- Number of in-progress transactions streamed to file
- Number of in-progress transactions streamed to writers
- Number of in-progress streamed transactions committed/aborted
and for relations only, these statistics:
- Total time spent processing replication for the relation
- Total lock wait time to acquire lock (if any) for the relation (only)
and for subscriptions only, these statistics:
- Number of COMMITs/DDL replicated for the subscription
- Number of times this subscription has connected upstream
Tracking of these statistics is controlled by the BDR GUCs
bdr.track_subscription_apply
and bdr.track_relation_apply
respectively.
The example output from these would look like this:
In this case the subscription connected 3 times to the upstream, inserted 10 rows and did 2 DDL commands inside 5 transactions.
Stats counters for these views can be reset to zero using the functions
bdr.reset_subscription_stats
and bdr.reset_relation_stats
.
Standard PostgreSQL Statistics Views
Statistics on table and index usage are updated normally by the downstream master. This is essential for the correct function of autovacuum. If there are no local writes on the downstream master and statistics have not been reset, these two views should show corresponding results between upstream and downstream:
pg_stat_user_tables
pg_statio_user_tables
Note
We don't necessarily expect the upstream table statistics to be similar to the downstream ones; we only expect them to change by the same amounts. Consider the example of a table whose statistics show 1M inserts and 1M updates; when a new node joins the BDR group, the statistics for the same table in the new node will show 1M inserts and zero updates. However, from that moment, the upstream and downstream table statistics will change by the same amounts, because all changes on one side will be replicated to the other side.
Since indexes are used to apply changes, the identifying indexes on the
downstream side may appear more heavily used with workloads that perform
UPDATE
s and DELETE
s than non-identifying indexes are.
The built-in index monitoring views are:
pg_stat_user_indexes
pg_statio_user_indexes
All these views are discussed in detail in the PostgreSQL documentation on the statistics views.
Monitoring BDR Versions
BDR allows running different Postgres versions as well as different BDR versions across the nodes in the same cluster. This is useful for upgrading.
The view bdr.group_versions_details
uses the function
bdr.run_on_all_nodes()
to retrieve Postgres and BDR versions from all nodes
at the same time. For example:
The recommended setup is to try to have all nodes running the same latest versions as soon as possible. It is recommended that the cluster does not run different BDR versions for too long.
For monitoring purposes, we recommend the following alert levels:
- status=UNKNOWN, message=This node is not part of any BDR group
- status=OK, message=All nodes are running same BDR versions
- status=WARNING, message=There is at least 1 node that is not accessible
- status=WARNING, message=There are node(s) running different BDR versions when compared to other nodes
The described behavior is implemented in the function
bdr.monitor_group_versions()
, which uses BDR version
information returned from the view bdr.group_version_details
to provide a cluster-wide version check. For example:
Monitoring Raft Consensus
Raft Consensus should be working cluster-wide at all times. The impact of running a EDB Postgres Distributed cluster without Raft Consensus working might be as follows:
- BDR data changes replication may still be working correctly
- Global DDL/DML locks will not work
- Galloc sequences will eventually run out of chunks
- Eager Replication will not work
- Cluster maintenance operations (join node, part node, promote standby) are still allowed but they might not finish (simply hang)
- Node statuses might not be correctly synced among the BDR nodes
- BDR group replication slot does not advance LSN, thus keeps WAL files on disk
The view bdr.group_raft_details
uses the functions
bdr.run_on_all_nodes()
and bdr.get_raft_status()
to retrieve Raft
Consensus status from all nodes at the same time. For example:
We can say that Raft Consensus is working correctly if all below conditions are met:
- A valid state (
RAFT_LEADER
orRAFT_FOLLOWER
) is defined on all nodes - Only one of the nodes is the
RAFT_LEADER
- The
leader_id
is the same on all rows and must match thenode_id
of the row wherestate = RAFT_LEADER
From time to time, Raft Consensus will start a new election to define a
new RAFT_LEADER
. During an election, there might be an intermediary
situation where there is no RAFT_LEADER
and some of the nodes consider
themselves as RAFT_CANDIDATE
. The whole election should not take longer
than bdr.raft_election_timeout
(by default it is set to 6 seconds). If
the query above returns an in-election situation, then simply wait for
bdr.raft_election_timeout
and run the query again. If after
bdr.raft_election_timeout
has passed and some the conditions above are
still not met, then Raft Consensus is not working.
Raft Consensus might not be working correctly on a single node only;
for example one of the nodes does not recognize the current leader and
considers itself as a RAFT_CANDIDATE
. In this case, it is important to
make sure that:
- All BDR nodes are accessible to each other through both regular and
replication connections (check file
pg_hba.conf
) - BDR versions are the same on all nodes
bdr.raft_election_timeout
is the same on all nodes
In some cases, especially if nodes are geographically distant from each
other and/or network latency is high, the default value of
bdr.raft_election_timeout
(6 seconds) might not be enough. If Raft
Consensus is still not working even after making sure everything is
correct, consider increasing bdr.raft_election_timeout
to, say, 30
seconds on all nodes. From BDR 3.6.11 onwards, setting
bdr.raft_election_timeout
requires only a server reload.
Given how Raft Consensus affects cluster operational tasks, and also as Raft Consensus is directly responsible for advancing the group slot, we can define monitoring alert levels as follows:
- status=UNKNOWN, message=This node is not part of any BDR group
- status=OK, message=Raft Consensus is working correctly
- status=WARNING, message=There is at least 1 node that is not accessible
- status=WARNING, message=There are node(s) as RAFT_CANDIDATE, an election might be in progress
- status=WARNING, message=There is no RAFT_LEADER, an election might be in progress
- status=CRITICAL, message=There is a single node in Raft Consensus
- status=CRITICAL, message=There are node(s) as RAFT_CANDIDATE while a RAFT_LEADER is defined
- status=CRITICAL, message=There are node(s) following a leader different than the node set as RAFT_LEADER
The described behavior is implemented in the function
bdr.monitor_group_raft()
, which uses Raft Consensus status
information returned from the view bdr.group_raft_details
to provide a cluster-wide Raft check. For example:
Monitoring Replication Slots
Each BDR node keeps:
- One replication slot per active BDR peer
- One group replication slot
For example:
Peer slot names follow the convention bdr_<DATABASE>_<GROUP>_<PEER>
,
while the BDR group slot name follows the convention
bdr_<DATABASE>_<GROUP>
, which can be accessed using the function
bdr.local_group_slot_name()
.
Peer replication slots should be active on all nodes at all times. If a peer replication slot is not active, then it might mean:
- The corresponding peer is shutdown or not accessible; or
- BDR replication is broken.
Grep the log file for ERROR
or FATAL
and also check bdr.event_summary
on
all nodes. The root cause might be, for example, an incompatible DDL was
executed with DDL replication disabled on one of the nodes.
The BDR group replication slot is however inactive most of the time. BDR maintains this slot and advances its LSN when all other peers have already consumed the corresponding transactions. Consequently it is not necessary to monitor the status of the group slot.
The function bdr.monitor_local_replslots()
provides a summary of whether all
BDR node replication slots are working as expected, e.g.:
One of the following status summaries will be returned:
UNKNOWN
:This node is not part of any BDR group
OK
:All BDR replication slots are working correctly
OK
:This node is part of a subscriber-only group
CRITICAL
:There is at least 1 BDR replication slot which is inactive
CRITICAL
:There is at least 1 BDR replication slot which is missing
Monitoring Transaction COMMITs
By default, BDR transactions commit only on the local node. In that case,
transaction COMMIT
will be processed quickly.
BDR can be used with standard PostgreSQL synchronous replication, while
BDR also provides two new transaction commit modes: CAMO and Eager
replication. Each of these modes provides additional robustness
features, though at the expense of additional latency at COMMIT
.
The additional time at COMMIT
can be monitored dynamically using the
bdr.stat_activity
catalog, where processes report different wait_event
states. A transaction in COMMIT
waiting for confirmations from one or
more synchronous standbys reports a SyncRep
wait event, whereas the
two new modes report EagerRep
.
- On this page
- Monitoring Overview
- Monitoring Node Join and Removal
- Monitoring Replication Peers
- Monitoring BDR Replication Workers
- Monitoring BDR Writers
- Monitoring Global Locks
- Monitoring Conflicts
- Apply Statistics
- Standard PostgreSQL Statistics Views
- Monitoring BDR Versions
- Monitoring Raft Consensus
- Monitoring Replication Slots
- Monitoring Transaction COMMITs