Page Comparison

...

bsr implements a block device that replicates data from the local node to all other nodes in the cluster. Here, the actual data and related metadata are stored individually (usually in the case of external metadata) on the “generic” block device volume of each cluster node. Replication block devices must be named by default in /dev/drbd<minor> bsr<minor> format or directly as a symbolic link (letter) to the device. One or more devices per resource are grouped and each device is replicated in parallel. The device inside the resource is defined as a volume, and resources can be duplicated between two or more cluster nodes. Cluster node-to-node connections are point-to-point links and use the TCP protocol. bsr consists of the basic components bsradm, which understands and processes configuration files, and the low-level components bsrsetup, bsrmeta, and bsrcon. The basic bsr configuration consists of /etc/drbdbsr.conf and any additional files it contains (typically global_common.conf and all * .res files in the /etc path). Usually each resource is in etc/bsr.d/. It is useful to define separate * .res files in the path. The configuration file is designed so that each cluster node contains the same copy of the entire cluster configuration. However, sometimes it may be necessary to have the contents of different configuration files for each node, so this is not absolute.

...

Called on all nodes after a verify finishes and out-of-sync blocks were found. This handler is mainly used for monitoring purposes. An example would be to call a script that sends an alert SMS.

quorum-lost cmd

Called on a Primary that lost quorum. This handler is usually used to reboot the node if it is not possible to restart the application that uses the storage on top of DRBD.

fence-peer cmd

Called when a node should fence a resource on a particular peer. The handler should not use the same communication path that DRBD bsr uses for talking to the peer.

...

initial-split-brain cmd

Called when DRBD bsr connects to a peer and detects that the peer is in a split-brain state with the local node. This handler is also called for split-brain scenarios which will be resolved automatically.

...

The local node is currently primary, but DRBD bsr believes that it should become a sync target. The node should give up its primary role.

...

The local node is currently primary, but it has lost the after-split-brain auto recovery procedure. The node should be abandoned.

pri-on-incon-degr cmd

The local node is primary, and neither the local lower-level device nor a lower-level device on a peer is up to date. (The primary has no device to read from or to write to.)

split-brain cmd

DRBD has detected a split-brain situation which could not be resolved automatically. Manual recovery is split-brain cmd

bsr has detected a split-brain situation which could not be resolved automatically. Manual recovery is necessary. This handler can be used to call for administrator attention.

Section net Parameters

after-sb-0pri policy

Define Defines how to react if respond when a split - brain scenario is detected and none neither of the two nodes is in primary plays the Primary role. (We detect Detects a split - brain scenarios scenario when two nodes connect; split-brain decisions are are connected. The split brain decision is always between the two nodes.) The defined policies arepolicy is:

disconnect

...

Simply disconnect.
discard-younger-primary,

...

discard-older-primary

...

Discard the first node

...

that became

...

Primary (

...

discard-younger-primary) or the last node that became Primary (discard-older-primary). If both nodes

...

have become Primary independently,

...

the discard-least-changes

...

policy is used.
discard-zero-changes If data is written from only one

...

node, resynchronize

...

based on this node

...

. If both nodes

...

have written data, they disconnect.
discard-least-changes

...

Synchronize based on the node that wrote a lot of data.
discard-node-nodename Always

...

discard the named node.

after-sb-1pri policy

Define how what to react do if a split - brain scenario is detected , with one primary node in primary role and one secondary node in secondary role. (We detect The split -brain scenarios when two nodes connect, so split-brain decisions are always among two nodes.) The defined policies are:disconnectNo automatic resynchronization, simply disconnect.consensusDiscard the data on the secondary node if the after-sb-0pri algorithm would also discard the data on the secondary node. Otherwise, disconnect.violently-as0pAlways take the decision of the after-sb-0pri algorithm, even if it causes an erratic change of the primary's view of the data. This is only useful if a single-node file system (i.e., not OCFS2 or GFS) with the allow-two-primaries flag is used. This option can cause the primary node to crash, and should not be used.discard-secondaryDiscard the data on the secondary node.call-pri-lost-after-sbAlways take the decision of the after-sb-0pri algorithm. If the decision is to discard the data on the primary node, call the pri-lost-after-sb handler on the primary node.

after-sb-2pri policy

Define how to react if a split-brain scenario is detected and both nodes are in primary role. (We detect split-brain scenarios when two nodes connect, so split-brain decisions are always among two nodes.) The defined policies are:disconnectNo automatic resynchronization, simply disconnect.violently-as0pSee the violently-as0p policy for after-sb-1pri.call-pri-lost-after-sbCall the pri-lost-after-sb helper program on one of the machines unless that machine can demote to secondary. The helper program is expected to reboot the machine, which brings the node into a secondary role. Which machine runs the helper program is determined by the after-sb-0pri strategy.

allow-two-primaries

The most common way to configure DRBD devices is to allow only one node to be primary (and thus writable) at a time. In some scenarios it is preferable to allow two nodes to be primary at once; a mechanism outside of DRBD then must make sure that writes to the shared, replicated device happen in a coordinated way. This can be done with a shared-storage cluster file system like OCFS2 and GFS, or with virtual machine images and a virtual machine manager that can migrate virtual machines between physical machines. The allow-two-primaries parameter tells DRBD to allow two nodes to be primary at the same time. Never enable this option when using a non-distributed file system; otherwise, data corruption and node crashes will result!

always-asbp

Normally the automatic after-split-brain policies are only used if current states of the UUIDs do not indicate the presence of a third node. With this option you request that the automatic after-split-brain policies are used as long as the data sets of the nodes are somehow related. This might cause a full sync, if the UUIDs indicate the presence of a third node. (Or double faults led to strange UUID sets.)

connect-int time

As soon as a connection between two nodes is configured with drbdsetup connect, DRBD immediately tries to establish the connection. If this fails, DRBD waits for connect-int seconds and then repeats. The default value of connect-int is 10 seconds.

cram-hmac-alg hash-algorithm

Configure the hash-based message authentication code (HMAC) or secure hash algorithm to use for peer authentication. The kernel supports a number of different algorithms, some of which may be loadable as kernel modules. See the shash algorithms listed in /proc/crypto. By default, cram-hmac-alg is unset. Peer authentication also requires a shared-secret to be configured.

csums-alg hash-algorithm

Normally, when two nodes resynchronize, the sync target requests a piece of out-of-sync data from the sync source, and the sync source sends the data. With many usage patterns, a significant number of those blocks will actually be identical. When a csums-alg algorithm is specified, when requesting a piece of out-of-sync data, the sync target also sends along a hash of the data it currently has. The sync source compares this hash with its own version of the data. It sends the sync target the new data if the hashes differ, and tells it that the data are the same otherwise. This reduces the network bandwidth required, at the cost of higher cpu utilization and possibly increased I/O on the sync target. The csums-alg can be set to one of the secure hash algorithms supported by the kernel; see the shash algorithms listed in /proc/crypto. By default, csums-alg is unset.

csums-after-crash-only

Enabling this option (and csums-alg, above) makes it possible to use the checksum based resync only for the first resync after primary crash, but not for later "network hickups". In most cases, block that are marked as need-to-be-resynced are in fact changed, so calculating checksums, and both reading and writing the blocks on the resync target is all effective overhead. The advantage of checksum based resync is mostly after primary crash recovery, where the recovery marked larger areas (those covered by the activity log) as need-to-be-resynced, just in case. Introduced in 8.4.5.

data-integrity-alg alg

DRBD normally relies on the data integrity checks built into the TCP/IP protocol, but if a data integrity algorithm is configured, it will additionally use this algorithm to make sure that the data received over the network match what the sender has sent. If a data integrity error is detected, DRBD will close the network connection and reconnect, which will trigger a resync. The data-integrity-alg can be set to one of the secure hash algorithms supported by the kernel; see the shash algorithms listed in /proc/crypto. By default, this mechanism is turned off. Because of the CPU overhead involved, we recommend not to use this option in production environments. Also see the notes on data integrity below.

fencing fencing_policy

Fencing is a preventive measure to avoid situations where both nodes are primary and disconnected. This is also known as a split-brain situation. DRBD supports the following fencing policies:dont-careNo fencing actions are taken. This is the default policy.resource-onlyIf a node becomes a disconnected primary, it tries to fence the peer. This is done by calling the fence-peer handler. The handler is supposed to reach the peer over an alternative communication path and call ' drbdadm outdate minor' there.resource-and-stonithIf a node becomes a disconnected primary, it freezes all its IO operations and calls its fence-peer handler. The fence-peer handler is supposed to reach the peer over an alternative communication path and call ' drbdadm outdate minor' there. In case it cannot do that, it should stonith the peer. IO is resumed as soon as the situation is resolved. In case the fence-peer handler fails, I/O can be resumed manually with ' drbdadm resume-io'.

ko-count number

If a secondary node fails to complete a write request in ko-count times the timeout parameter, it is excluded from the cluster. The primary node then sets the connection to this secondary node to Standalone. To disable this feature, you should explicitly set it to 0; defaults may change between versions.

max-buffers number

Limits the memory usage per DRBD minor device on the receiving side, or for internal buffers during resync or online-verify. Unit is PAGE_SIZE, which is 4 KiB on most systems. The minimum possible setting is hard coded to 32 (=128 KiB). These buffers are used to hold data blocks while they are written to/read from disk. To avoid possible distributed deadlocks on congestion, this setting is used as a throttle threshold rather than a hard limit. Once more than max-buffers pages are in use, further allocation from this pool is throttled. You want to increase max-buffers if you cannot saturate the IO backend on the receiving side.

max-epoch-size number

Define the maximum number of write requests DRBD may issue before issuing a write barrier. The default value is 2048, with a minimum of 1 and a maximum of 20000. Setting this parameter to a value below 10 is likely to decrease performance.

on-congestion policy,

congestion-fill threshold,

congestion-extents threshold

By default, DRBD blocks when the TCP send queue is full. This prevents applications from generating further write requests until more buffer space becomes available again. When DRBD is used together with DRBD-proxy, it can be better to use the pull-ahead on-congestion policy, which can switch DRBD into ahead/behind mode before the send queue is full. DRBD then records the differences between itself and the peer in its bitmap, but it no longer replicates them to the peer. When enough buffer space becomes available again, the node resynchronizes with the peer and switches back to normal replication. This has the advantage of not blocking application I/O even when the queues fill up, and the disadvantage that peer nodes can fall behind much further. Also, while resynchronizing, peer nodes will become inconsistent. The available congestion policies are block (the default) and pull-ahead. The congestion-fill parameter defines how much data is allowed to be "in flight" in this connection. The default value is 0, which disables this mechanism of congestion control, with a maximum of 10 GiBytes. The congestion-extents parameter defines how many bitmap extents may be active before switching into ahead/behind mode, with the same default and limits as the al-extents parameter. The congestion-extents parameter is effective only when set to a value smaller than al-extents. Ahead/behind mode is available since DRBD 8.3.10.

ping-int interval

When the TCP/IP connection to a peer is idle for more than ping-int seconds, DRBD will send a keep-alive packet to make sure that a failed peer or network connection is detected reasonably soon. The default value is 10 seconds, with a minimum of 1 and a maximum of 120 seconds. The unit is seconds.

ping-timeout timeout

Define the timeout for replies to keep-alive packets. If the peer does not reply within ping-timeout, DRBD will close and try to reestablish the connection. The default value is 0.5 seconds, with a minimum of 0.1 seconds and a maximum of 3 seconds. The unit is tenths of a second.

socket-check-timeout timeout

In setups involving a DRBD-proxy and connections that experience a lot of buffer-bloat it might be necessary to set ping-timeout to an unusual high value. By default DRBD uses the same value to wait if a newly established TCP-connection is stable. Since the DRBD-proxy is usually located in the same data center such a long wait time may hinder DRBD's connect process. In such setups socket-check-timeout should be set to at least to the round trip time between DRBD and DRBD-proxy. I.e. in most cases to 1. The default unit is tenths of a second, the default value is 0 (which causes DRBD to use the value of ping-timeout instead). Introduced in 8.4.5.

protocol name

Use the specified protocol on this connection. The supported protocols are:AWrites to the DRBD device complete as soon as they have reached the local disk and the TCP/IP send buffer.BWrites to the DRBD device complete as soon as they have reached the local disk, and all peers have acknowledged the receipt of the write requests.CWrites to the DRBD device complete as soon as they have reached the local and all remote disks.

rcvbuf-size size

Configure the size of the TCP/IP receive buffer. A value of 0 (the default) causes the buffer size to adjust dynamically. This parameter usually does not need to be set, but it can be set to a value up to 10 MiB. The default unit is bytes.

rr-conflict policy

This option helps to solve the cases when the outcome of the resync decision is incompatible with the current role assignment in the cluster. The defined policies are:disconnectNo automatic resynchronization, simply disconnect.violentlyResync to the primary node is allowed, violating the assumption that data on a block device are stable for one of the nodes. Do not use this option, it is dangerous.call-pri-lostCall the pri-lost handler on one of the machines. The handler is expected to reboot the machine, which puts it into secondary role.

shared-secret secret

Configure the shared secret used for peer authentication. The secret is a string of up to 64 characters. Peer authentication also requires the cram-hmac-alg parameter to be set.

sndbuf-size size

Configure the size of the TCP/IP send buffer. Since DRBD 8.0.13 / 8.2.7, a value of 0 (the default) causes the buffer size to adjust dynamically. Values below 32 KiB are harmful to the throughput on this connection. Large buffer sizes can be useful especially when protocol A is used over high-latency networks; the maximum value supported is 10 MiB.

tcp-cork

By default, DRBD uses the TCP_CORK socket option to prevent the kernel from sending partial messages; this results in fewer and bigger packets on the network. Some network stacks can perform worse with this optimization. On these, the tcp-cork parameter can be used to turn this optimization off.

timeout time

Define the timeout for replies over the network: if a peer node does not send an expected reply within the specified timeout, it is considered dead and the TCP/IP connection is closed. The timeout value must be lower than connect-int and lower than ping-int. The default is 6 seconds; the value is specified in tenths of a second.

use-rle

Each replicated device on a cluster node has a separate bitmap for each of its peer devices. The bitmaps are used for tracking the differences between the local and peer device: depending on the cluster state, a disk range can be marked as different from the peer in the device's bitmap, in the peer device's bitmap, or in both bitmaps. When two cluster nodes connect, they exchange each other's bitmaps, and they each compute the union of the local and peer bitmap to determine the overall differences. Bitmaps of very large devices are also relatively large, but they usually compress very well using run-length encoding. This can save time and bandwidth for the bitmap transfers. The use-rle parameter determines if run-length encoding should be used. It is on by default since DRBD 8.4.0.

verify-alg hash-algorithm

Online verification (drbdadm verify) computes and compares checksums of disk blocks (i.e., hash values) in order to detect if they differ. The verify-alg parameter determines which algorithm to use for these checksums. It must be set to one of the secure hash algorithms supported by the kernel before online verify can be used; see the shash algorithms listed in /proc/crypto. We recommend to schedule online verifications regularly during low-load periods, for example once a month. Also see the notes on data integrity below.

Section on Parameters

address [address-family] address: port

Defines the address family, address, and port of a connection endpoint. The address families ipv4, ipv6, ssocks (Dolphin Interconnect Solutions' "super sockets"), sdp (Infiniband Sockets Direct Protocol), and sci are supported ( sci is an alias for ssocks). If no address family is specified, ipv4 is assumed. For all address families except ipv6, the address is specified in IPV4 address notation (for example, 1.2.3.4). For ipv6, the address is enclosed in brackets and uses IPv6 address notation (for example, [fd01:2345:6789:abcd::1]). The port is always specified as a decimal number from 1 to 65535. On each host, the port numbers must be unique for each address; ports cannot be shared.

node-id value

Defines the unique node identifier for a node in the cluster. Node identifiers are used to identify individual nodes in the network protocol, and to assign bitmap slots to nodes in the metadata. Node identifiers can only be reasssigned in a cluster when the cluster is down. It is essential that the node identifiers in the configuration and in the device metadata are changed consistently on all hosts. To change the metadata, dump the current state with drbdmeta dump-md, adjust the bitmap slot assignment, and update the metadata with drbdmeta restore-md. The node-id parameter exists since DRBD 9. Its value ranges from 0 to 16; there is no default.

Section options Parameters (Resource Options)

auto-promote bool-value

A resource must be promoted to primary role before any of its devices can be mounted or opened for writing. Before DRBD 9, this could only be done explicitly ("drbdadm primary"). Since DRBD 9, the auto-promote parameter allows to automatically promote a resource to primary role when one of its devices is mounted or opened for writing. As soon as all devices are unmounted or closed with no more remaining users, the role of the resource changes back to secondary. Automatic promotion only succeeds if the cluster state allows it (that is, if an explicit drbdadm primary command would succeed). Otherwise, mounting or opening the device fails as it already did before DRBD 9: the mount(2) system call fails with errno set to EROFS (Read-only file system); the open(2) system call fails with errno set to EMEDIUMTYPE (wrong medium type). Irrespective of the auto-promote parameter, if a device is promoted explicitly ( drbdadm primary), it also needs to be demoted explicitly (drbdadm secondary). The auto-promote parameter is available since DRBD 9.0.0, and defaults to yes.

cpu-mask cpu-mask

Set the cpu affinity mask for DRBD kernel threads. The cpu mask is specified as a hexadecimal number. The default value is 0, which lets the scheduler decide which kernel threads run on which CPUs. CPU numbers in cpu-mask which do not exist in the system are ignored.

on-no-data-accessible policy

Determine how to deal with I/O requests when the requested data is not available locally or remotely (for example, when all disks have failed). The defined policies are:io-errorSystem calls fail with errno set to EIO.suspend-ioThe resource suspends I/O. I/O can be resumed by (re)attaching the lower-level device, by connecting to a peer which has access to the data, or by forcing DRBD to resume I/O with drbdadm resume-io res. When no data is available, forcing I/O to resume will result in the same behavior as the io-error policy. This setting is available since DRBD 8.3.9; the default policy is io-error.

peer-ack-window value

On each node and for each device, DRBD maintains a bitmap of the differences between the local and remote data for each peer device. For example, in a three-node setup (nodes A, B, C) each with a single device, every node maintains one bitmap for each of its peers. When nodes receive write requests, they know how to update the bitmaps for the writing node, but not how to update the bitmaps between themselves. In this example, when a write request propagates from node A to B and C, nodes B and C know that they have the same data as node A, but not whether or not they both have the same data. As a remedy, the writing node occasionally sends peer-ack packets to its peers which tell them which state they are in relative to each other. The peer-ack-window parameter specifies how much data a primary node may send before sending a peer-ack packet. A low value causes increased network traffic; a high value causes less network traffic but higher memory consumption on secondary nodes and higher resync times between the secondary nodes after primary node failures. (Note: peer-ack packets may be sent due to other reasons as well, e.g. membership changes or expiry of the peer-ack-delay timer.) The default value for peer-ack-window is 2 MiB, the default unit is sectors. This option is available since 9.0.0.

peer-ack-delay expiry-time

If after the last finished write request no new write request gets issued for expiry-time, then a peer-ack packet is sent. If a new write request is issued before the timer expires, the timer gets reset to expiry-time. (Note: peer-ack packets may be sent due to other reasons as well, e.g. membership changes or the peer-ack-window option.) This parameter may influence resync behavior on remote nodes. Peer nodes need to wait until they receive an peer-ack for releasing a lock on an AL-extent. Resync operations between peers may need to wait for for these locks. The default value for peer-ack-delay is 100 milliseconds, the default unit is milliseconds. This option is available since 9.0.0.

quorum value

When activated, a cluster partition requires quorum in order to modify the replicated data set. That means a node in the cluster partition can only be promoted to primary if the cluster partition has quorum. Every node with a disk directly connected to the node that should be promoted counts. If a primary node should execute a write request, but the cluster partition has lost quorum, it will freeze IO or reject the write request with an error (depending on the on-no-quorum setting). Upon loosing quorum a primary always invokes the quorum-lost handler. The handler is intended for notification purposes, its return code is ignored. The option's value might be set to off, majority, all or a numeric value. If you set it to a numeric value, make sure that the value is greater than half of your number of nodes. Quorum is a mechanism to avoid data divergence, it might be used instead of fencing when there are more than two repicas. It defaults to off If all missing nodes are marked as outdated, a partition always has quorum, no matter how small it is. I.e. If you disconnect all secondary nodes gracefully a single primary continues to operate. In the moment a single secondary is lost, it has to be assumed that it forms a partition with all the missing outdated nodes. In case my partition might be smaller than the other, quorum is lost in this moment. In case you want to allow permanently diskless nodes to gain quorum it is recommendet to not use majority or all. It is recommended to specify an absolute number, since DBRD's heuristic to determine the complete number of diskfull nodes in the cluster is unreliable. The quorum implementation is available starting with the DRBD kernel driver version 9.0.7.

quorum-minimum-redundancy value

This option sets the minimal required number of nodes with an UpToDate disk to allow the partition to gain quorum. This is a different requirement than the plain quorum option expresses. The option's value might be set to off, majority, all or a numeric value. If you set it to a numeric value, make sure that the value is greater than half of your number of nodes. In case you want to allow permanently diskless nodes to gain quorum it is recommendet to not use majority or all. It is recommended to specify an absolute number, since DBRD's heuristic to determine the complete number of diskfull nodes in the cluster is unreliable. This option is available starting with the DRBD kernel driver version 9.0.10.

on-no-quorum {io-error | suspend-io}

By default DRBD freezes IO on a device, that lost quorum. By setting the on-no-quorum to io-error it completes all IO operations with an error if quorum ist lost. The on-no-quorum options is available starting with the DRBD kernel driver version 9.0.8brain decision is always one of the two nodes because it detects a split brain scenario when two nodes are connected.) The policy defined is:

disconnect Simply disconnect.
consensus If the consensus victim node can be selected, it is automatically resolved. Otherwise, it acts like disconnect.
discard-secondary Secondary node is discarded.

after-sb-2pri policy

Define how to react when a split brain scenario is detected and both nodes act as primary. (The split brain decision is always one of the two nodes because it detects a split brain scenario when two nodes are connected.) The policy defined is:

disconnect Simply disconnect.

For 2 primary split brain, only manual recovery via disconnect is available.

allow-two-primaries

bsr does not support dual primary mode.

connect-int time

As soon as a connection between two nodes is configured with bsrsetup connect, bsr immediately tries to establish the connection. If this fails, bsr waits for connect-int seconds and then repeats. The default value of connect-int is 3 seconds.

csums-alg hash-algorithm

Normally, when two nodes resynchronize, the sync target requests a piece of out-of-sync data from the sync source, and the sync source sends the data. With many usage patterns, a significant number of those blocks will actually be identical. When a csums-alg algorithm is specified, when requesting a piece of out-of-sync data, the sync target also sends along a hash of the data it currently has. The sync source compares this hash with its own version of the data. It sends the sync target the new data if the hashes differ, and tells it that the data are the same otherwise. This reduces the network bandwidth required, at the cost of higher cpu utilization and possibly increased I/O on the sync target. The csums-alg can be set to one of the secure hash algorithms supported by the kernel; see the shash algorithms listed in /proc/crypto. By default, csums-alg is unset.

data-integrity-alg alg

bsr normally relies on the data integrity checks built into the TCP/IP protocol, but if a data integrity algorithm is configured, it will additionally use this algorithm to make sure that the data received over the network match what the sender has sent. If a data integrity error is detected, bsr will close the network connection and reconnect, which will trigger a resync. The data-integrity-alg can be set to one of the secure hash algorithms supported by the kernel; see the shash algorithms listed in /proc/crypto. By default, this mechanism is turned off. Because of the CPU overhead involved, we recommend not to use this option in production environments. Also see the notes on data integrity below.

fencing fencing_policy

Fencing is a preventive measure to avoid situations where both nodes are primary and disconnected. This is also known as a split-brain situation. bsr supports the following fencing policies:

dont-care No fencing actions are taken. This is the default policy.
resource-only If a node becomes a disconnected primary, it tries to fence the peer. This is done by calling the fence-peer handler. The handler is supposed to reach the peer over an alternative communication path and call ' bsradm outdate minor' there.
resource-and-stonith If a node becomes a disconnected primary, it freezes all its IO operations and calls its fence-peer handler. The fence-peer handler is supposed to reach the peer over an alternative communication path and call ' bsradm outdate minor' there. In case it cannot do that, it should stonith the peer. IO is resumed as soon as the situation is resolved. In case the fence-peer handler fails, I/O can be resumed manually with ' bsradm resume-io'.

ko-count number

Defines the number of transmission retries on the TX node side when a bottleneck occurs during transmission buffering.

max-buffers number

Limits the memory usage per bsr minor device on the receiving side, or for internal buffers during resync or online-verify. Unit is PAGE_SIZE, which is 4 KiB on most systems. The minimum possible setting is hard coded to 32 (=128 KiB). These buffers are used to hold data blocks while they are written to/read from disk. To avoid possible distributed deadlocks on congestion, this setting is used as a throttle threshold rather than a hard limit. Once more than max-buffers pages are in use, further allocation from this pool is throttled. You want to increase max-buffers if you cannot saturate the IO backend on the receiving side.

max-epoch-size number

Define the maximum number of write requests bsr may issue before issuing a write barrier. The default value is 2048, with a minimum of 1 and a maximum of 20000. Setting this parameter to a value below 10 is likely to decrease performance.

on-congestion policy,

congestion-fill threshold,

congestion-extents threshold

By default, bsr waits when the TCP send queue is full. In this case, the application cannot generate additional write requests until the send queue is available again. If you are using bsr with a proxy, we recommend using a pull-ahead congestion policy that allows you to put bsr into Ahead / Behind mode before the transmission queue is full. Then bsr records the difference between itself and the peer in the bitmap, but no longer replicates it to the peer. When enough buffer space becomes available again, the node resynchronizes with the peer and switches back to normal replication. This has the advantage of not blocking application I / O even when the queue is full, but it has the disadvantage that the peer node may lag far behind the original. And during resynchronization, the peer node is in an Inconsistent state. The available congestion policies are blocking (default), disconnect, and pull-ahead. The congestion-fill parameter defines the amount of data that is being replicated on this connection. The default is 0 (disable congestion control mechanism) and is up to 1 TB. The congestion-extents parameter defines the number of bitmap ranges that can be active before switching to Ahead / Behind mode. The congestion-extents parameter is only valid when set to a value less than al-extents.

ping-int interval

When the TCP/IP connection to a peer is idle for more than ping-int seconds, bsr will send a keep-alive packet to make sure that a failed peer or network connection is detected reasonably soon. The default value is 3 seconds, with a minimum of 1 and a maximum of 120 seconds. The unit is seconds.

ping-timeout timeout

Define the timeout for replies to keep-alive packets. If the peer does not reply within ping-timeout, bsr will close and try to reestablish the connection. The default value is 3 seconds, with a minimum of 0.1 seconds and a maximum of 3 seconds. The unit is tenths of a second.

protocol name

Use the specified protocol on this connection. The supported protocols are:

A Writes to the bsr device complete as soon as they have reached the local disk and the TCP/IP send buffer.
B Writes to the bsr device complete as soon as they have reached the local disk, and all peers have acknowledged the receipt of the write requests.
C Writes to the bsr device complete as soon as they have reached the local and all remote disks.

rcvbuf-size size

Configure the size of the TCP/IP receive buffer. A value of 0 (the default) causes the buffer size to adjust dynamically. This parameter usually does not need to be set, but it can be set to a value up to 10 MiB. The default unit is bytes. Not support for Windows

sndbuf-size size

Set the size of TX buffer allocated by the sending worker thread. You can set up to 1TB.

tcp-cork

By default, bsr uses the TCP_CORK socket option to prevent the kernel from sending partial messages; this results in fewer and bigger packets on the network. Some network stacks can perform worse with this optimization. On these, the tcp-cork parameter can be used to turn this optimization off.

timeout time

Define the timeout for replies over the network: if a peer node does not send an expected reply within the specified timeout, it is considered dead and the TCP/IP connection is closed. The timeout value must be lower than connect-int and lower than ping-int. The default is 5 seconds; the value is specified in tenths of a second.

use-rle

Each replicated device on a cluster node has a separate bitmap for each of its peer devices. The bitmaps are used for tracking the differences between the local and peer device: depending on the cluster state, a disk range can be marked as different from the peer in the device's bitmap, in the peer device's bitmap, or in both bitmaps. When two cluster nodes connect, they exchange each other's bitmaps, and they each compute the union of the local and peer bitmap to determine the overall differences. Bitmaps of very large devices are also relatively large, but they usually compress very well using run-length encoding. This can save time and bandwidth for the bitmap transfers. The use-rle parameter determines if run-length encoding should be used. It is on by default

verify-alg hash-algorithm

Online verification (bsradm verify) computes and compares checksums of disk blocks (i.e., hash values) in order to detect if they differ. The verify-alg parameter determines which algorithm to use for these checksums. It must be set to one of the secure hash algorithms supported by the kernel before online verify can be used; see the shash algorithms listed in /proc/crypto. We recommend to schedule online verifications regularly during low-load periods, for example once a month. Also see the notes on data integrity below.

Section on Parameters

address [address-family] address: port

Defines the address family, address, and port of a connection endpoint. The address families ipv4, ipv6, ssocks are supported. If no address family is specified, ipv4 is assumed. For all address families except ipv6, the address is specified in IPV4 address notation (for example, 1.2.3.4). For ipv6, the address is enclosed in brackets and uses IPv6 address notation (for example, [fd01:2345:6789:abcd::1]). The port is always specified as a decimal number from 1 to 65535. On each host, the port numbers must be unique for each address; ports cannot be shared.

node-id value

Defines the unique node identifier for a node in the cluster. Node identifiers are used to identify individual nodes in the network protocol, and to assign bitmap slots to nodes in the metadata. Node identifiers can only be reasssigned in a cluster when the cluster is down. It is essential that the node identifiers in the configuration and in the device metadata are changed consistently on all hosts. To change the metadata, dump the current state with bsrmeta dump-md, adjust the bitmap slot assignment, and update the metadata with bsrmeta restore-md. The node-id parameter must be set. Its value ranges from 0 to 16; there is no default.

Section options Parameters (Resource Options)

auto-promote bool-value

Not supported by bsr.

cpu-mask cpu-mask

Not supported by bsr.

on-no-data-accessible policy

Determines how to handle I / O requests when the requested data is not locally accessible (for example, if all disks fail). Not supported by bsr.

peer-ack-window value

On each node and for each device, bsr maintains a bitmap of the differences between the local and remote data for each peer device. For example, in a three-node setup (nodes A, B, C) each with a single device, every node maintains one bitmap for each of its peers. When nodes receive write requests, they know how to update the bitmaps for the writing node, but not how to update the bitmaps between themselves. In this example, when a write request propagates from node A to B and C, nodes B and C know that they have the same data as node A, but not whether or not they both have the same data. As a remedy, the writing node occasionally sends peer-ack packets to its peers which tell them which state they are in relative to each other. The peer-ack-window parameter specifies how much data a primary node may send before sending a peer-ack packet. A low value causes increased network traffic; a high value causes less network traffic but higher memory consumption on secondary nodes and higher resync times between the secondary nodes after primary node failures. (Note: peer-ack packets may be sent due to other reasons as well, e.g. membership changes or expiry of the peer-ack-delay timer.) The default value for peer-ack-window is 2 MiB, the default unit is sectors.

peer-ack-delay expiry-time

If after the last finished write request no new write request gets issued for expiry-time, then a peer-ack packet is sent. If a new write request is issued before the timer expires, the timer gets reset to expiry-time. (Note: peer-ack packets may be sent due to other reasons as well, e.g. membership changes or the peer-ack-window option.) This parameter may influence resync behavior on remote nodes. Peer nodes need to wait until they receive an peer-ack for releasing a lock on an AL-extent. Resync operations between peers may need to wait for for these locks. The default value for peer-ack-delay is 100 milliseconds, the default unit is milliseconds.

Section startup Parameters

The parameters in this section define the behavior of DRBD bsr at system startup time, in the DRBD bsr init script. They have no effect once the system is up and running.

degr-wfc-timeout timeout

Define how long to wait until all peers are connected in case the cluster consisted of a single node only when the system went down. This parameter is usually set to a value smaller than wfc-timeout. The assumption here is that peers which were unreachable before a reboot are less likely to be reachable after the reboot, so waiting is less likely to help. The timeout is specified in seconds. The default value is 0, which stands for an infinite timeout. Also see the wfc-timeout parameter.

outdated-wfc-timeout timeout

Define how long to wait until all peers are connected if all peers were outdated when the system went down. This parameter is usually set to a value smaller than wfc-timeout. The assumption here is that an outdated peer cannot have become primary in the meantime, so we don't need to wait for it as long as for a node which was alive beforethe reboot, so waiting is less likely to help. The timeout is specified in seconds. The default value is 0, which stands for an infinite timeout. Also see the wfc-timeout parameter.

stacked-timeouts

On stacked devices, the outdated-wfc-timeout and degr-wfc-timeout parameters in the configuration are usually ignored, and both timeouts are set to twice the connect-int timeout. The stacked-timeouts parameter tells DRBD to use the wfc-timeout and degr-wfc-timeout parameters as defined in the configuration, even on stacked devices. Only use this parameter if the peer of the stacked resource is usually not available, or will not become primary. Incorrect use of this parameter can lead to unexpected split-brain scenarios.timeout timeout

Define how long to wait until all peers are connected if all peers were outdated when the system went down. This parameter is usually set to a value smaller than wfc-timeout. The assumption here is that an outdated peer cannot have become primary in the meantime, so we don't need to wait for it as long as for a node which was alive before. The timeout is specified in seconds. The default value is 0, which stands for an infinite timeout. Also see the wfc-timeout parameter.

stacked-timeouts

Not supported by bsr

wait-after-sb

This parameter causes DRBD bsr to continue waiting in the init script even when a split-brain situation has been detected, and the nodes therefore refuse to connect to each other.

wfc-timeout timeout

Define how long Defines the time the init script waits until for all peers are connectedto connect. This can be useful in combination with a cluster manager which managers who cannot manage DRBD bsr resources: when . When the cluster manager starts, the DRBD resources will already be up and running. With a more capable cluster manager such as Pacemaker, it makes more sense to let the cluster manager control DRBD resources. The timeout is bsr resource is already running. Timeouts are specified in seconds. The default value is 0, which stands for indicating an infinite timeout. Also see the See also degr-wfc-timeout parameterparameter.

Section volume Parameters

device /dev/drbdbsrminor-number

Define the device name and minor number of a replicated block device. This is the device that applications are supposed to access; in most cases, the device is not used directly, but as a file system. This parameter is required and the standard device naming convention is assumed. In addition to this device, udev will create /dev/drbdbsr/by-res/resource /volume and /dev/drbdbsr/by-disk/lower-level-device symlinks to the device.

...

Define the lower-level block device that DRBD bsr will use for storing the actual data. While the replicated drbd bsr device is configured, the lower-level device must not be used directly. Even read-only access with tools like dumpe2fs(8) and similar is not allowed. The keyword none specifies that no lower-level block device is configured; this also overrides inheritance of the lower-level device.

...

Versions Compared

Old Version 5

New Version 6

Key

Section net Parameters

Section on Parameters

Section options Parameters (Resource Options)

Section on Parameters

Section options Parameters (Resource Options)

Section startup Parameters

Section volume Parameters