Overview
bsr implements a block device that replicates data from the local node to all other nodes in the cluster. Here, the actual data and related metadata are stored individually (usually in the case of external metadata) on the “generic” block device volume of each cluster node. Replication block devices must be named by default in /dev/bsr<minor> format or directly as a symbolic link (letter) to the device. One or more devices per resource are grouped and each device is replicated in parallel. The device inside the resource is defined as a volume, and resources can be duplicated between two or more cluster nodes. Cluster node-to-node connections are point-to-point links and use the TCP protocol. bsr consists of the basic components bsradm, which understands and processes configuration files, and the low-level components bsrsetup, bsrmeta, and bsrcon. The basic bsr configuration consists of /etc/bsr.conf and any additional files it contains (typically global_common.conf and all * .res files in the /etc path). Usually each resource is in etc/bsr.d/. It is useful to define separate * .res files in the path. The configuration file is designed so that each cluster node contains the same copy of the entire cluster configuration. However, sometimes it may be necessary to have the contents of different configuration files for each node, so this is not absolute.
resource r0 { net { protocol C; } disk { resync-rate 10M; c-plan-ahead 0; } on alice { volume 0 { device e minor 2; disk e; meta-disk f; } address 10.1.1.31:7789; } on bob { volume 0 { disk e; meta-disk f; } address 10.1.1.32:7789; } } |
This example defines the volume of letter e as the resource r0 containing a single replication device. This resource replicates between IPv4 addresses 10.1.1.31 and 10.1.1.32 and hosts alice and bob with node identifiers 0 and 1, respectively. The actual data is volume e, and the metadata is stored in volume f. Protocol C is used for connections between hosts.
File Format
The configuration file consists of sections containing different sections and parameters depending on the section type. Each section consists of one or more keywords, sometimes a section name, an opening brace ("{"), the contents of the section, and a closing brace ("}"). Parameters within a section consist of a keyword and one or more keywords or values and a semicolon (";"). Some parameter values have a default scale applied when specifying a regular number (e.g. Kilo). These default scales can be overridden using a suffix (e.g. Mega for M). Common suffixes are K = 2 ^ 10 = 1024, M = 1024 K, and G = 1024 M. Comments can be written beginning with a hash sign ("#") and ending at the end of the line. You can also prefix the keyword skip to all sections to ignore sections and subsections. Additional files can be included in the include file pattern statement. Include statements are only allowed outside of the section.
The sections described below are defined. Indicates that the indented section is a subsection.
common [disk] [handlers] [net] [options] [startup] global resource connection path net connection-mesh net [disk] floating handlers [net] on volume disk [disk] options |
Sections in parentheses affect other parts of the composition. The contents of the common section apply to all resources. The disk section of a resource or resource section applies to all volumes of that resource, and the network section of the resource section applies to all connections of that resource. This eliminates the need to repeat the same option for each resource, connection or volume. You can override more specific options in the Resources, Connections, Volumes or Volumes section. The peer-device options are defined as resync-rate, c-plan-ahead, c-delay-target, c-fill-target, c-max-rate and c-min-rate, and all disks for backward compatibility. Sections can also be specified. They are inherited by all relevant links. If granted in the connection section, it is inherited by all volumes in that connection. The "peer-device-options" section begins with the "disk" keyword.
Sections
common
This section can contain each a disk, handlers, net, options, and startup section. All resources inherit the parameters in these sections as their default values.
connection [name]
Define a connection between two hosts. This section must contain two host parameters or multiple path sections. The optional name is used to refer to the connection in the system log and in other messages. If no name is specified, the peer's host name is used instead.
path
Define a path between two hosts. This section must contain two host parameters.
connection-mesh
Define a connection mesh between multiple hosts. This section must contain a hosts parameter, which has the host names as arguments. This section is a shortcut to define many connections which share the same network options.
disk
Define parameters for a volume. All parameters in this section are optional.
floating [address-family] addr:port
Like the on section, except that instead of the host name a network address is used to determine if it matches a floating section. The node-id parameter in this section is required. If the address parameter is not provided, no connections to peers will be created by default. The device, disk, and meta-disk parameters must be defined in, or inherited by, this section.
global
Define some global parameters. All parameters in this section are optional. Only one global section is allowed in the configuration.
handlers
Define handlers to be invoked when certain events occur. The kernel passes the resource name in the first command-line argument and sets the following environment variables depending on the event's context:
For events related to a particular device: the device's minor number in BSR_MINOR, the device's volume number in BSR_VOLUME.
For events related to a particular device on a particular peer: the connection endpoints in BSR_MY_ADDRESS, BSR_MY_AF, BSR_PEER_ADDRESS, and BSR_PEER_AF; the device's local minor number in BSR_MINOR, and the device's volume number in BSR_VOLUME.
For events related to a particular connection: the connection endpoints in BSR_MY_ADDRESS, BSR_MY_AF, BSR_PEER_ADDRESS, and BSR_PEER_AF; and, for each device defined for that connection: the device's minor number in BSR_MINOR_ volume-number.
For events that identify a device, if a lower-level device is attached, the lower-level device's device name is passed in BSR_BACKING_DEV (or BSR_BACKING_DEV_volume-number).
All parameters in this section are optional. Only a single handler can be defined for each event; if no handler is defined, nothing will happen.
net
Define parameters for a connection. All parameters in this section are optional.
on host-name [...]
Define the properties of a resource on a particular host or set of hosts. Specifying more than one host name can make sense in a setup with IP address failover, for example. The host-name argument must match the Linux host name ( uname -n). Usually contains or inherits at least one volume section. The node-id and address parameters must be defined in this section. The device, disk, and meta-disk parameters must be defined in, or inherited by, this section. A normal configuration file contains two or more on sections for each resource. Also see the floating section.
options
Define parameters for a resource. All parameters in this section are optional.
resource name
Define a resource. Usually contains at least two on sections and at least one connection section.
stacked-on-top-of resource
Used instead of an on section for configuring a stacked resource with three to four nodes. Stacking is deprecated in bsr, we recommend using a 1:N replication configuration.
startup
The parameters in this section determine the behavior of a resource at startup time.
volume volume-number
Define a volume within a resource. The volume numbers in the various volume sections of a resource define which devices on which hosts form a replicated device.
connection
host name [address [address-family] address] [port port-number]
Defines an endpoint for a connection. Each host statement refers to an on section in a resource. If a port number is defined, this endpoint will use the specified port instead of the port defined in the on section. Each connection section must contain exactly two host parameters. Instead of two host parameters the connection may contain multiple path sections.
path
host name [address [address-family] address] [port port-number]
Defines an endpoint for a connection. Each host statement refers to an on section in a resource. If a port number is defined, this endpoint will use the specified port instead of the port defined in the on section. Each path section must contain exactly two host parameters.
connection-mesh
hosts name...
Defines all nodes of a mesh. Each name refers to an on section in a resource. The port that is defined in the on section will be used.
disk
al-extents extents
bsr manages active and recently rewritten areas based on recent disk write operations. When write I / O occurs, the active area can be written to disk immediately, but the inactive disk area must be activated first, so metadata write is required here. This active disk area is called activity log.
If you save the metadata write to the activity log, but recover the failed node, you will need to resynchronize over the entire activity log. Therefore, the size of the activity log is a major factor in how long it will take to resynchronize after the primary crash and how quickly the consistency of the clone disk is achieved. Activity log consists of several 4 MiB unit segments. The al-extents parameter determines the number of segments that can be active simultaneously. The default for al-extents is 6001, with a minimum of 7 and a maximum of 65536. Depending on how you generated the device metadata, the maximum valid value may be smaller (see bsrmeta).
The maximum effective value is 919 * (available on-disk activity log ring buffer area / 4kB -1), which is up to 6433 (including 25 GiB or more data) in the default 32KB ring buffer. It's a good idea to keep the size of the activity log within an amount where the backend storage and replication links can be resynchronized in about 5 minutes.
Changing the size of al-extents requires a resource down.
al-updates {yes | no}
With this parameter, the activity log can be turned off entirely (see the al-extents parameter). This will speed up writes because fewer meta-data writes will be necessary, but the entire device needs to be resynchronized opon recovery of a failed primary node. The default value for al-updates is yes.
disk-barrier,
disk-flushes,
disk-drain
bsr has three ways to handle the order of write requests.
disk-flush Performs write I / O to disk and forces flush to write all data to disk. Depending on the platform or drive vendor, the implementation of flush may be different. In the old way, it was used as a technique to bypass the disk cache called 'force unit access', but recently, it is basically implemented as a method to ensure disk writes by emptying the cache. This option is enabled by default.
disk-barrier Use this option to ensure that requests are written to disk in the correct order. The barrier ensures that all requests submitted before the barrier are all requested to disk prior to requests subsequently submitted. This is implemented using 'tagged command queuing' of SCSI devices and 'native command queuing' of SATA devices. Only some devices and device stacks support this method. The device mapper (LVM) only supports barriers in some configurations. Using this option on systems that do not support disk-barrier can result in data loss or corruption. This option was supported by older Linux kernels, but kernels after linux-2.6.36 (or 2.6.32 RHEL6) can no longer detect if disk-barrier is supported. This option is off by default and must be explicitly enabled.
disk-drain Wait for the request queue to "drain" (that is, until the request is complete) before submitting a write request. To use this method, requests must be stable on disk until the request is completed. Previously, this option was enabled by default, but is now disabled.
disk-timeout
If the I / O request fails to complete within the disk time defined for the child device that stores the data, bsr treats it as a failure. In this case, the child device is detached, and the disk status of the device is diskless. If bsr is connected to one or more peers, the failed request is forwarded to one of them. This option is dangerous and can lead to a kernel panic. Aborting the request and forcing the disk to be removed is an action for a completely blocked and stopped local backup device that no longer completes the request and returns no errors. In this situation, usually a hard reset and failover is the only way. The default value of disk-timeout is 0, which indicates an infinite timeout. Timeouts are specified in 0.1 second increments.
md-flushes
Enable disk flushes and disk barriers on the meta-data device. This option is enabled by default. See the disk-flushes parameter.
on-io-error handler
Configure how bsr responds to I / O errors on low-level devices. The following policies are defined.
passthrough If an error is returned from a lower device, the block layer is written to OOS and the error is passed to the upper layer. The error block is usually retried I / O by the upper layer, and if it succeeds at the time of retry, the OOS will be resolved naturally, otherwise the OOS will be recorded and left. This is the default for bsr.
call-local-io-error Call the local-io-error handler (see the handlers section).
detach Detach a low-level device and put it into diskless state. In diskless state, I / O cannot be performed and failover is required immediately.
max-passthrough-count
When on-io-error is the pass-through policy, repeating pass-throughs more than a certain number of times is considered a permanent disk failure. Specify a numeric threshold here. It is used only on Linux and defaults to 100.
resync-after res-name/volume
Define that a device should only resynchronize after the specified other device. By default, no order between devices is defined, and all devices will resynchronize in parallel. Depending on the configuration of the lower-level devices, and the available network and disk bandwidth, this can slow down the overall resync process. This option can be used to form a chain or tree of dependencies among devices.
peer-device-options
Please note that you open the section with the disk keyword.c-delay-target delay_target,
resync-rate rate
Defines the bandwidth available for resynchronization. bsr allows general application I / O even during resynchronization. If resynchronization takes up too much bandwidth, application I / O can be very slow and this parameter can be avoided. This option only works if the dynamic resync controller is disabled.
c-plan-ahead plan_time
Dynamically control the speed of resynchronization. This mechanism can be used by setting the c-plan-ahead parameter to a positive value. The maximum bandwidth is limited by the c-max-rate parameter. The c-plan-ahead parameter defines how quickly bsr adapts to changes in the resynchronization rate. It should be set to at least 5 times the network round-trip time (RTT). When c-fill-target is defined, it tries to fill the buffer with a defined amount of data along the data path, and has a defined delay if c-delay-target is defined. The common value range for c-fill-target for "normal" data paths is 4K to 100K. If you use drx, we recommend using c-delay-target instead of c-fill-target. The c-delay-target parameter is used when the c-fill-target parameter is undefined or set to 0. The c-delay-target parameter should be set to at least 5 times the network round trip time. The c-max-rate option should be set to either the available bandwidth or the available disk bandwidth between the bsr host and the system hosting drx. The default values for these parameters are: c-plan-ahead = 20 (in 0.1 second increments), c-fill-target = 0 (in sector increments), c-delay-target = 1 (in 0.1 second increments) and c-max-rate = 102400 (KiB / s unit).
c-min-rate min_rate
Nodes that are primary and source of synchronization must schedule application I / O requests and synchronization requests. The c-min-rate parameter limits the amount of bandwidth available for resynchronization I / O. The rest of the bandwidth is used for replication of application I / O. If the c-min-rate value is 0, it means there is no limit to the resynchronization I / O bandwidth. This can significantly slow down application I / O. Use the value of 1 (1 KiB / s) for the lowest resynchronization rate. The default value of c-min-rate is 250 in KiB / s.
c-max-rate max_rate
Sets the maximum bandwidth used for resynchronization I/O. bsr trades off replication bandwidth to maintain synchronization bandwidth from c-min-rate to c-max-rate.
global
dialog-refresh time
You can configure and start the device using the bsr initialization script. This may involve waiting for other cluster nodes. While waiting, the init script shows the remaining wait time. Refresh dialog defines the number of seconds between updates to that countdown and defaults to 1. A value of 0 turns countdown off.
disable-ip-verification
Normally, bsr checks if the IP address in the configuration matches the host name. You can disable these checks using the disable-ip-verification parameter.
usage-count {yes | no | ask}
Ability to aggregate usage statistics, but not used by bsr.
handlers
after-resync-target cmd
Called on a resync target when a node state changes from Inconsistent to Consistent when a resync finishes. This handler can be used for removing the snapshot created in the before-resync-target handler.
before-resync-target cmd
Called on a resync target before a resync begins. This handler can be used for creating a snapshot of the lower-level device for the duration of the resync: if the resync source becomes unavailable during a resync, reverting to the snapshot can restore a consistent state.
before-resync-source cmd
Called on a resync source before a resync begins.
out-of-sync cmd
Called on all nodes after a verify finishes and out-of-sync blocks were found. This handler is mainly used for monitoring purposes. An example would be to call a script that sends an alert SMS.
fence-peer cmd
Called when a node should fence a resource on a particular peer. The handler should not use the same communication path that bsr uses for talking to the peer.
unfence-peer cmd
Called when a node should remove fencing constraints from other nodes.
initial-split-brain cmd
Called when bsr connects to a peer and detects that the peer is in a split-brain state with the local node. This handler is also called for split-brain scenarios which will be resolved automatically.
local-io-error cmd
Called when an I/O error occurs on a lower-level device.
pri-lost cmd
The local node is currently primary, but bsr believes that it should become a sync target. The node should give up its primary role.
pri-lost-after-sb cmd
The local node is currently primary, but it has lost the after-split-brain auto recovery procedure. The node should be abandoned.
split-brain cmd
bsr has detected a split-brain situation which could not be resolved automatically. Manual recovery is necessary. This handler can be used to call for administrator attention.
net
after-sb-0pri policy
Defines how to respond when a split brain scenario is detected and neither of the two nodes plays the Primary role. (Detects a split brain scenario when two nodes are connected. The split brain decision is always between the two nodes.) The defined policy is:
disconnect Simply disconnect.
discard-younger-primary,
discard-older-primary Discard the first node that became Primary (discard-younger-primary) or the last node that became Primary (discard-older-primary). If both nodes have become Primary independently, the discard-least-changes policy is used.
discard-zero-changes If data is written from only one node, resynchronize based on this node. If both nodes have written data, they disconnect.
discard-least-changes Synchronize based on the node that wrote a lot of data.
discard-node-nodename Always discard the named node.
after-sb-1pri policy
Define what to do if a split brain is detected with one primary node and one secondary node. (The split brain decision is always one of the two nodes because it detects a split brain scenario when two nodes are connected.) The policy defined is:
disconnect Simply disconnect.
consensus If the consensus victim node can be selected, it is automatically resolved. Otherwise, it acts like disconnect.
discard-secondary Secondary node is discarded.
after-sb-2pri policy
Define how to react when a split brain scenario is detected and both nodes act as primary. (The split brain decision is always one of the two nodes because it detects a split brain scenario when two nodes are connected.) The policy defined is:
disconnect Simply disconnect.
For 2 primary split brain, only manual recovery via disconnect is available.
allow-two-primaries
bsr does not support dual primary mode.
connect-int time
As soon as a connection between two nodes is configured with bsrsetup connect, bsr immediately tries to establish the connection. If this fails, bsr waits for connect-int seconds and then repeats. The default value of connect-int is 3 seconds.
csums-alg hash-algorithm
Normally, when two nodes resynchronize, the sync target requests a piece of out-of-sync data from the sync source, and the sync source sends the data. With many usage patterns, a significant number of those blocks will actually be identical. When a csums-alg algorithm is specified, when requesting a piece of out-of-sync data, the sync target also sends along a hash of the data it currently has. The sync source compares this hash with its own version of the data. It sends the sync target the new data if the hashes differ, and tells it that the data are the same otherwise. This reduces the network bandwidth required, at the cost of higher cpu utilization and possibly increased I/O on the sync target. The csums-alg can be set to one of the secure hash algorithms supported by the kernel; see the shash algorithms listed in /proc/crypto. By default, csums-alg is unset.
data-integrity-alg alg
bsr normally relies on the data integrity checks built into the TCP/IP protocol, but if a data integrity algorithm is configured, it will additionally use this algorithm to make sure that the data received over the network match what the sender has sent. If a data integrity error is detected, bsr will close the network connection and reconnect, which will trigger a resync. The data-integrity-alg can be set to one of the secure hash algorithms supported by the kernel; see the shash algorithms listed in /proc/crypto. By default, this mechanism is turned off. Because of the CPU overhead involved, we recommend not to use this option in production environments. Also see the notes on data integrity below.
fencing fencing_policy
Fencing is a preventive measure to avoid situations where both nodes are primary and disconnected. This is also known as a split-brain situation. bsr supports the following fencing policies:
dont-care No fencing actions are taken. This is the default policy.
resource-only If a node becomes a disconnected primary, it tries to fence the peer. This is done by calling the fence-peer handler. The handler is supposed to reach the peer over an alternative communication path and call ' bsradm outdate minor' there.
resource-and-stonith If a node becomes a disconnected primary, it freezes all its IO operations and calls its fence-peer handler. The fence-peer handler is supposed to reach the peer over an alternative communication path and call ' bsradm outdate minor' there. In case it cannot do that, it should stonith the peer. IO is resumed as soon as the situation is resolved. In case the fence-peer handler fails, I/O can be resumed manually with ' bsradm resume-io'.
ko-count number
Defines the number of transmission retries on the TX node side when a bottleneck occurs during transmission buffering.
max-buffers number
Limits the memory usage per bsr minor device on the receiving side, or for internal buffers during resync or online-verify. Unit is PAGE_SIZE, which is 4 KiB on most systems. The minimum possible setting is hard coded to 32 (=128 KiB). These buffers are used to hold data blocks while they are written to/read from disk. To avoid possible distributed deadlocks on congestion, this setting is used as a throttle threshold rather than a hard limit. Once more than max-buffers pages are in use, further allocation from this pool is throttled. You want to increase max-buffers if you cannot saturate the IO backend on the receiving side.
max-epoch-size number
Define the maximum number of write requests bsr may issue before issuing a write barrier. The default value is 2048, with a minimum of 1 and a maximum of 20000. Setting this parameter to a value below 10 is likely to decrease performance.
on-congestion policy,
congestion-fill threshold,
congestion-extents threshold
By default, bsr waits when the TCP send queue is full. In this case, the application cannot generate additional write requests until the send queue is available again. If you are using bsr with a proxy, we recommend using a pull-ahead congestion policy that allows you to put bsr into Ahead / Behind mode before the transmission queue is full. Then bsr records the difference between itself and the peer in the bitmap, but no longer replicates it to the peer. When enough buffer space becomes available again, the node resynchronizes with the peer and switches back to normal replication. This has the advantage of not blocking application I / O even when the queue is full, but it has the disadvantage that the peer node may lag far behind the original. And during resynchronization, the peer node is in an Inconsistent state. The available congestion policies are blocking (default), disconnect, and pull-ahead. The congestion-fill parameter defines the amount of data that is being replicated on this connection. The default is 0 (disable congestion control mechanism) and is up to 1 TB. The congestion-extents parameter defines the number of bitmap ranges that can be active before switching to Ahead / Behind mode. The congestion-extents parameter is only valid when set to a value less than al-extents.
ping-int interval
When the TCP/IP connection to a peer is idle for more than ping-int seconds, bsr will send a keep-alive packet to make sure that a failed peer or network connection is detected reasonably soon. The default value is 3 seconds, with a minimum of 1 and a maximum of 120 seconds. The unit is seconds.
ping-timeout timeout
Define the timeout for replies to keep-alive packets. If the peer does not reply within ping-timeout, bsr will close and try to reestablish the connection. The default value is 3 seconds, with a minimum of 0.1 seconds and a maximum of 3 seconds. The unit is tenths of a second.
protocol name
Use the specified protocol on this connection. The supported protocols are:
A Writes to the bsr device complete as soon as they have reached the local disk and the TCP/IP send buffer.
B Writes to the bsr device complete as soon as they have reached the local disk, and all peers have acknowledged the receipt of the write requests.
C Writes to the bsr device complete as soon as they have reached the local and all remote disks.
rcvbuf-size size
Configure the size of the TCP/IP receive buffer. A value of 0 (the default) causes the buffer size to adjust dynamically. This parameter usually does not need to be set, but it can be set to a value up to 10 MiB. The default unit is bytes. Not support for Windows
sndbuf-size size
Set the size of TX buffer allocated by the sending worker thread. You can set up to 1TB.
tcp-cork
By default, bsr uses the TCP_CORK socket option to prevent the kernel from sending partial messages; this results in fewer and bigger packets on the network. Some network stacks can perform worse with this optimization. On these, the tcp-cork parameter can be used to turn this optimization off.
timeout time
Define the timeout for replies over the network: if a peer node does not send an expected reply within the specified timeout, it is considered dead and the TCP/IP connection is closed. The timeout value must be lower than connect-int and lower than ping-int. The default is 5 seconds; the value is specified in tenths of a second.
use-rle
Each replicated device on a cluster node has a separate bitmap for each of its peer devices. The bitmaps are used for tracking the differences between the local and peer device: depending on the cluster state, a disk range can be marked as different from the peer in the device's bitmap, in the peer device's bitmap, or in both bitmaps. When two cluster nodes connect, they exchange each other's bitmaps, and they each compute the union of the local and peer bitmap to determine the overall differences. Bitmaps of very large devices are also relatively large, but they usually compress very well using run-length encoding. This can save time and bandwidth for the bitmap transfers. The use-rle parameter determines if run-length encoding should be used. It is on by default
verify-alg hash-algorithm
Online verification (bsradm verify) computes and compares checksums of disk blocks (i.e., hash values) in order to detect if they differ. The verify-alg parameter determines which algorithm to use for these checksums. It must be set to one of the secure hash algorithms supported by the kernel before online verify can be used; see the shash algorithms listed in /proc/crypto. We recommend to schedule online verifications regularly during low-load periods, for example once a month. Also see the notes on data integrity below.
on
address [address-family] address: port
Defines the address family, address, and port of a connection endpoint. The address families ipv4, ipv6, ssocks are supported. If no address family is specified, ipv4 is assumed. For all address families except ipv6, the address is specified in IPV4 address notation (for example, 1.2.3.4). For ipv6, the address is enclosed in brackets and uses IPv6 address notation (for example, [fd01:2345:6789:abcd::1]). The port is always specified as a decimal number from 1 to 65535. On each host, the port numbers must be unique for each address; ports cannot be shared.
node-id value
Defines the unique node identifier for a node in the cluster. Node identifiers are used to identify individual nodes in the network protocol, and to assign bitmap slots to nodes in the metadata. Node identifiers can only be reasssigned in a cluster when the cluster is down. It is essential that the node identifiers in the configuration and in the device metadata are changed consistently on all hosts. To change the metadata, dump the current state with bsrmeta dump-md, adjust the bitmap slot assignment, and update the metadata with bsrmeta restore-md. The node-id parameter must be set. Its value ranges from 0 to 16; there is no default.
options
svc-auto-up
Automatically start the resource when the bsr service starts. The default is yes.
svc-auto-down
Automatically stop the resource when the bsr service shuts down. The default is yes.
options(Resource Options)
auto-promote bool-value
Not supported by bsr.
cpu-mask cpu-mask
Not supported by bsr.
on-no-data-accessible policy
Determines how to handle I / O requests when the requested data is not locally accessible (for example, if all disks fail). Not supported by bsr.
peer-ack-window value
On each node and for each device, bsr maintains a bitmap of the differences between the local and remote data for each peer device. For example, in a three-node setup (nodes A, B, C) each with a single device, every node maintains one bitmap for each of its peers. When nodes receive write requests, they know how to update the bitmaps for the writing node, but not how to update the bitmaps between themselves. In this example, when a write request propagates from node A to B and C, nodes B and C know that they have the same data as node A, but not whether or not they both have the same data. As a remedy, the writing node occasionally sends peer-ack packets to its peers which tell them which state they are in relative to each other. The peer-ack-window parameter specifies how much data a primary node may send before sending a peer-ack packet. A low value causes increased network traffic; a high value causes less network traffic but higher memory consumption on secondary nodes and higher resync times between the secondary nodes after primary node failures. (Note: peer-ack packets may be sent due to other reasons as well, e.g. membership changes or expiry of the peer-ack-delay timer.) The default value for peer-ack-window is 2 MiB, the default unit is sectors.
peer-ack-delay expiry-time
If after the last finished write request no new write request gets issued for expiry-time, then a peer-ack packet is sent. If a new write request is issued before the timer expires, the timer gets reset to expiry-time. (Note: peer-ack packets may be sent due to other reasons as well, e.g. membership changes or the peer-ack-window option.) This parameter may influence resync behavior on remote nodes. Peer nodes need to wait until they receive an peer-ack for releasing a lock on an AL-extent. Resync operations between peers may need to wait for for these locks. The default value for peer-ack-delay is 100 milliseconds, the default unit is milliseconds.
max-req-write-count
The maximum number of in-process (inflight) write I/O requests that can be allowed on the resource. The default value is 100000.
accelbuf-size size
Buffer size to improve local write performance in asynchronous replication. The default value is 10MB. accelbuf can quickly complete I/O by allocating one buffer and copying data from the generated write buffer space before copying it to the transmit buffer. (Supported after bsr 1.7)
max-accelbuf-blk-size size
The target I/O block size of the accelbuf buffer. The default value is 4KB, and accelbuf is applied only for write I/Os of 4KB or less. (Supported after bsr 1.7)
startup
The parameters in this section define the behavior of bsr at system startup time, in the bsr init script. They have no effect once the system is up and running.
degr-wfc-timeout timeout
Define how long to wait until all peers are connected in case the cluster consisted of a single node only when the system went down. This parameter is usually set to a value smaller than wfc-timeout. The assumption here is that peers which were unreachable before a reboot are less likely to be reachable after the reboot, so waiting is less likely to help. The timeout is specified in seconds. The default value is 0, which stands for an infinite timeout. Also see the wfc-timeout parameter.
outdated-wfc-timeout timeout
Define how long to wait until all peers are connected if all peers were outdated when the system went down. This parameter is usually set to a value smaller than wfc-timeout. The assumption here is that an outdated peer cannot have become primary in the meantime, so we don't need to wait for it as long as for a node which was alive before. The timeout is specified in seconds. The default value is 0, which stands for an infinite timeout. Also see the wfc-timeout parameter.
stacked-timeouts
Not supported by bsr
wait-after-sb
This parameter causes bsr to continue waiting in the init script even when a split-brain situation has been detected, and the nodes therefore refuse to connect to each other.
wfc-timeout timeout
Defines the time the init script waits for all peers to connect. This can be useful in combination with cluster managers who cannot manage bsr resources. When the cluster manager starts, the bsr resource is already running. Timeouts are specified in seconds. The default is 0, indicating an infinite timeout. See also degr-wfc-timeout parameter.
volume
device /dev/bsr minor-number
Define the device name and minor number of a replicated block device. This is the device that applications are supposed to access; in most cases, the device is not used directly, but as a file system. This parameter is required and the standard device naming convention is assumed. In addition to this device, udev will create /dev/bsr/by-res/resource /volume and /dev/bsr/by-disk/lower-level-device symlinks to the device.
disk {[disk] | none}
Define the lower-level block device that bsr will use for storing the actual data. While the replicated bsr device is configured, the lower-level device must not be used directly. Even read-only access with tools like dumpe2fs(8) and similar is not allowed. The keyword none specifies that no lower-level block device is configured; this also overrides inheritance of the lower-level device.
meta-disk internal,
meta-disk device,
meta-disk device [index]
Define where the metadata of a replicated block device resides: it can be internal, meaning that the lower-level device contains both the data and the metadata, or on a separate device. When the index form of this parameter is used, multiple replicated devices can share the same metadata device, each using a separate index. Each index occupies 128 MiB of data, which corresponds to a replicated device size of at most 4 TiB with two cluster nodes. We recommend not to share metadata devices anymore, and to instead use the lvm volume manager for creating metadata devices as needed. When the index form of this parameter is not used, the size of the lower-level device determines the size of the metadata. The size needed is 36 KiB + (size of lower-level device) / 32K * (number of nodes - 1). If the metadata device is bigger than that, the extra space is not used. This parameter is required if a disk other than none is specified, and ignored if disk is set to none. A meta-disk parameter without a disk parameter is not allowed.