개요

bsr은 로컬 노드의 데이터를 클러스터의 모든 다른 노드로 복제하는 블록 장치를 구현합니다. 여기서 실제 데이터 및 이와 관련한 메타 데이터는 각 클러스터 노드의 "일반"블록 장치 볼륨에 (일반적으로 외부메타일 경우)개별적으로 저장됩니다. 복제 블록 장치는 기본적으로 /dev/drbd<minor>형식으로 명명하거나 또는 장치의 심볼릭 링크(레터)로 직접 지정해야 합니다. 리소스 당 하나 이상의 장치들이 그룹화 되고 각각의 장치들을 병렬적으로 복제합니다. 리소스 내부의 장치를 volume으로 규정하며 두 개 이상의 클러스터 노드간에 리소스를 복제 할 수 있습니다. 클러스터 노드 간 연결은 지점 간 링크이며 TCP 프로토콜을 사용합니다. bsr은 구성파일을 이해하고 처리하는 기본 구성요소 bsradm 과 저수준의 구성요소 bsrsetup, bsrmeta, bsrcon으로 구성됩니다. 기본적인 bsr 구성은 /etc/drbd.conf 및 여기에 포함 된 추가 파일들(일반적으로 global_common.conf 및 /etc 경로의 안에 있는 모든 * .res 파일)로 구성됩니다. 보통 각 리소스를 etc/bsr.d/. 경로에서 별도의 * .res 파일들로 정의하는 것이 유용합니다. 구성 파일은 각 클러스터 노드가 전체 클러스터 구성의 동일한 사본을 포함하도록 설계되었습니다. 그러나 때로는 노드 별로 각기 다른 구성파일의 내용을 가져야 할 수도 있어서 절대적인 것은 아닙니다.

resource r0 {
        net {
               protocol C;
        }
       disk {
              resync-rate 10M;
              c-plan-ahead 0;
       }
       on alice {
              volume 0 {
                       	device e minor 2;
						disk e;
						meta-disk f;
              }
             address 10.1.1.31:7789;
             node-id 0;
      }
      on bob {
            volume 0 {
                     disk e;
                     meta-disk f;
            }
           address 10.1.1.32:7789;
           node-id 1;
      }
}

이 예에서는 e 레터의 볼륨을 단일 복제 장치가 포함 된 리소스 r0로 정의합니다. 이 리소스는 각각 IPv4 주소 10.1.1.31 및 10.1.1.32와 노드 식별자 0 및 1을 가진 호스트 alice 및 bob 간의 복제를 수행합니다. 실제 데이터는 e 볼륨이지만 및 메타 데이터는 f 볼륨에 저장됩니다. 호스트 간 연결에는 프로토콜 C가 사용됩니다.

파일 포맷

구성 파일은 섹션 유형에 따라 다른 섹션 및 매개 변수를 포함하는 섹션으로 구성됩니다. 각 섹션은 하나 이상의 키워드, 때로는 섹션 이름, 여는 중괄호 ( "{"), 섹션의 내용 및 닫는 중괄호 ( "}")로 구성됩니다. 섹션 내의 매개 변수는 키워드와 하나 이상의 키워드 또는 값 및 세미콜론 ( ";")으로 구성됩니다. 일부 매개 변수 값에는 일반 숫자를 지정할 때 적용되는 기본 스케일이 있습니다 (예 : Kilo). 이러한 기본 스케일은 접미사 (예 : M의 경우 Mega)를 사용하여 재정의 할 수 있습니다. 공통 접미사는 K = 2 ^ 10 = 1024, M = 1024 K 및 G = 1024 M이 지원됩니다. 주석은 해시 기호 ( "#")로 시작하여 해당 줄 끝까지 기술할 수 있습니다. 또한, 모든 섹션에 키워드 skip을 접두어로 붙여서 섹션과 하위 섹션을 무시할 수 있습니다. 추가 파일은 include 파일 패턴 명령문에 포함될 수 있습니다. include 문은 섹션 외부에서만 허용됩니다.

아래에 기술한 섹션이 정의됩니다. 들여 쓰기된 섹션이 하위 섹션임을 표시합니다.

common
   [disk]
   [handlers]
   [net]
   [options]
   [startup]
global
resource
   connection
      path
      net
        volume
           peer-device-options
      [peer-device-options]
   connection-mesh
      net
   [disk]
   floating
   handlers
   [net]
   on
      volume
         disk
         [disk]
   options
   stacked-on-top-of
   startup

괄호 안의 섹션은 구성의 다른 부분에 영향을 줍니다. common 섹션의 내용은 모든 리소스에 적용됩니다. resource 또는 resource 섹션의 disk 섹션은 해당 자원의 모든 볼륨에 적용되며 resource 섹션의 network 섹션은 해당 자원의 모든 connection에 적용됩니다. 이를 통해 각 자원, 연결 또는 볼륨에 대해 동일한 옵션을 반복하지 않아도됩니다. 리소스, 연결, 볼륨 또는 볼륨 섹션에서 보다 구체적인 옵션을 재정의 할 수 있습니다. peer-device 옵션은 resync-rate, c-plan-ahead, c-delay-target, c-fill-target, c-max-rate 및 c-min-rate 로 정의되며 이전 버전과의 호환성을 위해 모든 disk 섹션에서도 지정할 수 있습니다. 그것들은 모든 관련 연결로 상속됩니다. connection 섹션에 부여 된 경우 해당 connection의 모든 볼륨에 상속됩니다. "peer-device-options"섹션은 "disk"키워드로 시작됩니다.

섹션

common

이 섹션에는 각 disk, handler, network, options 및 startup 섹션이 포함될 수 있습니다. 모든 리소스들은 이 섹션의 매개 변수를 기본값으로 상속합니다.

connection [name]

두 호스트 간의 연결을 정의합니다. 이 섹션에는 두 개의 호스트 매개 변수 또는 여러 경로 섹션이 포함되어야합니다. 선택적으로 사용할 수 있는 "Name"은 시스템 로그 및 기타 다른 메시지들의 연결을 나타내는 데 사용됩니다. 이름을 지정하지 않으면 피어의 호스트 이름이 대신 사용됩니다.

path

두 호스트 간의 path를 정의합니다. 이 섹션에는 두 개의 호스트 매개 변수가 포함되어야합니다.

connection-mesh

여러 호스트들 간의 mesh 연결을 정의합니다. 이 섹션에는 호스트 이름을 인수로 갖는 "hosts"매개 변수가 포함되어야합니다. 이 섹션은 동일한 네트워크 옵션을 공유하는 많은 연결을 손쉽게 정의하는 방법입니다.

disk

볼륨의 매개 변수를 정의합니다. 이 섹션의 모든 매개 변수는 선택 사항입니다.

floating [address-family] addr:port

on 섹션과 마찬가지로 호스트 이름 대신 네트워크 주소가 floating 섹션과 일치하는지 확인합니다. 이 섹션의 node-id 매개 변수가 필요합니다. 주소 매개 변수가 제공되지 않으면 기본적으로 피어에 대한 연결이 생성되지 않습니다. device, disk 및 meta-disk 매개 변수는 반드시 이 섹션에서 정의하거나 상위로부터 상속해야 합니다.

global

일부 전역 매개 변수를 정의합니다. 이 섹션의 모든 매개 변수는 선택 사항입니다. 구성에서 하나의 global 섹션만 허용됩니다.

handlers

특정 이벤트가 발생할 때 호출 될 핸들러를 정의합니다. 커널은 핸들러의 첫 번째 명령 줄 인수에서 리소스 이름을 전달하고 이벤트 문맥에 따라 다음 환경 변수를 설정합니다.

특정 device와 관련된 이벤트의 경우: 장치의 부 번호는 BSR_MINOR, 장치의 볼륨 번호는 BSR_VOLUME 에 있습니다.
특정 peer의 특정 device와 관련된 이벤트의 경우: BSR_MY_ADDRESS, BSR_MY_AF, BSR_PEER_ADDRESS 및 BSR_PEER_AF에 connection 엔드 포인트가 있습니다; BSR_MINOR에 있는 장치의 로컬 부 번호 및 BSR_VOLUME에 장치의 볼륨 번호가 있습니다.
특정 connection과 관련된 이벤트의 경우: BSR_MY_ADDRESS, BSR_MY_AF, BSR_PEER_ADDRESS 및, BSR_PEER_AF에 connection 엔드 포인트; 그리고 해당 connection에 대해 정의된 각 device의 장치 부 번호가 DRBD_MINOR_volume-number에 있습니다.
장치를 식별하는 이벤트의 경우 하위 장치가 연결되어 있으면 하위 장치의 장치 이름이 BSR_BACKING_DEV (또는 BSR_BACKING_DEV_volume-number)로 전달됩니다.

이 섹션의 모든 매개 변수는 선택 사항입니다. 각 이벤트에 대해 단일 핸들러만 정의 할 수 있습니다. 핸들러가 정의되어 있지 않으면 아무 일도 일어나지 않습니다.

net

연결 매개 변수를 정의합니다. 이 섹션의 모든 매개 변수는 선택 사항입니다.

on host-name [...]

특정 호스트 또는 호스트 세트에서 리소스의 특성을 정의하십시오. 예를 들어, IP 주소 failover 설정에서 둘 이상의 호스트 이름을 지정하는 경우에 의미가 있습니다. host-name 인수는 OS에 설정된 호스트 이름 (uname -n)과 일치해야합니다. 일반적으로 하나 이상의 볼륨 섹션을 포함하거나 상속합니다. 이 섹션에서 node-id 및 address 매개 변수를 정의해야합니다. device, disk 및 meta-disk 매개 변수는이 섹션에서 반드시 정의하거나 상속해야합니다. 일반적인 구성 파일에서는 각 리소스에 대한 둘 이상의 on 섹션이 있습니다. floating 섹션의 내용을 참고하세요

options

리소스에 대한 매개 변수를 정의합니다. 이 섹션의 모든 매개 변수는 선택 사항입니다.

resource name

리소스를 정의합니다. 일반적으로 두 개 이상의 섹션과 하나 이상의 connection 섹션이 있습니다.

stacked-on-top-of resource

3-4 개의 노드로 스택 된 리소스를 구성하기 위해 on 섹션 대신 사용됩니다. bsr에선 스태킹이 더 이상 사용되지 않으며, 1:N 복제 구성으로 사용하는 것이 좋습니다.

startup

이 섹션의 매개 변수는 리소스 시작 시 리소스의 동작을 결정합니다.

volume volume-number

리소스 내에서 볼륨을 정의합니다. 리소스의 volume 섹션들에 있는 볼륨 번호는 호스트에서 어떤 장치가 복제 장치를 구성하는지 정의합니다.

connection 섹션

host name [address [address-family] address] [port port-number]

연결에 대한 엔드포인트를 정의합니다. 각 host 구문은 리소스의 on 섹션을 나타냅니다. 포트 번호가 정의되면 이 엔드 포인트는 on 섹션에 정의 된 포트 대신 지정된 포트를 사용합니다. 각 connection 섹션에는 정확히 두 개의 host 매개 변수가 포함되어야합니다. 두 개의 host 매개 변수 대신 connection에 다중 path 섹션이 포함될 수 있습니다.

path 섹션

host name [address [address-family] address] [port port-number]

연결에 대한 엔드포인트를 정의합니다. 각 host 구문은 리소스의 on 섹션을 나타냅니다. 포트 번호가 정의되면 이 엔드 포인트는 on섹션에 정의 된 포트 대신 지정된 포트를 사용합니다. 각 path 섹션에는 정확히 두 개의 host 매개 변수가 포함되어야합니다.

connection-mesh 섹션

hosts name…

메쉬의 모든 노드를 정의합니다. 각 이름은 자원의 on 섹션을 나타냅니다. on 섹션에 정의 된 포트가 사용됩니다.

disk 섹션

al-extents extents

bsr은 최근 디스크 쓰기 작업을 근거로 쓰여진(active) 영역과 쓰여진 영역에 최근 다시 쓰여진(hot) 영역에 대해 관리합니다. 쓰기 I/O가 발생하면 active 영역은 디스크에 즉시 쓰면 되지만 inactive 디스크 영역은 먼저 activated 해야 하기 때문에 여기서 메타 데이터 쓰기가 필요합니다. 이 active 디스크 영역을 activity log 라고 합니다.

activity log에 메타 데이터 쓰기를 저장하지만 실패한 노드를 복구할 경우 전체 activity log에 대해 다시 동기화 해야 합니다. 따라서 activity log의 크기는 primary 크래쉬 후 재 동기화에 얼마나 오래 걸릴지, 얼마나 빨리 복제 디스크의 일관성을 맞출지의 주요 요인이 됩니다. activity log는 여러 개의 4MiB 단위 세그먼트로 구성됩니다. al-extents 매개 변수는 동시에 활성화 할 수있는 세그먼트 수를 결정합니다. al-extents의 기본값은 6001이며 최소 7과 최대 65536입니다. 장치 메타 데이터를 생성한 방법에 따라 유효한 최대 값이 더 작을 수도 있습니다 (bsrmeta 참조).

유효 최대 값은 919 * (사용 가능한 온 디스크 activity log 링 버퍼 영역 / 4kB -1)이며, 기본 32KB 링 버퍼에서 최대 6433 (25GiB 이상의 데이터 포함)이 됩니다. 백엔드 스토리지 및 복제 링크가 약 5 분 이내에 재 동기화 될 수있는 양 이내에서 activity log의 크기를 유지하는 게 좋습니다.

al-updates {yes | no}

이 매개 변수를 no 로 설정하면 activity log를 완전히 끌 수 있습니다 (al-extents 매개 변수 참조). 메타 데이터 쓰기가 더 적게 필요하기 때문에 쓰기 속도가 빨라지지만, 실패한 기본 노드의 복구시 전체 장치를 재 동기화해야합니다. al-updats 의 기본값은 yes 입니다.

disk-barrier,

disk-flushes,

disk-drain

bsr에는 쓰기 요청의 순서를 처리하는 세 가지 방법이 있습니다.

disk-flush 디스크에 쓰기 I/O 를 수행한 후 flush 를 강제하여 모든 데이터를 디스크에 기록하도록 조치합니다. 플랫폼에 따라 또는 드라이브 공급 업체에 따라 flush의 구현이 다를 수 있습니다. 예전 방식으로는 'force unit access'라고 명명되는 디스크 캐쉬를 우회하는 기술로 사용되기도 했으나 최근에은 기본적으로 디스크이 캐쉬를 비우는 작업을 통해 디스크 쓰기를 보장하는 방식으로 구현되고 있습니다. 이 옵션은 기본적으로 활성화되어 있습니다.
disk-barrier 이 옵션을 사용하여 요청이 올바른 순서로 디스크에 기록되도록합니다. barrier는 barrier 이전에 제출 된 모든 요청이 이후에 제출 된 요청보다 앞서서 모두 디스크에 요청하도록 보장 합니다. 이는 SCSI 장치의 'tagged command queuing'과 SATA 장치의 'native command queuing'을 사용하여 구현됩니다. 일부 장치 및 장치 스택 만이 이 방법을 지원합니다. device mapper (LVM)는 일부 구성에서만 barrier를 지원합니다. disk-barrier을 지원하지 않는 시스템에서 이 옵션을 사용하면 데이터가 손실되거나 손상 될 수 있습니다. 이 옵션은 예전 리눅스 커널에서는 지원했지만 linux-2.6.36 (또는 2.6.32 RHEL6) 이후의 커널은 더 이상 disk-barrier가 지원되는지 감지할 수 없습니다. 이 옵션은 기본적으로 해제되어 있으며 명시적으로 활성화해야합니다.
disk-drain 쓰기 요청을 제출하기 전에 요청 큐가 "드레인"될 때까지(즉, 요청이 완료 될 때까지) 기다립니다. 이 방법을 사용하려면 요청이 완료될 떄 까지 요청들이 디스크에서 안정적이어야 합니다. 예전에는 이 옵션을 기본 활성화 하였지만 지금은 비활성화 되어 있습니다.

disk-timeout

bsr 장치가 데이터를 저장하는 하위 장치가 정의 된 디스크 시간 초과 내에 I/O 요청을 완료하지 못하면 bsr은 이를 실패로 처리합니다. 이 경우 하위 장치가 detach되고 장치의 디스크 상태가 디스크가 없는 상태로 진행됩니다. bsr이 하나 이상의 피어에 연결된 경우 실패한 요청이 그 중 하나에 전달됩니다. 이 옵션은 위험하며 커널 패닉으로 이어질 수 있습니다! 요청을 "Abort” 또는 강제로 디스크를 제거하는 것은 더 이상 요청을 완료하지도 않고 오류도 반환하지 않는 완전히 block되고 중지된 로컬 백업 장치를 위한 것입니다. 이 상황에서는 일반적으로 하드 리셋 및 페일 오버가 유일한 방법입니다. disk-timeout의 기본값은 0이며, 이는 무한 시간 초과를 나타냅니다. 시간 초과는 0.1 초 단위로 지정됩니다.

md-flushes

메타 데이터 장치에서 디스크 플러시 및 디스크 장벽을 활성화합니다. 이 옵션은 기본적으로 활성화되어 있습니다. disk-flush 매개 변수를 참조하십시오.

on-io-error handler

하위 레벨 장치에서 bsr이 I/O 오류에 대응하는 방식을 구성합니다. 다음과 같은 정책이 정의됩니다.

passthrough 하위 장치에서 오류가 반환될 경우 해당 블럭 계층을 OOS로 기록하고 상위 계층으로 오류를 전달합니다. 해당 오류 블럭은 보통 상위 계층에 의해 재시도 I/O가 발생 되고 재시도 시점에 성공할 경우 OOS 는 자연스럽게 해소되거나 그렇지 않을 경우 OOS 가 기록되어 남겨집니다. bsr 의 기본값 입니다.
call-local-io-error local-io-error 핸들러를 호출합니다 ( handlers 섹션을 참고하세요).
detach 하위레벨 장치를 분리하고 diskless 상태로 전환합니다. diskless 상태에서는 I/O가 수행될 수 없으며 즉시 failover가 필요합니다.

resync-after res-name/volume

지정된 다른 장치가 동기화된 이후에만 장치를 재 동기화하도록 정의합니다. 기본적으로 장치간에는 동기화 순서가 정의되어 있지 않으며 모든 장치가 병렬로 재 동기화 됩니다. 하위 장치 구성, 사용 가능한 네트워크 및 디스크 대역폭에 따라 전체 재 동기화 프로세스가 느려질 수 있기 때문에 이 옵션을 사용하여 장치 간의 종속성 체인 또는 트리를 형성 할 수 있습니다.

peer-device-options 섹션

peer-device-options 섹션 이지만 disk 키워드로 이 섹션을 기술합니다.

c-fill-target fill_target,

c-max-rate max_rate,

c-plan-ahead plan_time

재 동기화 속도를 동적으로 제어합니다. 이 메카니즘은 c-plan-ahead 매개 변수를 양수 값으로 설정하여 사용할 수 있습니다. 목표는 c-fill-target이 정의 된 경우 정의 된 양의 데이터로 데이터 경로를 따라 버퍼를 채우거나 c-delay-target이 정의 된 경우 경로를 따라 정의된 지연을 갖는 것입니다. 최대 대역폭은 c-max-rate 매개 변수에 의해 제한됩니다. c-plan-ahead 매개 변수는 bsr이 재 동기화 속도의 변화에 얼마나 빨리 적응 하는지를 정의합니다. 네트워크 왕복 시간(RTT)의 5 배 이상으로 설정해야합니다. "정상"데이터 경로에 대한 c-fill-target의 공통 값 범위는 4K ~ 100K입니다. drx를 사용하는 경우 c-fill-target 대신 c-delay-target을 사용하는 것이 좋습니다. c-delay-target 매개 변수는 c-fill-target 매개 변수가 정의되지 않거나 0으로 설정된 경우에 사용됩니다. c-delay-target 매개 변수는 네트워크 왕복 시간의 5 배 이상으로 설정해야합니다. c-max-rate 옵션은 bsr 호스트와 drx 를 호스팅하는 시스템간에 사용 가능한 대역폭 또는 사용 가능한 디스크 대역폭으로 설정해야합니다. 이 매개 변수의 기본값은 다음과 같습니다. c-plan-ahead = 20 (0.1 초 단위), c-fill-target = 0 (섹터 단위), c-delay-target = 1 (0.1 초 단위) ) 및 c-max-rate = 102400 (KiB/s 단위).

c-min-rate min_rate

Primary 이고 동기화 소스 인 노드는 애플리케이션 I/O 요청과 동기화 요청을 스케줄링해야 합니다. c-min-rate 매개 변수는 재 동기화 I/O에 사용할 수있는 대역폭의 양을 제한합니다. 나머지 대역폭은 응용 프로그램 I/O의 복제에 사용됩니다. c-min-rate 값이 0이면 재 동기화 I/O 대역폭에 제한이 없음을 의미합니다. 이로 인해 응용 프로그램 I/O 속도가 크게 느려질 수 있습니다. 가장 낮은 재 동기화 속도를 위해서는 1 (1 KiB/s) 값을 사용하십시오. c-min-rate의 기본값은 KiB/s 단위로 250 입니다.

resync-rate rate

재 동기화에 bsr이 사용할 수있는 대역폭을 정의합니다. bsr은 재 동기화 중에도 일반적인 응용 프로그램 I/O를 허용합니다. 재 동기화가 너무 많은 대역폭을 차지하면 응용 프로그램 I/O가 매우 느려질 수 있으며 이 매개 변수를 사용하면 이를 피할 수 있습니다. 이 옵션은 동적 재 동기화 컨트롤러가 비활성화 된 경우에만 작동합니다.

global 섹션

dialog-refresh time

bsr 초기화 스크립트를 사용하여 장치를 구성하고 시작할 수 있습니다. 여기에는 다른 클러스터 노드 대기와 관련 될 수 있습니다. 기다리는 동안 init 스크립트는 남은 대기 시간을 보여줍니다. 대화 상자 새로 고침은 해당 카운트 다운 업데이트 사이의 시간 (초)을 정의하고 기본값은 1입니다. 값이 0이면 카운트 다운이 꺼집니다.

disable-ip-verification

일반적으로 bsr은 구성의 IP 주소가 호스트 이름과 일치하는지 확인합니다. disable-ip-verification 매개 변수를 사용하여 이러한 검사를 비활성화할 수 있습니다.

usage-count {yes | no | ask}

사용 통계를 취합하는 기능이지만 bsr 에서는 사용되지 않습니다.

handlers 섹션

after-resync-target cmd

재 동기화가 완료 될 때 노드 상태가 "Inconsistent"에서 "Consistent"로 변경되면 SyncTarget에서 호출됩니다. 예를 들어, 이 핸들러는 before-resync-target 핸들러에서 작성된 스냅 샷을 제거하는 데 사용할 수 있습니다.

before-resync-target cmd

재 동기화가 시작되기 전에 SyncTarget에서 호출됩니다. 이 핸들러는 재 동기화 기간 동안 하위 레벨 디바이스의 스냅 샷을 작성하는 데 사용할 수 있습니다. 재 동기화 중에 재 동기화 소스를 사용할 수없는 경우 스냅 샷으로 되돌리면 Consistent한 상태로 복원 할 수 있습니다.

before-resync-source cmd

재동기화를 시작하기 전 동기화 소스에서 호출됩니다.

out-of-sync cmd

정합성 검사를 수행 후 OOS 블록이 발견되면 모든 노드에서 호출됩니다. 이 핸들러는 주로 모니터링 목적으로 사용됩니다. 경고 SMS를 보내는 스크립트를 호출하는 것이 그 예입니다.

fence-peer cmd

노드가 특정 피어의 리소스를 차단해야 할 때 호출됩니다. 핸들러는 bsr이 피어와 통신하는 데 사용하는 동일한 통신 경로를 사용해서는 안됩니다.

unfence-peer cmd

노드가 다른 노드에서 펜싱 제약 조건을 제거해야 할 때 호출됩니다.

initial-split-brain cmd

스플릿 브레인 상태임을 감지하면 호출됩니다. 이 핸들러는 스플릿 브레인 자동 해결 시나리오에서도 호출됩니다.

local-io-error cmd

하위 장치에서 I/O 에러가 발생할 경우 호출됩니다.

pri-lost cmd

로컬 노드는 현재 Primary 노드이지만 bsr이 동기화 대상이되어야 한다고 생각합니다. 이 경우 노드는 Primary 역할을 포기해야 합니다.

pri-lost-after-sb cmd

로컬 노드는 현재 Primary 노드이지만 스플릿 후 자동 복구 절차를 통해 잃었을 때 호출됩니다. 이 경우 노드는 버려야합니다.

split-brain cmd

bsr이 자동으로 해결할 수없는 스플릿 브레인 상황을 감지했습니다. 수동 복구가 필요합니다. 이 핸들러는 관리자의 주의를 환기시키는 데 사용될 수 있습니다.

net 섹션

after-sb-0pri policy

스플릿 브레인 시나리오가 감지되고 두 노드 중 어느 것도 Primary 역할을 수행하지 않는 경우 대응 방법을 정의합니다. (두 노드가 연결될 때 스플릿 브레인 시나리오를 감지합니다. 스플릿 브레인 결정은 항상 두 노드 사이에 있습니다.) 정의 된 정책은 다음과 같습니다.

disconnect 단순히 연결을 끊습니다.
discard-younger-primary,
discard-older-primary 먼저 Primary 가 됬던 노드(discard-younger-primary) 또는 마지막으로 Primary 가 됬던 노드(discard-older-primary)를 폐기합니다. 만일 두 노드가 독립적으로 Primary 가 됬었다면 discard-least-changes 정책을 사용합니다.
discard-zero-changes 하나의 노드에서만 데이터를 쓴 경우 이 노드를 기준으로 재 동기화 합니다. 두 노드가 모두 데이터를 쓴 경우 연결을 끊습니다.
discard-least-changes 많은 데이터를 쓴 노드를 기준으로 동기화 합니다.
discard-node-nodename 명명된 노드를 항상 폐기합니다.

after-sb-1pri policy

Primary 노드 1 개와 Secondary 노드 1 개로 스플릿 브레인이 감지되는 경우 대처 방법을 정의합니다. (두 노드가 연결될 때 스플릿 브레인 시나리오를 감지하므로 스플릿 브레인 결정은 항상 두 노드 중 하나입니다.) 정의 된 정책은 다음과 같습니다.

disconnect 단순히 연결을 끊습니다.
consensus 희생노드가 선택될 수 있다면 자동으로 해결합니다. 그렇지 않으면, disconnect처럼 동작합니다.
discard-secondary Secondary 의 노드를 폐기합니다.

after-sb-2pri policy

스플릿 브레인 시나리오가 감지되고 두 노드가 모두 Primary 역할을 하는 경우 대응 방법을 정의합니다. (두 노드가 연결될 때 스플릿 브레인 시나리오를 감지하므로 스플릿 브레인 결정은 항상 두 노드 중 하나 입니다.) 정의 된 정책은 다음과 같습니다.

disconnect 단순히 연결을 끊습니다.

2 primary 스플릿 브레인의 경우 disconnect 를 통한 수동 복구만 사용할 수 있습니다.

allow-two-primaries

bsr 에선 이중 primary 모드를 지원하지 않습니다.

always-asbp

Normally the automatic after-split-brain policies are only used if current states of the UUIDs do not indicate the presence of a third node. With this option you request that the automatic after-split-brain policies are used as long as the data sets of the nodes are somehow related. This might cause a full sync, if the UUIDs indicate the presence of a third node. (Or double faults led to strange UUID sets.)

connect-int time

As soon as a connection between two nodes is configured with drbdsetup connect, DRBD immediately tries to establish the connection. If this fails, DRBD waits for connect-int seconds and then repeats. The default value of connect-int is 10 seconds.

cram-hmac-alg hash-algorithm

Configure the hash-based message authentication code (HMAC) or secure hash algorithm to use for peer authentication. The kernel supports a number of different algorithms, some of which may be loadable as kernel modules. See the shash algorithms listed in /proc/crypto. By default, cram-hmac-alg is unset. Peer authentication also requires a shared-secret to be configured.

csums-alg hash-algorithm

Normally, when two nodes resynchronize, the sync target requests a piece of out-of-sync data from the sync source, and the sync source sends the data. With many usage patterns, a significant number of those blocks will actually be identical. When a csums-alg algorithm is specified, when requesting a piece of out-of-sync data, the sync target also sends along a hash of the data it currently has. The sync source compares this hash with its own version of the data. It sends the sync target the new data if the hashes differ, and tells it that the data are the same otherwise. This reduces the network bandwidth required, at the cost of higher cpu utilization and possibly increased I/O on the sync target. The csums-alg can be set to one of the secure hash algorithms supported by the kernel; see the shash algorithms listed in /proc/crypto. By default, csums-alg is unset.

csums-after-crash-only

Enabling this option (and csums-alg, above) makes it possible to use the checksum based resync only for the first resync after primary crash, but not for later "network hickups". In most cases, block that are marked as need-to-be-resynced are in fact changed, so calculating checksums, and both reading and writing the blocks on the resync target is all effective overhead. The advantage of checksum based resync is mostly after primary crash recovery, where the recovery marked larger areas (those covered by the activity log) as need-to-be-resynced, just in case. Introduced in 8.4.5.

data-integrity-alg alg

DRBD normally relies on the data integrity checks built into the TCP/IP protocol, but if a data integrity algorithm is configured, it will additionally use this algorithm to make sure that the data received over the network match what the sender has sent. If a data integrity error is detected, DRBD will close the network connection and reconnect, which will trigger a resync. The data-integrity-alg can be set to one of the secure hash algorithms supported by the kernel; see the shash algorithms listed in /proc/crypto. By default, this mechanism is turned off. Because of the CPU overhead involved, we recommend not to use this option in production environments. Also see the notes on data integrity below.

fencing fencing_policy

Fencing is a preventive measure to avoid situations where both nodes are primary and disconnected. This is also known as a split-brain situation. DRBD supports the following fencing policies:dont-careNo fencing actions are taken. This is the default policy.resource-onlyIf a node becomes a disconnected primary, it tries to fence the peer. This is done by calling the fence-peer handler. The handler is supposed to reach the peer over an alternative communication path and call ' drbdadm outdate minor' there.resource-and-stonithIf a node becomes a disconnected primary, it freezes all its IO operations and calls its fence-peer handler. The fence-peer handler is supposed to reach the peer over an alternative communication path and call ' drbdadm outdate minor' there. In case it cannot do that, it should stonith the peer. IO is resumed as soon as the situation is resolved. In case the fence-peer handler fails, I/O can be resumed manually with ' drbdadm resume-io'.

ko-count number

If a secondary node fails to complete a write request in ko-count times the timeout parameter, it is excluded from the cluster. The primary node then sets the connection to this secondary node to Standalone. To disable this feature, you should explicitly set it to 0; defaults may change between versions.

max-buffers number

Limits the memory usage per DRBD minor device on the receiving side, or for internal buffers during resync or online-verify. Unit is PAGE_SIZE, which is 4 KiB on most systems. The minimum possible setting is hard coded to 32 (=128 KiB). These buffers are used to hold data blocks while they are written to/read from disk. To avoid possible distributed deadlocks on congestion, this setting is used as a throttle threshold rather than a hard limit. Once more than max-buffers pages are in use, further allocation from this pool is throttled. You want to increase max-buffers if you cannot saturate the IO backend on the receiving side.

max-epoch-size number

Define the maximum number of write requests DRBD may issue before issuing a write barrier. The default value is 2048, with a minimum of 1 and a maximum of 20000. Setting this parameter to a value below 10 is likely to decrease performance.

on-congestion policy,

congestion-fill threshold,

congestion-extents threshold

By default, DRBD blocks when the TCP send queue is full. This prevents applications from generating further write requests until more buffer space becomes available again. When DRBD is used together with DRBD-proxy, it can be better to use the pull-ahead on-congestion policy, which can switch DRBD into ahead/behind mode before the send queue is full. DRBD then records the differences between itself and the peer in its bitmap, but it no longer replicates them to the peer. When enough buffer space becomes available again, the node resynchronizes with the peer and switches back to normal replication. This has the advantage of not blocking application I/O even when the queues fill up, and the disadvantage that peer nodes can fall behind much further. Also, while resynchronizing, peer nodes will become inconsistent. The available congestion policies are block (the default) and pull-ahead. The congestion-fill parameter defines how much data is allowed to be "in flight" in this connection. The default value is 0, which disables this mechanism of congestion control, with a maximum of 10 GiBytes. The congestion-extents parameter defines how many bitmap extents may be active before switching into ahead/behind mode, with the same default and limits as the al-extents parameter. The congestion-extents parameter is effective only when set to a value smaller than al-extents. Ahead/behind mode is available since DRBD 8.3.10.

ping-int interval

When the TCP/IP connection to a peer is idle for more than ping-int seconds, DRBD will send a keep-alive packet to make sure that a failed peer or network connection is detected reasonably soon. The default value is 10 seconds, with a minimum of 1 and a maximum of 120 seconds. The unit is seconds.

ping-timeout timeout

Define the timeout for replies to keep-alive packets. If the peer does not reply within ping-timeout, DRBD will close and try to reestablish the connection. The default value is 0.5 seconds, with a minimum of 0.1 seconds and a maximum of 3 seconds. The unit is tenths of a second.

socket-check-timeout timeout

In setups involving a DRBD-proxy and connections that experience a lot of buffer-bloat it might be necessary to set ping-timeout to an unusual high value. By default DRBD uses the same value to wait if a newly established TCP-connection is stable. Since the DRBD-proxy is usually located in the same data center such a long wait time may hinder DRBD's connect process. In such setups socket-check-timeout should be set to at least to the round trip time between DRBD and DRBD-proxy. I.e. in most cases to 1. The default unit is tenths of a second, the default value is 0 (which causes DRBD to use the value of ping-timeout instead). Introduced in 8.4.5.

protocol name

Use the specified protocol on this connection. The supported protocols are:

A Writes to the DRBD device complete as soon as they have reached the local disk and the TCP/IP send buffer.
B Writes to the DRBD device complete as soon as they have reached the local disk, and all peers have acknowledged the receipt of the write requests.
C Writes to the DRBD device complete as soon as they have reached the local and all remote disks.

rcvbuf-size size

Configure the size of the TCP/IP receive buffer. A value of 0 (the default) causes the buffer size to adjust dynamically. This parameter usually does not need to be set, but it can be set to a value up to 10 MiB. The default unit is bytes.

rr-conflict policy

This option helps to solve the cases when the outcome of the resync decision is incompatible with the current role assignment in the cluster. The defined policies are:

disconnect No automatic resynchronization, simply disconnect.
violently Resync to the primary node is allowed, violating the assumption that data on a block device are stable for one of the nodes. Do not use this option, it is dangerous.
call-pri-lost Call the pri-lost handler on one of the machines. The handler is expected to reboot the machine, which puts it into secondary role.

shared-secret secret

Configure the shared secret used for peer authentication. The secret is a string of up to 64 characters. Peer authentication also requires the cram-hmac-alg parameter to be set.

sndbuf-size size

Configure the size of the TCP/IP send buffer. Since DRBD 8.0.13 / 8.2.7, a value of 0 (the default) causes the buffer size to adjust dynamically. Values below 32 KiB are harmful to the throughput on this connection. Large buffer sizes can be useful especially when protocol A is used over high-latency networks; the maximum value supported is 10 MiB.

tcp-cork

By default, DRBD uses the TCP_CORK socket option to prevent the kernel from sending partial messages; this results in fewer and bigger packets on the network. Some network stacks can perform worse with this optimization. On these, the tcp-cork parameter can be used to turn this optimization off.

timeout time

Define the timeout for replies over the network: if a peer node does not send an expected reply within the specified timeout, it is considered dead and the TCP/IP connection is closed. The timeout value must be lower than connect-int and lower than ping-int. The default is 6 seconds; the value is specified in tenths of a second.

use-rle

Each replicated device on a cluster node has a separate bitmap for each of its peer devices. The bitmaps are used for tracking the differences between the local and peer device: depending on the cluster state, a disk range can be marked as different from the peer in the device's bitmap, in the peer device's bitmap, or in both bitmaps. When two cluster nodes connect, they exchange each other's bitmaps, and they each compute the union of the local and peer bitmap to determine the overall differences. Bitmaps of very large devices are also relatively large, but they usually compress very well using run-length encoding. This can save time and bandwidth for the bitmap transfers. The use-rle parameter determines if run-length encoding should be used. It is on by default since DRBD 8.4.0.

verify-alg hash-algorithm

Online verification (drbdadm verify) computes and compares checksums of disk blocks (i.e., hash values) in order to detect if they differ. The verify-alg parameter determines which algorithm to use for these checksums. It must be set to one of the secure hash algorithms supported by the kernel before online verify can be used; see the shash algorithms listed in /proc/crypto. We recommend to schedule online verifications regularly during low-load periods, for example once a month. Also see the notes on data integrity below.

on 섹션

address [address-family] address: port

Defines the address family, address, and port of a connection endpoint. The address families ipv4, ipv6, ssocks (Dolphin Interconnect Solutions' "super sockets"), sdp (Infiniband Sockets Direct Protocol), and sci are supported ( sci is an alias for ssocks). If no address family is specified, ipv4 is assumed. For all address families except ipv6, the address is specified in IPV4 address notation (for example, 1.2.3.4). For ipv6, the address is enclosed in brackets and uses IPv6 address notation (for example, [fd01:2345:6789:abcd::1]). The port is always specified as a decimal number from 1 to 65535. On each host, the port numbers must be unique for each address; ports cannot be shared.

node-id value

Defines the unique node identifier for a node in the cluster. Node identifiers are used to identify individual nodes in the network protocol, and to assign bitmap slots to nodes in the metadata. Node identifiers can only be reasssigned in a cluster when the cluster is down. It is essential that the node identifiers in the configuration and in the device metadata are changed consistently on all hosts. To change the metadata, dump the current state with drbdmeta dump-md, adjust the bitmap slot assignment, and update the metadata with drbdmeta restore-md. The node-id parameter exists since DRBD 9. Its value ranges from 0 to 16; there is no default.

options 섹션(resource options)

auto-promote bool-value

A resource must be promoted to primary role before any of its devices can be mounted or opened for writing. Before DRBD 9, this could only be done explicitly ("drbdadm primary"). Since DRBD 9, the auto-promote parameter allows to automatically promote a resource to primary role when one of its devices is mounted or opened for writing. As soon as all devices are unmounted or closed with no more remaining users, the role of the resource changes back to secondary. Automatic promotion only succeeds if the cluster state allows it (that is, if an explicit drbdadm primary command would succeed). Otherwise, mounting or opening the device fails as it already did before DRBD 9: the mount(2) system call fails with errno set to EROFS (Read-only file system); the open(2) system call fails with errno set to EMEDIUMTYPE (wrong medium type). Irrespective of the auto-promote parameter, if a device is promoted explicitly ( drbdadm primary), it also needs to be demoted explicitly (drbdadm secondary). The auto-promote parameter is available since DRBD 9.0.0, and defaults to yes.

cpu-mask cpu-mask

Set the cpu affinity mask for DRBD kernel threads. The cpu mask is specified as a hexadecimal number. The default value is 0, which lets the scheduler decide which kernel threads run on which CPUs. CPU numbers in cpu-mask which do not exist in the system are ignored.

on-no-data-accessible policy

Determine how to deal with I/O requests when the requested data is not available locally or remotely (for example, when all disks have failed). The defined policies are:

io-error System calls fail with errno set to EIO.
suspend-io The resource suspends I/O. I/O can be resumed by (re)attaching the lower-level device, by connecting to a peer which has access to the data, or by forcing DRBD to resume I/O with drbdadm resume-io res. When no data is available, forcing I/O to resume will result in the same behavior as the io-error policy.

This setting is available since DRBD 8.3.9; the default policy is io-error.

peer-ack-window value

On each node and for each device, DRBD maintains a bitmap of the differences between the local and remote data for each peer device. For example, in a three-node setup (nodes A, B, C) each with a single device, every node maintains one bitmap for each of its peers. When nodes receive write requests, they know how to update the bitmaps for the writing node, but not how to update the bitmaps between themselves. In this example, when a write request propagates from node A to B and C, nodes B and C know that they have the same data as node A, but not whether or not they both have the same data. As a remedy, the writing node occasionally sends peer-ack packets to its peers which tell them which state they are in relative to each other. The peer-ack-window parameter specifies how much data a primary node may send before sending a peer-ack packet. A low value causes increased network traffic; a high value causes less network traffic but higher memory consumption on secondary nodes and higher resync times between the secondary nodes after primary node failures. (Note: peer-ack packets may be sent due to other reasons as well, e.g. membership changes or expiry of the peer-ack-delay timer.) The default value for peer-ack-window is 2 MiB, the default unit is sectors. This option is available since 9.0.0.

peer-ack-delay expiry-time

If after the last finished write request no new write request gets issued for expiry-time, then a peer-ack packet is sent. If a new write request is issued before the timer expires, the timer gets reset to expiry-time. (Note: peer-ack packets may be sent due to other reasons as well, e.g. membership changes or the peer-ack-window option.) This parameter may influence resync behavior on remote nodes. Peer nodes need to wait until they receive an peer-ack for releasing a lock on an AL-extent. Resync operations between peers may need to wait for for these locks. The default value for peer-ack-delay is 100 milliseconds, the default unit is milliseconds. This option is available since 9.0.0.

startup 섹션

The parameters in this section define the behavior of DRBD at system startup time, in the DRBD init script. They have no effect once the system is up and running.

degr-wfc-timeout timeout

Define how long to wait until all peers are connected in case the cluster consisted of a single node only when the system went down. This parameter is usually set to a value smaller than wfc-timeout. The assumption here is that peers which were unreachable before a reboot are less likely to be reachable after the reboot, so waiting is less likely to help. The timeout is specified in seconds. The default value is 0, which stands for an infinite timeout. Also see the wfc-timeout parameter.

outdated-wfc-timeout timeout

Define how long to wait until all peers are connected if all peers were outdated when the system went down. This parameter is usually set to a value smaller than wfc-timeout. The assumption here is that an outdated peer cannot have become primary in the meantime, so we don't need to wait for it as long as for a node which was alive before. The timeout is specified in seconds. The default value is 0, which stands for an infinite timeout. Also see the wfc-timeout parameter.

stacked-timeouts

On stacked devices, the wfc-timeout and degr-wfc-timeout parameters in the configuration are usually ignored, and both timeouts are set to twice the connect-int timeout. The stacked-timeouts parameter tells DRBD to use the wfc-timeout and degr-wfc-timeout parameters as defined in the configuration, even on stacked devices. Only use this parameter if the peer of the stacked resource is usually not available, or will not become primary. Incorrect use of this parameter can lead to unexpected split-brain scenarios.

wait-after-sb

This parameter causes DRBD to continue waiting in the init script even when a split-brain situation has been detected, and the nodes therefore refuse to connect to each other.

wfc-timeout timeout

Define how long the init script waits until all peers are connected. This can be useful in combination with a cluster manager which cannot manage DRBD resources: when the cluster manager starts, the DRBD resources will already be up and running. With a more capable cluster manager such as Pacemaker, it makes more sense to let the cluster manager control DRBD resources. The timeout is specified in seconds. The default value is 0, which stands for an infinite timeout. Also see the degr-wfc-timeout parameter.

volume 섹션

device /dev/drbdminor-number

Define the device name and minor number of a replicated block device. This is the device that applications are supposed to access; in most cases, the device is not used directly, but as a file system. This parameter is required and the standard device naming convention is assumed. In addition to this device, udev will create /dev/drbd/by-res/resource /volume and /dev/drbd/by-disk/lower-level-device symlinks to the device.

disk {[disk] | none}

Define the lower-level block device that DRBD will use for storing the actual data. While the replicated drbd device is configured, the lower-level device must not be used directly. Even read-only access with tools like dumpe2fs(8) and similar is not allowed. The keyword none specifies that no lower-level block device is configured; this also overrides inheritance of the lower-level device.

meta-disk internal,

meta-disk device,

meta-disk device [index]

Define where the metadata of a replicated block device resides: it can be internal, meaning that the lower-level device contains both the data and the metadata, or on a separate device. When the index form of this parameter is used, multiple replicated devices can share the same metadata device, each using a separate index. Each index occupies 128 MiB of data, which corresponds to a replicated device size of at most 4 TiB with two cluster nodes. We recommend not to share metadata devices anymore, and to instead use the lvm volume manager for creating metadata devices as needed. When the index form of this parameter is not used, the size of the lower-level device determines the size of the metadata. The size needed is 36 KiB + (size of lower-level device) / 32K * (number of nodes - 1). If the metadata device is bigger than that, the extra space is not used. This parameter is required if a disk other than none is specified, and ignored if disk is set to none. A meta-disk parameter without a disk parameter is not allowed.

데이터 무결성 관련

DRBD supports two different mechanisms for data integrity checking: first, the data-integrity-alg network parameter allows to add a checksum to the data sent over the network. Second, the online verification mechanism ( drbdadm verify and the verify-alg parameter) allows to check for differences in the on-disk data.Both mechanisms can produce false positives if the data is modified during I/O (i.e., while it is being sent over the network or written to disk). This does not always indicate a problem: for example, some file systems and applications do modify data under I/O for certain operations. Swap space can also undergo changes while under I/O.Network data integrity checking tries to identify data modification during I/O by verifying the checksums on the sender side after sending the data. If it detects a mismatch, it logs an error. The receiver also logs an error when it detects a mismatch. Thus, an error logged only on the receiver side indicates an error on the network, and an error logged on both sides indicates data modification under I/O.The most recent example of systematic data corruption was identified as a bug in the TCP offloading engine and driver of a certain type of GBit NIC in 2007: the data corruption happened on the DMA transfer from core memory to the card. Because the TCP checksum were calculated on the card, the TCP/IP protocol checksums did not reveal this problem.

구성파일 매뉴얼

개요