Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Overview

bsrmon is a performance monitoring tool that tracks the entire replication process of the bsr engine that performs replication and records the time required for each logical section of the engine in real time to identify bottlenecks in engine operation. The bsr kernel engine maintains performance-related statistical information such as time records for each section and accumulated calculation values at the time the engine was operated in the form of a file log through the internal jiffies record code. Administrators can monitor the statistics information of these kernel engines by querying them with the bsrmon utility. bsrmon is supported since version 1.6.1.

Functions

Features provided by bsrmon include:

/start

Activates the Performance Monitor feature and starts aggregating performance data and logging files. By default, Performance Monitor is enabled.

Activation of the performance monitor is performed through the following process.

  • windows

    • set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\bsr\bsrmon_run registry value to 1

    • Activating the performance data aggregation logic of the bsr engine

    • Enable bsrservice's bsrmon /file execution logic

  • linux

    • Set the /etc/bsr.d/.bsrmon_run value to 1.

    • Activating the performance data aggregation logic of the bsr engine

    • /lib/bsr/bsrmon-run Run the script. Run the bsrmon /file command periodically

/stop

Disable Performance Monitor. When disabled, the engine stops aggregating performance data and logging files.

Deactivation is performed as follows.

  • windows

    • Set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\bsr\bsrmon_run registry value to 0

    • Disable performance data aggregation logic of bsr engine

    • Disable bsrservice's bsrmon /file execution logic

  • linux

    • Set /etc/bsr.d/.bsrmon_run value to 0

    • Disable performance data aggregation logic of bsr engine

    • Stop the /lib/bsr/bsrmon-run script

/status

Query the operating status of the performance monitor.

/file

Saves performance data and memory information to a file. This is a hidden command and is used by the bsrservice and bsrmon-run scripts.

The file storage location is as follows.

  • windows : C:\Program Files\bsr\log\perfmon

  • linux : /var/log/bsr/perfmon/

Data is saved only for the inquiry time and the numerical value for each item, and when the /watch or /report command is executed, the data is parsed and output according to the format.

2021-07-20_05:55:21.443 955 955 244480 244480 1042 1042 266752 266752

/watch {types} [/scroll]

Data that is being aggregated by type is output in real time. When the /scroll option is used, the output is displayed in newline format.

/report {types} [/f {filename}] [/d {YYYY-MM-DD}] [/s {hh:mm[:ss]}] [/e {hh:mm[:ss]}]

Outputs the statistics of the data recorded by the file for each type.

When generating statistics, meaningless data such as numerical values are excluded from the target sample.

  • Data written to zero with no I/O occurring

  • I/O data generated temporarily (when the I/O number is not consecutively more than 2 times)

  • ex) vnr0_IO_STAT

    2021-07-07_21:27:52.304 0 0 0 0 0 0 0 0 // Excluded from samples as recorded performance data is zero
    2021-07-07_21:27:53.325 0 0 0 0 2 2 1 1 // Excluded from samples as transient I/O performance data
    2021-07-07_21:27:54.346 0 0 0 0 0 0 0 0
    ...
    2021-07-07_21:30:28.549 0 0 0 0 408 408 104448 104448 
    2021-07-07_21:30:29.580 0 0 0 0 409 409 104704 104704 // Performance data is not zero and is being recorded continuously. selected as samples
    2021-07-07_21:30:30.605 0 0 0 0 404 404 103424 103424

[/f {filename}]

the file name for the report. Used for statistical calculation of backed up files.

[/d {YYYY-MM-DD}] [/s {hh:mm[:ss]}] [/e {hh:mm[:ss]}]

This option allows you to inquire the figures for a specific section.

  • /d : Outputs statistics of data recorded on that date. If not entered, statistics for the entire period are output.

  • /s : Outputs statistics of data recorded from that time. If not input, statistics of recorded data are output from 00:00:00.

  • /e : Outputs statistics of data recorded up to that time. If not input, statistics of data collected until 23:59:59 are output.

/set {period, file_size, file_cnt} {value}

Command to adjust the numerical values related to monitoring.

period

Set file saving and monitoring cycle. It is set in seconds, and the default value is 1 second.

file_size

Sets the file rolling size. Set in MB unit, the default value is 50 MB.

file_cnt

Sets the number of file rolling. The default value is three.

/io_delay_test {flag} {delay point} {delay time}

Verifies the functionality of the bsr performance monitor by intentionally incurring I/O performance degradation. This feature is for developers.

/debug cmds options

Command for windows to inquire debugfs information.

types

Performance monitoring targets.

iostat, ioclat, reqstat, peer_reqstat, alstat, network, sendbuf, memstat

iostat {resource} {vnr}

Aggregate data for read and write I/O of replication volumes (iops, ios, kbs, kb)

 Usage Example
  • Real-time monitoring of I/O performance data

> bsrmon /watch iostat r0 0
2020-12-06_08:22:36.223
  read : IOPS=0 (IOs=0), BW=0kb/s (0KB)
  write: IOPS=216 (IOs=432), BW=110208kb/s (220416KB)
2020-12-06_08:22:37.640
  read : IOPS=0 (IOs=0), BW=0kb/s (0KB)
  write: IOPS=34 (IOs=34), BW=16896kb/s (16896KB)
2020-12-06_08:22:39.32
  read : IOPS=0 (IOs=0), BW=0kb/s (0KB)
  write: IOPS=326 (IOs=326), BW=155176kb/s (155176KB)
2020-12-06_08:22:40.106
  read : IOPS=0 (IOs=0), BW=0kb/s (0KB)
  write: IOPS=330 (IOs=330), BW=164624kb/s (164624KB)
  ...

→ A situation in which write I/O is continuously occurring

  • Statistical output of I/O performance data

> bsrmon /report iostat r0 0
Report r0 [IO STAT - vnr0]
 Run: 2021-07-20_08:14:27.405 - 2021-08-04_18:59:53.573
  read : ios=24855, bw=6325194kbyte
    IOPS        : min=10, max=1113, avg=972, samples=24
    BW (kbyte/s): min=40, max=284928, avg=248578, samples=24
  write: ios=8360345, bw=2139182512kbyte
    IOPS        : min=3, max=1568, avg=1252, samples=6670
    BW (kbyte/s): min=12, max=401408, avg=320256, samples=6671

→ Data aggregated during the period 2021-07-20_08:14:27.405 - 2021-08-04_18:59:53.573

→ 24 read I/O samples. Measured at an average of 972 iops, a minimum of 10 iops, and a maximum of 1113 iops

→ 6670 write I/O samples. Measured at 1252 iops average, 3 iops min, 1568 iops max

  • Calculate statistics using a backup file

> bsrmon /report iostat r0 0 /f vnr0_IO_STAT_2021-07-20_081427.005 
Report r0 [vnr0_IO_STAT_2021-07-20_081427.005]
 Run: 2021-07-07_21:24:24.383 - 2021-07-20_08:14:26.371
  read : ios=6332219, bw=1595360917kbyte
    IOPS        : min=1, max=1233, avg=980, samples=6452
    BW (kbyte/s): min=4, max=428032, avg=247033, samples=6452
  write: ios=22550577, bw=3961426883kbyte
    IOPS        : min=1, max=6478, avg=786, samples=28468
    BW (kbyte/s): min=4, max=1085440, avg=138461, samples=28484

→ Extract data from file vnr0_IO_STAT_2021-07-20_081427.005

→ Data aggregated for the period 2021-07-07_21:24:24.383 - 2021-07-20_08:14:26.371

→ 6452 read I/O samples. Measured at an average of 980 iops, a minimum of 1 iops, and a maximum of 1233 iops

→ 28468 write I/O samples. Measured average 786 iops, min 1 iops, max 6478 iops

ioclat {resource} {vnr}

Complete latency of local/master I/O (usec)

 Usage Example
  • Real-time monitoring of I/O complete latency

> bsrmon /watch ioclat r0 0
2020-12-08_09:33:19.171
  local clat  (usec): min=520, max=112000, avg=10347
  master clat (usec): min=610, max=256852, avg=28202
2020-12-08_09:33:20.351
  local clat  (usec): min=780, max=478387, avg=44316
  master clat (usec): min=1499, max=492829, avg=114106
2020-12-08_09:33:21.509
  local clat  (usec): min=478, max=19805, avg=4523
  master clat (usec): min=577, max=24335, avg=6303
  ...

→ A situation in which write I/O is continuously occurring

→ Outputs the minimum, maximum, and average values of the time taken for local/master I/O to complete

  • I/O complete latency statistics output

> bsrmon /report ioclat r0 0 
Report r0 [IO COMPLETE - vnr0]
 Run: 2020-12-07_02:42:31.440 - 2020-12-08_09:37:44.115
  local clat  (usec): min=153, max=1635570, avg=8601, samples=50
  master clat (usec): min=217, max=1667260, avg=21801, samples=50

→ Data aggregated for the period 2020-12-07_02:42:31.440 - 2020-12-08_09:37:44.115

→ 50 local clat samples. Measured to have consumed an average of 8601, a minimum of 153, and a maximum of 1635570 usec

→ 50 master clat samples. Measured to have consumed an average of 21801, a minimum of 217, and a maximum of 1667260 usec

reqstat {resource} {vnr}

Aggregate data on the time taken from the time of request creation to a specific section (usec)

  • requests : The number of requests processed during the monitoring period (cycle)

  • before_queue : Time elapsed before being queued in the submit work queue

  • before_al_begin : Time elapsed before getting the lru of act_log

  • in_actlog : Time taken for RQ_IN_ACT_LOG to be set

  • submit : Time taken until request submit is executed

  • bio_endio : Time taken for bio completion logic to be executed

  • pre_send : Time taken before sending data to peer

  • acked : Time elapsed until RQ_NET_PENDING is removed after send

  • net_done : Time taken to RQ_NET_DONE

  • destroy : The time it takes for the request to be released

Time required for each section until al uptate is completed (usec)

  • al_update : The number of active logs updated during the monitoring period (cycle)

  • before_bm_write : Before executing bsr_bm_write_hinted()

  • after_bm_write : After executing bsr_bm_write_hinted()

  • after_sync_page : After executing bsr_md_sync_page_io()

 Usage Example
  • Real-time monitoring of request performance data

> bsrmon /watch reqstat r0 0
2020-12-08_09:42:18.549
  requests  : 445
    before_queue    (usec): min=0, max=19, avg=0
    before_al_begin (usec): min=0, max=4907, avg=11
    in_actlog       (usec): min=1, max=7191, avg=35
    submit          (usec): min=3, max=7215, avg=46
    bio_endio       (usec): min=4, max=7219, avg=50
    destroy         (usec): min=577, max=214980, avg=15711
  al_uptate  : 0
    before_bm_write (usec): min=0, max=0, avg=0
    after_bm_write  (usec): min=0, max=0, avg=0
    after_sync_page (usec): min=0, max=0, avg=0
  PEER bsr03:
    pre_send (usec): min=0, max=0, avg=0
    acked    (usec): min=0, max=0, avg=0
    net_done (usec): min=0, max=0, avg=0
  PEER 100.100.10.31:7792:
    pre_send (usec): min=571, max=214862, avg=15703
    acked    (usec): min=555, max=210004, avg=7966
    net_done (usec): min=575, max=214978, avg=15710
...

→ A situation in which write I/O is continuously occurring

→ Disconnect from peer bsr03

→ al update is not performed

  • Request performance data statistics output

> bsrmon /report reqstat r0 0 
Report r0 [REQUEST STAT - vnr0]
 Run: 2020-12-06_05:46:36.678 - 2020-12-08_09:45:46.219
  requests : total=38180
    before_queue    (usec): min=0, max=0, avg=0, samples=0
    before_al_begin (usec): min=6966, max=460112, avg=4264, samples=2
    in_actlog       (usec): min=1, max=589946, avg=549, samples=28
    submit          (usec): min=2, max=721793, avg=349, samples=72
    bio_endio       (usec): min=3, max=589987, avg=579, samples=28
    destroy         (usec): min=83, max=2362749, avg=46380, samples=161
  al_uptate : total=0
    before_bm_write (usec): min=0, max=0, avg=0, samples=0
    after_bm_write  (usec): min=0, max=0, avg=0, samples=0
    after_sync_page (usec): min=0, max=0, avg=0, samples=0
  PEER 100.100.10.31:7792:
    pre_send (usec): min=7, max=1667187, avg=22364, samples=161
    acked    (usec): min=11, max=1635581, avg=15737, samples=161
    net_done (usec): min=82, max=2362749, avg=39406, samples=161
  PEER bsr03:
    pre_send (usec): min=0, max=0, avg=0, samples=0
    acked    (usec): min=0, max=0, avg=0, samples=0
    net_done (usec): min=0, max=0, avg=0, samples=0

→ 38180 requests processed during the period 2020-12-06_05:46:36.678 - 2020-12-08_09:45:46.219

→ al_update never occurs

peer_reqstat {resource} {vnr}

Aggregate data (usec) about the time taken from the peer request creation time to a specific section

  • peer_requests : Number of peer requests processed during the monitoring period (cycle)

  • submit : Time taken until peer request submit is executed

  • bio_endio : Time taken until the bio completion logic of the peer request is executed

  • destroy : Time taken until peer request is released

 Usage Example
  • Real-time monitoring of peer request performance data

> bsrmon /watch peer_reqstat r0 0 
2021-07-19_21:52:02.191
  PEER bsr-03:
    peer requests : 0
    submit    (usec): min=0, max=0, avg=0
    bio_endio (usec): min=0, max=0, avg=0
    destroy   (usec): min=0, max=0, avg=0
  PEER bsr-02:
    peer requests : 100
    submit    (usec): min=421, max=6907, avg=2184
    bio_endio (usec): min=1021, max=7312, avg=2563
    destroy   (usec): min=1739, max=7955, avg=3244

→ 100 peer requests processed for bsr-02

→ No peer request for bsr-03

  • Peer request performance statistics output

> bsrmon /report peer_reqstat r0 0
Report r0 [PEER REQUEST STAT - vnr0]
 Run: 2021-07-11_22:23:08.890 - 2021-07-19_21:56:18.179
  PEER bsr-02:
    peer requests : total=344054
    submit    (usec): min=1, max=36103, avg=99, samples=6902
    bio_endio (usec): min=1, max=116988, avg=96161, samples=9517
    destroy   (usec): min=47, max=117495, avg=96340, samples=9518
  PEER bsr-03:
    peer requests : total=133288
    submit    (usec): min=1, max=1670, avg=5, samples=6037
    bio_endio (usec): min=1, max=117000, avg=104660, samples=8708
    destroy   (usec): min=63, max=125871, avg=104839, samples=8709

→ Data aggregated during the period 2021-07-11_22:23:08.890 - 2021-07-19_21:56:18.179

→ It took an average of 0.1 seconds for the bio completion logic to be executed.

alstat {resource} {vnr}

Aggregate data for usage figures in active log

  • Aggregation of used values, the maximum value of used values

  • Aggregating increments of hits, misses, starving, locked, changed

  • al_wait retry count count

  • aggregation of causes of AL shortages

    • starving, pending, used, busy, wouldblock

 Usage Example
  • Real-time monitoring of activity log performance data

> bsrmon /watch alstat r0 0
2021-08-04_19:10:11.463
  used    :          3/67 (max=8)
  hits    :        579 (total=2843)
  misses  :       2250 (total=17564)
  starving:          0 (total=0)
  locked  :          0 (total=0)
  changed :        749 (total=5854)
  al_wait retry :          0 (total=0, max=0)
  pending_changes :  1/64
  error   : 0
    NOBUFS - starving : 0
           - pending slot : 0
           - used slot : 0
    BUSY : 0
    WOULDBLOCK : 0
  flags   : __LC_DIRTY __LC_LOCKED 

→ There are 3 al slots in use at the time

→ 1 pending al slot

→ The dirty, locked flag is set

  • activity log performance statistics output

> bsrmon /report alstat r0 0
Report r0 [AL STAT - vnr0]
 Run: 2021-08-03_04:48:33.721 - 2021-08-03_22:47:28.326
  al_extents : 6001
    used     : max=0(all_slot_used=0), avg=0
    hits     : total=0
    misses   : total=0
    starving : total=0
    locked   : total=0
    changed  : total=0
    al_wait retry count : max=0, total=0
    pending_changes     : max=0, total=0
    error : total=0
      NOBUFS - starving     : total=0
             - pending slot : total=0
             - used    slot : total=0
      BUSY       : total=0
      WOULDBLOCK : total=0
 -> al_extents changed 
 Run: 2021-08-03_22:47:52.895 - 2021-08-04_19:27:06.522
  al_extents : 67
    used     : max=67(all_slot_used=59), avg=3
    hits     : total=337528
    misses   : total=2020409
    starving : total=2
    locked   : total=0
    changed  : total=673370
    al_wait retry count : max=2, total=501
    pending_changes     : max=64, total=258
    error : total=1004
      NOBUFS - starving     : total=501
             - pending slot : total=0
             - used    slot : total=503
      BUSY       : total=0
      WOULDBLOCK : total=0

→ al_extents value changed from 6001 to 67 on 2021-08-03_22:47:52.895

→ If al is set to 67, the number of times that all al slots are used 59 times, starving situation occurs 2 times, and al_wait retry occurs 501 times.

network {resource}

Replication network transmit/receive speed (byte/s)

 Usage Example
  • Real-time monitoring of network performance data

> bsrmon /watch network r0
2020-12-08_09:47:45.84
  PEER 100.100.10.31:7792:
    send (byte/s): 62932184
    recv (byte/s): 4820
  PEER bsr03:
    send (byte/s): 0
    recv (byte/s): 0
2020-12-08_09:47:46.160
  PEER 100.100.10.31:7792:
    send (byte/s): 104896568
    recv (byte/s): 12320
  PEER bsr03:
    send (byte/s): 0
    recv (byte/s): 0
...

→ Disconnected from peer bsr03

→ Data transmission to peer 100.100.10.31:7792 is in progress

  • output network performance statistics

> bsrmon /report network r0
Report r0 [NETWORK SPEED]
 Run: 2020-11-30_06:38:44.653 - 2020-12-08_09:53:31.455
  PEER 100.100.10.31:7792: send=52497905byte/s, receive=5902byte/s
    send (byte/s): min=3, max=115395016, avg=52497905, samples=392
    recv (byte/s): min=3, max=15004, avg=5902, samples=367
  PEER bsr03: send=0byte/s, receive=0byte/s
    send (byte/s): min=0, max=0, avg=0, samples=0
    recv (byte/s): min=0, max=0, avg=0, samples=0

→ On average, 52.4 MB of data is being transferred to the peer 100.100.10.31 node.

sendbuf {resource}

Send buffer usage (bytes)

 Usage Example
  • Real-time monitoring of transmit buffer performance data

> bsrmon /watch sendbuf r0
2020-12-08_09:54:13.735
  PEER 100.100.10.31:7792:
    data stream
        size (bytes): 10485761
        used (bytes): 204920
         [P_RS_DATA_REPLY]  -  cnt: 3  size: 209016 bytes
    control stream
        size (bytes): 5242881
        used (bytes): 0
  PEER bsr03:
    data stream
        size (bytes): 10485761
        used (bytes): 0
    control stream
        size (bytes): 5242881
        used (bytes): 0
...

→ The send buffer is allocated to peer 100.100.10.31:7792, bsr03

→ Disconnected from peer bsr03

→ Data transmission to peer 100.100.10.31:7792 is in progress

  • Send buffer performance statistics output

> bsrmon /report sendbuf r0
Report r0 [SEND BUFFER]
 Run: 2020-12-05_13:26:59.969 - 2020-12-08_09:56:33.718
  PEER 100.100.10.31:7792: data stream size=10485761byte, control stream size=5242881byte
    data-used (bytes): min=2097232, max=7603084, avg=4787174, samples=5
    cntl-used (bytes): min=0, max=0, avg=0, samples=0
  PEER bsr03: data stream size=10485761byte, control stream size=5242881byte
    data-used (bytes): min=0, max=0, avg=0, samples=0
    cntl-used (bytes): min=0, max=0, avg=0, samples=0

→ On average, 4.7 MB of buffer out of 10 MB is used

memstat

Memory usage by users and modules

In user space, the memory used by the bsradm, bsrsetup, bsrcon, bsrmon, and bsrservice processes is aggregated.

  • windows

    • Memory information obtained through GetProcessMemoryInfo()

      • WorkingSetSize, QuotaPagedPoolUsage, QuotaNonPagedPoolUsage, PagefileUsage

  • linux

    • Memory information obtained through the ps command

      • rsz : physical memory usage

      • vsz : virtual memory usage

As for the memory used by the kernel module, the following information is aggregated by OS.

  • windows

    • Nonpaged, paged memory usage allocated with 'BS--' tag in the driver

  • linux

    • slab cache informations

      • /sys/kernel/slab/bsr_req

      • /sys/kernel/slab/bsr_al

      • /sys/kernel/slab/bsr_bm

      • /sys/kernel/slab/bsr_ee

 Usage Example
  • Real-time monitoring of linux memory usage

> bsrmon /watch memstat
2020-12-08_09:57:27.171
  module (bytes)
    BSR_REQ : 16334336
    BSR_AL  : 803760
    BSR_BM  : 4161536
    BSR_EE  : 2782560
  user (kbytes)
    name      pid    rsz        vsz       
    bsrmon    29304  1192       12724     
    bsrmon    37474  1200       12720     
    bsrmon    112177 1192       12724     
    bsrmon    113913 1068       12724     
    bsrmon    113978 1308       12728   

→ Multiple bsrmon processes running. A non-terminating bsrmon process exists

  • Memory usage statistics output

> bsrmon /report memstat
Report [MEMORY]
 Run: 2020-12-06_19:04:12.177 - 2020-12-08_09:59:17.716
 module (bytes)
  BSR_REQ: 16303104 - 16459264    
  BSR_AL : 803760                 
  BSR_BM : 4063232 - 5271552      
  BSR_EE : 2782560                
 user (kbytes)
  name          rsz                     vsz
  bsradm        0                       0                       
  bsrsetup      0                       0                       
  bsrmeta       0                       0                       
  bsrmon        1068 - 1368             12720 - 12732     

→ Output the memory usage range of the data collected during the period 2020-12-06_19:04:12.177 - 2020-12-08_09:59:17.716

→ BSR_REQ uses 16303104 ~ 16459264 bytes

→ Considering that there is no data collected in user space except for bsrmon, it can be seen that the cli commands were executed within 1 second(The measurement period of bsrmon is 1 second.).

  • No labels