Overview
bsrmon is a performance monitoring tool that tracks the entire replication process of the bsr engine that performs replication and records the time required for each logical section of the engine in real time to identify bottlenecks in engine operation. The bsr kernel engine maintains performance-related statistical information such as time records for each section and accumulated calculation values at the time the engine was operated in the form of a file log through the internal jiffies record code. Administrators can monitor the statistics information of these kernel engines by querying them with the bsrmon utility. bsrmon is supported since version 1.6.1.
Functions
Features provided by bsrmon include:
/start
Activates the Performance Monitor feature and starts aggregating performance data and logging files. By default, Performance Monitor is enabled.
Activation of the performance monitor is performed through the following process.
windows
set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\bsr\bsrmon_run registry value to 1
Activating the performance data aggregation logic of the bsr engine
Enable bsrservice's bsrmon /file execution logic
linux
Set the /etc/bsr.d/.bsrmon_run value to 1.
Activating the performance data aggregation logic of the bsr engine
/lib/bsr/bsrmon-run Run the script. Run the bsrmon /file command periodically
/stop
Disable Performance Monitor. When disabled, the engine stops aggregating performance data and logging files.
Deactivation is performed as follows.
windows
Set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\bsr\bsrmon_run registry value to 0
Disable performance data aggregation logic of bsr engine
Disable bsrservice's bsrmon /file execution logic
linux
Set /etc/bsr.d/.bsrmon_run value to 0
Disable performance data aggregation logic of bsr engine
Stop the /lib/bsr/bsrmon-run script
/status
Query the operating status of the performance monitor.
/file
Saves performance data and memory information to a file. This is a hidden command and is used by the bsrservice and bsrmon-run scripts.
The file storage location is as follows.
Data is saved only for the inquiry time and the numerical value for each item, and when the /watch or /report command is executed, the data is parsed and output according to the format.
2021-07-20_05:55:21.443 955 955 244480 244480 1042 1042 266752 266752
Data that is being aggregated by type is output in real time. When the /scroll option is used, the output is displayed in newline format.
/report {types} [/f {filename}] [/d {YYYY-MM-DD}] [/s {hh:mm[:ss]}] [/e {hh:mm[:ss]}]
Outputs the statistics of the data recorded by the file for each type.
When generating statistics, meaningless data such as numerical values are excluded from the target sample.
[/f {filename}]
the file name for the report. Used for statistical calculation of backed up files.
[/d {YYYY-MM-DD}] [/s {hh:mm[:ss]}] [/e {hh:mm[:ss]}]
This option allows you to inquire the figures for a specific section.
/d : Outputs statistics of data recorded on that date. If not entered, statistics for the entire period are output.
/s : Outputs statistics of data recorded from that time. If not input, statistics of recorded data are output from 00:00:00.
/e : Outputs statistics of data recorded up to that time. If not input, statistics of data collected until 23:59:59 are output.
/set {period, file_size, file_cnt} {value}
Command to adjust the numerical values related to monitoring.
period
Set file saving and monitoring cycle. It is set in seconds, and the default value is 1 second.
file_size
Sets the file rolling size. Set in MB unit, the default value is 50 MB.
file_cnt
Sets the number of file rolling. The default value is three.
/io_delay_test {flag} {delay point} {delay time}
Verifies the functionality of the bsr performance monitor by intentionally incurring I/O performance degradation. This feature is for developers.
/debug cmds options
Command for windows to inquire debugfs information.
types
Performance monitoring targets.
iostat, ioclat, reqstat, peer_reqstat, alstat, network, sendbuf, memstat
iostat {resource} {vnr}
Aggregate data for read and write I/O of replication volumes (iops, ios, kbs, kb)
Usage Example
> bsrmon /watch iostat r0 0
2020-12-06_08:22:36.223
read : IOPS=0 (IOs=0), BW=0kb/s (0KB)
write: IOPS=216 (IOs=432), BW=110208kb/s (220416KB)
2020-12-06_08:22:37.640
read : IOPS=0 (IOs=0), BW=0kb/s (0KB)
write: IOPS=34 (IOs=34), BW=16896kb/s (16896KB)
2020-12-06_08:22:39.32
read : IOPS=0 (IOs=0), BW=0kb/s (0KB)
write: IOPS=326 (IOs=326), BW=155176kb/s (155176KB)
2020-12-06_08:22:40.106
read : IOPS=0 (IOs=0), BW=0kb/s (0KB)
write: IOPS=330 (IOs=330), BW=164624kb/s (164624KB)
...
|
→ A situation in which write I/O is continuously occurring |
> bsrmon /report iostat r0 0
Report r0 [IO STAT - vnr0]
Run: 2021-07-20_08:14:27.405 - 2021-08-04_18:59:53.573
read : ios=24855, bw=6325194kbyte
IOPS : min=10, max=1113, avg=972, samples=24
BW (kbyte/s): min=40, max=284928, avg=248578, samples=24
write: ios=8360345, bw=2139182512kbyte
IOPS : min=3, max=1568, avg=1252, samples=6670
BW (kbyte/s): min=12, max=401408, avg=320256, samples=6671
|
→ Data aggregated during the period 2021-07-20_08:14:27.405 - 2021-08-04_18:59:53.573 → 24 read I/O samples. Measured at an average of 972 iops, a minimum of 10 iops, and a maximum of 1113 iops → 6670 write I/O samples. Measured at 1252 iops average, 3 iops min, 1568 iops max |
> bsrmon /report iostat r0 0 /f vnr0_IO_STAT_2021-07-20_081427.005
Report r0 [vnr0_IO_STAT_2021-07-20_081427.005]
Run: 2021-07-07_21:24:24.383 - 2021-07-20_08:14:26.371
read : ios=6332219, bw=1595360917kbyte
IOPS : min=1, max=1233, avg=980, samples=6452
BW (kbyte/s): min=4, max=428032, avg=247033, samples=6452
write: ios=22550577, bw=3961426883kbyte
IOPS : min=1, max=6478, avg=786, samples=28468
BW (kbyte/s): min=4, max=1085440, avg=138461, samples=28484
|
→ Extract data from file vnr0_IO_STAT_2021-07-20_081427.005 → Data aggregated for the period 2021-07-07_21:24:24.383 - 2021-07-20_08:14:26.371 → 6452 read I/O samples. Measured at an average of 980 iops, a minimum of 1 iops, and a maximum of 1233 iops → 28468 write I/O samples. Measured average 786 iops, min 1 iops, max 6478 iops |
ioclat {resource} {vnr}
Complete latency of local/master I/O (usec)
Usage Example
> bsrmon /watch ioclat r0 0
2020-12-08_09:33:19.171
local clat (usec): min=520, max=112000, avg=10347
master clat (usec): min=610, max=256852, avg=28202
2020-12-08_09:33:20.351
local clat (usec): min=780, max=478387, avg=44316
master clat (usec): min=1499, max=492829, avg=114106
2020-12-08_09:33:21.509
local clat (usec): min=478, max=19805, avg=4523
master clat (usec): min=577, max=24335, avg=6303
...
|
→ A situation in which write I/O is continuously occurring → Outputs the minimum, maximum, and average values of the time taken for local/master I/O to complete |
> bsrmon /report ioclat r0 0
Report r0 [IO COMPLETE - vnr0]
Run: 2020-12-07_02:42:31.440 - 2020-12-08_09:37:44.115
local clat (usec): min=153, max=1635570, avg=8601, samples=50
master clat (usec): min=217, max=1667260, avg=21801, samples=50
|
→ Data aggregated for the period 2020-12-07_02:42:31.440 - 2020-12-08_09:37:44.115 → 50 local clat samples. Measured to have consumed an average of 8601, a minimum of 153, and a maximum of 1635570 usec → 50 master clat samples. Measured to have consumed an average of 21801, a minimum of 217, and a maximum of 1667260 usec |
reqstat {resource} {vnr}
Aggregate data on the time taken from the time of request creation to a specific section (usec)
requests : The number of requests processed during the monitoring period (cycle)
before_queue : Time elapsed before being queued in the submit work queue
before_al_begin : Time elapsed before getting the lru of act_log
in_actlog : Time taken for RQ_IN_ACT_LOG to be set
submit : Time taken until request submit is executed
bio_endio : Time taken for bio completion logic to be executed
pre_send : Time taken before sending data to peer
acked : Time elapsed until RQ_NET_PENDING is removed after send
net_done : Time taken to RQ_NET_DONE
destroy : The time it takes for the request to be released
Time required for each section until al uptate is completed (usec)
al_update : The number of active logs updated during the monitoring period (cycle)
before_bm_write : Before executing bsr_bm_write_hinted()
after_bm_write : After executing bsr_bm_write_hinted()
after_sync_page : After executing bsr_md_sync_page_io()
Usage Example
> bsrmon /watch reqstat r0 0
2020-12-08_09:42:18.549
requests : 445
before_queue (usec): min=0, max=19, avg=0
before_al_begin (usec): min=0, max=4907, avg=11
in_actlog (usec): min=1, max=7191, avg=35
submit (usec): min=3, max=7215, avg=46
bio_endio (usec): min=4, max=7219, avg=50
destroy (usec): min=577, max=214980, avg=15711
al_uptate : 0
before_bm_write (usec): min=0, max=0, avg=0
after_bm_write (usec): min=0, max=0, avg=0
after_sync_page (usec): min=0, max=0, avg=0
PEER bsr03:
pre_send (usec): min=0, max=0, avg=0
acked (usec): min=0, max=0, avg=0
net_done (usec): min=0, max=0, avg=0
PEER 100.100.10.31:7792:
pre_send (usec): min=571, max=214862, avg=15703
acked (usec): min=555, max=210004, avg=7966
net_done (usec): min=575, max=214978, avg=15710
...
|
→ A situation in which write I/O is continuously occurring → Disconnect from peer bsr03 → al update is not performed |
> bsrmon /report reqstat r0 0
Report r0 [REQUEST STAT - vnr0]
Run: 2020-12-06_05:46:36.678 - 2020-12-08_09:45:46.219
requests : total=38180
before_queue (usec): min=0, max=0, avg=0, samples=0
before_al_begin (usec): min=6966, max=460112, avg=4264, samples=2
in_actlog (usec): min=1, max=589946, avg=549, samples=28
submit (usec): min=2, max=721793, avg=349, samples=72
bio_endio (usec): min=3, max=589987, avg=579, samples=28
destroy (usec): min=83, max=2362749, avg=46380, samples=161
al_uptate : total=0
before_bm_write (usec): min=0, max=0, avg=0, samples=0
after_bm_write (usec): min=0, max=0, avg=0, samples=0
after_sync_page (usec): min=0, max=0, avg=0, samples=0
PEER 100.100.10.31:7792:
pre_send (usec): min=7, max=1667187, avg=22364, samples=161
acked (usec): min=11, max=1635581, avg=15737, samples=161
net_done (usec): min=82, max=2362749, avg=39406, samples=161
PEER bsr03:
pre_send (usec): min=0, max=0, avg=0, samples=0
acked (usec): min=0, max=0, avg=0, samples=0
net_done (usec): min=0, max=0, avg=0, samples=0
|
→ 38180 requests processed during the period 2020-12-06_05:46:36.678 - 2020-12-08_09:45:46.219 → al_update never occurs |
peer_reqstat {resource} {vnr}
Aggregate data (usec) about the time taken from the peer request creation time to a specific section
peer_requests : Number of peer requests processed during the monitoring period (cycle)
submit : Time taken until peer request submit is executed
bio_endio : Time taken until the bio completion logic of the peer request is executed
destroy : Time taken until peer request is released
Usage Example
> bsrmon /watch peer_reqstat r0 0
2021-07-19_21:52:02.191
PEER bsr-03:
peer requests : 0
submit (usec): min=0, max=0, avg=0
bio_endio (usec): min=0, max=0, avg=0
destroy (usec): min=0, max=0, avg=0
PEER bsr-02:
peer requests : 100
submit (usec): min=421, max=6907, avg=2184
bio_endio (usec): min=1021, max=7312, avg=2563
destroy (usec): min=1739, max=7955, avg=3244
|
→ 100 peer requests processed for bsr-02 → No peer request for bsr-03 |
> bsrmon /report peer_reqstat r0 0
Report r0 [PEER REQUEST STAT - vnr0]
Run: 2021-07-11_22:23:08.890 - 2021-07-19_21:56:18.179
PEER bsr-02:
peer requests : total=344054
submit (usec): min=1, max=36103, avg=99, samples=6902
bio_endio (usec): min=1, max=116988, avg=96161, samples=9517
destroy (usec): min=47, max=117495, avg=96340, samples=9518
PEER bsr-03:
peer requests : total=133288
submit (usec): min=1, max=1670, avg=5, samples=6037
bio_endio (usec): min=1, max=117000, avg=104660, samples=8708
destroy (usec): min=63, max=125871, avg=104839, samples=8709
|
→ Data aggregated during the period 2021-07-11_22:23:08.890 - 2021-07-19_21:56:18.179 → It took an average of 0.1 seconds for the bio completion logic to be executed. |
alstat {resource} {vnr}
Aggregate data for usage figures in active log
Aggregation of used values, the maximum value of used values
Aggregating increments of hits, misses, starving, locked, changed
al_wait retry count count
aggregation of causes of AL shortages
starving, pending, used, busy, wouldblock
Usage Example
> bsrmon /watch alstat r0 0
2021-08-04_19:10:11.463
used : 3/67 (max=8)
hits : 579 (total=2843)
misses : 2250 (total=17564)
starving: 0 (total=0)
locked : 0 (total=0)
changed : 749 (total=5854)
al_wait retry : 0 (total=0, max=0)
pending_changes : 1/64
error : 0
NOBUFS - starving : 0
- pending slot : 0
- used slot : 0
BUSY : 0
WOULDBLOCK : 0
flags : __LC_DIRTY __LC_LOCKED
|
→ There are 3 al slots in use at the time → 1 pending al slot → The dirty, locked flag is set |
> bsrmon /report alstat r0 0
Report r0 [AL STAT - vnr0]
Run: 2021-08-03_04:48:33.721 - 2021-08-03_22:47:28.326
al_extents : 6001
used : max=0(all_slot_used=0), avg=0
hits : total=0
misses : total=0
starving : total=0
locked : total=0
changed : total=0
al_wait retry count : max=0, total=0
pending_changes : max=0, total=0
error : total=0
NOBUFS - starving : total=0
- pending slot : total=0
- used slot : total=0
BUSY : total=0
WOULDBLOCK : total=0
-> al_extents changed
Run: 2021-08-03_22:47:52.895 - 2021-08-04_19:27:06.522
al_extents : 67
used : max=67(all_slot_used=59), avg=3
hits : total=337528
misses : total=2020409
starving : total=2
locked : total=0
changed : total=673370
al_wait retry count : max=2, total=501
pending_changes : max=64, total=258
error : total=1004
NOBUFS - starving : total=501
- pending slot : total=0
- used slot : total=503
BUSY : total=0
WOULDBLOCK : total=0
|
→ al_extents value changed from 6001 to 67 on 2021-08-03_22:47:52.895 → If al is set to 67, the number of times that all al slots are used 59 times, starving situation occurs 2 times, and al_wait retry occurs 501 times. |
network {resource}
Replication network transmit/receive speed (byte/s)
Usage Example
> bsrmon /watch network r0
2020-12-08_09:47:45.84
PEER 100.100.10.31:7792:
send (byte/s): 62932184
recv (byte/s): 4820
PEER bsr03:
send (byte/s): 0
recv (byte/s): 0
2020-12-08_09:47:46.160
PEER 100.100.10.31:7792:
send (byte/s): 104896568
recv (byte/s): 12320
PEER bsr03:
send (byte/s): 0
recv (byte/s): 0
...
|
→ Disconnected from peer bsr03 → Data transmission to peer 100.100.10.31:7792 is in progress |
> bsrmon /report network r0
Report r0 [NETWORK SPEED]
Run: 2020-11-30_06:38:44.653 - 2020-12-08_09:53:31.455
PEER 100.100.10.31:7792: send=52497905byte/s, receive=5902byte/s
send (byte/s): min=3, max=115395016, avg=52497905, samples=392
recv (byte/s): min=3, max=15004, avg=5902, samples=367
PEER bsr03: send=0byte/s, receive=0byte/s
send (byte/s): min=0, max=0, avg=0, samples=0
recv (byte/s): min=0, max=0, avg=0, samples=0
|
→ On average, 52.4 MB of data is being transferred to the peer 100.100.10.31 node. |
sendbuf {resource}
Send buffer usage (bytes)
Usage Example
> bsrmon /watch sendbuf r0
2020-12-08_09:54:13.735
PEER 100.100.10.31:7792:
data stream
size (bytes): 10485761
used (bytes): 204920
[P_RS_DATA_REPLY] - cnt: 3 size: 209016 bytes
control stream
size (bytes): 5242881
used (bytes): 0
PEER bsr03:
data stream
size (bytes): 10485761
used (bytes): 0
control stream
size (bytes): 5242881
used (bytes): 0
...
|
→ The send buffer is allocated to peer 100.100.10.31:7792, bsr03 → Disconnected from peer bsr03 → Data transmission to peer 100.100.10.31:7792 is in progress |
> bsrmon /report sendbuf r0
Report r0 [SEND BUFFER]
Run: 2020-12-05_13:26:59.969 - 2020-12-08_09:56:33.718
PEER 100.100.10.31:7792: data stream size=10485761byte, control stream size=5242881byte
data-used (bytes): min=2097232, max=7603084, avg=4787174, samples=5
cntl-used (bytes): min=0, max=0, avg=0, samples=0
PEER bsr03: data stream size=10485761byte, control stream size=5242881byte
data-used (bytes): min=0, max=0, avg=0, samples=0
cntl-used (bytes): min=0, max=0, avg=0, samples=0
|
→ On average, 4.7 MB of buffer out of 10 MB is used |
memstat
Memory usage by users and modules
In user space, the memory used by the bsradm, bsrsetup, bsrcon, bsrmon, and bsrservice processes is aggregated.
As for the memory used by the kernel module, the following information is aggregated by OS.
windows
linux
slab cache informations
/sys/kernel/slab/bsr_req
/sys/kernel/slab/bsr_al
/sys/kernel/slab/bsr_bm
/sys/kernel/slab/bsr_ee
Usage Example
> bsrmon /watch memstat
2020-12-08_09:57:27.171
module (bytes)
BSR_REQ : 16334336
BSR_AL : 803760
BSR_BM : 4161536
BSR_EE : 2782560
user (kbytes)
name pid rsz vsz
bsrmon 29304 1192 12724
bsrmon 37474 1200 12720
bsrmon 112177 1192 12724
bsrmon 113913 1068 12724
bsrmon 113978 1308 12728
|
→ Multiple bsrmon processes running. A non-terminating bsrmon process exists |
> bsrmon /report memstat
Report [MEMORY]
Run: 2020-12-06_19:04:12.177 - 2020-12-08_09:59:17.716
module (bytes)
BSR_REQ: 16303104 - 16459264
BSR_AL : 803760
BSR_BM : 4063232 - 5271552
BSR_EE : 2782560
user (kbytes)
name rsz vsz
bsradm 0 0
bsrsetup 0 0
bsrmeta 0 0
bsrmon 1068 - 1368 12720 - 12732
|
→ Output the memory usage range of the data collected during the period 2020-12-06_19:04:12.177 - 2020-12-08_09:59:17.716 → BSR_REQ uses 16303104 ~ 16459264 bytes → Considering that there is no data collected in user space except for bsrmon, it can be seen that the cli commands were executed within 1 second(The measurement period of bsrmon is 1 second.). |