The optimization of FSR is described in terms of both latency of local I/O and throughput of replication traffic.

Latency

The time taken to complete after local I/O is issued and written to disk is called response time or latency. The lower the delay time, the higher the number of I/Os and the better the performance. Conversely, if the latency increases due to a bottleneck, the number of I/Os per hour decreases, resulting in poor performance. In other words, in order to optimize the performance of the fsr engine, it is necessary to adjust the configuration conditions of fsr to minimize this delay time.

The delay factors described below should be tracked in real time using the performance monitor of fsr to check the factors causing bottlenecks and then apply appropriate countermeasures.

System Delay

Before checking the delay factor of the fsr engine, determine how long the delay time of the system itself is. To figure this out, you need to look at the pure I/O performance of the volume without any replication configured. It depends on the performance of the storage, but the latency when performing 4KB I/O is usually a few nanoseconds. If it is measured higher than this, there is a delay in the system itself, and it is necessary to check the performance of the storage itself.

FSR Engine Delay

The following elements are the components implemented in the fsr engine, and the delay time of each element section can be tracked through the performance monitor.

Path Filter

When I/O flows into fsr, the fsr engine performs path filter logic to identify whether the path of the I/O is a path within the resource, i.e., the I/O to be replicated. The delay incurred in this process increases as the number of destination paths registered in fsr increases. Therefore, it is advisable to keep the number of replication target paths for a resource as small as possible. For example, rather than specifying the replication target as an individual file, it is good for management and performance to specify the path to the top-level directory of the replication target.

Exclusion filter

The higher the number of exclude filters used with the path filter, the worse the performance. If possible, configure the number of exclusion filters to be kept as small as possible.

Local I/O

fsr performs local writes first before buffering the duplicate data. At this point, the local write I/O should theoretically have the same performance as the system latency described above. If there is a difference with the system latency, it means there is a problem somewhere. This can mostly be a bottleneck problem inside the fsr engine.

Tx Bufferring

After the local write I/O is complete, the duplicate data is queued into a buffer and the original I/O is completed. If the buffering is fast and the original I/O is completed immediately, the performance is good, but if buffering takes time, the bottleneck will occur and the I/O will be delayed.

The buffer used at this time is provided as a memory buffer and a file buffer, and it is not a problem because the memory buffer has high performance. There is a decrease in performance due to Considering the performance, it is better to use a lot of memory buffers, but if you need to operate the file buffers together according to the resource situation of the system, you should consider this performance degradation.

Transmission delay

Local I/O is completed at the time of buffering, so the section after buffering is independent of local I/O delay. The transmission delay it takes for buffered data to exit the TX is an indicator of buffering performance. If this performance is good (low transmission delay), replication is smooth. If there is a bottleneck in buffering, fsr cannot maintain the replication state due to frequent buffer overflow and repeat synchronization. Poor buffering performance is most likely a problem with fsr internal bottlenecks.

So far, this is limited to the range that the network transmission bandwidth can handle. If the network transmission band is very low compared to the local I/O band and the transmission band cannot support it, increase the bandwidth or consider the distribution of local I/O.

Throughput

Throughput defines the amount of data that can be transferred per unit of time, and in order to maintain real-time replication, the throughput performance of the FSR must at least be better than the local I/O load. If the replication throughput is low compared to the local I/O load, buffer overflow may occur and the replication state may not be maintained, such as resynchronization repeating. To do this, it is necessary to consider the following to ensure that:

Hardware

The higher the specification of the hardware, the better the throughput. Basically, you need to have a good performance I/O subsystem such as CPU, memory, and hard disk, and it is recommended to allocate enough memory resources for sending buffering to FSR.

Network Bandwidth

Due to the limitations of the network transmission band, the throughput is handled within the network transmission band. Therefore, it is advantageous to secure the transmission bandwidth as large as possible, but it is common to calculate the network bandwidth by grasping the load of write I/O occurring locally at least and estimating the replication bandwidth in advance. To estimate the replication bandwidth, you need to monitor the write I/O of the local system. For example, if the average write I/O of the local system is 200MB/s, at least a 10Gbps network must be established for replication.

Compression

fsr can improve throughput by compressing and sending replication data. Compressing the size of the data to reduce the transfer time is the most classic approach to improving throughput. However, because software compression occupies CPU usage, local compression may cause a certain level of load on the system, so you should consider this. The compression algorithm of fsr is LZ4, which provides the fastest compression performance.

FSR User's Guide - eng

Optimizations

Analytics