Notice: register_sidebar was called incorrectly. No id was set in the arguments array for the "Sidebar 1" sidebar. Defaulting to "sidebar-1". Manually set the id to "sidebar-1" to silence this notice and keep existing sidebar content. Please see Debugging in WordPress for more information. (This message was added in version 4.2.0.) in /usr/share/wordpress/wp-includes/functions.php on line 4146 Tech Notes » Blog Archive » Understanding IOPS, MB/s, and Why They Aren’t Enough

Understanding IOPS, MB/s, and Why They Aren’t Enough

Best Practices, SAN, SAN performance storage i/o bottleneck, VirtualWisdom Add comments

People often don’t understand why their performance monitors don’t help to either predict or find performance problems.  Well, the answer to that could take a book, but a simple first step is understanding what IOPS is telling you, and why, in a FC SAN, you need to look at frames per second.

I/Os per second, or IOPS, is commonly recognized as a standard measurement of performance, whether to measure a storage array’s back-end drives or the performance of the SAN.  IOPS vary on a number of factors,including a system’s balance of read and write operations; whether the traffic is sequential, random or mixed; the storage drivers; the OS background operations; or even the I/O block size.

Block size is usually determined by the application, with different applications using different block sizes for various circumstances. For example, Oracle will typically use block sizes of 2 KB or 4 KB for online transaction processing, and larger block sizes of 8 KB, 16 KB, or 32 KB, for decision support system workload environments. Exchange 2007 may use an 8 KB block size, SQL may use a minimum of 8 KB, and SAP may use 64 KB, or even more.

In addition, when IOPS is considered as a measurement of performance, it’s standard practice that the throughput — that is to say, MB/sec — is also used. This is due to the different impact they have on performance.  For example, an application with only 100MB/sec of throughput, but 20,000 IOPS may not cause bandwidth issues, but with so many small commands, the storage array is put under significant pressure, as its front-end and back-end processors have an immense workload to deal with. Alternatively, if an application has a low number of IOPS but significant throughput, such as long sustained reads, then the pressure will occur on the bandwidth of the SAN links. Despite understanding this relationship, MB/s and IOPS are still insufficient measures of performance when you don’t take into consideration the frames per second.

Why is this?  Let’s look at the FC frame.  A standard FC frame has a data payload of approx 2K.  So if an application has an 8K I/O block size, this will require 4 FC frames to carry that data. In this instance, one  I/O is 4 frames.  To get a true picture of utilization, looking at IOPS alone is not sufficient because there’s a big difference between applications and their I/O size, with some ranging from 2K to even 256K.

Looking at a metric such as the ratio of frames/sec to Mb/sec, as displayed in this VirtualWisdom dashboard widget, we get a better picture and understanding of the environment and its performance. With reference to this graph of MB/sec to frames/sec ratio, the line graph should never be below the 0.2 of the y-axis, that is, the 2K data payload.

If the ratio falls below this, say at the 0.1 level, as in the widget below, we know that data is not being passed efficiently despite the throughput being maintained, as measured in MB/sec.

This enables you to proactively identify if there are a number of management frames being passed instead of data, as they are busily reporting on the physical device errors that are occurring.

Without taking frames per second into consideration and having an insight into this ratio to MB/s, it’s easy to believe that everything is OK and that data is being passed efficiently, since you see lots of traffic. However, in actuality, all you might be seeing are management frames reporting a problem. By ignoring frames per second, you run the risk of needlessly prolonging troubleshooting and increasing OPEX costs, simply by failing to identify the root cause of the performance degradation of your critical applications.

For a more complete explanation, and an example of how this applies to identifying slow-draining devices, check out this short video.

 

Comments are closed.

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in