Notice: register_sidebar was called incorrectly. No id was set in the arguments array for the "Sidebar 1" sidebar. Defaulting to "sidebar-1". Manually set the id to "sidebar-1" to silence this notice and keep existing sidebar content. Please see Debugging in WordPress for more information. (This message was added in version 4.2.0.) in /usr/share/wordpress/wp-includes/functions.php on line 4139 Tech Notes » 2012 » March

A True Story from Virtual Instruments’ Lab: You Need the Global View

Best Practices, Global View, infrastructure, performance, VirtualWisdom No Comments »

In our lab here at Virtual Instruments, we run a good size VMware infrastructure, and of course, we use VirtualWisdom to monitor the performance of our lab systems.

Following our own best practices, when we first assembled our lab configuration, we recorded our performance and set alerts accordingly. We checked all our fiber channel links and they were free of physical errors. Overall, we were pretty satisfied, and for several months things ran just fine.

Then one day, we started getting alerts that our write exchange completion times were spiking in the 200-300ms range, from a baseline value of less than 20ms. Similarly, our read exchange completion times were jumping into the 100ms range, against a baseline of less than 10ms. We saw the peaks on the read and write exchange times trend higher as time went on, so we thought we were headed for an outage. We reviewed all our changes, logs, and any info we had. We couldn’t figure out which problems accounted for these slowdowns.

While all this was happening, we received no complaints from the system users — system analysts that review customer databases for issues. We knew that if something was wrong, we would get complaints. We had a silent and future deadly problem happening.

After we verified that our switches, cables and connections were fine, we approached our array vendor. They reviewed their logs on our storage ports and things looked fine. The “aha!” moment came when they started to review the overall array performance. Since VirtualWisdom records the time of each slowdown, it was very easy for the array vendor to look at what was happening. It turns out that our array has dual controllers — we use one controller and our engineering group uses the other. During the times of the slowdowns, the engineering group was running stress tests. The other controller was running at 80% of capacity and our controller was experiencing a large number of cache misses, which resulted in the slowdowns.

So, what can you learn from all of this? First is that when things are initially assembled or are running well, you must baseline your configuration. Unless you know what things are like when systems are running well, you have no idea of where to look. If we did not have a baseline of our configuration, we never would have noticed that the read and write exchange completion times were spiking. Second, by establishing a baseline and leveraging the VirtualWisdom platform, we were able to find and clear the problem before there was ever an outage or complaint. Yes, we don’t get credit for outage avoidance, but it is a lot less stressful for you. Our analysts are doing revenue-generating work, so if they go down, there is a lot of excitement. The last takeaway is that when something happens, it happens for a reason. While everything looked fine to us at the lab level, there were issues occurring one level above that affected us. So back to my comment about the global view. When you are having problems that don’t make sense like we were having in our lab config. Start looking around and see if you are overlooking the fact that you are part of a larger infrastructure.

 

Ensuring VDI Server Performance Without SSDs

Best Practices, SAN, SSD, VDI, VirtualWisdom No Comments »

We get a lot of questions about VDI (virtual desktop infrastructure, or interface).  By now, the benefits of VDI are pretty well understood. Despite the benefits and potential OPEX and CAPEX savings, businesses are still averse to its adoption, due to the common problem called the “boot storm.”

Boot storms are large slowdowns that occur when a large amount of end users log into their systems at the same time. This is typical in the morning when everyone starts work. This causes intense concentrated storage I/O, leading desktop users to experience extreme slowness on their virtual desktop to the point where it can become almost unusable.  To solve this issue, many vendors suggest the option of investing in expensive SSDs. So much for saving money … one of the big reasons for VDI in the first place.

We’ve found that insight into the SAN fabric and the end-to-end I/O profiles of your VDI deployment can help you ensure adequate desktop performance, even during peak times, by balancing out the load and eliminating any possible physical layer issues.

VDI servers have special I/O profiles, but they share the need with all other application servers, in that, to monitor and analyze performance, you need a single view of the entire infrastructure.  In this example dashboard, the administrator can see performance metrics as well as physical layer metrics, which together offer a way to watch for indications of performance problems in the VDI environment.

The custom VirtualWisdom dashboard below shows an end-to-end view of a VDI deployment that incorporates a view of the SAN network. On the left-hand side, we have a view of the throughput and demand of the physical servers, enabling us to immediately identify and correct any imbalances that may exist. In the center, we have metrics that highlight any potential physical layer issues or problems that may be occurring from HBA to switch port to storage port. This allows us to proactively eliminate any potential I/O slowdowns. On the right-hand side, we have a view of the storage infrastructure and how the demand from VDI is affecting the storage ports. This allows us to balance out the I/O load across the correct storage ports, identifying and eliminating any congestion or slowdowns.

This customized VDI infrastructure dashboard also enables us to monitor the centralized desktops backups, ensuring that these are not only successful and timely but also do not affect the rest of the company’s production environment.

Furthermore, with outsourcers and many companies having international staff, boot storms can occur at many different times of the day. Using VirtualWisdom’s unique playback facility, it’s easy to historically trend such throughput and I/O profiles to enable a safe, stable and cost-effective VDI investment and deployment.

 

Understanding IOPS, MB/s, and Why They Aren’t Enough

Best Practices, SAN, SAN performance storage i/o bottleneck, VirtualWisdom No Comments »

People often don’t understand why their performance monitors don’t help to either predict or find performance problems.  Well, the answer to that could take a book, but a simple first step is understanding what IOPS is telling you, and why, in a FC SAN, you need to look at frames per second.

I/Os per second, or IOPS, is commonly recognized as a standard measurement of performance, whether to measure a storage array’s back-end drives or the performance of the SAN.  IOPS vary on a number of factors,including a system’s balance of read and write operations; whether the traffic is sequential, random or mixed; the storage drivers; the OS background operations; or even the I/O block size.

Block size is usually determined by the application, with different applications using different block sizes for various circumstances. For example, Oracle will typically use block sizes of 2 KB or 4 KB for online transaction processing, and larger block sizes of 8 KB, 16 KB, or 32 KB, for decision support system workload environments. Exchange 2007 may use an 8 KB block size, SQL may use a minimum of 8 KB, and SAP may use 64 KB, or even more.

In addition, when IOPS is considered as a measurement of performance, it’s standard practice that the throughput — that is to say, MB/sec — is also used. This is due to the different impact they have on performance.  For example, an application with only 100MB/sec of throughput, but 20,000 IOPS may not cause bandwidth issues, but with so many small commands, the storage array is put under significant pressure, as its front-end and back-end processors have an immense workload to deal with. Alternatively, if an application has a low number of IOPS but significant throughput, such as long sustained reads, then the pressure will occur on the bandwidth of the SAN links. Despite understanding this relationship, MB/s and IOPS are still insufficient measures of performance when you don’t take into consideration the frames per second.

Why is this?  Let’s look at the FC frame.  A standard FC frame has a data payload of approx 2K.  So if an application has an 8K I/O block size, this will require 4 FC frames to carry that data. In this instance, one  I/O is 4 frames.  To get a true picture of utilization, looking at IOPS alone is not sufficient because there’s a big difference between applications and their I/O size, with some ranging from 2K to even 256K.

Looking at a metric such as the ratio of frames/sec to Mb/sec, as displayed in this VirtualWisdom dashboard widget, we get a better picture and understanding of the environment and its performance. With reference to this graph of MB/sec to frames/sec ratio, the line graph should never be below the 0.2 of the y-axis, that is, the 2K data payload.

If the ratio falls below this, say at the 0.1 level, as in the widget below, we know that data is not being passed efficiently despite the throughput being maintained, as measured in MB/sec.

This enables you to proactively identify if there are a number of management frames being passed instead of data, as they are busily reporting on the physical device errors that are occurring.

Without taking frames per second into consideration and having an insight into this ratio to MB/s, it’s easy to believe that everything is OK and that data is being passed efficiently, since you see lots of traffic. However, in actuality, all you might be seeing are management frames reporting a problem. By ignoring frames per second, you run the risk of needlessly prolonging troubleshooting and increasing OPEX costs, simply by failing to identify the root cause of the performance degradation of your critical applications.

For a more complete explanation, and an example of how this applies to identifying slow-draining devices, check out this short video.

 

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in