Expanding the Reach of Real-time Monitoring

Best Practices, Real-Time Monitoring, SAN, SNW, VirtualWisdom No Comments »

It’s been a busy and exciting few months at Virtual Instruments. At SNW Fall last month, we introduced the new high-density VirtualWisdom SAN Performance Probe. By doubling the density and supporting up to 16 Fiber Channel links per unit, the ProbeFC8-HD enables customers to monitor more of their infrastructure for less. In fact, customers can expect to reduce the cost of real-time monitoring by 25 percent and lower power consumption by 40 percent.

We also announced enhanced support for FCoE. With FCoE-specific enhancements to the current SAN Availability Probe module we’re able to deliver improved monitoring of top-of-rack FCoE switches, extending visibility into infrastructure performance, health and utilization across converged network environments.

We had the chance to meet with a number of customers, press and analysts at SNW to share our news. Check out the news and learn more about our VirtualWisdom platform, courtesy of W. Curtis Preston, Truebit.TV.

SAN Troubleshooting Best Practices

Best Practices, bottlenecks, Real-Time Monitoring, SAN, troubleshooting No Comments »

People often ask — “Are there any best practices that the troubleshooting experts recommend?” I asked a couple of our top services guys for their recommendations, and I’m sharing them with all of you today:

  1. Don’t stop looking just because you’ve removed the symptom, because if you do, you’re likely to see the same problems later. Sure, to alleviate the immediate problem, you may have to remove users or applications that are less critical, perhaps stop backups, and remove other potential bottlenecks. While this may fix the immediate problem, it often stops the underlying cause from being discovered.
  2. Use “real” real-time monitoring for alerts that get you in front of the issues before the application users feel the pain.
  3. Sometimes you have to broaden your approach beyond what the user is reporting. If you stop there, you will often miss larger issues that may affect other, slightly less latency-sensitive apps.
  4. As a first step for triage, try to isolate whether the cause is on the server or the SAN. Comparing your baseline Exchange Completion time with ECT during the slowdown, will tell you immediately where to start, and where to stop looking. Your vendors will appreciate it, too.
  5. Try to find the finest granularity in your historical reporting to see which event preceded another, for cause and effect. A one-minute interval is often not sufficiently granular.
  6. Look at your historical I/O patterns, busy times of day, multipath configurations, queue depth settings, top talkers, etc. to gain a profile of behavior. Then compare with your healthy baseline, and rule out things that haven’t changed. You might find 6 things that appear to be going wrong, but if only one of those things seem to have occurred when the problem was reported, you can focus on that thing immediately. Later on, you can go back to look at the others.
  7. When changes are made to fix the incident, you should get immediate feedback. Without an immediate response, customers often take one of two approaches: 1) They delay or stagger fixes until they can determine the effect of each one; 2) Or they make all changes at the same time, and are then left wondering which change fixed the problem.
  8. Lastly, ask for help sooner rather than later. We’ve heard of problems dragging on for months, vendors getting kicked out of accounts, and literally millions of dollars wasted on adding expensive hardware. Waiting days or weeks to find the root cause of a problem is unacceptable. Bring in a performance pro.



Controlling Over-Provisioning of Your Storage Ports

Best Practices, latency, over-provisioning, SAN, storage arrays, VirtualWisdom No Comments »

While it’s generally accepted that SAN storage utilization is low, only a few industry luminaries, such as John Toigo, have talked about the severe underutilization of Fibre Channel (FC) SAN fabrics.  The challenge, of course, is that few IT shops have actually instrumented their SANs to enable accurate measurements of fabric utilization.  Instead, 100% of enterprise applications get the bandwidth that perhaps only 5% of the applications, wasting CAPEX need. 

In dealing with several dozen large organizations, we have found that nearly all FC storage networks are seriously over-provisioned, with average utilization rates well below 10%.  Here’s a VirtualWisdom dashboard widget (below) that shows the most heavily utilized storage ports on two storage arrays, taken from an F500 customer.  The figures refer to “% utilization.”

Beyond the obvious unnecessary expense, the reality is that with such low utilization rates, simply building in more SAN hardware to address performance and availability challenges does nothing more than add complexity and increase risk.  With VirtualWisdom, you can consolidate your ports, or avoid buying new ones, and track the net effect on your application latency to the millisecond.  The dashboard widgets below show the “before” and “after” latency figures that resulted from the configuration changes to this SAN, using VirtualWisdom.  They demonstrate a negligible effect.

Latency “before”

Latency “after”

Our most successful customers have tripled utilization and have been able to reduce future storage port purchases by 50% or more, saving $100 – $300K per new storage array.

For a more detailed discussion of SAN over-provisioning, click here, or check out this ten-minute video discussing this issue and over-tiering.

Eager Attendees Ready to Learn During Hands-On-Lab Sessions at Spring SNW 2012

Best Practices, Hands-On Lab, SAN, SNW, storage, VirtualWisdom No Comments »

 At the spring Storage Network World (SNW) show in Dallas, I had the pleasure of teaching the hands-on lab session for VirtualWisdom with Andrew Benrey, VI Solutions Consultant, and we had a fantastic response to our “Storage Implications for Server Virtualization” session. We co-presented with Avere and HP 3par, and during the two-hour session, we covered how to use VirtualWisdom to administer and optimize a fiber channel SAN, NAS optimization with the Avere appliance and the use of thin provisioning and reclamation using the HP 3par arrays.

The lab exercises covered all areas of SAN administration. The first exercise looked at how we discover and report physical layer errors. We then looked at queue depth performance, imbalanced paths, and detection of slow-draining devices using buffer-to-buffer credits. In the last exercise, we reviewed a VMware infrastructure showing the virtual machines, fiber channel fabric and SCSI performance.

I found it interesting that for most of the lab sessions, many students picked the VirtualWisdom lab to start with. I believe that with the demand for proactive SAN management, more and more people are finding out about the benefits of VirtualWisdom, and came to the hands-on-lab to see for themselves. When looking at the attendance numbers, our lab was sold out for most sessions. Our most popular session had a sign up list of 52 for 20 seats.  During the six sessions we conducted, we were able to meet and talk with almost 500 attendees in depth about the need for tools like VirtualWisdom and the advantages this platform offers for SAN teams working in a virtualized environment.  Attendees liked the ability to quickly walk through the infrastructure from the ESXi server down to the storage array and spot the anomalies. The ability to go back in time was also of importance. Several customers were in the lab as part of their product evaluation.

Those of you who have seen VirtualWisdom understand how rich our user interface can be. For the lab exercises, I specifically divided up exercises so that the lab attendees had a much simpler and more easily understood interface in which to work. This turned out well as very few of the attendees needed additional help in working with the Dashboard interface.

Storage Network World Hands-On Lab Infrastructure

Ensuring VDI Server Performance Without SSDs

Best Practices, SAN, SSD, VDI, VirtualWisdom No Comments »

We get a lot of questions about VDI (virtual desktop infrastructure, or interface).  By now, the benefits of VDI are pretty well understood. Despite the benefits and potential OPEX and CAPEX savings, businesses are still averse to its adoption, due to the common problem called the “boot storm.”

Boot storms are large slowdowns that occur when a large amount of end users log into their systems at the same time. This is typical in the morning when everyone starts work. This causes intense concentrated storage I/O, leading desktop users to experience extreme slowness on their virtual desktop to the point where it can become almost unusable.  To solve this issue, many vendors suggest the option of investing in expensive SSDs. So much for saving money … one of the big reasons for VDI in the first place.

We’ve found that insight into the SAN fabric and the end-to-end I/O profiles of your VDI deployment can help you ensure adequate desktop performance, even during peak times, by balancing out the load and eliminating any possible physical layer issues.

VDI servers have special I/O profiles, but they share the need with all other application servers, in that, to monitor and analyze performance, you need a single view of the entire infrastructure.  In this example dashboard, the administrator can see performance metrics as well as physical layer metrics, which together offer a way to watch for indications of performance problems in the VDI environment.

The custom VirtualWisdom dashboard below shows an end-to-end view of a VDI deployment that incorporates a view of the SAN network. On the left-hand side, we have a view of the throughput and demand of the physical servers, enabling us to immediately identify and correct any imbalances that may exist. In the center, we have metrics that highlight any potential physical layer issues or problems that may be occurring from HBA to switch port to storage port. This allows us to proactively eliminate any potential I/O slowdowns. On the right-hand side, we have a view of the storage infrastructure and how the demand from VDI is affecting the storage ports. This allows us to balance out the I/O load across the correct storage ports, identifying and eliminating any congestion or slowdowns.

This customized VDI infrastructure dashboard also enables us to monitor the centralized desktops backups, ensuring that these are not only successful and timely but also do not affect the rest of the company’s production environment.

Furthermore, with outsourcers and many companies having international staff, boot storms can occur at many different times of the day. Using VirtualWisdom’s unique playback facility, it’s easy to historically trend such throughput and I/O profiles to enable a safe, stable and cost-effective VDI investment and deployment.


Understanding IOPS, MB/s, and Why They Aren’t Enough

Best Practices, SAN, SAN performance storage i/o bottleneck, VirtualWisdom No Comments »

People often don’t understand why their performance monitors don’t help to either predict or find performance problems.  Well, the answer to that could take a book, but a simple first step is understanding what IOPS is telling you, and why, in a FC SAN, you need to look at frames per second.

I/Os per second, or IOPS, is commonly recognized as a standard measurement of performance, whether to measure a storage array’s back-end drives or the performance of the SAN.  IOPS vary on a number of factors,including a system’s balance of read and write operations; whether the traffic is sequential, random or mixed; the storage drivers; the OS background operations; or even the I/O block size.

Block size is usually determined by the application, with different applications using different block sizes for various circumstances. For example, Oracle will typically use block sizes of 2 KB or 4 KB for online transaction processing, and larger block sizes of 8 KB, 16 KB, or 32 KB, for decision support system workload environments. Exchange 2007 may use an 8 KB block size, SQL may use a minimum of 8 KB, and SAP may use 64 KB, or even more.

In addition, when IOPS is considered as a measurement of performance, it’s standard practice that the throughput — that is to say, MB/sec — is also used. This is due to the different impact they have on performance.  For example, an application with only 100MB/sec of throughput, but 20,000 IOPS may not cause bandwidth issues, but with so many small commands, the storage array is put under significant pressure, as its front-end and back-end processors have an immense workload to deal with. Alternatively, if an application has a low number of IOPS but significant throughput, such as long sustained reads, then the pressure will occur on the bandwidth of the SAN links. Despite understanding this relationship, MB/s and IOPS are still insufficient measures of performance when you don’t take into consideration the frames per second.

Why is this?  Let’s look at the FC frame.  A standard FC frame has a data payload of approx 2K.  So if an application has an 8K I/O block size, this will require 4 FC frames to carry that data. In this instance, one  I/O is 4 frames.  To get a true picture of utilization, looking at IOPS alone is not sufficient because there’s a big difference between applications and their I/O size, with some ranging from 2K to even 256K.

Looking at a metric such as the ratio of frames/sec to Mb/sec, as displayed in this VirtualWisdom dashboard widget, we get a better picture and understanding of the environment and its performance. With reference to this graph of MB/sec to frames/sec ratio, the line graph should never be below the 0.2 of the y-axis, that is, the 2K data payload.

If the ratio falls below this, say at the 0.1 level, as in the widget below, we know that data is not being passed efficiently despite the throughput being maintained, as measured in MB/sec.

This enables you to proactively identify if there are a number of management frames being passed instead of data, as they are busily reporting on the physical device errors that are occurring.

Without taking frames per second into consideration and having an insight into this ratio to MB/s, it’s easy to believe that everything is OK and that data is being passed efficiently, since you see lots of traffic. However, in actuality, all you might be seeing are management frames reporting a problem. By ignoring frames per second, you run the risk of needlessly prolonging troubleshooting and increasing OPEX costs, simply by failing to identify the root cause of the performance degradation of your critical applications.

For a more complete explanation, and an example of how this applies to identifying slow-draining devices, check out this short video.


Are You a ‘Server Hugger’? How to Virtualize More Apps

Best Practices, SAN, virtualization, VMworld No Comments »

At VMworld in Las Vegas, leading analyst Bernd Harzog, presented an intriguing case for how to increase the use of virtual servers.  In his session entitled “Six Aggressive Performance Management Practices to Achieve 80%+ Virtualization,” Bernd described both the reasons why more applications aren’t virtualized today, and what to do about it.

Since there seems to be much industry confusion about “best practices” for increased virtualization, we wanted to highlight some of Bernd’s key takeaways.  First, he accurately identifies the fact that it seems all the benefits accrue to the team managing the infrastructure, NOT to the application owners.  For the app owners, dedicated hardware is a comfort blanket they are unwilling to give up, and he affectionately refers to these folks as “server huggers.”  To these huggers, virtualization is all risk and no reward!

So what’s the answer?  According to Bernd, companies implementing these solutions should deliver better application performance on their shared services virtual infrastructure than they are able to deliver on dedicated physical hardware.  He goes on to offer best practices for HOW.  Bernd describes a six-step process, but it’s step number two that Virtual Instruments can help with the most.

  1. Implement a Resource based Performance and Capacity Management Solution
  2. Put in Place an Understanding of end-to-end Infrastructure Latency
  3. Take Responsibility for Application Response Time!
  4. Rewrite your Service Level Agreements around Response Time, Variability, and Error Rates
  5. Base Your Approach to Capacity on Response Times and Transaction Rates
  6. Make Response Time and Transaction Rate Part of your Chargeback and Workload Allocation Process

As our customers know, Virtual Instruments can help with nearly all of these.  But we’re best known for step two, understanding the end-to-end infrastructure performance – not just VMware performance or SAN performance, but literally end-to-end performance – and infrastructure response time is the key metric we offer that really differentiates us.  It’s perhaps the most valuable metric to the team supporting the virtual infrastructure.  Bernd talks about it in some detail, and he goes on to offer advice on criteria that will help accomplish this second step to increasing virtual server success.  His list:

  • Measure IRT – Monitor how long it takes the infrastructure to respond to requests for work, not how much resource it takes
  • Deterministic – Get the real data, not a synthetic transaction, or an average
  • Real Time – Get the data when it happens, not seconds or minutes later
  • Comprehensive – Get all of the data, not a periodic sample of the data
  • Zero-Configuration (Discovery) – Discover the environment and its topology, and keep this up to date in real time
  • Application (or VM) Aware – Understand where the load is coming from and where it is going
  • Application Agnostic – Work for every workload or VM type in the environment, irrespective of how the application is built or deployed

We couldn’t agree more!  I can’t do justice to Bernd’s presentation, so to hear more, go to  the Performance Management Topic at The Virtualization Practice, or listen to the webinar we did with Bernd.

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in