A True Story from Virtual Instruments’ Lab: You Need the Global View

Best Practices, Global View, infrastructure, performance, VirtualWisdom No Comments »

In our lab here at Virtual Instruments, we run a good size VMware infrastructure, and of course, we use VirtualWisdom to monitor the performance of our lab systems.

Following our own best practices, when we first assembled our lab configuration, we recorded our performance and set alerts accordingly. We checked all our fiber channel links and they were free of physical errors. Overall, we were pretty satisfied, and for several months things ran just fine.

Then one day, we started getting alerts that our write exchange completion times were spiking in the 200-300ms range, from a baseline value of less than 20ms. Similarly, our read exchange completion times were jumping into the 100ms range, against a baseline of less than 10ms. We saw the peaks on the read and write exchange times trend higher as time went on, so we thought we were headed for an outage. We reviewed all our changes, logs, and any info we had. We couldn’t figure out which problems accounted for these slowdowns.

While all this was happening, we received no complaints from the system users — system analysts that review customer databases for issues. We knew that if something was wrong, we would get complaints. We had a silent and future deadly problem happening.

After we verified that our switches, cables and connections were fine, we approached our array vendor. They reviewed their logs on our storage ports and things looked fine. The “aha!” moment came when they started to review the overall array performance. Since VirtualWisdom records the time of each slowdown, it was very easy for the array vendor to look at what was happening. It turns out that our array has dual controllers — we use one controller and our engineering group uses the other. During the times of the slowdowns, the engineering group was running stress tests. The other controller was running at 80% of capacity and our controller was experiencing a large number of cache misses, which resulted in the slowdowns.

So, what can you learn from all of this? First is that when things are initially assembled or are running well, you must baseline your configuration. Unless you know what things are like when systems are running well, you have no idea of where to look. If we did not have a baseline of our configuration, we never would have noticed that the read and write exchange completion times were spiking. Second, by establishing a baseline and leveraging the VirtualWisdom platform, we were able to find and clear the problem before there was ever an outage or complaint. Yes, we don’t get credit for outage avoidance, but it is a lot less stressful for you. Our analysts are doing revenue-generating work, so if they go down, there is a lot of excitement. The last takeaway is that when something happens, it happens for a reason. While everything looked fine to us at the lab level, there were issues occurring one level above that affected us. So back to my comment about the global view. When you are having problems that don’t make sense like we were having in our lab config. Start looking around and see if you are overlooking the fact that you are part of a larger infrastructure.


Ensuring VDI Server Performance Without SSDs

Best Practices, SAN, SSD, VDI, VirtualWisdom No Comments »

We get a lot of questions about VDI (virtual desktop infrastructure, or interface).  By now, the benefits of VDI are pretty well understood. Despite the benefits and potential OPEX and CAPEX savings, businesses are still averse to its adoption, due to the common problem called the “boot storm.”

Boot storms are large slowdowns that occur when a large amount of end users log into their systems at the same time. This is typical in the morning when everyone starts work. This causes intense concentrated storage I/O, leading desktop users to experience extreme slowness on their virtual desktop to the point where it can become almost unusable.  To solve this issue, many vendors suggest the option of investing in expensive SSDs. So much for saving money … one of the big reasons for VDI in the first place.

We’ve found that insight into the SAN fabric and the end-to-end I/O profiles of your VDI deployment can help you ensure adequate desktop performance, even during peak times, by balancing out the load and eliminating any possible physical layer issues.

VDI servers have special I/O profiles, but they share the need with all other application servers, in that, to monitor and analyze performance, you need a single view of the entire infrastructure.  In this example dashboard, the administrator can see performance metrics as well as physical layer metrics, which together offer a way to watch for indications of performance problems in the VDI environment.

The custom VirtualWisdom dashboard below shows an end-to-end view of a VDI deployment that incorporates a view of the SAN network. On the left-hand side, we have a view of the throughput and demand of the physical servers, enabling us to immediately identify and correct any imbalances that may exist. In the center, we have metrics that highlight any potential physical layer issues or problems that may be occurring from HBA to switch port to storage port. This allows us to proactively eliminate any potential I/O slowdowns. On the right-hand side, we have a view of the storage infrastructure and how the demand from VDI is affecting the storage ports. This allows us to balance out the I/O load across the correct storage ports, identifying and eliminating any congestion or slowdowns.

This customized VDI infrastructure dashboard also enables us to monitor the centralized desktops backups, ensuring that these are not only successful and timely but also do not affect the rest of the company’s production environment.

Furthermore, with outsourcers and many companies having international staff, boot storms can occur at many different times of the day. Using VirtualWisdom’s unique playback facility, it’s easy to historically trend such throughput and I/O profiles to enable a safe, stable and cost-effective VDI investment and deployment.


Understanding IOPS, MB/s, and Why They Aren’t Enough

Best Practices, SAN, SAN performance storage i/o bottleneck, VirtualWisdom No Comments »

People often don’t understand why their performance monitors don’t help to either predict or find performance problems.  Well, the answer to that could take a book, but a simple first step is understanding what IOPS is telling you, and why, in a FC SAN, you need to look at frames per second.

I/Os per second, or IOPS, is commonly recognized as a standard measurement of performance, whether to measure a storage array’s back-end drives or the performance of the SAN.  IOPS vary on a number of factors,including a system’s balance of read and write operations; whether the traffic is sequential, random or mixed; the storage drivers; the OS background operations; or even the I/O block size.

Block size is usually determined by the application, with different applications using different block sizes for various circumstances. For example, Oracle will typically use block sizes of 2 KB or 4 KB for online transaction processing, and larger block sizes of 8 KB, 16 KB, or 32 KB, for decision support system workload environments. Exchange 2007 may use an 8 KB block size, SQL may use a minimum of 8 KB, and SAP may use 64 KB, or even more.

In addition, when IOPS is considered as a measurement of performance, it’s standard practice that the throughput — that is to say, MB/sec — is also used. This is due to the different impact they have on performance.  For example, an application with only 100MB/sec of throughput, but 20,000 IOPS may not cause bandwidth issues, but with so many small commands, the storage array is put under significant pressure, as its front-end and back-end processors have an immense workload to deal with. Alternatively, if an application has a low number of IOPS but significant throughput, such as long sustained reads, then the pressure will occur on the bandwidth of the SAN links. Despite understanding this relationship, MB/s and IOPS are still insufficient measures of performance when you don’t take into consideration the frames per second.

Why is this?  Let’s look at the FC frame.  A standard FC frame has a data payload of approx 2K.  So if an application has an 8K I/O block size, this will require 4 FC frames to carry that data. In this instance, one  I/O is 4 frames.  To get a true picture of utilization, looking at IOPS alone is not sufficient because there’s a big difference between applications and their I/O size, with some ranging from 2K to even 256K.

Looking at a metric such as the ratio of frames/sec to Mb/sec, as displayed in this VirtualWisdom dashboard widget, we get a better picture and understanding of the environment and its performance. With reference to this graph of MB/sec to frames/sec ratio, the line graph should never be below the 0.2 of the y-axis, that is, the 2K data payload.

If the ratio falls below this, say at the 0.1 level, as in the widget below, we know that data is not being passed efficiently despite the throughput being maintained, as measured in MB/sec.

This enables you to proactively identify if there are a number of management frames being passed instead of data, as they are busily reporting on the physical device errors that are occurring.

Without taking frames per second into consideration and having an insight into this ratio to MB/s, it’s easy to believe that everything is OK and that data is being passed efficiently, since you see lots of traffic. However, in actuality, all you might be seeing are management frames reporting a problem. By ignoring frames per second, you run the risk of needlessly prolonging troubleshooting and increasing OPEX costs, simply by failing to identify the root cause of the performance degradation of your critical applications.

For a more complete explanation, and an example of how this applies to identifying slow-draining devices, check out this short video.


Avoid MATCHES in Filters

Best Practices, VirtualWisdom No Comments »

Where possible, try to avoid using “MATCHES” expressions in Filters that are evaluated often; one suggestion is to move them to UDCs, but it’s not necessarily a constant rule.

I’ve used a few terms in that one-line suggestion, perhaps I can expand on this a bit.

VirtualWisdom lets you make filter expressions such as:

Attached Port Name MATCHES ^OracleServer_*

This powerful logic lets you leverage similar names and terms to select similar servers. Consider selecting similar storage targets or hosts by parts of names, or FCIDs that start or end in the same sequence, or switches using the word “Core” or “Edge” in its role. In fact, a simple filter applied to an alarm can apply a more urgent reaction to a port with errors on a core switch rather than an edge, representing different SLAs or criticality.

The example above says “look for where the Attached Port Name — the nickname of the device attached to a switch — starts with ‘OracleServer_’ “.

UDC — User-Defined Context — allows a VirtualWisdom Administrator to define an additional metric in terms of filter expressions: when various conditions match, a constant enumeration is used for that port’s value, or that ITL’s encoding. For example, for switches with certain names, a “DataCenter” column can identify where that switch is to help forward physical layer errors (such as CRCs) to the right team to more quickly address the issue. Different storage or servers involved in different business units can be enumerated, and based on that “BU” flag or value, different SLAs may be applied, or different teams alerted. UDCs are quite powerful, and are processed on every summary that gets stored in the database.

UDCs can use the same “MATCHES” terms that standard filters can use.

The problem with MATCHES is that it strips away some optimization: the Query Optimizer is a part of a database that cross-references the client’s query with existing possible indices, even aggregate indices, to reduce the processing load by orders of magnitude. Any Oracle Admin who has spent time with the “SQL EXPLAIN” has seen the difference a simple re-ordering of expressions can make in a complex query to get a more efficient join, or fewer rows evaluated for processing to reach a result. These indices only match constant expressions with basic comparison operators such as “==”, “!=”, ““, and are completely inefficient for fuzzy or regular-expression matches.

A “MATCHES” expression in your filter or UDC can increase the load between a VirtualWisdom Portal Server and the underlying MySQL database engine. Although Virtual Instruments Engineering has worked to improve the database schema and queries, resulting in dramatic improvements in processing efficiency and maximum ITL and port count of a Portal Server, we the users still have the power to ruin this with a heavy expression or two.

If a filter isn’t run very often (such as a private dashboard, or a filter used mostly in a daily report), it may not pose very much load on the database; conversely, for a filter that runs often, constantly, the load of a MATCHES expression can repeatedly affect the server for the same data points. It’s almost as though a cache of the resulting filter would avoid rerunning the comparison so often. That is where a UDC can be used.

For filter expressions that run often, consider moving the MATCHES to a UDC calculation, and convert the filter to a comparison against that precise value. For example, if your filter looks like:

Attached Port Name MATCHES BillingServer_* OR Attached Port Name MATCHES CustRecords_*

This can be converted to a UDC such as:

  • default value: “Other
  • value “Billing” when “Attached Port Name MATCHES BillingServer_*
  • value “Records” when “Attached Port Name MATCHES CustRecord_*

This sort of UDC means that the two MATCHES expressions will run twice on every Port or Exchange of every summary. If only Servers are identified by this pattern of nicknames, you could also avoid this sort of evaluation on non-Servers by the following:

  • default value: “Other
  • value: “Other” when “Attached Device Type != Server
  • value “Billing” when “Attached Port Name MATCHES BillingServer_*
  • value “Records” when “Attached Port Name MATCHES CustRecord_*

In general, if a MATCHES is rarely evaluated, then its load — however heavier — only affects the server at rare times, so in total has a lower effect. A 100-fold heavier query run only weekly is not worth swapping for a UDC expression run every five minutes.

Try to consider each case where MATCHES is used for conversion to a UDC expression, and whether even that evaluation can be avoided by a constant expression evaluated before the MATCHES expression. Your portal server will thank you!

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in