A True Story from Virtual Instruments’ Lab: You Need the Global View

Best Practices, Global View, infrastructure, performance, VirtualWisdom No Comments »

In our lab here at Virtual Instruments, we run a good size VMware infrastructure, and of course, we use VirtualWisdom to monitor the performance of our lab systems.

Following our own best practices, when we first assembled our lab configuration, we recorded our performance and set alerts accordingly. We checked all our fiber channel links and they were free of physical errors. Overall, we were pretty satisfied, and for several months things ran just fine.

Then one day, we started getting alerts that our write exchange completion times were spiking in the 200-300ms range, from a baseline value of less than 20ms. Similarly, our read exchange completion times were jumping into the 100ms range, against a baseline of less than 10ms. We saw the peaks on the read and write exchange times trend higher as time went on, so we thought we were headed for an outage. We reviewed all our changes, logs, and any info we had. We couldn’t figure out which problems accounted for these slowdowns.

While all this was happening, we received no complaints from the system users — system analysts that review customer databases for issues. We knew that if something was wrong, we would get complaints. We had a silent and future deadly problem happening.

After we verified that our switches, cables and connections were fine, we approached our array vendor. They reviewed their logs on our storage ports and things looked fine. The “aha!” moment came when they started to review the overall array performance. Since VirtualWisdom records the time of each slowdown, it was very easy for the array vendor to look at what was happening. It turns out that our array has dual controllers — we use one controller and our engineering group uses the other. During the times of the slowdowns, the engineering group was running stress tests. The other controller was running at 80% of capacity and our controller was experiencing a large number of cache misses, which resulted in the slowdowns.

So, what can you learn from all of this? First is that when things are initially assembled or are running well, you must baseline your configuration. Unless you know what things are like when systems are running well, you have no idea of where to look. If we did not have a baseline of our configuration, we never would have noticed that the read and write exchange completion times were spiking. Second, by establishing a baseline and leveraging the VirtualWisdom platform, we were able to find and clear the problem before there was ever an outage or complaint. Yes, we don’t get credit for outage avoidance, but it is a lot less stressful for you. Our analysts are doing revenue-generating work, so if they go down, there is a lot of excitement. The last takeaway is that when something happens, it happens for a reason. While everything looked fine to us at the lab level, there were issues occurring one level above that affected us. So back to my comment about the global view. When you are having problems that don’t make sense like we were having in our lab config. Start looking around and see if you are overlooking the fact that you are part of a larger infrastructure.


Ensuring VDI Server Performance Without SSDs

Best Practices, SAN, SSD, VDI, VirtualWisdom No Comments »

We get a lot of questions about VDI (virtual desktop infrastructure, or interface).  By now, the benefits of VDI are pretty well understood. Despite the benefits and potential OPEX and CAPEX savings, businesses are still averse to its adoption, due to the common problem called the “boot storm.”

Boot storms are large slowdowns that occur when a large amount of end users log into their systems at the same time. This is typical in the morning when everyone starts work. This causes intense concentrated storage I/O, leading desktop users to experience extreme slowness on their virtual desktop to the point where it can become almost unusable.  To solve this issue, many vendors suggest the option of investing in expensive SSDs. So much for saving money … one of the big reasons for VDI in the first place.

We’ve found that insight into the SAN fabric and the end-to-end I/O profiles of your VDI deployment can help you ensure adequate desktop performance, even during peak times, by balancing out the load and eliminating any possible physical layer issues.

VDI servers have special I/O profiles, but they share the need with all other application servers, in that, to monitor and analyze performance, you need a single view of the entire infrastructure.  In this example dashboard, the administrator can see performance metrics as well as physical layer metrics, which together offer a way to watch for indications of performance problems in the VDI environment.

The custom VirtualWisdom dashboard below shows an end-to-end view of a VDI deployment that incorporates a view of the SAN network. On the left-hand side, we have a view of the throughput and demand of the physical servers, enabling us to immediately identify and correct any imbalances that may exist. In the center, we have metrics that highlight any potential physical layer issues or problems that may be occurring from HBA to switch port to storage port. This allows us to proactively eliminate any potential I/O slowdowns. On the right-hand side, we have a view of the storage infrastructure and how the demand from VDI is affecting the storage ports. This allows us to balance out the I/O load across the correct storage ports, identifying and eliminating any congestion or slowdowns.

This customized VDI infrastructure dashboard also enables us to monitor the centralized desktops backups, ensuring that these are not only successful and timely but also do not affect the rest of the company’s production environment.

Furthermore, with outsourcers and many companies having international staff, boot storms can occur at many different times of the day. Using VirtualWisdom’s unique playback facility, it’s easy to historically trend such throughput and I/O profiles to enable a safe, stable and cost-effective VDI investment and deployment.


Understanding IOPS, MB/s, and Why They Aren’t Enough

Best Practices, SAN, SAN performance storage i/o bottleneck, VirtualWisdom No Comments »

People often don’t understand why their performance monitors don’t help to either predict or find performance problems.  Well, the answer to that could take a book, but a simple first step is understanding what IOPS is telling you, and why, in a FC SAN, you need to look at frames per second.

I/Os per second, or IOPS, is commonly recognized as a standard measurement of performance, whether to measure a storage array’s back-end drives or the performance of the SAN.  IOPS vary on a number of factors,including a system’s balance of read and write operations; whether the traffic is sequential, random or mixed; the storage drivers; the OS background operations; or even the I/O block size.

Block size is usually determined by the application, with different applications using different block sizes for various circumstances. For example, Oracle will typically use block sizes of 2 KB or 4 KB for online transaction processing, and larger block sizes of 8 KB, 16 KB, or 32 KB, for decision support system workload environments. Exchange 2007 may use an 8 KB block size, SQL may use a minimum of 8 KB, and SAP may use 64 KB, or even more.

In addition, when IOPS is considered as a measurement of performance, it’s standard practice that the throughput — that is to say, MB/sec — is also used. This is due to the different impact they have on performance.  For example, an application with only 100MB/sec of throughput, but 20,000 IOPS may not cause bandwidth issues, but with so many small commands, the storage array is put under significant pressure, as its front-end and back-end processors have an immense workload to deal with. Alternatively, if an application has a low number of IOPS but significant throughput, such as long sustained reads, then the pressure will occur on the bandwidth of the SAN links. Despite understanding this relationship, MB/s and IOPS are still insufficient measures of performance when you don’t take into consideration the frames per second.

Why is this?  Let’s look at the FC frame.  A standard FC frame has a data payload of approx 2K.  So if an application has an 8K I/O block size, this will require 4 FC frames to carry that data. In this instance, one  I/O is 4 frames.  To get a true picture of utilization, looking at IOPS alone is not sufficient because there’s a big difference between applications and their I/O size, with some ranging from 2K to even 256K.

Looking at a metric such as the ratio of frames/sec to Mb/sec, as displayed in this VirtualWisdom dashboard widget, we get a better picture and understanding of the environment and its performance. With reference to this graph of MB/sec to frames/sec ratio, the line graph should never be below the 0.2 of the y-axis, that is, the 2K data payload.

If the ratio falls below this, say at the 0.1 level, as in the widget below, we know that data is not being passed efficiently despite the throughput being maintained, as measured in MB/sec.

This enables you to proactively identify if there are a number of management frames being passed instead of data, as they are busily reporting on the physical device errors that are occurring.

Without taking frames per second into consideration and having an insight into this ratio to MB/s, it’s easy to believe that everything is OK and that data is being passed efficiently, since you see lots of traffic. However, in actuality, all you might be seeing are management frames reporting a problem. By ignoring frames per second, you run the risk of needlessly prolonging troubleshooting and increasing OPEX costs, simply by failing to identify the root cause of the performance degradation of your critical applications.

For a more complete explanation, and an example of how this applies to identifying slow-draining devices, check out this short video.


Naming Consistency: Make a Script

Best Practices No Comments »

If you want a consistent, easy-to-use repository, use a script to build directories and copy in your content, which gives consistency as a side-effect of making things easier to move around.

Anyone who has tried to download software form a repository can tell when the ownership changes hands: the directory structure changes in subtle ways.  There’s a dot in the path now, there wasn’t before, or capitalization changes.  This isn’t a problem until you try to use the repository in an automated fashion: scripting and tools.  Suddenly, a change from “V” to “v” requires an entirely new case, as if it’s a whole new repository on a different server.

NOTE: if the files are moved around manually, and the owner of those hands doing the moving is a bit flakey or random, then this sort of speed-wobble might as well count as changing the ownership, only more frequently (every release)

People will have a problem with this, but they’ll never tell you just as they’ll never tell you that your shoes don’t match your belt… but unlike fashion faux-pas, inconsistency with directories actually impacts others.

Don’t be a flake.  Be consistent.  A script helps you do that.

Additionally, if the script is the final part of the build process, it reduces the manual steps to a build.  I would recommend either right before or right after running your self-tests.

Avoid MATCHES in Filters

Best Practices, VirtualWisdom No Comments »

Where possible, try to avoid using “MATCHES” expressions in Filters that are evaluated often; one suggestion is to move them to UDCs, but it’s not necessarily a constant rule.

I’ve used a few terms in that one-line suggestion, perhaps I can expand on this a bit.

VirtualWisdom lets you make filter expressions such as:

Attached Port Name MATCHES ^OracleServer_*

This powerful logic lets you leverage similar names and terms to select similar servers. Consider selecting similar storage targets or hosts by parts of names, or FCIDs that start or end in the same sequence, or switches using the word “Core” or “Edge” in its role. In fact, a simple filter applied to an alarm can apply a more urgent reaction to a port with errors on a core switch rather than an edge, representing different SLAs or criticality.

The example above says “look for where the Attached Port Name — the nickname of the device attached to a switch — starts with ‘OracleServer_’ “.

UDC — User-Defined Context — allows a VirtualWisdom Administrator to define an additional metric in terms of filter expressions: when various conditions match, a constant enumeration is used for that port’s value, or that ITL’s encoding. For example, for switches with certain names, a “DataCenter” column can identify where that switch is to help forward physical layer errors (such as CRCs) to the right team to more quickly address the issue. Different storage or servers involved in different business units can be enumerated, and based on that “BU” flag or value, different SLAs may be applied, or different teams alerted. UDCs are quite powerful, and are processed on every summary that gets stored in the database.

UDCs can use the same “MATCHES” terms that standard filters can use.

The problem with MATCHES is that it strips away some optimization: the Query Optimizer is a part of a database that cross-references the client’s query with existing possible indices, even aggregate indices, to reduce the processing load by orders of magnitude. Any Oracle Admin who has spent time with the “SQL EXPLAIN” has seen the difference a simple re-ordering of expressions can make in a complex query to get a more efficient join, or fewer rows evaluated for processing to reach a result. These indices only match constant expressions with basic comparison operators such as “==”, “!=”, ““, and are completely inefficient for fuzzy or regular-expression matches.

A “MATCHES” expression in your filter or UDC can increase the load between a VirtualWisdom Portal Server and the underlying MySQL database engine. Although Virtual Instruments Engineering has worked to improve the database schema and queries, resulting in dramatic improvements in processing efficiency and maximum ITL and port count of a Portal Server, we the users still have the power to ruin this with a heavy expression or two.

If a filter isn’t run very often (such as a private dashboard, or a filter used mostly in a daily report), it may not pose very much load on the database; conversely, for a filter that runs often, constantly, the load of a MATCHES expression can repeatedly affect the server for the same data points. It’s almost as though a cache of the resulting filter would avoid rerunning the comparison so often. That is where a UDC can be used.

For filter expressions that run often, consider moving the MATCHES to a UDC calculation, and convert the filter to a comparison against that precise value. For example, if your filter looks like:

Attached Port Name MATCHES BillingServer_* OR Attached Port Name MATCHES CustRecords_*

This can be converted to a UDC such as:

  • default value: “Other
  • value “Billing” when “Attached Port Name MATCHES BillingServer_*
  • value “Records” when “Attached Port Name MATCHES CustRecord_*

This sort of UDC means that the two MATCHES expressions will run twice on every Port or Exchange of every summary. If only Servers are identified by this pattern of nicknames, you could also avoid this sort of evaluation on non-Servers by the following:

  • default value: “Other
  • value: “Other” when “Attached Device Type != Server
  • value “Billing” when “Attached Port Name MATCHES BillingServer_*
  • value “Records” when “Attached Port Name MATCHES CustRecord_*

In general, if a MATCHES is rarely evaluated, then its load — however heavier — only affects the server at rare times, so in total has a lower effect. A 100-fold heavier query run only weekly is not worth swapping for a UDC expression run every five minutes.

Try to consider each case where MATCHES is used for conversion to a UDC expression, and whether even that evaluation can be avoided by a constant expression evaluated before the MATCHES expression. Your portal server will thank you!

Three Steps to De-Risking Migration to the Private Cloud

Best Practices, Private Cloud, virtualization No Comments »

One of our customers recently completed a major datacenter consolidation, which included a move to a private cloud infrastructure for some of their applications.  I asked them how the private cloud initiative went and what they think they’ll get out of it.  During the discussion, they mentioned that they used VirtualWisdom to help with the migration, including the deployment of a major app on vSphere.  I thought I’d share the 3 discrete migration-best-practice steps they took, using VirtualWisdom, to ensure that the project went well.

1.     Find and Eliminate Connectivity Errors

This meant cleaning up multi-pathing errors, both in terms of single paths and of unbalanced paths.  To no one’s surprise, they found quite a few areas that needed clean-up.  At the same time, they monitored for physical layer issues, found one serious bottleneck they uncovered by looking at buffer to buffer credits, and remediated it.  Their private cloud migration uses virtualization at both the server and storage level, with a greater utilization of all components, so finding and fixing physical layer issues before the move was deemed essential.

2.     Ensure Optimal Performance

Because part of the project was a consolidation effort, the customer needed to review the configuration of their storage network.  Finding problems and opportunities for reducing the physical number of links without impacting performance was key.  They reviewed Queue Depth settings and found hidden performance improvements that gave them extra bandwidth head-room on the most-used links.

The customer used Exchange Completion Time, the measure of an I/O from the initiator to the LUN and back, as the key metric for performance testing.  They benchmarked application latency before the queue depth settings were changed, and after, and were able to prove the positive impact.  Then, as they brought applications over, they were able to instantly determine, to the millisecond, the impact of the migration on application latency.  This prevented potential user satisfaction issues, and they were able to prove that the consolidation project and separately, the private cloud migration did not hurt application response times.

3.     Optimize Utilization, Reduce Congestion

Good network capacity planning can help maintain networks in optimal working order.  It can reduce the risk of outages due to resource limitations, and justify future networking needs.  It’s important to look for patterns that occur at various times of day. There are often the equivalent of “rush hour” time periods where the SAN traffic will be slowed due to periods of significantly increased demands.   Using the VirtualWisdom “what if” reporting, this customer uncovered a backup job which was going to create a bottleneck if the consolidation took place exactly as planned.  So they found a less busy time of the day to run it, avoiding a potential problem.

They also found a number of no-longer-used reports that took server cycles and network bandwidth.  One of the reports created utilization on one link of approx 70% for just 2 minutes, and that alone was enough to increase transaction times well past an acceptable range.  By correlating metrics on the physical and virtual servers with link utilization, they were able to locate these rogue jobs and re-balance workloads.

Though the private cloud can help speed deployments and reduce costs, there’s little advantage to the end-user if it increases the risk to application performance and availability.  Through de-risking these areas, this customer was able to deliver the benefits of the new compute model and mitigate the risks.


Are You a ‘Server Hugger’? How to Virtualize More Apps

Best Practices, SAN, virtualization, VMworld No Comments »

At VMworld in Las Vegas, leading analyst Bernd Harzog, presented an intriguing case for how to increase the use of virtual servers.  In his session entitled “Six Aggressive Performance Management Practices to Achieve 80%+ Virtualization,” Bernd described both the reasons why more applications aren’t virtualized today, and what to do about it.

Since there seems to be much industry confusion about “best practices” for increased virtualization, we wanted to highlight some of Bernd’s key takeaways.  First, he accurately identifies the fact that it seems all the benefits accrue to the team managing the infrastructure, NOT to the application owners.  For the app owners, dedicated hardware is a comfort blanket they are unwilling to give up, and he affectionately refers to these folks as “server huggers.”  To these huggers, virtualization is all risk and no reward!

So what’s the answer?  According to Bernd, companies implementing these solutions should deliver better application performance on their shared services virtual infrastructure than they are able to deliver on dedicated physical hardware.  He goes on to offer best practices for HOW.  Bernd describes a six-step process, but it’s step number two that Virtual Instruments can help with the most.

  1. Implement a Resource based Performance and Capacity Management Solution
  2. Put in Place an Understanding of end-to-end Infrastructure Latency
  3. Take Responsibility for Application Response Time!
  4. Rewrite your Service Level Agreements around Response Time, Variability, and Error Rates
  5. Base Your Approach to Capacity on Response Times and Transaction Rates
  6. Make Response Time and Transaction Rate Part of your Chargeback and Workload Allocation Process

As our customers know, Virtual Instruments can help with nearly all of these.  But we’re best known for step two, understanding the end-to-end infrastructure performance – not just VMware performance or SAN performance, but literally end-to-end performance – and infrastructure response time is the key metric we offer that really differentiates us.  It’s perhaps the most valuable metric to the team supporting the virtual infrastructure.  Bernd talks about it in some detail, and he goes on to offer advice on criteria that will help accomplish this second step to increasing virtual server success.  His list:

  • Measure IRT – Monitor how long it takes the infrastructure to respond to requests for work, not how much resource it takes
  • Deterministic – Get the real data, not a synthetic transaction, or an average
  • Real Time – Get the data when it happens, not seconds or minutes later
  • Comprehensive – Get all of the data, not a periodic sample of the data
  • Zero-Configuration (Discovery) – Discover the environment and its topology, and keep this up to date in real time
  • Application (or VM) Aware – Understand where the load is coming from and where it is going
  • Application Agnostic – Work for every workload or VM type in the environment, irrespective of how the application is built or deployed

We couldn’t agree more!  I can’t do justice to Bernd’s presentation, so to hear more, go to  the Performance Management Topic at The Virtualization Practice, or listen to the webinar we did with Bernd.

Virtualize Mission Critical Applications?

Best Practices No Comments »

If the attendees at this summer’s Burton Group conference are any indication, many enterprises are forging ahead with plans to virtualize more and more tier one applications.  Certainly, virtualization and cloud computing were hot topics for the week, and I was surprised at how far along many people are in both areas.  From Burton and Gartner analysts, as well as many customer case-study speakers, the message was clear … IT is undergoing a transformation.  Virtualization benefits are too compelling to just relegate to tier two applications.

But it was also clear that most IT operations folks I listened to aren’t quite sure how they’ll provide SLAs to the business owners of these applications.  The existing physical-layer tools aren’t quite up to the task, and many I talked to were genuinely apprehensive about their ability to monitor and fix performance-related issues in both private and public clouds.

One of the top industry analysts on the subject, Bernd Harzog of The Virtualization Practice, suggested that we do a webcast on the subject.  Bernd is quite familiar with the range of emerging virtualization management tools, and so we teamed with him and produced a 50 minute webcast entitled“Beyond VMware Resource and Availability Monitoring – Reducing the Impact of SAN Bottlenecks on your vSphere Environment”. Yeah, I know the title is quite a mouthful.

According to VMware performance specialists, “90% of performance problems seen at VMware customers are SAN related”.  And from my reckoning, 90% of the tools out there are blind to the SAN.  They report on I/O at the server and that’s about it.  So check out the new webcast; find out what solutions are available; and let me know if it’s helpful.  You can register for this free, on-demand webcast at:  http://www.virtualinstruments.com/vmware-performance-webinar.html

The Need for End to End Awareness

Best Practices No Comments »

In the past week I’ve had two customers mention how they are lacking / need an “End to End Awareness” for their environment. They both mentioned how their Host and SRM tools are device specific and while great in some respects they failed to provide a comprehensive view of their environment’s performance.

This drew me back to my own days as an end user when all the SMI-S compliant tools that were at my disposal gave me wonderful topologies, capacity planning features and end to end views but failed to provide the ‘awareness’ on performance I craved. Worse still I was often guilty of still zoning and provisioning with the legacy SAN switch management tool or the Storage Array Management Console, despite all the APIs that were running in my heterogeneous environment to give me that ‘single management pane’. The simple reason was despite all the management capabilities, I was concerned that I still needed the legacy tools to get some detailed picture of what impact my changes would have on the environment’s performance. In hindsight even this wasn’t good enough as I was depending on averaged out / polling intervals that gave me metrics which were unable to go the millisecond granularity I needed.

Hence another one of my personal conundrums as a Solutions Consultant for Virtual Instruments: Our solution offers the Awareness of performance that allows you to see every single I/O from HBA to Switch port to LUN that complements the SRM and device specific tools that already exist. We are able to measure every single FC transaction down to the millisecond. So while it’s great that I am now able to explain to customers this unique solution that provides the granular End to End Awareness of performance that I also personally craved, I’m now no longer an end user and hence can’t take advantage of the platform myself!

Changing IT Role in News Publishing Company

Best Practices, EMC World No Comments »

During one of the “cattle call” lunches at EMC World last week, I sat with an operations manager from a leading U.S. news publisher, who had an interesting observation on how the role of IT has changed at his company.

He reminded me that today, publishing no longer means just “paper.”  For new-age publishers, their bread and butter consists of assimilating, publishing, indexing, and archiving online content.  Their business, in effect, rides on the Internet, and their readers expect no latency in their online apps.  The benchmark is really the physical experience of reading a book, magazine, or newspaper.  There are a lot of news publishers on the Web, so the penalty for downtime or slowdowns is immediate … people go elsewhere.

Beyond just current news, this publisher has to maintain a massive archive of articles.  Because of the huge spikes in readership on an hourly or daily basis, this news leader makes heavy use of virtualization to help balance the load.

Because of this, virtualization tracking and guaranteeing performance are major challenges.  The old ways that sort of work in the purely physical world really aren’t working for them, so they are looking for better measuring and monitoring tools.  He said that his current tools will alert his team to problems, but in a virtualized world, those tools aren’t actually leading them to the cause of the problem.  At the same time, the IT department is on the hook for performance-based SLAs.  So they’re wasting money buying more and more gear to hedge their bets, realizing that it can’t go on forever.  Not with the margins in online publishing.

In between bites of short ribs, he told me about a company he’s talking to which measures the kinds of metrics that’ll help him meet his SLAs.  At first, I was worried that maybe VI had a new competitor, but then he noticed my badge and said, “Hey, it’s you guys.”  Small world.  It’ll be fun to see how his investigation progresses.

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in