May 01
SAN Troubleshooting Best Practices
Best Practices, bottlenecks, Real-Time Monitoring, SAN, troubleshooting Add commentsPeople often ask — “Are there any best practices that the troubleshooting experts recommend?” I asked a couple of our top services guys for their recommendations, and I’m sharing them with all of you today:
- Don’t stop looking just because you’ve removed the symptom, because if you do, you’re likely to see the same problems later. Sure, to alleviate the immediate problem, you may have to remove users or applications that are less critical, perhaps stop backups, and remove other potential bottlenecks. While this may fix the immediate problem, it often stops the underlying cause from being discovered.
- Use “real” real-time monitoring for alerts that get you in front of the issues before the application users feel the pain.
- Sometimes you have to broaden your approach beyond what the user is reporting. If you stop there, you will often miss larger issues that may affect other, slightly less latency-sensitive apps.
- As a first step for triage, try to isolate whether the cause is on the server or the SAN. Comparing your baseline Exchange Completion time with ECT during the slowdown, will tell you immediately where to start, and where to stop looking. Your vendors will appreciate it, too.
- Try to find the finest granularity in your historical reporting to see which event preceded another, for cause and effect. A one-minute interval is often not sufficiently granular.
- Look at your historical I/O patterns, busy times of day, multipath configurations, queue depth settings, top talkers, etc. to gain a profile of behavior. Then compare with your healthy baseline, and rule out things that haven’t changed. You might find 6 things that appear to be going wrong, but if only one of those things seem to have occurred when the problem was reported, you can focus on that thing immediately. Later on, you can go back to look at the others.
- When changes are made to fix the incident, you should get immediate feedback. Without an immediate response, customers often take one of two approaches: 1) They delay or stagger fixes until they can determine the effect of each one; 2) Or they make all changes at the same time, and are then left wondering which change fixed the problem.
- Lastly, ask for help sooner rather than later. We’ve heard of problems dragging on for months, vendors getting kicked out of accounts, and literally millions of dollars wasted on adding expensive hardware. Waiting days or weeks to find the root cause of a problem is unacceptable. Bring in a performance pro.
Recent Comments