Why is Allan so Nuts?

Uncategorized No Comments »

This blog is normally about what I did to solve problems I had; it might not be the best reading, I’m not the best writer, but I like to share so that the research I did for something can go further.

I am obsessed by details. Why? Why am I so nuts?

In my early career, I went from Military to Telephony to USL Unix to mobile phones: where errors cost lives, to errors costing 911 calls, to errors costing millions of dollars in retail revenue and servers down, to an environment (mobile) where it must be correct from the outset or you risk 911 calls and risk losing the ability to fix it in a costly field-upgrade.

Recalling a handset can ruin a company, but not being able to place a call can risk a life.

When I helped NASDAQ field-upgrade, we saved them millions and millions for 2700 servers in an environment where if too many servers don’t come back, all trading would be suspended for fairness/access reasons. When I helped McDonalds, they needed 100% uptime without fatigue on flash or RAM. K-Mart needed around 2500 servers done in one-shot. The FAA needed 100% accuracy and a one-entry form for install information (so consider installing Windows and VW with only one piece of data — three letters and a number — as the only user entry). RadioShack and SDM (Canada’s largest retailer) needed a “hit enter” as the only install command. At every step, I’ve had to keep track of every little detail.

Errors are avoided at the design stage: When it has to work the same every time as delivered, interpreted and late-bound languages are not the best choice (especially when their lack of internationalization causes additional issues — I’m looking at you, Perl). Re-use tested stuff rather than making it up — it’s out there if you look — or discuss and indicate why the existing is so unsuitable so that others know you did and can learn from your research. Documented design choices avoids second-guessing by latecomers. If no choice is possible and one of many equal options must be done, document the “Arbitrary choice” and move on, able to refactor if a later detail shows a design issue for correction. Good design should avoid Epochs and variants, but since you cannot see everything yourself, document and discuss, debate, as much as your timeline allows.

Failure scenarios should be graceful, and verbose for forensic analysis. Errors and error codes should be unique and easy to shout over a phone from a noisy datacenter, and should be easy to google and/or research at 3am rather than tracking down an engineer to define/dereference. Why did you add complexity? Can you explain it in one breath? In general, the myriad little details should be handled by the software, letting the broader scope and macroscopic config choice be handled by the meatware (the person/people).

This is why I see details. This is why I discuss unique error codes (and love Oracle for it). This is why I discuss parsable logs and precise configs.

My memory doesn’t work like other people, but my memory works for these things.

This is not the blog where I’m quite critical about things; this is where I hope others will find solutions. I wanted to explain why those solutions tend to be specifically different from others’ solutions.

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in