Post mortem of this morning's outage:
- Why? – Our link to Peer1 NY went down
- Why? – Our switch appears to have put the port in a failed state
- Why? – After some discussion with the Peer1 NOC, we speculate that it was quite possibly caused by an Ethernet speed / duplex mismatch
- Why? – The switch interface was set to auto-negotiate instead of being manually configured
- Why? – Our network administrator was fully aware of problems like this, and has been for many years. But - we do not have a written standard and verification process for production switch configurations.
Assuming that the third ‘why’ is correct, and it certainly is probable, then we have our root cause. Had we produced a written standard prior to deploying the switch and subsequently reviewed our work to match the standard, this outage would not have occurred. Or, it would occur once, and the standard would get updated as appropriate. Documentation is often thought of as an aid for when the sysadmin isn’t around or for other members of the operations team. It should be clear that it is much more than that.
There is irony in the fact that our system administrator spent the early part of this week drafting a small set of policies and standards for our environment. He now has one more to add to the list.
Now, we could surely take the 5 Whys even further and discover that we would be better off with an HA router / switch configuration, etc. While I see that as fair, the above examination exposes a fundamental flaw in our approach to maintaining this environment which needs to be remedied before adding complexity.