Left understated was that maintenance windows are not like any other time.
Being under time pressure is a vastly different experience
because you are fighting both stress and non-normal operating hours, ie.
every minute into a window means you are more tired and therefore stupider.
We combat this with prudent planning; plans are literally intelligence that
we warehoused when we were smarter.
Anything that can go wrong, will go wrong
Consider all work as transitions from one known-good state to another.
How do you show that a new state is good? How do you rollback to a previous state?
Document what you want to do, what are you doing, and what you have done (post-mortem).
Never perform major hardware or software upgrades on a Friday
Establish maintenance windows.
Is it reasonable that the plan will fit the window?
Plans have abort timings, ie. rollback if not complete by a predetermined time.
When pulling cable, always pull extra cable
Terminate in-building wiring to the building with noise budget for expected segments, eg.
a link from a LAN closet to a workstation has 3 segments: switch to same-rack cable term,
from rack to room-wall, and from wall to workstation.
When a room layout changes, you only need to change out the last segment.
Any changes to the system should be immediately tested
Already covered. See above for transition plan.
Never delete an old version of software until
At worst, archive to tape or offsite.
Do not upgrade/install software in the middle of the school year
Already covered, see above for maintenance windows.
Learn from your mistakes.
While planning, review the literature of others have done (including your past self).
ie, use the 3 Gs: google and git grep.
Failure is not an option.
Failure is the default unplanned option.
Every problem has a solution.
(and the solution might be to just clean up, cf worse is better)
Never say "No" when answering.
(Take the time to educate the other person as to why the request is sub-optimal)
There is no rule 7.
For every action there is an opposite & equal reaction.
Prudent admins consider the range of human responses to policies and rules.
Work smarter, not harder.
Automate Everything and Retire Early.
Always have backups. (note the plural!)
Run mini-disaster recoveries often.
Only people with a key to the room are allowed in the data-centre.
Part of transition planning is avoiding unplanned transitions.
Nullius in verba
Never close your last root shell.
Never disable your emergency recovery.
Periodically prove that it works.
The only reason to answer the phone
Nominate a NOC liaison, enforce communication through them or it.
Do not brag
OpSec is important.
Do not announce
Periodically review security measures vs experience here and elsewhere.