Pages

Thursday, 11 October 2012

Prevention is much better than cure

Much of the day to day work in Operations tends to fit the break-fix pattern. Things break, and the team spends its time trying to fix them. From my own experience and talking engineers in other companies, this break-fix pattern is where 75% of time goes. This is not good.

  • The company can't serve new customers effectively
  • The company can't make significant improvement to existing services
  • Engineers can't work on things they actually like doing (increasing the turnover on staff cost the company more money in training)
So why is it hard to solve?
  • Culture - This is the most frustrating thing to deal with, the attitude that repeatedly fixing the same problems is just life in Operations.
  • Up to half of the problems are probably non-technical - Many problems can't be solved behind a computer. They involve educating customers and staff to change work patterns so that problems don't arise.
  • Prevention takes much longer than a quick fix - it takes considerably longer to prevent an issue than to perform a workaround
In my opinion, developing a strategy to prevent problems is the only way for an Operations team to really move forward. Without such a strategy, time and energy will be eaten up in break-fix efforts until the team can do little else. So here's my approach

Monitor your stuff
Yes, it seems like this should go without saying but I always find critical components that are simply not monitored. RAID isn't much use if you don't know when a disk fails. And I've lost count of the number of problems caused by lack of storage on disks, tapes or other devices. Get the basics covered for everything - CPU, memory, disk, network connectivity.

Monitor the stuff the breaks
When something breaks, I always think "How would I detect that in the future?" A lot of time is spent simply finding exactly what the problem is, any measure that can cut down that time is a win. This means monitoring network ports, system logs and processes.

Review the stuff that break repeatedly
So you've got monitoring in place and you're quick to respond to any failures. But you don't want to fix things repeatedly, this is a waste of time. At your weekly round-up meetings, discuss the things the break repeatedly and see if there's a long term fix.

Track and check your changes
Many issues are cause by accidentally caused by changes made by engineers. At a minimum, all changes need to be tracked so at least there's a place to check what might have caused the recent incident. And after that, you need a way to perform some basic risk assessment on changes. And by basic I mean asking the question "Could anything fail as a result of this change?"


There's more to preventing problems than this list, but I think it's a good start.