Sunday, 5 March 2017

Better tools do not make better engineers

In these days of DevOps, we have so many tools available to us. From configuration management, to NoSQL databases to containers and orchestration. And under open source, many are available at no cost. While the tools work may solve predominately technical problems, I believe that a significant proportion of the problem in making better systems solutions, are human.

A common issue I've seen is not understanding enough about what problem needs to be solved in the first place. But there seems to be a new tool that Sometimes the problem can inadvertently become "How can we implement technology X in our business?" rather than "How can the implementation of X help this business?".

In my team we've discussed whether we should migrate from Puppet to Ansible. Some people say Puppet is slow, complicated and hard to make reliable. If the criticisms are true, then we have the opportunity to be more productive by migrating. If they are not true, then the migration will cost a lot of time only to end up in a similar place. When faced with a problem that looks like it might be solved by the implementation of a new tool, there a few key questions we should be asking ourselves.

  • Is the problem with the current technology or it's implementation?
    • Consider best practices, whether the design is appropriate for the problem faced, whether extensions to the tech (plugins, libraries, hardware) would solve the problem better.
  • Are the problems solved in a more recent version of the tool?
    • The most common issues with a tool are often known by the organisation behind it, sometimes a fix is in a more recent version or it may be on the development roadmap.
  • How would the cost of migration be offset by the benefits of the new technology?
    • Costs include the training of staff, the development of new workflows (if required) and the time it takes to develop replacement solutions
  • Is the new tool weak in places where the old tool is strong?
    • While the benefits of the successor are touted well, the things it does not do as well may be harder to find. Often such problems are not evident until the migration has started.
The weighting given the questions will change depending on your resources such as time, money, skills, and attitude but I believe they are still valid in most cases. They are not reasons not to change, but appropriate consideration will make it easier to plan for any change and ensure the benefits are realised.

There are issues with every technology but I don't think shifting every time something shiny comes along will make us better engineers. It is often in overcoming challenges that we improve our skill in engineering. Our creativity and understanding of technology are both developed when we persevere with problems. Even if we ultimately fail to solve them the way we like, we still gain by understanding more about what is required for the solution. This way, when we do need to take a technological step forward, we can be confident we're heading in the right direction.

Friday, 13 June 2014

A template for the team weekly roundup

While working in a previous position, I noticed that although the team discussed a lot, the right things were always coming up. That team supported live television services so it was important that engineers were up to date on outstanding problems, workarounds and changes. I found that even when engineers sit next to each others, there was no guarantee that they would exchange the right information that each of them needed. So I initiated a round up meeting that would occur on a Friday to solve this issue.

The focus on the weekly round up was not to review everything that happened in the week, but to determine where problems were and communicate activity to the team. Action points were captured out of the meeting and assigned later. Minutes were sent out that day. Always write some kind minutes for review meetings otherwise decisions and knowledge are lost. The meeting is limited to 30 minutes. 

Weekly Roundup Template

  • Changes: Any failed changes or changes that took longer than one hour to complete? (In that environment, most well written changes could be performed in less than an hour. Any longer and there was probably something wrong).
  • Incidents: Any repeated incidents?
  • Projects
    • What's been done this week
    • What's the next step
  • The Good - What went well
  • The Bad - What should we be doing better
  • Any Other Business - anything else people want to discuss

How this helped:
  • The team had greater awareness important activity.
  • Engineers had the opportunity to review problems as a team - engineers have differing levels of expertise and experience. Putting problems to the team enabled all strengths to be applied.
  • Better identification of problems - There are numerous occasions where engineer adopted a practice of successfully working around a problem, however it still cost a significant amount of time to do this. Specifically discussing repeated incidents helped bring underlying problems to light.
  • Improvement for team morale - It's frustrating to be firefighting much of the week only to know that next week will be the same. Being able to raise problems and track progress of solutions helped to improve morale.

Not everybody was a fan of the meeting. But I found that in those cases it reflected the engineer's approach to teamwork, rather than the meeting itself. It was especially useful to newer engineers as they learned much about things that were going on that wouldn't normally be discussed with them due to their experience. Overall, it turned out to be a very useful 30 minutes out of the week.

Tuesday, 6 May 2014

We like good documentation, but why don't we like to write it?

'...The problem with Software A is that the documentation is really lacking. Look - there's no reference for these commands, and the online reference is outdated. I don't think the product is mature.'
'Yeah, it takes too long to work out how to do something. Have a look at Product B, the features are similar but they've got tutorial on common tasks and online docs are much better.'

When we're evaluating a new piece of software, this type of conversation is common. But when we need to create documentation ourselves the conversation is quite different.

'Mate, I've got a customer who want to configure a new public facing web server in a DMZ. Is there documentation for this?'
'Well, what you have to do is talk to Networks and get them to set it up. Actually no, they need information about customer VLANs and new server needs to be in the same ip range as their other stuff'
'Ok, where are the ip ranges documented? And how does it work behind the load balancer?'
'For the ip ranges, there's a document on the 'S drive' but I can't remember exactly where it is. Not sure about the load balancer, check with John he set it up. He might have some notes somewhere...'

The attitudes towards internal documentation and what we expect from third parties couldn't be more different. The key impacts are
  • Takes longer to resolve incidents on average - For example, if a website is becomes unavailable when a load balancer is failed over, it's probably down to missing configuration somewhere. But if the engineer doesn't know or can't remember how the website was provisioned, she will have to work out from scratch how to provision a web site.
  • Changes take longer - modifying or improving services takes longer because you have to repeatedly determine the specifics of how a service works
  • Changes are more likely to cause incidents - In the real world, an engineer has a limited amount of time to determine the possible effects of the change. With lacking documentation about the service, she is less likely to understand fully what the change will do and therefore more likely to cause an unintended effects.
  • Engineers time is wasted repeatedly explaining the same thing - In a team, one person may have to repeat an explanation to different engineers as they require it. But then people often forget, especially if they don't do that thing frequently so they'll have to ask again, consuming time from both engineers and delaying work.

So why don't we make documentation better?
  1. Too much focus on resolving incidents instead of preventing them - Live incidents get a lot of attention and once they're resolved it's on to the next thing. But is this incident a one-off or has it happened before? Wouldn't writing the solution down be useful to the next engineer? Was the incident a result of a change that wasn't implemented correctly because the implementer didn't know exactly what to do? If we start trying to prevent incidents, the value of documentation becomes more obvious.
  2. A belief that memorizing details is the way to increase knowledge - Some are of the opinion that with more experience, an engineer should be able to remember enough about the environment to manage it effectively. This approach simply doesn't scale well. Once an engineer stops working on something for a while, the details begins to slip away. How many times have you come back to a script or service you built and struggled to remember how it worked?
  3. The cost of poor documentation is not obvious - The biggest effect is loss of productivity - less work takes more time. But the average team probably doesn't track how long tasks take. Lack of weekly review of incidents and changes tends to hide the fact the people are spending too much time on tasks that could be much quicker.
  4. Lack of skills to solve non-technical problems - Most of engineers are geared to technically analyze issues. But there isn't much training or focus on how to identify and resolve non-technical challenges like knowledge sharing or incident prevention. Companies end up recruiting technically proficient teams but don't recruit people who can see the non-technical components.

IMHO, the fact that many engineers don't like to write docs is not the real problem. The real issue is that in the culture of IT Operations, there's isn't a strong understanding of how capturing knowledge helps make us better engineers.