Infrastructure simplifying engineer
Date: 2020-10-09
It is the translation of my speech TechLeadConf 2020-06-09. Before we start, I’d like to get on the same page with you. So, could you please answer? How much time will it take to:
There is a spoiler from the TechLeadConf. Unfortunately, it’s in Russian:
! Menti
Let’s imagine that you expect that an environment will be ready in 2 days after you created a Jira ticket. Just after that, you receive an email that it will be ready in 2 weeks. It’s sad but true. But usually, I’m this person from the infrastructure team. I collapse your expectation. Please don’t blame on me. I’m not a stiff, opinionated or arrogant person. There are some serious reasons for that. Let me clarify and put things right.
Before we start, I would like to clarify what is an infrastructure?. First of all, we should realize how it appears. From my point of view, those processes are pretty similar to each other. So, let’s take a look onto the infrastructure for out of the box software for the immense enterprise.
As we know from our experience, on the one hand, if you want to mount servers to racks, you should know that it is not a rapid process. But on the other hand, a development process must run at full speed.
The world was changing. Our application was containerized. Developers were able to create CoreOS based VMs dynamically. There were docker-compose files for each VM for emulating customers like environments. It was kind of k8s with minimal sets of features. It was before k8s was released. The first answered question was faced. There was a bunch of orphaned YML files in a git repository. There was no responsible person for them. Technical debt was increasing because of fast solutions & kludges. Team members were also changing. The apocalypse came: nobody knew how infrastructure work, nobody saw the whole picture.
Things like that happen. It is not bad or good. There is a heat death of the Universe, that’s a given. There is exactly the same thing: infrastructure tends to chaos.
Creation of an instruction or a blueprint might be the first idea if you want to deal with chaos. There is a short story.
Ages ago, I work in an immense company. The company related somehow to the petrochemical industry. There were a lot of red tapes & formalism inside the company. Once upon a time, someone had to migrate a service and agreed to implement a temporary network scheme for a couple of weeks. The service was shared to the Internet & it was authorized. I was reverse-engineering the petroleum-storage depot in a big airport and found that service. The funny thing was that I found that temp scheme 5 years later. It was an awkward situation because there was direct unrestricted access to a DC core network.
As you can see, sometimes formal authorizations & blueprints can’t save from chaos.
There is approximately the same situation with Jira tickets. Let’s image somebody asks you to prepare a new test environment. On the one hand, you prepare the environment & immediately forget how it has been configured. But on the other hand, if you formalize/automate your agreements as code (it might be custom DSL, Ansible playbook, etc) you have a reproducible solution & the single source of truth. It sounds great: somebody commits to git changes & it automatically appears in the production infrastructure. But you may ask, does it worth the effort?
Does it make sense to automate agreements? I use the decision-making matrix for answering the question. Conceptually it looks like the Eisenhower box.
Let me share some examples:
Let’s try to look from a different perspective. Your processes might be automated on a different level.
To evolute, or not to evolute: that is the question.
It’s a part of the infrastructure nature that agreements are formalizing as code and tending to be * as Service.
Sooner or later, your infrastructure increases the number of agreements. It’s tending to chaos. You can face the agreements drift between your ideal model of infrastructure and real picture of the world. As a result, you might want to put things in order. There are some possible reasons for that:
There was a custom configuration management solution. It looked like IaC. It behaved like IaC. But it was hard to maintain this solution. The reason for that was that it was too fragile & nobody wanted to support it. As a result, it was changed. However, it took 18 months.
You can ask me: ‘Why so long?’. There are some answers in the article Ansible: CoreOS to CentOS, 18 months long journey. In general, the answer is because changing processes, agreements and workflows. Migration is a boring determined process & It follows the Pareto principle:
From the top-level perspective, it is as easy as pie:
There was a project. It was a casual project, nothing special. There were operations engineers and developers. They were dealing with exactly the same task: how to provision an application. However, there was a problem: each team tried to do it uniquely. They had decided to deal with it & use Ansible as the source of truth. Sooner or later, they realized that the playbooks were not stable. They were occasionally crashing. As a result, there was a task to stabilize it.
You can read the whole story in the article How to test Ansible and don’t go nuts. To make a long story short:
For historical reasons, there was a monolith repository with all Ansible roles. The Jenkins multi-branch pipeline was created.
The pipeline top-level overview is:
It’s totally ok that infrastructure is changing and evolving into a huge pile of scripts or bunch of * as Service. It is a pretty good thing because you can refactor it. I guess you’ve noticed there is something common in these cases:
It is not really important how to shave the yak. It might be stickers on a wardrobe, tasks in Jira or a spreadsheet in google docs. The main idea is to track current status & understand how it is going. By its nature, this process is similar to code refactoring but with some limitations on tooling. You should not burn out during refactoring, because it is a long boring journey. Also, I’d like to emphasize that you should have:
After some time, the velocity of changing infrastructure might go down. It is totally ok. From my point of view, there is a correlation between the number of agreements & SLOC of your IaC. I’m not 100% sure that it’s the best way to visualize agreements, but it looks like there are some remarks:
There is a conclusion from that. We can support IaC growth with a constant amount of engineers because of IaC tests. In other words, IaC testing increases the velocity of changes and makes it cheaper.
In 2019 I made a speech called Lessons learned from testing Over 200,000 lines of Infrastructure Code. It was about the similarity between IaC and code development. Especially, I was talking about the IaC testing pyramid. You can create the whole infrastructure from scratch for each commit, but, usually, there are some obstacles: The price is stratospheric & it requires a lot of time. It makes to re-use testing pyramid from the software development world:
It’s a pretty good question. Let me share some stories.
We started from integration tests. It was hard to maintain & it was working to slow for developing IaC. As a result, we decided to rebuild the process from scratch. We started from linting. After that, our process evoluted to unit tests.
There was another project. We started from linting from the very beginning with a small codebase. It grew smoothly & painless.
The lesson learned:
Let me share my estimations about numbers of SLOC: