CI Over CD

You have Continuous Deployment, not Continuous Delivery 🔗

I am a consultant, and I have lived in more production systems than most. I will tell you that CD(continuous delivery) is easy. Everyone does it, and in general, it works. I say this because this blog has CD, and at this point in time, the configuration for that is 43 lines of YAML. It probably could be shorter but it was easy and its pretty reliable. When you think about CD, it is a very light extension of what our toolchains do every day. I press the play button or run the npm test, and it passes, or it doesn't. Moving those steps to another computer with NIX or Docker is thus trivial. I duplicate the same toolchain plus a couple of release-specific language tools, and I am in production.

So, stepping back for a moment, just because I have seen a lot of systems doesn't generally mean I have seen a lot of good ones. But, I have caught a thread: the cost of true continuous delivery is at odds with little "a" Agile and our impression of what is good 'nuf. I am a TDD nut with a penchant for E2E testing; I am from the Pittsburgh area, and I love pickles; gherkin and cucumber are my best friends. The Steelers are a fantastic foosball club; pickles don't come for free. You have to grow some cucumbers and marinate them; waiting is something you have to get used to; running E2E tests is slow, but in both cases, the results are worth it. If you are a Dallas fan, that's "ok," too.

Slow is fast 🔗

Consider this: you have a huge platform that, like all huge platforms, cannot $$permit$$$$$$ themselves to be offline for even seconds, and yet we have decided not to use a Waterfall methodology, yea I know it sounds crazy. So what do we do? Build up a test suite, Conway's Law, some teams, and start building systems that can create artifacts for deployment. Regardless of whether you let the computer deploy them or not, those artifacts are "deployable," and that's where we stopped. We fill in the gaps with humans, pressing buttons, watching graphs, and generally doing the stuff humans are bad at. Surprisingly, bugs sneak past, and then more humans furiously tell relaxed humans to figure it out. Like ants during a termite invasion, you've been there, in the special "command center" of 30 some people with only 3 talking. Each team has their assigned scouts waiting for a fight and shielding the colony from wasting their time. Feels busy, feels dangerous, feels fast! Good thing they don't build houses this way, right?

So what's the alternative? And no, it's not "MOAR tests!" Yep, surprised even myself there, cause I kinda wanna write some more tests. It's a concept of coverage, though; just because I have merged my change doesn't mean I need it to deploy immediately. I do need it to be tested, though, and I need that evaluation to be feature-aware. It's not so surprising that we create a system for a fixed purpose, and we, some fancy folks with paper hats, create a dish for our customers to consume. Now say it with me, they planned the dish, the cooks made it, and then.... they threw the recipe away. Yep, go find a product person adjacent to your team and ask them to describe a given feature's purpose. Not the technical implementation, but the "5 whys", how we got here, and what problem we were trying to solve. Nine out of ten directors will struggle to describe the answer. If you are in the bottom 10 percent, stay there and count your blessings. The rest of us, on the other hand, are entering the Pith helmet phase, wandering off to recolonize the heathens we abandoned a generation or two of developers past.

Consider if you will the transition of humans past from their oral tradition to a written one. Product features are nothing more than folklore we must adhere to, not because they are good but because we don't quite understand them. They say things move fast in tech, developers average life span on a project or company is 2 years. Ten times faster than that of the organized media's two-decade reminder that a new generation exists as a meaningless moniker to help inject market separation. While it may feel we are far from the point, it's our lack of history that makes us ignorant and the tradition of ignorance that makes us complicit.

There is a better world! It involves no longer adhering to the convention that we need to deliver a product to appease a schedule but to be art. An artifact if you will of a moment of creation, never to exist again but to persist into infinity. And we do this with test automation, documentation, and continuous Integration. The simple nature of continuous Integration can find its core in how we organize commits. Change management starts at the very level of the code change and the changes being recorded. Each commit is; atomic, can be built and tested(doesn't have to pass), and is a complete change. One step up we have a branch or feature, which is complete in its specification and includes tests internal and external. If external, it exposes immutable contracts that express specific intents that require change management and re-evaluation to evolve. One step further, our entire product is a hierarchy based on the quality of our commits.

Deployment_Artifact
└── E2E_Testing
    └── External Contracts
        └── Unit_Tests
            └── Feature or Branch
                └── Atomic Commit

So taking a bit more time to build out this hierarchy on each delivery means we don't have to maintain a long-lived understanding of all of the parts. Instead of long-lived tribal knowledge within a team, we use the details each team exposes and trust that the tests cover our goals. This doesn't mean we will never have bugs or errors because, at some level, we are still humans writing the feature for a computer we partially understand. Each level requires a set of completion criteria that makes us think less until some ragamuffin pushes a "quick fix" and leaves us forever wondering why one request can either have a customer_id or a customer_uuid, but the fix was valuable enough and that developer doesn't work here anymore. Since it is now the basis of our entire product path, it is going to stay there. It may feel like broken windows theory, but go on, buy anything of reasonable cost at the big box appliance store; you are gonna want a discount for that scratch and dent. While broken window theory has made its way into the Clean Code cult, the reality is more about pride in our environment and respect for our ergonomics. The opposite being Trail Ethics where we all ascribe to a set of conditions that leave the world untouched and Leave no trace. While not completely possible when evolving software, we can respect our impacts.

When you do things right 🔗

Simply, "Watch the pennies and the dollars will take care of themselves. - Franklin", good work is based on a steady stream of quality and consistency. Quality takes effort and time; the results lead to more time saved, not asking why. Early in my career, it seemed normal to expect the highest quality before speed. We didn't advocate for speed until quality could be achieved and even then it was a bonus. This has been replaced for absent-minded "quick win" methodology, which is sometimes linked to Lean Software Development, but mind you read about it first. The truth is lean is not about fast but about eliminating waste, one of those being "relearning". While not quite the same as "don't make me think", it is about having an environment where truth is self-evident in process and in execution. Sounds fancy, and that's cause it is. Continuous Integration is the process of allowing developers to not focus on the past or the tangental concerns and instead focus on the work at hand, management processes, and other operations concern themselves with specific cross-cutting concerns that are related, unblocking, and orthogonal.

All of the companies that have been a joy to work for and have eventually grown to something great have followed these ideals.

What does the "ideal" CI process look like?

A developer runs their test suite and merges their feature, already a number of tests have passed and tests that proved there were enough tests. Processes have evaluated the product to prove that standard practices have been followed. Next, we prepare an artifact for integration testing, which can be eventually promoted to an artifact that will make its way to production. In this phase, a series of end-to-end tests are run looking for fires, smoke tests if you will, that evaluate success over completeness. These tests are run against both production artifacts and other pre-production artifacts awaiting release. Given these steps pass, we are what I often refer to as the "rubber meets the road" part of the failure path. We actually deploy it, and we deploy it to a portion of our consumers. There are two conditions that need to be met to complete a deployment; enough interactions have happened to consider the artifact valid, and we have observed no aberration in behavior. The former is related to SLOs (Service Level Objectives) or Core Flows that are expected to work with a specific performance and accuracy. The latter is about real-world requests and their impacts, not artificial, may be referred to as canary. When both of these items pass, we complete the deployment and immediately start the next release over and over until time immemorial or the VC runs out, whichever comes first.

A lot of stuff just happened in those few steps, and there's lots of permutations of how to achieve this. In a perfect world, a release is not a single developer's code but instead a batch of code that can describe its own changelog through the accuracy of its commit messages. Something that when it fails allows us to extract a change using tools like bisect to eliminate offending work and allow it to be staged later. While this might sound like stuff that only Google and Amazon do, I have worked in places significantly smaller doing this much or more. We just had a commitment to it and a team dedicated to its perfection. We had engineers interested in the ergonomics of the work outside themselves. And, we had leaders that tracked and informed on the process.

If the definition of vacation is not having to carry keys, the definition of Continuous Integration is deploying on a Friday and turning off your computer. Its not always going to be precisely possible but we can get close.