One of the riskiest parts of software delivery is the production deployment.
It's the first time in the delivery cycle that code changes are exposed to end-users. If the application has a significant level of traffic, it could be the first opportunity to see how it performs under load. It's that scary moment when you open your code up to the scrutiny of a firehose of users, bots and services, hitting endpoints in ways you hadn't expected. It can be a very stressful point in the delivery process if you do not approach it in the right way.
In this article, I am going to take you through the techniques we use to reduce deployment risk and explain how we've gone from an infrequent, stressful and error-prone deployment process, to one which enables us to execute production deploys at the click of a button. We'll show you how we rollback changes, architect our codebase and enforce delivery practices to reduce the risk, enabling us to deploy hundreds of applications into production every week.
Before getting started, it might be worth reading the blog post my colleague David wrote on the Continuous Delivery practices that we run. This should help you to understand our approach to software delivery.
We've been using the Blue-Green technique since 2010 and we've now got ~60 applications running Blue-Green deploys every week. This process is not perfect, but it has done wonders for our production deployment process, and enables us to get features delivered faster, without the fear of getting the changes into production.
I'll provide a short introduction to Blue-Green deploys, for those who have not heard about it:
A Blue-Green deployment is a technique in which you have two identical production environments that you can deploy an application to. One environment is named Blue, the other Green. You route live users to one environment, let's say Blue, and deploy the new application into Green. Once you've deployed and tested the new application on Green, you flip the traffic and direct the live traffic to the Green environment. The Green environment becomes the new live, and next time you'll deploy to Blue.
There are many benefits to Blue-Green Deploys, such as an easy rollback process if the deployment doesn't go as expected and being able to deploy to production with zero-downtime. In the above example, if we had seen an increase in error rates, we could route the live traffic back onto blue without any downtime.
There are many practices which we have adopted to enable us to get to a Blue-Green deployment process. These practices have also given us the ability to scale horizontally easily, as the underlying architecture principles are the same.
The application must not persist state to the server. We ensure things like cookies, cache files, logfiles, UGC etc are backed by external services like memcached, Amazon S3 and Papertail.
We use a shared database between the Blue / Green production applications. All database schema changes are stored in migrations, which can be rolled forwards and backwards. Forward rolling DB migrations should be written in such a way as to not affect the currently live application. Backward rolling migrations should reverse the effect of the forward migration.
If you need to perform destructive actions against the database, defer them until neither Green nor Blue has code relying on the destroyed tables. Releasing small and often helps to achieve this.
When features are not ready for consumers to see, we favour dark launching them over delaying a deployment. We find this is a safer way to get code into production and minimises the amount of code changes being deployed, when the feature actually goes live.
All server configurations must be stored in code, so they are repeatable and automated. We primarily use a hosted PaaS called Cloud Foundry, which allows us to define a particular buildpack in application manifests. However, whether we're provisioning our own Cloud Foundry infrastructure or custom infrastructure on AWS, we'll make sure we use tools like Chef or Puppet, so infrastructure is identical every time.
Small Changeset Mindset
Commit early and often. Deploy as frequently as possible. These are our main guidelines to our team and customers. If a change has been sitting in a staging environment for more than a few days, then the developers are likely to have forgotten the implementation details and you're increasing the risk around a production deploy. The more fear there is around deployments, the less likely they are to happen. We've got scripts which look at days since deploy and send alerts to our Hipchat room if we exceed a threshold.
Where possible, implement feature toggling so you can get features out to a subset of users (such as the customer and engineering team), and give yourself an opportunity to address any bugs and exceptions before they make their way through to the full user base. This approach increases the engineering effort, but is a useful technique if the application is of sufficient scale.
Rollback If You Can't Roll Forwards
If something goes badly wrong with a Blue-Green deploy, you've got the option of a full rollback. In our experience, this is rarely needed, but it's nice to know it's available, for those occasions. We tend to favour rolling changes forward where possible, as it's easier to deal with problems when they crop up, than defer them to later.
Ensuring the Blue & Green environments have separate cache servers is an important lesson that we learnt. You should to be able to clear individual cache keys or entire application caches, without impacting the other environment.
Once an application is deployed to live, it's important that it's monitored closely for potential problems. We've tried a number of different techniques and tools and currently favour doing the following:
Application Performance Monitoring
We watch New Relic's key performance metrics in real-time and have alerts setup to notify us if they exceed certain thresholds whilst we're not in front of our screens. If we've deployed a brand-new feature, we'll closely monitor the controller actions to see where the application is spending time and quickly detect any potential problems, such as slow running SQL calls or unnecessarily complex Ruby.
We always look at the exception notifications. These are an early indicator of significant problems, so we like to see exceptions infrequently. If the exception notifications increase to a significant volume we'll investigate further and decide if they warrant a rollback or corrective action.
We also monitor the application logs closely. We setup alerts within tools like Papertrail and Loggly to report any unexpected behaviour we notice. This becomes a useful technique for trying to resolve issues that happen infrequently.
Our deployment process is great, though we feel like there is room for improvement. We're currently exploring the following techniques to help us improve it further:
We see Canary Releases as the evolution of Blue-Green deploys and the obvious next step in improving our production deployment process. If you're not familiar with Canary Releases, the idea behind them is that you deploy a set of changes to a small number of end users, before rolling this out to an incrementally larger userbase, until all users have access. It's the technique that very large scale sites, like Facebook use. If you're interested in learning more, you should watch this video of Chuck Rossi from Facebook talking about their release process..
We are looking to introduce a Smoke Test that runs against Blue-Green and verifies that the build has deployed successfully. At present, we manually check the latest Blue-Green deploy before routing live traffic through. We see potential value in automating this process by having a test suite that runs in the production environment and verifies key features are working. If the Smoke Tests fail, we would prevent this build from being routed into live.
If you're looking for further information on Blue-Green deploys, check-out Martin Fowler's introduction to Blue-Green Deploys here or this article by Transport For London on how they utilise Blue-Green deploys on tfl.gov.uk.
You could also check out a RubyGem we've written called CF:Deploy Gem, which handles Blue-Green deploys to a Cloud Foundry environment and flipping applications into live and rolling back (if necessary!).