Working with Production at Amazon Retail Website

Background

Prior to working at Amazon, I was developing software at a couple of startups, mostly working with products that were in the conceptual phase or the development phase. One of the things I desired the most was to have exposure to products that were live in production or to bring a development project to production. In the end, I got a good taste of that and a lot of hard lessons were learned along the way.

Lesson 1: Pipelines

A good CI/CD (continuous integration / continuous development) cycle can be visualized with pipelines. At its most basic level, a pipeline serves as the workflow from changes that have been committed (i.e. Git repositories), all the way to the deployment to Production. The pipeline is like a DAG, where code deployments will traverse from one stage (node) to the next stage (node), only if the checks at each node have passed. Pipelines can have any number of user-defined stages, where each stage has some semantic meaning. They will have thorough checks to ensure code changes are deployed correctly. They might have additional tests that will automatically run before proceeding to the next stage. The pipelines ensure that your newest changes can be tracked, visualized, tested, and be production ready.

Stages

There is no limit to how many stages you could have, but typically the Development stage and Production stage are the most standard bare minimums. This is to ensure that if something goes awry with the Development stack, the Production stack will never be impacted and continue to be operational. For that reason, the Development stage is typically the first stage in the pipeline, while the Production stage is the final stage. The Production stage will often have the most restrictions so that teams can safely feel assured that the changes to be deployed won't break the Production environment.

More stages can be added and should be added as a rule for best practices. A Pre-Production environment is commonly used in addition to Development and Production, with the benefit that the Pre-Production environment is located in the same network infrastructure as the Production environment. The main benefit here is that the Pre-Production can have access to external Production dependencies and services, allowing you to perform tests with some configurations that are closer to Production. This would normally not be the case with the Development environment, as the purpose of the Development environment would be to test the first round of bare-minimum changes in an isolated, disparate environment compared to the Production counterpart.

Each environment comprises a completely separate software stack, meaning that it has its own data stores, server configurations, deployment settings, application runtimes, and so on. If say you have a Pre-Production stack, then this would be a separate stack from the Production stack. This allows you to test your new changes in a Pre-Production environment, but with external Production dependencies, which is useful for policing your newest changes before they are finally pushed to the live Production environment. If things happened to break in the Pre-Production stack, the Production stack is unaffected and it buys you time to fix the root cause before re-testing the Pre-Production stage again.

Another benefit of a Pre-Production environment is that since the configurations are more similar to Production, you can perform load tests to ensure that the hosts can sustain the TPS quota even with the newly deployed changes.

Additional stages can also be used based on region. For example, if you have a multi-region service, then your pipeline can have a stage per region, and enforce deployment first on a region with lesser traffic before deploying to a higher traffic region.

Deployment Checks

After a fresh deployment, it is crucial to do some form of basic validation checks for deployments. These typically consist of the following:

Bake-in Period

The simplest check you could do is to add a bake-in period after a deployment rolls out. A basic example is waiting for 1 hour after the deployment rollout before proceeding to the next stage in the pipeline. The benefit of this is that the oncall team can have some time to observe changes before they proceed to production. It's not the most recommended deployment check due to it's manual nature, but it beats having nothing in place!

Heartbeat Checks

Deploying new changes to prod can cause wide scale outages. Therefore, we can combine the bake-in period with periodic health checks to monitor the health of our application. If the health checks fail, it will automatically commence rollback deployments which will revert breaking changes and add a stopper to the CI/CD pipelines.

Host Metrics Health

Other triggers for a rollback deployment can include any kind of metrics for the hosts that are vital to its uptime. For example, low disk space, high CPU usage or high memory usage can indicate a server just waiting to crash and go down, causing your end users to suffer.

These things CAN happen! One issue I've ran into in production involved log files that were accumulating too rapidly after a configuration change, causing all of the disk space to be consumed within a week's worth of time. Consequently, the servers crashed, which caused an outage in production at midnight for several hours!

Validation Steps

After the deployment step at any given stage, a series of validation steps will be run to ensure that the deployment was successful.

These would consist of the following:

  • Integration Tests
    • These tests ensure that the newly deployed change has no regressions with the core functionality of the application on a live environment. This often means testing against live upstream services (e.g., product API service, A/B testing service).
  • Accessibility Tests
    • For user-facing websites or mobile apps, accessibility (often abbreviated as A11y) is very important. This usually involves a test on the front-end side of your application; for browsers, a Selenium based testing platform is frequently used to test against your page on different browsers.
  • Locale tests
    • Having tests for each locale can be very important. For example, what if the Spanish locale of your website has incredibly long text that gets cut off by some HTML elements? Or maybe a translation is completely missing, causing the website to throw fatal errors?
  • Load Tests
    • These tests are generally added for Pre-Production stages, where the environment closely resembles the Production stage. It doesn't make much sense to add them at the Production stage, because they will impact your Production environment and may cause problems for real-world users. It also doesn't make much sense to add them to Development stages (or anything before that) since these environments are not a good representation of what you will use in Production.
    • Load tests give you an idea of how much capacity your service may need to handle current and peak Production traffic. They can also test various endpoints and give you useful metrics and error debugging, such as the count/rate of http response codes (e.g. 400 Bad Request, 401 Unauthorized, 403 Forbidden)
  • Bake-in Step
    • Although these steps should only be run after the deployment to each server in the cluster is successful, you may still want an extra delay or bake-in time before starting other steps (such as integration tests).
    • Bake-in time can help give your servers some time to run some start-up scripts. For example, a script to warm up the servers cache via pre-meditated API calls.
    • Another use case is if your deployment infrastructure does not have a fixed number of hosts and has the ability to auto-scale (i.e. Amazon ECS). Scaling up after the initial deployment may take additional time, so a bake-in time may help to ensure that you are running a good number of hosts before running heavy validation steps such as load tests or integration tests.
  • Other external deployments
    • Sometimes you might need to deploy additional things to other pipelines or workflows that cannot be tracked or managed in the current pipeline. For example, after you finish the deployment step in your Production stage, you may want to upload these deployment logs to some external metrics/analytics service for housekeeping.

Time Window Blockers

TODO

Lesson 2: Metrics and Alarms

TODO

Lesson 3: Security and Patches

TODO

Lesson 4: Worst Case Scenarios

TODO