Overhauling Loco2's hosting infrastructure with AWS, Docker and Terraform

16 December 2016

Recently I worked on a major overhaul of the infrastructure hosting Loco2.com. In this post, I’ll dig into the details of what we did and why.

History
Infrastructure as code with Terraform
Environment-specific AWS accounts
Transitioning to Virtual Private Cloud
Moving the RDS database into VPC
Provisioning the EC2 Container Service cluster
How ECS runs our containers
Building the Docker image
Deploying the new image
Rolling deploys in ECS
A trick to build Docker images faster
Memory leaks in long-running processes
Developer access to the production environment
Conclusion

History

Loco2 has been hosted on Amazon Web Services for years, but we used it rather like a traditional hosting provider. We had long-running instances which had to be manually provisioned and maintained.

We used Chef to manage the process of configuring servers, but the experience was far from perfect:

Since we provisioned new servers infrequently, our chef scripts were often broken when we came to need them, due to external dependencies changing. As well, unless you lock everything down really tightly, you can easily end up with slightly different versions of packages on different servers, depending on when they were provisioned. This can lead to nasty surprises.
Configuration changes, such as updating the version of Ruby we used, could be complicated to orchestrate. We had to install the new Ruby version alongside the old one, change our application to start using it, and then remove the old version.
We had to maintain a “chef server” component (although we later started using the third-party tool knife-zero to avoid this, since we did not have a large number of instances to manage)

Since our servers were long-running, from time to time we had to give them manual attention to fix broken software, prevent disks filling up, and so on. Although rebuildiing servers was somewhat automated via Chef, it was still time-consuming and error prone. We wanted a system where servers could be added or removed quick and easily.

We were still on EC2 Classic, which was the only option when we first set up our AWS infrastructure. This meant that we missed out on features only available in EC2 Virtual Private Cloud, such as faster, cheaper instance types and better network security.

Infrastructure as code with Terraform

In the past we’d made changes to our cloud resources manually through the AWS Console. This was error-prone, and made it hard for one developer to understand when and why another developer had made a certain change.

To solve this, I introduced a relatively new but powerful tool, Terraform. Terraform allows us to declaratively specify our infrastructure as code, which we store in a git repository. Now, we can see exactly when a certain configuration change was made, who made it, and why.

When we change the configuration code, Terraform finds the differences between our desired infrastructure and our actual infrastructure, and performs the necessary modifications.

It also allows us to refer to values by their logical names. For example, rather than having to find the DNS name of the load balancer, and paste it in to a field to configure a CNAME record, we can just write our config file to refer to aws_alb.web.dns_name, which will automatically be replaced with the relevant value when Terraform runs.

Environment-specific AWS accounts

Our legacy infrastructure had one AWS account for all staging and production resources. In order to enforce better separation between the two environments, I set up completely separate staging and production accounts.

To avoid having to set up users in each separate account, we use our existing account as a gateway. Users log in to this gateway account and can then assume a role in the staging or production account which allows them to administer resources within that account.

The beauty of this approach in conjunction with Terraform is that it allows us to test a change in staging and then when we’ve seen that it works, apply the exact same change to our production account.

Our Terraform repository is laid out like this:

gateway/ - Resources in our gateway account, such as IAM users and anything else we haven’t yet transitioned to environment-specific accounts
- terraform.tfstate - The state file for the gateway account
- *.tf - Various Terraform configuration files
shared/ - Terraform module containing all the shared configuration for the staging and production accounts
- We use lots of sub-modules to keep things organised
staging/ and production/
- terraform.tfstate - The state file for the account
- shared.tf - Config file to call the shared module, passing some variables to tweak things like the EC2 instance types we want to use

When we’re in staging/ or production/, the Terraform AWS provider is configured with an assume_role block, causing Terraform to operate on the correct account.

Transitioning to Virtual Private Cloud

Our services (app servers, PostgreSQL, Redis, and so on) would gradually be migrated to a VPC in the new AWS accounts, but during the transition we still needed to have communication to and from our EC2 Classic instances.

To achieve this, multiple steps were required.

First, I created a new VPC inside our gateway account for each environment. Then, I connected our EC2 Classic instances to those VPCs via ClassicLink. This enables private network traffic to flow between EC2 Classic instances and a VPC.

This VPC was within our gateway account though; we still needed to be able to communicate through to a different VPC inside the staging or production account. This is done with a VPC peering connection which allows VPCs to exchange private network traffic with each other and can be configured to support ClassicLink traffic.

A diagram showing the ClassicLink and peering connection configuration

Moving the RDS database into VPC

Our PostgreSQL database was provisioned using the managed Relational Database Service, but unfortunately it was also using EC2 Classic. Our options for connecting to it from our new VPC instances were sub-optimal:

Sending traffic over the public internet would compromise on security, and incur high bandwidth costs
We could provision a proxy in EC2 Classic and then connect to the proxy via ClassicLink, but this would be a single point of failure unless we embraced the complexity of having multiple proxies and dealt with switching between them. Even so, we’d still face some extra bandwidth costs (although not as expensive as transfer over the public internet)
Ultimately, keeping the database in our gateway account would prevent us from achieving the dream setup of having complete separation between our staging and production environments

Therefore, I decided to migrate the database to the new accounts.

I was concerned about significant downtime if we went the route of snapshotting the database and then restoring the snapshot in the desired account. So I spent some time trying to configure the Database Migration Service, which promised a zero-downtime migration.

This turned out to be far more complicated than the documentation would have you believe, and after lots of messing around the final nail in the coffin was the realisation that DMS does not properly support complex data types such as json, hstore and arrays (it converts them to text objects).

I decided to test the snapshot-and-restore approach to see how long it would take, and discovered that it would only be an hour or so. (I should really have just tried this in the first place.) Therefore I woke up at 3 AM one morning and took the site down to do this. Not ideal, but acceptable.

Unfortunately, following the migration we started to see quite a bit of latency on disk I/O, which slowed the site down. This caused lots of stress and head-scratching, but ultimately we ended up weathering the storm and the problems eventually settled down after a few days.

This is certainly a downside of RDS; while having a managed service is great, when things aren’t working well it’s very hard to dig into the details of why, or to know whether it’ll eventually sort itself out. Over the years I’ve realised the importance of testing every change (such as a database version update) against a copy of the production database before implementing it for real. But even so, problems like this can still crop up.

Provisioning the EC2 Container Service cluster

We decided early on that we’d like to use Docker for deploying our application. The benefits of Docker have been written about in many other places so I won’t go into detail here, but the aim was to make it easier to change our application’s runtime environment, make deployment more robust and predictable, and to avoid having to maintain complex, long-running EC2 instances.

I considered Elastic Beanstalk and EC2 Container Service as options for managing our Docker containers, and settled on ECS as it seemed a more flexible approach and less tied to a specific blessed “AWS way” of doing things.

(Whilst there is plenty of excitement around Kubernetes at the moment, using Kubernetes on AWS would require us to manage it ourselves, whereas ECS is a managed service. If I were building a new cloud deployment from scratch, I’d certainly look closely at the managed Google Container Engine though, which is built on Kubernetes.)

ECS runs Docker on what it calls container instances which are grouped in a cluster. You must provision these container instances yourself, which we do via an Auto Scaling Group. This enables us to specify “there must be X instances running in the cluster”, and EC2 will take care of starting and stopping instances to achieve this. In the future, we could implement dynamic auto-scaling where we increase or decrease the required number of instances in response to real time load. For now, it simply allows our cluster to auto-heal if instances die for any reason. The Auto Scaling Group also balances the instances over two availability zones, ensuring that should one AZ fail, our site will continue running.

Amazon provides VM images specifically for use with ECS which have the ECS agent pre-installed. We use these images, in conjunction with some cloud-init config which does some lightweight provisioning such as hooking up Papertrail for logging and Librato Agent for more detailed metrics. (In the future it may be better to create our own derivative machine images via Packer, which would make it faster and more reliable to bring new instances up.)

How ECS runs our containers

A Docker container running within ECS is called a task. To tell ECS what container image to use, what command to run, how much memory to allocate, and so on, you create a task definition. The task definition specifies the parameters for running a container, and the task actually runs it.

You run a task on a cluster, but you have no control over which container instance it actually runs on; ECS will pick one based on available system resources.

If you want a certain task to always be running, you create a service. For example, we have a service specifying that we should always have X instances of the web server task defintion running. If one of those tasks dies for any reason, ECS will notice and magically start a new one. As with Auto Scaling Groups for EC2 instances, it is also possible to implement auto scaling for ECS services, enabling you to dynamically increase or decrease the number of containers you’re running in response to demand.

Loco2 has two clusters: web and worker. The web cluster has one service, which runs Puma. The worker cluster has three services: worker-core, worker-maintenance-reports and worker-maintenance-other. These all run Sidekiq, but each service picks jobs off a different queue and is set up slightly differently. (As the names suggest, worker-core is the main event and the others deal with less important jobs.)

A diagram showing how our ECS clusters, services and task definitions are related

Building the Docker image

When new code is pushed to Loco2’s git repository, Travis CI runs the tests and builds a Docker image. The image contains everything needed to run the application:

A Ruby binary
Extra operating system packages (things like libpq-dev)
All the gems in the bundle
Precompiled assets
A LOCO2_COMMIT environment variable set via a build arg so we know exactly which commit we’re on

We tag the image with the git commit SHA, as well as with latest (for convenience). When we deploy, we use the git commit tag; this allows us to lock to an exact version of the code. Otherwise, we could have a situation where the latest tag is updated, one of our tasks gets restarted by ECS, and then we have a newer version of the code unintentionally deployed. (This system also makes it crystal clear what version of the code we’re running.)

Once we’ve built the image, we use docker run to invoke rails runner '' (with RAILS_ENV=production). This is a simple smoke test to ensure our Docker image can boot Rails without trouble. The image is then pushed to Docker Hub.

Deploying the new image

When we’re ready to deploy, we type /dockbit deploy production in Slack. This triggers a deployment pipeline in Dockbit. The pipeline has two steps:

Wait for the Travis CI build, and ensure that it passed
Run ./bin/deploy $DOCKBIT_DEPLOYMENT_SHA, which invokes a bash script in our repository

The ./bin/deploy script looks like this. The script defines some functions, and then invokes them at the bottom:

check_image_exists
prepare_deploy

run_concurrently update_service web web
run_concurrently update_service worker worker-core
run_concurrently update_service worker worker-maintenance-reports
run_concurrently update_service worker worker-maintenance-other

wait_for_children

There are two preparation steps:

Check that a Docker image for the commit we’re trying to deploy actually exists in our Docker Hub repository
Run rake deploy:prepare in our new Docker image, which allows us to run arbitrary application code on deploy. We use this to run database migrations, amongst other things. This works by running a task on our worker cluster.

If either of these steps fail, we’ll abort the deployment.

Otherwise, we concurrently update each of our ECS services to tell them we want to start running a newer version of the code. Here’s how we update each service:

Download the JSON describing the latest revision of the task definition
Update the Docker image reference in the JSON to point to the git commit tag we’re deploying (e.g. loco2/loco2:f423bbd8ba70446e09c44848b687512741e54814).
Upload the new JSON, creating a new revision of the task definition
Update the ECS service to tell it use the newer revision of the task definition
Wait for the ECS deployment to finish (more on this below)

Rolling deploys in ECS

Updating a service causes ECS to orchestrate a rolling deploy. This means that there is never a point where zero tasks are running and the application is inaccessible. Instead, ECS gradually starts new tasks and stops old ones until no tasks using the previous task definition are still running.

ECS makes decisions about how to do this based on available memory on the container instances (a new task cannot be started if there is not enough memory available for it), as well as your minimum and maximum healthy percentage settings.

The minimum and maximum healthy percentages govern how the rolling deploy will proceed. If we configure a service with 10 desired tasks running, a minimum healthy percentage of 50% and a maximum of 200%, then during a deploy we may have anywhere between 5 and 20 tasks running. If we set a minimum healthy percentage of 100%, then ECS will need to start new tasks before it stops old ones; this only works if there is sufficient available memory on the container instances.

One crucial fact about rolling deploys is that there may be two versions of the application running at the same time. This means that any database migrations applied in the new deploy must be backwards-compatible with the previously-deployed version of the code, otherwise there will probably be errors.

When a web request comes in to our Application Load Balancer during a rolling deploy, it may be routed to a task running either the old code or the new code. If a user gets routed to one of the new tasks, we don’t want them to get routed to one of the old tasks on a subsequent request, otherwise they may end up inconsistently seeing different versions of a page. To solve this, we use sticky sessions, which ensure that the load balancer always routes the same user to the same ECS task (so long as it’s still running).

Rolling deploys are a fantastic feature of ECS. There is a lot of complex logic going on here which we can just rely on ECS to implement.

A trick to build Docker images faster

Building our Docker image on Travis CI is quite slow. While Docker implements a build cache to maximise the efficiency of rebuilding images, this is irrelevant on Travis CI since we’re always building the image in a completely new VM environment with no data cached.

The most time-consuming part of our image build is installing the bundle, which doesn’t only come down to network speed but also the time taken to install various gems with native extensions to compile.

To speed this up a bit, we have an automated build on Docker Hub which builds an image called loco2/loco2_base every time we push to our git repository.

When we build the image for a given commit, we use this base image as a starting point. However, it probably doesn’t contain the absolute latest code, and our bundle or assets may have changed in the newer code. So we replace all the source code and then re-install the bundle, regenerate the assets and so on.

The Dockerfile we build on Travis CI looks like this:

FROM loco2/loco2_base:latest

# Set the commit ID in an env var
ARG LOCO2_COMMIT
ENV LOCO2_COMMIT $LOCO2_COMMIT

# Clean the source tree so that if any files have been deleted after the base
# image was built, they will get removed from from the final image.
RUN docker/clean.sh

# Now, re-add all the source files that still exist.
ADD . /loco2

# Update any generated files to match the updated source tree
RUN docker/prepare.sh

The docker/clean.sh script looks like this:

#!/bin/bash
set -e

# Preserve generated files so we don't have to generate them again when they
# are unchanged.

ls -A | egrep -v "tmp|public|vendor" | xargs rm -r

pushd public/
ls -A | egrep -v "assets" | xargs rm -r
popd

And the docker/prepare.sh script looks like this:

#!/bin/bash
set -e

bundle check || bundle install --deployment --clean --without='development test' --jobs=4
cp docker/database.yml config/
RAILS_ENV=production rake assets:precompile

Since we preseve the bundled gems and compiled assets from the base image, most of the time docker/prepare.sh runs quickly. If the bundle or the assets have changed, it’ll be a bit slower but still nowhere near as slow as starting from scratch.

This approach makes our image builds faster, but it’s still not exactly instantaneous. There is still quite a lot of time spent actually pulling the loco2/loco2_base image in the first place. Also, if we change our base image Dockerfile (e.g. to add a new operating system package, or upgrade the Ruby version) we must wait for it to be rebuilt before we build the final image on Travis CI.

Why don’t we just use a Docker Hub automated builds for our final image, rather than building it on Travis CI?

It’s slower than building on Travis CI
There’s no support for build args, which we use to set our LOCO2_COMMIT environment variable
There’s no way to tag our images with the git commit, which we need so we can be precise about which version we’re deploying

Memory leaks in long-running processes

As mentioned previously, we use Sidekiq to process background jobs. Unfortunately, over time, the memory used by our Sidekiq processes seems to grow indefinitely (or at least grow pretty large). (This is probably not the fault of Sidekiq itself, but of our own code or code in libraries we’re using.)

While we would ideally spend time finding and fixing the leaks, it’s pretty hard to prioritise this sort of work. So for a long time we have done what many others do and used monit to keep an eye on the memory usage of our Ruby processes, gracefully restarting them when it gets too much.

But using monit doesn’t make a lot of sense in the Docker world, since the container orchestration system (ECS) is already responsible for monitoring our containers.

Docker recently added a “health check” feature which enables you to specify a health check command which will be run inside the container to determine its health. We could implement this to periodically check the memory usage of our process and report the container as unhealthy if it gets too high.

This is all fine and dandy but Docker doesn’t actually do anything about the health check status; that’s really up to the container orchestration system. Ideally, ECS would monitor the Docker container health check status and gracefully restart tasks which are unhealthy.

Unfortunately, ECS doesn’t currently support Docker health checks, though there is an open feature request for it. So we need another solution.

After casting around to try to find out how others were dealing with this problem I drew a blank, so ended up writing a simple memory monitoring script, which Loco2 has made available as open source.

The script runs a program and keeps an eye on its memory use. If it gets too high, it sends a SIGQUIT to the program. If the program doesn’t exit after a certain timeout, it sends a SIGKILL. That’s it - once the program has exited we can rely on ECS to notice that the task died and start a new one, so we don’t need to implement any of our own restarting logic.

It works like this:

memory_monitor --limit 1500 --interval 1 --timeout 30 sidekiq ...

This invocation would run sidekiq, monitoring its resident set size every second, and stopping it within 30 seconds if memory use exceeds 1,500 MB.

Developer access to the production environment

From time to time developers inevitably need to get into the production environment to run rails console, or psql, or a rake task. In our legacy infrastructure we used to just ssh into a server and cd to the directory where the application was. But now we needed to a way to get into a Docker container.

While it would theoretically be possible to do this on our ECS container instances, I decided to provision a dedicated admin instance for these sorts of tasks in order to avoid tying up resources on user-facing servers.

This instance is based on a stock Amazon Linux machine image, and we do some light provisioning via cloud-init to:

Install Docker
Enable SSH access for admin users via public keys stored in IAM, using an approach similar to this (it’s is kind of a hack, but works well)
Secure SSH access with two-factor authentication provided by Duo Unix
Configure a cron job to periodically pull our latest Docker image and remove any old ones

Then, we can access a container in production like this:

$ ssh -t [server] \
    docker run --rm -it \
    -e RAILS_ENV=production \
    loco2/loco2:latest \
    [bash|rails c|psql|...]

(In practise, we wrap this invocation up into a little script for convenience.)

Conclusion

There were quite a lot of steps to get to this point, but I think Loco2 now has a much more robust and maintainable infrastructure. I was really impressed by Terraform and it’s nice to see how quickly it is maturing. ECS is good at what it does too, but I think there are lots of ways it could improve.

When the time came to switch over to the new system it thankfully happened with very little drama!

No doubt there are many different ways of solving the problems we encountered, but I hope this provides a useful insight into the solutions arrived on at Loco2. For me this whole experience underlined how difficult it can be to iterate existing, mature systems with lots of real users vs building something from scratch.

Comments

I'd love to hear from you here instead of on corporate social media platforms! You can also contact me privately.

Juan Paulo Breinlinger

8 years ago

Great writing and loads of good practices here. A couple of comments would be:
# ls -A | egrep -v "tmp|public|vendor" | xargs rm -r
would fail if any of the files has a space which is probably unlikely but is always better to be on the safe side.
Additionally have you tried serverspec with docker? Is actually quite good at testing infrastructure. I developed simple gem that:
- Builds the docker image
- Run a new container from that image
- Run serverspec tests through rspec to test the container use cases.
- Push container to docker hub if tests pass
Cheers!

Juan

Jon Leighton

8 years ago

Thanks Juan! Good point about files with spaces, although since we're using "set -e" we'd hopefully notice if that happened.

I hadn't heard of serverspec, but it looks interesting so thanks for the pointer. What's your gem called?

Incidentally I did experiment with running all our tests on the CI inside the Docker container (i.e. "docker run rspec ..."). Unfortunately it slowed down the CI quite a bit and we decided against it. But it's theoretically a nice idea to increase confidence that the container image works properly.

Juan Paulo Breinlinger

8 years ago

Is something I use for my customers but is not documented. You can take a look at it here:
https://github.com/BreinsNe...
Here is a snapshot on how we build our docker-db container using jenkins:
https://uploads.disquscdn.c...
and here is a gist of the spec file:
https://gist.github.com/jua...
Using this process any technical person in the company can push changes in the infrastructure with a high level of confidence.
If you need more information on how to use it just let me know
Regarding speed, that's why I always prefer to use jenkins. Not that I'm against SaaS but having full control over your build process is nice. Jenkins 2.0 is just amazing and using docker + serverspec you can keep up with services updates quite easy.
Should you need any further help with all this just let me know. Good luck with your new projects!

Juan Paulo Breinlinger

8 years ago

Ah BTW, for deleting files there is a trick which is:
ls -A|tr '\n' '\0'|xargs -0 rm -rf
That'll do

Rob Shepherd

8 years ago

A comprehensive write up, thanks for sharing Jon!

Good heads up on the shortcomings of DMS. That's likely saved me from repeating the same experiments to find the same outcome, cheers!

The pre-built image base to speed things up is a technique i've been fiddling with for java related projects and it's dependency fetching during builds on cache-less containers.
I've been working on a similar solution - albeit using different tools, but the same angle.

Nice one.

Tamsin Slinn

8 years ago

Really interesting, thanks Jon!
I am looking at doing something similar for managing releases with docker.
Once question, if you use the git commit SHA as the tag and push a tagged image to dockerhub on every push to your git repository, do you not find you have a very long list of tagged images on docker with no easy way to tell which ones were ever deployed? I like the idea of human-readable tags (perhaps push to "latest" tag until acceptance tests complete, then tag git and push matching tag to dockerhub).
But maybe I'm just being too old school about release versioning :)

Jon Leighton

8 years ago

Hey Tamsin, glad you liked the post! Yes, you do end up with loads of tags on Docker Hub. However, we don't use Docker Hub to see which revisions we have deployed since this information is a lot easier to see in Dockbit (we can also see which revisions were previously deployed, and it hooks into GitHub so we can easily navigate to the specific commmit). We can also see the currently deployed revision in ECS.

Tamsin Slinn

8 years ago

Thanks Jon, makes sense! We can tag our docker compose file to track releases, so will see how that goes. I'll just have to avoid looking at dockerhub :)

Overhauling Loco2's hosting infrastructure with AWS, Docker and Terraform

Table of Contents

History

Infrastructure as code with Terraform

Environment-specific AWS accounts

Transitioning to Virtual Private Cloud

Moving the RDS database into VPC

Provisioning the EC2 Container Service cluster

How ECS runs our containers

Building the Docker image

Deploying the new image

Rolling deploys in ECS

A trick to build Docker images faster

Memory leaks in long-running processes

Developer access to the production environment

Conclusion

Comments

Juan Paulo Breinlinger

Jon Leighton

Juan Paulo Breinlinger

Juan Paulo Breinlinger

Rob Shepherd

Tamsin Slinn

Jon Leighton

Tamsin Slinn

Add your comment