Eager has been described as a CDN with a user interface. The interface is certainly what we think about the most, but I wanted to share some things we’ve learned building the CDN.
The first rule of building reliable things is given enough time everything fails. Based on that mindset, the goal is to build a system which has some form of redundancy at every level. Fortunately, it isn’t all that hard to build a certain about of redundancy into your setup.
Commercial CDN Providers
Unless you need to do wonderously complex things at your edge locations, it just makes sense to use a commercial CDN provider like Amazon’s CloudFront or Fastly. Building out a reliable network of servers all around the world would be time consuming and expensive, and it’s certainly not what we want to put our efforts into.
Reliability Tip Don’t choose one CDN, choose two. If you have two CDNs configured and ready to go, you can failover using your DNS settings to turn a catastrophic failure into only a few minutes of pain.
The more important the reliability of a system is, the more simple and foolproof it needs to be. If a database has to be up and running for files to get generated when the CDN needs them, you’re gonna have reliability challenges. Additionally, files are generally changed infrequently but loaded often. It just makes more sense to generate files when the information backing them changes, and ship them off to a static file host like Amazon S3 to be served from now until the end of time.
Reliability Tip Generate files when they change, not every time they need to be loaded.
Who would have thought that putting a filesystem in the cloud would have such an impact on how we build things. Amazon S3 is essentially the perfect CDN origin for static files. Unfortunately, it has one unreliable element, latency. We’ve had requests for files take upwards of 600ms to return. Eventually everything gets cached in the CDNs, but with thousands of edge locations and potentially short cache timeouts, it’s inevitable that a big chunk of requests will be much slower than you’d like.
Reliability Tip Use multiple origins. We push all files both into Amazon S3 and Rackspace CloudFiles. We have a simple service which attempts to load files from both locations, and responds with the first file it gets. This both helps us tolerate latency, and a failure of either service. Even better, we have an in-memory cache in the upstream service we can push new files to, keeping requests fast as new files get distributed to the various CDN edge nodes.
In the last section we talked about a service we have which is responsible for delivering files to our CDNs edge nodes. It goes without saying that if all of Fastly or S3 could go down, so could our service. Fortunately, Fastly allows us to specify that in the event of a failure of the upstream service, requests should be routed directly to S3.
Reliability Tip This failover is automatic, but our DNS failover to CloudFront is manual. It’s always important to remember that it’s not uncommon for the recovery systems to create more problems than they solve. If a recovery step is expensive to reverse, it often makes sense to leave it to the humans.
You can’t fix problems you don’t know about, and there’s nothing worse than being told of a reliability problem by one of your customers.
It’s always important to remember that different systems have different reliability needs. Our app, which we deploy code to every day, can have the occasional bug without it ending the world. Our static files have to be served to our customers all the time, every time. The more critical reliability of a system is, the more time and effort it makes sense to invest to get it there. Beyond that, there is a balance between capability and reliability.
Reliability Tip If different parts of your system have different reliability requirements, they should be separated so the parts which can be changed quickly don’t harm the parts which must stay simple and change methodologically.
We’ve never had a confirmed outage of our static file system (100% uptime). Ultimately there is no such thing as a perfectly reliable system, but there are varying degrees of reliablility. For us, the goal is to be confident that we are the most reliable option our users have.