AWS Power Outage – how to stay up when an EC2 datacenter goes down

Opentracker distributed databases not affected by AWS power outages

Summary

Server issues impact our ability to provide services. Our work to safeguard against catastrophic events has led us to build the distributed data-driven architecture described below. The result is the failsafe qualities which are needed to drive best-of-breed solutions.

Over the past months, there have been several outages at AWS (Amazonʼs web services) the largest cloud service in the world. (http://venturebeat.com/2012/06/29/amazon-outage-netflix-instagram-pinter…)

Amazon has been actively preaching the “design for failure” model. Under the “design for failure” model, things should always work even in the event of a massive, datacenter-wide outage.

Google

Amongst others, Google has also propagated its architecture to work in the event of failure. All of Googleʼs software – Maps, Apps, Analytics, GMail – etc, are based on this architecture. You could even argue that this has been Googleʼs competitive advantage this past decade, dwarfing its rivals by providing reliable software.

In Googleʼs infrastructure, identical chunks of data are spread across several datacenters, so when one computer, server-rack or datacenter fails, things get routed to nodes that are still working. Application developers donʼt need to really worry about the fail-safe infrastructure; they just need to abide to certain rules. If they do this, it just always works; this helps Google develop reliable things quickly.

The Holy Grail for internet companies is creating awesome applications used by millions of users. Growing Internet companies eventually need to worry about the infrastructure failing and the people needed to maintain it.

Databases

Surprisingly when it comes to common databases, this is a hard thing to achieve. Common databases like MySQL are based on the data residing on a hard-drive, and given Murphyʼs Law, these hard-drives will eventually fail.

To counteract this, database administrators make backups, set up master- slave replication servers, and monitor everything 24/7. These are the ingredients for recovering from a disaster.

Ironically the more successful you become, the more servers you have, the more failures youʼll likely get, and coping with downtime becomes a sought after skill when reviewing applicant resumes.

Getting things back to normal when something fails takes a disproportionate amount of human effort and skills.

Opentracker

Our challenge at Opentracker, as the provider of SaaS (Software-as-a- Service) is to host and generate traffic data reports for our clients. When there are server issues, this impacts both our ability to store data and generate traffic reports.

Over the years we have experimented with various ways of safe-guarding against catastrophic events.

Regarding database management, this has led us to move our development in a new direction, towards a distributed data-driven reporting engine. Our new model involves storage distributed across multiple datacenters. Different physical locations hold identical data, and automatically copy new data to several locations.

What does this mean?

In practice, this means that 2 data centers can go down without anyone noticing, as long as at least one datacenter remains up. If one data center is down, there is no extra pressure to get things working, no need to get a team of engineers awake at 4 a.m.

Furthermore there is no need for our application developers to worry about failures or balancing data.

Conclusion

The fact that we stayed up and running while a whole datacenter was down confirmed that we made the correct choices. We have learned that our distributed database exhibits “failsafe” qualities when compared to MySQL. Our cluster, spread over multiple locations, was able to continue both recording and generating reports without interruption, and was able to repair itself automatically after the fact.

Identify & Track your Visitors