Simulating fault in cloud-scalable services

One of the markers of world-class cloud-scaled services is the ability to gracefully deal with fault. In the cloud, it’s not a matter of if things will fail, it’s a matter of when they will fail. Things have a tendency to fail in the worst possible moments – the launch date of a new platform, the day the media shines the spotlight on your service, or even the day the most users hit your service such as Super Bowl Sunday, Christmas morning, or Black Friday.

To make matters worse, like the old idiom says, when it rains, it pours.

So how do you make sure that when those rainy days come, you’ve not only got your umbrella on, but your Paddington Bear boots and blue raincoat too?

Make those faults happen on your terms, on your time, and when you’re ready for them.

The building-blocks of faults

A few years ago, some very smart folks at Netflix shared some learnings, tactics, and approaches they took to simulate fault by releasing a set of Chaos Monkeys into their ecosystem. We were tickled by the idea, and saw the brilliance behind their innovation, though applying their solution wasn’t feasible as we aren’t running on AWS, so we decided to design our own and add our own flair to it.

In doing so, we first looked at the basic building blocks behind the concept of simulating fault in a production system, and we came up with a few tenets:

  • Faults should happen at randomly scheduled times and intervals
  • Faults tend to happen in clusters (i.e. if there is a large data center issue not only does your connection to your worker services die, but your DB access, pub/sub queues, and ack services will likely die as well)
  • Faults can happen internally in your service (i.e. CPU/memory issues, local storage issues, file corruption/consistency) and externally outside your service (i.e. connections fail, remote resources timeout)

With that in mind, we set off to build our own chaos-creating system in a way that was slightly more extensible to the number and types of faults that we could cause.

Introducing the GoDaddy Goons

Naming a product is often a very difficult task, but when we started thinking of what to name a service with the sole purpose to cause havoc and smash things up in our deployment the name came very naturally – GoDaddy Goons. Goons are generally thugs employed to handle particularly unsavory tasks. Often employed by gangsters, and the like, goons rely on brute force to get the job done which was a beautiful paradigm for what we needed to build.

gangster

Scheduling events

The Gangster service is the brains behind the operation. This service acts as a scheduler for the system, generating random events and event types to simulate fault in the system. The Gangster service is configurable, and can generate isolated events, clustered events, and even regular periodic sequential events.

We’ve put together an admin portal for ease of configuration and management of the scheduled events, but the Gangster itself exposes an admin REST API to allow management of the system. Once event types are registered against particular services, the scheduler then takes care of generating random events as defined. These events are then dispatched either to the HitSquad service that handles external faults or to individual Goons hiding amongst the citizens that handle internal faults for each particular service.

Internal faults

Every service in our platform has a private REST API that the Gangster service connects to. As new roles and instances come up in the platform, they are registered with the Gangster so that it knows how to communicate with each service’s Goons. The API also allows the Gangster to instruct the service to download additional Goons, giving us the freedom to extend the types of Goons each service contains without restarting the system.

The Gangster can then send tailored commands to each Goon on a service for the type of fault that it wishes to simulate. Because the Goons live on the service itself, they have access to all the service’s internal resources which allow them to cause any kind of fault that you can imagine.

Here are some examples of internal faults we currently have Goons for:

  • CPU spiking
  • Memory hogging
  • Process restarting

External faults

The HitSquad service handles external faults and is a standalone service outside the platform’s services. It has connections to the external resources for the system and provides a similar REST API that the Gangster service connects to allowing the Gangster to specify new Goons that need to come on board the HitSquad.

The HitSquad also has connections to our external resources like the OpenStack API, Cassandra admin service, and RMQ admin service. The Gangster can instruct the HitSquad’s Goons to perform any number of actions on external resources ranging from killing connections to the resources to simulating reboots, timeouts, and erroneous data going through the pipes to the service.

Here are some examples of external faults we currently have Goons for:

  • Restarting OpenStack VMs
  • Injecting bogus Rabbit MQ messages
  • Disconnecting Cassandra connections

Extensibility

One of the things we’ve learned is that the landscape is ever-changing, so the Goon system was built with that in mind. Implementing a new Goon is as easy as implementing a simple interface with a single invoke function that gets called when the Goon is given an instruction, and then provides a token that the Goon can be identified with.

public interface Goon
{
    public UUID getGoonIdentifier(); 
    public void execute();
} 

public class KillHostGoon implements Goon
{
    private static final GDLogger logger = getLogger(KillHostGoon.class);

    @Override
    public UUID getGoonIdentifier()
    {
        return UUID.fromString("F338EC9D-4FD0-4311-89E9-DBBBDF549406")
    }
    
    @Override
    public void execute()
    {
        logger.debug("Killing host!")
        Runtime.getRuntime().halt(-1);
    }
}

A sample scenario

The first deployment of GoDaddy Goons was in our Domains platform. If you squint a little, our architecture looks very similar to the following:

gangstereco2

One of our first Goons helped us simulate fault in our asynchronous workflow by injecting itself into our completion handler services. They then would hijack the messages from RabbitMQ and drop them, simulating a no response from the registry. They then got upgraded to also occasionally kill the connection to the RabbitMQ server, ensuring that our handlers were able to deal with these faults.

As a result of the Goons hammering on the system, we were able to identify the need for a healing service that can go through our DB periodically to determine some recovery action on the specific request. For example, a domain create command that had its result dropped should have a re-query action to query the status of the command and a contact info update should be retried since it is idempotent in nature.

Not only were we able to make sure our bases were covered on the individual completion handler services, but we were also able to ensure that the rest of the system could handle not having the request status updated and could recover gracefully.

Where do we go from here?

We’ve got tons of ideas for new Goons, but would love to hear your ideas as well! We’re planning on making the sources available to the community by sharing the service to allow for contributions and fresh ideas.

If you’re interested in working on these and other fun problems with us, check out our jobs page.