A few years ago, GoDaddy adopted OpenStack as the software used to run our internal cloud. As a result, our ability to execute and build products has increased tremendously. It has helped change the culture of how we develop and architect products, as well as enabled us to leverage automation across our infrastructure to ensure we meet the demands of our customers. The next logical progression for us was to give our customers access to the same power and scale that GoDaddy uses to run it’s own business, but in a simple and easy to consume product. Enter GoDaddy Cloud Servers. Our goal was to create a simple to use product, powered by a consistent and intuitive API that empowers developers, IT professionals, and small business owners to get up and running quickly and easily. To delight our customers we understand that our system has to scale, be highly available, and provide industry leading performance. This article will discuss some of the technologies and tools we used to achieve our goals with the GoDaddy Cloud Servers provisioning system.
In order to make the product intuitive and easy to use we decided to expose our own API as a shim layer in front of the OpenStack APIs. The thought process behind this is that it allows us to keep the API consistent with other GoDaddy APIs that are exposed externally and to remove some of the complexity involved in learning and consuming the OpenStack APIs. To address the consistency of our APIs, we have adopted Swagger-API as the framework for describing all of the APIs at GoDaddy. This allows developers to easily consume our various products since they all follow a consistent convention. OpenStack has grown quickly over the years and their APIs are not all that consistent. There are also multiple ways to accomplish the same task (Nova versus Neutron and allocating/assigning floating IPs are good examples). We wanted to eliminate this for our end-users by building an API that follows the Unix philosophy of “Do one thing and do it well”.
One of the cool things about GoDaddy Cloud Servers is that it runs on the same platform that it exposes. We are truly eating our own dog food with this product. All of our infrastructure runs on our private OpenStack cloud, while our customers live on our public cloud. As mentioned above, one of the goals we set out to achieve was a product that is highly available and horizontally scalable so that we we can meet our customer’s ever changing demands. To achieve this, we decided to use Apache Kafka at the center of our cloud provisioning system. By using Kafka, we get a centralized messaging system that is scalable, distributed, and durable. Kafka is also built on top of Apache ZooKeeper, which provides many features that are helpful in building distributed systems. GoDaddy Cloud Servers leverages both Kafka and ZooKeeper. Below is a high level architecture diagram of our system:
The main takeaway from the architecture diagram is that we use Kafka to decouple the back-end services from the front-end API. All requests that hit the API end up becoming JSON messages that land in Kafka and are consumed by the back-end services that interact with OpenStack. We also take advantage of OpenStack’s Ceilometer Notification Service to track changes that occur on the OpenStack side by feeding them back in to the same Kafka cluster used by the provisioning system. This lends itself to a much better event-driven architecture versus polling OpenStack for state transitions.
Kafka has some key features that allow us to scale out our provisioning system horizontally. First, it guarantees to replicate messages across the cluster. This means that if a node in the cluster goes away, no messages are lost. Kafka also provides partitioning across topics, which allows us to distribute incoming messages to specific partitions. What this buys us is back-end services that can target specific partitions allowing us to distribute the load of processing messages across multiple nodes. To keep things simple, we have a dedicated VM for each partition in our topic. As the load increases, we simply add a new partition, update our partitioning logic (think buckets), and spin up a new node to consume the new partition. We can do this as many times as needed to scale out. In the future, as we bring on more data centers, we can shard the data appropriately to the specific DC for processing allowing us to scale out indefinitely.
At this point, we have a highly scalable system, but what if something fails? Our backend services come in two flavors. For the first type of service, we only want one instance running at any given time, but we always want one running. And for the second flavor of services, we want one running on every application server, but only processing data that belongs to it. To solve the first problem we leverage Zookeeper’s distributed locking mechanism. The distributed locking mechanism allows us to run our services on every application server, but only have one actively working at any given time. If the active node fails, ZooKeeper will detect it is gone very quickly and allow one of the standby services to acquire the lock and start processing. The second flavor of service is responsible for processing data for the primary partition that it is reading events from in Kafka. The problem with this approach is that if a particular app server goes down, we have lost the node responsible for the corresponding partition. To get around this, each service is configured with a failover partner that will assume responsibility for the ailing node’s partition until it is brought back online. Again, this is done using ZooKeeper and distributed locks. However, in the future, we are going to be experimenting with consumer groups now that they are available in the latest python Kafka libraries.
Overall, we are very pleased with the performance and uptime of our provisioning system. Both Kafka and ZooKeeper have proven to be very stable and relatively easy to maintain. More importantly, it has been extremely easy for us to recover from planned and unplanned outages that naturally occur in “cloudy” environments.