Building Resilient and Scalable Architectures

The GoDaddy hosting engineering team develops and supports a wide-range of products that host more than 10 million websites. These products include shared hosting, dedicated servers, virtual private servers, and a managed WordPress offering that span thousands of servers. So, how do we make new APIs available for these hosting products to consume in a way that’s both resilient and scalable?

Our team is in the process of implementing a new architecture for hosting products that provides all of the common web services required by new products under development. These web services provide integration with third-party vendors as well as with other GoDaddy products such as domain management, DNS, and SSL certificate management. This new architecture, which we call Hosting Foundation Services (HFS), exposes an API to these web services via a common REST endpoint. Each of these web services is implemented as a micro-service with a single concern that is task-oriented and is independent of any other web service. We also use a shared nothing architecture to avoid introducing dependencies between web services and to reduce complexity.

The initial design for such a product was straightforward enough: stand up a cluster of reverse proxy servers behind a load balancer that can proxy the various API calls for each individual web service, and set up enough application servers to handle the expected load and provide for some redundancy. However, we had some additional constraints that influenced how we designed such an architecture:

  • As each product that would eventually use the HFS architecture began to take shape, the need for additional APIs were identified. We needed to be able to add web services in a dynamic fashion that didn’t require reconfiguration of our reverse proxy servers.
  • We didn’t know exactly how many application servers we would need for each web service, and so we needed a scalable solution that would allow us to add application servers dynamically (and in the future, automatically) to meet demand.
  • We knew that as more and more products depended on the HFS architecture, it needed to be resilient to software, hardware, and network failures in a cloud environment.
  • Developers like having a swagger interface available for each web service, and we wanted to expose a consolidated swagger interface for all of the available APIs at a single location.

Our new hosting products are cloud-based, and internally we use OpenStack as our virtualization technology. Therefore, it was no problem to construct the initial tooling that created a VM, applied puppet updates, and installed any one of the arbitrary web services under development. While this approach was adequate early on in the development process, it didn’t scale in practice. As servers were created manually, it became cumbersome to keep track of which one was associated with each of the various APIs under development. We weren’t sure how many servers we would need to spin-up for any given web service and even if we knew how many servers to provision, we still needed something that would scale the solution automatically for each of the different APIs we wanted to provide. Enter the Master Control Program, or MCP.

MCP was conceived as an orchestration tool that could provision and de-provision application servers automatically to scale with the expected load on the VMs servicing the APIs. Orchestration technologies such as Kubernetes or Docker Swarm provide similar capabilities when using container-based solutions, but we needed MCP to perform similar orchestration functionality on our VM-based implementation. While we expect to eventually migrate some of these web services from standalone VMs to containers, we will still need MCP’s ability to provision additional VMs as Docker hosts in order to scale-out the environment to meet demand.

VM Factory: Speeding up the Server Deployments

During the process of automating server builds, we found that applying system updates and Puppet were taking a considerable amount of time to run. So, we implemented a “VM Factory” process that can do these steps ahead of time and provide a ready pool of VMs that could be used to quickly spin-up additional needed application servers for a specific web service.

Service Registration: Where was that API Server again?

Each web service retrieves configuration data from a ZooKeeper cluster. And for service registration and discovery, each web service registers its location in the ZooKeeper cluster using ephemeral znodes. That way, if the VM hosting that web service dies the service registration data in ZooKeeper is updated automatically. On the reverse proxy servers, a daemon process uses the service registration data in ZooKeeper to construct nginx configuration directives that expose each API via its own URL endpoint. As web services are added or removed, the reverse proxy configuration in the nginx configuration is updated automatically. The ephemeral nature of the service registrations ensures that the reverse proxy configuration is updated for any VMs that are removed. Since the daemon is already handling the creation and removal of service registrations, it can also combine the swagger specifications provided by each web service to present a consolidated specification to the end-user that will always be current.

scalable-architecture

Making Sure Supply == Demand

Now that we had addressed the requirement to have a dynamic number of web services and application servers, we still wanted the ability to automatically grow or shrink certain portions of the infrastructure based on any number of variables such as demand, health of various system components, and failure of various components. To do this, MCP compares the number of running instances of each web service to a configurable value and either starts or terminates the required number of instances for each web service to ensure that the proper number is running and that all members are healthy. This provides nicely for automatic remediation in the event a server should fail as MCP will detect a mismatch between the configured number of instances and the number of running instances, and then start a new VM when necessary.

Rethinking Upgrades

Software upgrades of a web service are also handled by MCP by specifying how many instances of the new version should be running. MCP will ensure that the specified number of instances is started, and can verify the health of those web services by checking for the expected service registration information to be present in ZooKeeper. Depending on the upgrade scenario, the old version can be de-configured and removed, or can be left configured if two versions of the web service should be left running concurrently for A/B testing or for performance comparison purposes.

Auto-Scaling to Meet Changes in Demand

Now that the count of application servers per web service is configurable, we have the opportunity to make the system scale as needed by measuring things like web traffic and system load, and by adjusting the application server counts automatically. For web services that experience increased traffic, MCP would ensure that additional VMs are spun-up and brought online to meet the higher demand. This process can be extended to cover other parts of the infrastructure as well. For instance, to deal with an increasing amount of traffic, the pool of reverse proxy servers running nginx could be scaled dynamically. And when combined with Load Balancer as a Service (LBaaS), MCP would be able to provision additional reverse proxy servers as needed and automatically add them to the load balancer configuration.

Conclusion

MCP has allowed our Hosting Foundation Services (HFS) to scale with application servers while web services are being added dynamically. Developers and end-users have the swagger specification they need and the access control is simplified by having a single endpoint. We’ve learned to think differently about how the infrastructure is constructed by taking a second look at each part of it and asking, “Is this something that can be provisioned automatically as needed?”

The concepts introduced by MCP could be further realized by bootstrapping a complete environment with only a ZooKeeper cluster and an instance of MCP. In this case, the MCP could build out the originally empty environment as needed, including reverse proxy servers, application servers, and any other server type that can be orchestrated via Puppet. Once that happens, the whole environment can be recreated, making MCP an attractive option for developers to replicate a complex environment with minimal effort.

Join us!

If you’re interested in working on challenging projects like this, GoDaddy has several positions available. Check out the GoDaddy Jobs page to learn about all our current openings across the company.