The GoDaddy FIND Team is responsible for services that help suggest domain names. While this sounds reasonably straightforward, it is a critical function of taking a customer from the point of being interested to making the purchase of the perfect domain name – and when it comes to premium names, the higher price point means the suggestions must be right on target. Based on an initial starting suggestion, be it an initial domain name or even just search terms, we leverage ElasticSearch to quickly and insightfully suggest names to the customer.
We can even take into account customer information like previous purchases, other domains a customer owns, or any other hint. The inventory of premium names comes from a number of different sources including names available for sale and auctions from both internal as well as partner providers. We load this data continuously and put it into ElasticSearch. From this index, our engine can query for good candidates to present to customers.
Why Should You Care?
Like most things here at GoDaddy, this process needs to be fast, accurate, and reliable. Fast and reliable can be accomplished by any modern cache or key/value lookup system. Accurate can be achieved by quite a few robust search systems; many of which don’t win any races when it comes to speed and make “reliable” a nice way to discover humor. ElasticSearch, when used properly, hits all three requirements. Learning how to use ElasticSearch provides a solid resource for collecting, analyzing, and serving data that allows our customers to make solid purchasing decisions.
GoDaddy had to configure ElasticSearch by trial-and-error, test-and-measure and, in some cases, making outright guesses to see how everything played out. ElasticSearch, while more mature now, is still somewhat of an emerging technology and learning how to use it for our specific needs was both a challenge and an opportunity.
What Is ElasticSearch?
Let’s take a closer look at how we use ElasticSearch to help customers find domain names at GoDaddy and examine some of the challenges we faced and the solutions we uncovered.
ElasticSearch is a scalable, distributed search engine that lives on top of Lucene, a Java-based indexing and search engine. ElasticSearch is fast and provides an elegant API-based query system. The first thing we need to understand is how ElasticSearch defines indexes that live in shards across nodes and how it replicates data for reliability. Our first challenge was ensuring that our index was always available and returned results fast enough to support real-time search.
An ElasticSearch index is the data set upon which our searches are performed and shards are pieces of that data set distributed amongst nodes, which are individual installations of ElasticSearch typically one per machine. For the GoDaddy Domain Find team, we rebuild our index daily while also taking in real-time feeds of auction domains that inform us when names are added and deleted as well as when prices change. We have set-up a Jenkins job to bring in this data and add it to our current index throughout the day without impacting read operations. To do this, we have configured our ElasticSearch cluster to separate nodes that hold data and nodes that we make available via our API for doing searches. This way, as we’re taxing the data nodes with new data the API nodes are not impacted. We even turn off replication during large loads so that each batch loaded does not start a brand new replication operation. These have now become somewhat standard practice with ElasticSearch, but when we were first starting out, it was the Wild West! This strategy reaped the best results out of much trial and error.
We scale our system with the addition of new nodes which reduces the number of shards per node (thus reducing load) by re-allocating nodes automatically. Indeed, most of the scaling and distribution of ElasticSearch is done automatically with hints based on your configuration. Any given node may have all shards or a subset, thereby distributing the load when a search is performed. This is a key benefit of ElasticSearch in that it does the hard work for you.
The default number of replicas is one. A replica is a replication of your index, so with a replica of one you have two complete sets of data: one to start plus one replica. ElasticSearch will ensure that a shard and its replica are never stored on the same node (provided you have more than one node, of course). If a node should go down, it will take with it either a shard or its replica, but not both. So you’re covered. Even if you have more than two nodes and two replicas, you could suffer a failure of two nodes and still have your data available.
For our system, we chose to have two data nodes and two client nodes that hold no data, but have the interfaces for performing queries. This was also trial and error. We tried two and we tried six.
The Black Magic
Deciding how many shards and replicas is somewhat of an untested black art both here at GoDaddy and also in the wild. Trial and error on the number of shards, if you have the time and patience, is an interesting exercise. We started with the default of five and tested performance. We then increased or decreased and then remeasured. As ElasticSearch matures, this area will likely receive attention from developers. At the ElastiCon15 conference this year, none of the presenters had the same configuration in terms of size and that was rather telling. Each had to determine their configuration individually based on their use cases. One thing to note is that once you create an index with a set number of shards, you cannot change it. You can create a new index, of course, but the one you’ve created has its shard count set. Replica count, however, can be changed at any time and ElasticSearch will re-allocate shards across nodes appropriately.
For our purposes in the GoDaddy Domain FIND team, we stuck with the default of five shards because our data set does not contain a huge number of documents and five shards is a decent number. If your document count is high, you would want to consider more shards to split up the data to make queries (and indexing) faster. We also found that for our use, four nodes provided enough headroom and redundancy, so we created three replications in our configuration.
Down the Rabbit Hole: Cross Data-Center Distribution
What about distributing across multiple data centers like we have? The official answer seems to be, “sure, but it’s not supported.” If the pipe between your data centers is reasonably wide and reliable, there’s no reason you can’t. We tried it with nodes in different data centers and the communication between them got bogged down and caused nodes to overload and time out – taking the whole cluster down! For now, it’s my recommendation that we avoid doing it. This, too, is something the ElasticSearch developers say they’re working on improving. Honestly, I’ll believe it when I see it.
For monitoring, we use a number of tools, but the best one that we like is the very simple “head” plugin.
In the image above, we see the output from head plugin. Each index is spread across six nodes and each node has five shards. In this case, we have set the number of replicas to five, meaning we have a total of six copies (the original plus five replicas). Primary shards have a bold border and replicas do not. This tool gives us a great, but simple visual representation of our cluster and also provides for quick ad-hoc queries and modifications to our indexes.
One indicates everything is reasonably balanced, queries are fast. Ours tend to be sub-10 milliseconds which allows ElasticSearch to be used as a real-time responsive system. While we had a number of challenges in crafting efficient queries, once we got over that hurdle things have been fast and stable ever since. Word to the wise: don’t put wildcards in your queries that result in huge intermediate results. It’s not pretty.
ElasticSearch also provides a more heavyweight monitoring solution called Marvel. Our first try at Marvel was less than impressive as it put too much of a load on each node and filled the indexes with lots of monitoring information that cluttered things up. I’m told that this has improved dramatically in the past year and we’re keen to give Marvel another try.
Challenge: Indexing a Live System
What about indexing? For our team, our biggest challenge is that we need to ensure 100% uptime and that includes ensuring ongoing indexing does not impact read operations. Failing to provide solid premium domain recommendations means money left on the table. So when we rebuild our entire index every night, we do it in a creative way by indexing to only one node which is marked as never being a master node and containing only data. This is, in essence, a “write” node. We turn off replication, create the new index, and load into it. This operation takes about two hours. Once done, we turn replication back on and let the index be copied to all nodes, including those from which we read. Once that is complete, we then tell ElasticSearch about the new index and make it primary using an alias. This gives us zero downtime for reads as well as keeping the heavy lifting of indexing constrained to one node until it is done. When we add records throughout the day, we do it to that one node and let it push the updates to the read-only nodes.
For us at GoDaddy, such strategies make a lot of sense when considering how uptime, responsiveness, and indexing operations can potentially impact read performance.
The fine folk at Elastic created a video wherein they asked a number of people how they’re using ElasticSearch and why they like it. I got to ramble for a while and they used a couple clips.
If you’re interested in working on these and other fun problems with us, check out our jobs page.