Crowdsourcing for Complex Tasks: How to Ensure Quality Output

Here at GoDaddy, we use crowdsourcing in our pipeline for extracting content from the web. This pipeline helps us convert arbitrary web pages into semantically structured data such as price lists (e.g. restaurant menus). There are many different ways to make crowdsourcing tasks, ranging from microtasks (seconds per task) to complex tasks (up to hours per task). The best task type to use depends on the problem you are trying to solve and how much context crowdworkers need. In a previous post, we discussed the trade-offs between task types, why complex tasks work better for our use case, as well as touched on how to ensure high quality output with task design. In this post, we will talk about how to achieve quality output in more depth.

A Hierarchy of Workers


Our workers are organized in a 3-level hierarchy, with the least experienced workers at the bottom and most experienced at the top. Workers at the bottom process a task from scratch, while workers at higher levels, called reviewers, review and correct their output. As a task moves through the hierarchy, the quality should improve. This hierarchical structure is uncommon for crowd work, where the typical model is to treat all workers as interchangeable. Here are the main reasons we chose this design:

Complex tasks require more training. Our tasks require a few weeks for a new worker to onboard. This is the opposite of microtasks, where you can typically learn the task in a few minutes. A hierarchy of review allows experienced workers to provide feedback and instruction to new workers, serving as a key part of the training process. Our system facilitates communication by allowing workers to leave messages and annotate a task with comments.

Reviews takes less time that reprocessing. Reviewing a task for mistakes takes much less time than processing a new task. Sometimes tasks require a lot of manual data entry, in which case it is much easier to check for errors in the task than to reproduce it.

Difficult to compare the output of complex tasks. In crowdsourcing, there is a chance that a spammer picks up your task. The common method of dealing with this is to send a task to multiple workers and take a majority vote on the output. This works well for tasks with simple outputs like true/false or multiple choice, but is much more difficult for tasks with complex outputs like a structured price list.

Saving Money & Maintaining Quality with Machine Learning

We have a hierarchy of workers doing complex work. Can we do better? One metric we care about is keeping costs low. Another is output quality. Is there a way for us to lower costs while keeping similar quality output? The simple answer is yes. The secret is that not all reviews are the same. Some reviews result in lots of fixes, while others make no changes at all. What if we could use machine learning to predict how many fixes a review will make? Then we could focus our efforts on only the tasks that need it the most.

We use machine learning to train a model to decide which tasks to review. In training this model, there are a couple of points to consider. First, what data do we use for training? We do not have “ground truth” data in the sense that we are 100% certain the task output is correct. Instead, we have tasks that have been corrected by trusted reviewers. For model training, we assume that the output of a task after review is “ground truth”.

The other question is, what exactly is the model predicting? There are many ways to measure how much a task has changed.  We need some quantitative measure of output quality that we can train against. In our case, the task output is a large blob of text with roughly one significant unit of data per line. A natural measure of quality is the percentage of lines different between the worker output and “ground truth”. We train our model to predict percentage line difference, and then pick the tasks with the highest predicted line difference for review. We can evaluate our model’s performance with the following graph, which shows the total errors caught if we review the tasks that the model picked, compared to randomly picked tasks, as we vary the review budget from 0% to 100%.

In production, we dynamically change the threshold for review to match our desired review budget. Currently, our production review budget is set to around 40%. From the performance graph, you can see that at the desired budget our model catches 50% more errors than random.

The Future of Crowdsourcing at GoDaddy

There’s so much more we want to do with human workers in our workflows. Right now, all of our tasks involve some kind of data extraction / entry, but we are curious to see how the crowd can be used for more subjective, creative tasks like designing a website! We also plan to use crowd workers to help train models for personalization to help small businesses better target their users. Stay tuned for more posts about that in the future.

For more details on the topics discussed in this blog post, you can read our paper that appeared in the 41st International Conference on Very Large Data Bases (VLDB 2015).  If you are interested in joining our team, you can apply at