The past few months have been crazy for Locu, and now that our approach is being validated by awesome companies and developers building products with our structured menu data, we’re excited to share what we’ve learned along the way. We’ve got several posts in the pipeline where we’ll share what we’ve learned about crowdsourced labor, design, ethics, and enterprise- and small business-facing products to name a few. In order for our lessons to make sense, it’s important to answer a simple question: what does Locu do, anyway?
To explain that, let’s start with something very familiar to Locu: restaurant menus. We can process a lot more than menus, but the approach described below is similar across data domains.
Menu styles are as varied as the restaurants they represent. Visit a few restaurant websites and you will find menus embedded in HTML tables, PDFs, scanned-in low-quality images, or the flash animations that your smartphone has taught you to hate. Local price lists like menus are notoriously unstructured beasts that contain data with immense value to many folks. That’s why we made local data the first stop on our path to structure the world’s information.
A Data Structuring Workflow with Humans in the Loop
So how do we go from an unstructured menu to a structured set of entries in a database listing menu items, prices, descriptions, choices (e.g., pick three toppings!), and additions (e.g., extra cheese for $1!)? While the details are complicated, our workflow roughly comes down to three components: a crawler, a learner, and a set of awesome crowd workers.
A Crawler. Locu’s crawler travels around the web, sniffing out anything that smells like a restaurant’s website, and when it finds a restaurant, anything that has the hint of a menu. A restaurant isn’t limited to one menu: one venue might have different offerings for breakfast, lunch, dinner, dessert, and wine, to list a few.
A Learner. Machines are pretty good at recognizing patterns. A dollar sign is a good indication of a price, and various visual elements on a page (boldface, repeated HTML divs) signal menu items and descriptions. We’ll save the details for another post, but we’ve trained a suite of classifiers that identify different menu elements and extract them into a structured collection of menu sections, items, descriptions, and prices.
A Crowd. Machine learning isn’t a perfect drop-in for human intelligence, and to get our menus near 100% data quality, it takes a human touch. All machine-extracted data is displayed to workers in an interface like this one:
Data Entry Specialists (DESs) comb through the wikitext-like content that our algorithms extracted and simultaneously reference the original menu. They fix errors, type text that the robots missed, and identify duplicate or incorrectly classified menus. DES edits are vetted by Quality Specialists (QSs), who are workers that have been with us for a while and have done enough excellent DES work to see promotion. We’ve found that this apprenticeship relationship between DESs and QSs helps train new workers while maintaining high-quality extracted data.
Learning and training isn’t limited to our crowd workers: our algorithms are lifelong learners, and the menus corrected by our crowd workers keep them learning from their mistakes.