At Slice, we work on a very complicated problem: categorizing all commerce purchases. These are purchases ranging from men’s athletic shoes, to laptops, to concert tickets. Our machine learning team has built an engine that can classify each of our billions of products into our multilevel taxonomy consisting of thousands of categories and thousands of brands.
For example, we would classify the product Panasonic 35 – 100mm f / 2.8 X OIS into the category Electronics & Accessories > Cameras & Photo > Camera Lenses and the brand Panasonic
With categorization, we’re now able to answer some really interesting questions like how do Apple and Samsung compare to each other this quarter versus last quarter in the mobile phone space, or how loyal are Apple iPhone users are vs. Samsung Galaxy users.
In this blog post, we discuss the challenges a traditional machine learning approach may face when approaching such a problem, and how we have addressed them by inserting different types of humans into our system through rules, crowdsourcing, and outsourcing.
The traditional approach to machine learning is to painstakingly create a labeled training set and a machine learning algorithm that can learn from this training set. This algorithm can then be applied to classify unlabeled data:
While machines are great (they are fast, scalable, consistent), there are a few issues that can make things difficult for them.
First off, the engine needs to continuously evolve. There are new products being sold every month (“Pokemon Go”, bone conduction headphones, etc) and if the machine doesn’t adapt with the new products, it will mistakenly put the beloved “Angry Birds” game into a “Pet Supplies” category. Training data can quickly become unrepresentative and stale over time without continued adaptation.
In addition, there is a long tail of edge cases – for example, if you have a skirts category and a shorts category, how do you classify a skort? Teaching a machine all of these edge cases and gotchas can be quite challenging. This is why it is relatively easy to get a machine learning classifier to get to 80% or even 90% accuracy, but very difficult to get it to 95% and higher. Before you know it, the machine suffers the death of a thousand paper cuts.
Lastly, if the only human involvement in the system is a machine learning engineer, this is problematic, as machine learning engineers are a limited resource (especially for a small company like Slice), and they quickly become a bottleneck for the system.
Our solution is to build tools that allow other humans to improve our system, so the engine can continuously improve. In addition to machine learning engineers, we have three types of human input:
- analysts with domain expertise
- crowdsourced workers
- outsourced workers
These workers can directly improve the engine by adding training data or creating rules.
Rules are important to our system because they can quickly bring up the accuracy in a deterministic fashion. Improving the algorithms of a machine learning system can sometimes feel like fitting a large rug into a small room. One corner can be fixed at the expense of another. Rules can address this issue. For example, we can create a rule to put products that contain both shampoo and conditioner into “Health & Beauty > Hair Care > Combo Sets” instead of “Health & Beauty > Hair Care > Shampoos” or “Health & Beauty > Hair Care > Conditioners”. In addition, rules can be helpful when training data is sparse or not representative for a particular category (for example, a new category like wearables). Finally, rules can be an effective and easy way to handle the long tail of edge cases.
The reason why we have three types of human input is that they all have their pros and cons. With more inputs we are able to mix and match to achieve more.
Analysts have the domain expertise and data analysis tools expertise (like SQL and data visualization tools) to be able to quickly diagnose the health of categories. However, they are not scalable. A business cannot hire an army of analysts.
Crowdsourced workers are very scalable as they are paid per task and can be located anywhere in the world. As more work needs to be done, you just create tasks on demand on your favorite crowdsourcing platform. You don’t need to hire a large team that may be idle at times. Below is an example of a task we send to a crowdsourced worker.
Note, in the interface above, we don’t ask the crowdsourced worker to choose among thousands of categories. Rather we ask the worker to choose the best category at each level and give feedback on the explanations of each category as the worker chooses them. Based on which category they choose, we only show the children of that category at the next level. This way, the worker is only choosing one of a handful of categories, rather than one of thousands.
Now, a given crowdsourced worker may not be 100% accurate in his or her labels. After all, these workers are not domain experts. We’ve noticed that a given worker’s labels are on average about 80% accurate. The way we improve the accuracy of worker’s labels is by sending the same task to multiple workers and using voting to triangulate the correct label. This has allowed us to improve the accuracy from 80% to 95%.
Crowdsourcing is great for micro-tasks, but it doesn’t work well for more involved tasks. For example, tasks that involve data analysis, filtering, or searching through hundreds or thousands of products wouldn’t work well for crowdsourced workers. In addition, these workers may not have much domain expertise.
This is why we augment with outsourced workers. With outsourced workers, you can give them more involved tasks and work closely with them to teach them the domain and the taxonomy and how to use the tools effectively. They have much more practice on these tasks, so they can be very productive and eventually become domain experts.
By utilizing a more elastic and diverse human workforce, we have created continuous monitoring of the quality of our engine and a mechanism to quickly improve the quality when an issue is detected. As a result, we have created a large and diverse training set of tens of millions of products and tens of thousands of rules for our product categorization system. This multi-tiered approach using humans in the loop can be generalized to other machine learning systems that require high precision classifications.
If you’re interested in tackling these challenging data science problems to build the world’s largest purchase graph, we’re hiring!