The heart of tech is coming to the heart of the Mediterranean. Join TNW in València this March 🇪🇸

This article was published on March 25, 2018

How to use machine learning for your startup’s product

How to use machine learning for your startup’s product
David Crawford
Story by

David Crawford

David Crawford is the director of software engineering at Alation, a collaborative data catalog company. David Crawford is the director of software engineering at Alation, a collaborative data catalog company.

There’s a misconception that to leverage machine learning you need to be a mathematical genius. In reality, most machine-learning applications use well-understood, well-tested, off-the-shelf algorithms.

For many developers, especially those at startups, the real challenge lies in training the data. Overcoming this challenge takes clever product development with an eye on user experience.

Do you really need machine learning?

Machine learning can make a good product even better: more engaging, more responsive, and more effective. But, before tackling machine learning, ask yourself whether algorithms are right for your product.

Start testing the learning aspect with humans before jumping into machine learning. This will give you a better sense for whether the result is actually useful. Testing the machine learning will also give you an idea of when a human should be involved and when the machine learning should take over.

Often, a product’s sweet spot lands somewhere in between human-run and machine learning-automated. Either the human helps the computer when the algorithm is out of its depth or the computer helps the human to scale. For example, Clara Labs has differentiated its scheduling assistant by knowing which tasks are good for algorithms and when a real person needs to step in. This hybrid approach has helped Clara Labs separate itself from AI-only virtual assistants, which lose trust when the AI falters.

Once you have identified that your product would benefit from machine learning and know how much machine learning is right, then comes the challenge of labeling that data.

Labeling the data

Without high-quality, labeled training data, machine-learning accuracy gets constrained. Labels ensure that models can predict, classify or analyze data with accuracy.

Manually labeling data is a thankless, relatively low-level job. The best machine-learning products find ways to integrate labeling into the application’s overall experience.

Trading value for labels

For the sheer number of labels needed to train algorithms, manual labeling is often too time-consuming. Instead, well-designed, thoughtful applications often leverage users to do much of the labeling. The goal is to take the task that humans are good at, transfer the knowledge to the applications and have the applications take over.

For instance, reCAPTCHA is a free service from Google that helps protect websites from spam and abuse. The user has to identify images to prove that they aren’t a bot. At the same time, reCAPTCHA is training algorithms to recognize real-world objects. The images themselves are the training data and as users identify objects, the data gets the labels it needs.

There’s a cautionary tale here. Labeling cannot be a means to a distant far-off end. If the task you’re using to train the data doesn’t have value or the user won’t see the value for a long time, users won’t take part. Even reCAPTCHA, with its clear benefit to security and quality, wears on the nerves of internet goers — an issue Google has been grappling with.

If users are going to label your data, the labeling must be clearly and immediately valuable. Generally speaking, there are two types of value. The first is making the action valuable in its own right. For example, we’re willing to tag Facebook photos because it lets our friends and family know that they’re in the picture. With the labels, Facebook begins to recognize faces, making it easier to find people in pictures in the future. Although it may take some time before Facebook’s algorithm recognizes your best friend’s face, the act of labeling has value in and of itself.

The second value comes when the labeling has an immediate impact. Netflix asks users to rank movies with the promise that it will help improve movie recommendations. To make the value clear, Netflix immediately responds with new recommendations based on the rating you just gave.

Another tactic is to make labeling a game. Foursquare was successful at getting users to provide location data by incentivizing location check-ins. Dedicated users provided valuable labels about locations while competing for “badges” and “mayorship.”

While Foursquare no longer needs to use check-ins thanks to passive location tracking, the competitive check-in aspect lives on in Foursquare Swarm, and early on all those check-ins provided FourSquare with information that added greater context to location.

While tying the labeling process to clear value is an effective way to enlist users to train data, there are also tactics that don’t require active user involvement.

Derive from behavior

One way around enlisting users to actively label data is to observe their behavior. The benefit of deriving labels from behavior is that the user doesn’t need to actively participate in the labeling process. This eliminates a lot of the pitfalls that can harm user experience.

For example, Amazon observes your buying behaviors to recommend products and deals. At my company, we took a similar route. We monitor data usage, like which reports get used most and which SQL queries are being written, to help analysts find the right data set for the task at hand.

Learning without teachers

In the near future, users may not be as important to training data. Simulations provide a contained environment and a perfect way to label data. Chess, Go, and Pong are all games that can be easily simulated, allowing thousands or even hundreds of thousands of scenarios to run. Google’s Alpha Zero was able to teach itself chess and beat the leading chess programs, going on to master two other games while only playing itself.

While board games are closed environments, simulation is also helping to train devices intended to function in the real world. Autonomous vehicle developer Waymo is using simulations to train self-driving cars. The company is using virtual environments based on real-world locations to train vehicles for real-world driving. While very new, simulation offers the potential to create labels without human intervention.

User experience is paramount

Machine learning can help make more compelling, responsive products. Users aren’t going to provide their data or patiently train your algorithms if the value isn’t there and the experience isn’t compelling. Whether the user is directly labeling your data, indirectly labeling your data or not involved at all — user experience is paramount.

For startups, this demands another layer of design thinking. Not only does the product itself need to be great, but if users are contributing to the machine learning, the data collection and training process must be just as compelling. But, it is exactly these kinds of hurdles that drive creativity.