This article was published on October 20, 2017

Companies are collecting a mountain of data. What should they do with it?


Companies are collecting a mountain of data. What should they do with it?

It’s called the information age for a reason.

From our tweets and status updates to our Yelp reviews and Amazon product ratings, the internet-connected portion of the human race generates 2.5 quintillion bytes of computer data every single day. That’s 2.5 million one-terabyte hard drives filled every 24 hours.

The takeaway is clear: in 2017, there’s more data than there’s ever been, and there’s only more on the way.

So what are savvy companies doing to harness the data that their human users shed on a daily basis? They’re finding meaningful ways to release it for public experimentation. By opening the kimono on their data, companies large and small can figuratively invite the public to have a hackathon that yields novel applications they wouldn’t otherwise have the time or resources to create.

From America Online to Netflix, companies across industries have released datasets to the public to differing ends.

In the case of crowdsourced restaurant review giant Yelp, its Dataset Challenge saw the company release more than 4 million reviews and 200,000 pictures pertaining to 156,000 individual businesses to the public as downloadable computer data. From students to computer scientists, interested parties around the world welcomed the ball in their court, finding novel ways to sort this data and build applications with it. “We think there is incredible promise in the ways people can use Yelp’s data to understanding food trends, build Yelp chat bots, or understand the visual content of local businesses,” says Yelp Senior Vice President of Engineering Jason Fennell.

For example, students from the University of Virginia’s computer science program fed Yelp’s data into a personalized sentiment classification model based on social psychology theory and the human tendency to associate with people like themselves. Fennell explained that “most text-based sentiment analysis models work at a global level and use localized group psychology, failing to capture wide-ranging opinions amongst users.” UVA’s project yielded a more nuanced picture of people’s regional tastes.

Whatever a company’s end goal may be, the experts generally concur: it’s good to release certain types of data to the public, where it can be received by an army of interested tinkerers. It’s an attitude that generally resonates with what is well-trod territory in the open source software community: sharing is caring, a rising tide lifts all opens.

“Releasing data is good,” says Richard Ford, chief scientist of computer security software company Forcepoint. “I come from an academic world, where we have tons of ideas and no data. Now I’m in the commercial world, where we have tons of data and no time to execute ideas. Releasing data lets other people experiment for us, but we always worry about the potential for deanonymization.”

Thus we’re presented with the double-edged side of releasing public data sets: if they contain sensitive information that can make someone personally identifiable, then that sets the stage for all kinds of trouble. It’s happened in the past.

On August 4, 2006, AOL released a database of 20 million internet searches made by 650,000 users over one three-month period. No names appeared in this data, but search terms were sortable by the user who made them, and many of these searches contained personally identifiable information. We humans are unique, but this uniqueness makes us identifiable. Knowing just a little bit about a person can make it easy to identify them in a sea of data. AOL pulled its dataset just three days later, but the genie was out of the bottle, copies already flying around the internet. CNN called it the 57th dumbest business moment of 2006.

In the same year, a pair of researchers at the University of Texas at Austin successfully deanonymized some entries from the Netflix Prize dataset, containing the movie ratings of 500,000 Netflix members. By knowing just a little bit about a particular Netflix subscriber, they could find his or her specific entries in the collection of data. The main takeaway here is that removing personally identifiable information from a database is not sufficient for anonymizing it — there are too many other ways in.

The consensus seems to be that the pros on releasing data widely outweigh the cons, especially assuming a robust privacy policy. It lets outside perspectives in, and can result in new, interesting features that company leadership may never have thought of. Public access to well-organized computer datasets can also play a meaningful role in educating a new community of developers and technologists.

“Respect for privacy is so important when you release a dataset,” says Ford. “Data is an asset, but it’s also a liability.”

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Also tagged with