Researchers: Anonymized data does little to protect user privacy

Providing third-parties with data is a necessary cost of living in the 21st century. Whether it’s securing auto insurance, undergoing a routine examination at the dentist, or chatting up friends and relatives on Facebook, each of us will hand over about 1.7MB of data per second next year, according to one recent report.

While our anxiety around how this data will be used has grown considerably in recent years, culminating with the launch of a federal probe by the DOJ in recent weeks, it’s done little to stop the flow of information from individuals to companies, or from one company to another. The data trade, in fact, has overtaken oil as the world’s fastest-growing commodity market according to some experts.

And while we grow increasingly anxious about it, there’s little we can do to stop its flow. We’re assuaged at the thought of our data being anonymized, crucial data points stored as individual blips on a massive database — one that’s so large, with so many of these markers, that it’s nearly impossible to trace back to a single human.

Or, that’s what we were told, anyway. But this has never been true. In fact, we’ve known since the mid-1990s, when Dr. Latanya Sweeney, Professor of Government and Technology in Residence at Harvard University, blew that notion to pieces by identifying the medical records of William Weld (then the Governor of Massachusetts) from just three data points in an anonymous database. Dr. Sweeney, who also heads the Data Privacy Lab at the Institute of Quantitative Social Sciences at Harvard, needed only Weld’s zipcode, his date of birth, and gender to correctly identify him among countless others.

Pressed by NGOs and legislators to truly anonymize data before sharing it with others, companies started to rely on a new method called sampling. In a sample database, no individual, or company, would have access to only a small piece of an anonymous database, and not the entire thing.

In theory, it would lower the risk of re-identification of anonymous individuals by splitting the data into several, smaller samples. This makes it unlikely that any one person would be re-identified, because the number of anonymous data points on each person would be split across several databases — and no company or individual would be able to access all of them.

According to the Office of the Australian Information commissioner, sampling “[creates] uncertainty that any particular person is even included in the dataset.” Or, to put it simply, sampling will prevent re-identification of anonymous individuals. But this too is false.

According to a trio of European researchers, individuals in a sample database can be re-identified 83 percent of the time using just three data points: their gender, date of birth, and zip code. They created a handy tool (that doesn’t store collected data) that you can use to find out how likely you are to be re-identified by these three data points. For me that’s 45 percent of the time, much better than average, but still shockingly high.

In an article published in Nature Communications, the team developed a statistical model that could correctly identify 99.98 percent of Americans using 15 characteristics from an anonymized dataset, including age, gender, and marital status.

The 15 needed characteristics may seem unrealistic for a single company to have obtained, or for an individual to have provided. It’s not. Facebook, Google, and Amazon alone have hundreds, perhaps thousands of data points to pull from, data that you’ve given up based on your search history, the ads you click, and the purchases you make. At this point, the companies don’t even need you to give them this data, as they can make an educated guess that’s reasonably accurate based on your behavior while using certain websites or applications.

And what they aren’t tracking, they’re buying. Data brokers are big business, and exist solely to provide competitive insights into everything from your household income to who you voted for in the last election.

According to the researchers:

Contrary to popular belief, sampling a dataset does not provide plausible deniability and does not effectively [protect] people’s privacy.

We believe that, in general, it is time to move away from de-identification and tighten the rules for constitute truly anonymized data. Making sure data can be used statistically, e.g., for medical research is extremely important but cannot happen at the expense of people’s privacy. Datasets such as the NIGMS and NIH genetic data, the Washington State Health Data, the NYC Taxicab dataset, the Transport For London bike sharing dataset, and the Australian de-identified Medicare Benefits Schedule (MBS) and Pharmaceutical Benefits Schedule (PBS) datasets have been show to be easily re-identifiable.

Anonymized data is better than the alternative, but it’s clear that we have some work to do in increasing our understanding of what’s collected and how it may be used against us.