A group of researchers from Tel Aviv University recently discovered how shockingly easy it is to identify a person based on a tiny sample of their musical listening preferences.
The gist: When companies such as Spotify use your data to train its AI, they remove all the identifying markers such as your name, account number, or anything else a computer or person could use to immediately identify you. What’s left is the raw data on your listening preferences such as how many times you’ve listened to a track and whether you’ve given a track a thumbs up or not.
The big idea here is that Spotify and Pandora don’t need to know who you are in order to serve you the music you like. And this matters because privacy is a huge issue. We have to take companies at their word when they claim they won’t sell our private information or exploit our identities for third-party companies.
The research: Data doesn’t change when you remove labels, it still holds the same information even if you don’t know who generated it. And that means, given enough data, a powerful enough system can usually trace it back to the person responsible.
This is a scary prospect for people who care about their privacy, but typically there isn’t much to worry about. Companies such as Spotify take great pains to protect their data as it’s usually in their best interest to safeguard their users.
However, there’s no way for companies to protect us from good old fashioned human intuition.
Per the Tel Aviv team’s paper:
In this paper we introduce a methodology to re-identify users based on their music selections, and prove the efficiency of the methodology empirically in four experiments.
The experiments: The researchers didn’t use algorithm-busting AI or digital privacy-smashing techniques to de-anonymize data, they just asked study participants to look at playlists containing only three song selections and decide who, among a group of strangers, each list belonged to.
Per a university press release:
The findings surprised even the researchers. The analysis of the data showed that the group members were able to identify the study participants according to their musical taste at a very high level of between 80 and 100%, even though the group members did not know each other well and had no prior knowledge of each other’s musical preferences.
Quick take: This is astounding. The small study population (N=150) makes it less-than-perfect, but the resulting accuracy is certainly unexpected. What’s most important here is that humans typically aren’t trained to extract high-level features from anonymized data.
That means this experiment proved efficacious in a void. A bad actor could apparentley use techniques as simple as combining human intuition with sorting algorithms to de-anonymize data containing more important identifying features than just what songs we’re likely to listen enjoy.
As the researchers conclude, “In the digital world we live in today, these findings have far-reaching implications on privacy violations, especially since information about people can be inferred from a completely unexpected source, which is therefore lacking in protection against such violations.”
You can check out the full study here.
Get the TNW newsletter
Get the most important tech news in your inbox each week.