Google open-sources tool for companies that aims to keep personal data private

For a company that’s in the business of tracking users’ online activities, Google sure is going all out to prove it’s dead serious about privacy.

To that effect, the internet behemoth is open-sourcing a library that it uses to glean insights from aggregate data in a privacy-preserving manner.

Called Differentially Private SQL, the library leverages the idea of differential privacy (DP) — a statistical technique that makes it possible to collect and share aggregate information about users, while safeguarding individual privacy.

This allows developers and organizations to build tools that can learn from aggregate user data without revealing any personally identifiable information.

The <3 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol' founder Boris, and some questionable AI art. It's free, every week, in your inbox. Sign up now!

The approach can be particularly useful if companies want to share confidential data sets with one another without being exposed to de-anonymization (or re-identification) attacks.

Limiting exposure to personal information

“If you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in order to determine if there are differences in care,” said Miguel Guevara. “Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner.”

For those uninitiated, DP works by strategically adding random noise to an individual’s information before it’s uploaded to the cloud. As a result, the total dataset can still reveal meaningful results — that, while not exact — is accurate enough without spilling that individual’s sensitive data.

What Google has open-sourced is essentially a process that allows organizations to perform differentially private aggregations on databases. In addition to allowing multiple records to be associated with an individual user, “developers can compute counts, sums, averages, medians, and percentiles using our library,” the search giant said.

The goal of DP is not data minimization: it won’t stop companies from scooping your personal data. Rather it’s more about not leaking that information when inferring patterns through data mining techniques.

Google is not the only player

One of Google’s own earliest initiatives with differential privacy was RAPPOR, a method for anonymously crowdsourcing statistics from apps such as Chrome with “strong privacy guarantees.”

Since then, the company has used this method to protect all different types of information, from location data of its Google Fi mobile customers to designing features that help identify how popular is a restaurant’s dish in Google Maps.

Google even plans to leverage DP as part of its new proposal for an anti-tracking policy for the web, a move that has provoked strong criticism from privacy advocates.

But the search giant is far from the only player. Differential Privacy undergirds the entirety of Apple’s machine learning algorithms that it uses to statistically anonymize iPhone user data and still draw useful results.

But a study in 2017 found flaws in its approach especially with regards to the privacy budget (or privacy loss parameter) — that determines the tradeoff between accuracy and privacy.

Uber, similarly, has its own DP equivalent called FLEX that’s employed to limit queries from revealing too much about any individual Uber rider or driver.

A big list of open-source initiatives

Part of the reason why rolling out a DP scheme is not straightforward is because it requires the mechanism to be foolproof in ensuring that data is safeguarded from all kinds of unintended consequences post release, including a data breach.

By making it open-source, Google not only wants to improve its offering through extensive feedback from academia and the general tech community. It further hopes the tool will be embraced by other developers without the need for designing custom DP solutions.

Differentially-private data analysis also joins a long list of privacy-focused open-source initiatives from the company — Federated Learning, TensorFlow Privacy, Private Join and Compute, Private Set Intersection, and confidential computing — all geared around improving privacy and security at different levels of the internet machinery.

“From medicine, to government, to business, and beyond, it’s our hope that these open-source tools will help produce insights that benefit everyone,” said Guevara.

With Silicon Valley tech majors increasingly under regulatory spotlight for a string of privacy missteps, Google’s ongoing efforts could be perceived as an attempt to justify the data collection that funnels its lucrative targeted advertising business.

Ultimately, jury is still out on the benefits of differential privacy. But even if it helps fix a few of the data security and protection problems ailing big tech today, it’s worth it.