We can’t allow big tech companies to thwart the ‘right to remember’

The European Union recently adopted laws embodying a proposed “right to be forgotten,” to protect individuals from eternal memorialization of unfortunate past indiscretions. However, I feel it’s time to propose a complementary “right to remember,” to ensure that history cannot be erased or rewritten at the whim of those who control the systems we use to communicate, plan, and lead our lives.

Recent court cases have shown that the largest, most powerful companies controlling the internet are willing to take extreme positions regarding their right to control data after it’s been made public. They abuse ambiguous, out-of-date US legislation such as the Computer Fraud and Abuse Act and the Digital Millennium Copyright Act to threaten and punish companies that dare to collect data that is explicitly intended for public view. They also employ sophisticated software to thwart and discourage automated collection of data they deem undesirable. (My company is currently involved in such litigation against LinkedIn.)

The potential harm here goes way beyond the success or failure of a few small Silicon Valley startups. To keep the powers that be accountable, we need to support and protect the right to view, archive, and make available information that was made public in the past. No individual, company or government has the right to restrict access to information of note that was publicly available at any point. No one has the right to fiddle with history.

Data is made public when people want to disseminate that data broadly — to publish it. Platform designers can specify what data stays behind a wall (available only to explicit “users” who have signed up and signed in) and what data is made freely available for search engines, individual users, or — presumably — anyone with an interest and a connection to the internet.

The <3 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol' founder Boris, and some questionable AI art. It's free, every week, in your inbox. Sign up now!

What rights does the public currently have to this supposedly public data? In most countries, that question can be framed through an understanding of the principle of copyright. Copyright is a form of monopoly control afforded to creators of “original work,” to ensure that those who have invested considerable time and effort to develop that work have the ability to profit from it. It’s important, however, to understand the limitations of copyright. From the US Government Copyright Office:

Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.

There is also a principle called fair use (US) or fair dealing (other places):

In its most general sense, a fair use is any copying of copyrighted material done for a limited and “transformative” purpose, such as to comment upon, criticize, or parody a copyrighted work. — Stanford University Library

But assuming the use is within the confines of existing copyright law, it seems kosher to archive for posterity publicly available data on the World Wide Web that is accessible through a browser without a username and password — or another access-control mechanism. To do this at any meaningful scale, you would need to automate the task, programming computers to browse (scrape, crawl, collect) this data in ways similar to how a search engine operates.

Websites use a (non-standard/defacto) protocol to inform “robots” — automated systems as opposed to humans with a visual browser — how the sites would like them to behave. At every major website, you can see this in action simply by typing “/robots.txt” after the URL. Many websites use this file to allow crawling by major search engines such as Google and Bing, but explicitly discourage automated data collection in general.

Websites encourage search engines to catalog their pages because they want their content to be seen, to be “discoverable.” They are actively promoting the dissemination of their data by encouraging the dominant search engines to index their site and push them up to the top of search results.

Automatic, human viewing

Let’s think about this for a moment. If someone suggested a protocol that said you could view a website only if you worked at a short list of companies, that person would rightly be pilloried. But what the typical robots.txt file says is exactly that — except it applies to automated viewing as opposed to human viewing.

The deciding factor should not be whether you view the data using one piece of software (a popular browser) or another (a script you wrote in a programming language such as Python). The issue should simply be, do I have a right to copy (save, archive, collect) this publicly available data?

Various attempts have been made to archive the internet and web, the best known being the Internet Archive’s Wayback Machine, which tries to keep a record of websites as they existed at certain points in time. It’s polite to a fault — until recently, it would never catalog pages unless their robots.txt file allowed it. (In fact, the Internet Archive has removed access to archived sites when the present version of the site prohibits crawling — retroactively erasing historical data).

The Internet Archive is a nonprofit group that depends on private donors to fund its operations. It’s unlikely it has the resources or ideological disposition to fight protracted legal battles should a major platform decide it doesn’t want certain pages archived.

With the advent of javascript-heavy websites — that act more like applications than document-oriented web pages — and the adoption of mobile apps, archiving the web is getting harder all the time. But that’s all the more reason it needs to be addressed sooner rather than later. Leaving this job (including decisions as to what to archive) to a single nonprofit seems like a bad idea.

Countervailing currents

There are some countervailing currents. Technologies such as public key encryption and blockchains — which underlie Bitcoin, Ethereum, and similar platforms — are poised to enable secure, decentralized, trusted identity, and transactions. They could also be used to widely distribute and put a digital seal on information, ensuring that no one could rewrite or erase the past to suit their purposes, or take credit for works they did not create.

The fox cannot guard the henhouse. All the big players — Google, Facebook, Twitter, LinkedIn, Amazon, and so on — actively “monetize” this data. Google analyzes your searches. Facebook studies your social behavior. Amazon keeps track of your purchases. They then use this data to bring you targeted ads and purchasing opportunities.

It seems rather cynical and self-serving for these companies to collect all of this valuable information, make it public to ensure it’s searchable and monetizable, and then turn around and complain when third parties attempt to collect the same public data and redistribute it, either in its original form or analyzed, aggregated, and made searchable in new and interesting ways.

There is a device in the Men in Black series called the neuralyzer. The agents use this fictional gizmo to wipe out the recent memory of bystanders who have witnessed classified information such as alien lifeforms or spacecraft.

Let’s keep that thing fictional.

Story by Dan Miller

Dan Miller is Chief Technology Officer of San Francisco-based hiQ Labs, Inc., a tech startup that collects and analyzes public profile infor (show all) Dan Miller is Chief Technology Officer of San Francisco-based hiQ Labs, Inc., a tech startup that collects and analyzes public profile information on LinkedIn to provide its clients with insights about their employees. He is a serial entrepreneur whose accomplishments include co-founding On2 Technologies, an audio/video streaming and compression company that went public with Miller as CEO and was acquired by Google in 2010. On2's HTML5/webM video compression now powers Youtube and is natively supported in Firefox, Opera, and Chrome.