Anonymous Search Log Files for Wikipedia Released

Wikimedia releases anonymous search log files for Wikipedia, pulls the logs [Updated]

Wikimedia today announced it is making anonymous search log files for Wikipedia (and its sister projects) available for the first time, under a CC0 1.0 Universal license. This means that you can copy, modify, distribute, and perform work on the data, even for commercial purposes, all without asking permission (Wikimedia hopes you’ll cite the organization though!).

Starting today, Wikipedia will publish the search queries for the previous day at dumps.wikimedia.org/other/search. Today’s logs are already there, and Wikimedia plans to eventually have at least the last three months of search data available on any given day.

Each line in the log files is tab separated and contains the following 10 fields:

Server hostname.
Timestamp (UTC).
Wikimedia project.
URL encoded search query.
Total number of results.
Lucene score of best match.
Interwiki result.
Namespace (coded as integer) .
Namespace (human-readable) .
Title of best matching article.

The log files contain unsampled queries for all Wikimedia projects in all languages. They are collected both from the search box on wiki pages after visitors submit queries, and from queries submitted from Special:Search pages. Unfortunately, autocomplete search results are not included, as Wikimedia says this generates too much data.

So, the data is supposed to be “anonymous,” but what does that mean? Well, as you can see from the list above, there is nothing in the logs that allows someone to map a query to an individual user. That means no IP addresses and no editor names. Even anonymous tokens are excluded. Furthermore, queries that contain email addresses, credit card numbers, and social security numbers are excluded.

TNW City Coworking space - Where your best work happens

A workspace designed for growth, collaboration, and endless networking opportunities in the heart of tech.

Book a tour now

Wikimedia expects this search query data to help its editor community figure out topics of interest that are currently insufficiently covered, as well as improve its search index by benchmarking improvements against real queries. In fact, the company even admitted: “We know that most people use external search engines to search Wikipedia because our own search functionality does not always give the same accuracy.”

At the same time, the online encyclopaedia hopes the data will give outside researchers the opportunity to discover something cool. More specifically, the organization hopes the data will be used to build “innovative applications that highlight topics that Wikipedia is currently not covering.” Personally, I think Google and Microsoft will find the results very interesting.

Update on September 20: Wikimedia has taken down the data to make additional improvements to the anonymization protocol related to the search queries. The organization says “a small percentage of queries contained information unintentionally inserted by users.” Unfortunately, Wikimedia is now no longer planning on publishing this data in the near future, until further notice.

Image credit: stock.xchng

Story by Emil Protalinski

Emil was a reporter for The Next Web between 2012 and 2014. Over the years, he has covered the tech industry for multiple publications, incl (show all) Emil was a reporter for The Next Web between 2012 and 2014. Over the years, he has covered the tech industry for multiple publications, including Ars Technica, Neowin, TechSpot, ZDNet, and CNET. Stay in touch via Facebook, Twitter, and Google+.

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Also tagged with

Wikipedia

Wikimedia releases anonymous search log files for Wikipedia, pulls the logs [Updated]

Get the TNW newsletter

Also tagged with

The next Renaissance: Why creativity is the currency of the AI age

Naboo raises $70M to turn AI event planning into corporate procurement platform

Discover TNW All Access

Managing your brand’s narrative in the AI age

Stop talking to AI, let them talk to each other: The A2A protocol