Anonymous Search Log Files for Wikipedia Released

Wikimedia releases anonymous search log files for Wikipedia, pulls the logs [Updated]

Wikimedia today announced it is making anonymous search log files for Wikipedia (and its sister projects) available for the first time, under a CC0 1.0 Universal license. This means that you can copy, modify, distribute, and perform work on the data, even for commercial purposes, all without asking permission (Wikimedia hopes you’ll cite the organization though!).

Starting today, Wikipedia will publish the search queries for the previous day at dumps.wikimedia.org/other/search. Today’s logs are already there, and Wikimedia plans to eventually have at least the last three months of search data available on any given day.

Each line in the log files is tab separated and contains the following 10 fields:

Server hostname.
Timestamp (UTC).
Wikimedia project.
URL encoded search query.
Total number of results.
Lucene score of best match.
Interwiki result.
Namespace (coded as integer) .
Namespace (human-readable) .
Title of best matching article.

The log files contain unsampled queries for all Wikimedia projects in all languages. They are collected both from the search box on wiki pages after visitors submit queries, and from queries submitted from Special:Search pages. Unfortunately, autocomplete search results are not included, as Wikimedia says this generates too much data.

The 💜 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol' founder Boris, and some questionable AI art. It's free, every week, in your inbox. Sign up now!

So, the data is supposed to be “anonymous,” but what does that mean? Well, as you can see from the list above, there is nothing in the logs that allows someone to map a query to an individual user. That means no IP addresses and no editor names. Even anonymous tokens are excluded. Furthermore, queries that contain email addresses, credit card numbers, and social security numbers are excluded.

Wikimedia expects this search query data to help its editor community figure out topics of interest that are currently insufficiently covered, as well as improve its search index by benchmarking improvements against real queries. In fact, the company even admitted: “We know that most people use external search engines to search Wikipedia because our own search functionality does not always give the same accuracy.”

At the same time, the online encyclopaedia hopes the data will give outside researchers the opportunity to discover something cool. More specifically, the organization hopes the data will be used to build “innovative applications that highlight topics that Wikipedia is currently not covering.” Personally, I think Google and Microsoft will find the results very interesting.

Update on September 20: Wikimedia has taken down the data to make additional improvements to the anonymization protocol related to the search queries. The organization says “a small percentage of queries contained information unintentionally inserted by users.” Unfortunately, Wikimedia is now no longer planning on publishing this data in the near future, until further notice.

Image credit: stock.xchng

Story by Emil Protalinski

Emil was a reporter for The Next Web between 2012 and 2014. Over the years, he has covered the tech industry for multiple publications, incl (show all) Emil was a reporter for The Next Web between 2012 and 2014. Over the years, he has covered the tech industry for multiple publications, including Ars Technica, Neowin, TechSpot, ZDNet, and CNET. Stay in touch via Facebook, Twitter, and Google+.

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Also tagged with

Wikipedia

Wikimedia releases anonymous search log files for Wikipedia, pulls the logs [Updated]

Get the TNW newsletter

Also tagged with

Broadcom steps back from M&A as AI revenue surges

Icertis veterans raise $7.55 million to build the AI layer that recovers money enterprises don’t know they’re losing

Discover TNW All Access

EU lawmakers voted to shield colleagues from Belgium’s Huawei corruption probe

Meta will let employees stop being tracked, for 30 minutes at a time