Wikimedia today announced it is making anonymous search log files for Wikipedia (and its sister projects) available for the first time, under a CC0 1.0 Universal license. This means that you can copy, modify, distribute, and perform work on the data, even for commercial purposes, all without asking permission (Wikimedia hopes you’ll cite the organization though!).
Starting today, Wikipedia will publish the search queries for the previous day at dumps.wikimedia.org/other/search. Today’s logs are already there, and Wikimedia plans to eventually have at least the last three months of search data available on any given day.
“This event was off the charts”
Gary Vaynerchuk was so impressed with TNW Conference 2016 he paused mid-talk to applaud us.
Each line in the log files is tab separated and contains the following 10 fields:
- Server hostname.
- Timestamp (UTC).
- Wikimedia project.
- URL encoded search query.
- Total number of results.
- Lucene score of best match.
- Interwiki result.
- Namespace (coded as integer) .
- Namespace (human-readable) .
- Title of best matching article.
The log files contain unsampled queries for all Wikimedia projects in all languages. They are collected both from the search box on wiki pages after visitors submit queries, and from queries submitted from Special:Search pages. Unfortunately, autocomplete search results are not included, as Wikimedia says this generates too much data.
So, the data is supposed to be “anonymous,” but what does that mean? Well, as you can see from the list above, there is nothing in the logs that allows someone to map a query to an individual user. That means no IP addresses and no editor names. Even anonymous tokens are excluded. Furthermore, queries that contain email addresses, credit card numbers, and social security numbers are excluded.
Wikimedia expects this search query data to help its editor community figure out topics of interest that are currently insufficiently covered, as well as improve its search index by benchmarking improvements against real queries. In fact, the company even admitted: “We know that most people use external search engines to search Wikipedia because our own search functionality does not always give the same accuracy.”
At the same time, the online encyclopaedia hopes the data will give outside researchers the opportunity to discover something cool. More specifically, the organization hopes the data will be used to build “innovative applications that highlight topics that Wikipedia is currently not covering.” Personally, I think Google and Microsoft will find the results very interesting.
Update on September 20: Wikimedia has taken down the data to make additional improvements to the anonymization protocol related to the search queries. The organization says “a small percentage of queries contained information unintentionally inserted by users.” Unfortunately, Wikimedia is now no longer planning on publishing this data in the near future, until further notice.
Image credit: stock.xchng