Wikimedia today announced it is replacing its search feature with one provided by enterprise data search and analytics startup Elasticsearch. The non-profit will be rolling out new search infrastructure to all of its wikis, starting with beta users in February and then all users in March or April.
All Wikimedia sites currently use a home-grown search system called lucene-search-2 based on Apache Lucene that was written primarily by volunteer Robert Stojnić. While the company has been able to scale it very well for the past eight years or so, it became clear in early 2013 that a replacement was needed, especially since Stojnić was no longer around to keep it running smoothly.
Here’s a screenshot of the new search box:
Wikimedia explained that it wanted to stop having to maintain a special-purpose open-source search system when there are two very good general-purpose open-source search systems already available: Solr and Elasticsearch, both based on Lucene as well. The company tried integrating both into MediaWiki but eventually picked Elasticsearch for the following reasons:
- Elasticsearch’s reference manual and contribution documentation promised an easy start and pleasant time getting changes upstream when needed to.
- Elasticsearch’s super expressive search API lets Wikimedia search any way needed and gives the company confidence that it can be expanded, including via expressive ad-hoc queries.
- Elasticsearch’s index maintenance API lets Wikimedia maintain the index right from its MediaWiki extension, so it’s easier to deploy and test, and should be easier for MediaWiki users outside Wikimedia to use. At the time of the choice, Solr’s schema API was read-only.
- Rack awareness, automatic shard rebalancing, statistics exposed over HTTP, preference for JSON and YML over XML, and first-party Debian packages were also nice.
Wikimedia has written a new extension called CirrusSearch to provide the integration to MediaWiki. It is mostly backwards-compatible with the current search, although it can’t handle text inside templates. That being said, updates are reflected in search results usually within seconds for single page edits, pages marked as higher or lower quality are reflected in search results, and a few new “expert” options have been added (check out the full documentation here).