In September, Facebook rolled out a new feature for its Graph Search product that enabled users to find posts, comments, check-ins, and status updates. There may be some that wonder why this wasn’t available when the company introduced Graph Search to the world and the answer can be boiled down into a simple statement: it was complicated.
Ashoat Tevosyan, a Facebook engineer working on its search quality and ranking team, has penned a post on the company’s Engineering blog that gave a first-hand account of the massive undertaking the team needed in order to build posts search. He describes it as a feature that has been “two years in the making” and required Facebook to find ways to catalog the 1 billion new posts added every day. What’s more, its index already contains more than 1 trillion total posts amounting to hundreds of terabytes of data.
Tevosyan chronicles the journey by providing insights on how Facebook collected the data, built and updated the index, served it, and handled ranking results. Interestingly enough, the concept behind posts search came up as a result of one of Facebook’s internal hackathon projects. Tevosyan started out as a company intern and he spent a night working to implement a way so him and his friends could find old posts that they had written on the social network. From there, it blew up into becoming the latest Graph Search feature.
One of the first things that the company needed to do was find a way to assemble on the necessary content in order for the posts search functionality to actually work. This was the one of the biggest challenges the company faced and also would put considerable strain on database resources. Tevosyan said that Facebook has 70 different kinds of data that is sorted and indexed with many specific to certain types of posts. To say that a quick query was needed would be a gross understatement.
Next up was building the index and this was done through the use of an HBase cluster, Hadoop jobs, and Unicorn, Facebook’s search infrastructure. From there, keeping it updated was the next task and the company relied on a program called Wormhole. This was done to help make sure that the system doesn’t break whenever a new post is created, updated, or deleted.
Facebook then needed to find a way to serve its index to the public. Tevosyan states that it wasn’t just as simple as how Facebook handles other search indexes — this one was much larger. Since the posts search index took up 700 terabytes of RAM, the company needed to find an efficient way to use the data without crushing its server resources. It wound up serving the index using sold-state flash memory with the most frequently accessed data structures stored on RAM.
As for ranking and helping surface the most useful content to users, Facebook utilized query rewriting and dynamic result scoring to help make its algorithm as effective as possible, although Tevosyan says that his team will “continue to work on refining these models as we roll out to more users and listen to feedback.”
Photo credit: Stephen Lam/Getty Images