Diffbot, the Palo Alto-based startup that helps apps see the Web the way humans do — as actual content with context — has just released a new API and Chrome extension, as well as a handful of interesting stats.
The developer-centric company first launched with two APIs back in August of 2011, raised $2 Million this past May, and grew its offering to three specific services: An Article API which can be used to extract clean article text from news sites, a Frontpage API, which interprets and returns the individual elements of a homepage and lastly the Follow API, which lets developers detect changes to any given Web page.
Now, the company is launching its Page Classifier API, which is able to classify any given Web page into 20 different page types, like product or event pages. According to Diffbot, this new API “correctly recognizes and categorizes more than 90% of the Web.” This sort of technology has major potential in ad tech and security tech, but also could help to humanize the Web and allow us to understand what makes up the Internet piece by piece.
As for Diffbot’s new Chrome Extension, it works like the new Page Classifier API, but is more consumer focused and will resonate with those not interested in writing any code. Diffbot told TNW how it works:
With the extension installed, when a user visits Twitter.com, they can see category tags next to Tweets containing links (article, video, picture, etc). When they click on a tag, the Tweets expand to show article text, photos, and more right in the Tweet stream.
As far as what can be classified, take a gander at the stats the company dug up relating to what is shared on Twitter. According to Diffbot, photos and images make up 35% of all shared links on Twitter, with articles/blog posts making up 16%, videos 9% and products 8%. Additionally, Diffbot found that English-language sites comprise 68% of all shared links (Japanese-language pages are second with 7%.). For all the stats, check out the company’s infographic here.
Clearly there’s a lot of power hidden within these APIs, and as the startup grows, it’s positioning itself as the default middle man between content and anyone interested in understanding that content. This is pretty useful stuff for developers and consumers alike.