When I signed up to make TNW the media sponsor for GLUE Conference I knew that I was getting in over my head. Then, a couple of weeks before the event, I got asked to help sort through the massive list of applications for the Demo Pavilion. 12 teams went in, 1 emerged victorious in a system that let everyone vote for the winner. That team? Distil.it
The Pavilion was an area, graciously sponsored by Alcatel-Lucent, where early-stage startups could demonstrate their product to the people who matter most – those who would potentially be using them. There was a heavy focus on cloud architecture, and graduates of the inaugural TechStars Cloud program were represented en masse.
But back to the subject at hand, Distil is a company that focuses on putting an end to Web scraping. It’s a problem that anyone who has anything online has encountered at some point, and it is increasingly becoming a problem that is costing companies money.
What’s interesting about Distil, versus some other companies who are working on the scraping problem, is that it isn’t focused on just digital publishing. While publishing is an obvious target, there are other industries that have equal problems.
Looking at Distil’s site, you’ll see that they’re branching out their tools to work in a number of different areas. As I talked with TechStars Cloud Managing Direct Jason Seats, he opened my eyes to some other, not so obvious problems. Walmart, for instance, could perhaps hire a company to scrape pricing data from Amazon so that they know how to price their own items.
While that’s just one example, there are many others where direct or ancillary revenue is lost because of the simple act of taking information from one site and placing it elsewhere. Entire businesses are being made based on the profitability of crawling content to scrape information.
The shocking part is how easy it is to get content scraped, and how inexpensive the process has become. A post on the Distil blog breaks it down quite nicely, as the team went so far as to hire someone to do a scraping project (of freely-available, public data) and then blog the results.
GigaOm’s Stacey Higginbotham has a rather brilliant look at the problem at hand today. Essentially it comes down to the differences between those who would do good and those who would do bad, but the difficulty comes in separating them from one another.
“Not all scrapers and crawlers are out to defraud publishers of ad revenue — some are foisting their robots on the web in a legitimate way to offer consumers a service as BlackLocus does, while others use it for academic research. Even journalists scrape data from web sites for their stories.”
So what does Distil do? Depending upon the plan that you choose, you’ll get everything from SSL support to custom headers, dedicated vanity IP addresses and even a sandboxed, private network. Each of these, alone, is an effective method to combat scraping. But combine them and you’ll quickly have an arsenal of tools at your disposal to keep your content safe.
The market isn’t empty, and Distil has its work cut out for it. CloudFlare’s ScrapeShield immediately comes to mind as an alternative. From the talk that I had with the Distil team, however, they’ve got some great ideas going. It will most definitely be interesting to see their approach as it continues to develop.
The GLUE Conference Demo Pavilion competition wasn’t a blowout by any means. But as one small aside, there’s a funny story you should hear. The voting application was put together by Twilio. Its developer was so certain of its security that he issued the challenge for anyone to hack it.
Bad move in a room full of hackers.
Only a few minutes later, suspiciously-padded vote counts meant that the process had to be rolled back to before the challenge, then re-secured, and voting could continue. Fortunately it was pretty easy to track the massive influx of votes to get things back in line, but a good laugh was had by all.