The growth rate of the Internet is accelerating in such a degree that a rather amazing related milestone was passed; Google’s spiders discovered the trillionth URL. That’s 1,000,000,000,000 WebPages indexed, Cuil reported to have indexed almost 122 billion pages with the help of the Internet Archive. According to Google, the World Wide Web is growing at a speed of a billion pages per day.
Jesse Alpert & Nissan Hajaj tell explain how Google downloads the web, and reprocess the web-link graphs continuously, a Good example of how complex Indexing actually is:
“To keep up with this volume of information, our systems have come a long way since the first set of web data Google processed to answer queries. Back then, we did everything in batches: one workstation could compute the PageRank graph on 26 million pages in a couple of hours, and that set of pages would be used as Google’s index for a fixed period of time. Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections. So multiple times every day, we do the computational equivalent of fully exploring every intersection of every road in the United States. Except it’d be a map about 50,000 times as big as the U.S., with 50,000 times as many roads and intersections.”