Celebrate King's Day with TNW 🎟 Use code GEZELLIG40 on your Business, Investor and Startup passes today! This offer ends on April 29 →

This article was published on June 15, 2008

It really exists: the Terra Incognita of the Web.


It really exists: the Terra Incognita of the Web.

This is a guest post by New Media student Edial Dekker

Science Fiction writers, visionaries, whose books I consumed as a child, made me believe that in a few years, shiny robots would handle all mundane tasks. There are many robots today, but no funny-whistling R2-D2’s. The robots today are invisible and immaterial, reading and indexing millions of websites on daily basis. They are robots built for speed and efficiency, mapping the Internet as fast and as accurately as possible. A few years ago we thought we could find anything that was out there on the Web, today we realize the Web is fragmented, divided into four continents with ‘Terra Incognita’-islands; websites that are clustered and simply can’t be found, no matter how many times you click or how hard you try.

No round-trips

Most search-engines do not even try to reach the full Web, because indexing as many as websites as possible isn’t necessarily the best way to provide the best search results. The Web is big yet small. But the small world behind the Web is a bit misleading. The Web is a scale-free network, dominated by hubs and nodes with a very large number of links. The World Wide Web has a directed structure. Andrei Broder, Vice President of Emerging Search Technology for Yahoo!, was the first person to notice how this directed network had consequences for the topology of the Web itself. For example, if you want to go from website A to website D, you can start from node A, then go to node B, which has a link to node C, which points to D. But you can’t make a round-trip. Most likely there is a different route one would have to find for going from node D to node A.

The four different continents of the Web

Albert-László Barabási, a Hungarian scientist, famous for contributing his insights on network theories, has tried to map the Web into four different continents:A Strongly Connected, or Central Core (SCC): this contains a quarter of all websites, it gives a home to all indexed websites and is easy navigable. This does not mean there is a link between all nodes; but the paths are defined and allows you to surf between the nodes.Than there are the IN and the OUT continents: these continents are just as large as the Central Core but are much harder to navigate. From the IN continent you can easily reach the SCC, but there is no path taking you back to the IN continent. In contrast, the OUT continent can easily be reached from the SCC, but has links to take you back to the core (where all the magic happens). The OUT continent is mostly populated by corporate websites that can easily be reached from outside, but once you get in, there is no way out.

The fourth continent is made out of Tendrils and disconnected Islands; they are interlinked groups that are unreachable from the SCC and have no links back to it. These websites can contain thousands of documents. The location of these websites have nothing to do with the content, but with relation to other documents.

There’s no way you can reach it

These four continents significantly limit the Web’s navigability. Where we can go, depends on the continent you start your search at. No matter how many times you time you want to click, when you are in the Central Core there is no way you can reach the IN continent or the Islands that surround it. Ever realized why search engines are giving user the option to submit websites? It’s because then the crawlers can sniff into those isolated islands that can otherwise never be found.

Is this fragmented structure here to stay? Barabási thinks it is. As long links remain directed, homogenization will never occur. One of the founding fathers of the Web, Tim Berners-Lee has been stressing the importance of links that track back to where they are linked from, for many years. The way blogs use the track-back system, can also be used for connecting the IN and OUT continent. The bottom line is that directed networks always break into the same four continents. The only way to organize is to reorganize the relations documents have with each other, semantic web anyone?

Get the TNW newsletter

Get the most important tech news in your inbox each week.