Home News Insight Builder Reviews Jobs Downloads newsletters
Insight:   Hardware  |  Software  |  Security  |  Communications  |  Business  |  Commentary  |  Archive
VPNs tested   |   MS patent reversed   |   Tracking the net   |   Mozilla in danger   |   CNET.com.au

The magic that makes Google tick
By Matt Loney, ZDNet UK
02 December 2004
Add your opinion TalkBack!
Forward in E-mail this story! Format for Printer Friendly

The numbers alone are enough to make your eyes water.

  • Over four billion Web pages, each an average of 10KB, all fully indexed.
  • Up to 2,000 PCs in a cluster.
  • Over 30 clusters.
  • 104 interface languages including Klingon and Tagalog.
  • One petabyte of data in a cluster -- so much that hard disk error rates of 10-15 begin to be a real issue.
  • Sustained transfer rates of 2Gbps in a cluster.
  • An expectation that two machines will fail every day in each of the larger clusters.
  • No complete system failure since February 2000.

    It is one of the largest computing projects on the planet, arguably employing more computers than any other single, fully managed system (we're not counting distributed computing projects here), some 200 computer science PhDs, and 600 other computer scientists.

    And it is all hidden behind a deceptively simple, white, Web page that contains a single one-line text box and a button that says Google Search.

    When Arthur C. Clarke said that any sufficiently advanced technology is indistinguishable from magic, he was alluding to the trick of hiding the complexity of the job from the audience, or the user. Nobody hides the complexity of the job better than Google does; so long as we have a connection to the Internet, the Google search page is there day and night, every day of the year, and it is not just there, but it returns results. Google recognises that the returns are not always perfect, and there are still issues there -- more on those later -- but when you understand the complexity of the system behind that Web page you may be able to forgive the imperfections. You may even agree that what Google achieves is nothing short of sorcery.

    On Thursday evening, Google's vice-president of engineering, Urs Hölzle, who has been with the company since 1999 and who is now a Google fellow, gave an insight to would-be Google employees into just what it takes to run an operation on such a scale, with such reliability. ZDNet UK snuck in the back to glean some of the secrets of Google's magic.

    Google's vision is broader than most people imagine, said Hölzle: "Most people say Google is a search engine but our mission is to organise information to make it accessible."

    Behind that, he said, comes a vast scale of computing power based on cheap, no-name hardware that is prone to failure. There are hardware malfunctions not just once, but time and time again, many times a day.

    Yes, that's right, Google is built on imperfect hardware. The magic is writing software that accepts that hardware will fail, and expeditiously deals with that reality, says Hölzle.

    Google indexes over four billion Web pages, using an average of 10KB per page, which comes to about 40TB. Google is asked to search this data over 1,000 times every second of every day, and typically comes back with sub-second response rates. If anything goes wrong, said Hölzle, "you can't just switch the system off and switch it back on again."

    How to slam spam
    The job is not helped by the nature of the Web. "In academia," said Hölzle, "the information retrieval field has been around for years, but that is for books in libraries. On the Web, content is not nicely written -- there are many different grades of quality."

    Some, he noted, may not even have text. "You may think we don't need to know about those but that’s not true -- it may be the home page of a very large company where the Webmaster decided to have everything graphical. The company name may not even appear on the page."

    Google deals with such pages by regarding the Web not as a collection of text documents, but a collection of linked text documents, with each link containing valuable information.

    "Take a link pointing to the Stanford university home page," said Hölzle. "This tells us several things: First, that someone must think pointing to Stanford is important. The text in the link also gives us some idea of what is on the page being pointed to. And if we know something about the page that contains the link we can tell something about the quality of the page being linked to."

    This knowledge is encapsulated in Google's famous PageRank algorithm, which looks not just at the number of links to a page but at the quality or weight of those links, to help determine which page is most likely to be of use, and so which is presented at the top of the list when the search results are returned to the user. Hölzle believes the PageRank algorithm is 'relatively' spam resistant, and those interested in exactly how it works can find more information here.

  • Forward in E-mail this story! Format for Printer Friendly
    Related stories
      Google's man behind the curtain

      Coming soon: Google TV?

    Tell us your opinion
      Talkback: Post your comment here
    Can't resist being a little PC and finding the parallel betw... Anonymous
     
    Very nice overview. Our compute farm has no direct correlat... Miles O'Neal
     
    Nice article. Always nice to get an "inside" look at one of ... Anthony Papillion
     
    Thanks for the story. More interesting insights on the Googl... Anonymous
     
    This was a really cool article. Thanks... Anonymous
     
    Hey, a "give me less commercial" button sounds wonderful. If... Anonymous
     
    How have you guys not done a story on the Google Sandbox??!... Russell Taylor
     
    Klingon and Tagalog? I'm curious if you were aware that Taga... Anonymous
     
    Why is Tagalog placed in the same context as Klingon? It's h... Anonymous
     
    Very cool article. Great insight on how the website works. N... Anonymous
     
    THE PAGE IS HANGING! In both IE and Firefox!... Anonymous
     
    Congrats to you and Matt Loney; good stuff. Got onto you fr... Anonymous
     
    Spelling and grammar make this article a pain, although the ... J Stroud
     
    Jesus, i never knew any of this, its really inresting, what ... Mike
     
    Get a proofreader. 10-15 was supposed to be 10 to the power ... Anonymous
     
    What amazes me is that with 200 computer doctors and 600 oth... Wily Elder
     
    Uh...Google keeps locking up....(Just Kiddin') Have A Mer... Jay Paul
     
    "104 interface languages including Klingon and Tagalog." ... Anonymous
     
    umm interesting article... Nesta
     
    Fran Foo
    Made in Australia security qualification?
    Register for exclusive content and special offers.
    Opening lines of communication
    The return of Atari's founder
    Mail's in ... for outsourcing
    Browsing opportunities: 11 Web browsers tested
    Aussie BitTorrent case to test Aust-US FTA
    EDS: Linux is insecure, unscalable
    Centrelink backs up fingerprint scanners with Novell
    Yahoo vows to open all services to Firefox users
    Wireless
    Process Improvement
    Servers
    E-mail
    CRM
    Weekly Insight
    IT in Government
    Enterprise Storage
    seek
    Tech Job Search
    seek
    seek seek
    seek
    seek
     Keyword (optional):
     
     Or use our full Job Search
    Powered by SEEK


     Sponsored Links
    Avoid Email Overload   Download EMC's free whitepapers on Email Management
    $50 Google voucher   Find out how to get $50 of FREE Google advertising today! Go >
    Telstra Business   Upgrading your office phone system? View our great offers!
    FREE Adobe Seminar   Automate document-driven processes with Intelligent Documents
     Featured Links
    The .biz scam  Don't get caught out.
    Tracking the Internet  Does this mean the end of anonymous acess?
    Tech DIY  Tips and tricks to set up your ultimate home office.
    VPNs tested  We test eight of the latest.
    Home News Insight Builder Reviews Jobs Downloads Newsletters
    Security & Privacy Policy | Terms of Use | Advertise | Contact | About Us | Site Map
    Copyright © 2005 CNET Networks, Inc. All rights reserved. ZDNet is a registered service mark of CNET Networks, Inc. ZDNet Logo is service mark of CNET Networks, Inc.