Fri, Jun 29, 2007 |
Software Test & Performance |
ALM
|
Developer Tools
|
Security
|
Development Tools Directory
|
Architecture | Business | Data | Hardware | Legacy Systems | Networks | Open Source | Languages | SOA | Social Computing | Telecom | Virtualization | UML | Web | Wireless |
Columns: | Curmudgeon | Geek@Home | Interviews | Kode Vicious | | | Conference Calendar | Issue Index | Site Map | | | |||
CRC Career Resource Center |
Hardware & Systems -> Features -> DNS issue
In the early '90s, the Berkeley NOW (Network of Workstations) Project under David Culler posited that groups of less capable machines (running SunOS) could be used to solve scientific and other computing problems at a fraction of the cost of larger computers. In 1994, Donald Becker and Thomas Sterling worked to drive the costs even lower by adopting the then-fledgling Linux operating system to build Beowulf clusters at NASA's Goddard Space Flight Center. By tying desktop machines together with open source tools such as PVM (Parallel Virtual Machine), MPI (Message Passing Interface), and PBS (Portable Batch System), early clusters - which were often PC towers stacked on metal shelves with a nest of wires interconnecting them - fundamentally altered the balance of scientific computing. Before these first clusters appeared, distributed/parallel computing was prevalent at only a few computing centers, national laboratories, and a very few university departments. Since the introduction of clusters, distributed computing is now, literally, everywhere. There were, however, ugly realities about clusters. The lack of tools meant that building 16 or 32 machines to work closely together was a heroic systems effort. Open source software was (and often still is) poorly documented and lacked critical functionality that more mature commercial products offered on the "big machines." It often took months to get a cluster up and running and took highly trained experts to get it into that condition. It took even longer for applications to run reasonably well on these cheaper machines, if at all. Nonetheless, the potential of building scalable and cheap computing was too great to be ignored, and the community as a whole grew more sophisticated until clusters became the dominant architecture in high-performance computing. Midsize clusters are now about 100 machines in strength, big clusters consist of 1,000 machines, and the biggest supercomputers are even larger cluster machines. For HPC (high-performance computing), clusters have arrived. Most of these are either Linux-based or a commercial Unix derivative, with most of the top 500 machines running a Linux derivative. There is a new trend toward better hardware integration in terms of blades. This helps eliminate significant wiring clutter. The past 12 years of clusters have honed community experience - many can turn out "MPI boxes" (homogeneous hardware that enables message-passing parallel applications), and there are several software tools that understand clusters and allow non-experts to go from bare metal (e.g., that cluster SKU from your favorite computing hardware company) to functioning cluster made up of hundreds of individual nodes (computers) in a few hours. At the National Center for Supercomputing Applications, the Tungsten2 Cluster (512 nodes) went from purchase order placement to full production in less than a month and was one of the 50 fastest supercomputers in the world in June 2005. It seems that the problems with clusters have been solved, but their wild success means that everyone wants to do more with them. While retaining roots born in HPC, clusters of Web servers, tiled display walls, database servers, and file servers are becoming commonplace. Nearly every entity in the modern machine room is essentially a clustered architecture. Building a specialized MPI box (classic Beowulf cluster) is a small subset of what is needed to support the needs of computational researchers. Early clusters were tractable for experts because all the hardware was identical (purchased that way to simplify things) and every node in the cluster ran only a single software stack. To build these machines, the expert carefully created a "model" node through painstaking, handcrafted system administration. Then he or she took an image of this model and cloned bits onto all the other nodes. When changes were needed, a new model was built, and the nodes were recloned. The community learned that for clustered applications to function properly, software must be consistent across all nodes. A variety of mechanisms and underlying assumptions have been used to achieve consistency with varying degrees of success. Consistency itself isn't intractable, but achieving it becomes significantly more important as the complexity of the infrastructure increases. The reason for the complexity of the modern machine room isn't just that all nodes are part of a particular logical cluster, it's that every one of the clusters needs a different software configuration. In Rocks, software that we have developed to provision and manage clusters rapidly, we call these configurations appliances. If every appliance has its own "handcrafted" model, the cost of the cluster goes up dramatically, uniformity of security policy across clusters is tied to the ability of the cluster experts to apply changes uniformly to all model nodes, and each (sub)cluster must use relatively identical hardware. This administrator-heavy model leads to inconsistency in the enterprise and relies too much on human "wetware," when what is needed is a programmatic approach.
by Philip Papadopoulos, Greg Bruno, Mason Katz - University Of California, San Diego Submit this story to one of the following blogs:
|
Queue Partners |
ACM Home |
About Queue | Advertise with Queue | Advisory Board | Back Issues | Contact Us | Dev Tools Roadmap | Free Subscription | Privacy Policy | Writer Faq | RSS feeds |
© ACM, Inc. All rights reserved. |