• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 295
  • Last Modified:

construct a super computer using linux clusters

I intend to build a linux cluster for running
MPI type of programs. I intend to use 16 machines.
Can any of you give me some advice on what kind
of network card, and switches are bested suited
for the task? The faster the better. The latency
time should be as short as possible.
I am very new to network switches. What kind of
network topography I should adopt to avoid
communication conflict? For example, is there
a way to build a network so node 1 can speak to
node 2, at the same time as node 3 is speaking to
node 4, without have to wait until one of the task
has been completed? I don't mind if I have to
use multiple switches and multiple network cards
at each node.
1 Solution
Get big and fast ones!!!

Sorry that was my attempt at humor.  My point is that not knowing about 100 other details about this network, the applications involved, the traffic flows the applications produce, the amount of users, proximity of users to equipment, existing technologies, bandwidth requirements, etc, etc there is no real way to answer the question.  This is an extremely broad topic, requiring a lot of in depth knowledge about what exists and what is needed, it is really impossible to give an answer that means anything.

I'm not telling you that you did anything wrong by asking, but instead trying to inform you that if you are in an environment requiring large server clusters, then you most likely should have the resources at hand to bring in a consultant or ask the resident network team to handle this question.

Hope that helps!
Just because he is looking at a 16-node cluster does not mean he has the resources available to answer his questions. I know a few scientific agencies that had to learn the hard way how to build MPP clusters.

I think that some further information is warranted, i.e. what type of traffic, at what speeds, etc. Initially, it sounds as if he is describing a typical n+1 non-blocking switch.

Fill us in oatnusigma3.
oatnusigma3Author Commented:
The clusters I am going to build is only intended for
running parallel programs written in c++ using MPI.
There is going to be a lot of communications between the
16 nodes. These communications will mainly be performed
using following modes:
(1). Point to point communication. This is the case when
one node need to talk to another node. These type of
communication happens the most frequent. And it often
happens that a number of pairs, for example, between
nodes i and i+1, for i=1,3,5,7,9,11,13, 16, want to
communicate simutanously. Namely node 1 want to talk to
node 2, but don't want to wait until the node 3 has
talked to node 4.
(2). Broadcast. This is the case that one node want to
send the same information to all the other nodes. Let us
say that the node 1 want to broad cast. Does it has to
send message to node 2 first, and then node 3, and then
node 4, or it can be achieved at once?
(3). Shift. This is the case where node i want to pass
the information to node i+1, for i=1,2,...16. For node 16,
it want to send the message to 1. Can this be done at once?
Or the actually message have to be sent from node 1 to
node 2. And then from node 2 to node 3. ....

Most of the messages are not very big in size. On average
it range from 1Mb to 10Mb. Occationly it gets to 100Mb
range. The most frequent messages are those short messages
like those below 1 Mb size messages. The communication
between nodes are very frequent. I do not have good data
on this right now. But it is likely be 5 to 50 times
per pair of node per second.

The cluster is not going to be connected to the outsider

There will only be a few users (1 to 5) for the cluster.
And it is likely that we want to schedule one job being
run at any one time.
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Some times ago Los Alamos National Laboratory created large supercomputer with 3Com switches for interconnecting nodes (see below for details, I found it in my email archive).

Today for your small cluster I will recommend 3Com SuperStack Switch 4900 (12x1000Base-T ports) with 4x1000Base-T module (16x1000Base-T ports total) and 3Com 10/100/1000 Base-T PCI-X NICs. This switch has non-blocking architecture (verified by Tolly Group) and good price.


SANTA CLARA, Calif.--(BUSINESS WIRE)--August 10, 1999--

Stackable 3Com Gigabit Switches Connect 144 Super-Fast Processors and Link Lab Users to Deliver World-Class Performance at Unprecedented Low Cost

3Com Corporation (Nasdaq:COMS) today announced that Los Alamos National Laboratory, Los Alamos, NM, is using 3Com(R) Gigabit Ethernet switches in its Avalon supercomputer. This device, which is among the most powerful computers in the world, was built in-house using off-the-shelf components from 3Com and Digital Equipment Corporation DEC), now Compaq. As a result, Avalon cost only about $275,000. In addition, the laboratory's Center for Nonlinear Studies (CNLS) has upgraded its local area network (LAN) using 3Com Gigabit Ethernet systems.

Los Alamos is a scientific institution owned by the U.S.
Department of Energy. It will use Avalon for research ranging from astrophysics and global climate modeling to materials and weapons technologies. CNLS, one of the research centers operating within the laboratory, is devoted to the identification and study of fundamental nonlinear phenomena and promote their use in applied research. The center coordinates the efforts of the Laboratory's premier researchers from many disciplines and many divisions, mobilizing theoreticians and experimentalists in applied math, statistical physics, solid state and materials physics, polymer chemistry, structural biology and biochemistry.

"Avalon sets a new standard for cost-effective supercomputing," said Michael Warren of Los Alamos National Laboratory's Theoretical Astrophysics Group and Avalon's chief architect. "We're able to process huge amounts of data reliably and without latency, even at supercomputer speeds."

Connectivity lies at the core of Avalon's supercomputer design, known as Beowulf architecture. This approach deploys multiple processors, or nodes, that are networked together. Computations are distributed among the nodes via the network, enabling all the processors to work simultaneously on tasks for much quicker results.

Supercomputing with Gigabit Ethernet Performance

Avalon is powered by a 3Com network based on a Gigabit Ethernet backbone. The supercomputer's nodes are comprised of 144 DEC desktop computers with 533 megahertz, 64-bit processors and 256 megabytes of RAM. In total, Avalon has nearly 40 gigabytes of RAM. The nodes are all linked at full-duplex 100 megabits-per-second (Mbps) Fast Ethernet
speeds to four 3Com 36-port SuperStack(R) II 3900 Gigabit Ethernet switches with three 1000 Mbps Gigabit Ethernet uplink modules. Using their Gigabit Ethernet uplinks, these switches connect to a 12-port SuperStack II 9300 Gigabit Ethernet switch, which serves as the network backbone. This robust design features 12 gigabits of connectivity between the SuperStack II 9300 switch and the SuperStack II 3900 switches.

"One of Avalon's key enabling technologies is the ability of the SuperStack II 3900 switch to trunk its three discrete Gigabit Ethernet uplinks into one," said David Neal, systems administrator for Los Alamos' CNLS and a co-developer of Avalon. "This allows each switch to deliver, in effect, a three gigabit link to the SuperStack II 9300 switch, ensuring the network can sustain extremely high traffic volumes between the 144 nodes."

To reduce costs, Avalon's development team deployed the open source Red Hat Linux 5.1 operating system and other software that are freely available on the Internet.

When Avalon was first built last spring, it featured 70 nodes, but the scalability and high-density of the SuperStack II switches permitted Avalon to attain even greater speeds. "Just half the ports on the SuperStack II 3900 switches had processors linked to them, which meant the network was operating at only half its capacity," added Warren. "The wire-speed performance of the 3900s and the full-duplex 12 gigabit backplane of the 9300 let us expand Avalon to 144 processors, which multiplies its computing power."

With 70 nodes, Avalon reached speeds of 19.2 gigaflops, which is 19.2 billion floating point operations per second. At that speed, Avalon was the 315th most powerful computer in the world. With 144 nodes, the supercomputer's performance is expected to approach 60 gigaflops, which will place it within the top one hundred most powerful computing devices. "Avalon began with much more processing power per dollar than any other supercomputer and we improved upon that considerably," said Neal.

Like many supercomputers, Avalon will be used for exotic
scientific research, including exploration into the origins of the universe. Avalon, however, demonstrates that Beowulf class supercomputing is accessible to many organizations. "In campus environments, workstations can be linked together and used at night for supercomputing tasks," said Neal. "A high-speed network is essential so the workstations don't compete for bandwidth, undermining

"When we first set up Avalon, we received all the components on a Friday and connected them over the weekend," said Warren. "On Monday, we had one of the most powerful computers on the planet. Supercomputing is no longer super-costly."

Enhancing the Network to Handle Supercomputing

To complement Avalon's enhanced computing performance, CNLS
recently upgraded its LAN. Prior to deploying the SuperStack II Switch 9300 switch-based Gigabit Ethernet backbone, CNLS ran its research applications in switched Ethernet/FDDI environment based on a 3Com CoreBuilder(R) 6000 Ethernet/FDDI switch. With the introduction of Avalon and the increasing size and complexity of the programs it ran, this infrastructure created a bottleneck in research operations that routinely include gigabyte file transfers. Dismissing ATM as too costly and FDDI and Fast Ethernet as too slow, CNLS chose Gigabit Ethernet as its backbone technology. With the reliability of 3Com systems already proven in its earlier network, CNLS again turned to 3Com for its upgrade.

Today, the CNLS network consists of a Gigabit Ethernet backbone comprised of a single SuperStack II Switch 9300 switch with 1000 Mbps links to five 36-port SuperStack II Switch 3900 switches and to a single CoreBuilder 3500 Layer 3 high function switch. The CoreBuilder 3500 switch provides layer 3 switching, while the SuperStack II 3900 switches provide full-duplex 100 Mbps Ethernet connections to networked desktops.

Upgraded Los Alamos Local Area Network

The laboratory has also upgraded the LAN for its Center for
Nonlinear Studies (CNLS). The 200-user network depends on a Gigabit Ethernet backbone based on a 3Com SuperStack II Switch 9300 Gigabit Ethernet switch. 3Com Fast Ethernet edge devices deliver switched 100Mbps to network desktops. As the conduit for information delivery at CNLS, the high-speed 3Com network creates the stable, high-performance environment needed to support bandwidth-intensive research and analysis activities. In addition to the hardware, 3Com is providing CNLS with 24 X 7 support, free software upgrades and advance hardware replacement through its comprehensive Guardian (sm) service program.

Running with a rough average using your numbers you are talking an average of 140mb/s. I assume that your use of the lower case "b" means bits and not bytes. Throw in a couple large transmissions and you are well into the gigabit range.

Since 3Com dumped its core enterprise customers about a year ago, I would be very shy of using their equipment. I would recommend an Extreme Alpine 3804 (16 max gb ports, fully non-blocking), Foundry FastIron II GC (32 max gb ports), or perhaps something from HP.

You did not mention the server hardware architecture, so I will assume that it is an Intel architecture. For the NICs, Intel also makes very good server adapters. Both gig over copper and fiber are supported.

Good luck!

Just compare price for Extreme or Foundry switches and 3Com's 4900. HP... Hm, it's OEM...
The 4900 does not meet his needs as he has 16 nodes. It would be much faster to use the backplane to transfer the data than to have two less expensive switches, but have to move the data on a 10gb uplink (which the 4900 does not have).

I see your point about cost though. For any one of the switches that I mentioned, you are looking at $50k and up once it is populated with ports.
1. 4900 have backplane capacity of 32Gb/s (16x2 - full duplex each port) and I think it meet his needs.

2. 4900 costs ~$4000-5000 plus 4-port module ~1500,
i.e. ~$5500-$6500 total. NICs - $150-$200 each.
The 4900 only has 12 ports. He needs 16. That puts you into a 4005 or 4007 family. Again, looking at ~50 large.

4900 have 12x1000Base-T ports and 1 uplink slot. Using 4x1000Base-T uplink module we have 16 ports as needed for oatnusigma3's cluster.

About Switch 4007. "Fastest" L2 configuration for 16x1000Base-SX ports (expandable):

1  x Switch 4007 24-Port Gigabit Ethernet Fabric,
1  x Switch 4007 7-Slot Chassis 930 Watt AC Power Supply,
1  x Switch 4007 Management Module,
4  x Switch 4007 4-port Gigabit Ethernet I/O Module,
16 x 1000BASE-SX GBIC

costs near $250000-$27000, not $50k.
Sorry, $25000-$27000...
Good solution with the 4900.

I based my numbers off of the 4005 with 16 x 1 port gb modules at the MSRP of 1995ea. Guess I should have looked closer.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now