insideHPC Joins Flex Rex Communications

We recently announced that insideHPC, a leading online news publication for High Performance Computing, was acquired by Rich Brueckner.

As part of Flex Rex Communications, insideHPC, LLC will continue to publish the latest news on High Performance Computing. With over 700.000 page views per month, insideHPC is one of the best ways to get your products and services noticed by the HPC community.

The Rich Report: The 16 Terabyte PC – SGI Bets on Exascale

It has been over a year since SGI’s merger with Rackable Systems. The two company’s came from very different camps, so I was curious to learn about where they are today and where they’re headed in the HPC space. So I caught up with the company’s Chief Technology Office, Eng Lim Goh, to discuss the company’s new products and their plans for Exascale computing.

insideHPC: How long have you been at SGI?

Dr. Eng Lim Goh

Dr. Eng Lim Goh

Dr. Lim Goh: Over 20 years now. I started as a systems engineer in Singapore working on the GT workstation.

insideHPC: As CTO, what does a typical day look like for you?

Dr. Lim Goh: These days I spend about 50-60 percent of my time outside with customers. That’s particularly important now given the fact that we are a new company, with Rackable having acquired us and then renaming the company “SGI.” So I’m going out communicating not only about the new company, but also about the new line including products in the Internet Cloud space, which are less familiar to our HPC customers. And I’m also going out to our Cloud customers who are not familiar with our HPC line and storage lines.

So, it’s a lot of work to bring the community up to speed on both sides–the Cloud side and the HPC side, and that has been going on for a year now. And I think we have come to more of run-rate like scenario now.

insideHPC: That’s interesting. I remember when I first read about the acquisition. I wasn’t familiar with Rackable, so I looked at a corporate overview video that highlighted all their key customers. The list was a who’s-who of heavy-hitter Internet companies like Amazon, Facebook, and Yahoo, and I thought, my gosh, SGI has become the new “Dot in Dot Com” just like Sun was ten or twelve years ago.

Dr. Lim Goh: That’s very complimentary of you to say. In fact our latest win was with with their EC2 and S3 cloud. They’re one of the biggest cloud providers today and we supply the majority of systems to that enterprise.

insideHPC: That brings up my next question. You have these distinct customer segments: the Cloud/Internet providers and the typical big HPC clusters. They’re both filling up rooms with x86 racks, but how do their needs differ?

Dr. Lim Goh: The differences are as follows. On the Internet/Cloud side, they have the same 500 racks of computer systems in their datacenter, but they run tens of thousands of different applications like map reduce, memcacheDB, and Hadoop that are highly distributed. And then on the other extreme, in the HPC world, you may have 256 racks and you may even be thinking of running just one application across all of that. I’m just talking extremes here, of course. There are overlaps, but given these extremes, you see that the needs are different.

On the HPC side, interruption to services on any node in the entire facility can affect productivity. For example, you may have checkpoint restart, but even if you have checkpointed, it takes time to checkpoint and then restart. Unless the user has intentionally gone into the code to tolerate a node failure while an MPI program is running on the HPC side. So a node failure can be more interruptive to the HPC world as opposed to the cloud side. On the cloud side, they are designed to be highly tolerant of node failures. And as such the focus is different.

Now let’s look at some of the similarities like power. There is one area where we have been learning a lot from the Cloud side to bring over to the HPC side. Their Internet datacenters are on the order 10 to 20 or 50 Megawatts. While in the HPC space, if you talk about a 50 MW datacenter it is considered extreme. So in this sense, I’d say the Cloud world actually scales bigger.

insideHPC: So they are facing a lot of the same challenges in terms of power and cooling. What did Rackable bring to the table in this area?

Dr. Lim Goh: With regards to power and cooling on the Cloud side, one of the key requirements Rackable addressed was efficiency. In the early days, when datacenters were on the order of a Megawatt, customers had power efficiency specifications at the tray level. And then more recently they were set at the rack level. So if they were ordering 400 racks like one of our cloud customers, they stopped specifying at the chassis level and started specifying at the rack level. 

So that gave us the opportunity to optimize at the rack level: removing power supplies in every chassis and doing AC to DC conversion in the infrastructure at the rack level. Later, with our CloudRack design, we removed fans at the chassis level as well.  In fact, some Internet datacenters are demanding that those racks are able to run very warm, as high as 40 degrees Centigrade, in order to reduce energy consumption on the cooling side. 

So then as they move to even larger scales, with Cloud datacenters that run tens of Megawatts, they are moving to the next level up of granularity and specifying efficiencies at the container level. At that level, we essentially have a modular datacenter, and this is where they started to specify a PUE requirement for each container that we ship. Today the standard requirements are on the order of 1.2 PUE, with more recent acquisitions demanding even more efficiency than that.

So on the Internet/Cloud side, yes, the expertise brought by Rackable was to be able to scale with the customer’s requirements as they went from 1 Megawatt to tens of Megawatts and keep up with these datacenter’s demands for higher and higher efficiencies.

insideHPC: You mentioned container-based datacenters. I came from Sun where we never seemed to make hay with our Project Blackbox. How well is SGI doing with it’s ICE cube container datacenters?

Dr. Lim Goh: We have shipped containers to a number of customers. Some have allowed us to name them publicly, such as British Telecom, China Unicom, and we also have a couple on Cloud providers who are evaluating ICE cubes for wider deployment. 

insideHPC: Are the HPC customers interested in containers, or are they still on the fence?

Dr. Lim Goh: This is where I think the combination of the two companies, Rackable and SGI, have a very strong leverage because the HPC world is coming up to where the Internet datacenters are in terms of scale. So when we’re talking about Exascale computing here, and they are specifying 20 MW for a future Exascale system, this is something that the Rackable side is very familiar with in terms of power. So for HPC, we are actually drawing a lot on their expertise of delivering to Internet datacenters at that scale and at that requirement for efficiency. 

For example, if there is someday an HPC datacenter that requires an extreme PUE number of say 1.1 as a key criteria in addition to meeting the Exascale requirements, that only gives you 10 percent of the power that can be used for cooling. So we have drawn from the Cloud datacenter side, where they already have such requirements for an air-cooled container that just takes in outside air through a filter to cool your systems. We have one such system now that has passed the experimental stage and is ready for deployment. And in many places in the world, if we can build a system that can tolerate 25 degrees Centigrade most of the year, you can get free cooling. However, for those places where you can’t get 25 degrees C, this ideostatic cooling essentially uses a garden hose (I’m simplifying it) type connection to wet the filter just like a swamp cooler. Depending on humidity levels, you can get five to ten degrees Centigrade reduction in temperature. 

insideHPC: So that brings up another issue. When you have that kind of scale going on, system management must be a huge undertaking.

Dr. Lim Goh: Absolutely. We have hierarchical systems management tools with a user interface to manage all the way from the compute side to the interconnect side and then all the way to facility power consumption. And of course, at the container level, we have a modular control system that handles temperature, humidity, pressure, and outside air. And that modular system feeds upward to the hierarchical systems management tools.

insideHPC: Since we’re talking about big scale, I think we should dive into the new Ultra Violet product, SGI Altix UV, that you announced at SC09. Is that product shipping now?

Dr. Lim Goh: We began shipping the Altix UV a few weeks ago. We now have about 50 orders, so there is a lot of interest in the system. 

In terms of it’s use, there are two areas in which the Altix UV is of great interest. On the one hand, you have customers who are very interested in big, scale-up nodes. You know, with today’s Nehalem EX you can get two, four, and eight socket systems. If you think in that way, the Altix UV scales beyond that eight socket limit all the way to 256 sockets and 16 Terabytes of memory. So that’s one way to look at the Altix UV. The 16 Terabyte memory limit is because the Nehalem core only has 44 bits for physical address space.

So that’s one of the ways of looking at Altix UV. And the reason people buy that, for example, is heavy analytics where they load in 10 Terabyte datasets and then use the 256 sockets, which equates to up to 2000+ cores, to work on that dataset. 

insideHPC: And that’s a single system image for all those cores?

Dr. Lim Goh: Yes. It runs as a Single System Image on the Linux operating system, either SuSe or Red Hat, and we are in the process of testing Windows on it right now. So when you get Windows running on it, it’s really going to be a very, very big PC. It will look just like a PC. We have engineers that are compiling code on their laptops and the binary just works on this system. The difference is that their laptops have two Gigabytes of memory and the Altix UV has up to 16 Terabytes of memory and 2000+ physical cores. 

So this is going to be a really big PC. Imagine trying to load a 1.5 Terabyte Excel spreadsheet and then working with it all in memory. That’s one way of using the Altix UV.

insideHPC: Did you develop a new chip to do the communications?

Dr. Lim Goh: Yes. We are leveraging the ASICs chip that we developed. You can call it a node controller, but we call it the Altix UV Hub (HUV). Every hub sits below two Nehalem EX (8-core) sockets. And this Hub essentially talks to every other Hub in every node in the system and fuses the memory in those nodes into one collective. So when the Linux operating system or Windows operating system comes in, it thinks that this is one big node. That’s how it works.

So all the cache coherency, sharing, and all that is done by that chip in hardware, not in software. Others have done this with software, but this is all done in hardware, even the tracking of who is sharing what in the shared memory system. It’s all registered in hardware on that chip, and that chip carries it’s own private memory to keep track of all the vectors. 

insideHPC: So how does this kind of Big Node change the way scientists can approach their problems?

Dr. Lim Goh: This is a brilliant question. Although the Altix UV is a great tool for large-scale analytics, we are starting to see a lot of interest from the scientists and engineers. There are many scenarios, but let me describe to you one scenario.

If you take typical scientists: the chemists, physicists, or biologists, they do research in the labs and write programs on their laptops to experiment with ideas. So they work with these ideas on their laptop, small scale, but what do they do today when they need to scale up their problems? Today what they have to do is either MPI encode it themselves, or try to get computational scientists in from a supercomputer center or university to code it for them and run it in parallel. And this transition takes weeks, if not months. So what we envision is that the scientist will plug into the Altix UV instead of waiting this time for the validation. The Altix UV will plug into the middle here by giving the scientists a bigger PC; it does not replace the MPI.

Let’s look at a very common example. If you take a cube model with 1000 grid points in the X direction and 1000 grid points in the Y and Z  directions, and then you march this cube 1000 time steps, that would be a 1 trillion-pont (Terapoint) dataset. Now if every point was a big byte, double-precision number, then this is an 8 Terabyte cube with a thousand time steps.. You must imagine that this is a chemists, physicists, or biologist, right? How do you simply write and an array XYZT? If I really have my ideas in the center of that cube, I just want to run it to see how it works. It can take weeks to run; it doesn’t matter. I just want to know the results at the end. 

The alternative is to go with MPI because you can’t run this on your laptop. An 8 Terabyte dataset would be paging forever. But we can now supply a 10 Terabyte PC to run problems like these. My suspicion is that they will eventually move to MPI as they run more rigorous simulations. So rather than replace MPI, Altix UV gives them a bridge to scale their simulations. 

insideHPC: What other ways might they use Altix UV?

Dr. Lim Goh: There is another way to use the Altix UV that we envision and that is to use it as a front end to an Exascale system. Image your Exascale cluster with tens or hundreds of Petabytes of distributed memory and you’re using Message Passing or some other kind of API to run a large application. Since this system is going to generate massive amounts of data, it would be good to have a head node that could handle that data for your analysis work. You can’t use a PC any more in the Exascale world; you need something bigger.

insideHPC: So there is a lot of talk these days about Exascale in the next eight or ten years. Where do you see SGI playing a role in that space?

Dr. Lim Goh: I think our role in Exascale will be two-fold. The first fold will be to use this Big PC concept, with 16 Terabytes going to 64 Terabytes in 2012, and use it as the front end to an Exascale system. We would like the next generations of Altix UV to be the front end of every Exascale system that’s out there. Because if you are already spending tens of millions or hundreds of millions of dollars to build and Exascale system, it’s worth spending a little more so that you can get better use and be more productive with the output of that Exascale system. 

Another role for SGI is developing the Exascale system itself. And this is where we are looking at providing a partitioned version of the Altix UV to be the key Exascale system. 

So let’s look at Exascale systems now: If you look at what the top research priorities are to achieve Exascale within this decade, you can see that in general those are power/cooling as number one; and how do you get an Exaflop with 20 Megawatts? Number two would be resilience; can the Exascale system stay up long enough to at least do a checkpoint? (laughs) And on these two we are looking very closely with microprocessor and accelerator vendors.

But the next two priorities are what we are focusing on ourselves: communications across the systems (essentially the interconnect) and usability. As I’ve described on the usability side, we will be looking at the Altix UV as a big head node. 

In the communications area, we believe the interconnect needs to be smarter for an Exascale system to work. Why? Because you cannot get away from global collectives for example, in an MPI program unless you code specifically for Exascale applications to avoid it. Many of the applications that try to run on this large of an Exascale system will have global collectives and will need to do massive communications in the course of running the applications. 

insideHPC: So how do you propose to reduce communications overhead in an Exascale system?

Dr. Lim Goh: We sat down and worked out that to cut down that overhead, we need a global address space. With this, memory in every node in the Exascale system is aware (through the node controller) of every other memory in the entire infrastructure. So that when you send a message, a synchronization, or GET PUT to do communications, you do it with very little overheard.

But I must emphasize, as many even well-informed HPC people misunderstand, that this global address space is not shared memory. This is the other part of Altix UV that has not been understood well. Let me therefore lay it out.

At the very highest level you have shared memory. In the next level down you have global address space and next level down you have distributed memory. Distributed memory is what we all know; each node doesn’t know about it’s neighboring nodes and what you have to do is send a message across. That’s why it’s called Message Passing. 

Shared memory then is all the way up. Every node sees all the memory in every other node and hears all the chatter in every other node. Whether it needs it or not, it will see everything and hear everything. That’s why the Linux or Windows can just come in and use the big node.

However, with all the goodness of big shared memory hearing and seeing everything brings you, it cannot scale to a billion threads. It’s just like if you were in a crowded room and and tried to pay attention to all the chatter at once even though it is not meant for you. You would get highly distracted. 

So if you go to the other extreme to a distributed memory, you see nothing and you hear nothing. The only way you can get a communication across is to send a message, or essentially write an email and send it to a neighbor. But that is too slow. Thus, shared memory is too inclusive and distributed memory is too exclusive for Exascale. 

So we decided that global addressable space is the best middle ground. In that analogy, global address space sees everything, but does not hear the neighbors chattering. All it wants to do is see everything so that it can do a GET PUT directly, do a SEND RECEIVE directly, or it can do a synchronization directly. So a hardware-supported, global address space is one way to get the communications overhead lowered in the Exascale world. And this is especially important when you’re talking about a billion threads. Imagine trying to do a global sum on a billion threads. I hope we can code around it, but my suspicion is that there will still be applications needing to do it. 

insideHPC: I can tell by your voice that you have great passion for this subject. It sounds like the next ten years are going to be very exciting for SGI.

Dr. Lim Goh: You are very right about that. Because we sit here looking at the world, saying that we need to go to Exascale. And at the same time people are realizing that ok, we first have to do R&D on power, cooling, and resiliency. Plus we need a smarter interconnect. Sure, SGI  is there with the others working on the first set of problems, but we have been working hard on this communications problem for 10 or 15 years already. And now we have what we think is an ideal solution to the usability problem of an Exascale system with our big PC head node concept as well. So in summary we believe SGI can really contribute there. 

The Rich Report is produced by Rich Brueckner at Flex Rex Communications. You can follow Rich on Twitter.

The Rich Report Podcast: New Exascale Report Tracks the Future

In this podcast, I interview John West and Mike Bernhardt, founders of The Exascale Report. This new, subscription-based publication offers insider reporting on the technologies and initiatives that are in the works for Exascale computing in the coming decade.

“If you look at Exascale efforts internationally, you really find very little awareness among the different groups about what other people are doing,” said John E. West, co-founder of The Exascale Report. “In many cases, it’s not only that a group’s awareness is limited to their own country, it’s limited to their own project. Because these are very difficult to fund, very intensive scientific and engineering efforts to build these computers and the people working on this stuff are really just very busy and very focused. And so what The Exascale Report hopefully does is give an in-depth view to all of our readers on what’s going on in all the other groups. So it will kind of bring the community together and hopefully stimulate some discussion that raises all boats at the same time.”


The Rich Report: InfiniBand Rocks the TOP500

On my recent trip to ISC10 in Hamburg this month, the real buzz centered around the industry reaching sustained Exaflops in the next decade or so. And while this milestone is daunting enough in terms of processing, it struck me that the challenges of getting that many cores connected with low latency is going to be the highest hurdle of all. Is InfiniBand going to get us there, or will it require something way beyond current technology trends? To find out, I caught up with Brian Sparks and Clem Cole from the InfiniBand Trade Association, to talk about IBTA’s latest performance roadmap.

insideHPC: This is TOP500 week. How did IB fare in the rankings?


Brian Sparks, IBTA Marketing Working Group Co-Chair

Brian Sparks: We did really well. InfiniBand is now deployed on 208 systems of the TOP500 sites, and that’s an increase of 37 percent from a year ago. And that success is really reflected in the upper tiers. So in the Top100, we have 64 systems. And in the Top10, we have five systems. InfiniBand has powered Top10 systems on every list since 2003, so what you’re seeing is our momentum continuing to increase.

insideHPC: Are you gaining TOP500 “share” at the expense of proprietary interconnects?

Brian Sparks: A lot of what it’s eaten has been the proprietary interconnects. If you look at all the proprietary links combined, I think it’s only 25 to 28 clusters on the TOP500. The remaining gains have come from GigE, which has gone down to the 235-240 range. There’s also a couple of 10 GigE clusters entering mix now finally.

Clem Cole, Senior Principal Engineer, Intel

Clem Cole, Senior Principal Engineer, Intel

Clem Cole: We aren’t here to bash anyone, but I think what Brian describes is correct. I’m enough of a gray-haired historian on this subject to say that the proprietary guys face a shrinking value proposition. That doesn’t mean that there isn’t room for them. I absolutely believe that for the Top5 kind of customers, there will tend to be a value for somebody to invent something that takes it to the next level. Will they do it by starting with IB, or will they start by blowing up IB like IB did with Ethernet? I don’t know. If they do go on their own, history has shown that it’s very tough to survive. The point is, I think that for the bulk of the Top500, the guys who really want a multicomputer site environment with a multi-vendor ecosystem, that’s what IB is about.

insideHPC: What are the highlights of the latest IB roadmap?

Brian Sparks: The bottom line is really that IB continues to evolve with leading bandwidth performance and other enhancements. IBTA’s current roadmap calls for 4x EDR ports at 104Gb/s data rates in 2011. That’s 26Gb/s per lane and that’s a significant uptick from IBTA’s previous roadmap from June 2008, where we projected 4x EDR at less than 80Gb/s data rate in 2011. The other performance news is that the spec is now moving from 8b/10b to a 64/66 encoding so the data throughput will now better match the link speed. So taking the new encoding into effect, EDR will be over 3X the data bandwidth that QDR now provides.

Clem Cole: I think the big message is that the performance gains are going to keep coming. So if you look at the gains we’re making from current speeds of 40Gb/s to over 100Gb/s in 2011, IB continues to be a very economical solution to one of the toughest problems in High Performance Computing.

insideHPC: Clem, you were part of the IBTA effort from the beginning. Can you tell me more about where IB came from?

Clem Cole: Well, you really need to go back to the beginning. What we think of as clustering today is really an old idea that goes back to Gerald Popek at UCLA in the early 70’s. Jerry was the first one to put his finger on this idea of taking multiple computers and orchestrating them a one big system.

So we could all see that this was the way the industry was going. And along the way, people started building these custom interconnects and we had a situation where the big vendors all had their own proprietary network technology going. At DEC we had something called Memory Channel and we were facing big development costs, like $60-$100 Million for the next generation.

Remember that this is all driven by money. So we as vendors all needed a high-bandwidth technology that was going to meet the demand, but not a one of us was getting make any money if we continued to fight each other. So InfiniBand originated from the 1999 merger of two competing designs: Future I/O, developed by Compaq, IBM, and HP, and Next Generation I/O, developed by Intel, Microsoft, and Sun.

So in the end what made InfiniBand, and for that matter Ethernet go, was everybody agreeing that this was good enough. So instead of trying to differentiate with the interconnect, we could do our own value-add in other areas.

insideHPC: How important has the Open Fabric project been to industry adoption of InfiniBand?

Clem Cole: I think Open Fabrics was one of the most important milestones in getting IB to be successful. The Open Fabrics Alliance took the Open Fabrics Enterprise Distribution upstream and helped make it part of the Linux kernel. So no you’ve got the distributions like SuSe and Red Hat where it’s just in there. And don’t forget Windows as well.

insideHPC: InfiniBand has a reputation of being much harder to implement and manage than Ethernet. Does the IBTA recognize this as an issue that needs addressing?

Brian Sparks: I think that for HPC, IB has gone through a lot of evolution in terms of ease of use. When it first came out, you had a lot of scientists who were eager to play around with things and make it work. And now as you start going into enterprise solutions, people just want to drop it in and not worry about it. So as the years have evolved, we’ve been able to make that possible.

As an organization, IBTA has been trying to address InfiniBand’s reputation as being difficult to work with. We recently came out with a eBook called Introduction to InfiniBand for End Users. It’s kind of an IB for Dummies document with some key know-how such as what does IB management look like and how does that differ from what you’re used to in terms of Ethernet management.

So InfiniBand continues to evolve, and these efforts are really important because IB isn’t just for supercomputers and hard-core scientists any more. IB lets you add a server any time you want, and for things like cloud computing that’s a great value to the enterprise as well.

Video: Life is Random, So is Your Storage

I am fascinated with the CGI effects in today’s feature films and it seems like HPC infrastructure has become a competitive weapon for the movie studios. So when I heard that Weta Digital (Lord of the Rings) was using BlueArc, I decided to take a closer look.

In this video, BlueArc Director of HPC Marketing Bjorn Andersson talks about why storage loads are so random and how the company’s storage solutions are built for optimal performance.

Video: Why is Everyone Talking About GPUs for HPC?

I think that NVIDIA was a big winner this year at ISC, and it wasn’t just because China’s new Tesla-powered Nebulae Supercomputer came in at number 2 on the TOP500. The reason for me was rapid adoption; I visited at least half a dozen booths that featured the latest NVIDIA GPUs in a variety of configurations. Clearly, the company is getting traction in the market.

In this video, I interview Andy Keane, NVIDIA General Manager of the Tesla Business Unit and discuss the advantages of GPUs for HPC. He also gives us his views on power efficiency and Exascale computing.

Sun Compute Cluster: “A Solid Pedigree”

Paul Weinberg weighs in on Sun’s new series of preconfigured and pre-architected hardware and software tools for web build out, data center efficiency, and high performance computing:
“The Sun Compute Cluster, the final item in the series, is billed by the vendor as a scalable and open solution for small and medium sized organizations that require compute-intensive data processing. Included is modular processing power to reduce the complexity of IT decision-making, pre-configured high performance computing containing four base building block configurations and a Linux OS-based software stack.” Full Story