The Rich Report: The 16 Terabyte PC – SGI Bets on Exascale

It has been over a year since SGI’s merger with Rackable Systems. The two company’s came from very different camps, so I was curious to learn about where they are today and where they’re headed in the HPC space. So I caught up with the company’s Chief Technology Office, Eng Lim Goh, to discuss the company’s new products and their plans for Exascale computing.

insideHPC: How long have you been at SGI?

Dr. Eng Lim Goh

Dr. Eng Lim Goh

Dr. Lim Goh: Over 20 years now. I started as a systems engineer in Singapore working on the GT workstation.

insideHPC: As CTO, what does a typical day look like for you?

Dr. Lim Goh: These days I spend about 50-60 percent of my time outside with customers. That’s particularly important now given the fact that we are a new company, with Rackable having acquired us and then renaming the company “SGI.” So I’m going out communicating not only about the new company, but also about the new line including products in the Internet Cloud space, which are less familiar to our HPC customers. And I’m also going out to our Cloud customers who are not familiar with our HPC line and storage lines.

So, it’s a lot of work to bring the community up to speed on both sides–the Cloud side and the HPC side, and that has been going on for a year now. And I think we have come to more of run-rate like scenario now.

insideHPC: That’s interesting. I remember when I first read about the acquisition. I wasn’t familiar with Rackable, so I looked at a corporate overview video that highlighted all their key customers. The list was a who’s-who of heavy-hitter Internet companies like Amazon, Facebook, and Yahoo, and I thought, my gosh, SGI has become the new “Dot in Dot Com” just like Sun was ten or twelve years ago.

Dr. Lim Goh: That’s very complimentary of you to say. In fact our latest win was with with their EC2 and S3 cloud. They’re one of the biggest cloud providers today and we supply the majority of systems to that enterprise.

insideHPC: That brings up my next question. You have these distinct customer segments: the Cloud/Internet providers and the typical big HPC clusters. They’re both filling up rooms with x86 racks, but how do their needs differ?

Dr. Lim Goh: The differences are as follows. On the Internet/Cloud side, they have the same 500 racks of computer systems in their datacenter, but they run tens of thousands of different applications like map reduce, memcacheDB, and Hadoop that are highly distributed. And then on the other extreme, in the HPC world, you may have 256 racks and you may even be thinking of running just one application across all of that. I’m just talking extremes here, of course. There are overlaps, but given these extremes, you see that the needs are different.

On the HPC side, interruption to services on any node in the entire facility can affect productivity. For example, you may have checkpoint restart, but even if you have checkpointed, it takes time to checkpoint and then restart. Unless the user has intentionally gone into the code to tolerate a node failure while an MPI program is running on the HPC side. So a node failure can be more interruptive to the HPC world as opposed to the cloud side. On the cloud side, they are designed to be highly tolerant of node failures. And as such the focus is different.

Now let’s look at some of the similarities like power. There is one area where we have been learning a lot from the Cloud side to bring over to the HPC side. Their Internet datacenters are on the order 10 to 20 or 50 Megawatts. While in the HPC space, if you talk about a 50 MW datacenter it is considered extreme. So in this sense, I’d say the Cloud world actually scales bigger.

insideHPC: So they are facing a lot of the same challenges in terms of power and cooling. What did Rackable bring to the table in this area?

Dr. Lim Goh: With regards to power and cooling on the Cloud side, one of the key requirements Rackable addressed was efficiency. In the early days, when datacenters were on the order of a Megawatt, customers had power efficiency specifications at the tray level. And then more recently they were set at the rack level. So if they were ordering 400 racks like one of our cloud customers, they stopped specifying at the chassis level and started specifying at the rack level. 

So that gave us the opportunity to optimize at the rack level: removing power supplies in every chassis and doing AC to DC conversion in the infrastructure at the rack level. Later, with our CloudRack design, we removed fans at the chassis level as well.  In fact, some Internet datacenters are demanding that those racks are able to run very warm, as high as 40 degrees Centigrade, in order to reduce energy consumption on the cooling side. 

So then as they move to even larger scales, with Cloud datacenters that run tens of Megawatts, they are moving to the next level up of granularity and specifying efficiencies at the container level. At that level, we essentially have a modular datacenter, and this is where they started to specify a PUE requirement for each container that we ship. Today the standard requirements are on the order of 1.2 PUE, with more recent acquisitions demanding even more efficiency than that.

So on the Internet/Cloud side, yes, the expertise brought by Rackable was to be able to scale with the customer’s requirements as they went from 1 Megawatt to tens of Megawatts and keep up with these datacenter’s demands for higher and higher efficiencies.

insideHPC: You mentioned container-based datacenters. I came from Sun where we never seemed to make hay with our Project Blackbox. How well is SGI doing with it’s ICE cube container datacenters?

Dr. Lim Goh: We have shipped containers to a number of customers. Some have allowed us to name them publicly, such as British Telecom, China Unicom, and we also have a couple on Cloud providers who are evaluating ICE cubes for wider deployment. 

insideHPC: Are the HPC customers interested in containers, or are they still on the fence?

Dr. Lim Goh: This is where I think the combination of the two companies, Rackable and SGI, have a very strong leverage because the HPC world is coming up to where the Internet datacenters are in terms of scale. So when we’re talking about Exascale computing here, and they are specifying 20 MW for a future Exascale system, this is something that the Rackable side is very familiar with in terms of power. So for HPC, we are actually drawing a lot on their expertise of delivering to Internet datacenters at that scale and at that requirement for efficiency. 

For example, if there is someday an HPC datacenter that requires an extreme PUE number of say 1.1 as a key criteria in addition to meeting the Exascale requirements, that only gives you 10 percent of the power that can be used for cooling. So we have drawn from the Cloud datacenter side, where they already have such requirements for an air-cooled container that just takes in outside air through a filter to cool your systems. We have one such system now that has passed the experimental stage and is ready for deployment. And in many places in the world, if we can build a system that can tolerate 25 degrees Centigrade most of the year, you can get free cooling. However, for those places where you can’t get 25 degrees C, this ideostatic cooling essentially uses a garden hose (I’m simplifying it) type connection to wet the filter just like a swamp cooler. Depending on humidity levels, you can get five to ten degrees Centigrade reduction in temperature. 

insideHPC: So that brings up another issue. When you have that kind of scale going on, system management must be a huge undertaking.

Dr. Lim Goh: Absolutely. We have hierarchical systems management tools with a user interface to manage all the way from the compute side to the interconnect side and then all the way to facility power consumption. And of course, at the container level, we have a modular control system that handles temperature, humidity, pressure, and outside air. And that modular system feeds upward to the hierarchical systems management tools.

insideHPC: Since we’re talking about big scale, I think we should dive into the new Ultra Violet product, SGI Altix UV, that you announced at SC09. Is that product shipping now?

Dr. Lim Goh: We began shipping the Altix UV a few weeks ago. We now have about 50 orders, so there is a lot of interest in the system. 

In terms of it’s use, there are two areas in which the Altix UV is of great interest. On the one hand, you have customers who are very interested in big, scale-up nodes. You know, with today’s Nehalem EX you can get two, four, and eight socket systems. If you think in that way, the Altix UV scales beyond that eight socket limit all the way to 256 sockets and 16 Terabytes of memory. So that’s one way to look at the Altix UV. The 16 Terabyte memory limit is because the Nehalem core only has 44 bits for physical address space.

So that’s one of the ways of looking at Altix UV. And the reason people buy that, for example, is heavy analytics where they load in 10 Terabyte datasets and then use the 256 sockets, which equates to up to 2000+ cores, to work on that dataset. 

insideHPC: And that’s a single system image for all those cores?

Dr. Lim Goh: Yes. It runs as a Single System Image on the Linux operating system, either SuSe or Red Hat, and we are in the process of testing Windows on it right now. So when you get Windows running on it, it’s really going to be a very, very big PC. It will look just like a PC. We have engineers that are compiling code on their laptops and the binary just works on this system. The difference is that their laptops have two Gigabytes of memory and the Altix UV has up to 16 Terabytes of memory and 2000+ physical cores. 

So this is going to be a really big PC. Imagine trying to load a 1.5 Terabyte Excel spreadsheet and then working with it all in memory. That’s one way of using the Altix UV.

insideHPC: Did you develop a new chip to do the communications?

Dr. Lim Goh: Yes. We are leveraging the ASICs chip that we developed. You can call it a node controller, but we call it the Altix UV Hub (HUV). Every hub sits below two Nehalem EX (8-core) sockets. And this Hub essentially talks to every other Hub in every node in the system and fuses the memory in those nodes into one collective. So when the Linux operating system or Windows operating system comes in, it thinks that this is one big node. That’s how it works.

So all the cache coherency, sharing, and all that is done by that chip in hardware, not in software. Others have done this with software, but this is all done in hardware, even the tracking of who is sharing what in the shared memory system. It’s all registered in hardware on that chip, and that chip carries it’s own private memory to keep track of all the vectors. 

insideHPC: So how does this kind of Big Node change the way scientists can approach their problems?

Dr. Lim Goh: This is a brilliant question. Although the Altix UV is a great tool for large-scale analytics, we are starting to see a lot of interest from the scientists and engineers. There are many scenarios, but let me describe to you one scenario.

If you take typical scientists: the chemists, physicists, or biologists, they do research in the labs and write programs on their laptops to experiment with ideas. So they work with these ideas on their laptop, small scale, but what do they do today when they need to scale up their problems? Today what they have to do is either MPI encode it themselves, or try to get computational scientists in from a supercomputer center or university to code it for them and run it in parallel. And this transition takes weeks, if not months. So what we envision is that the scientist will plug into the Altix UV instead of waiting this time for the validation. The Altix UV will plug into the middle here by giving the scientists a bigger PC; it does not replace the MPI.

Let’s look at a very common example. If you take a cube model with 1000 grid points in the X direction and 1000 grid points in the Y and Z  directions, and then you march this cube 1000 time steps, that would be a 1 trillion-pont (Terapoint) dataset. Now if every point was a big byte, double-precision number, then this is an 8 Terabyte cube with a thousand time steps.. You must imagine that this is a chemists, physicists, or biologist, right? How do you simply write and an array XYZT? If I really have my ideas in the center of that cube, I just want to run it to see how it works. It can take weeks to run; it doesn’t matter. I just want to know the results at the end. 

The alternative is to go with MPI because you can’t run this on your laptop. An 8 Terabyte dataset would be paging forever. But we can now supply a 10 Terabyte PC to run problems like these. My suspicion is that they will eventually move to MPI as they run more rigorous simulations. So rather than replace MPI, Altix UV gives them a bridge to scale their simulations. 

insideHPC: What other ways might they use Altix UV?

Dr. Lim Goh: There is another way to use the Altix UV that we envision and that is to use it as a front end to an Exascale system. Image your Exascale cluster with tens or hundreds of Petabytes of distributed memory and you’re using Message Passing or some other kind of API to run a large application. Since this system is going to generate massive amounts of data, it would be good to have a head node that could handle that data for your analysis work. You can’t use a PC any more in the Exascale world; you need something bigger.

insideHPC: So there is a lot of talk these days about Exascale in the next eight or ten years. Where do you see SGI playing a role in that space?

Dr. Lim Goh: I think our role in Exascale will be two-fold. The first fold will be to use this Big PC concept, with 16 Terabytes going to 64 Terabytes in 2012, and use it as the front end to an Exascale system. We would like the next generations of Altix UV to be the front end of every Exascale system that’s out there. Because if you are already spending tens of millions or hundreds of millions of dollars to build and Exascale system, it’s worth spending a little more so that you can get better use and be more productive with the output of that Exascale system. 

Another role for SGI is developing the Exascale system itself. And this is where we are looking at providing a partitioned version of the Altix UV to be the key Exascale system. 

So let’s look at Exascale systems now: If you look at what the top research priorities are to achieve Exascale within this decade, you can see that in general those are power/cooling as number one; and how do you get an Exaflop with 20 Megawatts? Number two would be resilience; can the Exascale system stay up long enough to at least do a checkpoint? (laughs) And on these two we are looking very closely with microprocessor and accelerator vendors.

But the next two priorities are what we are focusing on ourselves: communications across the systems (essentially the interconnect) and usability. As I’ve described on the usability side, we will be looking at the Altix UV as a big head node. 

In the communications area, we believe the interconnect needs to be smarter for an Exascale system to work. Why? Because you cannot get away from global collectives for example, in an MPI program unless you code specifically for Exascale applications to avoid it. Many of the applications that try to run on this large of an Exascale system will have global collectives and will need to do massive communications in the course of running the applications. 

insideHPC: So how do you propose to reduce communications overhead in an Exascale system?

Dr. Lim Goh: We sat down and worked out that to cut down that overhead, we need a global address space. With this, memory in every node in the Exascale system is aware (through the node controller) of every other memory in the entire infrastructure. So that when you send a message, a synchronization, or GET PUT to do communications, you do it with very little overheard.

But I must emphasize, as many even well-informed HPC people misunderstand, that this global address space is not shared memory. This is the other part of Altix UV that has not been understood well. Let me therefore lay it out.

At the very highest level you have shared memory. In the next level down you have global address space and next level down you have distributed memory. Distributed memory is what we all know; each node doesn’t know about it’s neighboring nodes and what you have to do is send a message across. That’s why it’s called Message Passing. 

Shared memory then is all the way up. Every node sees all the memory in every other node and hears all the chatter in every other node. Whether it needs it or not, it will see everything and hear everything. That’s why the Linux or Windows can just come in and use the big node.

However, with all the goodness of big shared memory hearing and seeing everything brings you, it cannot scale to a billion threads. It’s just like if you were in a crowded room and and tried to pay attention to all the chatter at once even though it is not meant for you. You would get highly distracted. 

So if you go to the other extreme to a distributed memory, you see nothing and you hear nothing. The only way you can get a communication across is to send a message, or essentially write an email and send it to a neighbor. But that is too slow. Thus, shared memory is too inclusive and distributed memory is too exclusive for Exascale. 

So we decided that global addressable space is the best middle ground. In that analogy, global address space sees everything, but does not hear the neighbors chattering. All it wants to do is see everything so that it can do a GET PUT directly, do a SEND RECEIVE directly, or it can do a synchronization directly. So a hardware-supported, global address space is one way to get the communications overhead lowered in the Exascale world. And this is especially important when you’re talking about a billion threads. Imagine trying to do a global sum on a billion threads. I hope we can code around it, but my suspicion is that there will still be applications needing to do it. 

insideHPC: I can tell by your voice that you have great passion for this subject. It sounds like the next ten years are going to be very exciting for SGI.

Dr. Lim Goh: You are very right about that. Because we sit here looking at the world, saying that we need to go to Exascale. And at the same time people are realizing that ok, we first have to do R&D on power, cooling, and resiliency. Plus we need a smarter interconnect. Sure, SGI  is there with the others working on the first set of problems, but we have been working hard on this communications problem for 10 or 15 years already. And now we have what we think is an ideal solution to the usability problem of an Exascale system with our big PC head node concept as well. So in summary we believe SGI can really contribute there. 

The Rich Report is produced by Rich Brueckner at Flex Rex Communications. You can follow Rich on Twitter.


About Rich
FlexRex began his life as a cartoon character I created a Sun Microsystems. As the world's first "fictional blogger," he appeared in numerous parody films that made fun of the whole work-from-home thing. Somewhere along the line, the Sun IT department adopted FlexRex as their spokesman in a half-dozen security awareness films for employees. So when I left Sun recently, I started FlexRex Communications, a Marketing company in Portland, Oregon.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: