Here’s a talk I just gave today to people from Sun, DARPA, Oracle, nVIDIA, and academia. Since the slides are mostly illustration and not necessarily self-explanatory, here’s a shortened version of the narrative [one paragraph per slide].
Hey there — I was invited to talk to you guys today about memory systems in general, and since it is a diverse group [spanned everything from process technologists to applications developers], I figured I’d go high-level and tell you about what I see are the important open problems and some potential solutions.
Many people in the audience (including the web audience checking the slides) are perhaps unfamiliar with my group. We study the memory system exclusively and have for the past twelve years, so we’ve given the matter a bit of thought … here’s the PhD theses that have come out of our group in that time. [they all investigate various aspects of the memory system]
About five years ago, one of my students, Ankush Varma, now at Intel, said the following quote. I thought it was pretty insightful, so I wrote it on the whiteboard in my office and still haven’t erased it. Probably won’t come off now … Anyway, what he meant was that technology trickles down from high performance to consumer-level devices, and today we see things like pipelines, caches, branch prediction, VLIW, and even superscalar in low-cost embedded processors. These technologies all started out as very high-end mechanisms. However, the flip side is that the issues we face trickle upwards. The embedded domain has had to deal with resource constraints (energy use, heat extraction, reliability, cost, physical size, etc.) since its inception, and it’s only recently that these issues have arisen in the high end.
This more or less says what I just said.
One of the primary issues facing us is the capacity problem. The bottom graph shows the improvement in DRAM signaling rates over time. The top graph shows the price we paid for these improvements. When SDRAM was introduced in the mid 1990s you could put 8 DIMMs in a memory channel. When DDR was introduced, it doubled the datarate, and you could put 4 DIMMs in a channel. It is hard to drive a wide bus fast, especially when it is multi-drop, and the easiest way to address the signal integrity issues is to reduce the number of drops on the bus. 8 DIMMs goes to 4. When DDR2 came out, it doubled the signaling rate again, and the number of DIMMs per channel was cut to 2. When DDR3 was first discussed at JEDEC they seriously considered reducing that to 1, yes 1, DIMM per channel. I think the screaming could be heard in outer space, so they are currently working on getting to the higher speed grades and still retaining 2 DIMMs per channel — a big reason why you haven’t seen DDR3-1600 appear yet. So the net result is that the channel capacity has remained relatively flat over a long period of time (the blue line).
The systems guys have been tearing their hair out over that … for instance, in my $2000 desktop system, I would like roughly $600 or $700 of that to go into DRAM. That would buy me several hundred GB of DRAM at today’s prices, yet there is not a desktop on the planet (that I know of) that will allow me to put several hundred GB of DRAM into it. I can afford it, but I can’t shove it into the box. Intel tried to solve the problem by introducing Fully Buffered DIMM, a system architecture that puts a high-speed ASIC onto each DIMM to form the channel. The memory controller talks to the high-speed ASIC, and it talks to the next one down the channel … a daisy chain. The links are narrow and fast (six times faster signaling rate than the DRAM devices), so you can put three times the number of channels into a system, given the same pin/trace count. Since the channels are daisy-chained (not multi-drop), you can have more DIMMs per channel (arbitrarily limited to 8). Result: nearly an order of magnitude more DRAM per system. Problem: the high-speed ASICs have SERDES that burn a few Watts of power even if you’re not reading or writing to the DIMM, and so system-level power went up an order of magnitude. RIP fully buffered DIMM.
So capacity is ultimately a bandwidth problem … if we could have cheap bandwidth and high-density packaging, we would not have to resort to narrow/fast (and therefore hot) interfaces to the DRAM to get capacity. At the same time, bandwidth is ultimately a power/heat problem, and we do need bandwidth. Numerous studies have shown the rule of thumb is that you need about a GB/s of bandwidth per core … and the industry is relentlessly moving toward increasing numbers of cores on chip. Open problem, must be solved.
Another problem: TLB reach doesn’t scale at all (translation lookaside buffer — it’s a little cache inside the CPU that holds translation information, entries from the page table). The way we implement virtual memory, the TLB has to have mapping information in it before we can access data. Even if we have the data in our cache, if the mapping info is not in the TLB, we can’t get at the data. Have to go to the page table first, load the TLB with the info, and then access the cache. Problem: last-level caches are typically on the order of 10MB, while TLBs typically map on the order of 1MB. This is bad and helps to explain why modern machines spend roughly 20% of their time (and thus power) servicing TLB misses.
A trend, since it will figure into solutions to our problems: flash is beating up on disk, and PCM is expected to beat up on flash when it arrives. Even if it gets a little hamstrung to reduce manufacturing costs or increase reliability, who cares — non-volatile solid-state memory is a great resource.
Some obvious conclusions based on what I’ve said so far. Well, obvious to me, and clearly speculative. First, we want significantly more main-memory capacity, and we don’t want to give up bandwidth. It’s clear that can only happen with a system redesign; we need a new memory architecture. Here’s something our group proposed a few years back. Nothing novel; it harkens back to designs from the 1970s and 1980s. The interesting thing is that I see things like this already in production and in research/development right now. It’s pretty clear something like this will make it mainstream soon (hierarchical memory control) … the only question is details: speeds, widths, concurrency, packaging, signaling technology, etc.
Second, non-volatile memory (both flash and PCM) have better scaling characteristics than DRAM, which seems to be nearing its end of life as far as scaling goes. Thus we can expect increasing capacity given the same form factors, over time, from NV memory. Conclusion: let’s make non-volatile memory a first-class citizen in the memory hierarchy, like cache & DRAM. Pictured is a stylized memory hierarchy including cache, DRAM, and disk.
… and here, we’ve split main memory into two pieces: DRAM and FLASH (or PCM or whatever). Let’s have DIMMs made of non-volatile memory, and a NV controller sitting next to the DRAM controller. Access the NV memory via a load/store interface, not via the operating system’s file-system interface (which imposes a huge and unnecessary overhead). NV is denser than DRAM, so it addresses the capacity problem. DRAM acts as a write buffer for NV, addressing the write-leveling issue. It is also likely that the capacity requirement for DRAM would lessen, given the extra NV memory.
Last conclusion. Let’s reduce the translation overhead of the TLB … this will require us to redesign the VM (virtual memory) part of the operating system and will require a rethink of the computer architecture as well. There are numerous facilities and ideas to revisit, now that technology parameters are significantly different than they were when these facilities were first investigated. In particular, SASOS (single address space operating systems) I think will make a lot of sense … and they will enable us to move the translation point — notionally, where the TLB sits — from right next to the L1 cache out to the main memory, perhaps even further. We should be putting the file system into the address space (memory-mapped files, persistent objects, etc.) as well. This is do-able for high-end systems and would trickle down to the commodity space if we can show it to be worthwhile.
And, relative to the high-end … these things are effectively large embedded systems. Enterprise computers and supercomputers are effectively embedded systems (for all the reasons listed), so they should be treated as such. For instance, use the same optimization techniques, use low-power DSPs (they use Cray-style architectures, for one thing), etc.
… and that’s it for the meat. Hope you find this useful.