Recent advances in 3D integration technology and the high-bandwidth demand of modern processors led to the development of 3D-stacked memory devices such as Hybrid Memory Cube (HMC) that improve DRAM bandwidth while reducing energy cost. One of the salient features of the HMC is the routing capability provided by the logic layer that enables creating a memory network. Memory networks pose new opportunities in system design to enables efficient communication among different processors in a system, which can also lead to improved programmability.
We first explore the design space of the system interconnect, which defines the connectivity of multiple processors and memory devices in a system. We show the limitations of the conventional system interconnect design, which we classify as a processor-centric network (PCN), in flexibly utilizing the processor bandwidth. By leveraging the routing capability of HMCs, we propose a memory-centric network (MCN), which can enable full processor bandwidth utilization for different traffic patterns. The MCN leads to challenges including higher processor-to-processor latency and the need to properly exploit the path diversity. Thus, we propose a distributor-based network and pass-through microarchitecture to reduce network diameter and per-hop latency, while leveraging the path diversity within the memory network to provide high throughput for adversarial traffic patterns.
Meanwhile, GPUs, which are commonly used to accelerate various workloads, employ the PCIe interface, and can suffer from two major communication bottlenecks ？ accessing remote GPU memory and the host CPU memory ？ that lead to programmability challenges. This work leverages the memory network to simplify memory management and proposes scalable kernel execution (SKE) where multiple GPUs are encapsulated as a single virtual GPU to improve programmability. In addition, we propose a unified memory network (UMN) which combines the CPU memory network and GPU memory network to provide high bandwidth between CPU and multiple GPUs while eliminating memory copy overhead. In order to meet the high bandwidth requirement of the GPU and low latency requirement of the CPU, we propose a sliced flattened butterfly topology which provides high network bandwidth at low cost and an overlay network architecture to minimize CPU packet latency.
The memory network and the logic layer of 3D-stacked memory device that can provide computational capability also pose the opportunity for near-data processing (NDP) which has the potential to address several obstacles for modern computer systems such as memory bandwidth and energy efficiency. Furthermore, a standardization of NDP interface can achieve more pervasive use of NDP across a wide range of systems, leveraging economies of scale across the industry. In order to overcome the challenge of performing address translation in an architecture-neutral manner to provide access to data distributed across multiple memory stacks in NDP, we propose a partitioned execution model, which removes the need for an architecture-specific MMU or TLB in the logic layer. In addition, instead of employing a data cache in the logic layer, we introduce NDP buffers to avoid the the issue of cache coherence among the main processor and multiple memory stacks. As offloading too much computation to NDP logic can degrade performance by making it a bottleneck, we also low-complexity, dynamic offload decision mechanisms to enable high speedup as well as energy reduction.