For decades, the growing speed gap between processors and main memories has given rise to inefficiency in computing systems. In order to prevent the performance loss caused by the inefficiency, many architectural techniques such as cache, out-of-order, and prefetch have been employed. These techniques have played a role for a long time. However, with the advent of multi-core era, the effectiveness of the technologies have decreased, making multi-core systems undergo serious inefficiency in terms of memory accesses. This is because co-running programs in a processor can break memory-access locality which is a fundamental rationale of the traditional technologies. For this reason, main memory systems themselves became crucial to multi-core computations. The inefficiency caused by main memory systems cannot be covered by other techniques anymore. Hence, we decided to focus on improving main memory systems, which is the subject of this dissertation.
A main memory system consists of memory controllers and DRAMs. A processor sends memory requests to memory controllers, and then an appropriate command sequence is created by the memory controllers. The generated commands are issued to DRAMs sequentially, and the DRAMs operate according to the commands, finally to transfer the requested data to the processor. Usually, the performance of main memory systems depends on the command sequence and the data transfer rate. In other words, main memory systems have been developed by improving the command scheduling method and the interface bandwidth. However, the effectiveness of these conventional methodologies have declined because of the long DRAM access latency.
Ideally, DRAM latency can be hidden by command scheduling, but in reality it is impossible to hide all DRAM latencies due to program's diversity and system's complexity. If schedulers fail to hide DRAM access latency, the effective bandwidth decreases, resulting in performance degradation. However, for 20 years, DRAM latency has been decreased only by 20\% while bandwidth has been grown by 20 times, which implies that DRAM latency has been more and more critical as time goes on. In fact, current main memory systems severely suffer from the long DRAM latency. The tardy progress concerning DRAM latency is mainly due to DRAM manufacturing cost. To reduce DRAM latency, the manufacturing cost must escalate.That is, the cost burden makes DRAM vendors hesitate to provide low latency. That's why we set our goal as building a new main memory system which has a low-latency but not incurring cost-impact on DRAM implementation.
In general, there are two approaches to relieve the impact of the long DRAM latency: 1) reducing DRAM latency itself and 2) increasing DRAM access parallelism. For each approach, we figured out several ideas, which are briefly explained below.
1) Reducing DRAM latency itself. DRAM latency is determined by the speed of a DRAM-internal circuitry. The core of the circuitry is sense-amplifiers, the role of which is to distinguish between '0' and '1'. Specifically, a sense-amplifier is able to sense the charge stored in a cell. We observed that the sensing speed depends on the amount of charge stored in the cell, and the amount of charge changes periodically with refreshes. As a result, the sensing speed varies periodically according to the refresh timing. We exploit this observation to design a non-uniform access time memory controller (NUAT).
2) Increasing DRAM access parallelism. DRAM is hierarchically organized: Channel - Rank - BankGroup - Bank. A channel is connected to multiple ranks, and each rank has multiple bank-groups, and each bank-group has multiple banks. This hierarchical structure facilitates creating parallelism in DRAM. However, the current DRAM architecture only supports bank-level parallelism. We leverage the other hierarchies to create new parallelisms in DRAM. Particularly, we propose bank-group level parallelism (BGLP) and rank-level parallelism (RLP).
To sum up, we exploit already implemented circuits and structures in DRAM. For this reason, the proposed architectural techniques are cost-effective and dovetails with the existing implementation skills, which is a great advantage when it comes to DRAM. In this dissertation, we also exhibit the quantitative benefit of our proposals through cycle-accurate system simulations to prove architectural usefulness. In addition, many analyses will follow so that we argue that the proposed work also has a merit in terms of other factors like energy and compatibility.