Over the last decade, hardware prefetching has become an essential technique to improve performance in high-performance processors. By bringing data that may be used in advance with using additional memory bandwidth, the hardware prefetching can be used to hide long external memory latency, and then achieve performance benefits in the processors. Due to using additional memory bandwidth, prefetching is often considered as a technique to convert bandwidth to performance.
With the advent of the multi-core processors, on-chip networks became fundamental components as shared
resources for communication between cores. However, the distinct characteristics of prefetch traffic have not been considered in the on-chip network design, while prefetchers have been oblivious to the network congestion. Along with the on-chip networks, next generation memory such as 3D-stacked and non-volatile memories have also emerged for the high-performance computing, and can be shared performance critical resources that influence on or are affected by prefetching techniques. Nevertheless, the existing hardware prefetching does not consider the various system memories and was just tuned with only a few fixed configurations. That is, the existing prefetching techniques do not accurately account for the bandwidth provided by the emerging architectures.
In the dissertation, we strive to design a combined bandwidth-aware prefetcher framework to maximize the utilization of the bandwidth provided by each emerging architecture for better performance. Firstly, for the design of mutually-aware prefetch and on-chip networks, we investigate the interactions between prefetchers and on-chip networks, exploiting the synergy of these two components in multi-cores. Considering the difference between prefetch and non-prefetch packets, we propose a priority-based router design that selects non-prefetch packets preferentially over prefetch packets. In addition, we propose a prefetch control mechanism sensitive to network congestion. Second, for the prefetch design on diverse memory architectures, we explore the performance effectiveness of the available memory bandwidth to the prefetchers, and how the aggressiveness of prefetchers is tuned for such memory architectures as well as application behaviors to maximize the system performance. Based on the observations, we propose a new memory-oblivious prefetcher framework to dynamically adjust the prefetch aggressiveness under various memory architectures. In addition, we study the performance effectiveness of such automatic tuning in the hybrid memory system, and improve the solution for cache pollution exacerbated by the increased speculative data from more aggressive prefetchers. With these proposed mechanisms, we finally provide an integrated bandwidth-aware prefetcher framework to comprehensively take into account various bandwidths supported by the emerging architectures.