Valkyrie: Leveraging inter-TLB locality to enhance GPU performance

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 126
  • Download : 0
Programming on a GPU has been made considerably easier with theintroduction of Virtual Memory features, which support commonpointer-based semantics between the CPU and the GPU. However,supporting virtual memory on a GPU comes with some additionalcosts and overhead, with the largest being from the support foraddress translation. The fact that a massive number of threads runconcurrently on a GPU means that the translation lookaside bu!ers(TLBs) are oversubscribed most of the time. Our investigation intoa diverse set of GPU workloads shows that TLB misses can beextremely high (up to 99%), which inevitably leads to signi"cantperformance degradation due to long-latency page-table walks. Ourpro"ling of TLB-sensitive workloads reveals a high degree of pagesharing across the di!erent cores of a GPU. In many applications,a page can be accessed in temporal proximity by multiple cores,following similar memory access patterns. To support the inherent sharing present in GPU workloads, we propose Valkyrie, anintegrated cooperative TLB prefetching mechanism and an interL1-TLB probing scheme that can e#ciently reduce TLB bottlenecksin GPUs. Our evaluation using a diverse set of GPU workloadsreveals that Valkyrie is able to achieve an average speedup of 1.95?,while adding modest hardware overhead.
Publisher
Institute of Electrical and Electronics Engineers Inc.
Issue Date
2020-10-06
Language
English
Citation

2020 ACM International Conference on Parallel Architectures and Compilation Techniques, PACT 2020, pp.456 - 466

DOI
10.1145/3410463.3414639
URI
http://hdl.handle.net/10203/289868
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0