While GPGPU programming models such as CUDA and OpenCL are good for exploiting data
parallelism, it is difficult to exploit pipeline parallelism with them. Since there are many workloads
that spend a large portion of runtime on I/O device access, serial CPU thread execution and/or data
transfer through PCIe, performance can be significantly improved if pipeline parallelism between those
components is properly leveraged. Unfortunately, current GPGPU programming models can require a
significant programmer effort to leverage the parallelism due to complex data dependency.
In this work, we propose a framework to exploit implicit pipeline parallelism, without requiring a
programmer to explicitly specify data dependency. We propose hardware-based dynamic dependency
tracking mechanism to overlap different stages of GPU-accelerated workloads and reduce runtime. Also,
our framework does not require any kernel modification or complex dependency tracking by programmer.
Our evaluation results show that the proposed framework significantly reduces overall runtime that
includes not only kernel execution time but also I/O and data transfer time by up to 40% and by 24%
overall.