The increasing number of cores in graphics processing units (GPU) and memory bandwidth requirements of these cores have placed more demand on the memory bandwidth. Memory controllers in these systems often employ out-of-order scheduling to maximize row access locality. However, this requires complex logic to enable out-of-order scheduling. To provide a low-cost and low-complexity memory scheduling, we propose source-based memory scheduling where memory access is scheduled at the injection of the shader cores. We propose two complementary techniques-dram-aware source throttling and superpackets. For highly parallel, non-graphics applications, memory access requests from shader cores have been shown to result in significant row locality but the locality is destroyed in the on-chip network. We show how the requests can be group together into a single superpacket prior to injection to maintain the row locality without increasing the complexity of any component such as the memory scheduler or the on-chip network. We use local, distributed throttling mechanism to achieve dram-aware memory scheduling and reduce the network congestion caused by full of memory controller queue. By combing these techniques of superpacket and dram-aware scheduling, the performance across a wide range of application is within 93% of the complex FR-FCFS on average while exceeding the performance of previously proposed on-chip network modification by 13% and at significantly lower cost and complexity.