Advanced search
2 files | 3.38 MB Add to list

Intra-cluster coalescing and distributed-block scheduling to reduce GPU NoC pressure

(2019) IEEE TRANSACTIONS ON COMPUTERS. 68(7). p.1064-1076
Author
Organization
Abstract
GPUs continue to boost the number of streaming multiprocessors (SMs) to provide increasingly higher compute capabilities. To construct a scalable crossbar network-on-chip (NoC) that connects the SMs to the memory controllers, a cluster structure is introduced in modern GPUs in which several SMs are grouped together to share a network port. Because of network port sharing, clustered GPUs face severe NoC congestion, which creates a critical performance bottleneck. In this paper, we target redundant network traffic to mitigate GPU NoC congestion. In particular, we observe that in many GPU-compute applications, different SMs in a cluster access shared data. Sending redundant requests to access the same memory location wastes valuable NoC bandwidth-we find on average 19 percent (and up to 48 percent) of the requests to be redundant. To remove redundant NoC traffic, we propose distributed-block scheduling, intra-cluster coalescing (ICC) and the coalesced cache (CC) to coalesce L1 cache misses within and across SMs in a cluster, respectively. Our evaluation results show that distributed-block scheduling, ICC and CC are complementary and improve both performance and energy consumption. We report an average performance improvement of 15 percent (and up to 67 percent) while at the same time reducing system energy by 6 percent (and up to 19 percent) and improving the energy-delay product (EDP) by 19 percent on average (and up to 53 percent), compared to state-of-the-art distributed CTA scheduling.
Keywords
GPU, coalescing, CTA scheduling, inter-CTA locality, NoC pressure, CACHE

Downloads

  • TC.pdf
    • full text (Accepted manuscript)
    • |
    • open access
    • |
    • PDF
    • |
    • 1.20 MB
  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 2.18 MB

Citation

Please use this url to cite or link to this publication:

MLA
Wang, Lu, et al. “Intra-Cluster Coalescing and Distributed-Block Scheduling to Reduce GPU NoC Pressure.” IEEE TRANSACTIONS ON COMPUTERS, vol. 68, no. 7, 2019, pp. 1064–76, doi:10.1109/tc.2019.2895036.
APA
Wang, L., Zhao, X., Kaeli, D. R., Wang, Z., & Eeckhout, L. (2019). Intra-cluster coalescing and distributed-block scheduling to reduce GPU NoC pressure. IEEE TRANSACTIONS ON COMPUTERS, 68(7), 1064–1076. https://doi.org/10.1109/tc.2019.2895036
Chicago author-date
Wang, Lu, Xia Zhao, David R. Kaeli, Zhiying Wang, and Lieven Eeckhout. 2019. “Intra-Cluster Coalescing and Distributed-Block Scheduling to Reduce GPU NoC Pressure.” IEEE TRANSACTIONS ON COMPUTERS 68 (7): 1064–76. https://doi.org/10.1109/tc.2019.2895036.
Chicago author-date (all authors)
Wang, Lu, Xia Zhao, David R. Kaeli, Zhiying Wang, and Lieven Eeckhout. 2019. “Intra-Cluster Coalescing and Distributed-Block Scheduling to Reduce GPU NoC Pressure.” IEEE TRANSACTIONS ON COMPUTERS 68 (7): 1064–1076. doi:10.1109/tc.2019.2895036.
Vancouver
1.
Wang L, Zhao X, Kaeli DR, Wang Z, Eeckhout L. Intra-cluster coalescing and distributed-block scheduling to reduce GPU NoC pressure. IEEE TRANSACTIONS ON COMPUTERS. 2019;68(7):1064–76.
IEEE
[1]
L. Wang, X. Zhao, D. R. Kaeli, Z. Wang, and L. Eeckhout, “Intra-cluster coalescing and distributed-block scheduling to reduce GPU NoC pressure,” IEEE TRANSACTIONS ON COMPUTERS, vol. 68, no. 7, pp. 1064–1076, 2019.
@article{8616587,
  abstract     = {{GPUs continue to boost the number of streaming multiprocessors (SMs) to provide increasingly higher compute capabilities. To construct a scalable crossbar network-on-chip (NoC) that connects the SMs to the memory controllers, a cluster structure is introduced in modern GPUs in which several SMs are grouped together to share a network port. Because of network port sharing, clustered GPUs face severe NoC congestion, which creates a critical performance bottleneck. In this paper, we target redundant network traffic to mitigate GPU NoC congestion. In particular, we observe that in many GPU-compute applications, different SMs in a cluster access shared data. Sending redundant requests to access the same memory location wastes valuable NoC bandwidth-we find on average 19 percent (and up to 48 percent) of the requests to be redundant. To remove redundant NoC traffic, we propose distributed-block scheduling, intra-cluster coalescing (ICC) and the coalesced cache (CC) to coalesce L1 cache misses within and across SMs in a cluster, respectively. Our evaluation results show that distributed-block scheduling, ICC and CC are complementary and improve both performance and energy consumption. We report an average performance improvement of 15 percent (and up to 67 percent) while at the same time reducing system energy by 6 percent (and up to 19 percent) and improving the energy-delay product (EDP) by 19 percent on average (and up to 53 percent), compared to state-of-the-art distributed CTA scheduling.}},
  author       = {{Wang, Lu and Zhao, Xia and Kaeli, David R. and Wang, Zhiying and Eeckhout, Lieven}},
  issn         = {{0018-9340}},
  journal      = {{IEEE TRANSACTIONS ON COMPUTERS}},
  keywords     = {{GPU,coalescing,CTA scheduling,inter-CTA locality,NoC pressure,CACHE}},
  language     = {{eng}},
  number       = {{7}},
  pages        = {{1064--1076}},
  title        = {{Intra-cluster coalescing and distributed-block scheduling to reduce GPU NoC pressure}},
  url          = {{http://dx.doi.org/10.1109/tc.2019.2895036}},
  volume       = {{68}},
  year         = {{2019}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: