CUDA Profiler:计算内存并计算利用率

血统

我正在尝试使用ubuntu上的CUDA nsight profiler对内存带宽利用率和GPU加速应用程序的吞吐量利用率建立两个总体衡量指标。该应用程序在Tesla K20c GPU上运行。

我想要的两个度量在某种程度上可以与该图给出的度量进行比较: 图一

问题在于这里没有给出确切的数字,更重要的是我不知道这些百分比是如何计算的。

内存带宽利用率

探查器告诉我,我的GPU的最大全局内存带宽为208 GB / s。

在此处输入图片说明

这是指设备内存带宽还是全局内存带宽?它说全球,但第一个对我来说更有意义。

For my kernel the profiler tells me that the Device Memory Bandwidth is 98.069 GB/s.

在此处输入图片说明

Assuming that the max of 208 GB/s refer to the Device Memory could I then simply calculate the Memory BW Utilization as 90.069/208 = 43%? Note that this kernel is executed multiple times without additional CPU-GPU data transfers. The system BW is therefore not important.

Compute Throughput Utilization

I am not exactly sure what the best way is to put Compute Throughput Utilization into a number. My best guess is to use the Instructions per Cycle to max Instructions per cycle ratio. The profiler tells me that the max IPC is 7 (see picture above).

First of all, what does that actually mean? Each multiprocessor has 192 cores and therefore a maximum of 6 active warps. Wouldnt that mean that max IPC should be 6?

The profiler tells me that my kernel has issued IPC = 1.144 and executed IPC = 0.907. Should I calculate the compute utilization as 1.144/7 = 16% or 0.907/7 = 13% or none of these?

Are these two measurements (Memory and compute utilization) giving an adequate first impression of how efficiently my kernel is using the resources? Or are there other important metrics that should be included?

Additional Graph

在此处输入图片说明

Greg Smith

NOTE: I will try to update this answer for additional details in the future. I do not think all of the individual components of the calculations of easily visible in the Visual Profiler reports.

Compute Utilization

This is the pipeline utilization of the logical pipes: memory, control flow, and arithmetic. The SMs have numerous execution pipes that are not document. If you look at the instruction throughput charts you can determine to a high level how to calculate utilization. You can read the kepler or maxwell architecture documents for more information on the pipelines. A CUDA core is a marketing term for a integer/single precision floating point math pipeline.

This calculation is not based upon IPC. Its based upon pipeline utilization and issue cycles. For example, you can be at 100% utilization if you issue 1 instruction/cycle (never dual-issue). You can also be at 100% if you issue a double precision instruction at maximum rate (depends on GPU).

Memory Bandwidth Utilization

探查器计算L1,TEX,L2和设备内存的利用率。显示最高值。数据路径利用率非常高,而带宽利用率却很低。

还应计算内存等待时间边界原因。使程序受内存延迟限制而不受计算利用率或内存带宽限制很容易。

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章