CUDA Profiler：计算内存并计算利用率

Clood 发表于 Dev

血统

我正在尝试使用ubuntu上的CUDA nsight profiler对内存带宽利用率和GPU加速应用程序的吞吐量利用率建立两个总体衡量指标。该应用程序在Tesla K20c GPU上运行。

我想要的两个度量在某种程度上可以与该图给出的度量进行比较：

问题在于这里没有给出确切的数字，更重要的是我不知道这些百分比是如何计算的。

内存带宽利用率

探查器告诉我，我的GPU的最大全局内存带宽为208 GB / s。

在此处输入图片说明

这是指设备内存带宽还是全局内存带宽？它说全球，但第一个对我来说更有意义。

For my kernel the profiler tells me that the Device Memory Bandwidth is 98.069 GB/s.

在此处输入图片说明

Assuming that the max of 208 GB/s refer to the Device Memory could I then simply calculate the Memory BW Utilization as 90.069/208 = 43%? Note that this kernel is executed multiple times without additional CPU-GPU data transfers. The system BW is therefore not important.

Compute Throughput Utilization

I am not exactly sure what the best way is to put Compute Throughput Utilization into a number. My best guess is to use the Instructions per Cycle to max Instructions per cycle ratio. The profiler tells me that the max IPC is 7 (see picture above).

First of all, what does that actually mean? Each multiprocessor has 192 cores and therefore a maximum of 6 active warps. Wouldnt that mean that max IPC should be 6?

The profiler tells me that my kernel has issued IPC = 1.144 and executed IPC = 0.907. Should I calculate the compute utilization as 1.144/7 = 16% or 0.907/7 = 13% or none of these?

Are these two measurements (Memory and compute utilization) giving an adequate first impression of how efficiently my kernel is using the resources? Or are there other important metrics that should be included?

Additional Graph

在此处输入图片说明

Greg Smith

NOTE: I will try to update this answer for additional details in the future. I do not think all of the individual components of the calculations of easily visible in the Visual Profiler reports.

Compute Utilization

This is the pipeline utilization of the logical pipes: memory, control flow, and arithmetic. The SMs have numerous execution pipes that are not document. If you look at the instruction throughput charts you can determine to a high level how to calculate utilization. You can read the kepler or maxwell architecture documents for more information on the pipelines. A CUDA core is a marketing term for a integer/single precision floating point math pipeline.

This calculation is not based upon IPC. Its based upon pipeline utilization and issue cycles. For example, you can be at 100% utilization if you issue 1 instruction/cycle (never dual-issue). You can also be at 100% if you issue a double precision instruction at maximum rate (depends on GPU).

Memory Bandwidth Utilization

探查器计算L1，TEX，L2和设备内存的利用率。显示最高值。数据路径利用率非常高，而带宽利用率却很低。

还应计算内存等待时间边界原因。使程序受内存延迟限制而不受计算利用率或内存带宽限制很容易。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-10-27

我来说两句

0 条评论

登录后参与评论

TOP 榜单

文章

CUDA Profiler：计算内存并计算利用率

CUDA Profiler：计算内存并计算利用率

计算数据帧R中的字符串频率

Android Studio Kotlin：提取为常量

Excel 2016图表将增长与4个参数进行比较

获取并汇总所有关联的数据

如何使用Redux-Toolkit重置Redux Store

http：// localhost：3000 /＃！/为什么我在localhost链接中得到“＃！/”。

将加号/减号添加到jQuery菜单

算术中的c ++常量类型转换

TYPO3：将 Formhandler 添加到新闻扩展

TreeMap中的自定义排序

如何开始为Ubuntu开发

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

无法使用 envoy 访问 .ssh/config

在Ubuntu和Windows中，触摸板有时会滞后。硬件问题？

遍历元素数组以每X秒在浏览器上显示

在Jenkins服务器中使用Selenium和Ruby进行的黄瓜测试失败，但在本地计算机中通过

警告消息：在matrix（unlist（drop.item），ncol = 10，byrow = TRUE）中：数据长度[16]不是列数的倍数[10]>？

未捕获的SyntaxError：带有Ajax帖子的意外令牌u

如何使用tweepy流式传输来自指定用户的推文（仅在该用户发布推文时流式传输）

尝试在Dell XPS13 9360上安装Windows 7时出错

如果从DB接收到的值为空，则JMeter JDBC调用将返回该值作为参数名称