RuntimeError：模块必须在设备cuda：1（device_ids [0]）上具有其参数和缓冲区，但在设备cuda：2上找到其中一个参数和缓冲区

vpap 发表于 Dev

vpap

我有4个GPU（0,1,2,3），我想在GPU 2上运行一个Jupyter笔记本，在GPU 0上运行另一个，在执行之后，

 export CUDA_VISIBLE_DEVICES=0,1,2,3

对于我做的GPU 2笔记本，

device = torch.device( f'cuda:{2}' if torch.cuda.is_available() else 'cpu')
device, torch.cuda.device_count(), torch.cuda.is_available(), torch.cuda.current_device(), torch.cuda.get_device_properties(1)

在创建新模型或加载模型后，

model = nn.DataParallel( model, device_ids = [ 0, 1, 2, 3])
model = model.to( device)

然后，当我开始训练模型时，

RuntimeError                              Traceback (most recent call last)
<ipython-input-18-849ffcb53e16> in <module>
 46             with torch.set_grad_enabled( phase == 'train'):
 47                 # [N, Nclass, H, W]
 ---> 48                 prediction = model(X)
 49                 # print( prediction.shape, y.shape)
 50                 loss_matrix = criterion( prediction, y)

~/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
491             result = self._slow_forward(*input, **kwargs)
492         else:
--> 493             result = self.forward(*input, **kwargs)
494         for hook in self._forward_hooks.values():
495             hook_result = hook(self, input, result)

~/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
144                 raise RuntimeError("module must have its parameters and buffers "
145                                    "on device {} (device_ids[0]) but found one of "
--> 146                                    "them on device: {}".format(self.src_device_obj, t.device))
147 
148         inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2

乔达格

DataParallel需要每个输入张量在其device_ids列表中的第一个设备上提供。

在散布到其他GPU之前，它基本上使用该设备作为暂存区域，并且是从最终返回之前收集最终输出的设备。如果要将设备2用作主要设备，则只需将其放在列表的开头，如下所示

model = nn.DataParallel(model, device_ids = [2, 0, 1, 3])
model.to(f'cuda:{model.device_ids[0]}')

之后，提供给模型的所有张量也应该在第一个设备上。

x = ... # input tensor
x = x.to(f'cuda:{model.device_ids[0]}')
y = model(x)

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-01-22

我来说两句

0 条评论

登录后参与评论

上一篇：更新到macOS Catalina后授予对JupyterLab的完全访问权限

TOP 榜单

文章

RuntimeError：模块必须在设备cuda：1（device_ids [0]）上具有其参数和缓冲区，但在设备cuda：2上找到其中一个参数和缓冲区

RuntimeError：模块必须在设备cuda：1（device_ids [0]）上具有其参数和缓冲区，但在设备cuda：2上找到其中一个参数和缓冲区

我来说两句

相关文章

TOP 榜单

隐藏发件人没有短信PHP

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

在浏览器中请求URL时会发生什么？

flask-admin 如何自定义删除按钮

材质UI垂直滑块。如何改变在垂直材料UI滑块导轨的厚度（反应）

用日期数据透视表和日期顺序查询

Jqgrid：多级别组摘要

java io ioexception无法解析服务器地址解析器的响应

Swift如何使用Base64Url编码JWT标头和有效负载之类的json对象

sshd AllowGroups组未授予访问权限

jQuery无限滚动固定div中的滚动

android 背部按下

Flexbox CSS 对齐属性环境惰性？

为什么随机森林中的平均降低基尼系数取决于人口规模？

ClickHouse 创建临时表

为什么PlusShare.Builder setRecipients方法不起作用？

如何在Android中识别MICR代码

PyQt4.QtCore模块无法向sip模块注册

正则表达式，用于查找所有以任何字母开头和数字开头的文件

是否可以通过编程方式对很多动画进行重新着色？

机器密钥生成

热门标签

归档