分布式Tensorflow：内部错误-Blas GEMM启动失败

Shakeel anjum

我正在尝试分布式Tensorflow，并从localhost（Windows 10，Python 3.6.6，Tensorflow 1.8.0）上的两个进程开始。每个过程都运行一个简单的神经网络（一个隐藏层）的副本，该副本针对UrbanSounds数据集的子集（每个模型有5268个样本，每个样本具有193个特征）进行建模。

在这篇写得很好的文章之后：https : //learningtensorflow.com/lesson11/我可以重复他们的基本示例，根据两个不同过程的结果计算均值。对于我的数据集，我对代码进行了如下修改，将总样本分为两半，并让两个不同的过程分别计算成本函数。但是，在成功启动RPC服务器之后，两个过程最终都会出现以下错误：

InternalError（内部错误，请参阅上面的追溯）：Blas GEMM启动失败：a.shape =（263，193），b.shape =（193，200），m = 263，n = 200，k = 193

[[[节点：MatMul = MatMul [T = DT_FLOAT，transpose_a = false，transpose_b = false，_device =“ / job：local / replica：0 / task：0 / device：GPU：0”]]（_ recv_Placeholder_0_G7，w1 / read） ]]

在我看来，神经网络配置或为feed_dict准备数据集存在一些基本错误，但我看不到，因此需要另一双眼睛。在该实验中的另一个观察结果是，GPU大多达到最大值，代码中止。请协助我解决在分发Tensorflow的代码或策略方面的任何错误吗？

谢谢。

### ERROR TRACE (removed duplicate rows ...) ####
train_data, train_labels (528, 193) (528, 10)
test_data, test_labels (22, 193) (22, 10)
2018-08-27 14:35:29.096572: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-08-27 14:35:29.330127: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.63GiB
...
2018-08-27 14:35:33.982347: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
Traceback (most recent call last):
  File "C:\Users\shakeel\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
2018-08-27 14:35:33.989312: W T:\src\github\tensorflow\tensorflow\stream_executor\stream.cc:2001] attempting to perform BLAS operation using StreamExecutor without BLAS support
    return fn(*args)
...
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
...
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

Caused by op 'MatMul', defined at:
  File "tf_dis_audio_test.py", line 78, in <module>
    z = tf.nn.tanh(tf.matmul(X, w1) + b1)
  File "C:\Users\shakeel\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2122, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
...
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

### CODE SAMPLE ###
# selected UrbanSounds dataset
print("train_data, train_labels", train_data.shape, train_labels.shape)
print("test_data, test_labels", test_data.shape, test_labels.shape)

# neural network configurations
cost = 0.0
n_tasks = 2
n_epochs = 10
n_classes = 10
n_features = 193
n_hidden_1 = 200
learning_rate = 0.1
sd = 1/np.sqrt(n_features)
cost_history = np.empty(shape=[1], dtype=float)

# task#0 is set as rpc host process
rpc_server = "grpc://localhost:2001"

# run two separate python shells, each with its task number (0,1), as:
#>python this_script.py  0
#>python this_script.py  1
task_number = int(sys.argv[1])

# cluster specs with two localhosts on different ports (2001, 2002)
cluster = tf.train.ClusterSpec({job_name:["localhost:2001", "localhost:2002"]})
server = tf.train.Server(cluster, job_name="local", task_index=task_number)
server.start()

graph = tf.Graph()
with graph.as_default():    
    X = tf.placeholder(tf.float32, [None, n_features])
    Y = tf.placeholder(tf.float32, [None, n_classes])

    w1 = tf.Variable(tf.random_normal([n_features, n_hidden_1], mean = 0, stddev=sd), name="w1")
    b1 = tf.Variable(tf.random_normal([n_hidden_1], mean=0, stddev=sd), name="b1")
    w2 = tf.Variable(tf.random_normal([n_hidden_1, n_classes], mean = 0, stddev=sd), name="w2")
    b2 = tf.Variable(tf.random_normal([n_classes], mean=0, stddev=sd), name="b2")
    
    z = tf.nn.tanh(tf.matmul(X, w1) + b1)
    _y = tf.nn.softmax(tf.matmul(z, w2) + b2)
    
    cost_function = tf.reduce_mean(tf.square(Y - _y))
    train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)
    prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(_y, 1))
    accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32)) * 100.0
    print("#2: {}".format(datetime.utcnow().strftime(datetime_format)[:-3]))

# hack to fix the GPU out of memory issue
# but it does not make any good, GPU still shoots :(
gpuops = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
config = tf.ConfigProto(gpu_options=gpuops)

with tf.Session(rpc_server, graph=graph, config=config) as ss:
    # setting up the session with RPC host
    ss = tf.Session(rpc_server)
    ss.run(tf.global_variables_initializer())

    for epoch in range(n_epochs):
        batch_size = int(len(train_labels) / n_tasks)

	# run session for task#0
        if (task_number == 0):
            _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[:batch_size-1], Y:train_labels[:batch_size-1]})

	# run session for task#1
        elif (task_number == 1):
            _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[batch_size:-1], Y:train_labels[batch_size:-1]})

	# recording the running cost of both processes
        cost_history = np.append(cost_history, cost)
        print(" epoch {}: task {}: history {:.3f}".format(epoch, task_number, cost_history))

    print("Accuracy SGD ({}): {:.3f}".format(
        epoch, round(ss.run(accuracy, feed_dict={X: test_data, Y: test_labels}), 3)))

Shakeel anjum

只需将给定的代码移至Ubuntu 16.04.4 LTS，即可为我解决上述问题。

我不确定，但这似乎与Windows 10上的GRPC + Fiewall有关。

如果有人在Windows上遇到BLASS错误并可以在Windows上解决它，请为我们其余人员发布解决方案。

干杯。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-12-7

我来说两句

0 条评论

登录后参与评论

上一篇：创建列以根据实际表对行进行分类DAX PowerBI

TOP 榜单

文章

分布式Tensorflow：内部错误-Blas GEMM启动失败

分布式Tensorflow：内部错误-Blas GEMM启动失败

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用