分布式Tensorflow:内部错误-Blas GEMM启动失败

Shakeel anjum

我正在尝试分布式Tensorflow,并从localhost(Windows 10,Python 3.6.6,Tensorflow 1.8.0)上的两个进程开始。每个过程都运行一个简单的神经网络(一个隐藏层)的副本,该副本针对UrbanSounds数据集的子集(每个模型有5268个样本,每个样本具有193个特征)进行建模。

在这篇写得很好的文章之后:https : //learningtensorflow.com/lesson11/我可以重复他们的基本示例,根据两个不同过程的结果计算均值。对于我的数据集,我对代码进行了如下修改,将总样本分为两半,并让两个不同的过程分别计算成本函数。但是,在成功启动RPC服务器之后,两个过程最终都会出现以下错误:

InternalError(内部错误,请参阅上面的追溯):Blas GEMM启动失败:a.shape =(263,193),b.shape =(193,200),m = 263,n = 200,k = 193

[[[节点:MatMul = MatMul [T = DT_FLOAT,transpose_a = false,transpose_b = false,_device =“ / job:local / replica:0 / task:0 / device:GPU:0”]](_ recv_Placeholder_0_G7,w1 / read) ]]

在我看来,神经网络配置或为feed_dict准备数据集存在一些基本错误,但我看不到,因此需要另一双眼睛。在该实验中的另一个观察结果是,GPU大多达到最大值,代码中止。请协助我解决在分发Tensorflow的代码或策略方面的任何错误吗?

谢谢。

### ERROR TRACE (removed duplicate rows ...) ####
train_data, train_labels (528, 193) (528, 10)
test_data, test_labels (22, 193) (22, 10)
2018-08-27 14:35:29.096572: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-08-27 14:35:29.330127: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.63GiB
...
2018-08-27 14:35:33.982347: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
Traceback (most recent call last):
  File "C:\Users\shakeel\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
2018-08-27 14:35:33.989312: W T:\src\github\tensorflow\tensorflow\stream_executor\stream.cc:2001] attempting to perform BLAS operation using StreamExecutor without BLAS support
    return fn(*args)
...
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
...
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

Caused by op 'MatMul', defined at:
  File "tf_dis_audio_test.py", line 78, in <module>
    z = tf.nn.tanh(tf.matmul(X, w1) + b1)
  File "C:\Users\shakeel\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2122, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
...
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]
### CODE SAMPLE ###
# selected UrbanSounds dataset
print("train_data, train_labels", train_data.shape, train_labels.shape)
print("test_data, test_labels", test_data.shape, test_labels.shape)

# neural network configurations
cost = 0.0
n_tasks = 2
n_epochs = 10
n_classes = 10
n_features = 193
n_hidden_1 = 200
learning_rate = 0.1
sd = 1/np.sqrt(n_features)
cost_history = np.empty(shape=[1], dtype=float)

# task#0 is set as rpc host process
rpc_server = "grpc://localhost:2001"

# run two separate python shells, each with its task number (0,1), as:
#>python this_script.py  0
#>python this_script.py  1
task_number = int(sys.argv[1])

# cluster specs with two localhosts on different ports (2001, 2002)
cluster = tf.train.ClusterSpec({job_name:["localhost:2001", "localhost:2002"]})
server = tf.train.Server(cluster, job_name="local", task_index=task_number)
server.start()

graph = tf.Graph()
with graph.as_default():    
    X = tf.placeholder(tf.float32, [None, n_features])
    Y = tf.placeholder(tf.float32, [None, n_classes])

    w1 = tf.Variable(tf.random_normal([n_features, n_hidden_1], mean = 0, stddev=sd), name="w1")
    b1 = tf.Variable(tf.random_normal([n_hidden_1], mean=0, stddev=sd), name="b1")
    w2 = tf.Variable(tf.random_normal([n_hidden_1, n_classes], mean = 0, stddev=sd), name="w2")
    b2 = tf.Variable(tf.random_normal([n_classes], mean=0, stddev=sd), name="b2")
    
    z = tf.nn.tanh(tf.matmul(X, w1) + b1)
    _y = tf.nn.softmax(tf.matmul(z, w2) + b2)
    
    cost_function = tf.reduce_mean(tf.square(Y - _y))
    train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)
    prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(_y, 1))
    accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32)) * 100.0
    print("#2: {}".format(datetime.utcnow().strftime(datetime_format)[:-3]))

# hack to fix the GPU out of memory issue
# but it does not make any good, GPU still shoots :(
gpuops = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
config = tf.ConfigProto(gpu_options=gpuops)

with tf.Session(rpc_server, graph=graph, config=config) as ss:
    # setting up the session with RPC host
    ss = tf.Session(rpc_server)
    ss.run(tf.global_variables_initializer())

    for epoch in range(n_epochs):
        batch_size = int(len(train_labels) / n_tasks)

	# run session for task#0
        if (task_number == 0):
            _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[:batch_size-1], Y:train_labels[:batch_size-1]})

	# run session for task#1
        elif (task_number == 1):
            _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[batch_size:-1], Y:train_labels[batch_size:-1]})

	# recording the running cost of both processes
        cost_history = np.append(cost_history, cost)
        print(" epoch {}: task {}: history {:.3f}".format(epoch, task_number, cost_history))

    print("Accuracy SGD ({}): {:.3f}".format(
        epoch, round(ss.run(accuracy, feed_dict={X: test_data, Y: test_labels}), 3)))

Shakeel anjum

只需将给定的代码移至Ubuntu 16.04.4 LTS,即可为我解决上述问题。

我不确定,但这似乎与Windows 10上的GRPC + Fiewall有关。

如果有人在Windows上遇到BLASS错误并可以在Windows上解决它,请为我们其余人员发布解决方案。

干杯。

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

tensorflow-gpu不适用于Blas GEMM启动失败

TensorFlow:InternalError:Blas SGEMM启动失败

在Tensorflow,Theano,Pytorch中使用的是GEMM还是BLAS

分布式 Tensorflow 重载模型失败

安装theano“ blas错误”

春季启动分布式事务错误java.lang.ClassNotFoundException:javax.transaction.TransactionManager

寻找LAPACK / BLAS错误代码列表

重新启动分布式OrientDB集群中的节点后,插入(JavaAPI)失败

伪分布式模式启动HBase失败抛出“Failed building RegionServer”

PCA失败并显示ND4J:未找到BLAS?

分布式系统中的启动器节点

与Spring-启动分布式事务回滚

快发分布式,无法启动工作程序

如何在分布式模式下本地启动钻头?

分布式tensorflow源码

MediaRecorder启动失败错误

Kubeflow 中的分布式 Tensorflow - NotFoundError

分布式Tensorflow:CreateSession仍在等待

使用 CudnnLSTM 的分布式 Tensorflow

分布式张量流失败并显示“在软件包中找不到构建文件”

两台机器上的分布式事务处理失败:InvalidArgumentError

如何找到为什么任务无法在dask分布式中失败?

分布式Google Analytics(分析)扩展程序在Air Debug Launcher(ADL)中失败

以分布式模式启动Kafka连接时请求超时消息

在伪分布式模式下使用Cloudera Manager设置Hadoop-Datanode无法启动

在Ubuntu 16.04上自动启动dask分布式调度程序和工作程序

在伪分布式模式下启动Hadoop时避免输入密码

Spring Boot自定义启动器,用于日志记录和分布式跟踪

WSO2-无法在API Manager分布式设置中启动流量管理器