我有一个带有管道复制活动的数据工厂,如下所示:
{
"type": "Copy",
"name": "Copy from storage to SQL",
"inputs": [
{
"name": "storageDatasetName"
}
],
"outputs": [
{
"name": "sqlOutputDatasetName"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
}
},
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
}
}
输入数据的大小约为90MB,约150万行,分为大约2个字节。Azure存储中的20 x 4.5MB块Blob文件。这是数据(CSV)的示例:
A81001,1,1,1,2,600,3.0,0.47236654,141.70996,0.70854986 A81001,4,11,0,25,588,243.0,5.904582,138.87576,57.392536 A81001,7,4,1,32,1342,278.0,7.5578647,316.95795, 65.65895
该接收器是类型为S2的Azure SQL Server,其额定值为50 DTU。我创建了一个简单的表,其中包含明智的数据类型,并且没有键,索引或任何花哨的东西,只有列:
CREATE TABLE [dbo].[Prescriptions](
[Practice] [char](6) NOT NULL,
[BnfChapter] [tinyint] NOT NULL,
[BnfSection] [tinyint] NOT NULL,
[BnfParagraph] [tinyint] NOT NULL,
[TotalItems] [int] NOT NULL,
[TotalQty] [int] NOT NULL,
[TotalActCost] [float] NOT NULL,
[TotalItemsPerThousand] [float] NOT NULL,
[TotalQtyPerThousand] [float] NOT NULL,
[TotalActCostPerThousand] [float] NOT NULL
)
源,接收器和数据工厂都在同一地区(北欧)。
根据Microsoft的“复制活动性能和调优指南”,对于Azure存储源和Azure SQL S2接收器,我应该获得0.4 MBps的速度。根据我的计算,这意味着90MB应该在大约半小时内传输(是吗?)。
由于某种原因,它非常快地复制了70,000行,然后似乎挂起了。使用SQL Management Studio,我可以看到数据库表中的行数恰好是70,000,并且在7个小时内根本没有增加。但是复制任务仍在运行,没有错误:
Any ideas why this is hanging at 70,000 rows? I can't see anything unusual about the 70,001st data row which would cause a problem. I've tried compeltely trashing the data factory and starting again, and I always get the same behaviour. i have another copy activity with a smaller table (8000 rows), which completes in 1 minute.
Just to answer my own question in case it helps anyone else:
The issue was with null values. The reason that my run was hanging at 70,000 rows was that at row 76560 of my blob source file, there was a null value in one of the columns. The HIVE script I had used to generate this blob file had written the null value as '\N'. Also, my sink SQL table specified 'NOT NULL' as part of the column, and the column was a FLOAT value.
因此,我进行了两项更改:在我的Blob数据集定义中添加了以下属性:
"nullValue": "\\N"
并使我的SQL表列可为空。现在它可以完全运行并且不会挂起!:)
问题是数据工厂没有出错,只是卡住了-如果作业失败并显示一条有用的错误消息,并告诉我问题出在哪一行数据,那就太好了。我认为,因为默认情况下写入批处理大小为10,000,这就是为什么它被固定为70,000而不是76560的原因。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句