sqoop导入中按列拆分的数据类型是否应该始终是数字数据类型(整数,bignint,数字)?不能是字符串吗?
是的,您可以拆分任何非数字数据类型。
但是不建议这样做。
用于拆分数据Sqoop触发
SELECT MIN(col1), MAX(col2) FROM TABLE
然后根据您的映射器数量进行划分。
现在取整数作为的一个例子--split-by
列
表中有一些id
列的值是1到100,并且您使用了4个映射器(-m 4
在sqoop命令中)
Sqoop使用以下方法获得MIN和MAX值:
SELECT MIN(id), MAX(id) FROM TABLE
输出:
1,100
整数拆分很容易。您将分为4部分:
现在将字符串作为--split-by
列
表中有一些name
列,其值从“ dev”到“ sam”,并且您使用了4个映射器(-m 4
在sqoop命令中)
Sqoop使用以下方法获得MIN和MAX值:
SELECT MIN(id), MAX(id) FROM TABLE
输出:
开发人员,一个人
现在如何将其分为四个部分。根据sqoop文档,
/**
* This method needs to determine the splits between two user-provided
* strings. In the case where the user's strings are 'A' and 'Z', this is
* not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
* splits for strings beginning with each letter, etc.
*
* If a user has provided us with the strings "Ham" and "Haze", however, we
* need to create splits that differ in the third letter.
*
* The algorithm used is as follows:
* Since there are 2**16 unicode characters, we interpret characters as
* digits in base 65536. Given a string 's' containing characters s_0, s_1
* .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
* base 65536. Having mapped the low and high strings into floating-point
* values, we then use the BigDecimalSplitter to establish the even split
* points, then map the resulting floating point values back into strings.
*/
您将在代码中看到警告:
LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
+ "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");
在Integer示例的情况下,所有映射器将获得均衡的负载(所有映射器将从RDBMS获取25个记录)。
如果是字符串,则排序数据的可能性较小。因此,很难为所有映射器提供相似的负载。
简而言之,将整数列用作--split-by
列。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句