Sqoop导入按列拆分数据类型

巴格瓦蒂

sqoop导入中按列拆分的数据类型是否应该始终是数字数据类型（整数，bignint，数字）？不能是字符串吗？

开发人员

是的，您可以拆分任何非数字数据类型。

但是不建议这样做。

为什么？

用于拆分数据Sqoop触发

SELECT MIN(col1), MAX(col2) FROM TABLE

然后根据您的映射器数量进行划分。

现在取整数作为的一个例子--split-by列

表中有一些id列的值是1到100，并且您使用了4个映射器（-m 4在sqoop命令中）

Sqoop使用以下方法获得MIN和MAX值：

SELECT MIN(id), MAX(id) FROM TABLE

输出：

1,100

整数拆分很容易。您将分为4部分：

1-25
25-50
51-75
76-100

现在将字符串作为--split-by列

表中有一些name列，其值从“ dev”到“ sam”，并且您使用了4个映射器（-m 4在sqoop命令中）

Sqoop使用以下方法获得MIN和MAX值：

SELECT MIN(id), MAX(id) FROM TABLE

输出：

开发人员，一个人

现在如何将其分为四个部分。根据sqoop文档，

/**
   * This method needs to determine the splits between two user-provided
   * strings.  In the case where the user's strings are 'A' and 'Z', this is
   * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
   * splits for strings beginning with each letter, etc.
   *
   * If a user has provided us with the strings "Ham" and "Haze", however, we
   * need to create splits that differ in the third letter.
   *
   * The algorithm used is as follows:
   * Since there are 2**16 unicode characters, we interpret characters as
   * digits in base 65536. Given a string 's' containing characters s_0, s_1
   * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
   * base 65536. Having mapped the low and high strings into floating-point
   * values, we then use the BigDecimalSplitter to establish the even split
   * points, then map the resulting floating point values back into strings.
   */

您将在代码中看到警告：

LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
    + "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");

在Integer示例的情况下，所有映射器将获得均衡的负载（所有映射器将从RDBMS获取25个记录）。

如果是字符串，则排序数据的可能性较小。因此，很难为所有映射器提供相似的负载。

简而言之，将整数列用作--split-by列。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-11-3

我来说两句

0 条评论

登录后参与评论

上一篇：jQuery datatable-设置列宽并换行