为什么使用as.factor（）而不是仅使用factor（）

Ben 发表于 Dev

Ben :

我最近看到Matt Dowle用as.factor()专门编写了一些代码

for (col in names_factors) set(dt, j=col, value=as.factor(dt[[col]]))

在对此答案的评论中。

我使用了此代码段，但是我需要明确设置因子水平，以确保水平按所需顺序显示，因此我必须进行更改

as.factor(dt[[col]])

至

factor(dt[[col]], levels = my_levels)

这让我开始思考：as.factor()与just相比有factor()什么好处（如果有）？

李哲源 :

as.factor是的包装器factor，但如果输入向量已经是一个因素，则可以快速返回：

function (x) 
{
    if (is.factor(x)) 
        x
    else if (!is.object(x) && is.integer(x)) {
        levels <- sort(unique.default(x))
        f <- match(x, levels)
        levels(f) <- as.character(levels)
        if (!is.null(nx <- names(x))) 
        names(f) <- nx
        class(f) <- "factor"
        f
    }
else factor(x)
}

弗兰克的评论：这不仅仅是一个包装，因为这种“快速回报”将保持要素水平不变，而factor()不会：

f = factor("a", levels = c("a", "b"))
#[1] a
#Levels: a b

factor(f)
#[1] a
#Levels: a

as.factor(f)
#[1] a
#Levels: a b

两年后扩大答案，包括以下内容：

手册说什么？
性能：as.factor> factor输入为因素时
性能：as.factor> factor输入为整数时
未使用水平或NA水平
使用R的分组功能时的警告：注意未使用或不可用的水平

手册说什么？

的文档?factor提到以下内容：

‘factor(x, exclude = NULL)’ applied to a factor without ‘NA’s is a
 no-operation unless there are unused levels: in that case, a
 factor with the reduced level set is returned.

 ‘as.factor’ coerces its argument to a factor.  It is an
 abbreviated (sometimes faster) form of ‘factor’.

性能：`as.factor`> `factor`输入为因素时

“不操作”这个词有点含糊。不要把它当作“无所事事”；实际上，它的意思是“做很多事情，但本质上什么也没改变”。这是一个例子：

set.seed(0)
## a randomized long factor with 1e+6 levels, each repeated 10 times
f <- sample(gl(1e+6, 10))

system.time(f1 <- factor(f))  ## default: exclude = NA
#   user  system elapsed 
#  7.640   0.216   7.887 

system.time(f2 <- factor(f, exclude = NULL))
#   user  system elapsed 
#  7.764   0.028   7.791 

system.time(f3 <- as.factor(f))
#   user  system elapsed 
#      0       0       0 

identical(f, f1)
#[1] TRUE

identical(f, f2)
#[1] TRUE

identical(f, f3)
#[1] TRUE

as.factor确实能带来快速回报，但factor不是真正的“禁运”。让我们factor来了解一下它做了什么。

Rprof("factor.out")
f1 <- factor(f)
Rprof(NULL)
summaryRprof("factor.out")[c(1, 4)]
#$by.self
#                      self.time self.pct total.time total.pct
#"factor"                   4.70    58.90       7.98    100.00
#"unique.default"           1.30    16.29       4.42     55.39
#"as.character"             1.18    14.79       1.84     23.06
#"as.character.factor"      0.66     8.27       0.66      8.27
#"order"                    0.08     1.00       0.08      1.00
#"unique"                   0.06     0.75       4.54     56.89
#
#$sampling.time
#[1] 7.98

它首先输入的向量sort的unique值f，然后转换f为字符向量，最后用于factor将字符向量强制转换为因子。这是factor用于确认的源代码。

function (x = character(), levels, labels = levels, exclude = NA, 
    ordered = is.ordered(x), nmax = NA) 
{
    if (is.null(x)) 
        x <- character()
    nx <- names(x)
    if (missing(levels)) {
        y <- unique(x, nmax = nmax)
        ind <- sort.list(y)
        levels <- unique(as.character(y)[ind])
    }
    force(ordered)
    if (!is.character(x)) 
        x <- as.character(x)
    levels <- levels[is.na(match(levels, exclude))]
    f <- match(x, levels)
    if (!is.null(nx)) 
        names(f) <- nx
    nl <- length(labels)
    nL <- length(levels)
    if (!any(nl == c(1L, nL))) 
        stop(gettextf("invalid 'labels'; length %d should be 1 or %d", 
            nl, nL), domain = NA)
    levels(f) <- if (nl == nL) 
        as.character(labels)
    else paste0(labels, seq_along(levels))
    class(f) <- c(if (ordered) "ordered", "factor")
    f
}

因此，功能factor实际上是为与字符向量一起使用而设计的，并且将其应用于as.character其输入以确保这一点。我们至少可以从上面学习两个与性能相关的问题：

如果许多列都是容易考虑的因素，那么对于数据帧而言DF，lapply(DF, as.factor)比lapply(DF, factor)类型转换要快得多。
该函数factor很慢可以解释为什么某些重要的R函数很慢，例如table：R：表函数令人惊讶地慢

性能：`as.factor`> `factor`输入为整数时

因子变量是整数变量的近亲。

unclass(gl(2, 2, labels = letters[1:2]))
#[1] 1 1 2 2
#attr(,"levels")
#[1] "a" "b"

storage.mode(gl(2, 2, labels = letters[1:2]))
#[1] "integer"

这意味着将整数转换为因数要比将数字/字符转换为因数容易。as.factor只是照顾这个。

x <- sample.int(1e+6, 1e+7, TRUE)

system.time(as.factor(x))
#   user  system elapsed 
#  4.592   0.252   4.845 

system.time(factor(x))
#   user  system elapsed 
# 22.236   0.264  22.659

未使用水平或NA水平

现在，让我们看几个关于factor和as.factor对因素水平的影响的示例（如果输入已经是一个因素）。弗兰克给出了一个未使用的因子水平，我将给出一个未使用的因子NA水平。

f <- factor(c(1, NA), exclude = NULL)
#[1] 1    <NA>
#Levels: 1 <NA>

as.factor(f)
#[1] 1    <NA>
#Levels: 1 <NA>

factor(f, exclude = NULL)
#[1] 1    <NA>
#Levels: 1 <NA>

factor(f)
#[1] 1    <NA>
#Levels: 1

有一个（通用）函数droplevels可以用来降低未使用的因子水平。但是NA默认情况下不能删除级别。

## "factor" method of `droplevels`
droplevels.factor
#function (x, exclude = if (anyNA(levels(x))) NULL else NA, ...) 
#factor(x, exclude = exclude)

droplevels(f)
#[1] 1    <NA>
#Levels: 1 <NA>

droplevels(f, exclude = NA)
#[1] 1    <NA>
#Levels: 1

使用R的分组功能时的警告：注意未使用或不可用的水平

进行分组操作的R函数（例如split）tapply希望我们提供因子变量作为“ by”变量。但是通常我们只提供字符或数字变量。因此，在内部，这些函数需要将它们转换为因子，并且可能大多数都会as.factor首先使用（至少对于split.defaultand 是如此tapply）。该table函数看起来像一个异常，我发现factor而不是as.factor内部。不幸的是，当我检查其源代码时，可能会有一些特殊的考虑对我来说并不明显。

由于大多数分组R函数使用as.factor，如果给它们一个未使用或NA级别的因数，则此类分组将出现在结果中。

x <- c(1, 2)
f <- factor(letters[1:2], levels = letters[1:3])

split(x, f)
#$a
#[1] 1
#
#$b
#[1] 2
#
#$c
#numeric(0)

tapply(x, f, FUN = mean)
# a  b  c 
# 1  2 NA

有趣的是，尽管table不依赖as.factor，它也保留了那些未使用的级别：

table(f)
#a b c 
#1 1 0

有时，这种行为是不希望的。一个典型的例子是barplot(table(f))：

如果这真的是不理想的，我们需要手动删除未使用的还是NA从我们的因子变量的水平，使用droplevels或factor。

暗示：

split有一个参数drop，FALSE因此默认as.factor使用；按drop = TRUE功能factor代替。
aggregate依赖split，因此它也有一个drop参数，默认为TRUE。
tapply没有，drop尽管它也依赖split。特别是文档?tapply说as.factor（一直）使用。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-09-22

我来说两句

0 条评论

登录后参与评论

上一篇：AWS的169.254.169.254 IP地址有何特别之处？

TOP 榜单

文章