基准测试和处理时间结果的差异

Leerssej

我一直在尝试对替换数据帧中NA的最有效方法进行一些测试。

我首先比较了一百万行,十二列数据集中的NA与0的替换解决方案。将所有具备管道能力的设备投入microbenchmark我的系统中,结果如下。

问题1:是否有方法可以测试benchmark函数内部的子集左赋值语句(例如:df1 [is.na(df1)] <-0)

library(dplyr)
library(tidyr)
library(microbenchmark)

set.seed(24)
df1 <- as.data.frame(matrix(sample(c(NA, 1:5), 1e6 *12, replace=TRUE),
                            dimnames = list(NULL, paste0("var", 1:12)), ncol=12))

op <- microbenchmark(
    mut_all_ifelse   = df1 %>% mutate_all(funs(ifelse(is.na(.), 0, .))),
    mut_at_ifelse    = df1 %>% mutate_at(funs(ifelse(is.na(.), 0, .)), .cols = c(1:12)),
    # df1[is.na(df1)] <- 0 would sit here, but I can't make it work inside this function
    replace          = df1 %>% replace(., is.na(.), 0),
    mut_all_replace  = df1 %>% mutate_all(funs(replace(., is.na(.), 0))),
    mut_at_replace   = df1 %>% mutate_at(funs(replace(., is.na(.), 0)), .cols = c(1:12)),
    replace_na       = df1 %>% replace_na(list(var1 = 0, var2 = 0, var3 = 0, var4 = 0, var5 = 0, var6 = 0, var7 = 0, var8 = 0, var9 = 0, var10 = 0, var11 = 0, var12 = 0)),
    times = 1000L
)

print(op) #standard data frame of the output
    Unit: milliseconds
            expr       min       lq     mean   median       uq       max neval
  mut_all_ifelse 769.87848 844.5565 871.2476 856.0941 895.4545 1274.5610  1000
   mut_at_ifelse 713.48399 847.0322 875.9433 861.3224 899.7102 1006.6767  1000
         replace 258.85697 311.9708 334.2291 317.3889 360.6112  455.7596  1000
 mut_all_replace  96.81479 164.1745 160.6151 167.5426 170.5497  219.5013  1000
  mut_at_replace  96.23975 166.0804 161.9302 169.3984 172.7442  219.0359  1000
      replace_na 103.04600 161.2746 156.7804 165.1649 168.3683  210.9531  1000
boxplot(op) #boxplot of output

Microbenchmark Ba​​se R,dplyr和tidyr替代品的箱线图

library(ggplot2) #nice log plot of the output
qplot(y=time, data=op, colour=expr) + scale_y_log10()

Microbenchmark Ba​​se R,dplyr和tidyr的颜色logY时间点图替换

为了测试子集赋值运算符,我最初运行了这些测试。

set.seed(24) 
> Book1 <- as.data.frame(matrix(sample(c(NA, 1:5), 1e8 *12, replace=TRUE),
+ dimnames = list(NULL, paste0("var", 1:12)), ncol=12))
> system.time({ 
+     Book1 %>% mutate_all(funs(ifelse(is.na(.), 0, .))) })
   user  system elapsed 
  52.79   24.66   77.45 
> 
> system.time({ 
+     Book1 %>% mutate_at(funs(ifelse(is.na(.), 0, .)), .cols = c(1:12)) })
   user  system elapsed 
  52.74   25.16   77.91 
> 
> system.time({ 
+     Book1[is.na(Book1)] <- 0 })
   user  system elapsed 
  16.65    7.86   24.51 
> 
> system.time({ 
+     Book1 %>% replace_na(list(var1 = 0, var2 = 0, var3 = 0, var4 = 0, var5 = 0, var6 = 0, var7 = 0, var8 = 0, var9 = 0,var10 = 0, var11 = 0, var12 = 0)) })
   user  system elapsed 
   3.54    2.13    5.68 
> 
> system.time({ 
+     Book1 %>% mutate_at(funs(replace(., is.na(.), 0)), .cols = c(1:12)) })
   user  system elapsed 
   3.37    2.26    5.63 
> 
> system.time({ 
+     Book1 %>% mutate_all(funs(replace(., is.na(.), 0))) })
   user  system elapsed 
   3.33    2.26    5.58 
> 
> system.time({ 
+     Book1 %>% replace(., is.na(.), 0) })
   user  system elapsed 
   3.42    1.09    4.51 

In these tests the base replace() comes in first. In the benchmarking trials, the replace falls farther back in the ranks while the tidyr replace_na() wins (by a nose) Running the singular tests repeatedly and on different shapes and sizes of data frames always finds the base replace() in the lead.

Question 2: How could it's benchmark performance be the only result that falls so far out of line with the simple test results?

More perplexingly - Question 3: How can all the mutate_all/_at(replace()) work faster than the simple replace()? Many folks report this: http://datascience.la/dplyr-and-a-very-basic-benchmark/ (and all the links in that article) but I still haven't found an explanation for why beyond that hashing and C++ are used.)

with special thanks already to Tyler Rinker: https://www.r-bloggers.com/microbenchmarking-with-r/ and akrun: https://stackoverflow.com/a/41530071/5088194

alexis_laz

You can include a complex/multi statement in microbenchmark by wrapping it with {} which, basically, converts to a single expression:

microbenchmark(expr1 = { df1[is.na(df1)] = 0 }, 
               exp2 = { tmp = 1:10; tmp[3] = 0L; tmp2 = tmp + 12L; tmp2 ^ 2 }, 
               times = 10)
#Unit: microseconds
#  expr        min         lq       mean     median         uq        max neval cld
# expr1 124953.716 137244.114 158576.030 142405.685 156744.076 284779.353    10   b
#  exp2      2.784      3.132     17.748     23.142     24.012     38.976    10  a 

Worth noting the side-effects of this:

tmp
#[1]  1  2  0  4  5  6  7  8  9 10

in contrast to, say, something like:

rm(tmp)
microbenchmark(expr1 = { df1[is.na(df1)] = 0 },  
               exp2 = local({ tmp = 1:10; tmp[3] = 0L; tmp2 = tmp + 12L; tmp2 ^ 2 }), 
               times = 10)
#Unit: microseconds
#  expr       min         lq        mean     median         uq        max neval cld
# expr1 127250.18 132935.149 165296.3030 154509.553 169917.705 314820.306    10   b
#  exp2     10.44     12.181     42.5956     54.636     57.072     97.789    10  a 
tmp
#Error: object 'tmp' not found

Noticing the side effect a benchmark has, we see that the first operation that removes NA values leaves a fairly light job for the following alternatives:

# re-assign because we changed it before
set.seed(24)
df1 = as.data.frame(matrix(sample(c(NA, 1:5), 1e6 * 12, TRUE), 
                           dimnames = list(NULL, paste0("var", 1:12)), ncol = 12))
unique(sapply(df1, typeof))
#[1] "integer"
any(sapply(df1, anyNA))
#[1] TRUE
system.time({ df1[is.na(df1)] <- 0 })
# user  system elapsed 
# 0.39    0.14    0.53 

The previous benchmark leaves us with:

unique(sapply(df1, typeof))
#[1] "double"
any(sapply(df1, anyNA))
#[1] FALSE

And replacing NA, when there are none, is/should be taken account to do nothing on the input.

Aside of that, note that in all your alternatives you sub-assign a "double" (typeof(0)) to "integer" columns-vectors (sapply(df1, typeof)). While, I don't think there is any case (in the above alternatives) where df1 gets modified in place (since after the creation of a "data.frame", there is stored info to copy its vector-columns in case of modification), there -still- is a minor but avoidable overhead in coercing to "double" and storing as a "double". R before replacing elements in an "integer" vector will allocate and copy (in case of "integer" replacement) or allocate and coerce (in case of "double" replacement). Also, after the first coercion (from a side effect of a benchmark, as noted above), R will operate on "double"s and that contains slower manipulations than on "integer"s. I can't find a straightforward R way to investigate that difference, but in a nutshell (in danger of not being totally accurate) we can simulate these operations by:

# simulate R's copying of int to int
# allocate a new int and copy
int2int = inline::cfunction(sig = c(x = "integer"), body = '
    SEXP ans = PROTECT(allocVector(INTSXP, LENGTH(x)));
    memcpy(INTEGER(ans), INTEGER(x), LENGTH(x) * sizeof(int));
    UNPROTECT(1);
    return(ans);
')
# R's coercing of int to double
# 'coerceVector', internally, allocates a double and coerces to populate it
int2dbl = inline::cfunction(sig = c(x = "integer"), body = '
    SEXP ans = PROTECT(coerceVector(x, REALSXP));
    UNPROTECT(1);
    return(ans);
')
# simulate R's copying form double to double
dbl2dbl = inline::cfunction(sig = c(x = "double"), body = '
    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
    memcpy(REAL(ans), REAL(x), LENGTH(x) * sizeof(double));
    UNPROTECT(1);
    return(ans);
')

And on a benchmark:

x.int = 1:1e7; x.dbl = as.numeric(x.int)
microbenchmark(int2int(x.int), int2dbl(x.int), dbl2dbl(x.dbl), times = 50)
#Unit: milliseconds
#           expr      min       lq     mean   median       uq      max neval cld
# int2int(x.int) 16.42710 16.91048 21.93023 17.42709 19.38547 54.36562    50  a 
# int2dbl(x.int) 35.94064 36.61367 47.15685 37.40329 63.61169 78.70038    50   b
# dbl2dbl(x.dbl) 33.51193 34.18427 45.30098 35.33685 63.45788 75.46987    50   b

在结束(!)整体上一份说明,取代了00L应节省一些时间...

最后,为了更公平地复制基准,我们可以使用:

library(dplyr)
library(tidyr)
library(microbenchmark) 
set.seed(24)
df1 = as.data.frame(matrix(sample(c(NA, 1:5), 1e6 * 12, TRUE), 
                            dimnames = list(NULL, paste0("var", 1:12)), ncol = 12))

包装功能:

stopifnot(ncol(df1) == 12)  #some of the alternatives are hardcoded to 12 columns
mut_all_ifelse = function(x, val) x %>% mutate_all(funs(ifelse(is.na(.), val, .)))
mut_at_ifelse = function(x, val) x %>% mutate_at(funs(ifelse(is.na(.), val, .)), .cols = c(1:12))
baseAssign = function(x, val) { x[is.na(x)] <- val; x }
baseFor = function(x, val) { for(j in 1:ncol(x)) x[[j]][is.na(x[[j]])] = val; x }
base_replace = function(x, val) x %>% replace(., is.na(.), val)
mut_all_replace = function(x, val) x %>% mutate_all(funs(replace(., is.na(.), val)))
mut_at_replace = function(x, val) x %>% mutate_at(funs(replace(., is.na(.), val)), .cols = c(1:12))
myreplace_na = function(x, val) x %>% replace_na(list(var1 = val, var2 = val, var3 = val, var4 = val, var5 = val, var6 = val, var7 = val, var8 = val, var9 = val, var10 = val, var11 = val, var12 = val))

在基准之前测试结果是否相等:

identical(mut_all_ifelse(df1, 0), mut_at_ifelse(df1, 0))
#[1] TRUE
identical(mut_at_ifelse(df1, 0), baseAssign(df1, 0))
#[1] TRUE
identical(baseAssign(df1, 0), baseFor(df1, 0))
#[1] TRUE
identical(baseFor(df1, 0), base_replace(df1, 0))
#[1] TRUE
identical(base_replace(df1, 0), mut_all_replace(df1, 0))
#[1] TRUE
identical(mut_all_replace(df1, 0), mut_at_replace(df1, 0))
#[1] TRUE
identical(mut_at_replace(df1, 0), myreplace_na(df1, 0))
#[1] TRUE

强制测试为“双精度”:

benchnum = microbenchmark(mut_all_ifelse(df1, 0), 
                          mut_at_ifelse(df1, 0), 
                          baseAssign(df1, 0), 
                          baseFor(df1, 0),
                          base_replace(df1, 0), 
                          mut_all_replace(df1, 0),
                          mut_at_replace(df1, 0), 
                          myreplace_na(df1, 0),
                          times = 10)
benchnum
#Unit: milliseconds
#                    expr       min        lq      mean    median        uq       max neval cld
#  mut_all_ifelse(df1, 0) 1368.5091 1441.9939 1497.5236 1509.2233 1550.1416 1629.6959    10   c
#   mut_at_ifelse(df1, 0) 1366.1674 1389.2256 1458.1723 1464.5962 1503.4337 1553.7110    10   c
#      baseAssign(df1, 0)  532.4975  548.9444  586.8198  564.3940  655.8083  667.8634    10  b 
#         baseFor(df1, 0)  169.6048  175.9395  206.7038  189.5428  197.6472  308.6965    10 a  
#    base_replace(df1, 0)  518.7733  547.8381  597.8842  601.1544  643.4970  666.6872    10  b 
# mut_all_replace(df1, 0)  169.1970  183.5514  227.1978  194.0903  291.6625  346.4649    10 a  
#  mut_at_replace(df1, 0)  176.7904  186.4471  227.3599  202.9000  303.4643  309.2279    10 a  
#    myreplace_na(df1, 0)  172.4926  177.8518  199.1469  186.3645  192.1728  297.0419    10 a

测试时不强制“加倍”:

benchint = microbenchmark(mut_all_ifelse(df1, 0L), 
                          mut_at_ifelse(df1, 0L), 
                          baseAssign(df1, 0L), 
                          baseFor(df1, 0L),
                          base_replace(df1, 0L), 
                          mut_all_replace(df1, 0L),
                          mut_at_replace(df1, 0L),
                          myreplace_na(df1, 0L),
                          times = 10)
benchint
#Unit: milliseconds
#                     expr        min        lq      mean    median        uq       max neval cld
#  mut_all_ifelse(df1, 0L) 1291.17494 1313.1910 1377.9265 1353.2812 1417.4389 1554.6110    10   c
#   mut_at_ifelse(df1, 0L) 1295.34053 1315.0308 1372.0728 1353.0445 1431.3687 1478.8613    10   c
#      baseAssign(df1, 0L)  451.13038  461.9731  477.3161  471.0833  484.9318  528.4976    10  b 
#         baseFor(df1, 0L)   98.15092  102.4996  115.7392  107.9778  136.2227  139.7473    10 a  
#    base_replace(df1, 0L)  428.54747  451.3924  471.5011  470.0568  497.7088  516.1852    10  b 
# mut_all_replace(df1, 0L)  101.66505  102.2316  137.8128  130.5731  161.2096  243.7495    10 a  
#  mut_at_replace(df1, 0L)  103.79796  107.2533  119.1180  112.1164  127.7959  166.9113    10 a  
#    myreplace_na(df1, 0L)  100.03431  101.6999  120.4402  121.5248  137.1710  141.3913    10 a

一种简单的可视化方式:

boxplot(benchnum, ylim = range(min(summary(benchint)$min, summary(benchnum)$min),
                               max(summary(benchint)$max, summary(benchnum)$max)))
boxplot(benchint, add = TRUE, border = "red", axes = FALSE) 
legend("topright", c("coerce", "not coerce"), fill = c("black", "red"))                       

在此处输入图片说明

请注意,df1所有这一切(str(df1)之后都保持不变

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章