R去除某些值后矩阵的最大距离

Kevin 发表于 Dev

凯文

假设我们有一个如下的矩阵，

A <- matrix(c(1,7,13,19,9,5,8,14,20,10,3,4,15,21,1,2,4,16,22,2,8,3,17,23,1,6,3,18,24,2), nrow=5)

 A
       [,1]   [,2]   [,3]   [,4]   [,5]   [,6]
 [1,]    1      5      3      2      8      6
 [2,]    7      8      4      4      3     3
 [3,]    13     14     15     16     17     18
 [4,]    19     20     21     22     23     24
 [5,]    9      10     1      2      1      2

该dist函数可以计算出最大绝对的矩阵的每一行之间的距离A和返回距离矩阵D使用dist(A, method = "maximum")。D[i,j] = \max_{k}(|A[i,k]-A[j,k]|)例如，

 D[1,2] = max( abs( A[1,] - A[2,] ) ) = max(6, 3, 1, 2, 5, 3) = 6

但是，就我而言，我需要首先删除i, jelement，D[i,j] = \max_{k not equal to i or j}(|A[i,k]-A[j,k]|)例如，在上面的示例中，答案是beomes

 D[1,2] = max( abs( A[1,] - A[2,] ) ) = max( 1, 2, 5, 3) = 5

我不知道如何以有效的方式执行此操作，我知道我可以使用for循环，但是数据集很大，for循环非常慢。

ekoam

假设您的真实矩阵还具有多于行的列。这是所需功能的基本R实现：

max_dist <- function(mat, i, j) {
  mat <- mat[c(i, j), -c(i, j)]
  max(abs(mat[1L, ] - mat[2L, ]))
}

dist1 <- function(mat) {
  n <- nrow(mat)
  ids <- do.call(rbind, lapply(2:n, function(i, e) cbind(i:e, rep.int(i - 1L, e - i + 1L)), n))
  out <- apply(ids, 1L, function(i) max_dist(mat, i[[1L]], i[[2L]]))
  attributes(out) <- list(
    Size = n, Labels = dimnames(mat)[[1L]], Diag = FALSE, 
    Upper = FALSE, method = "dist1", call = match.call(), 
    class = "dist"
  )
  out
}

如果您认为R的速度不够快，则可以使用package parallelDist，它允许用户定义C ++距离函数。考虑以下实现：

library(parallelDist)
library(RcppXPtrUtils)
library(RcppArmadillo)

mydist_ptr <- cppXPtr("double mydist(const arma::mat &A, const arma::mat &B) {
  arma::uvec ids = {0, (unsigned int)A(0, 0), (unsigned int)B(0, 0)};
  arma::mat A_ = A, B_ = B;
  A_.shed_cols(ids); B_.shed_cols(ids); 
  return abs((A_ - B_)).max();
}", depends = "RcppArmadillo")

dist2 <- function(mat) {
  # prepend row numbers to the matrix
  # this later allows cpp function `mydist` to identify which rows to drop 
  parDist(cbind(seq_len(nrow(mat)), mat), method = "custom", func = mydist_ptr) 
}

使用以下矩阵进行测试（这small_m是您帖子中的示例）：

small_m <- matrix(c(1,5,3,2,8,6,7,8,4,4,3,3,13,14,15,16,17,18,19,20,21,22,23,24,9,10,1,2,1,2), 5, 6, byrow = TRUE)
large_m <- matrix(rnorm(1000000), 10, 100000)

基准测试

# no real difference between these two implementations when the input matrix is small
> microbenchmark::microbenchmark(dist1(small_m), dist2(small_m))
Unit: microseconds
           expr   min     lq    mean median     uq   max neval cld
 dist1(small_m)  77.4  87.10 112.403  106.5 125.95 212.2   100  a 
 dist2(small_m) 145.5 160.25 177.786  170.2 183.80 286.7   100   b

# `dist2` is faster with large matrix input. However, the efficiency of `dist1` is also acceptable IMO.
> microbenchmark::microbenchmark(dist1(large_m), dist2(large_m))
Unit: milliseconds
           expr      min       lq      mean   median       uq      max neval cld
 dist1(large_m) 129.7531 139.3909 152.13154 143.0549 149.5870 322.0173   100   b
 dist2(large_m)  48.8025  52.5081  55.84333  55.5175  58.6095  67.6470   100  a

输出如下

> dist1(small_m)
   1  2  3  4
2  5         
3 14 15      
4 18 21  6   
5  5  3 16 22
> dist2(small_m)
   1  2  3  4
2  5         
3 14 15      
4 18 21  6   
5  5  3 16 22

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。