创建虚拟编码的更有效方法

Xiphias

问题： 在Python中，我将使用字典，并使用许多map / apply函数。但是，从RI开始，这种简单的使用列表的方法开始了，我想知道是否存在一种更有效/更优雅的方法来进行以下操作。

在统计信息中，您可以使用虚拟变量来表示名义属性的级别。例如，A / B / C将变为00、01、10。A / B / C / D将变为000、001、010、100。因此，每个项目只允许一个1。因此，您需要n-1数字来表示n变量/字母。

在这里，我创建一些数据：

data <- data.frame(
  "upper" = c(1,1,1,2,2,2,3,3,3), # var 1
  "country" = c(1,2,3,1,2,3,1,2,3), # var 2
  "price" = c(1,2,3,2,3,1,3,1,2) # var 3
)

创建一个包含键（属性）和值（唯一属性级别的列表）的列表：

lst <- list()
for (attribute in colnames(data)) {
  lst[[attribute]] = unique(data[[attribute]])
}

创建虚拟编码，i仅用于考虑以下n-1项目：

dummy <- list()
for (attribute in colnames(data)) {
  i <- 1
  for (level in lst[[attribute]]) {
    if (length(lst[[attribute]])!=i) {
      dummy[[paste0(attribute, level)]] <- ifelse(
        data[[attribute]]==level,
        1,
        0
      )
    }
    i <- i + 1
  }
}

结果：

dummy
$upper1
[1] 1 1 1 0 0 0 0 0 0

$upper2
[1] 0 0 0 1 1 1 0 0 0

$country1
[1] 1 0 0 1 0 0 1 0 0

$country2
[1] 0 1 0 0 1 0 0 1 0

$price1
[1] 1 0 0 0 0 1 0 1 0

$price2
[1] 0 1 0 1 0 0 0 0 1

阿克伦

我们使用创建设计矩阵model.matrix，split列创建list的list，最后的拼接list元素结合在一起（do.call(c,..）。

res <- do.call("c",lapply(data, function(x) {
            x1 <- model.matrix(~0+factor(x))
               split(x1, col(x1))}))

由于我们只需要前两个级别，因此我们可以将“ res”子集化，使用c(TRUE, TRUE, FALSE)该子集将循环到的结尾list。

res[c(TRUE, TRUE, FALSE)]
#$upper.1
#[1] 1 1 1 0 0 0 0 0 0

#$upper.2
#[1] 0 0 0 1 1 1 0 0 0

#$country.1
#[1] 1 0 0 1 0 0 1 0 0

#$country.2
#[1] 0 1 0 0 1 0 0 1 0

#$price.1
#[1] 1 0 0 0 0 1 0 1 0

#$price.2
#[1] 0 1 0 1 0 0 0 0 1

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。