我有一个数据框,例如,从不同列中的实验中复制。如果我的数据框中的每一行都是一个样本,其中 a、b、c 列作为副本,我想:
如何在此数据框中完成此操作?我想要新列:“max” - 每行的 a、b、c 的最高值 “min” - 每行的 a、b、c 的最低值 “variation” - 每行的 max/min
然后,我想省略 a、b 或 c 中距离其他点最远的数据点,以便其余点的变化小于 10。
df <- data.frame(a = rnorm(10, 100, 20),
b = rnorm(10, 2000, 500),
c = rnorm(10, 50, 20))
df$max = apply(df, 1, max, na.rm = T)
df$min = apply(df, 1, min, na.rm = T)
df$variation = df$max/df$min
(另外,如何使用 dplyr 和 %>% 符号计算最大值和最小值?)
使用 dplyr管道的示例,带有mutate和group_by。我使用 tidyr gather以长格式重新调整数据,最后使用spread以宽格式重新调整数据。
library(dplyr)
library(tidyr)
set.seed(100)
dtf_wide <- data.frame(a = rnorm(10, 100, 20),
b = rnorm(10, 2000, 500),
c = rnorm(10, 50, 20))
以长格式重塑数据。按 id 分组(宽格式的行号)然后计算变异和与中值的距离。
dtf <- dtf_wide %>%
# Explicitely add an identification column (for the grouping)
mutate(id = row_number()) %>%
# put data in tidy format, one observation per row
gather(key, value, a:c) %>%
arrange(id) %>%
group_by(id) %>%
mutate(variation = max(value, na.rm = TRUE) / min(value, na.rm = TRUE),
median = median(value),
distancefrommedian = abs(value-median),
maxdistancefrommedian = max(distancefrommedian))
head(dtf)
# # A tibble: 6 x 7
# # Groups: id [2]
# id key value variation median distancefrommedian maxdistancefrommedian
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 a 89.95615 49.58856 89.95615 0.00000 1954.987
# 2 1 b 2044.94307 49.58856 89.95615 1954.98692 1954.987
# 3 1 c 41.23820 49.58856 89.95615 48.71795 1954.987
# 4 2 a 102.63062 31.37407 102.63062 0.00000 1945.507
# 5 2 b 2048.13723 31.37407 102.63062 1945.50661 1945.507
# 6 2 c 65.28121 31.37407 102.63062 37.34941 1945.507
如果变异大于 10,则删除值远离中位数的行(如果需要,您可以在此处更改该规则以删除更多行)。
dtf <- dtf %>%
# For each id,
# Take all lines where variation is smaller than 10
filter(variation <= 10 |
# If varation is greater than 10,
# Filter out lines were the value is further away from the median
(variation > 10 & distancefrommedian < maxdistancefrommedian)) %>%
# Keep only interesting variables
select(id, key, value) %>%
# Compute the variations again (just to check)
mutate(variation = max(value, na.rm = TRUE) / min(value, na.rm = TRUE))
head(dtf)
# id key value variation
# <int> <chr> <dbl> <dbl>
# 1 1 a 89.95615 2.181379
# 2 1 c 41.23820 2.181379
# 3 2 a 102.63062 1.572131
# 4 2 c 65.28121 1.572131
# 5 3 a 98.42166 1.781735
# 6 3 c 55.23923 1.781735
重塑数据以获得类似于原始数据框的宽格式表格。
dtf_wide2 <- dtf %>%
spread(key, value)
head(dtf_wide2)
# id variation a c
# <int> <dbl> <dbl> <dbl>
# 1 1 4.385692 89.95615 41.23820
# 2 2 4.385692 102.63062 65.28121
# 3 3 4.385692 98.42166 55.23923
# 4 4 4.385692 117.73570 65.46809
# 5 5 4.385692 102.33943 33.71242
# 6 6 4.385692 106.37260 41.23099
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句