我在R中有两个数据框:
city price bedroom
San Jose 2000 1
Barstow 1000 1
NA 1500 1
重新创建的代码:
data = data.frame(city = c('San Jose', 'Barstow'), price = c(2000,1000, 1500), bedroom = c(1,1,1))
和:
Name Density
San Jose 5358
Barstow 547
重新创建的代码:
population_density = data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
我想基于条件city_type
在data
数据集中创建一个额外的列,因此,如果城市人口密度高于1000,则为城市,低于1000的为郊区,NA为NA。
city price bedroom city_type
San Jose 2000 1 Urban
Barstow 1000 1 Suburb
NA 1500 1 NA
我正在使用for循环进行条件流:
for (row in 1:length(data)) {
if (is.na(data[row,'city'])) {
data[row, 'city_type'] = NA
} else if (population[population$Name == data[row,'city'],]$Density>=1000) {
data[row, 'city_type'] = 'Urban'
} else {
data[row, 'city_type'] = 'Suburb'
}
}
for循环在原始数据集中具有20000多个观察值的情况下运行无误;但是,它会产生很多错误的结果(大部分情况下会产生NA)。
这里出了什么问题?如何才能更好地达到预期的效果?
我非常喜欢dplyr
这种类型的联接/过滤器/突变工作流管道。所以这是我的建议:
library(dplyr)
# I had to add that extra "NA" there, did you not? Hm...
data <- data.frame(city = c('San Jose', 'Barstow', NA), price = c(2000,1000, 500), bedroom = c(1,1,1))
population <- data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
data %>%
# join the two dataframes by matching up the city name columns
left_join(population, by = c("city" = "Name")) %>%
# add your new column based on the desired condition
mutate(
city_type = ifelse(Density >= 1000, "Urban", "Suburb")
)
输出:
city price bedroom Density city_type
1 San Jose 2000 1 5358 Urban
2 Barstow 1000 1 547 Suburb
3 <NA> 500 1 NA <NA>
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句