我对使用的功能还是陌生的data.table
。我的目标是使用rle()
或rleid()
同时按多个变量分组。rle()
不是典型的摘要统计信息。
在下面的测试数据集中,我的目的是对唯一自行车(bike_id
)位于同一位置的连续重复记录进行计数address
,然后按日期和进行分组bike_id
。
一些测试数据如下:
> dat
time bike_id address
1: 2017-11-22 15:45:34 1 Waters Rd
2: 2017-11-22 15:50:16 1 Waters Rd
3: 2017-11-22 16:00:03 1 Washington Ave
4: 2017-11-22 16:10:03 1 Washington Ave
5: 2017-11-22 16:20:02 1 Washington Ave
6: 2017-11-22 16:30:02 2 Shady Lane
7: 2017-11-22 16:40:03 2 Comstock Ave
8: 2017-11-22 16:50:02 2 Comstock Ave
9: 2017-11-22 17:00:02 2 Comstock Ave
10: 2017-11-22 17:10:02 2 Comstock Ave
11: 2017-11-22 17:20:03 3 Scranton Drive
12: 2017-11-22 17:30:03 3 Scranton Drive
13: 2017-11-22 17:40:03 3 Scranton Drive
14: 2017-11-22 17:50:03 3 Shady Lane
15: 2017-11-22 18:00:04 3 Scranton Drive
16: 2017-11-23 18:10:03 1 Shady Lane
17: 2017-11-23 18:20:03 1 Shady Lane
18: 2017-11-23 18:30:02 1 Shady Lane
19: 2017-11-23 18:40:03 1 Shady Lane
20: 2017-11-23 18:50:03 1 Shady Lane
21: 2017-11-23 19:00:03 2 Lovers Lane
22: 2017-11-23 19:10:02 2 Mulholland Drive
23: 2017-11-23 19:20:03 2 Mulholland Drive
24: 2017-11-23 19:30:02 2 Mulholland Drive
25: 2017-11-23 19:40:03 2 Mulholland Drive
time bike_id address
我知道使用rle(dat$address)
会在下面的期望输出中生成第三列,但是不确定如何使用rle()
in进行分组data.table
> output
date bike_id rle
1 2017-11-22 1 2
2 2017-11-22 1 3
3 2017-11-22 2 1
4 2017-11-22 2 4
5 2017-11-22 3 3
6 2017-11-22 3 1
7 2017-11-22 3 1
8 2017-11-23 1 5
9 2017-11-23 2 1
10 2017-11-23 2 4
任何的意见都将会有帮助!
这是示例数据:
> dput(dat)
structure(list(time = structure(c(1511383534.43394, 1511383816.49785,
1511384403.94561, 1511385003.17654, 1511385602.47887, 1511386202.99895,
1511386803.18361, 1511387402.98233, 1511388002.69461, 1511388602.5818,
1511389203.52712, 1511389803.652, 1511390403.26619, 1511391003.79218,
1511391604.30061, 1511478603.55103, 1511479203.60366, 1511479802.97132,
1511480403.45374, 1511481003.12783, 1511481603.34055, 1511482202.62777,
1511482803.66405, 1511483402.83378, 1511484003.46605), tzone = "", class = c("POSIXct",
"POSIXt")), bike_id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), address = c("Waters Rd",
"Waters Rd", "Washington Ave", "Washington Ave", "Washington Ave",
"Shady Lane", "Comstock Ave", "Comstock Ave", "Comstock Ave",
"Comstock Ave", "Scranton Drive", "Scranton Drive", "Scranton Drive",
"Shady Lane", "Scranton Drive", "Shady Lane", "Shady Lane", "Shady Lane",
"Shady Lane", "Shady Lane", "Lovers Lane", "Mulholland Drive",
"Mulholland Drive", "Mulholland Drive", "Mulholland Drive")), .Names = c("time",
"bike_id", "address"), class = c("data.table", "data.frame"), row.names = c(NA,
-25L), .internal.selfref = <pointer: 0x10300d178>)
编辑:
唯一的情况是,以下答案中的代码产生错误的结果:
> dput(dat)
structure(list(bike_id = c(1, 1, 1, 1, 1, 1), lon = c(-76.968,
-76.968, -76.968, -72.141, -72.141, -72.141), lat = c(38.924,
38.924, 38.924, -39.219, -39.219, -39.219), time = structure(c(1511383534.49273,
1511383816.52327, 1511384403.97359, 1511385003.20305, 1511385602.50507,
1511299803.02598), tzone = "", class = c("POSIXct", "POSIXt"))), .Names = c("bike_id",
"lon", "lat", "time"), row.names = c(NA, -6L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x10300d178>)
> dat
bike_id lon lat time
1: 1 -76.968 38.924 2017-11-22 15:45:34
2: 1 -76.968 38.924 2017-11-22 15:50:16
3: 1 -76.968 38.924 2017-11-22 16:00:03
4: 1 -72.141 -39.219 2017-11-22 16:10:03
5: 1 -72.141 -39.219 2017-11-22 16:20:02
6: 1 -72.141 -39.219 2017-11-21 16:30:03
> dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(lat, lon))][, grp := NULL][]
产生:
bike_id date n
1: 1 2017-11-22 3
2: 1 2017-11-22 3
预期:
bike_id date n
1: 1 2017-11-22 3
2: 1 2017-11-22 2
3: 1 2017-11-21 1
我们可以用rleid
在data.table
dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(address))][, grp := NULL][]
如果每个分组变量有多个“日期”(第二个数据),则前一个将仅选择第一个“日期”([1]
)。假设我们想同时获取“日期”和
dat[, .(date = unique(as.Date(time)), n = .N),, .(bike_id, grp = rleid(lon, lat))]
# bike_id grp date n
#1: 1 1 2017-11-22 3
#2: 1 2 2017-11-22 3
#3: 1 2 2017-11-21 3
但是,每个组也有多行。如果每个组仅需要一行,则可以创建一list
列(保留class
)
dat[, .(date = list(unique(as.Date(time))), n = .N),, .(bike_id, grp = rleid(lon, lat))]
# bike_id grp date n
#1: 1 1 2017-11-22 3
#2: 1 2 2017-11-22,2017-11-21 3
或者paste
该unique
元素结合在一起
基于OP帖子中针对预期输出(来自第二个数据集)的更新,我们还需要使用“日期”作为分组变量
dat[, .(n = .N),, .(bike_id, date = as.Date(time), grp = rleid(lon, lat))][, grp := NULL][]
# bike_id date n
#1: 1 2017-11-21 1
#2: 1 2017-11-22 3
#3: 1 2017-11-22 2
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句