游程编码和分组依据

the_darkside

我对使用的功能还是陌生的data.table我的目标是使用rle()rleid()同时按多个变量分组。rle()不是典型的摘要统计信息。

在下面的测试数据集中,我的目的是对唯一自行车(bike_id)位于同一位置的连续重复记录进行计数address,然后按日期和进行分组bike_id

一些测试数据如下:

> dat
                   time bike_id          address
 1: 2017-11-22 15:45:34       1        Waters Rd
 2: 2017-11-22 15:50:16       1        Waters Rd
 3: 2017-11-22 16:00:03       1   Washington Ave
 4: 2017-11-22 16:10:03       1   Washington Ave
 5: 2017-11-22 16:20:02       1   Washington Ave
 6: 2017-11-22 16:30:02       2       Shady Lane
 7: 2017-11-22 16:40:03       2     Comstock Ave
 8: 2017-11-22 16:50:02       2     Comstock Ave
 9: 2017-11-22 17:00:02       2     Comstock Ave
10: 2017-11-22 17:10:02       2     Comstock Ave
11: 2017-11-22 17:20:03       3   Scranton Drive
12: 2017-11-22 17:30:03       3   Scranton Drive
13: 2017-11-22 17:40:03       3   Scranton Drive
14: 2017-11-22 17:50:03       3       Shady Lane
15: 2017-11-22 18:00:04       3   Scranton Drive
16: 2017-11-23 18:10:03       1       Shady Lane
17: 2017-11-23 18:20:03       1       Shady Lane
18: 2017-11-23 18:30:02       1       Shady Lane
19: 2017-11-23 18:40:03       1       Shady Lane
20: 2017-11-23 18:50:03       1       Shady Lane
21: 2017-11-23 19:00:03       2      Lovers Lane
22: 2017-11-23 19:10:02       2 Mulholland Drive
23: 2017-11-23 19:20:03       2 Mulholland Drive
24: 2017-11-23 19:30:02       2 Mulholland Drive
25: 2017-11-23 19:40:03       2 Mulholland Drive
                   time bike_id          address

我知道使用rle(dat$address)会在下面的期望输出中生成第三列,但是不确定如何使用rle()in进行分组data.table

> output
         date bike_id rle
1  2017-11-22       1   2
2  2017-11-22       1   3
3  2017-11-22       2   1
4  2017-11-22       2   4
5  2017-11-22       3   3
6  2017-11-22       3   1
7  2017-11-22       3   1
8  2017-11-23       1   5
9  2017-11-23       2   1
10 2017-11-23       2   4

任何的意见都将会有帮助!

这是示例数据:

> dput(dat)
structure(list(time = structure(c(1511383534.43394, 1511383816.49785, 
1511384403.94561, 1511385003.17654, 1511385602.47887, 1511386202.99895, 
1511386803.18361, 1511387402.98233, 1511388002.69461, 1511388602.5818, 
1511389203.52712, 1511389803.652, 1511390403.26619, 1511391003.79218, 
1511391604.30061, 1511478603.55103, 1511479203.60366, 1511479802.97132, 
1511480403.45374, 1511481003.12783, 1511481603.34055, 1511482202.62777, 
1511482803.66405, 1511483402.83378, 1511484003.46605), tzone = "", class = c("POSIXct", 
"POSIXt")), bike_id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 
3, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2), address = c("Waters Rd", 
"Waters Rd", "Washington Ave", "Washington Ave", "Washington Ave", 
"Shady Lane", "Comstock Ave", "Comstock Ave", "Comstock Ave", 
"Comstock Ave", "Scranton Drive", "Scranton Drive", "Scranton Drive", 
"Shady Lane", "Scranton Drive", "Shady Lane", "Shady Lane", "Shady Lane", 
"Shady Lane", "Shady Lane", "Lovers Lane", "Mulholland Drive", 
"Mulholland Drive", "Mulholland Drive", "Mulholland Drive")), .Names = c("time", 
"bike_id", "address"), class = c("data.table", "data.frame"), row.names = c(NA, 
-25L), .internal.selfref = <pointer: 0x10300d178>)

编辑:

唯一的情况是,以下答案中的代码产生错误的结果:

> dput(dat)
structure(list(bike_id = c(1, 1, 1, 1, 1, 1), lon = c(-76.968, 
-76.968, -76.968, -72.141, -72.141, -72.141), lat = c(38.924, 
38.924, 38.924, -39.219, -39.219, -39.219), time = structure(c(1511383534.49273, 
1511383816.52327, 1511384403.97359, 1511385003.20305, 1511385602.50507, 
1511299803.02598), tzone = "", class = c("POSIXct", "POSIXt"))), .Names = c("bike_id", 
"lon", "lat", "time"), row.names = c(NA, -6L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x10300d178>)

> dat
   bike_id     lon     lat                time
1:       1 -76.968  38.924 2017-11-22 15:45:34
2:       1 -76.968  38.924 2017-11-22 15:50:16
3:       1 -76.968  38.924 2017-11-22 16:00:03
4:       1 -72.141 -39.219 2017-11-22 16:10:03
5:       1 -72.141 -39.219 2017-11-22 16:20:02
6:       1 -72.141 -39.219 2017-11-21 16:30:03

> dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(lat, lon))][, grp := NULL][]

产生:

   bike_id       date n
1:       1 2017-11-22 3
2:       1 2017-11-22 3

预期:

   bike_id       date n
1:       1 2017-11-22 3
2:       1 2017-11-22 2
3:       1 2017-11-21 1
阿克伦

我们可以用rleiddata.table

dat[, .(date = as.Date(time)[1], n = .N), .(bike_id, grp = rleid(address))][, grp := NULL][]

如果每个分组变量有多个“日期”(第二个数据),则前一个将仅选择第一个“日期”([1])。假设我们想同时获取“日期”和

dat[, .(date = unique(as.Date(time)), n = .N),, .(bike_id, grp = rleid(lon, lat))]
#   bike_id grp       date n
#1:       1   1 2017-11-22 3
#2:       1   2 2017-11-22 3
#3:       1   2 2017-11-21 3

但是,每个组也有多行。如果每个组仅需要一行,则可以创建一list列(保留class

dat[, .(date = list(unique(as.Date(time))), n = .N),, .(bike_id, grp = rleid(lon, lat))]
#   bike_id grp                  date n
#1:       1   1            2017-11-22 3
#2:       1   2 2017-11-22,2017-11-21 3

或者pasteunique元素结合在一起

更新资料

基于OP帖子中针对预期输出(来自第二个数据集)的更新,我们还需要使用“日期”作为分组变量

dat[, .(n = .N),, .(bike_id, date = as.Date(time), grp = rleid(lon, lat))][, grp := NULL][]
#   bike_id       date n
#1:       1 2017-11-21 1
#2:       1 2017-11-22 3
#3:       1 2017-11-22 2

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章