我有一个数据集 df,如下所示:
subject recipient length folder message date edit
80 out 1/2/2020 1:00:01 AM T
80 out 1/2/2020 1:00:05 AM T
hey [email protected],[email protected] 80 out 1/2/2020 1:00:10 AM T
hey [email protected],[email protected] 80 out 1/2/2020 1:00:15 AM T
hey [email protected],[email protected] 80 out 1/2/2020 1:00:30 AM T
some k 900 in jjjjj 1/2/2020 1:00:35 AM F
some k 900 in jjjjj 1/2/2020 1:00:36 AM F
some k 900 in jjjjj 1/2/2020 1:00:37 AM F
hey [email protected],[email protected] 80 draft 1/2/2020 1:02:00 AM T
hey [email protected],[email protected] 80 draft 1/2/2020 1:02:05 AM T
no a 900 in iii 1/2/2020 1:02:10 AM F
no a 900 in iii 1/2/2020 1:02:15 AM F
no a 900 in iii 1/2/2020 1:02:20 AM F
no a 900 in iii 1/2/2020 1:02:25 AM F
hey [email protected],[email protected] 80 draft 1/2/2020 1:03:00 AM T
hey [email protected],[email protected] 80 draft 1/2/2020 1:03:20 AM T
背景和目标
数据集表示用户何时编辑消息、离开并恢复到完全相同的消息。我正在尝试捕获手头消息的总持续时间。我知道我必须首先将消息分组。我希望根据这些条件对消息进行分组:
如果 Folder 列是 == "out" OR "draft",如果 Message 列是 == "",并且如果 Edit == "T",则 Length 列也应该连续相同。所以一旦我有了这些组,我希望找到这些组的持续时间(开始和结束)。例如,第一组持续时间为 29 秒,因为它从 1/2/2020 1:00:01 AM 开始并在 1/2/2020 1:00:30 AM 结束。第二组将于 2020 年 1 月 2 日 1:02:00 开始,并在凌晨 1:02:05 结束。最后,第三组于 1/2/2020 1:03:00 AM 开始,并于 1:03:20 AM 结束。此外,由于这些组都属于 SAME 消息,我想通过使用以下逻辑将它们完全链接起来: 组最后一行中的主题、收件人和长度内容与下一组主题的第一行匹配, 收件人和长度,
我希望输出如下所示:
期望输出
Start End Duration Group Subject Length
1/2/2020 1:00:10 AM 1/2/2020 1:00:30 AM 20 A hey 80
1/2/2020 1:02:00 AM 1/2/2020 1:02:05 AM 5 A hey 80
1/2/2020 1:03:10 AM 1/2/2020 1:03:20 AM 20 A hey 80
这是我的 dput
structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("",
"hey", "no", "some"), class = "factor"), Recipient = structure(c(1L,
1L, 5L, 5L, 5L, 4L, 4L, 4L, 5L, 5L, 3L, 3L, 3L, 3L, 5L, 5L, 1L,
2L), .Label = c("", " ", "a", "k", "[email protected],[email protected]"
), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L, 900L,
900L, 900L, 80L, 80L, 900L, 900L, 900L, 900L, 80L, 80L, NA, NA
), Folder = structure(c(4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 2L, 2L,
3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("", "draft", "in",
"out"), class = "factor"), Message = structure(c(1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L), .Label = c("",
"jjjjjjj", "llll"), class = "factor"), Date = structure(c(2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 1L, 1L), .Label = c("", "1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM",
"1/2/2020 1:00:10 AM", "1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM",
"1/2/2020 1:00:35 AM", "1/2/2020 1:00:36 AM", "1/2/2020 1:00:37 AM",
"1/2/2020 1:02:00 AM", "1/2/2020 1:02:05 AM", "1/2/2020 1:02:10 AM",
"1/2/2020 1:02:15 AM", "1/2/2020 1:02:20 AM", "1/2/2020 1:02:25 AM",
"1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"), class = "factor"),
Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, NA, NA
)), class = "data.frame", row.names = c(NA, -18L))
这就是我正在做的事情:
df <- df %>%
mutate(Message = ifelse(is.na(Message), '', Message)) %>%
mutate(
cond = Edit & Folder %in% c('out', 'draft') & Message == ''
) %>%
mutate(grp = cumsum(!cond)) %>%
filter(cond, Subject = '')
df$Date <- as.POSIXct(as.character(df$Date), format = '%m/%d/%Y %H:%M:%OS')
get_group_number = function(){
i = 0
function(){
i <<- i+1
i
}
}
group_number = get_group_number()
df <- df %>% group_by(Subject, Recipient, Length) %>% mutate(Group = group_number())
df %>% group_by(grp) %>%
summarise(Start = min(Date), End = max(Date),
Duration = End - Start, Group = unique(Group)[1])
非常感谢任何帮助或建议。
df %>%
# The original data was loaded as factors, which have their uses, but
# converting those to characters will be simpler to work with here.
mutate_if(is.factor, as.character) %>%
# I'm replacing NA in Subj & Recip with an empty string, and trimming
# excess spaces from the start and end. One of the recipients is " "
# but I assume that's functionally the same as blank.
mutate_at(c("Subject", "Recipient"), ~if_else(is.na(.), "", stringr::str_trim(.))) %>%
filter(Subject != '') %>%
mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y %H:%M:%OS')) %>%
mutate(cond = Edit & Folder %in% c('out', 'draft') & Message == '') %>%
mutate(segment = cumsum(!cond)) %>%
filter(cond) %>% # EDIT: Added to exclude rows matching cond
# Get summary stats for each segment
group_by(Subject, Recipient, Length, segment) %>%
summarize(Start = min(Date),
End = max(Date),
Duration = End - Start) %>%
# This counts the number of times that these columns don't match their
# predecessor. TRUE = 1, so we get 1 when anything changes.
# Look at ?lag for more on what those parameters mean.
mutate(new_group = (Subject != lag(Subject, 1, "")) *
(Recipient != lag(Recipient, 1, "")) *
(Length != lag(Length, 1, ""))) %>%
ungroup() %>%
mutate(group = LETTERS[cumsum(new_group)])
# A tibble: 3 x 9
Subject Recipient Length segment Start End Duration new_group group
<chr> <chr> <int> <int> <dttm> <dttm> <drtn> <int> <chr>
1 hey [email protected],[email protected] 80 0 2020-01-02 01:00:10 2020-01-02 01:00:30 20 secs 1 A
2 hey [email protected],[email protected] 80 3 2020-01-02 01:02:00 2020-01-02 01:02:05 5 secs 0 A
3 hey [email protected],[email protected] 80 7 2020-01-02 01:03:00 2020-01-02 01:03:20 20 secs 0 A
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句