根据特定条件对数据进行分组并在 R 或 Python 中查找持续时间

林恩

我有一个数据集 df,如下所示:

 subject    recipient                  length   folder    message  date                       edit
                                        80      out                1/2/2020 1:00:01 AM        T                                    
                                        80      out                1/2/2020 1:00:05 AM        T                   
hey        [email protected],[email protected]    80      out                1/2/2020 1:00:10 AM        T
hey        [email protected],[email protected]    80      out                1/2/2020 1:00:15 AM        T
hey        [email protected],[email protected]    80      out                1/2/2020 1:00:30 AM        T
some       k                           900      in       jjjjj     1/2/2020 1:00:35 AM        F
some       k                           900      in       jjjjj     1/2/2020 1:00:36 AM        F 
some       k                           900      in       jjjjj     1/2/2020 1:00:37 AM        F
hey        [email protected],[email protected]    80    draft                1/2/2020 1:02:00 AM        T
hey        [email protected],[email protected]    80    draft                1/2/2020 1:02:05 AM        T    
no         a                          900       in        iii      1/2/2020 1:02:10 AM        F
no         a                          900       in        iii      1/2/2020 1:02:15 AM        F
no         a                          900       in        iii      1/2/2020 1:02:20 AM        F
no         a                          900       in        iii      1/2/2020 1:02:25 AM        F
hey        [email protected],[email protected]   80    draft                 1/2/2020 1:03:00 AM        T
hey        [email protected],[email protected]   80    draft                 1/2/2020 1:03:20 AM        T

背景和目标

数据集表示用户何时编辑消息、离开并恢复到完全相同的消息。我正在尝试捕获手头消息的总持续时间。我知道我必须首先将消息分组。我希望根据这些条件对消息进行分组:

如果 Folder 列是 == "out" OR "draft",如果 Message 列是 == "",并且如果 Edit == "T",则 Length 列也应该连续相同。所以一旦我有了这些组,我希望找到这些组的持续时间(开始和结束)。例如,第一组持续时间为 29 秒,因为它从 1/2/2020 1:00:01 AM 开始并在 1/2/2020 1:00:30 AM 结束。第二组将于 2020 年 1 月 2 日 1:02:00 开始,并在凌晨 1:02:05 结束。最后,第三组于 1/2/2020 1:03:00 AM 开始,并于 1:03:20 AM 结束。此外,由于这些组都属于 SAME 消息,我想通过使用以下逻辑将它们完全链接起来: 组最后一行中的主题、收件人和长度内容与下一组主题的第一行匹配, 收件人和长度,

我希望输出如下所示:

期望输出

 Start                  End                        Duration          Group  Subject  Length
 1/2/2020 1:00:10 AM    1/2/2020 1:00:30 AM        20                A      hey       80
 1/2/2020 1:02:00 AM    1/2/2020 1:02:05 AM        5                 A      hey       80
 1/2/2020 1:03:10 AM    1/2/2020 1:03:20 AM        20                A      hey       80

这是我的 dput

  structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 4L, 
  4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("", 
 "hey", "no", "some"), class = "factor"), Recipient = structure(c(1L, 
 1L, 5L, 5L, 5L, 4L, 4L, 4L, 5L, 5L, 3L, 3L, 3L, 3L, 5L, 5L, 1L, 
 2L), .Label = c("", " ", "a", "k", "[email protected],[email protected]"
  ), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L, 900L, 
 900L, 900L, 80L, 80L, 900L, 900L, 900L, 900L, 80L, 80L, NA, NA
 ), Folder = structure(c(4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 2L, 2L, 
 3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("", "draft", "in", 
 "out"), class = "factor"), Message = structure(c(1L, 1L, 1L, 
 1L, 1L, 2L, 2L, 2L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L), .Label = c("", 
 "jjjjjjj", "llll"), class = "factor"), Date = structure(c(2L, 
 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
17L, 1L, 1L), .Label = c("", "1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM", 
"1/2/2020 1:00:10 AM", "1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM", 
"1/2/2020 1:00:35 AM", "1/2/2020 1:00:36 AM", "1/2/2020 1:00:37 AM", 
"1/2/2020 1:02:00 AM", "1/2/2020 1:02:05 AM", "1/2/2020 1:02:10 AM", 
"1/2/2020 1:02:15 AM", "1/2/2020 1:02:20 AM", "1/2/2020 1:02:25 AM", 
"1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"), class = "factor"), 
Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, 
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, NA, NA
)), class = "data.frame", row.names = c(NA, -18L))

这就是我正在做的事情:

   df <- df %>% 
    mutate(Message = ifelse(is.na(Message), '', Message)) %>% 
    mutate(
    cond = Edit & Folder %in% c('out', 'draft') & Message == ''
       ) %>% 
    mutate(grp = cumsum(!cond)) %>% 
   filter(cond, Subject = '')

   df$Date <- as.POSIXct(as.character(df$Date), format = '%m/%d/%Y %H:%M:%OS')

   get_group_number = function(){
   i = 0
   function(){
   i <<- i+1
     i
       }
   }
  group_number = get_group_number()
  df <- df %>% group_by(Subject, Recipient, Length) %>% mutate(Group = group_number())

  df %>% group_by(grp) %>% 
  summarise(Start = min(Date), End = max(Date),
         Duration = End - Start, Group = unique(Group)[1])

非常感谢任何帮助或建议。

乔恩·斯普林
df %>% 
  # The original data was loaded as factors, which have their uses, but
  #   converting those to characters will be simpler to work with here.
  mutate_if(is.factor, as.character) %>% 
  # I'm replacing NA in Subj & Recip with an empty string, and trimming 
  #    excess spaces from the start and end. One of the recipients is " "
  #    but I assume that's functionally the same as blank.
  mutate_at(c("Subject", "Recipient"), ~if_else(is.na(.), "", stringr::str_trim(.))) %>%
  filter(Subject != '') %>%
  mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y %H:%M:%OS')) %>%
  mutate(cond = Edit & Folder %in% c('out', 'draft') & Message == '') %>% 
  mutate(segment = cumsum(!cond)) %>%
  filter(cond) %>%   # EDIT: Added to exclude rows matching cond

  # Get summary stats for each segment
  group_by(Subject, Recipient, Length, segment) %>%
  summarize(Start = min(Date),
            End = max(Date),
            Duration = End - Start) %>%

  # This counts the number of times that these columns don't match their
  #   predecessor. TRUE = 1, so we get 1 when anything changes.
  #   Look at ?lag for more on what those parameters mean.
  mutate(new_group = (Subject   != lag(Subject, 1, "")) *
                     (Recipient != lag(Recipient, 1, "")) *
                     (Length    != lag(Length, 1, ""))) %>%
  ungroup() %>%
  mutate(group = LETTERS[cumsum(new_group)])



# A tibble: 3 x 9
  Subject Recipient                   Length segment Start               End                 Duration new_group group
  <chr>   <chr>                        <int>   <int> <dttm>              <dttm>              <drtn>       <int> <chr>
1 hey     [email protected],[email protected]     80       0 2020-01-02 01:00:10 2020-01-02 01:00:30 20 secs          1 A    
2 hey     [email protected],[email protected]     80       3 2020-01-02 01:02:00 2020-01-02 01:02:05  5 secs          0 A    
3 hey     [email protected],[email protected]     80       7 2020-01-02 01:03:00 2020-01-02 01:03:20 20 secs          0 A    

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

如何对时间实例进行分组,找到它的持续时间并根据特定条件(在 R 或 Python 中)对它们求和

根据条件对数据进行子集和重命名,并在Dplyr中查找持续时间

R 根据特定条件对所有列中的数据进行分组并减去行

从熊猫python中的数据帧创建持续时间

在 Python 中查找成功交付的平均持续时间

如何根据特定条件将行分组?(R 或 Python)

如何使用R计算时间序列数据中的持续时间?

计算时间序列数据帧中事件的持续时间(python 2.7)

通过按ID和其他特定条件分组查找持续时间

在python中通过timedelta的持续时间

使用Python进行持续时间/会话的时间

查找时间序列数据中的总持续时间,没有重叠

如何根据python中的特定条件对值进行排序?

使用熊猫在python中查找两个日期之间的持续时间

R使用data.table中的条件来找到波超过给定值的频率和持续时间

在Python中按多个条件对数据进行分组

创建组并获取持续时间不大于特定数字(R,Dplyr)的持续时间

根据R中的某些条件对数据帧进行子集

计算时间序列python中事件的持续时间

查找每个项目的特定事件的总持续时间

如何根据特定列对数据帧进行分组,并将聚合添加为R中的单独列?

如何根据特定条件对列中的字段进行分组

根据SQL Server中的特定条件对行进行分组

Python中datetime列中多个事件的持续时间

在 R 中跨共享持续时间提取共同异常

从R中的角色中提取持续时间

计算 r 中阈值以下的持续时间

如何根据python中的时间变化对数据集进行分类或重新分组

如何在python中获得单调的持续时间?