给定一个数据框,我如下:
import pandas as pd
import datetime
df = pd.DataFrame([[2, 3],[2, 1],[2, 1],[3, 4],[3, 1],[3, 1],[3, 1],[3, 1],[4, 2],[4, 1],[4, 1],[4, 1]], columns=['id', 'count'])
df['date'] = [datetime.datetime.strptime(x,'%Y-%m-%d %H:%M:%S') for x in
['2016-12-28 15:17:00','2016-12-28 15:29:00','2017-01-05 09:32:00','2016-12-03 18:10:00','2016-12-10 11:31:00',
'2016-12-14 09:32:00','2016-12-18 09:31:00','2016-12-22 09:32:00','2016-11-28 15:31:00','2016-12-01 16:11:00',
'2016-12-10 09:31:00','2016-12-13 12:06:00']]
我想根据条件使用grouby:对于数据相同的数据id
,如果它们的日期差小于4天,则将它们视为相同的组,否则创建一个新列new_id
,然后我将对grouby进行计数并基于计算总和new_id
。
我可以通过以下代码得到结果,但是它太慢了,我怎样才能使其更有效?
df.sort_values(by=['id', 'date'], ascending = [True, False], inplace = True)
df['id'] = df['id'].astype(str)
df['id_up'] = df['id'].shift(-1)
df['id_down'] = df['id'].shift(1)
df['date_up'] = df['date'].shift(-1)
df['date_diff'] = df.apply(lambda df: (df['date'] - df['date_up'])/datetime.timedelta(days=1) if df['id'] == df['id_up'] else 0, axis=1)
df = df.reset_index()
df = df.drop(['index','id_up','id_down','date_up'],axis=1)
df['new'] = ''
for i in range(df.shape[0]):
if i == 0:
df.loc[i,'new'] = 1
else:
if df.loc[i,'id'] != df.loc[i-1,'id']:
df.loc[i,'new'] = 1
else:
if df.loc[i-1,'date_diff'] <= 4:
df.loc[i,'new'] = df.loc[i-1,'new']
else:
df.loc[i,'new'] = df.loc[i-1,'new'] + 1
df['new'] = df['id'].astype(str) + '-' + df['new'].astype(str)
df1 = df.groupby('new')['date'].min()
df1 = df1.reset_index()
df1.rename(columns={"date": "first_date"}, inplace=True)
df = pd.merge(df, df1, on='new')
df1 = df.groupby('new')['date'].max()
df1 = df1.reset_index()
df1.rename(columns={"date": "last_date"}, inplace=True)
df = pd.merge(df, df1, on='new')
df1 = df.groupby('new')['count'].sum()
df1 = df1.reset_index()
df1.rename(columns={"count": "count_sum"}, inplace=True)
df = pd.merge(df, df1, on='new')
print(df)
出:
id count date date_diff new first_date last_date count_sum
0 2 1 2017-01-05 09:32:00 7.752083 2-1 2017-01-05 09:32:00 2017-01-05 09:32:00 1
1 2 1 2016-12-28 15:29:00 0.008333 2-2 2016-12-28 15:17:00 2016-12-28 15:29:00 4
2 2 3 2016-12-28 15:17:00 0.000000 2-2 2016-12-28 15:17:00 2016-12-28 15:29:00 4
3 3 1 2016-12-22 09:32:00 4.000694 3-1 2016-12-22 09:32:00 2016-12-22 09:32:00 1
4 3 1 2016-12-18 09:31:00 3.999306 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
5 3 1 2016-12-14 09:32:00 3.917361 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
6 3 1 2016-12-10 11:31:00 6.722917 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
7 3 4 2016-12-03 18:10:00 0.000000 3-3 2016-12-03 18:10:00 2016-12-03 18:10:00 4
8 4 1 2016-12-13 12:06:00 3.107639 4-1 2016-12-10 09:31:00 2016-12-13 12:06:00 2
9 4 1 2016-12-10 09:31:00 8.722222 4-1 2016-12-10 09:31:00 2016-12-13 12:06:00 2
10 4 1 2016-12-01 16:11:00 3.027778 4-2 2016-11-28 15:31:00 2016-12-01 16:11:00 3
11 4 2 2016-11-28 15:31:00 0.000000 4-2 2016-11-28 15:31:00 2016-12-01 16:11:00 3
要获取该new
列,您可以执行以下操作:
df.sort_values(by=['id', 'date'], ascending = [True, False], inplace = True)
groups = df.groupby('id')
# mask where the date differences exceed threshold
df['new'] = groups.date.diff().abs() > pd.to_timedelta(4, unit='D')
# group within each id
df['new'] = groups['new'].cumsum().astype(int) + 1
# concatenate `id` and `new`:
df['new'] = df['id'].astype(str) + '-' + df['new'].astype(str)
# get other columns with groupby
new_groups = df.groupby('new')
df['first_date'] = new_groups.date.transform('min')
df['last_date'] = new_groups.date.transform('max')
df['count_sum'] = new_groups['count'].transform('sum')
输出:
id count date new first_date last_date count_sum
-- ---- ------- ------------------- ----- ------------------- ------------------- -----------
0 2 1 2017-01-05 09:32:00 2-1 2017-01-05 09:32:00 2017-01-05 09:32:00 1
1 2 1 2016-12-28 15:29:00 2-2 2016-12-28 15:17:00 2016-12-28 15:29:00 4
2 2 3 2016-12-28 15:17:00 2-2 2016-12-28 15:17:00 2016-12-28 15:29:00 4
3 3 1 2016-12-22 09:32:00 3-1 2016-12-22 09:32:00 2016-12-22 09:32:00 1
4 3 1 2016-12-18 09:31:00 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
5 3 1 2016-12-14 09:32:00 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
6 3 1 2016-12-10 11:31:00 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
7 3 4 2016-12-03 18:10:00 3-3 2016-12-03 18:10:00 2016-12-03 18:10:00 4
8 4 1 2016-12-13 12:06:00 4-1 2016-12-10 09:31:00 2016-12-13 12:06:00 2
9 4 1 2016-12-10 09:31:00 4-1 2016-12-10 09:31:00 2016-12-13 12:06:00 2
10 4 1 2016-12-01 16:11:00 4-2 2016-11-28 15:31:00 2016-12-01 16:11:00 3
11 4 2 2016-11-28 15:31:00 4-2 2016-11-28 15:31:00 2016-12-01 16:11:00 3
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句