我遇到了一个问题,该问题ticker
基于具有最低值(int
&dates
)的条件的列中的键来过滤出重复数据。因此,初始数据集如下所示:
ticker dim cal_date date0 date1 diff
0 A ART 9/30/16 12/20/16 12/20/17 -81
1 AA ART 9/30/16 12/1/16 12/1/17 -62
2 AA ART 9/30/16 12/1/16 2/8/18 -131
3 AA ART 9/30/16 2/8/17 12/1/17 -62
4 AA ART 9/30/16 2/8/17 2/8/18 -131
5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 AAME ART 9/30/16 11/14/16 11/14/17 -45
36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
37 ABMT ART 9/30/16 2/14/17 2/16/18 -139
38 ABMT ART 9/30/16 2/16/17 2/14/18 -137
注意,AA值重复4次,ABMT值重复3次。我想根据两个条件过滤掉一些值,第一个条件选择第一个出现的date0日期,所以现在数据集将如下所示:
ticker dim cal_date date0 date1 diff
0 A ART 9/30/16 12/20/16 12/20/17 -81
1 AA ART 9/30/16 12/1/16 12/1/17 -62
2 AA ART 9/30/16 12/1/16 2/8/18 -131
5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 AAME ART 9/30/16 11/14/16 11/14/17 -45
36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
37 ABMT ART 9/30/16 2/14/17 2/16/18 -139
第二个条件是删除diff值最低的值以获得最终结果。现在,经过过滤的完整数据集将如下所示:
ticker dim cal_date date0 date1 diff
0 A ART 9/30/16 12/20/16 12/20/17 -81
1 AA ART 9/30/16 12/1/16 12/1/17 -62
5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 AAME ART 9/30/16 11/14/16 11/14/17 -45
36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
谢谢您的帮助。
编辑:
在Wen回答之后,我将代码更新为以下内容:
import pandas as pd
data = pd.read_csv('input_transform.csv')
print(data)
返回:
Unnamed: 0 ticker dim cal_date date0 date1 diff
0 0 A ART 9/30/16 12/20/16 12/20/17 -81
1 1 AA ART 9/30/16 12/1/16 12/1/17 -62
2 2 AA ART 9/30/16 12/1/16 2/8/18 -131
3 3 AA ART 9/30/16 2/8/17 12/1/17 -62
4 4 AA ART 9/30/16 2/8/17 2/8/18 -131
5 5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 9 AAME ART 9/30/16 11/14/16 11/14/17 -45
10 36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
11 37 ABMT ART 9/30/16 2/14/17 2/16/18 -139
12 38 ABMT ART 9/30/16 2/16/17 2/14/18 -137
然后我添加:
# making sure the date is in date format.
data['date0'] = pd.to_datetime(data['date0'].replace("'", ""))
# making sure the diff is in float or int format
data['diff'] = data['diff'].astype(float)
data.sort_values(['date0', 'diff'], ascending=[False, True]).drop_duplicates('ticker', keep='last').sort_index()
print(data)
哪个返回:
Unnamed: 0 ticker dim cal_date date0 date1 diff
0 0 A ART 9/30/16 2016-12-20 12/20/17 -81.0
1 1 AA ART 9/30/16 2016-12-01 12/1/17 -62.0
2 2 AA ART 9/30/16 2016-12-01 2/8/18 -131.0
3 3 AA ART 9/30/16 2017-02-08 12/1/17 -62.0
4 4 AA ART 9/30/16 2017-02-08 2/8/18 -131.0
5 5 AABA ART 9/30/16 2016-11-09 11/9/17 -40.0
6 6 AAC ART 9/30/16 2016-11-08 11/8/17 -39.0
7 7 AAL ART 9/30/16 2016-10-20 10/20/17 -20.0
8 8 AAMC ART 9/30/16 2016-11-07 11/7/17 -38.0
9 9 AAME ART 9/30/16 2016-11-14 11/14/17 -45.0
10 36 ABMT ART 9/30/16 2017-02-14 2/14/18 -137.0
11 37 ABMT ART 9/30/16 2017-02-14 2/16/18 -139.0
12 38 ABMT ART 9/30/16 2017-02-16 2/14/18 -137.0
不幸的是,到目前为止,没有运气。
然后sort_values
+drop_duplicates
df.sort_values(['date0','diff'],ascending=[False,True]).drop_duplicates('ticker',keep='last').sort_index()
Out[1071]:
ticker dim cal_date date0 date1 diff
0 A ART 9/30/16 12/20/16 12/20/17 -81
1 AA ART 9/30/16 12/1/16 12/1/17 -62
5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 AAME ART 9/30/16 11/14/16 11/14/17 -45
36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句