我有一个在SQL中使用以下方法进行的计算:
SELECT
round(sum(total_size) / (1024*1024), 2) 'Total (PB)' ,
round(sum(keep_size) / (1024*1024), 2) 'Keep (PB)' ,
round(sum(remove_size) / (1024*1024), 2) 'Remove (PB)'
FROM (
SELECT
case when dedupe_status='K' then path when dedupe_status='R' then null when dedupe_status='G' then group_super end as g_key,
round(sum(file_size), 2) total_size,
case when dedupe_status='R' then round(sum(file_size), 2) when dedupe_status='K' then 0 when dedupe_status='G' then round(sum(file_size) - file_size, 2) end remove_size,
case when dedupe_status='R' then 0 when dedupe_status='K' then file_size when dedupe_status='G' then round(sum(file_size) - (sum(file_size) - file_size), 2) end keep_size
from dedupe__df group by g_key
) clean_list
我仅将此内容作为参考。这是我要在熊猫中进行相同计算的数据框。这是我的数据帧:
df=pd.DataFrame([
{'dedupe_status': 'R', 'size': 134, 'dedupe_key': 'g_149'},
{'dedupe_status': 'K', 'size': 101, 'dedupe_key': 'g9'},
{'dedupe_status': 'G', 'size': 101, 'dedupe_key': 'x09'},
{'dedupe_status': 'G', 'size': 405, 'dedupe_key': 'xx01'},
{'dedupe_status': 'G', 'size': 4, 'dedupe_key': 'x09'},
{'dedupe_status': 'G', 'size': 1405, 'dedupe_key': 'xx01'},
{'dedupe_status': 'G', 'size': 401, 'dedupe_key': 'xx01'},
])
我想要得到的结果有三个值的Total Size
,Remove Size
和Keep Size
。计算方法如下:
Total
:容易,只需总结所有大小即可。Keep
:如果状态为K
(保留),则将尺寸相加。如果状态为R
(删除),请跳过它;如果状态为G
(分组),则对进行分组,dedupe_key
并仅保留其中一种大小(如果最简单,可以抓住first
或没关系min
)。换句话说,当值为时G
,表示该组中的所有元素都是重复的,我们只需要保留其中一个即可。Remove
:Total
-Keep
有了以上值,我们将拥有:
field value # comments
Total 2551 # df['size'].sum()
Keep 607 # 101 (K) + 101 (G: x09) + 405(G: xx01)
Remove 1944 # 134 (R) + 4 (G: x08) + 1405+401 (G: xx01)
到目前为止,我有:
>>> df['new_key'] = df.apply(lambda row: 'R' if row.dedupe_status == 'R' else row.dedupe_key, axis=1)
>>> df.groupby('new_key').agg(total=('size', 'sum'))
剩下的我该怎么办?
让我们尝试np.where
找到那些Keep
:
mask = np.where(df.dedupe_status.eq('R') |
(df.duplicated(['dedupe_status', 'dedupe_key']) &
df.dedupe_status.eq('G')
),
'Remove', 'Keep')
ret = df.groupby(mask)['size'].sum()
ret.loc['Total'] = ret.sum()
输出:
Keep 607
Remove 1944
Total 2551
Name: size, dtype: int64
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句