# 熊猫中的条件计算

``````SELECT
round(sum(total_size) / (1024*1024), 2) 'Total (PB)' ,
round(sum(keep_size) / (1024*1024), 2) 'Keep (PB)' ,
round(sum(remove_size) / (1024*1024), 2) 'Remove (PB)'
FROM (
SELECT
case when dedupe_status='K' then path when dedupe_status='R' then null when dedupe_status='G' then group_super end as g_key,
round(sum(file_size), 2) total_size,
case when dedupe_status='R' then round(sum(file_size), 2) when dedupe_status='K' then 0 when dedupe_status='G' then round(sum(file_size) - file_size, 2) end remove_size,
case when dedupe_status='R' then 0 when dedupe_status='K' then file_size when dedupe_status='G' then round(sum(file_size) - (sum(file_size) - file_size), 2) end keep_size
from dedupe__df group by g_key
) clean_list
``````

``````df=pd.DataFrame([
{'dedupe_status': 'R', 'size': 134, 'dedupe_key': 'g_149'},
{'dedupe_status': 'K', 'size': 101, 'dedupe_key': 'g9'},
{'dedupe_status': 'G', 'size': 101, 'dedupe_key': 'x09'},
{'dedupe_status': 'G', 'size': 405, 'dedupe_key': 'xx01'},
{'dedupe_status': 'G', 'size': 4, 'dedupe_key': 'x09'},
{'dedupe_status': 'G', 'size': 1405, 'dedupe_key': 'xx01'},
{'dedupe_status': 'G', 'size': 401, 'dedupe_key': 'xx01'},
])
``````

• `Total`：容易，只需总结所有大小即可。
• `Keep`：如果状态为`K`保留），则将尺寸相加。如果状态为`R`删除），请跳过它；如果状态为`G`分组），则对进行分组，`dedupe_key`保留其中一种大小（如果最简单，可以抓住`first`没关系`min`）。换句话说，当值为时`G`，表示该组中的所有元素都是重复的，我们只需要保留其中一个即可。
• `Remove``Total`-`Keep`

``````field           value            # comments
Total           2551             # df['size'].sum()
Keep            607              # 101 (K) + 101 (G: x09) + 405(G: xx01)
Remove          1944             # 134 (R) + 4 (G: x08) + 1405+401 (G: xx01)
``````

``````>>> df['new_key'] = df.apply(lambda row: 'R' if row.dedupe_status == 'R' else row.dedupe_key, axis=1)
>>> df.groupby('new_key').agg(total=('size', 'sum'))
``````

Quang Hoang：

``````mask = np.where(df.dedupe_status.eq('R') |
(df.duplicated(['dedupe_status', 'dedupe_key']) &
df.dedupe_status.eq('G')
),
'Remove', 'Keep')

ret.loc['Total'] = ret.sum()
``````

``````Keep       607
Remove    1944
Total     2551
Name: size, dtype: int64
``````

0 条评论