我下面有一个DataFrame,但我需要根据已取消和订购的列从每个代码中选择行。
假设代码xxx的顺序为[6、1、5、1],顺序为11。我需要一种算法,该算法可以选择满足总11条说的顺序的行[6&5]
如果没有行匹配,则选择最接近的ID并将其添加到列表中,其与取消的差异如下所示:111111是所选ID,而35是55与20之间的差异。我需要一种可以处理10k行的算法
我的预期输出
**code** **canceled** **order** **ids**
xxx 11.0 13 [128281, 128283]
cvd 20 55 (111111, 35)
import pandas as pd
ccc = [
{"code":"xxx","canceled":11.0,"id":"128281","order":6},
{"code":"xxx","canceled":11.0,"id":"128282","order":1},
{"code":"xxx","canceled":11.0,"id":"128283","order":5},
{"code":"xxx","canceled":11.0,"id":"128284","order":1},
{"code":"xxS","canceled":0.0,"id":"108664","order":4},
{"code":"xxS","canceled":0.0,"id":"110515","order":1},
{"code":"xxS","canceled":0.0,"id":"113556","order":1},
{"code":"eeS","canceled":5.0,"id":"115236","order":1},
{"code":"eeS","canceled":5.0,"id":"108586","order":1},
{"code":"eeS","canceled":5.0,"id":"114107","order":1},
{"code":"eeS","canceled":5.0,"id":"113472","order":3},
{"code":"eeS","canceled":5.0,"id":"114109","order":3},
{"code":"544W","canceled":44.0,"id":"107650","order":20},
{"code":"544W","canceled":44.0,"id":"127763","order":4},
{"code":"544W","canceled":44.0,"id":"128014","order":20},
{"code":"544W","canceled":44.0,"id":"132434","order":58},
{"code":"cvd","canceled":20.0,"id":"11111","order":55},
{"code":"eeS","canceled":5.0,"id":"11111","order":5}
]
我尝试了下面的解决方案,它的工作原理,但我需要寻找确切的值,如果它存在。我还需要选择最可能的ID总计为已取消值。我想消除这种可能性(111111, 35)
df = pd.DataFrame(ccc)
def selected_ids(datum):
ids = datum.id
nbc = int(datum.canceled)
order = datum.order
count = []
arr = []
for loc, i in enumerate(order):
count.append(i)
arr.append(ids[loc])
if nbc == int(i):
return ids[loc]
elif nbc == 0:
return ''
elif nbc < int(i):
return (ids[loc], (int(i)-nbc))
if nbc < sum(count):
return [arr[:-1], (arr[-1],sum(count)-nbc)]
xcv = df.sort_values('order').groupby('code').agg({
'code':'first',
'canceled': 'first',
'order': list,
'id':list
})
xcv['Orders_to_cancel'] = xcv.apply(
selected_ids, axis = 1
)
xcv
怎么办(数据框仅限于代码xxx
,cvd
并且eeS
出于可读性考虑)
df2 = df.groupby('code').agg({
'canceled' : 'first',
'order' : list,
'id' : list
}).reset_index().rename(
columns={
'id' : 'ids',
'order' : 'orders',
}
)
df2['orders_sum'] = df2.orders.apply(sum)
print(df2)
### code canceled orders ids orders_sum
### 0 cvd 20.0 [55] [11111] 55
### 1 eeS 5.0 [1, 1, 1, 3, 3, 5] [115236, 108586, 114107, 113472, 114109, 11111] 14
### 2 xxx 11.0 [6, 1, 5, 1] [128281, 128282, 128283, 128284] 13
然后,我们可能首先要检查某些值是否ids
已经order
直接适合该canceled
值。
df2['direct_ids'] = df2.apply(
lambda r: [
i for o, i in zip(r.orders, r.ids)
if o == r.canceled
],
axis = 1
)
print(df2.loc[:, ('code', 'canceled', 'direct_ids')])
### code canceled direct_ids
### 0 cvd 20.0 []
### 1 eeS 5.0 [11111] # We could have had more than one id, hence the list
### 2 xxx 11.0 []
...否则我们必须要获得所有可能ids
的组合
import itertools as it
import pprint as pp
df2['ids_'] = df2.ids.apply(lambda l:l[:40]) # Let's make the bet that 40 ids will be enough to find the sum we want, avoiding memory error at the same time.
df2['combos'] = df2.ids_.apply(
lambda l: list(it.chain.from_iterable(
it.combinations(l, i + 1)
for i in range(len(l))
))
)
pp.pprint(df2.combos[2]) # an illustration with the indexed-by-2 combinations (code `'xxx'`)
### [('128281',),
### ('128282',),
### ('128283',),
### ('128284',),
### ('128281', '128282'),
### ('128281', '128283'),
### ('128281', '128284'),
### ('128282', '128283'),
### ('128282', '128284'),
### ('128283', '128284'),
### ('128281', '128282', '128283'),
### ('128281', '128282', '128284'),
### ('128281', '128283', '128284'),
### ('128282', '128283', '128284'),
### ('128281', '128282', '128283', '128284')]
现在,我们需要计算这些组合产生的canceled
值和order
-sum之间的所有距离。
df2['distances'] = df2.apply(
lambda r : {
combo : abs(
r.canceled - df.loc[
df.code.isin([r.code]) & df.id.isin(combo),
('order',)
].sum()[0]) for combo in r.combos
},
axis = 1
)
pp.pprint(df2.distances[2])
### {('128281',): 5.0,
### ('128281', '128282'): 4.0,
### ('128281', '128282', '128283'): 1.0,
### ('128281', '128282', '128283', '128284'): 2.0,
### ('128281', '128282', '128284'): 3.0,
### ('128281', '128283'): 0.0, #<--- this is the 'xxx'-combination we want
### ('128281', '128283', '128284'): 1.0,
### ('128281', '128284'): 4.0,
### ('128282',): 10.0,
### ('128282', '128283'): 5.0,
### ('128282', '128283', '128284'): 4.0,
### ('128282', '128284'): 9.0,
### ('128283',): 6.0,
### ('128283', '128284'): 5.0,
### ('128284',): 10.0}
..现在我们可以隔离出我们想要的确切组合
default_minv = [float('inf')]
df2['min_distance'] = df2.distances.apply(
lambda ds : min(ds.values() or default_minv) # to avoid errors when ds.values() is empty
)
df2['summed_ids'] = df2.apply(
lambda r : [
c for c, d in r.distances.items()
if d == r.min_distance
],
axis = 1
)
print(df2.loc[:, ('code', 'canceled', 'orders_sum', 'min_distance', 'summed_ids')])
### code canceled orders_sum min_distance summed_ids
### 0 cvd 20.0 55 35.0 [(11111,)]
### 1 eeS 5.0 14 0.0 [(11111,), (115236, 108586, 113472), (115236, ...
### 2 xxx 11.0 13 0.0 [(128281, 128283)]
i)如上所示,我已定义min_distance
为一个不同的列,仅仅是因为在单个列中包含不同/多种类型的对象不是一种好习惯,并且ii)该方法是通用的,因此您可以具有多种组合的ids
作为summed_ids
,即如果其中许多人具有相同的min_distance
。
[...]我还需要选择最可能的ID总计为被取消的值。
从此以后,这样做就像
cols_of_interest = ['code', 'canceled', 'orders_sum', 'direct_ids', 'summed_ids']
sub_df = df2.loc[
(df2.min_distance==0) | df2.direct_ids.map(len), cols_of_interest
]
print(sub_df)
### code canceled orders_sum direct_ids summed_ids
### 1 eeS 5.0 14 [11111] [(11111,), (115236, 108586, 113472), (115236, ...
### 2 xxx 11.0 13 [] [(128281, 128283)]
为了避免存储所有组合(无需df2['combos']
像以前那样定义),您可以执行以下操作:
df2['distances'] = df2.apply(
lambda r : {
combo : abs(
r.canceled - df.loc[
df.code.isin([r.code]) & df.id.isin(combo),
('order',)
].sum()[0]) for combo in it.chain.from_iterable(
it.combinations(r.ids, i + 1)
for i in range(len(r.ids))
)
},
axis = 1
)
因为我承认它开始成为代码高尔夫,所以请考虑下面的(整个)代码
import itertools as it
df2 = df.groupby('code').agg({
'canceled' : 'first',
'order' : list,
'id' : list
}).reset_index().rename(
columns={
'id' : 'ids',
'order' : 'orders',
}
)
df2['orders_sum'] = df2.orders.apply(sum)
df2['direct_ids'] = df2.apply(
lambda r: [
i for o, i in zip(r.orders, r.ids)
if o == r.canceled
],
axis = 1
)
def distances_computer(r):
combos = it.chain.from_iterable(
it.combinations(r.ids, i + 1)
for i in range(len(r.ids))
)
distances_ = []
for combo in combos:
d = abs(
r.canceled - df.loc[
df.code.isin([r.code]) & df.id.isin(combo),
('order',)
].sum()[0])
distances_.append((combo, d))
if d == 0: # Iterations stops as soon as a zero-distance is found.
break
# Let's minimize the number of returned distances, keeping only the 10
# smallest
distances = sorted(distances_, key=lambda item:item[1])[:10] # Actually you may want to put `1` instead of `10`.
return dict(distances)
df2['distances'] = df2.apply(
distances_computer, axis = 1
)
default_minv = [float('inf')]
df2['min_distance'] = df2.distances.apply(
lambda ds : min(ds.values() or default_minv) # to avoid errors when ds.values() is empty
)
df2['summed_ids'] = df2.apply(
lambda r : [
c for c, d in r.distances.items()
if d == r.min_distance
],
axis = 1
)
cols_of_interest = ['code', 'canceled', 'orders_sum', 'direct_ids', 'summed_ids']
sub_df = df2.loc[
(df2.min_distance==0) | df2.direct_ids.map(len), cols_of_interest
]
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句