Python Pandas Select values from another dataFrame depending on the total from another column

Lwanga

I have a DataFrame below but I need to select rows from each code depending on the canceled and order column.

Say code xxx has orders [6, 1, 5, 1] and the order is 11. I need an algorithm that can select the rows that meet the total 11 say rows with orders [6 & 5]

If No rows match, then select the closest ids and add it to a list with their difference from the canceled as below 111111 is the selected id and 35 is the diff between 55 and 20. i need an algorithm that can handle 10k Rows

My Expected output

**code**    **canceled**    **order**              **ids**
     xxx           11.0            13     [128281, 128283]
     cvd             20            55         (111111, 35)
import pandas as pd

ccc = [
    {"code":"xxx","canceled":11.0,"id":"128281","order":6},
    {"code":"xxx","canceled":11.0,"id":"128282","order":1},
    {"code":"xxx","canceled":11.0,"id":"128283","order":5},
    {"code":"xxx","canceled":11.0,"id":"128284","order":1},
    {"code":"xxS","canceled":0.0,"id":"108664","order":4},
    {"code":"xxS","canceled":0.0,"id":"110515","order":1},
    {"code":"xxS","canceled":0.0,"id":"113556","order":1},
    {"code":"eeS","canceled":5.0,"id":"115236","order":1},
    {"code":"eeS","canceled":5.0,"id":"108586","order":1},
    {"code":"eeS","canceled":5.0,"id":"114107","order":1},
    {"code":"eeS","canceled":5.0,"id":"113472","order":3},
    {"code":"eeS","canceled":5.0,"id":"114109","order":3},
    {"code":"544W","canceled":44.0,"id":"107650","order":20},
    {"code":"544W","canceled":44.0,"id":"127763","order":4},
    {"code":"544W","canceled":44.0,"id":"128014","order":20},
    {"code":"544W","canceled":44.0,"id":"132434","order":58},
    {"code":"cvd","canceled":20.0,"id":"11111","order":55},
    {"code":"eeS","canceled":5.0,"id":"11111","order":5}
]

I tried a solution the below solution it works but i need to look for the exact value if it exists. I also need to select the most possible ids that sum up to the canceled value. I want to eliminate the possibility of having this (111111, 35)

df = pd.DataFrame(ccc)

def selected_ids(datum):
    ids = datum.id
    nbc = int(datum.canceled)
    order = datum.order
    count = []
    arr = []

    for loc, i in enumerate(order):

        count.append(i)
        arr.append(ids[loc])

        if nbc == int(i):
            return ids[loc]

        elif nbc  == 0:
            return ''

        elif nbc  < int(i):
            return (ids[loc], (int(i)-nbc))

        if nbc < sum(count):
             return [arr[:-1], (arr[-1],sum(count)-nbc)]

xcv = df.sort_values('order').groupby('code').agg({
    'code':'first',
    'canceled': 'first',
    'order': list,
    'id':list
})
xcv['Orders_to_cancel'] = xcv.apply(
    selected_ids, axis = 1
)
xcv
keepAlive

What about (with a dataframe restricted to the codes xxx, cvd and eeS for the sake of readability)

df2 = df.groupby('code').agg({
    'canceled' : 'first',
    'order'    : list,
    'id'       : list
}).reset_index().rename(
    columns={
        'id'    : 'ids',
        'order' : 'orders',
    }
)
df2['orders_sum'] = df2.orders.apply(sum)

print(df2)
###   code  canceled              orders                                              ids  orders_sum
### 0  cvd      20.0                [55]                                          [11111]          55
### 1  eeS       5.0  [1, 1, 1, 3, 3, 5]  [115236, 108586, 114107, 113472, 114109, 11111]          14
### 2  xxx      11.0        [6, 1, 5, 1]                 [128281, 128282, 128283, 128284]          13

We may then first want to check if some ids already have an order value that fits directly the canceled value.

df2['direct_ids'] = df2.apply(
    lambda r: [
        i for o, i in zip(r.orders, r.ids)
        if o == r.canceled
    ],
    axis = 1
)

print(df2.loc[:, ('code', 'canceled', 'direct_ids')])   
###   code  canceled direct_ids
### 0  cvd      20.0         []
### 1  eeS       5.0    [11111]    # We could have had more than one id, hence the list
### 2  xxx      11.0         []

... otherwise we necessarily have to get all the possible ids's combinations

import itertools as it
import pprint as pp

df2['ids_']   = df2.ids.apply(lambda l:l[:40])  # Let's make the bet that 40 ids will be enough to find the sum we want, avoiding memory error at the same time.
df2['combos'] = df2.ids_.apply(
    lambda l: list(it.chain.from_iterable(
        it.combinations(l, i + 1)
        for i in range(len(l))
    ))
)

pp.pprint(df2.combos[2]) # an illustration with the indexed-by-2 combinations (code `'xxx'`)
### [('128281',),
###  ('128282',),
###  ('128283',),
###  ('128284',),
###  ('128281', '128282'),
###  ('128281', '128283'),
###  ('128281', '128284'),
###  ('128282', '128283'),
###  ('128282', '128284'),
###  ('128283', '128284'),
###  ('128281', '128282', '128283'),
###  ('128281', '128282', '128284'),
###  ('128281', '128283', '128284'),
###  ('128282', '128283', '128284'),
###  ('128281', '128282', '128283', '128284')]

We now need to compute all the distances between canceled values and order-sums that result from these combinations.

df2['distances'] = df2.apply(
    lambda r : {
        combo : abs(
            r.canceled - df.loc[
            df.code.isin([r.code]) & df.id.isin(combo),
            ('order',)
        ].sum()[0]) for combo in r.combos
    },
    axis = 1
)
pp.pprint(df2.distances[2])                                                                          
### {('128281',): 5.0,
###  ('128281', '128282'): 4.0,
###  ('128281', '128282', '128283'): 1.0,
###  ('128281', '128282', '128283', '128284'): 2.0,
###  ('128281', '128282', '128284'): 3.0,
###  ('128281', '128283'): 0.0,           #<--- this is the 'xxx'-combination we want
###  ('128281', '128283', '128284'): 1.0,
###  ('128281', '128284'): 4.0,
###  ('128282',): 10.0,
###  ('128282', '128283'): 5.0,
###  ('128282', '128283', '128284'): 4.0,
###  ('128282', '128284'): 9.0,
###  ('128283',): 6.0,
###  ('128283', '128284'): 5.0,
###  ('128284',): 10.0}

.. and we can now isolate the exact combination(s) we want

default_minv        = [float('inf')]
df2['min_distance'] = df2.distances.apply(
    lambda ds : min(ds.values() or default_minv) # to avoid errors when ds.values() is empty
)
df2['summed_ids'] = df2.apply(
    lambda r : [
        c for c, d in r.distances.items()
        if d == r.min_distance
    ],
    axis = 1
)

print(df2.loc[:, ('code', 'canceled', 'orders_sum', 'min_distance', 'summed_ids')])                     
###   code  canceled  orders_sum  min_distance                                         summed_ids
### 0  cvd      20.0          55          35.0                                         [(11111,)]
### 1  eeS       5.0          14           0.0  [(11111,), (115236, 108586, 113472), (115236, ...
### 2  xxx      11.0          13           0.0                                 [(128281, 128283)]

i) As you can see above, I have defined min_distance as a distinct column, simply because it is not a good practice to have different/multiple types of objects within a single column and ii) the approach is generalized so that you can have multiple combinations of ids as summed_ids, i.e. if many of them have the same min_distance.


[...] I also need to select the most possible ids that sum up to the canceled value.

Doing so is henceforth as easy as

cols_of_interest = ['code', 'canceled', 'orders_sum', 'direct_ids', 'summed_ids']
sub_df = df2.loc[
    (df2.min_distance==0) | df2.direct_ids.map(len), cols_of_interest
]
print(sub_df)                                                        
###   code  canceled  orders_sum direct_ids                                         summed_ids
### 1  eeS       5.0          14    [11111]  [(11111,), (115236, 108586, 113472), (115236, ...
### 2  xxx      11.0          13         []                                 [(128281, 128283)]

The avoid the storage of all the combinations (no need to define df2['combos'] as before), something you could do is:

df2['distances'] = df2.apply(
    lambda r : {
        combo : abs(
            r.canceled - df.loc[
            df.code.isin([r.code]) & df.id.isin(combo),
            ('order',)
        ].sum()[0]) for combo in it.chain.from_iterable(
            it.combinations(r.ids, i + 1)
            for i in range(len(r.ids))
        )
    },
    axis = 1
)

UPDATE

Because I admit that it started to become code-golf, consider the (entire) code that follows

import itertools as it

df2 = df.groupby('code').agg({
    'canceled' : 'first',
    'order'    : list,
    'id'       : list
}).reset_index().rename(
    columns={
        'id'    : 'ids',
        'order' : 'orders',
    }
)

df2['orders_sum'] = df2.orders.apply(sum)


df2['direct_ids'] = df2.apply(
    lambda r: [
        i for o, i in zip(r.orders, r.ids)
        if o == r.canceled
    ],
    axis = 1
)

def distances_computer(r):
    combos = it.chain.from_iterable(
        it.combinations(r.ids, i + 1)
        for i in range(len(r.ids))
    )
    distances_ = []
    for combo in combos:
        d = abs(
            r.canceled - df.loc[
            df.code.isin([r.code]) & df.id.isin(combo),
            ('order',)
        ].sum()[0])
        distances_.append((combo, d)) 
        if d == 0: # Iterations stops as soon as a zero-distance is found.
            break
    # Let's minimize the number of returned distances, keeping only the 10
    # smallest
    distances = sorted(distances_, key=lambda item:item[1])[:10] # Actually you may want to put `1` instead of `10`.
    return dict(distances)

df2['distances'] = df2.apply(
    distances_computer, axis = 1
)   


default_minv        = [float('inf')]
df2['min_distance'] = df2.distances.apply(
    lambda ds : min(ds.values() or default_minv) # to avoid errors when ds.values() is empty
)
df2['summed_ids'] = df2.apply(
    lambda r : [
        c for c, d in r.distances.items()
        if d == r.min_distance
    ],
    axis = 1
)

cols_of_interest = ['code', 'canceled', 'orders_sum', 'direct_ids', 'summed_ids']
sub_df = df2.loc[
    (df2.min_distance==0) | df2.direct_ids.map(len), cols_of_interest
]

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

How to calculate the values of a pandas DataFrame column depending on the results of a rolling function from another column

pandas: modifying values in dataframe from another column

Pandas column values between values from another dataframe column

select data in a column of a pandas dataframe depending on pieces of string in another column

Subtracting the values in another DataFrame from every column in a pandas DataFrame

How to conditionally replace Pandas dataframe column values from another dataframe

pandas: insert rows in a dataframe from column values in another dataframe

Fill a column in the dataframe based on similar values from another dataframe in pandas

Efficiently add column to Pandas DataFrame with values from another DataFrame

Efficiently replace values from a column to another column Pandas DataFrame

Replace values from one column with another column Pandas DataFrame

Pandas: select from column with index corresponding to values in another column

How to fill null values in pandas dataframe, with values from another column in python?

Group values and check if the values ​from another column cancel each other out in python pandas dataframe

Fill new column in one dataframe with values from another, based on values in two other columns? (Python/Pandas)

pandas dataframe select list value from another column

how to select rows in pandas dataframe based on between from another column

Pandas, add column reading values from another dataframe

New column based on matching values from another dataframe pandas

Create variables from values in another column in a pandas dataframe

Replace pandas column with list values from another panadas dataframe

Need to compare one Pandas (Python) dataframe with values from another dataframe

Python Pandas: Concatenate and update dataframe values from another dataframe

Add column with values depending on another column to a dataframe

Selecting rows from one DataFrame depending on values from another

Pandas: multiply column by column from another dataframe?

Python Pandas: How to subtract values in two non-consecutive rows in a specific column of a dataframe from one another

Python pandas: create new column based on category values from another dataframe

Pandas python Random select values from a list colum by another column value

TOP Ranking

HotTag

Archive