Algorithm for grouping records

GoTo

I have table that looks like:

Group Name
1     A
1     B
2     R
2     F
3     B
3     C

And i need group this records by following rool: If an group has received at least one Name that is contained in another group, then these two groups are in the same group. In my case Group 1 contains A and B. And group 3 contains B and C. They have common name B, so they are must be in the same group. As result i want to get something like this:

Group Name ResultGroup
1     A    1
1     B    1
2     R    2
2     F    2
3     B    1
3     C    1

I already finded solution, but in my table is about 200k records, so it take too much time (more than 12 hours). Is there way to optimize it? May be using pandas or something like that?

def printList(l, head=""):
    if(head!=""):
        print(head)
    for i in l:
        print(i)

def find_group(groups, vals):
    for k in groups.keys():
        for v in vals:
            if v in groups[k]:
                return k
    return 0

task = [ [1, "AAA"], [1, "BBB"], [3, "CCC"], [4, "DDD"], [5, "JJJ"], [6, "AAA"], [6, "JJJ"], [6, "CCC"], [9, "OOO"], [10, "OOO"], [10, "DDD"], [11, "LLL"], [12, "KKK"] ]

ptrs = {}
groups = {}

group_id = 1

printList(task, "Initial table")

for i in range(0, len(task)):
    itask = task[i]
    resp = itask[1]
    val = [ x[0] for x in task if x[1] == resp ]
    minval = min(val)
    for v in val:
        if not v in ptrs.keys(): ptrs[v] = minval

    myGroup = find_group(groups, val)
    if(myGroup == 0):
        groups[group_id] = list(set(val))
        myGroup = group_id
        group_id += 1
    else:
        groups[myGroup].extend(val)
        groups[myGroup] = list(set(groups[myGroup]))

    itask.append(myGroup)
    task[i] = itask

print()
printList(task, "Result table")
MaPy

You can groupby 'Name' and keep the first Group:

df = pd.DataFrame({'Group': [1, 1, 2, 2, 3, 3], 'Name': ['A', 'B', 'R', 'F', 'B', 'C']})
df2 = df.groupby('Name').first().reset_index()

Then merge with the original data-frame and drop duplicates of the original group:

df3 = df.merge(df2, on='Name', how='left')
df3 = df3[['Group_x', 'Group_y']].drop_duplicates('Group_x')
df3.columns = ['Group', 'ResultGroup']

One more merge will give you the result:

df.merge(df3, on='Group', how='left')

Group Name  ResultGroup
    1    A            1
    1    B            1
    2    R            2
    2    F            2
    3    B            1
    3    C            1

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related