Algorithm for grouping records

GoTo Published at Dev

GoTo

I have table that looks like:

Group Name
1     A
1     B
2     R
2     F
3     B
3     C

And i need group this records by following rool: If an group has received at least one Name that is contained in another group, then these two groups are in the same group. In my case Group 1 contains A and B. And group 3 contains B and C. They have common name B, so they are must be in the same group. As result i want to get something like this:

Group Name ResultGroup
1     A    1
1     B    1
2     R    2
2     F    2
3     B    1
3     C    1

I already finded solution, but in my table is about 200k records, so it take too much time (more than 12 hours). Is there way to optimize it? May be using pandas or something like that?

def printList(l, head=""):
    if(head!=""):
        print(head)
    for i in l:
        print(i)

def find_group(groups, vals):
    for k in groups.keys():
        for v in vals:
            if v in groups[k]:
                return k
    return 0

task = [ [1, "AAA"], [1, "BBB"], [3, "CCC"], [4, "DDD"], [5, "JJJ"], [6, "AAA"], [6, "JJJ"], [6, "CCC"], [9, "OOO"], [10, "OOO"], [10, "DDD"], [11, "LLL"], [12, "KKK"] ]

ptrs = {}
groups = {}

group_id = 1

printList(task, "Initial table")

for i in range(0, len(task)):
    itask = task[i]
    resp = itask[1]
    val = [ x[0] for x in task if x[1] == resp ]
    minval = min(val)
    for v in val:
        if not v in ptrs.keys(): ptrs[v] = minval

    myGroup = find_group(groups, val)
    if(myGroup == 0):
        groups[group_id] = list(set(val))
        myGroup = group_id
        group_id += 1
    else:
        groups[myGroup].extend(val)
        groups[myGroup] = list(set(groups[myGroup]))

    itask.append(myGroup)
    task[i] = itask

print()
printList(task, "Result table")

MaPy

You can groupby 'Name' and keep the first Group:

df = pd.DataFrame({'Group': [1, 1, 2, 2, 3, 3], 'Name': ['A', 'B', 'R', 'F', 'B', 'C']})
df2 = df.groupby('Name').first().reset_index()

Then merge with the original data-frame and drop duplicates of the original group:

df3 = df.merge(df2, on='Name', how='left')
df3 = df3[['Group_x', 'Group_y']].drop_duplicates('Group_x')
df3.columns = ['Group', 'ResultGroup']

One more merge will give you the result:

df.merge(df3, on='Group', how='left')

Group Name  ResultGroup
    1    A            1
    1    B            1
    2    R            2
    2    F            2
    3    B            1
    3    C            1

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-12-6

Comments

0 comments

TOP Ranking

Article

Algorithm for grouping records

Algorithm for grouping records

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

pump.io port in URL

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

Emulator wrong screen resolution in Android Studio 1.3

3D Touch Peek Swipe Like Mail

Double spacing in rmarkdown pdf

Svchost high CPU from Microsoft.BingWeather app errors

How to how increase/decrease compared to adjacent cell

Using Response.Redirect with Friendly URLS in ASP.NET

java.lang.NullPointerException: Cannot read the array length because "<local3>" is null

BigQuery - concatenate ignoring NULL

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Can a 32-bit antivirus program protect you from 64-bit threats

Make a B+ Tree concurrent thread safe

Bootstrap 5 Static Modal Still Closes when I Click Outside

Vector input in shiny R and then use it

Assembly definition can't resolve namespaces from external packages