根据与熊猫数据框的另一列值的部分匹配获取多列值

希哈布·乌拉

我有以下数据框：

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]','[email protected]', '[email protected]', '[email protected]', '[email protected]',  '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']},                                        
                                 {'main_url': 'http://kirsebaergaarden.com', 'emails': ['[email protected]','[email protected]']},
                                 {'main_url': 'http://koglernes.dk', 'emails': ['[email protected]']},
                                  {'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
                               ])

但是，我只想保留名为“emails”的属性的值，它们在“@”之后的每个元素的值都与“main_url”属性的相应值相同，但在“http://”之后会产生以下数据框：

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['[email protected]']},                                        
                                 {'main_url': 'http://kirsebaergaarden.com', 'emails': ['[email protected]']},
                                 {'main_url': 'http://koglernes.dk', 'emails': ['[email protected]']},
                                  {'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
                               ])

考虑到我有数百万行来实现转换这一事实，任何提示或方法都是可观的

奥努尔·古文

试一试，我认为它应该能够处理几百万行。

def list_check(emails_list, email_match):
    match_indexes = [i for i, s in enumerate(emails_list) if email_match in s]
    return [emails_list[index] for index in match_indexes]

# Parse main_url to get domain column
df['domain'] = list(map(lambda x: x.split('//')[1], df['main_url']))

# Apply list_check to your dataframe using emails and domain columns
df['emails'] = list(map(lambda x, y: list_check(x, y), df['emails'], df['domain']))

# Drop domain column
df.drop(columns=['domain'], inplace=True)

list_check函数检查您的匹配字符串是否在电子邮件列表中并获取匹配索引，然后使用匹配的索引从电子邮件列表中获取值并在列表中返回这些值。

输出：