提高两个列表中模糊匹配词的速度

用户名

我在一个清单上列出了大约500个项目。我想用最小长度的项目替换该列表中所有模糊匹配的项目。

有没有一种方法可以加快我的模糊匹配的实现?

注意:我之前曾发布过类似的问题,但由于缺乏回应,因此将其重新定义。

我的实现:

def find_fuzzymatch_samelist(list1, list2, cutoff=90):
    """
    #list1 = list(ds1.Title)
    #list2 = list(ds1.Title)
    """
    matchdict = defaultdict(list)

    for i, u in enumerate(list1):

        for i1, u1 in enumerate(list2):

            #Since list orders are the same, this makes sure this isn't the same item.
            if i != i1:

                if fuzz.partial_token_sort_ratio(u, u1) >= cutoff:
                    pair = (u, u1)

                    #Because there are potential duplicates, I have to make the key constant.
                    #Otherwise, putting list1 as the key will result in both duplicate items
                    #serving as the key. 

                    """
                    Potential problem:
                    • what if there are diffrent shortstr? 

                    """

                    shortstr = min(pair, key=len)
                    longstr = max(pair, key=len)     
                    matchdict[shortstr].append(longstr)
    return matchdict 
戴维

我将假定您已经安装了python-Levenshtein,这将使您的速度提高4倍。

优化循环和字典访问:

def find_fuzzymatch_samelist(list1, list2, cutoff=90):
    matchdict = dict()

    for i1, i2 in itertools.permutations(range(len(list1), repeat=2)

        u1 = list1[i1]
        u2 = list2[i2]

        if fuzz.partial_token_sort_ratio(u, u1) >= cutoff:    
            shortstr = min(u1, u2, key=len)
            longstr = max(u1, u2, key=len)     
            matchdict.get(shortstr, list).append(longstr)
    return matchdict

这与除模糊调用外一样快。如果您阅读了源代码,则会看到在每次迭代中对每个字符串都进行了一些预处理。我们可以一次完成所有操作:

def _asciionly(s):
    if PY3:
        return s.translate(translation_table)
    else:
        return s.translate(None, bad_chars)


def full_pre_process(s, force_ascii=False):
    s = _asciionly(s)
    # Keep only Letters and Numbres (see Unicode docs).
    string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
    # Force into lowercase.
    string_out = StringProcessor.to_lower_case(string_out)
    # Remove leading and trailing whitespaces.
    string_out = StringProcessor.strip(string_out)

    out = ''.join(sorted(string_out))
    out.strip()
    return out


def find_fuzzymatch_samelist(list1, list2, cutoff=90):
    matchdict = dict()
    if list1 is not list2:
        list1 = [full_pre_process(each) for each in list1]
        list2 = [full_pre_process(each) for each in list2]
    else:
        # If you are comparing a list to itself, we don't want to overwrite content.
        list1 = [full_pre_process(each) for each in list1]
        list2 = list1

    for i1, i2 in itertools.permutations(range(len(list1), repeat=2)
        u1 = list1[i1]
        u2 = list2[i2]

        if fuzz.partial_ratio(u, u1) >= cutoff:
            pair = (u1, u2)

            shortstr = min(pair, key=len)
            longstr = max(pair, key=len)     
            matchdict.get(shortstr, list).append(longstr)
    return matchdict

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章