Python：用熊貓中的句子對相似的單詞進行分組

albator 发表于 Dev

阿爾巴托

我有一個包含句子的數據庫，通常只有單詞。我經常有像購買和購買這樣的詞。當我數詞時，我有購買和購買，這扭曲了計算。我的需要如下：

我想在我的列上循環，當我第一次注意到一個詞時，我會替換其他句子中的相似詞。我試過模糊，但我只在最後得到單詞而沒有句子

例如：

This topic is about purchasing

He was talking about shopping

它成為了：

This topic is about purchasing

He was talking about purchasing

即使句子被扭曲了，那也沒關係。

我應用了這段代碼，但結果並不令人滿意：

import pandas
from fuzzywuzzy import fuzz

# Replaces %90 and more similar strings  
def func(input_list):
    for count, item in enumerate(input_list):
        rest_of_input_list = input_list[:count] + input_list[count + 1:]
        new_list = []
        for other_item in rest_of_input_list:
            similarity = fuzz.ratio(item, other_item)
            if similarity >= 90:
                new_list.append(item)
            else:
                new_list.append(other_item)
        input_list = new_list[:count] + [item] + new_list[count :]
                
    return input_list

df = pandas.read_csv('input.txt') # Read data from csv
result = []
for column in list(df):
    column_values = list(df[column])
    first_words = [x[:x.index(" ")] if " " in x else x for x in column_values]
    result.append(func(first_words))
    
new_df = pandas.DataFrame(result).transpose() 
new_df.columns = list(df)

print(new_df)

Serge de Gosson de Varennes

也許這是一個可能的解決方案。鑑於以下數據：

input
This topic is about purchasing
He was talking about shopping
That was the reason for the request
About request
requests
My home is nice
My home is beautiful
My homes are nice

和：

import pandas as pd
from fuzzywuzzy import fuzz

# Replaces %90 and more similar strings  
def func(input_list):
    for count, item in enumerate(input_list):
        rest_of_input_list = input_list[:count] + input_list[count + 1:]
        new_list = []
        for other_item in rest_of_input_list:
            similarity = fuzz.ratio(item, other_item)
            if similarity >= 50:
                new_list.append(item)
            else:
                new_list.append(other_item)
        input_list = new_list[:count] + [item] + new_list[count :]
                
    return input_list

df = pd.read_csv('input.txt')

result = []
for column in list(df):
    column_values = list(df[column])
    result.append(func(column_values))
    
new_df = pd.DataFrame(result).transpose() 
new_df.columns = ['ouput']

full_df = pd.concat([df,new_df], axis=1)
print(full_df)

你會得到以下輸出：

                             input                                ouput
0       This topic is about purchasing        He was talking about shopping
1        He was talking about shopping        He was talking about shopping
2  That was the reason for the request  That was the reason for the request
3                        About request                             requests
4                             requests                             requests
5                      My home is nice                    My homes are nice
6                 My home is beautiful                    My homes are nice
7                    My homes are nice                    My homes are nice

請注意，我更改了相似度的限制。事實上，如果你檢查相似性，它們都沒有達到 90 分。

更新

另一種方法是：

import pandas as pd
import fuzzywuzzy.fuzz as fuzz

df = pd.read_csv('input.txt')
print('--- before ---')
print(df)
SENTENCES = df['input'].to_list()
print('--- changes ---')
for index, word in enumerate(SENTENCES):
    for other_index, other_word in enumerate(SENTENCES[index+1:], index+1):
            result = fuzz.token_sort_ratio(word, other_word)
            if result > 10:
                print(f'OK | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')                
            elif result >20:
                print(f'   | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')  
            if result > 45:
                SENTENCES[index] = other_word
                
df['output'] = SENTENCES

print(df)

它提供了有關正在發生的事情的一些信息：

--- before ---
                                 input
0       This topic is about purchasing
1        He was talking about shopping
2  That was the reason for the request
3                        About request
4                             requests
5                      My home is nice
6                 My home is beautiful
7                    My homes are nice
--- changes ---
OK |  51 |  0 This topic is about purchasing ->  1 He was talking about shopping
OK |  34 |  0 This topic is about purchasing ->  2 That was the reason for the request
OK |  37 |  0 This topic is about purchasing ->  3 About request
OK |  16 |  0 This topic is about purchasing ->  4 requests
OK |  36 |  0 This topic is about purchasing ->  5 My home is nice
OK |  28 |  0 This topic is about purchasing ->  6 My home is beautiful
OK |  30 |  0 This topic is about purchasing ->  7 My homes are nice
OK |  41 |  1 He was talking about shopping ->  2 That was the reason for the request
OK |  43 |  1 He was talking about shopping ->  3 About request
OK |  16 |  1 He was talking about shopping ->  4 requests
OK |  23 |  1 He was talking about shopping ->  5 My home is nice
OK |  37 |  1 He was talking about shopping ->  6 My home is beautiful
OK |  26 |  1 He was talking about shopping ->  7 My homes are nice
OK |  38 |  2 That was the reason for the request ->  3 About request
OK |  37 |  2 That was the reason for the request ->  4 requests
OK |  16 |  2 That was the reason for the request ->  5 My home is nice
OK |  22 |  2 That was the reason for the request ->  6 My home is beautiful
OK |  31 |  2 That was the reason for the request ->  7 My homes are nice
OK |  67 |  3 About request ->  4 requests
OK |  21 |  3 About request ->  5 My home is nice
OK |  36 |  3 About request ->  6 My home is beautiful
OK |  33 |  3 About request ->  7 My homes are nice
OK |  17 |  4 requests ->  5 My home is nice
OK |  29 |  4 requests ->  6 My home is beautiful
OK |  32 |  4 requests ->  7 My homes are nice
OK |  54 |  5 My homes are nice ->  6 My home is beautiful
OK | 100 |  5 My homes are nice ->  7 My homes are nice
OK |  54 |  6 My home is beautiful ->  7 My homes are nice
                                 input                               output
0       This topic is about purchasing        He was talking about shopping
1        He was talking about shopping        He was talking about shopping
2  That was the reason for the request  That was the reason for the request
3                        About request                             requests
4                             requests                             requests
5                      My home is nice                    My homes are nice
6                 My home is beautiful                    My homes are nice
7                    My homes are nice                    My homes are nice

要獲取數據框：

import pandas as pd
import fuzzywuzzy.fuzz as fuzz

df = pd.read_csv('input.txt')

SENTENCES = df['input'].to_list()

for index, word in enumerate(SENTENCES):
    for other_index, other_word in enumerate(SENTENCES[index+1:], index+1):
            result = fuzz.token_sort_ratio(word, other_word)
            if result > 10:
                f'OK | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}'             
            elif result >20:
                f'   | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}'  
            if result > 45:
                SENTENCES[index] = other_word
                
df['output'] = SENTENCES

print(df)

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-10-31

我来说两句

0 条评论

登录后参与评论

Python：用熊貓中的句子對相似的單詞進行分組

Python：用熊貓中的句子對相似的單詞進行分組

构建类似于Jarvis的本地语言应用程序

Qt Creator Windows 10 - “使用 jom 而不是 nmake”不起作用

在 Avalonia 中是否有带有柱子的 TreeView 或类似的东西？

SQL Server中的非确定性数据类型

使用next.js时出现服务器错误，错误：找不到react-redux上下文值；请确保组件包装在<Provider>中

错误：找不到存根。请确保已调用spring-cloud-contract：convert

如何了解DFT结果

ng升级性能注意事项

Embers js中的更改侦听器上的组合框

Swift 2.1-对单个单元格使用UITableView

Java中的循环开关案例

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

如何使用geoChoroplethChart和dc.js在Mapchart的路径上添加标签或自定义值？

ggplot：对齐多个分面图-所有大小不同的分面

如何避免每次重新编译所有文件？

Swift中的指针替代品？

完全禁用暂停（在内核级别？-必须与使用的DE和登录状态无关！）

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID

使用分隔符将成对相邻的数组元素相互连接

如何开始为Ubuntu开发

Blazor：如何将事件传递给通用组件中的onClick函数