我有一個包含句子的數據庫,通常只有單詞。我經常有像購買和購買這樣的詞。當我數詞時,我有購買和購買,這扭曲了計算。我的需要如下:
我想在我的列上循環,當我第一次注意到一個詞時,我會替換其他句子中的相似詞。我試過模糊,但我只在最後得到單詞而沒有句子
例如 :
This topic is about purchasing
He was talking about shopping
它成為了:
This topic is about purchasing
He was talking about purchasing
即使句子被扭曲了,那也沒關係。
我應用了這段代碼,但結果並不令人滿意:
import pandas
from fuzzywuzzy import fuzz
# Replaces %90 and more similar strings
def func(input_list):
for count, item in enumerate(input_list):
rest_of_input_list = input_list[:count] + input_list[count + 1:]
new_list = []
for other_item in rest_of_input_list:
similarity = fuzz.ratio(item, other_item)
if similarity >= 90:
new_list.append(item)
else:
new_list.append(other_item)
input_list = new_list[:count] + [item] + new_list[count :]
return input_list
df = pandas.read_csv('input.txt') # Read data from csv
result = []
for column in list(df):
column_values = list(df[column])
first_words = [x[:x.index(" ")] if " " in x else x for x in column_values]
result.append(func(first_words))
new_df = pandas.DataFrame(result).transpose()
new_df.columns = list(df)
print(new_df)
也許這是一個可能的解決方案。鑑於以下數據:
input
This topic is about purchasing
He was talking about shopping
That was the reason for the request
About request
requests
My home is nice
My home is beautiful
My homes are nice
和:
import pandas as pd
from fuzzywuzzy import fuzz
# Replaces %90 and more similar strings
def func(input_list):
for count, item in enumerate(input_list):
rest_of_input_list = input_list[:count] + input_list[count + 1:]
new_list = []
for other_item in rest_of_input_list:
similarity = fuzz.ratio(item, other_item)
if similarity >= 50:
new_list.append(item)
else:
new_list.append(other_item)
input_list = new_list[:count] + [item] + new_list[count :]
return input_list
df = pd.read_csv('input.txt')
result = []
for column in list(df):
column_values = list(df[column])
result.append(func(column_values))
new_df = pd.DataFrame(result).transpose()
new_df.columns = ['ouput']
full_df = pd.concat([df,new_df], axis=1)
print(full_df)
你會得到以下輸出:
input ouput
0 This topic is about purchasing He was talking about shopping
1 He was talking about shopping He was talking about shopping
2 That was the reason for the request That was the reason for the request
3 About request requests
4 requests requests
5 My home is nice My homes are nice
6 My home is beautiful My homes are nice
7 My homes are nice My homes are nice
請注意,我更改了相似度的限制。事實上,如果你檢查相似性,它們都沒有達到 90 分。
更新
另一種方法是:
import pandas as pd
import fuzzywuzzy.fuzz as fuzz
df = pd.read_csv('input.txt')
print('--- before ---')
print(df)
SENTENCES = df['input'].to_list()
print('--- changes ---')
for index, word in enumerate(SENTENCES):
for other_index, other_word in enumerate(SENTENCES[index+1:], index+1):
result = fuzz.token_sort_ratio(word, other_word)
if result > 10:
print(f'OK | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')
elif result >20:
print(f' | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')
if result > 45:
SENTENCES[index] = other_word
df['output'] = SENTENCES
print(df)
它提供了有關正在發生的事情的一些信息:
--- before ---
input
0 This topic is about purchasing
1 He was talking about shopping
2 That was the reason for the request
3 About request
4 requests
5 My home is nice
6 My home is beautiful
7 My homes are nice
--- changes ---
OK | 51 | 0 This topic is about purchasing -> 1 He was talking about shopping
OK | 34 | 0 This topic is about purchasing -> 2 That was the reason for the request
OK | 37 | 0 This topic is about purchasing -> 3 About request
OK | 16 | 0 This topic is about purchasing -> 4 requests
OK | 36 | 0 This topic is about purchasing -> 5 My home is nice
OK | 28 | 0 This topic is about purchasing -> 6 My home is beautiful
OK | 30 | 0 This topic is about purchasing -> 7 My homes are nice
OK | 41 | 1 He was talking about shopping -> 2 That was the reason for the request
OK | 43 | 1 He was talking about shopping -> 3 About request
OK | 16 | 1 He was talking about shopping -> 4 requests
OK | 23 | 1 He was talking about shopping -> 5 My home is nice
OK | 37 | 1 He was talking about shopping -> 6 My home is beautiful
OK | 26 | 1 He was talking about shopping -> 7 My homes are nice
OK | 38 | 2 That was the reason for the request -> 3 About request
OK | 37 | 2 That was the reason for the request -> 4 requests
OK | 16 | 2 That was the reason for the request -> 5 My home is nice
OK | 22 | 2 That was the reason for the request -> 6 My home is beautiful
OK | 31 | 2 That was the reason for the request -> 7 My homes are nice
OK | 67 | 3 About request -> 4 requests
OK | 21 | 3 About request -> 5 My home is nice
OK | 36 | 3 About request -> 6 My home is beautiful
OK | 33 | 3 About request -> 7 My homes are nice
OK | 17 | 4 requests -> 5 My home is nice
OK | 29 | 4 requests -> 6 My home is beautiful
OK | 32 | 4 requests -> 7 My homes are nice
OK | 54 | 5 My homes are nice -> 6 My home is beautiful
OK | 100 | 5 My homes are nice -> 7 My homes are nice
OK | 54 | 6 My home is beautiful -> 7 My homes are nice
input output
0 This topic is about purchasing He was talking about shopping
1 He was talking about shopping He was talking about shopping
2 That was the reason for the request That was the reason for the request
3 About request requests
4 requests requests
5 My home is nice My homes are nice
6 My home is beautiful My homes are nice
7 My homes are nice My homes are nice
要獲取數據框:
import pandas as pd
import fuzzywuzzy.fuzz as fuzz
df = pd.read_csv('input.txt')
SENTENCES = df['input'].to_list()
for index, word in enumerate(SENTENCES):
for other_index, other_word in enumerate(SENTENCES[index+1:], index+1):
result = fuzz.token_sort_ratio(word, other_word)
if result > 10:
f'OK | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}'
elif result >20:
f' | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}'
if result > 45:
SENTENCES[index] = other_word
df['output'] = SENTENCES
print(df)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句