以下のようなデータを含むsearch.txtとlog.txtという名前の2つのテキストファイルがあります。
search.txt
19:00:15 , mouse , FALSE
19:00:15 , branded luggage bags and trolley , TRUE
19:00:15 , Leather shoes for men , FALSE
19:00:15 , printers , TRUE
19:00:16 , adidas watches for men , TRUE
19:00:16 , Mobile Charger Stand/Holder black , FALSE
19:00:16 , watches for men , TRUE
log.txt
19:00:00 , trakjkfsa,
19:00:00 , door,
19:00:00 , sweater,
19:00:00 , sweater,
19:00:00 , sweater,
19:00:00 , dis,
19:00:01 , not,
19:00:01 , nokia,
19:00:01 , collar,
19:00:01 , nokia,
19:00:01 , collar,
19:00:01 , gsm,
19:00:01 , sweater,
19:00:01 , sweater,
19:00:01 , gsm,
19:00:02 , gsm,
19:00:02 , show,
19:00:02 , wayfreyerv,
19:00:02 , door,
19:00:02 , collar,
19:00:02 , or,
19:00:02 , harman,
19:00:02 , women's,
19:00:02 , collar,
19:00:02 , sweater,
19:00:02 , head,
19:00:03 , womanw,
19:00:03 , com.shopclues.utils.k@42233ff0,
19:00:03 , samsu,
19:00:03 , adidas,
19:00:03 , collar,
19:00:04 , ambas,
19:00:04 , harman,
19:00:04 , mi,
19:00:04 , nor,
19:00:04 , airtel,
19:00:04 , ,
19:00:04 , adidas,
19:00:05 , harman,
19:00:05 , collar,
19:00:05 , flip,
19:00:05 , brass,
19:00:05 , laptop,
19:00:05 , collar,
19:00:05 , wayfreyer,
19:00:05 , head,
19:00:05 , adidas,
19:00:05 , discn,
19:00:05 , head,
19:00:05 , adidas,
19:00:05 , collar,
19:00:05 , collar,
19:00:06 , disco,
19:00:06 , head,
19:00:06 , harman,
19:00:06 , nigh,
19:00:06 , microsoft,
19:00:06 , ambassado,
19:00:07 , salwar,
19:00:07 , bb,
19:00:07 , harman,
19:00:07 , ambassador,
19:00:07 , ambassador,
19:00:07 , salwar,
19:00:08 , microsoft,
19:00:08 , ac,
19:00:08 , jea,
19:00:08 , gens,
19:00:08 , ambassador,
19:00:08 , orpa,
19:00:09 , ac,
19:00:09 , black,
19:00:09 , asus,
19:00:09 , salwar,
19:00:09 , salwar,
19:00:09 , ac,
19:00:10 , whechains,
19:00:10 , gens,
19:00:10 , ambassador,
19:00:10 , sony,
19:00:10 , salwa,
19:00:10 , ac,
19:00:10 , woman,
19:00:10 , li,
19:00:11 , boxers,
19:00:11 , harman,
19:00:11 , sal,
19:00:11 , ambassador,
19:00:11 , sony,
19:00:11 , ,
19:00:11 , boxers,
19:00:12 , adidas,
19:00:12 , samsung,
19:00:12 , boxer,
19:00:12 , boxers,
19:00:12 , com.shopclues.utils.k@427b9538,
19:00:12 , harman,
19:00:12 , wechains#002,
19:00:12 , collar,
19:00:13 , collar,
19:00:13 , collar,
19:00:13 , one,
19:00:13 , collar,
19:00:13 , ambassador,
19:00:13 , hitech,
19:00:13 , fanc,
19:00:13 , adidas,
19:00:13 , bp,
19:00:13 , asus,
19:00:13 , ambassador,
19:00:13 , harman,
19:00:14 , lin,
19:00:14 , one,
19:00:14 , samsung,
19:00:14 , cond,
19:00:14 , atx,
19:00:15 , blackles#002,
19:00:15 , woman,
19:00:15 , asus,
19:00:15 , airtel,
19:00:15 , weel,
19:00:15 , aenglish,
19:00:15 , orpat,
19:00:15 , one,
19:00:15 , condom,
19:00:15 , one,
19:00:15 , ling,
19:00:15 , fancy,
19:00:15 , orpat,
19:00:15 , woman,
19:00:19 , watches fo,
これから私がする必要があるのは、2つのファイルを開く必要があり、search.txtから最初のクエリを選択した場合は検索ファイルからlog.txtに移動し、60秒の間にそのクエリに関連するクエリを検索します前後。検索クエリに関連するものが見つかった場合は、データをリストとともに保存し、search.txtを追加します。
o / pは次のようになります:-
search.txt
19:00:15 , mouse , FALSE - []
19:00:15 , branded luggage bags and trolley , TRUE - []
19:00:15 , Leather shoes for men , FALSE - []
19:00:15 , printers , TRUE - []
19:00:16 , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas,adidas]
19:00:16 , Mobile Charger Stand/Holder black , FALSE - []
19:00:16 , watches for men , TRUE
例を見てみましょう。「mouse」がsearch.txtから「19:00:15」に配置されたクエリである場合、log.txtに移動して、「mouse」に関連するクエリを時間の間に見つける必要があります。 「18:59:15-19:01:15」のは、search.txtの前後60秒を意味し、それに関連するクエリがある場合は、リストとともにその行のsearch.txtにデータが保存されます。
以下はコードです:
import datetime
from collections import defaultdict
def getting_partial_queries(querylist):
basequery = ' '.join(querylist)
querylist = []
for n in range(2,len(basequery)+1):
querylist.append(basequery[:n])
return querylist
queries_time = defaultdict(list)
with open('logs.txt') as f:
for line in f:
fields = [ x.strip() for x in line.split(',') ]
timestamp = datetime.datetime.strptime(fields[0], "%H:%M:%S")
queries_time[fields[1]].append(timestamp)
with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
for line in inputf:
fields = [ x.strip() for x in line.split(',') ]
timestamp = datetime.datetime.strptime(fields[0], "%H:%M:%S")
queries = getting_partial_queries(fields[1].split())
results = []
for q in queries:
poss_timestamps = queries_time[q]
for ts in poss_timestamps:
if timestamp - datetime.timedelta(seconds=60) <= ts <= timestamp:
results.append(q)
if timestamp + datetime.timedelta(seconds=60) >= ts >= timestamp:
results.append(q)
outputf.write (line.strip() + " , {}\n".format(results))
「部分クエリ」の意味はまだ不明ですが、以下のコードでは、関数で部分クエリを再定義するだけでそれを実行できますfilter_out_common_queries
。例えばあなたが、クエリの完全一致を探しているならsearch.txt
、あなたは置き換えることができます# add your logic here
によってreturn [' '.join(querylist), ]
。
import datetime as dt
from collections import defaultdict
def filter_out_common_queries(querylist):
# add your logic here
return querylist
queries_time = defaultdict(list) # personally, I'd use 'set' as the default factory
with open('log.txt') as f:
for line in f:
fields = [ x.strip() for x in line.split(',') ]
timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
queries_time[fields[1]].append(timestamp)
with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
for line in inputf:
fields = [ x.strip() for x in line.split(',') ]
timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
queries = filter_out_common_queries(fields[1].split()) # "adidas watches for men" -> "adidas" "watches" "for" "men". "for" is a very generic keyword. You should do well to filter these out
results = []
for q in queries:
poss_timestamps = queries_time[q]
for ts in poss_timestamps:
if timestamp - dt.timedelta(seconds=15) <= ts <= timestamp:
results.append(q)
outputf.write(line.strip() + " - {}\n".format(results))
入力データに基づく出力:
19:00:15 , mouse , FALSE - []
19:00:15 , branded luggage bags and trolley , TRUE - []
19:00:15 , Leather shoes for men , FALSE - []
19:00:15 , printers , TRUE - []
19:00:16 , adidas watches for men , TRUE - ['adidas', 'adidas', 'adidas', 'adidas', 'adidas', 'adidas']
19:00:16 , Mobile Charger Stand/Holder black , FALSE - ['black']
19:00:16 , watches for men , TRUE - []
「モバイル充電器スタンド/ホルダーブラック」で「ブラック」に一致するものが見つかったことに注意してください。これは、上記のコードで、個別の単語をそれぞれ検索したためです。
編集:コメントを実装するには、次のように再定義しますfilter_out_common_queries
。
def filter_out_common_queries(querylist):
basequery = ' '.join(querylist)
querylist = []
for n in range(2,len(basequery)+1):
querylist.append(basequery[:n])
return querylist
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加