我在使用python函数时遇到性能问题,因为我正在使用加载两个5+ GB选项卡描述的txt文件,这些txt文件具有相同的格式和不同的值,并使用第三个文本文件作为键来确定应保留哪些值以进行输出。如果可能,我希望获得一些帮助以提高速度。
这是代码:
def rchfile():
# there are 24752 text lines per stress period, 520 columns, 476 rows
# there are 52 lines per MODFLOW model row
lst = []
out = []
tcel = 0
end_loop_break = False
# key file that will set which file values to use. If cell address is not present or value of cellid = 1 use
# baseline.csv, otherwise use test_p97 file.
with open('input/nrd_cells.csv') as csvfile:
reader = csv.reader(csvfile)
for item in reader:
lst.append([int(item[0]), int(item[1])])
# two files that are used for data
with open('input/test_baseline.rch', 'r') as b, open('input/test_p97.rch', 'r') as c:
for x in range(3): # skip the first 3 lines that are the file header
b.readline()
c.readline()
while True: # loop until end of file, this should loop here 1,025 times
if end_loop_break == True: break
for x in range(2): # skip the first 2 lines that are the stress period header
b.readline()
c.readline()
for rw in range(1, 477):
if end_loop_break == True: break
for cl in range(52):
# read both files at the same time to get the different data and split the 10 values in the row
b_row = b.readline().split()
c_row = c.readline().split()
if not b_row:
end_loop_break == True
break
for x in range(1, 11):
# search for the cell address in the key file to find which files datat to keep
testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]
if not testval: # cell address not in key file
out.append(b_row[x - 1])
elif lst[testval[0]][1] == 1: # cell address value == 1
out.append(b_row[x - 1])
elif lst[testval[0]][1] == 2: # cell address value == 2
out.append(c_row[x - 1])
print(cl * 10 + x + tcel) # test output for cell location
tcel += 520
print('success')`
密钥文件如下所示:
37794, 1
37795, 0
37796, 2
数据文件每个大约5GB,从计数的角度来看很复杂,但是格式是标准的,看起来像:
0 0 0 0 0 0 0 0 0 0
1.5 1.5 0 0 0 0 0 0 0 0
这个过程耗时很长,希望有人可以帮助加快速度。
我相信您的速度问题来自此行:
testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]
您正在遍历HUGE输出文件中每个单个值的整个键列表。情况不妙。
看来cl * 10 + x + tcel
是您要寻找的公式lst[n][0]
。
我可以建议您使用dict
而不是list
来将数据存储在中lst
。
lst = {}
for item in reader:
lst[int(item[0])] = int(item[1])
现在,lst是一个映射,这意味着您可以简单地使用in
运算符来检查密钥的存在。这是一个近乎即时的查找,因为该dict
类型基于哈希,并且对于键查找非常有效。
something in lst
# for example
(cl * 10 + x) in lst
您可以通过以下方式获取价值:
lst[something]
#or
lst[cl * 10 + x]
进行一点重构,您的代码应该可以快速地进行加速。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句