使用python从多个文本文件中提取数据

丹尼尔

我试图同时从几个文本文件中提取数据。

import fileinput

num_lines = sum(1 for line in open('2grams.txt'))  ## in order not to print junk

count = 0
f0 = open("2gram_glues.txt", 'r')
f1 = open("2grams.txt", 'r')
f2 = open("output.txt", 'w')
f3 = open('2mwus.txt', 'r')

with fileinput.input(files=('2grams.txt', '2gram_glues.txt', '2mwus.txt')) as f:
    for line in f:
        f3.seek(0, 0)

        for line1 in f3:

            if line == line1:
                f2.write("The 2 Gram is: " + line.strip() + "\t The score is: " + f0.readline())
                count += 1
                if count >= num_lines:
                    break


f0.close()
f1.close()
f2.close()
f3.close()

2grams.txt和2gram_glues.txt分别具有相同数量的行和数据(在这些行上),但是,我实际上要写入输出文件的数据是2mwus.txt中与数据相交的数据2grams.txt具有不同的行数。

问题是我要打印与2gram_glues.txt串联的2mwus.txt(包含一个分数)。

我从2gram_glues.txt获得的分数是有序的,而不是2mwus.txt的分数。

写数据时我做错了什么?

文本文件的链接:

https://drive.google.com/folderview?id=0B1oTQq97VF44V1p3MEZwQkhqTjQ&usp=sharing

麦克风

我认为您不需要使用fileinput:

num_lines = sum(1 for line in open('2grams.txt'))  ## in order not to print junk

count = 0
intersect = open('2grams.txt', 'r')
out_file = open("output.txt", 'w')
scores = open("2gram_glues.txt", 'r')

with open('2mwus.txt', 'r') as base:
    for line in base:

        line = line.rstrip()
        number = line[-2:]
        number = int(number.lstrip())

        line = line[:-2]
        line = line.rstrip()

        intersect.seek(0, 0)
        scores_lines=scores.readlines()
        scores.seek(0, 0)

        for i, line_intersect in enumerate(intersect):
            line_intersect= line_intersect.rstrip()
            if line == line_intersect:
                print("**The 2 Gram is: " + line.strip() + "\t The score is: " + scores_lines[i] +
                      'The number is ' + str(number))
                count += 1
                if count >= num_lines:
                    break

intersect.close()
out_file.close()
scores.close()

切片和条带化

从:

'(850,·900,\t12·'
'(frequencies·850,\t4·'
'phone·but\t2·'

#\t denotes tabulation, · denotes spaces

使用:

line = line.rstrip()

使得:

'(850,·900,\t12'
'(frequencies·850,\t4'
'phone·but\t2'

然后得到数字:

number = line[-2:]

给出:

'12'
'\t4'
'\t2'

然后左剥离数字:

number = int(number.lstrip())

给出:

12
4
2

继续我们的“路线”:

'(850,·900,\t12'
'(frequencies·850,\t4'
'phone·but\t2'

使用

line = line[:-2]
line = line.rstrip()

给出:

'(850, 900,'
'(frequencies 850,'
'phone but'

有点麻烦,但避免使用RegEx的必要性

输出

**The 2 Gram is: (850, 900,  The score is: 0.857143
The number is 12
**The 2 Gram is: (Bands 4    The score is: 0.4
The number is 2
**The 2 Gram is: (frequencies 850,   The score is: 1
The number is 4
**The 2 Gram is: 1, 3,   The score is: 1
The number is 8
**The 2 Gram is: 13, 25)     The score is: 0.666667
The number is 2
**The 2 Gram is: 1800, 1900  The score is: 1
The number is 8
**The 2 Gram is: 1900, 2100  The score is: 1
The number is 10
**The 2 Gram is: 5 compatible    The score is: 0.444444
The number is 2
**The 2 Gram is: A1428: UMTS/HSPA+/DC-HSDPA  The score is: 0.5
The number is 2
**The 2 Gram is: A1429: UMTS/HSPA+/DC-HSDPA  The score is: 0.4
The number is 2
**The 2 Gram is: Australia, Germany,     The score is: 1
The number is 2
**The 2 Gram is: B (800,     The score is: 1
The number is 2
**The 2 Gram is: Full specs  The score is: 1
The number is 2
**The 2 Gram is: GSM model   The score is: 0.428571
The number is 6
**The 2 Gram is: In deciding     The score is: 1
The number is 2
**The 2 Gram is: KDDI network    The score is: 0.5
The number is 2
**The 2 Gram is: South Korea).   The score is: 1
The number is 2
**The 2 Gram is: UMTS/HSPA+/DC-HSDPA (850,   The score is: 0.666667
The number is 6
**The 2 Gram is: US AT&T     The score is: 1
The number is 2
**The 2 Gram is: US, along   The score is: 1
The number is 2
**The 2 Gram is: bands 4     The score is: 0.4
The number is 2
**The 2 Gram is: bands, making   The score is: 1
The number is 2
**The 2 Gram is: battery life    The score is: 0.363636
The number is 2
**The 2 Gram is: blazing fast    The score is: 1
The number is 2
**The 2 Gram is: didn't come     The score is: 0.666667
The number is 3
**The 2 Gram is: fact that   The score is: 0.4
The number is 3
**The 2 Gram is: iPhone 5    The score is: 0.526316
The number is 5
**The 2 Gram is: meet compatibility  The score is: 1
The number is 2
**The 2 Gram is: model A1429:    The score is: 0.5
The number is 4
**The 2 Gram is: networks in     The score is: 0.258065
The number is 4
**The 2 Gram is: networks. However,  The score is: 1
The number is 2
**The 2 Gram is: one GSM.    The score is: 0.363636
The number is 2
**The 2 Gram is: phone but   The score is: 0.1
The number is 2
**The 2 Gram is: phone. This     The score is: 0.444444
The number is 2
**The 2 Gram is: release three   The score is: 0.8
The number is 2
**The 2 Gram is: sim card    The score is: 0.8
The number is 2
**The 2 Gram is: standards worldwide.    The score is: 1
The number is 2
**The 2 Gram is: support LTE     The score is: 0.296296
The number is 4
**The 2 Gram is: the phone   The score is: 0.188679
The number is 10
**The 2 Gram is: to my   The score is: 0.12
The number is 3
**The 2 Gram is: works great     The score is: 0.4
The number is 2

带回家的想法:

  • 注意空格,rstrip是您的盟友。
  • 使用f1,f2和f3很直观,但是从长远来看,您会感到困惑。使用有意义的名称!

希望能帮助到你!

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

使用python从文本文件中提取数据

从文本文件中提取数据

使用Python从多个文本文件中的多个字典中提取键值对

从python 3中的文本文件中提取数据

从文本文件中提取数据(python)

使用Pandas从文本文件中提取标题数据

使用bash从文本文件中提取数据

使用 for 循环从文本文件中提取数据

从文本文件中提取多个模式并将其保存到熊猫数据框[python]

使用Python从文本文件中提取数据并写入新文件

使用Python从文本文件中提取数值

使用 python 从文本文件中提取特定行

如何使用Python从文本文件中提取特定数据并写入CSV

使用Python从大型非结构化文本文件中提取数据元素

如何使用 Python 将多个文本文件的内容提取到 Pandas 数据框中?

如何针对每个单独的文件名从多个文本文件中提取数据?

在 R 编程中使用模式和表达式从文本文件中提取多个数据帧

从文本文件中提取多个匹配模式

如何从文本文件中提取多个圣经经文?

从文本文件中提取文本的Python程序?

从文本文件Python中提取括号之间的文本

使用Python将数据文本文件拆分为多个MySQL文本文件

Matlab脚本从文本文件中提取数据

从文本文件中提取模式之间的数据

Linux:从文本文件中提取数据

从文本文件中提取特定数据

从文本文件中提取数据到 Excel

如何从文本文件中提取数据?

从文本文件中提取数据到csv