我有这个python脚本,我在其中使用nltk库来解析,标记,标记和分块一些让我们说来自网络的随机文本。
我需要的格式,并写在文件的输出chunked1
,chunked2
,chunked3
。这些有类型class 'nltk.tree.Tree'
更具体地讲,我需要写只匹配正则表达式的线条chunkGram1
,chunkGram2
,chunkGram3
。
我怎样才能做到这一点?
#! /usr/bin/python2.7
import nltk
import re
import codecs
xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]
def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
#print tokenized
#print tagged
chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""
chunkParser1 = nltk.RegexpParser(chunkGram1)
chunked1 = chunkParser1.parse(tagged)
chunkParser2 = nltk.RegexpParser(chunkGram2)
chunked2 = chunkParser2.parse(tagged)
chunkParser3 = nltk.RegexpParser(chunkGram3)
chunked3 = chunkParser2.parse(tagged)
#print chunked1
#print chunked2
#print chunked3
# with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:
# for i,line in enumerate(chunked1):
# if "JJ" in line:
# outfile.write(line)
# elif "NNP" in line:
# outfile.write(line)
processLanguage()
当我试图运行它时,我得到了错误:
`Traceback (most recent call last):
File "sentdex.py", line 47, in <module>
processLanguage()
File "sentdex.py", line 40, in processLanguage
outfile.write(line)
File "C:\Python27\lib\codecs.py", line 688, in write
return self.writer.write(data)
File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, tuple found`
编辑: @Alvas回答后,我设法做我想要的。但是现在,我想知道如何从文本语料库中剥离所有非ASCII字符。例:
#store cleaned file into variable
with open('path\to\file.txt', 'r') as infile:
xstring = infile.readlines()
infile.close
def remove_non_ascii(line):
return ''.join([i if ord(i) < 128 else ' ' for i in line])
for i, line in enumerate(xstring):
line = remove_non_ascii(line)
#tokenize and tag text
def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
print tokenized
print tagged
processLanguage()
上面的内容来自S / O中的另一个答案。但是,它似乎不起作用。有什么问题吗?我得到的错误是:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)
您的代码有几个问题,尽管罪魁祸首是您的for
循环未修改以下内容xstring
:
我将在这里解决您代码中的所有问题:
您不能使用single编写这样的路径\
,因为\t
它将解释为制表符和\f
换行符。您必须将它们加倍。我知道这里只是一个例子,但经常会引起这种混淆:
with open('path\\to\\file.txt', 'r') as infile:
xstring = infile.readlines()
下一infile.close
行是错误的。它不调用close方法,实际上不执行任何操作。此外,你的文件是已经被用条款,如果你看到任何回答任何地方这条线关闭,请只是评论说,downvote的答案完全file.close
是错误的,应该是file.close()
。
以下内容应该可以使用,但是您需要意识到,用它代替每个非ASCII字符' '
会破坏诸如朴素和咖啡厅之类的词
def remove_non_ascii(line):
return ''.join([i if ord(i) < 128 else ' ' for i in line])
但这是代码因unicode异常而失败的原因:您根本没有修改元素xstring
,也就是说,您正在计算删除了ascii字符的行,是的,但这是一个新值,从不存储进入列表:
for i, line in enumerate(xstring):
line = remove_non_ascii(line)
相反,它应该是:
for i, line in enumerate(xstring):
xstring[i] = remove_non_ascii(line)
或我首选的非常pythonic:
xstring = [ remove_non_ascii(line) for line in xstring ]
尽管这些Unicode错误的发生主要是因为您正在使用Python 2.7处理纯Unicode文本,但最新的Python 3版本正在解决这些问题,因此,我建议您如果刚开始要升级的任务,即将推出Python 3.4+。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句