NLTK中没有pos_tag的ne_chunk

Sang 发表于 Dev

唱

我正在尝试在nltk中使用ne_chunk和pos_tag对句子进行分块。

from nltk import tag
from nltk.tag import pos_tag
from nltk.tree import Tree
from nltk.chunk import ne_chunk

sentence = "Michael and John is reading a booklet in a library of Jakarta"
tagged_sent = pos_tag(sentence.split())

print_chunk = [chunk for chunk in ne_chunk(tagged_sent) if isinstance(chunk, Tree)]

print print_chunk

结果如下：

[Tree('GPE', [('Michael', 'NNP')]), Tree('PERSON', [('John', 'NNP')]), Tree('GPE', [('Jakarta', 'NNP')])]

我的问题是，是否可以不包括pos_tag（如上面的NNP）而仅包括Tree'GPE'，'PERSON'？“ GPE”是什么意思？

提前致谢

亚历克西斯

命名的实体分块器将为您提供包含分块和标签的树。您不能更改它，但是可以取出标签。从您的tagged_sent：

chunks = nltk.ne_chunk(tagged_sent)
simple = []
for elt in chunks:
    if isinstance(elt, Tree):
        simple.append(Tree(elt.label(), [ word for word, tag in elt ]))
    else:
        simple.append( elt[0] )

如果只需要块，则忽略else:上面的子句。您可以修改代码以任意方式包装大块。我使用nltkTree将更改保持在最低限度。请注意，某些块包含多个单词（尝试在示例中添加“ New York”），因此，块的内容必须是列表，而不是单个元素。

PS。“ GPE”代表“地缘政治实体”（显然是一个大块的错误）。您可以在此处找到nltk书中的“常用标签”列表。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。