我试图抓取一个网站,所以我设法使用此模板提取了所有我想要的文本:
nameList = bsObj.findAll("strong")
for text in nameList:
string = text.get_text()
if "Title" in string:
print(text.get_text())
然后我以这种方式获得文本:
标题1:textthatineed
标题2:textthatineed
标题3:textthatineed
标题4:textthatineed
标题5:textthatineed
标题6:textthatineed
标题7:textthatineed ....
有什么方法可以使用beautifulsoup或其他方法在python中剪切字符串,并且仅获取“ textthatineed”而不包含“ title(number):”。
说我们有
s = 'Title 1: textthatineed'
标题以冒号开头两个字符,因此我们找到冒号的索引,向下移动两个字符,并将子字符串从该索引移到末尾:
index = s.find(':') + 2
title = s[index:]
注意,find()
仅返回第一次出现的索引,因此包含冒号的标题不受影响。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句