如何过滤字符串列表中的关键字？

奥列斯特·托科文科（Orest Tokovenko）

我有一个字符串列表，这些字符串是我使用BeautifulSoup抓取的链接。我无法弄清楚如何仅返回包含单词“ The”的字符串。该解决方案可能使用Regex，但对我不起作用。

我试过了

for i in links_list:
     if re.match('^The', i) is not None:
        eps_only.append(i)

但我收到类似的错误

File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object

该列表如下所示：

['index.html', 'seinfeld-scripts.html', 'episodes_oveview.html', 'seinfeld-characters.html', 'buy-seinfeld.html', 'http://addthis.com/bookmark.php?v=250&username=doctoroids', None, None, None, None, 'http://community.seinfeldscripts.com', 'buy-seinfeld.html', 'seinfeld-t-shirt.html', 'seinfeld-dvd.html', 'episodes_oveview.html', 'alpha.html', '    http://www.shareasale.com/r.cfm?u=439896&b=119192&m=16934&afftrack=seinfeldScriptsTop&urllink=search%2E80stees%2Ecom%2F%3Fcategory%3D80s%2BTV%26i%3D1%26theme%3DSeinfeld%26u1%3Dcategory%26u2%3Dtheme', ' TheSeinfeldChronicles.htm', ' TheStakeout.htm', ' TheRobbery.htm', ' MaleUnbonding.htm', ' TheStockTip.htm', ' TheExGirlfriend.htm', ' ThePonyRemark.htm', ' TheJacket.htm', ' ThePhoneMessage.htm', ' TheApartment.htm', ' TheStatue.htm', ' TheRevenge.htm', ' TheHeartAttack.htm', ' TheDeal.htm', ' TheBabyShower.htm', ' TheChineseRestaurant.htm', ' TheBusboy.htm', 'TheNote.html', ' TheTruth.htm', 'ThePen.html', ' TheDog.htm', ' TheLibrary.htm', ' TheParkingGarage.htm', 'TheCafe.html', ' TheTape.htm', 'TheNoseJob.html', 'TheStranded.html', ...]

更新：完整代码

import requests
import re
from bs4 import BeautifulSoup

##################
##--user agent--##
##################

user_agent_desktop = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '\
    'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 '\
    'Safari/537.36'

headers = {'User-Agent': user_agent_desktop}

#########################
##--fetching the page--##
#########################

URL = 'https://www.seinfeldscripts.com/seinfeld-scripts.html'
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')


############################################################
##--scraping the links to the scripts from the main page--##
############################################################

links_list = []
eps_only = []

for link in soup.find_all('a'):
    links_list.append(link.get('href'))

### sorting for links that contain 'the' ###

for i in filter(None, links_list):
    if re.match('^The', str(i)) is not None:
        eps_only.append(i)
        print(eps_only)

贾维斯

您应该过滤None从BeautifulSoup返回的列表元素（不带）：

for i in filter(None, links_list):
    if re.match('^The', str(i)) is not None:
        eps_only.append(i)
print(eps_only)

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-01-26

我来说两句

0 条评论

登录后参与评论

上一篇：javascript中无法理解的对象重新分配

过滤以特定关键字开头的字符串列表

如何在字符串列表中搜索关键字，然后返回该字符串？

遍历字符串列表，查找关键字并打印

如何过滤字符串列表中的关键字？

如何过滤字符串列表中的关键字？

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用