我正在尝试从以下网页的表中提取SMILES String和Repeat_Unit的值:https ://khazana.gatech.edu/module_search/material_detail.php?id =1& m =9
尽管这可能不是最有效的方法,但我可以从以下代码中成功提取这些值:
from bs4 import BeautifulSoup
import requests
link='https://khazana.gatech.edu/module_search/material_detail.php?id=1&m=9'
link=requests.get(link)
soup=BeautifulSoup(link.text)
data=[]
tables=soup.find_all('table')
#the desired table was selected based on list index because there is no other attributes
table_body=tables[9].find('tbody')
rows=table_body.findAll('tr')
for row in rows:
cols=row.findAll('td')
cols=[ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
print (data[13][1])
print (data[14][1])
在我的应用程序中,我需要从相似网页的1000s中提取SMILES String和Repeat_Unit的值,其中html地址仅在id =之后出现的数字不同(在本示例中为1)。
我有熊猫数据框,其中一列具有数据ID。为了获得给定ID的SMILES字符串和重复单元,我将上面的代码修改为:
data=[]
SMILES=[]
Repeat_Unit=[]
for index, prow in df.iterrows():
a=prow['#id']
link='https://khazana.gatech.edu/module_search/material_detail.php?id='+str(a)+'&m=9'
link=requests.get(link)
soup=BeautifulSoup(link.text)
tables=soup.find_all('table')
for table in tables:
table_body=tables[9].find('tbody')
rows=table_body.findAll('tr')
for row in rows:
cols=row.findAll('td')
cols=[ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
SMILES.append(data[13][1])
Repeat_Unit.append(data[14][1])
现在,当我调用SMILES或RepeatUnit时,出现以下错误:
IndexError Traceback (most recent call last)
<ipython-input-55-74f7ef016c59> in <module>()
36 cols=[ele.text.strip() for ele in cols]
37 data.append([ele for ele in cols if ele])
---> 38 SMILES.append(data[13][1])
39 Repeat_Unit.append(data[14][1])
IndexError: list index out of range
即使我在追加到SMILES之前遍历数据,仍然会遇到相同的错误。
预先感谢您的帮助!
使用:
s = ['SMILES String', 'Repeat Unit']
N = 10
data=[]
for a in np.arange(1,N + 1):
link='https://khazana.gatech.edu/module_search/material_detail.php?id='+str(a)+'&m=9'
link=requests.get(link)
soup=BeautifulSoup(link.text, 'lxml')
d = {}
for x in s:
#https://stackoverflow.com/a/5999786/2901002
out = soup.find(text=x).parent.findNext('td').contents[0]
d[x] = out
data.append(d)
df = pd.DataFrame(data)
print (df)
Repeat Unit \
0 C5O3(CH2-OH)-O-C5O3(CH2-OH)-O
1 Polystyrene
2 CH2-CH(CH3)-CH2-CH(CH3)
3 CHF-CF2-CHF-CF2
4 <img border="0" height="60" src="block_images/...
5 CNS-C6H3-CSN-C6H3
6 CH(CF3)-O-CH2
7 (CH2)5-O-CO
8 CH2-CH2-C(CF3)2-O
9 <img border="0" height="60" src="block_images/...
SMILES String
0 C(C(O)C1(O))C(CO)OC1O
1 CC(C1=CC=CC=C1)CC(C2=CC=CC=C2)CC(C3=CC=CC=C3)C...
2 CC(C)CC(C)CC(C)
3 C(F)C(F)(F)
4 C(S1)=CC=C1C(S2)=CC=C2
5 C(OC1=C2)=NC1=CC=C2C(OC3=C4)=NC3=CC=C4
6 C(C(F)(F)(F))OCC(C(F)(F)(F))OC
7 CCCCCOC(=O)CCCCCOC(=O)
8 CCC(C(F)(F)(F))(C(F)(F)(F))OCCC(C(F)(F)(F))(C(...
9 CCCC
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句