我正在嘗試使用以下代碼抓取網頁,但是,由於“不匹配”的行出現錯誤。我想要實現的是一個 Pandas 數據框,它包含課程名稱,然後是全日制代碼、全日制 URL、兼職代碼、兼職 URL。問題是並不是所有的課程都有全日制和非全日制課程,所以當我試圖用“NA”替換空格以獲得相同的行數時,它會產生錯誤。
以下代碼提供了所有全日制和非全日制課程的輸出,此代碼不會產生錯誤,因為它只允許存在所有 5 個元素的課程:
#Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup
#Specify URL
url = "http://eecs.qmul.ac.uk/postgraduate/programmes"
html = urlopen(url)
# Print the first 10 table rows
rows = soup.find_all('tr')
print(rows[:10])
#Create data frame
df = pd.DataFrame(columns = ['Course Name', 'Part Time Code', 'Part Time URL', 'Full Time Code', 'Full Time URL'])
#Create loop to go through all rows
for row in rows:
courses = row.find_all("td")
# The fragments list will store things to be included in the final string, such as the course title and its URLs
fragments = []
for course in courses:
if course.text.isspace():
continue
# Add the <td>'s text to fragments
fragments.append(course.text)
# Try and find an <a> tag
a_tag = course.find("a")
if a_tag:
# If one was found, add the URL to fragments
fragments.append(a_tag["href"])
# Make a string containing every fragment with ", " spacing them apart.
cleantext = ", ".join(fragments)
#Add rows to the dataframe if the information exists
if len(fragments) == 5:
df.loc[len(df.index)] = fragments
df.head(30)
這是我用來嘗試用 NA 替換空白以確保每行中有 5 個元素的方法:
#Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup
#Specify URL
url = "http://eecs.qmul.ac.uk/postgraduate/programmes"
html = urlopen(url)
# Print the first 10 table rows
rows = soup.find_all('tr')
#Create data frame
df = pd.DataFrame(columns = ['Course Name', 'Part Time Code', 'Part Time URL', 'Full Time Code', 'Full Time URL'])
#Create loop to go through all rows
for row in rows:
courses = row.find_all("td")
# The fragments list will store things to be included in the final string, such as the course title and its URLs
fragments = []
for course in courses:
if course.text.isspace():
fragments.append("NA")
else:
# Add the <td>'s text to fragments
fragments.append(course.text)
# Try and find an <a> tag
a_tag = course.find("a")
if a_tag:
# If one was found, add the URL to fragments
fragments.append(a_tag["href"])
else:
fragments.append("NA")
# Make a string containing every fragment with ", " spacing them apart.
cleantext = ", ".join(fragments)
#Add rows to the dataframe if the information exists
if len(fragments) > 0:
df.loc[len(df.index)] = fragments
df.head(30)
這是它返回的錯誤:
ValueError Traceback (most recent call last)
<ipython-input-28-94bb08463416> in <module>()
38 #Add rows to the dataframe if the information exists
39 if len(fragments) > 0:
---> 40 df.loc[len(df.index)] = fragments
41 df.head(30)
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _setitem_with_indexer_missing(self, indexer, value)
1854 # must have conforming columns
1855 if len(value) != len(self.obj.columns):
-> 1856 raise ValueError("cannot set a row with mismatched columns")
1857
1858 value = Series(value, index=self.obj.columns, name=indexer)
ValueError: cannot set a row with mismatched columns
您能否確定我如何解決此問題,以便沒有兼職代碼或 URL 的課程仍包含在數據框中?
它比那要簡單得多。按 id 查找表,然後將prettify
-ed 版本直接輸入 Pandas IO。Pandas 開箱即用地處理 NaN。
soup = BeautifulSoup(urlopen('http://eecs.qmul.ac.uk/postgraduate/programmes'))
table = soup.find("table", {"id":"PGCourse"})
df = pd.read_html(table.prettify())[0]
# rename columns
df.columns = ['Course Name', 'Part Time Code', 'Full Time Code']
編輯:好的,然後要獲取您需要迭代的鏈接:
pt_links, ft_links = [], []
for row in table.find_all("tr")[1:]:
row_data = row.find_all("td")
pt, ft = row_data[1], row_data[2]
pt_link = pt.find_all('a')
pt_links.append('' if len(pt_link) == 0 else pt_link[0]['href'])
ft_link = ft.find_all('a')
ft_links.append('' if len(ft_link) == 0 else ft_link[0]['href'])
df['Part Time URL'] = pt_links
df['Full Time URL'] = ft_links
# rearrange the columns (optional)
df = df[['Course Name', 'Part Time Code', 'Part Time URL', 'Full Time Code', 'Full Time URL']]
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句