如何根據網頁抓取的輸出創建帶有空白單元格的 Pandas 數據框？

用戶3062448

我正在嘗試使用以下代碼抓取網頁，但是，由於“不匹配”的行出現錯誤。我想要實現的是一個 Pandas 數據框，它包含課程名稱，然後是全日制代碼、全日制 URL、兼職代碼、兼職 URL。問題是並不是所有的課程都有全日制和非全日制課程，所以當我試圖用“NA”替換空格以獲得相同的行數時，它會產生錯誤。

以下代碼提供了所有全日制和非全日制課程的輸出，此代碼不會產生錯誤，因為它只允許存在所有 5 個元素的課程：

#Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup
#Specify URL
url = "http://eecs.qmul.ac.uk/postgraduate/programmes"
html = urlopen(url)
# Print the first 10 table rows
rows = soup.find_all('tr')  
print(rows[:10])
#Create data frame
df = pd.DataFrame(columns = ['Course Name', 'Part Time Code', 'Part Time URL', 'Full Time Code', 'Full Time URL'])
#Create loop to go through all rows
for row in rows:
    courses = row.find_all("td")
    # The fragments list will store things to be included in the final string, such as the course title and its URLs
    fragments = []
    for course in courses:
        if course.text.isspace():
           continue
        # Add the <td>'s text to fragments
        fragments.append(course.text)
        # Try and find an <a> tag 
        a_tag = course.find("a")
        if a_tag:
            # If one was found, add the URL to fragments
           fragments.append(a_tag["href"])
        # Make a string containing every fragment with ", " spacing them apart.
        cleantext = ", ".join(fragments)
        #Add rows to the dataframe if the information exists
        if len(fragments) == 5:
           df.loc[len(df.index)] = fragments 
df.head(30)

這是輸出：

這是我用來嘗試用 NA 替換空白以確保每行中有 5 個元素的方法：

#Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup
#Specify URL
url = "http://eecs.qmul.ac.uk/postgraduate/programmes"
html = urlopen(url)
# Print the first 10 table rows
rows = soup.find_all('tr')  
#Create data frame
df = pd.DataFrame(columns = ['Course Name', 'Part Time Code', 'Part Time URL', 'Full Time Code', 'Full Time URL'])
#Create loop to go through all rows
for row in rows:
    courses = row.find_all("td")
    # The fragments list will store things to be included in the final string, such as the course title and its URLs
    fragments = []
    for course in courses:
        if course.text.isspace():
           fragments.append("NA")
        else:
        # Add the <td>'s text to fragments
           fragments.append(course.text)
        # Try and find an <a> tag 
           a_tag = course.find("a")
        if a_tag:
            # If one was found, add the URL to fragments
           fragments.append(a_tag["href"])
        else:
            fragments.append("NA")
        # Make a string containing every fragment with ", " spacing them apart.
        cleantext = ", ".join(fragments)
        #Add rows to the dataframe if the information exists
        if len(fragments) > 0:
           df.loc[len(df.index)] = fragments 
df.head(30)

這是它返回的錯誤：

ValueError                                Traceback (most recent call last)
<ipython-input-28-94bb08463416> in <module>()
     38         #Add rows to the dataframe if the information exists
     39         if len(fragments) > 0:
---> 40            df.loc[len(df.index)] = fragments
     41 df.head(30)

2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _setitem_with_indexer_missing(self, indexer, value)
   1854                     # must have conforming columns
   1855                     if len(value) != len(self.obj.columns):
-> 1856                         raise ValueError("cannot set a row with mismatched columns")
   1857 
   1858                 value = Series(value, index=self.obj.columns, name=indexer)

ValueError: cannot set a row with mismatched columns

您能否確定我如何解決此問題，以便沒有兼職代碼或 URL 的課程仍包含在數據框中？

喬什·弗里德蘭德

它比那要簡單得多。按 id 查找表，然後將prettify-ed 版本直接輸入 Pandas IO。Pandas 開箱即用地處理 NaN。

soup = BeautifulSoup(urlopen('http://eecs.qmul.ac.uk/postgraduate/programmes'))
table = soup.find("table", {"id":"PGCourse"})
df = pd.read_html(table.prettify())[0]
# rename columns
df.columns = ['Course Name', 'Part Time Code', 'Full Time Code']

編輯：好的，然後要獲取您需要迭代的鏈接：

pt_links, ft_links = [], [] 
for row in table.find_all("tr")[1:]:
    row_data = row.find_all("td")
    pt, ft = row_data[1], row_data[2]
    pt_link = pt.find_all('a')
    pt_links.append('' if len(pt_link) == 0 else pt_link[0]['href'])
    ft_link = ft.find_all('a')
    ft_links.append('' if len(ft_link) == 0 else ft_link[0]['href'])

df['Part Time URL'] = pt_links
df['Full Time URL'] = ft_links

# rearrange the columns (optional)
df = df[['Course Name', 'Part Time Code', 'Part Time URL', 'Full Time Code', 'Full Time URL']]

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-11-10

我来说两句

0 条评论

登录后参与评论

上一篇：Python - `statistics.stdev()` 在 Numpy 數組上的行為

如何根據網頁抓取的輸出創建帶有空白單元格的 Pandas 數據框？

如何根據網頁抓取的輸出創建帶有空白單元格的 Pandas 數據框？

隐藏发件人没有短信PHP

材质UI垂直滑块。如何改变在垂直材料UI滑块导轨的厚度（反应）

在Windows 7中无法删除文件（2）

HttpClient中的角度变化检测

Azure VM启动/停止日志

如何在 Vb.net 中使用函数返回多个值

Powerpoint-条形长度错误的堆积条形图

最新歌剧断断续续的快速拨号和渲染错误

Mac OS X更新后的GRUB 2问题

需要公式以vlookup逗号分隔单个单元格中的值

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

ggplot：对齐多个分面图-所有大小不同的分面

OS X-为什么我需要打开WiFi才能确定最近的位置

用日期数据透视表和日期顺序查询

Java Eclipse中的错误13，如何解决？

如何在Django中使用UUID

加载Microsoft Visual菜单时出现问题

具有if条件的SQL UPDATE

从JSON到JSONL的Python转换

如何在Kod中更改字体？

共享图像将路径放入地址