如何通过使用rvest动态更新url从多个页面抓取数据

MNM 发表于 Dev

纳米

我正在尝试从该网站提取数据。我对从中提取数据感兴趣draft selections by year。年份从1963年到2018
年。URL结构中有一个常见的模式。例如，其https://www.eliteprospects.com/draft/nhl-entry-draft/2018，https://www.eliteprospects.com/draft/nhl-entry-draft/2017等等。

到目前为止，我已经成功提取了一年的数据。我已经编写了一个自定义函数，在给定输入的情况下，抓取工具将收集数据并将其以美观的数据帧格式呈现。

library(rvest)
library (tidyverse)
library (stringr)
get_draft_data<- function(draft_type, draft_year){

  # replace the space between words in draft type with a '-'
  draft_types<- draft_type %>%
    # coerce to tibble format
    as.tibble() %>%
    set_names("draft_type") %>% 
    # replace the space between words in draft type with a '-'
    mutate(draft_type = str_replace_all(draft_type, " ", "-"))

  # create page url
  page <- stringr::str_c("https://www.eliteprospects.com/draft/", draft_types, "/", draft_year)%>%
    read_html()

  # Now scrape the team data from the page
  # Extract the team data
  draft_team<- page %>%

    html_nodes(".team") %>%
    html_text()%>%
    str_squish() %>%
    as_tibble()

  # Extract the player data
  draft_player<- page %>%

    html_nodes("#drafted-players .player") %>%
    html_text()%>%
    str_squish() %>%
    as_tibble()

  # Extract the seasons data
  draft_season<- page %>%

    html_nodes(".seasons") %>%
    html_text()%>%
    str_squish() %>%
    as_tibble()

# Join the dataframe's together. 
  all_data<- cbind(draft_team, draft_player,draft_season)  

  return(all_data)

} # end function

# Testing the function
draft_data<-get_draft_data("nhl entry draft", 2011)
glimpse(draft_data)
Observations: 212
Variables: 3
$ value <chr> "Team", "Edmonton Oilers", "Colorado Avalanche", "Florida Panth...
$ value <chr> "Player", "Ryan Nugent-Hopkins (F)", "Gabriel Landeskog (F)", "...
$ value <chr> "Seasons", "8", "8", "7", "8", "6", "8", "8", "8", "7", "7", "3...

问题：如何编写代码以使网页url中的年份自动增加，从而使抓取工具能够提取相关数据并写入数据框。

注：我已经看过类似，一些相关的问题，1，2，3，4，但无法找到我的解决方案。

塔式起重机

我只是创建一个抓取给定年份的函数，然后绑定该年份的行。

使用paste()创建与字符串动态URL和可变的一年
为url编写scrape函数（注意：您不必使用html_text －它存储为表格，因此可以使用来直接将其提取出来html_table()）
通过使用多年循环功能 lapply()
结合使用列表中的dfs bind_rows()

以下是2010年至2012年这一过程的示例。

library(rvest);library(tidyverse)


scrape.draft = function(year){

  url = paste("https://www.eliteprospects.com/draft/nhl-entry-draft/",year,sep="")

  out = read_html(url) %>%
    html_table(header = T) %>% '[['(2) %>%
    filter(!grepl("ROUND",GP)) %>%
    mutate(draftYear = year)

  return(out)

}

temp = lapply(2010:2012,scrape.draft) %>%
  bind_rows()

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-12-10

我来说两句

0 条评论

登录后参与评论

使用RVest抓取多个URL

如何通过使用rvest动态更新url从多个页面抓取数据

如何通过使用rvest动态更新url从多个页面抓取数据

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用