使用 vba 从网站抓取数据 - 问题

lalachka

我正在尝试从这次搜索中的每个设施中获取地址、设施类型和一些其他数据。我能够获得搜索结果和设施列表,但我无法弄清楚如何从页面获取数据。

编辑我在答案中应用了建议,这是新代码,OBJECT REQUIRED 错误在调试行我试图点击每个链接并获取名称、地址、设施类型以及其他任何数据页

Sub Test()
    Dim ie2 As New InternetExplorer
    'Set ie = New InternetExplorerMedium

    With ie2
        .Visible = True
        .navigate "https://healthapps.state.nj.us/facilities/fsSetSearch.aspx?by=county"

        FacType = "Long-Term Care (Nursing Homes)"
        While .Busy Or .ReadyState < 4: DoEvents: Wend

        With .Document
            .querySelector("#middleContent_cbType_0").Click
            .querySelector("#middleContent_btnGetList").Click
        End With
            
        While .Busy Or .ReadyState < 4: DoEvents: Wend
            

        Pause (2)
        
        Dim list2 As Object, i2  As Long, line1 As String, line2 As String

        Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")
        
        For i2 = 0 To list2.Length - 1
            list2.Item(i2).Click
            Debug.Print .Document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText
            
            While .Busy Or .ReadyState < 4: DoEvents: Wend

            Pause (2)

            address = Replace(Replace(Replace(line1 & " " & line2, "<span id=" & Chr(34) & "middleContent_lbAddress" & Chr(34) & ">", ""), "<br>", ", "), "</span>", "")

            WriteTable .Document.getElementsByTagName("table")(3), .Document.getElementById("middleContent_Menu1").innerText

            .Navigate2 .Document.URL
            While .Busy Or .ReadyState < 4: DoEvents: Wend
            Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")

        Next
        .Quit                                    '
    End With

End Sub

我在这一行收到 OBJECT REQUIRED 错误

地址 = Replace(Replace(Replace(.Document.getElementById("middleContent_lbAddress").outerHTML, "<span id=" & Chr(34) & "middleContent_lbAddress" & Chr(34) & ">", ""), "
", ", "), "", "")

但我很确定我使用错误的方式来获取数据。所以,即使没有错误,我也不会给我我需要的东西。

Sub Test()
    Dim ie2 As New InternetExplorer
    'Set ie = New InternetExplorerMedium

    With ie2
        .Visible = False
        .navigate "https://healthapps.state.nj.us/facilities/fsSetSearch.aspx?by=county"

        While .Busy Or .ReadyState < 4: DoEvents: Wend

        With .Document
            .querySelector("#middleContent_cbType_0").Click
            .querySelector("#middleContent_btnGetList").Click
        End With
            
        While .Busy Or .ReadyState < 4: DoEvents: Wend
            
        Dim list2 As Object, i2  As Long
        Set list2 = .Document.querySelectorAll("#main_table")
             
        For i2 = 0 To list2.Length - 1
            list2.Item(i2).Click

            While .Busy Or .ReadyState < 4: DoEvents: Wend

            Pause (2)
    
            If .Document.getElementById("middleContent_lbResultTitle") Is Nothing Then
                Pause (5)
            End If

            If .Document.getElementById("middleContent_lbResultTitle").outerHTML Like "*Long-Term Care Facility*" Then
                FacType = "Long-Term Care (Nursing Homes)"
            End If

            Address = Replace(Replace(Replace(.Document.getElementById("middleContent_lbAddress").outerHTML, "<span id=" & Chr(34) & "middleContent_lbAddress" & Chr(34) & ">", ""), "<br>", ", "), "</span>", "")

            WriteTable .Document.getElementsByTagName("table")(3), .Document.getElementById("middleContent_Menu1").innerText


            .Navigate2 .Document.URL
            While .Busy Or .ReadyState < 4: DoEvents: Wend
            Set list2 = .Document.querySelectorAll("#main_table")

        Next
        .Quit                                    '
    End With
End Sub
哈尔

这是单个节点Set list2 = .Document.querySelectorAll("#main_table")相反,假设所有结果的结构相同,请使用以下内容:

Dim i As Long, line1 As String, line2 As String, address As String

Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")

For i = 0 To list2.Length - 1
    line1 = list2.Item(i).NextSibling.NextSibling.NodeValue
    line2 = list2.Item(i).NextSibling.NextSibling.NextSibling.NodeValue
    address = line1 & " " & line2 'apply string cleaning here
Next

这最初针对每个结果的超链接,然后使用 nextSibling 在 br 元素之间移动以获得地址行 1 和 2。您需要在地址变量上编写一些字符串清理。

如果您决定单击每个超链接,则在详细信息页面上使用.document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText来检索完整地址。

导航到每个页面的示例(检查检索到的 url 是否完整,不需要前缀)

Dim i As Long, address As String, urls(), numLinks As Long

Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")
numLinks = List.Length - 1
ReDim urls(0 To numLinks)

For i = 0 To numLinks
    urls(i) = list2.Item(i).href
Next

For i = 0 To numLinks
    .navigate2 urls(i)
    While .Busy Or .ReadyState <> 4: DoEvents: Wend
    'time loop maybe goes here
    address = .Document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText
    Debug.Print address
Next

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章