Trouble getting the right website data into excel

Ludo

I wanted tomake a list of my LEGO collection by entering the set numbers in the first column, and having excel look them up online and fill in the details like set name, brick count, ...

This is my code:

Option Explicit  
Sub BrickLinkDataExtraction()

    Dim x As Integer
    Dim i As Integer
    Dim IE As New InternetExplorer

    For i = 5 To Cells(Rows.Count, 1).End(xlUp).Row

        IE.navigate "https://brickset.com/sets/" & Cells(RowIndex:=i, columnindex:=1).Value
        IE.Visible = False

        Do
            DoEvents
        Loop Until IE.readyState = READYSTATE_COMPLETE 'hier moet ik zeggen "tot rijen vol zijn", zoiets? IsEmpty(Range("i+1" & "A"))

        Dim Doc As HTMLDocument
        Set Doc = IE.document
        Dim NAME As String
        NAME = Trim(Doc.getElementsByTagName("dd")(1).innerText)
        Dim THEME As String
        THEME = Trim(Doc.getElementsByTagName("dd")(4).innerText)
        Dim YEAR As String
        YEAR = Trim(Doc.getElementsByTagName("dd")(6).innerText)
        Dim BRICKS As String
        BRICKS = Trim(Doc.getElementsByTagName("dd")(8).innerText)
        Dim MINIFIGS As String
        MINIFIGS = Trim(Doc.getElementsByTagName("dd")(9).innerText)

        If IsEmpty(Cells(RowIndex:=i, columnindex:=2)) Then
            Cells(RowIndex:=i, columnindex:=2).Value = NAME
        End If

        If IsEmpty(Cells(RowIndex:=i, columnindex:=3)) Then
            Cells(RowIndex:=i, columnindex:=3).Value = BRICKS
        End If

        If IsEmpty(Cells(RowIndex:=i, columnindex:=4)) Then
            Cells(RowIndex:=i, columnindex:=4).Value = MINIFIGS
        End If

        If IsEmpty(Cells(RowIndex:=i, columnindex:=5)) Then
            Cells(RowIndex:=i, columnindex:=5).Value = THEME
        End If

        If IsEmpty(Cells(RowIndex:=i, columnindex:=6)) Then
            Cells(RowIndex:=i, columnindex:=6).Value = YEAR
        End If

    Next

    IE.Quit

    Cells.Columns.AutoFit

End Sub

This works fine, until the code reaches a set that doesn't have the tags in the same order, or not using minifigs. Then I get the wrong info in my spreadsheet.

How can I specify I need the beloning to the "name", instead of specifying the second, fifth, ... ?

So for example https://brickset.com/sets/10224 works as intended; but https://brickset.com/sets/10262 enters the original retail price in the minifigs column.

Also, is there a way to optimize the code so it doesn't take as long to run?

QHarr

I would switch to XMLHTTP GET requests to retrieve the information you want faster.

The HTML doesn't lend itself to a nice way of selecting only the items of interest as you have discovered. Positional matching falls over when there aren't the same number of items on each page.

The pattern that is consistent, is that item names (dt tagged) and values (dd tagged) come in pairs. For example, "Name" comes with "Town Hall"; so you can collect the dt elements in one nodeList, and the dd in another; loop the first checking that the item names you want exist. The length of the list of item names will match the length of the list of associated values, so you only need loop the items and access the values nodeList using the same index as where your required item name was found.


PROCESS:

I store the sets of interest in an array, sets, which I read in from Sheet1 column A. I loop that array, concatenating the current set number onto a base url constant to get the actual url for the lego set. The XMLHTTP GET Request is issued against that url.

An helper function is employed, GetHTMLDoc, to process the request and return an HTMLDocument with the page html.

I use an additional helper function, GetItemsInfo, to retrieve the various items you want from the page HTML stored in the recently returned HTMLDocument. It creates a dictionary, resultsDict, whose keys are the items of interest i.e. "Name","Theme" etc. These keys have an initial vbNullstring value, and if the key is found on the page, the value in the dictionary, for that key, is overwritten with the value found on the page.

The dictionaries of results for each page are stored in an array, results, which I later loop to write the results out to the page.


TODO:

  1. You could develop this with some additional error handling. For example, when the GET request is unable to return the required HTML due to page not found, or to handle blank cells in column A between the start and end rows.
  2. There is a free SOAP based API you could explore.I am not sure it offers all the available items of interest from an initial skim of the documentation.
  3. Handle potential unwanted characters in return strings e.g. the  in the James Bond title which appears on writing to the sheet if not handled. In that case I used Replace$(info(i).innerText, Chr$(194), vbNullString).

CSS SELECTORS:

I use the fact that each dd tag of interest is preceeded by a dt tag, within the parent dl tag of interest:

example layout

This means I can collect all the dt tags with a parent dl tag, using a CSS selector to target the page styling. I then loop over the returned nodeList, checking each node's innerText values against my dictionary keys. If they match (exists) then I know the item I want exists on the page. Given that there is a matching dd tag for each dt tag, within the parent element, I know that the value I want will be at the same index in the nodeList I can return by grabbing all the dd tags with parent tag dl. I can then overwrite the dictionary value with the found value.

I apply a CSS selector of dl dt, to return all elements with a dt tag having a parent dl tag. This selector is applied via the .querySelectorAll method of HTMLDocument. This returns a nodeList whose .Length can be traversed to access individual nodes by index, starting at 0. This is the "titles" nodeList - it contains each item name e.g "Name","Theme"etc.

Example CSS query:

A similar CSS query is used for the values, returning a nodeList which I refer to as info, using dl dd


VBA:

Option Explicit
Public Sub GetInfo()
    Dim i As Long, html As HTMLDocument, sets(), http As Object, results(), url As String
    Const BASE_URL As String = "https://brickset.com/sets/"
    Application.ScreenUpdating = False

    With ThisWorkbook.Worksheets("Sheet1")
        Dim lastRow As Long: lastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
        If lastRow < 5 Then
            Exit Sub
        ElseIf lastRow = 5 Then
            ReDim sets(1, 1): sets(1, 1) = .Range("A5").Value
        Else
            sets = .Range("A5:A" & lastRow).Value
        End If

        ReDim results(0 To UBound(sets, 1) - 1) 
        Set http = CreateObject("MSXML2.XMLHTTP")

        For i = LBound(sets, 1) To UBound(sets, 1)
            url = BASE_URL & sets(i, 1)
            Set html = GetHTMLDoc(http, url)
            Set results(i - 1) = GetItemsInfo(html)
        Next

        Dim headers()
        headers = Array("Set", "Name", "Theme", "Year released", "Pieces", "Minifigs")
        .Cells(4, 1).Resize(1, UBound(headers) + 1) = headers

        For i = LBound(results) To UBound(results)
            .Cells(i + 5, 2).Resize(1, results(i).Count) = results(i).Items
        Next
    End With
    Application.ScreenUpdating = True
End Sub

Public Function GetHTMLDoc(ByVal http As Object, ByVal url As String) As HTMLDocument
    Dim html As New HTMLDocument, sResponse As String
    With http
        .Open "GET", url, False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With       
    html.body.innerHTML = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))   
    Set GetHTMLDoc = html
End Function

Public Function GetItemsInfo(ByVal html As HTMLDocument) As Object
    Dim titles As Object, info As Object, i As Long
    Dim resultsDict As Object
    Set resultsDict = CreateObject("Scripting.Dictionary")
    resultsDict.Add "Name", vbNullString
    resultsDict.Add "Theme", vbNullString
    resultsDict.Add "Year released", vbNullString
    resultsDict.Add "Pieces", vbNullString
    resultsDict.Add "Minifigs", vbNullString

    With html
        Set titles = .querySelectorAll("dl dt")
        Set info = .querySelectorAll("dl dd")

        For i = 0 To titles.Length - 1
            If resultsDict.Exists(titles(i).innerText) Then
                resultsDict(titles(i).innerText) = Replace$(info(i).innerText, Chr$(194), vbNullString)
            End If
        Next
    End With
    Set GetItemsInfo = resultsDict
End Function

RESULTS:

Results


References (VBE > Tools > References):

  1. Microsoft HTML Object Library

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Trouble getting right values against each item

Trouble getting the right output from a file

Trouble getting JSON data with PHP

Trouble getting data from doInBackground

Trouble getting into json data object

EXCEL 365 - Trouble getting a 4-level dropdown menu to display DATA based on other criteria

Getting Website data on android app

Trouble getting the right appearance of an array (Python3)

Trouble with getting right value from html input using jquery

Trouble getting relationship right between two comboboxes and the main window

BigQuery trouble getting substring based on character from the right

Getting Trouble in Dismissing Ads from Website using Selenium

Trouble with getting value from a website - vb.net

Trouble getting data from JSON with ReactJS

Trouble parsing XML and getting data into pandas dataframe

Having trouble getting data from a json file

Trouble getting data from json file back

Trouble with Angular/Ajax/PHP not getting the data

Having trouble getting json data from file

Linking website data to Excel sheet

The right side of all the text on my website is getting cut off

Trouble writing list of data in an excel file

Excel - Auto fetching data from website to excel

Trouble with data types after scraping a website with lxml and xpath

Python Question have trouble saving data from website to .txt file

Getting the right formatting for my XML data

Python getting Data from Coinmarketcap Website API

Web Scraper not getting the full data from a website

Getting specific data from a website with Python