I wanted tomake a list of my LEGO collection by entering the set numbers in the first column, and having excel look them up online and fill in the details like set name, brick count, ...
This is my code:
Option Explicit
Sub BrickLinkDataExtraction()
Dim x As Integer
Dim i As Integer
Dim IE As New InternetExplorer
For i = 5 To Cells(Rows.Count, 1).End(xlUp).Row
IE.navigate "https://brickset.com/sets/" & Cells(RowIndex:=i, columnindex:=1).Value
IE.Visible = False
Do
DoEvents
Loop Until IE.readyState = READYSTATE_COMPLETE 'hier moet ik zeggen "tot rijen vol zijn", zoiets? IsEmpty(Range("i+1" & "A"))
Dim Doc As HTMLDocument
Set Doc = IE.document
Dim NAME As String
NAME = Trim(Doc.getElementsByTagName("dd")(1).innerText)
Dim THEME As String
THEME = Trim(Doc.getElementsByTagName("dd")(4).innerText)
Dim YEAR As String
YEAR = Trim(Doc.getElementsByTagName("dd")(6).innerText)
Dim BRICKS As String
BRICKS = Trim(Doc.getElementsByTagName("dd")(8).innerText)
Dim MINIFIGS As String
MINIFIGS = Trim(Doc.getElementsByTagName("dd")(9).innerText)
If IsEmpty(Cells(RowIndex:=i, columnindex:=2)) Then
Cells(RowIndex:=i, columnindex:=2).Value = NAME
End If
If IsEmpty(Cells(RowIndex:=i, columnindex:=3)) Then
Cells(RowIndex:=i, columnindex:=3).Value = BRICKS
End If
If IsEmpty(Cells(RowIndex:=i, columnindex:=4)) Then
Cells(RowIndex:=i, columnindex:=4).Value = MINIFIGS
End If
If IsEmpty(Cells(RowIndex:=i, columnindex:=5)) Then
Cells(RowIndex:=i, columnindex:=5).Value = THEME
End If
If IsEmpty(Cells(RowIndex:=i, columnindex:=6)) Then
Cells(RowIndex:=i, columnindex:=6).Value = YEAR
End If
Next
IE.Quit
Cells.Columns.AutoFit
End Sub
This works fine, until the code reaches a set that doesn't have the tags in the same order, or not using minifigs. Then I get the wrong info in my spreadsheet.
How can I specify I need the beloning to the "name", instead of specifying the second, fifth, ... ?
So for example https://brickset.com/sets/10224 works as intended; but https://brickset.com/sets/10262 enters the original retail price in the minifigs column.
Also, is there a way to optimize the code so it doesn't take as long to run?
I would switch to XMLHTTP GET requests to retrieve the information you want faster.
The HTML doesn't lend itself to a nice way of selecting only the items of interest as you have discovered. Positional matching falls over when there aren't the same number of items on each page.
The pattern that is consistent, is that item names (dt
tagged) and values (dd
tagged) come in pairs. For example, "Name"
comes with "Town Hall"
; so you can collect the dt
elements in one nodeList
, and the dd
in another; loop the first checking that the item names you want exist. The length of the list of item names will match the length of the list of associated values, so you only need loop the items and access the values nodeList
using the same index as where your required item name was found.
PROCESS:
I store the sets of interest in an array, sets
, which I read in from Sheet1
column A. I loop that array, concatenating the current set number onto a base url constant to get the actual url for the lego set. The XMLHTTP GET Request is issued against that url.
An helper function is employed, GetHTMLDoc
, to process the request and return an HTMLDocument
with the page html.
I use an additional helper function, GetItemsInfo
, to retrieve the various items you want from the page HTML stored in the recently returned HTMLDocument
. It creates a dictionary, resultsDict
, whose keys are the items of interest i.e. "Name","Theme"
etc. These keys have an initial vbNullstring
value, and if the key is found on the page, the value in the dictionary, for that key, is overwritten with the value found on the page.
The dictionaries of results for each page are stored in an array, results
, which I later loop to write the results out to the page.
TODO:
Â
in the James Bond title which appears on writing to the sheet if not handled. In that case I used Replace$(info(i).innerText, Chr$(194), vbNullString)
.I use the fact that each dd
tag of interest is preceeded by a dt
tag, within the parent dl
tag of interest:
This means I can collect all the dt
tags with a parent dl
tag, using a CSS selector to target the page styling. I then loop over the returned nodeList
, checking each node's innerText
values against my dictionary keys. If they match (exists) then I know the item I want exists on the page. Given that there is a matching dd
tag for each dt
tag, within the parent element, I know that the value I want will be at the same index in the nodeList
I can return by grabbing all the dd
tags with parent tag dl
. I can then overwrite the dictionary value with the found value.
I apply a CSS selector of dl dt
, to return all elements with a dt
tag having a parent dl
tag. This selector is applied via the .querySelectorAll
method of HTMLDocument
. This returns a nodeList
whose .Length
can be traversed to access individual nodes by index, starting at 0. This is the "titles"
nodeList - it contains each item name e.g "Name","Theme"
etc.
Example CSS query:
A similar CSS query is used for the values, returning a nodeList
which I refer to as info
, using dl dd
VBA:
Option Explicit
Public Sub GetInfo()
Dim i As Long, html As HTMLDocument, sets(), http As Object, results(), url As String
Const BASE_URL As String = "https://brickset.com/sets/"
Application.ScreenUpdating = False
With ThisWorkbook.Worksheets("Sheet1")
Dim lastRow As Long: lastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
If lastRow < 5 Then
Exit Sub
ElseIf lastRow = 5 Then
ReDim sets(1, 1): sets(1, 1) = .Range("A5").Value
Else
sets = .Range("A5:A" & lastRow).Value
End If
ReDim results(0 To UBound(sets, 1) - 1)
Set http = CreateObject("MSXML2.XMLHTTP")
For i = LBound(sets, 1) To UBound(sets, 1)
url = BASE_URL & sets(i, 1)
Set html = GetHTMLDoc(http, url)
Set results(i - 1) = GetItemsInfo(html)
Next
Dim headers()
headers = Array("Set", "Name", "Theme", "Year released", "Pieces", "Minifigs")
.Cells(4, 1).Resize(1, UBound(headers) + 1) = headers
For i = LBound(results) To UBound(results)
.Cells(i + 5, 2).Resize(1, results(i).Count) = results(i).Items
Next
End With
Application.ScreenUpdating = True
End Sub
Public Function GetHTMLDoc(ByVal http As Object, ByVal url As String) As HTMLDocument
Dim html As New HTMLDocument, sResponse As String
With http
.Open "GET", url, False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
html.body.innerHTML = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
Set GetHTMLDoc = html
End Function
Public Function GetItemsInfo(ByVal html As HTMLDocument) As Object
Dim titles As Object, info As Object, i As Long
Dim resultsDict As Object
Set resultsDict = CreateObject("Scripting.Dictionary")
resultsDict.Add "Name", vbNullString
resultsDict.Add "Theme", vbNullString
resultsDict.Add "Year released", vbNullString
resultsDict.Add "Pieces", vbNullString
resultsDict.Add "Minifigs", vbNullString
With html
Set titles = .querySelectorAll("dl dt")
Set info = .querySelectorAll("dl dd")
For i = 0 To titles.Length - 1
If resultsDict.Exists(titles(i).innerText) Then
resultsDict(titles(i).innerText) = Replace$(info(i).innerText, Chr$(194), vbNullString)
End If
Next
End With
Set GetItemsInfo = resultsDict
End Function
RESULTS:
References (VBE > Tools > References):
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments