我撰寫了一個宏來從網站的三個頁面中抓取一些欄位。我曾經Array()存盤和寫入結果,以使執行速度更快。
只要涉及單個頁面的內容,腳本就可以正常作業。但是,當我使用串列中的三個鏈接時,事情會出錯。To be specific, the script overwrites previous results. 例如,我應該在執行后得到 150 個結果。相反,我從最后一個鏈接獲得了 50 個結果。
到目前為止我已經寫過:
Public Sub FetchData()
Dim Xhr As Object, Html As HTMLDocument, Ws As Worksheet
Dim Link As Variant, Links As Variant, LeadInfo() As String
Dim I&, HtmlDoc As HTMLDocument, Listings As Object, Headers()
Dim URLS(), N As Variant
Links = Array( _
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=1&pagesize=50", _
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50", _
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50" _
)
Set Ws = ThisWorkbook.Worksheets("Sheet1")
Set Xhr = CreateObject("MSXML2.XMLHTTP")
Set Html = New HTMLDocument
Set HtmlDoc = New HTMLDocument
Headers = Array("Title", "URL", "User", "Asked")
For Each Link In Links
With Xhr
.Open "GET", Link, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
.send
Html.body.innerHTML = .responseText
End With
Set Listings = Html.querySelectorAll(".summary")
ReDim LeadInfo(1 To Listings.Length, 1 To 4)
On Error Resume Next
For I = 0 To Listings.Length - 1
HtmlDoc.body.innerHTML = Listings.item(I).innerHTML
LeadInfo(I 1, 1) = HtmlDoc.querySelector(".question-hyperlink").innerText
LeadInfo(I 1, 2) = HtmlDoc.querySelector(".question-hyperlink").getAttribute("href")
LeadInfo(I 1, 3) = HtmlDoc.querySelector(".user-details > a").innerText
LeadInfo(I 1, 4) = HtmlDoc.querySelector(".user-action-time > span.relativetime").innerText
Next I
On Error GoTo 0
If IsEmpty(Ws.Cells(1, 1).Value) Then Ws.Cells(1, 1).Resize(1, UBound(Headers) 1) = Headers
Ws.Cells(2, 1).Resize(UBound(LeadInfo, 1), UBound(LeadInfo, 2)) = LeadInfo
Next Link
End Sub
我怎樣才能寫出三個鏈接的所有結果,而不是只寫最后一個鏈接的結果?
uj5u.com熱心網友回復:
您已經擁有頁面大小(即每頁最大結果)、頁數和標題大小。只需對陣列進行維度來存盤結果并將其寫出一次。比重復 ReDim 更有效,后者復制陣列并寫出導致 I/O 的內容。
使用變數來跟蹤要填充到陣列中的行。
將你的寫作移到回圈之外。
宣告Listings As MSHTML.IHTMLDOMChildrenCollection以便提供更新的 Excel 版本(至少向后兼容 2010)。
Option Explicit
Public Sub FetchData()
Dim Xhr As Object, Html As MSHTML.HTMLDocument, Ws As Worksheet
Dim Link As Variant, Links() As Variant, LeadInfo() As String
Dim I As Long, HtmlDoc As MSHTML.HTMLDocument, Listings As MSHTML.IHTMLDOMChildrenCollection
Dim Headers() As Variant
Links = Array( _
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=1&pagesize=50", _
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50", _
"https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50" _
)
Set Ws = ThisWorkbook.Worksheets("Sheet1")
Set Xhr = CreateObject("MSXML2.XMLHTTP")
Set Html = New HTMLDocument
Set HtmlDoc = New HTMLDocument
Headers = Array("Title", "URL", "User", "Asked")
ReDim LeadInfo(1 To (UBound(Links) 1) * 50, 1 To UBound(Headers) 1) 'size according to headers and page size
Dim rowNumber As Long
For Each Link In Links
With Xhr
.Open "GET", Link, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
.send
Html.body.innerHTML = .responseText
End With
Set Listings = Html.querySelectorAll(".summary")
On Error Resume Next
For I = 0 To Listings.Length - 1
rowNumber = rowNumber 1
HtmlDoc.body.innerHTML = Listings.Item(I).innerHTML
LeadInfo(rowNumber, 1) = HtmlDoc.querySelector(".question-hyperlink").innerText
LeadInfo(rowNumber, 2) = Replace$(HtmlDoc.querySelector(".question-hyperlink").href, "about:", "https://stackoverflow.com")
LeadInfo(rowNumber, 3) = HtmlDoc.querySelector(".user-details > a").innerText
LeadInfo(rowNumber, 4) = HtmlDoc.querySelector(".user-action-time > span.relativetime").innerText
Next I
On Error GoTo 0
Next Link
If IsEmpty(Ws.Cells(1, 1).Value) Then Ws.Cells(1, 1).Resize(1, UBound(Headers) 1) = Headers
Ws.Cells(2, 1).Resize(UBound(LeadInfo, 1), UBound(LeadInfo, 2)) = LeadInfo
End Sub
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/380752.html
上一篇:在VBA中創建記錄陣列
下一篇:Excel應用程式。輸入完整地址
