我正在通過 Python \ Selenium \ Chrome 無頭驅動程式執行網路抓取,其中涉及執行回圈:
# perform loop
CustId=2000
while (CustId<=3000):
# Part 1: Customer REST call:
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId = CustId 1
# close driver at end of everything
driver.close()
但是,有時當客戶 ID 為某個數字時,該頁面可能不存在。我無法控制這一點,代碼因找不到頁面 404 錯誤而停止。我如何忽略這一點并繼續回圈?
我猜我需要一個嘗試....除了?
uj5u.com熱心網友回復:
也許這樣做的一種方法是嘗試:
try:
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId = CustId 1
except:
print("404 error found, moving on")
CustId = CustId 1
對不起,如果這不起作用,我還沒有測驗過。
uj5u.com熱心網友回復:
您可以檢查頁面正文h1標記文本出現時出現的內容404 error,然后您可以將其放在 if 子句中以檢查如果不是然后進入塊內。
CustId=2000
while (CustId<=3000):
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
if not "Page not found" in soup.find("body").text:
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId=CustId 1
或者
CustId=2000
while (CustId<=3000):
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
if not "404" in soup.find("body").text:
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId=CustId 1
uj5u.com熱心網友回復:
一種理想的方法是使用該range()函式,driver.quit()最后如下:
for CustId in range(2000, 3000):
try:
urlg = f'https://mywebsite.com/customerRest/show/?id={str(CustId)}'
driver.get(urlg)
if not "404" in driver.page_source:
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
except:
continue
driver.quit()
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/460958.html
