PythonWebScraping-處理頁面404錯誤-有解無憂

我正在通過 Python \ Selenium \ Chrome 無頭驅動程式執行網路抓取，其中涉及執行回圈：

# perform loop

CustId=2000
while (CustId<=3000):
  

  # Part 1: Customer REST call:
  urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
  driver.get(urlg)

  soup = BeautifulSoup(driver.page_source,"lxml")

  dict_from_json = json.loads(soup.find("body").text)

  #logic for webscraping is here......

  CustId = CustId 1

  # close driver at end of everything

driver.close()

但是，有時當客戶 ID 為某個數字時，該頁面可能不存在。我無法控制這一點，代碼因找不到頁面 404 錯誤而停止。我如何忽略這一點并繼續回圈？

我猜我需要一個嘗試....除了？

uj5u.com熱心網友回復：

也許這樣做的一種方法是嘗試：

try:
    urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
    driver.get(urlg)

    soup = BeautifulSoup(driver.page_source,"lxml")

    dict_from_json = json.loads(soup.find("body").text)

    #logic for webscraping is here......

    CustId = CustId 1
except:   
    print("404 error found, moving on")
    CustId = CustId 1

對不起，如果這不起作用，我還沒有測驗過。

uj5u.com熱心網友回復：

您可以檢查頁面正文h1標記文本出現時出現的內容404 error，然后您可以將其放在 if 子句中以檢查如果不是然后進入塊內。

CustId=2000
while (CustId<=3000):
    urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
    driver.get(urlg)
    soup = BeautifulSoup(driver.page_source,"lxml")
    if not "Page not found" in soup.find("body").text:     
      dict_from_json = json.loads(soup.find("body").text)
      #logic for webscraping is here......

    CustId=CustId 1

或者

CustId=2000
while (CustId<=3000):
    urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
    driver.get(urlg)
    soup = BeautifulSoup(driver.page_source,"lxml")
    if not "404" in soup.find("body").text:     
      dict_from_json = json.loads(soup.find("body").text)
      #logic for webscraping is here......

    CustId=CustId 1

uj5u.com熱心網友回復：

一種理想的方法是使用該range()函式，driver.quit()最后如下：

for CustId in range(2000, 3000):
    try:
        urlg = f'https://mywebsite.com/customerRest/show/?id={str(CustId)}'
        driver.get(urlg)
        if not "404" in driver.page_source:
            soup = BeautifulSoup(driver.page_source,"lxml")
            dict_from_json = json.loads(soup.find("body").text)
            #logic for webscraping is here......
except:
        continue
driver.quit()

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/460958.html

標籤：Python 硒硒网络驱动程序网页抓取

上一篇：如何最好地調整我的Selenium代碼以安全地輸入用戶名和密碼？

下一篇：如何使用BeautifulSoup成功訪問此Web資料？