抓取的連接問題：僅開始2次抓取中的1次（另一個被忽略并僅每5-6次嘗試開始糾正）-有解無憂

我正在制作一個小的抓取腳本，僅用于學習和個人目的（非營利）。我的問題不是關于抓取，而是關于連接（我認為，但我訪問該站點沒有問題。我沒有收到任何錯誤的請求錯誤）。我注意到抓取有時可以正常作業，有時則不能。僅開始 2 次抓取中的 1 次。然而，現在它不能“半途而廢”（50% 是，50% 不是）。在 5-6-7 次嘗試中，B 系列有 1 次被正確抓取。

代碼說明：代碼通過 Firefox 作為代理連接到 Tor。然后用 2 個“for”回圈（A 輪和 B 輪）開始 2 次不同的刮擦。目的是簡單地抓取兩個 for 回圈的名稱。

問題：我沒有收到任何錯誤，但意乙聯賽的比賽感覺像是被忽視了。只有系列 A 被抓取，沒有系列 B（但它們具有相同的抓取代碼）。: 幾天前，兩個抓取都正常作業，只是偶爾發生意乙沒有抓取的情況。然而，現在，意乙在 5-6-7 次嘗試中被正確地刮掉了 1 次。

直覺上，我會說問題在于 Tor 連接。我還嘗試復制和粘貼 Tor 連接的代碼......為系列 B for 回圈輸入它，以便系列 A 和系列 B 都具有 Tor 連接。最初它作業正常，意甲和意乙都在刮。在隨后的嘗試中，意乙并沒有拼搶。

有什么問題？Python代碼問題？Firefox 代理的 Tor 連接問題？其他？我應該改變什么？我該如何解決？如果我寫的代碼不正確，我可以寫什么代碼？謝謝

    ######## TOR CONNECTION WITH FIREFOX ########
    from selenium import webdriver
    from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
    import os
    
    tor_linux = os.popen('/home/james/.local/share/torbrowser/tbb/x86_64/tor-browser_en-US') 
    
    profile = FirefoxProfile('/home/james/.local/share/torbrowser/tbb/x86_64/tor-browser_en-US/Browser/TorBrowser/Data/Browser/profile.default')
    profile.set_preference('network.proxy.type', 1)
    profile.set_preference('network.proxy.socks', '127.0.0.1')
    profile.set_preference('network.proxy.socks_port', 9050)
    profile.set_preference("network.proxy.socks_remote_dns", False) 
    
    profile.update_preferences()
    
    firefox_options = webdriver.FirefoxOptions()
    firefox_options.binary_location = '/usr/bin/firefox' 
    
    driver = webdriver.Firefox(
        firefox_profile=profile, options=firefox_options, 
        executable_path='/usr/bin/geckodriver')
    ########################################################################    
    
    #I need this for subsequent insertion into the database
    Values_SerieA = []
    Values_SerieB = []
    
    
    #### SCRAPING SERIE A ####
    driver.minimize_window()
    driver.get("https://www.diretta.it/serie-a/classifiche/")
    for SerieA in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieA_text = SerieA.text
        Values_SerieA.append(tuple([SerieA_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieA_text)
    driver.close
    enter code here
    
   #### SCRAPING SERIE B ######
    driver.minimize_window()
    driver.get("https://www.diretta.it/serie-b/classifiche/")
    for SerieB in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieB_text = SerieA.text
        Values_SerieB.append(tuple([SerieB_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieB_text)
    driver.close

uj5u.com熱心網友回復：

值得一提的幾件事：

selenium 是同步的，因此driver.implicity_wait(2)在請求站點后使用將給它時間在您driver開始尋找尚未加載到 DOM 的元素之前加載
您正在嘗試最小化驅動程式視窗，即使您執行的最后一步是關閉驅動程式視窗。嘗試翻轉系列 B 部分的前兩行，然后在后面放一個time.sleep(2)或driver.implicitly_wait(2)
我沒有使用驅動程式的代理，所以我不能告訴你這是否會造成連接問題。如果您能夠訪問該站點而不會收到某種錯誤的請求錯誤，我會認為連接不是問題

=== 試試這個 ===

#### SCRAPING SERIE A ####

# request site
    driver.get("https://www.diretta.it/serie-a/classifiche/")

# wait for it to load
    driver.implicitly_wait(2)

# once you're sure page is loaded, minimize window
    driver.minimize_window()

    for SerieA in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieA_text = SerieA.text
        Values_SerieA.append(tuple([SerieA_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieA_text)
    driver.close()
    
   #### SCRAPING SERIE B ######

# request the site
    driver.get("https://www.diretta.it/serie-b/classifiche/")

# wait for everything to load
    driver.implicitly_wait(2)

# once you're sure the window is loading correctly you can move
# this back up to happen before the wait
    driver.minimize_window()

    for SerieB in driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='tableCellParticipant__name']"):
        SerieB_text = SerieA.text
        Values_SerieB.append(tuple([SerieB_text])) #inserisco le squadre nell'elenco vuoto Values
        print(SerieB_text)
    driver.close

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/342831.html

標籤：Python 蟒蛇-3.x 硒硒网络驱动程序代理

上一篇：使用for回圈時如何跳過串列中的專案

下一篇：2021 CSP-J復賽我的備戰與游記