我很難用 Python 下的 Selenium 以健壯的方式瀏覽下一頁https://www.digitalwallonia.be/fr/cartographie/的 448 個連續頁面。我嘗試了(太多)沒有令人滿意的結果(因此,很難放置相關代碼)。
想看看你的解決方案。如果問題表述不當,請道歉:第一次。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.implicitly_wait(20)
browser.get('https://www.digitalwallonia.be/fr/cartographie')
browser.find_element("xpath",'//*[@id="axeptio_btn_acceptAll"]').click()
browser.find_element("xpath",'//*[@id="axeptio_btn_configure"]').click()
browser.find_element("xpath",'//*[@id="axeptio_btn_acceptAllAndNext"]').click()
WebDriverWait(browser, 1000).until(EC.element_to_be_clickable((By.CLASS_NAME,'next'))).click()
input('Press ENTER to close the automated browser')
browser.quit()
我收到以下錯誤:selenium.common.exceptions.ElementNotInteractableException:訊息:無法將元素滾動到視圖中
uj5u.com熱心網友回復:
我會在這里就幾個問題提出建議:
- 您最好使用
WebDriverWait,而不是implicitly_wait因為前者只等待元素存在,而WebDriverWait您可以等待更成熟的元素狀態,即可見、可點擊等。 - 不要在同一個檔案中混用
WebDriverWait,implicitly_wait可能會導致問題。 - 這些
next page按鈕位于頁面底部,因此您需要向下滾動,然后才能單擊尋呼按鈕。 - 無需設定超過 30 秒的超時時間。
下面的代碼正在作業:
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
webdriver_service = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(service=webdriver_service, options=options)
url = "https://www.digitalwallonia.be/fr/cartographie"
actions = ActionChains(driver)
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_acceptAll"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_configure"]'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="axeptio_btn_acceptAllAndNext"]'))).click()
driver.execute_script("window.scrollBy(0, arguments[0]);", 800)
time.sleep(0.5)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.next a'))).click()
uj5u.com熱心網友回復:
每次單擊轉到下一頁(“Suivant”按鈕)時,頁面中的 javascript 都會向 API 端點發出 POST 請求,其中包含標頭和有效負載。標頭、有效負載和 API 端點可以在瀏覽器開發工具 - 網路選項卡中找到(僅選擇 XHR 呼叫)。因此,我們可以嘗試使用請求來抓取 API url,并避免 selenium/chromedriver 的開銷。以下是獲取該資料的一種方式:
import requests
import pandas as pd
big_df = pd.DataFrame()
url = 'https://search.production.ribo.digitalwallonia.be/contentful-entries_production/_search/template'
headers = {
'content-type': 'application/json',
'Origin': 'https://www.digitalwallonia.be',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
counter = 0
while True:
payload = '{"id":"filter-profile-search-template-fr-v3","params":{"categoriesSlugList":[],"programsSlugList":[],"from":' str(counter) ',"regionsList":[],"size":100}}'
r = s.post(url, data=payload)
big_df = pd.concat([big_df, pd.json_normalize(r.json()['hits']['hits'])], axis=0, ignore_index=True)
counter = counter 100
if counter > 448*12:
break
print(big_df)
我們一次獲得 100 個專案(實際頁面一次獲得 12 個)。大約一分鐘后,您應該在終端中顯示以下資料框:
_index _type _id _score sort _source.sys.id _source.sys.contentType.sys.id _source.sys.updatedAt _source.fields.addresses.fr _source.fields.belgianEnterprisesNumbers.fr _source.fields.urlsWebSite.fr _source.fields.shortDescription.en _source.fields.shortDescription.fr _source.fields.logoAssetImage.fr.file.en.fileName _source.fields.logoAssetImage.fr.file.en.details.image.width _source.fields.logoAssetImage.fr.file.en.details.image.height _source.fields.logoAssetImage.fr.file.en.details.size _source.fields.logoAssetImage.fr.file.en.contentType _source.fields.logoAssetImage.fr.file.en.url _source.fields.logoAssetImage.fr.file.fr.fileName _source.fields.logoAssetImage.fr.file.fr.details.image.width _source.fields.logoAssetImage.fr.file.fr.details.image.height _source.fields.logoAssetImage.fr.file.fr.details.size _source.fields.logoAssetImage.fr.file.fr.contentType _source.fields.logoAssetImage.fr.file.fr.url _source.fields.logoAssetImage.fr.title.en _source.fields.logoAssetImage.fr.title.fr _source.fields.title.en _source.fields.title.fr _source.fields.slug.en _source.fields.slug.fr _source.fields.urlsSocialNetwork.fr _source.fields.shortTitle.en _source.fields.shortTitle.fr _source.fields.founders.fr _source.fields.mainNaceCode.fr _source.fields.staffing.fr _source.fields.logoAssetImage.fr _source.fields.partnersAdditionalDescriptions.fr _source.fields.incubators.fr
0 contentful-entries_productionv3 _doc 3O1t8sTHhj5ZGrmGKtHI6y None [ Dynamix JAVA] 3O1t8sTHhj5ZGrmGKtHI6y profile 2022-09-01T14:36:06.899Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.388591169708497, 'Lat': 50.7035958197085}, 'Northeast': {'Lng': 4.391289130291502, 'Lat': 50.7062937802915}}, 'coordinates': [4.3898572, 50.7050388], 'type': 'Point', 'Location': {'Lng': 4.3898572, 'Lat': 50.7050388}}, 'Metadata': {'PlaceId': 'ChIJOZeR297Rw0cR_y-bZPZvwzQ', 'AddressType': 'head office', 'Timestamp': '2022-08-29T13:55:32.180Z'}, 'FormattedAddress': 'Av. des Dauphins 17, 1410 Waterloo, Belgique', 'MainAddress': True}] [0715677777] [{'Metadata': {'Timestamp': '2022-08-29T15:58:45 02:00'}, 'URL': 'https://dynamix-it.be/'}] Consulting company specialised in JAVA, SAP, DotNet, and son one. Société de consultance spécialisée en JAVA, SAP, DotNet, etc. dynamix_java.png 160.0 160.0 15950.0 image/png //images.ctfassets.net/myqv2p4gx62v/3jrjoVohZ1ooo2VMkum0Ns/1e5bd1ac59dab0126baea85f9156b872/dynamix_java.png dynamix java.png 160.0 160.0 15950.0 image/png //images.ctfassets.net/myqv2p4gx62v/3jrjoVohZ1ooo2VMkum0Ns/8e23b45bf77a17026df43cd072d06a52/dynamix_java.png Dynamix Java Dynamix Java Dynamix JAVA Dynamix JAVA dynamix-java dynamix-java [{'Metadata': {'Timestamp': '2022-08-29T15:58:14 02:00'}, 'URL': 'https://www.facebook.com/DYNAMIXJAVASPRL'}, {'Metadata': {'Timestamp': '2022-08-29T15:58:27 02:00'}, 'URL': 'https://www.linkedin.com/company/dynamixjava/'}] NaN NaN NaN NaN NaN NaN NaN NaN
1 contentful-entries_productionv3 _doc 4D2kOg0t4iRD11fzJFaPc8 None [ Lan-Area ] 4D2kOg0t4iRD11fzJFaPc8 profile 2022-08-25T08:42:32.473Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.744188919708497, 'Lat': 50.3149442697085}, 'Northeast': {'Lng': 4.746886880291502, 'Lat': 50.3176422302915}}, 'coordinates': [4.745529299999999, 50.31632769999999], 'type': 'Point', 'Location': {'Lng': 4.745529299999999, 'Lat': 50.31632769999999}}, 'Metadata': {'PlaceId': 'ChIJm9XAKz6SwUcRs45ovYpmEpc', 'AddressType': 'head office', 'Timestamp': '2022-06-21T14:17:33.655Z'}, 'FormattedAddress': 'Rue d'Ermeton 14, 5537 Anhée, Belgique', 'MainAddress': True}] [0779822986] [{'Metadata': {'Timestamp': '2022-08-25T10:42:29 02:00'}, 'URL': 'https://www.lan-area.be/'}] Platform exclusively focused on local sports competition. Lan-Area has created a central calendar where all local events are announced and a Belgian community space where players can post their teams, courses and successes. Plateforme exclusivement tournée vers la compétition e-sportive locale . Lan-Area a créé un calendrier central où tous les évènements locaux sont annoncés et un espace communautaire belge où les joueurs peuvent afficher leurs équipes, parcours et succès. lan-Aera.jpg 450.0 250.0 21154.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/3Gg1nuukov4gaypTawIQs8/346ee9006b0b5e3e33d2fab6ce293a47/lan-Aera.jpg lan-Aera.jpg 450.0 250.0 21154.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/3Gg1nuukov4gaypTawIQs8/7f30ce6782073cf51d16c1f67ef5ee0d/lan-Aera.jpg lan-Aera Logo Lan-Aera Lan-Aera Lan-Area lan-aera lan-area [{'Metadata': {'Timestamp': '2022-06-21T15:06:34 02:00'}, 'URL': 'https://www.facebook.com/lanarea2020'}, {'Metadata': {'Timestamp': '2022-06-21T15:07:31 02:00'}, 'URL': 'https://twitter.com/LanArea5'}, {'Metadata': {'Timestamp': '2022-06-21T15:59:53 02:00'}, 'URL': 'https://www.twitch.tv/ladh_lanarea'}] NaN NaN NaN NaN NaN NaN NaN NaN
2 contentful-entries_productionv3 _doc 6sbdRDRWJXTTtbR1wycE52 None [1-formation.be] 6sbdRDRWJXTTtbR1wycE52 profile 2022-05-15T11:21:20.388Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.863200770107277, 'Lat': 50.46117977010727}, 'Northeast': {'Lng': 4.865900429892721, 'Lat': 50.46387942989271}}, 'coordinates': [4.864224099999999, 50.462539], 'type': 'Point', 'Location': {'Lng': 4.864224099999999, 'Lat': 50.462539}}, 'Metadata': {'PlaceId': 'ChIJa-SkInKZwUcRsc1Xs-GqwSE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:43:01.598Z'}, 'FormattedAddress': 'Rue des Fossés Fleuris 42, 5000 Namur, Belgique', 'MainAddress': True}] [0891973792] [{'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'http://www.1-formation.be/'}] Training in IT following based on four subjects: office applications, web and image, web marketing and communication, personnel management and development. Formations en informatique suivant quatre thématiques: bureautique, web et image, webmarketing et communication, management et développement personnel. NaN NaN NaN NaN NaN NaN logo-f-1-formation.jpg 350.0 77.0 5569.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/7Itx3K16vYyGTHuYUD7TfW/7103d85dbce48d1c3a0535dac76df5c0/logo-f-1-formation.jpg NaN logo-f-1-formation.jpg 1-formation.be 1-formation.be 1-formationbe 1-formationbe [{'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'https://twitter.com/1formation_be'}, {'Metadata': {'Timestamp': '2022-05-07T18:43:01.459Z'}, 'URL': 'https://www.facebook.com/1formation'}] NaN NaN NaN NaN NaN NaN NaN NaN
3 contentful-entries_productionv3 _doc 4EuOqP1eQIeka5xHcoq5mQ None [1-position.be] 4EuOqP1eQIeka5xHcoq5mQ profile 2022-05-15T11:21:23.274Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.863200770107277, 'Lat': 50.46117977010727}, 'Northeast': {'Lng': 4.865900429892721, 'Lat': 50.46387942989271}}, 'coordinates': [4.864224099999999, 50.462539], 'type': 'Point', 'Location': {'Lng': 4.864224099999999, 'Lat': 50.462539}}, 'Metadata': {'PlaceId': 'ChIJa-SkInKZwUcRsc1Xs-GqwSE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:51:39.745Z'}, 'FormattedAddress': 'Rue des Fossés Fleuris 42, 5000 Namur, Belgique', 'MainAddress': True}] [0891973792] [] Communications agency and IT training centre: website creation, professional SEO, the creation of Google Adwords campaigns, copywriting and web content, visual identity creation, communications consulting. Agence de communication et centre de formation informatique: création de sites web, référencement professionnel, création et gestion de campagnes Google AdWords, copywriting et écriture web, création d'identité visuelle, conseil en communication. NaN NaN NaN NaN NaN NaN marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png 169.0 129.0 11128.0 image/png //images.ctfassets.net/myqv2p4gx62v/2RMVJINCIXiF4O2hZIb6kx/c1aebc77207c1a5ae67af5ebd87b1dd3/marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png NaN marque-cp52u3ifgt9us27gak951f15p6-1369821343-position.png 1-position.be 1-position.be 1-positionbe 1-positionbe [{'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://twitter.com/1position'}, {'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://www.facebook.com/pages/1-positionbe/147447630063'}, {'Metadata': {'Timestamp': '2022-05-07T18:51:39.679Z'}, 'URL': 'https://www.linkedin.com/company/1-position.be'}] NaN NaN NaN NaN NaN NaN NaN NaN
4 contentful-entries_productionv3 _doc 1VvYEZncg0lEDL8RzGAvmE None [123 Automation Engineering & Development] 1VvYEZncg0lEDL8RzGAvmE profile 2022-05-15T05:25:51.214Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.456926070107278, 'Lat': 50.53833147010727}, 'Northeast': {'Lng': 4.459625729892722, 'Lat': 50.54103112989272}}, 'coordinates': [4.4582759, 50.5396813], 'type': 'Point', 'Location': {'Lng': 4.4582759, 'Lat': 50.5396813}}, 'Metadata': {'PlaceId': 'EjNSdWUgZGVzIEFydGlzYW5zIDQsIDYyMTAgTGVzIEJvbnMgVmlsbGVycywgQmVsZ2lxdWUiGhIYChQKEgn75Aq3dyzCRxFEh7hEj1NdPBAE', 'AddressType': 'head office', 'Timestamp': '2022-05-07T15:17:32.918Z'}, 'FormattedAddress': 'Rue des Artisans 4, 6210 Les Bons Villers, Belgique', 'MainAddress': True}] [0820888531] [{'Metadata': {'Timestamp': '2022-05-07T15:17:32.867Z'}, 'URL': 'http://www.123automation.be/'}] NaN Automation et robotique industrielle: étude, conception, développement, intégration et maintenance de solutions automatisées visant l’amélioration de la productivité dans les processus de fabrication quels qu’ils soient. NaN NaN NaN NaN NaN NaN 123automation.png 319.0 111.0 5802.0 image/png //images.ctfassets.net/myqv2p4gx62v/6uY3Y6EDfICh8wdp4XNK7Z/082273035f7a600ec34098b09ab4fee9/123automation.png NaN 123automation.png 123 Automation Engineering & Development 123 Automation Engineering & Development 123-automation 123-automation [] NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5360 contentful-entries_productionv3 _doc 1AbDfyZ4rHL18Bw6aiJKSA None [école Centrale des Arts et Métiers - HE Vinci] 1AbDfyZ4rHL18Bw6aiJKSA profile 2022-05-15T11:43:23.005Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.452325870107279, 'Lat': 50.84853592010727}, 'Northeast': {'Lng': 4.455025529892723, 'Lat': 50.85123557989272}}, 'coordinates': [4.4538028, 50.8499896], 'type': 'Point', 'Location': {'Lng': 4.4538028, 'Lat': 50.8499896}}, 'Metadata': {'PlaceId': 'ChIJwdgtpYbcw0cRfjW1nUhDNk8', 'AddressType': 'head office', 'Timestamp': '2022-05-07T15:44:19.720Z'}, 'FormattedAddress': 'Prom. de l'Alma 50, 1200 Woluwe-Saint-Lambert, Belgique', 'MainAddress': True}] [0459279954, 0409454123] [{'Metadata': {'Timestamp': '2022-05-07T15:44:19.660Z'}, 'URL': 'http://www.ecam.be/'}] NaN L'ECAM est un Institut Supérieur Industriel ayant pour objet la formation de Master en sciences industrielles dans une des spécialités suivantes: ?automatisation, construction, électromécanique, électronique, géomètre, informatique, business analyst (alternance). NaN NaN NaN NaN NaN NaN ecam.jpg 512.0 512.0 93657.0 image/jpeg //images.ctfassets.net/myqv2p4gx62v/4e2oSTcbXRABuyibUwgs95/4e5d8f540ccc67065a94eb528418ddd7/ecam.jpg NaN ecam.jpg école Centrale des Arts et Métiers - HE Vinci école Centrale des Arts et Métiers - HE Vinci ecole-centrale-des-arts-et-metiers ecole-centrale-des-arts-et-metiers [] ECAM ECAM NaN NaN NaN NaN NaN NaN
5361 contentful-entries_productionv3 _doc 5vp8xZpO6CucXtOmc1H8yR None [école communale fondamentale de Seneffe] 5vp8xZpO6CucXtOmc1H8yR profile 2022-05-15T09:12:19.246Z [{'Geometry': {'Viewport': {'Southwest': {'Lng': 4.252977370107278, 'Lat': 50.52898217010728}, 'Northeast': {'Lng': 4.255677029892722, 'Lat': 50.53168182989272}}, 'coordinates': [4.2543333, 50.5303456], 'type': 'Point', 'Location': {'Lng': 4.2543333, 'Lat': 50.5303456}}, 'Metadata': {'PlaceId': 'ChIJt1KItgg0wkcR6ekUYWMbdDg', 'AddressType': 'head office', 'Timestamp': '2022-05-07T18:58:11.863Z'}, 'FormattedAddress': 'Rue de Buisseret 19, 7180 Seneffe, Belgique', 'MainAddress': True}] NaN [] NaN Ecole fondamentale. NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN école communale fondamentale de Seneffe école communale fondamentale de Seneffe ecole-communale-de-seneffe ecole-communale-de-seneffe [] NaN NaN NaN NaN NaN NaN NaN NaN
[...]
這個資料框有 5365 行 × 40 列。您可以檢查初始 json 回應并進一步剖析它,也許您需要從中獲得更多/更少/其他資訊。
請求檔案:https ://requests.readthedocs.io/en/latest/
Pandas 相關檔案:https ://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/507733.html
標籤:Python python-3.x 硒 硒网络驱动程序 网页抓取
