我正在嘗試使用 BeautifulSoup 和 Selenium 從 Airbnb 抓取資料。我想從這個搜索頁面收集每個串列。
這是我到目前為止所擁有的:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def scrape_page(page_url):
driver_path = "C:/Users/parkj/Downloads/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(service = Service(driver_path))
driver.get(page_url)
wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'itemprop')))
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
return soup
def extract_listing(page_url):
page_soup = scrape_page(page_url)
listings = page_soup.find_element(By.CLASS_NAME, "itemprop")
return listings
page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto Prefecture, Japan&date_picker_type=flexible_dates&search_type=unknown"
#items = extract_listing(page_url)
#process items to get all information you need, just an example
#[{'name':items.select_one('[itemprop="name"]')['content'],
# 'url':items.select_one('[itemprop="url"]')['content']}
# for i in items]
test = scrape_page(page_url)
test
似乎 scrape_page() 從搜索頁面回傳 HTML 腳本,但不包含完整內容。它不包括我需要的資訊,這是 HTML 的這一部分:
HTML 腳本的影像
我做了一些研究,發現 WebDriverWait 可能會有所幫助,但我得到了 TimeoutException 錯誤。
超時例外錯誤
最終目標是獲取每個串列的名稱和 URL。結果串列中的前 3 項應類似于以下內容:
[{'name': '?Kyoto?/Near Station & Bus/Temple/Twin Room(^^???',
'url': 'www.airbnb.com/rooms/50290730?adults=1&children=0&infants=0&check_in=2022-07-20&check_out=2022-07-27&previous_page_section_name=1000'},
{'name': 'Stay in Kyoto central island',
'url': 'www.airbnb.com/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'},
{'name': '和楽庵【Single】100 Year old Machiya Guest House (1pax)',
'url': 'www.airbnb.com/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'}]
如果我沒有在這個問題中包含足夠的資訊,我提前道歉,因為這是我第一次在這里發帖。我會很感激任何幫助,謝謝。
uj5u.com熱心網友回復:
我不經常使用硒,但推薦請求。
試試這個
from requests import get
from bs4 import BeautifulSoup
headers = {'User-agent':'Mozilla/5.0 (X11; Linux i686; rv:100.0) Gecko/20100101 Firefox/100.0.'}
res = get('https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto Prefecture, Japan&date_picker_type=flexible_dates&search_type=unknown', headers=headers)
soup = BeautifulSoup(res.text, features="html.parser")
url_list = soup.find_all("meta", attrs={"itemprop":"url"})
就我而言,它回傳了 20 個結果,與一頁上的結果一樣多。如果你想要更多,你需要報廢另一頁。
用戶代理 firefox 非常重要。很多頁面都沒有阻止這個用戶代理,這是舊的廢料案例
uj5u.com熱心網友回復:
在這種情況下選擇您正在等待更具體的元素css selector:
wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))
還要盡量避免beautifulsoup使用 selenium 語法并css selectors在bs3語法中使用:
listings = page_soup.select('[itemprop="itemListElement"]')
例子
...
def scrape_page(page_url):
driver.get(page_url)
wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
return soup
def extract_listing(page_url):
page_soup = scrape_page(page_url)
listings = page_soup.select('[itemprop="itemListElement"]')
return listings
page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto Prefecture, Japan&date_picker_type=flexible_dates&search_type=unknown"
items = extract_listing(page_url)
#process items to get all information you need, just an example
[{'name':i.select_one('[itemprop="name"]')['content'],
'url':i.select_one('[itemprop="url"]')['content']}
for i in items]
輸出
[{'name': '?Kyoto?/N?he Bahnhof & Bus/Tempel/Einzelzimmer(^^?',
'url': 'www.airbnb.de/rooms/50293998?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
{'name': '100 Jahre altes Machiya-G?stehaus (1Pax)',
'url': 'www.airbnb.de/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-08-22&check_out=2022-08-29&previous_page_section_name=1000'},
{'name': '27, Deluxe Designer Zweibett- / Dreibettzimmer in Shijo (1-3 Personen / Nichtraucher)',
'url': 'www.airbnb.de/rooms/41413491?adults=1&children=0&infants=0&check_in=2023-05-16&check_out=2023-05-23&previous_page_section_name=1000'},
{'name': 'Aufenthalt auf der zentralen Insel Kyoto',
'url': 'www.airbnb.de/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-24&check_out=2022-07-01&previous_page_section_name=1000'},
{'name': 'Sweet 202 Privatzimmer ??',
'url': 'www.airbnb.de/rooms/30217767?adults=1&children=0&infants=0&check_in=2022-07-18&check_out=2022-07-25&previous_page_section_name=1000'},
{'name': 'Kyoto Sanjo Ohashi Superior Zweibettzimmer Studio Nichtraucher Superior Zweibettzimmer',
'url': 'www.airbnb.de/rooms/45207535?adults=1&children=0&infants=0&check_in=2022-09-27&check_out=2022-10-04&previous_page_section_name=1000'},
{'name': 'Toller Blick auf den Fluss, sch?nes traditionelles Haus',
'url': 'www.airbnb.de/rooms/25762078?adults=1&children=0&infants=0&check_in=2022-12-07&check_out=2022-12-14&previous_page_section_name=1000'},
{'name': 'Doppelzimmer - Waschmaschine in allen Zimmern ☆ Guest House 10-Minuten zu Fu? von Kyoto Station -',
'url': 'www.airbnb.de/rooms/51433076?adults=1&children=0&infants=0&check_in=2022-06-13&check_out=2022-06-20&previous_page_section_name=1000'},
{'name': 'In der N?he des Bahnhofs Kyoto Gemütliches Zimmer in einem traditionellen Haus',
'url': 'www.airbnb.de/rooms/25600163?adults=1&children=0&infants=0&check_in=2022-09-12&check_out=2022-09-19&previous_page_section_name=1000'},
{'name': 'Gemütliche und ruhige zweist?ckige japanische Wohnung',
'url': 'www.airbnb.de/rooms/38743436?adults=1&children=0&infants=0&check_in=2023-03-11&check_out=2023-03-18&previous_page_section_name=1000'},
{'name': '51★Günstigste★5 Minuten zu Fu? Shin-Osaka Sta.★Max 1 G?ste',
'url': 'www.airbnb.de/rooms/14539052?adults=1&children=0&infants=0&check_in=2022-07-03&check_out=2022-07-10&previous_page_section_name=1000'},
{'name': '和楽庵【Doppel】100 Jahre altes Machiya G?stehaus (2pax)',
'url': 'www.airbnb.de/rooms/22867502?adults=1&children=0&infants=0&check_in=2022-08-26&check_out=2022-09-02&previous_page_section_name=1000'},
{'name': 'Expo Hostel Nishi #1 /1000yen Fahrrad für deinen Aufenthalt',
'url': 'www.airbnb.de/rooms/8295322?adults=1&children=0&infants=0&check_in=2022-08-27&check_out=2022-09-03&previous_page_section_name=1000'},
{'name': '★Lovely RiverSide House in★der N?he von Einkaufsviertel★3 Betten',
'url': 'www.airbnb.de/rooms/40117962?adults=1&children=0&infants=0&check_in=2022-07-07&check_out=2022-07-14&previous_page_section_name=1000'},
{'name': 'ZIMMER - Bereich Central Kyoto Gion',
'url': 'www.airbnb.de/rooms/15215980?adults=1&children=0&infants=0&check_in=2022-06-14&check_out=2022-06-21&previous_page_section_name=1000'},
{'name': 'Raum, um das Kyoto zu genie?en.',
'url': 'www.airbnb.de/rooms/9263813?adults=1&children=0&infants=0&check_in=2022-09-08&check_out=2022-09-15&previous_page_section_name=1000'},
{'name': 'Stilvolles modernes Kyo-Machiya 500 金閣寺 m vom Trockner entfernt',
'url': 'www.airbnb.de/rooms/20041502?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'},
{'name': 'Hotel Sou Kyoto Gion Queen Studio',
'url': 'www.airbnb.de/rooms/40236377?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
{'name': 'Workation GroLiving in KYOTO',
'url': 'www.airbnb.de/rooms/612511811801466646?adults=1&children=0&infants=0&check_in=2022-07-19&check_out=2022-07-26&previous_page_section_name=1000'},
{'name': '【home quarantin ok】shibainuatiniya/Kyoto Sta/Toji',
'url': 'www.airbnb.de/rooms/34028813?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'}]
uj5u.com熱心網友回復:
如果您想獲取頁面的全部內容,我認為您應該尋找可以在站點中運行 javascript 的東西。類似于 chrome 引擎的精簡版
我不知道它是否可以完成這項作業,但是像 qt web 引擎這樣的東西
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/480509.html
標籤:Python 硒 硒网络驱动程序 美丽的汤 airbnb-js-styleguide
