BeautifulSoup沒有從airbnb搜索頁面回傳完整的html腳本-有解無憂

我正在嘗試使用 BeautifulSoup 和 Selenium 從 Airbnb 抓取資料。我想從這個搜索頁面收集每個串列。

這是我到目前為止所擁有的：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

def scrape_page(page_url):
    
    driver_path = "C:/Users/parkj/Downloads/chromedriver_win32/chromedriver.exe"
    driver = webdriver.Chrome(service = Service(driver_path))
    driver.get(page_url)
    wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'itemprop')))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.close()
    
    return soup

def extract_listing(page_url):
    
    page_soup = scrape_page(page_url)
    listings = page_soup.find_element(By.CLASS_NAME, "itemprop")
    return listings

page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto Prefecture, Japan&date_picker_type=flexible_dates&search_type=unknown"
#items = extract_listing(page_url)

#process items to get all information you need, just an example
#[{'name':items.select_one('[itemprop="name"]')['content'],
#  'url':items.select_one('[itemprop="url"]')['content']} 
# for i in items]

test = scrape_page(page_url)
test

似乎 scrape_page() 從搜索頁面回傳 HTML 腳本，但不包含完整內容。它不包括我需要的資訊，這是 HTML 的這一部分：

HTML 腳本的影像

我做了一些研究，發現 WebDriverWait 可能會有所幫助，但我得到了 TimeoutException 錯誤。

超時例外錯誤

最終目標是獲取每個串列的名稱和 URL。結果串列中的前 3 項應類似于以下內容：

[{'name': '?Kyoto?/Near Station & Bus/Temple/Twin Room(^^???',
  'url': 'www.airbnb.com/rooms/50290730?adults=1&children=0&infants=0&check_in=2022-07-20&check_out=2022-07-27&previous_page_section_name=1000'},
 {'name': 'Stay in Kyoto central island',
  'url': 'www.airbnb.com/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'},
 {'name': '和楽庵【Single】100 Year old Machiya Guest House (1pax)',
  'url': 'www.airbnb.com/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'}]

如果我沒有在這個問題中包含足夠的資訊，我提前道歉，因為這是我第一次在這里發帖。我會很感激任何幫助，謝謝。

uj5u.com熱心網友回復：

我不經常使用硒，但推薦請求。

試試這個

from requests import get
from bs4 import BeautifulSoup

headers = {'User-agent':'Mozilla/5.0 (X11; Linux i686; rv:100.0) Gecko/20100101 Firefox/100.0.'}

res = get('https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto Prefecture, Japan&date_picker_type=flexible_dates&search_type=unknown', headers=headers)

soup = BeautifulSoup(res.text, features="html.parser")

url_list = soup.find_all("meta", attrs={"itemprop":"url"})

就我而言，它回傳了 20 個結果，與一頁上的結果一樣多。如果你想要更多，你需要報廢另一頁。

用戶代理 firefox 非常重要。很多頁面都沒有阻止這個用戶代理，這是舊的廢料案例

uj5u.com熱心網友回復：

在這種情況下選擇您正在等待更具體的元素css selector：

wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))

還要盡量避免beautifulsoup使用 selenium 語法并css selectors在bs3語法中使用：

listings = page_soup.select('[itemprop="itemListElement"]')

例子

...
def scrape_page(page_url):
    driver.get(page_url)
    wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.close()
    
    return soup

def extract_listing(page_url):
    
    page_soup = scrape_page(page_url)
    listings = page_soup.select('[itemprop="itemListElement"]')
    return listings

page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths[]=one_week&refinement_paths[]=/homes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto Prefecture, Japan&date_picker_type=flexible_dates&search_type=unknown"
items = extract_listing(page_url)

#process items to get all information you need, just an example
[{'name':i.select_one('[itemprop="name"]')['content'],
 'url':i.select_one('[itemprop="url"]')['content']} 
for i in items]

輸出

[{'name': '?Kyoto?/N?he Bahnhof & Bus/Tempel/Einzelzimmer(^^?',
  'url': 'www.airbnb.de/rooms/50293998?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
 {'name': '100 Jahre altes Machiya-G?stehaus (1Pax)',
  'url': 'www.airbnb.de/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-08-22&check_out=2022-08-29&previous_page_section_name=1000'},
 {'name': '27, Deluxe Designer Zweibett- / Dreibettzimmer in Shijo (1-3 Personen  / Nichtraucher)',
  'url': 'www.airbnb.de/rooms/41413491?adults=1&children=0&infants=0&check_in=2023-05-16&check_out=2023-05-23&previous_page_section_name=1000'},
 {'name': 'Aufenthalt auf der zentralen Insel Kyoto',
  'url': 'www.airbnb.de/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-24&check_out=2022-07-01&previous_page_section_name=1000'},
 {'name': 'Sweet 202 Privatzimmer ??',
  'url': 'www.airbnb.de/rooms/30217767?adults=1&children=0&infants=0&check_in=2022-07-18&check_out=2022-07-25&previous_page_section_name=1000'},
 {'name': 'Kyoto Sanjo Ohashi Superior Zweibettzimmer Studio Nichtraucher Superior Zweibettzimmer',
  'url': 'www.airbnb.de/rooms/45207535?adults=1&children=0&infants=0&check_in=2022-09-27&check_out=2022-10-04&previous_page_section_name=1000'},
 {'name': 'Toller Blick auf den Fluss, sch?nes traditionelles Haus',
  'url': 'www.airbnb.de/rooms/25762078?adults=1&children=0&infants=0&check_in=2022-12-07&check_out=2022-12-14&previous_page_section_name=1000'},
 {'name': 'Doppelzimmer - Waschmaschine in allen Zimmern ☆ Guest House 10-Minuten zu Fu? von Kyoto Station -',
  'url': 'www.airbnb.de/rooms/51433076?adults=1&children=0&infants=0&check_in=2022-06-13&check_out=2022-06-20&previous_page_section_name=1000'},
 {'name': 'In der N?he des Bahnhofs Kyoto Gemütliches Zimmer in einem traditionellen Haus',
  'url': 'www.airbnb.de/rooms/25600163?adults=1&children=0&infants=0&check_in=2022-09-12&check_out=2022-09-19&previous_page_section_name=1000'},
 {'name': 'Gemütliche und ruhige zweist?ckige japanische Wohnung',
  'url': 'www.airbnb.de/rooms/38743436?adults=1&children=0&infants=0&check_in=2023-03-11&check_out=2023-03-18&previous_page_section_name=1000'},
 {'name': '51★Günstigste★5 Minuten zu Fu? Shin-Osaka Sta.★Max 1 G?ste',
  'url': 'www.airbnb.de/rooms/14539052?adults=1&children=0&infants=0&check_in=2022-07-03&check_out=2022-07-10&previous_page_section_name=1000'},
 {'name': '和楽庵【Doppel】100 Jahre altes Machiya G?stehaus (2pax)',
  'url': 'www.airbnb.de/rooms/22867502?adults=1&children=0&infants=0&check_in=2022-08-26&check_out=2022-09-02&previous_page_section_name=1000'},
 {'name': 'Expo Hostel Nishi #1 /1000yen Fahrrad für deinen Aufenthalt',
  'url': 'www.airbnb.de/rooms/8295322?adults=1&children=0&infants=0&check_in=2022-08-27&check_out=2022-09-03&previous_page_section_name=1000'},
 {'name': '★Lovely RiverSide House in★der N?he von Einkaufsviertel★3 Betten',
  'url': 'www.airbnb.de/rooms/40117962?adults=1&children=0&infants=0&check_in=2022-07-07&check_out=2022-07-14&previous_page_section_name=1000'},
 {'name': 'ZIMMER - Bereich Central Kyoto Gion',
  'url': 'www.airbnb.de/rooms/15215980?adults=1&children=0&infants=0&check_in=2022-06-14&check_out=2022-06-21&previous_page_section_name=1000'},
 {'name': 'Raum, um das Kyoto zu genie?en.',
  'url': 'www.airbnb.de/rooms/9263813?adults=1&children=0&infants=0&check_in=2022-09-08&check_out=2022-09-15&previous_page_section_name=1000'},
 {'name': 'Stilvolles modernes Kyo-Machiya 500 金閣寺 m vom Trockner entfernt',
  'url': 'www.airbnb.de/rooms/20041502?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'},
 {'name': 'Hotel Sou Kyoto Gion Queen Studio',
  'url': 'www.airbnb.de/rooms/40236377?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
 {'name': 'Workation GroLiving in  KYOTO',
  'url': 'www.airbnb.de/rooms/612511811801466646?adults=1&children=0&infants=0&check_in=2022-07-19&check_out=2022-07-26&previous_page_section_name=1000'},
 {'name': '【home quarantin ok】shibainuatiniya/Kyoto Sta/Toji',
  'url': 'www.airbnb.de/rooms/34028813?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'}]

uj5u.com熱心網友回復：

如果您想獲取頁面的全部內容，我認為您應該尋找可以在站點中運行 javascript 的東西。類似于 chrome 引擎的精簡版

我不知道它是否可以完成這項作業，但是像 qt web 引擎這樣的東西

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/480509.html

標籤：Python 硒硒网络驱动程序美丽的汤 airbnb-js-styleguide

上一篇：如何在PythonSelenium的控制臺中列印訊息

下一篇：使用Selenium檢索AzureAD授權承載