無法使用Selenium/BS4找到正確的href -有解無憂

我正試圖用下面的代碼來緩解我的財務資料收集。然而，它似乎有幾個問題。我想為一個特定的href搜刮以下頁面：'https://www.witan.com/investor-information/factsheets/#currentPage=1'

。

我試圖決議的href。 href="/media/1767/witan-investment-trust_factsheet_310821.pdf"/p>

目前我正在使用selenium來做，但是它有點慢，所以如果有可能使用BS4來搜刮，我愿意接受建議--到目前為止我的嘗試都失敗了。

# 設定selenium的選項
options = Options()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("--window-size=1920,1200")

# 使用Selenium & ChromeDriver請求網站
driver = webdriver.Chrome('C:/AnaConda/chromedriver.exe', options=options)
driver.get('https://www.witan.com/investor-information/factsheets/#currentPage=1') # 請求網站
html = driver.page_source

soup = BeautifulSoup(html, "html.parser")
link_finder = soup.findAll('a', href=re.compile('/witan-investment-trust-factsheet') )[0]

當使用上述代碼時，我得到。 a class="ico-arrow document-view size" href="/media/1750/witan-investment-trust-factsheet-30jun2021.pdf" target="_blank" ...

希望有人能幫助我！

uj5u.com熱心網友回復：

帶有PDF鏈接的HTML檔案是通過JavaScript異步加載的（所以beautifulsoup不會在初始頁面內看到它們）。要列印所有的PDF鏈接，你可以這樣做：

import 請求
from bs4 import BeautifulSoup

api_url = "https://www.witan.com/umbraco/surface/listing/DocumentListing"/span>

params = {
    "currentPage": "1",
    "year": "2021",
    "isArchive": "false",
    "分頁"。"true"。
}

with requests.session() as s:
    #加載cookies:
    s.get("https://www.witan.com/investor-information/factsheets/")
    # get document page:
    soup = BeautifulSoup(s.get(api_url, params=params).content, "html.parser")
    for a in soup.select(".document-view") 。
        print("https://www.witan.com"   a["href"] )

印刷品：

https://www.witan.com/media/1767/witan-investment-trust_factsheet_310821.pdf
https://www.witan.com/media/1763/witan-investment-trust_factsheet_310721.pdf
https://www.witan.com/media/1750/witan-investment-trust-factsheet-30jun2021.pdf
https://www.witan.com/media/1730/witan-investment-trust_factsheet_310521.pdf
https://www.witan.com/media/1718/witan-factsheet-30apr2021.pdf

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/320228.html

標籤：

上一篇：Bootstrap4導航條內容在IE11中垂直對齊被忽略了

下一篇：關閉過濾鍵對話框的WindowsAPI？