我正試圖用下面的代碼來緩解我的財務資料收集。然而,它似乎有幾個問題。我想為一個特定的href搜刮以下頁面:'https://www.witan.com/investor-information/factsheets/#currentPage=1'
。我試圖決議的href。 href="/media/1767/witan-investment-trust_factsheet_310821.pdf"/p>
目前我正在使用selenium來做,但是它有點慢,所以如果有可能使用BS4來搜刮,我愿意接受建議--到目前為止我的嘗試都失敗了。
# 設定selenium的選項
options = Options()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("--window-size=1920,1200")
# 使用Selenium & ChromeDriver請求網站
driver = webdriver.Chrome('C:/AnaConda/chromedriver.exe', options=options)
driver.get('https://www.witan.com/investor-information/factsheets/#currentPage=1') # 請求網站
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
link_finder = soup.findAll('a', href=re.compile('/witan-investment-trust-factsheet') )[0]
當使用上述代碼時,我得到。 a class="ico-arrow document-view size" href="/media/1750/witan-investment-trust-factsheet-30jun2021.pdf" target="_blank" ...
希望有人能幫助我!
uj5u.com熱心網友回復:
帶有PDF鏈接的HTML檔案是通過JavaScript異步加載的(所以beautifulsoup不會在初始頁面內看到它們)。要列印所有的PDF鏈接,你可以這樣做:
import 請求
from bs4 import BeautifulSoup
api_url = "https://www.witan.com/umbraco/surface/listing/DocumentListing"/span>
params = {
"currentPage": "1",
"year": "2021",
"isArchive": "false",
"分頁"。"true"。
}
with requests.session() as s:
#加載cookies:
s.get("https://www.witan.com/investor-information/factsheets/")
# get document page:
soup = BeautifulSoup(s.get(api_url, params=params).content, "html.parser")
for a in soup.select(".document-view") 。
print("https://www.witan.com" a["href"] )
印刷品:
https://www.witan.com/media/1767/witan-investment-trust_factsheet_310821.pdf
https://www.witan.com/media/1763/witan-investment-trust_factsheet_310721.pdf
https://www.witan.com/media/1750/witan-investment-trust-factsheet-30jun2021.pdf
https://www.witan.com/media/1730/witan-investment-trust_factsheet_310521.pdf
https://www.witan.com/media/1718/witan-factsheet-30apr2021.pdf
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/320228.html
標籤:
