我想抓取表中所有專案的 URL,但是當我嘗試時,什么也沒有出現。該代碼非常基本,因此我可以看到為什么它可能不起作用。然而,即使試圖刮掉這個網站的標題,也沒有任何結果。我至少期待 h1 標簽,因為它在桌子外面......
網站:https ://www.vanguard.com.au/personal/products/en/overview
import requests
from bs4 import BeautifulSoup
lists =[]
url = 'https://www.vanguard.com.au/personal/products/en/overview'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
title = soup.find_all('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
link_text = links['href']
lists.append(link_text)
print(title)
print(lists)
uj5u.com熱心網友回復:
如果問題是由 JavaScript 事件監聽器引起的,我建議你使用beautifulsoupwithselenium來抓取這個網站。所以,讓我們在發送請求時應用 selenium 并獲取頁面源,然后使用beautifulsoup它來決議它。
此外,您應該使用 title = soup.find()而不是title = soup.findall()為了只獲得一個標題。
使用 Firefox 的代碼示例:
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from bs4 import BeautifulSoup
url = 'https://www.vanguard.com.au/personal/products/en/overview'
browser = webdriver.Firefox(executable_path=GeckoDriverManager().install())
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
browser.close()
lists =[]
title = soup.find('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
link_text = links['href']
lists.append(link_text)
print(title)
print(lists)
輸出:
<h1 class="heading2 gbs-font-vanguard-red">Investment products</h1>
['/personal/products/en/detail/8132', '/personal/products/en/detail/8219', '/personal/products/en/detail/8121',...,'/personal/products/en/detail/8217']
uj5u.com熱心網友回復:
最常見的問題(許多現代頁面):此頁面用于JavaScript添加元素但requests/BeautifulSoup無法運行JavaScript。
您可能需要使用Selenium來控制可以運行的真實網路瀏覽器JavaScript。
此示例僅使用 Selenium,不使用BeautifulSoup
我使用xpath,但您也可以使用css selector.
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.vanguard.com.au/personal/products/en/overview'
lists = []
#driver = webdriver.Chrome(executable_path="/path/to/chromedrive.exe")
driver = webdriver.Firefox(executable_path="/path/to/geckodrive.exe")
driver.get(url)
title = driver.find_element(By.XPATH, '//h1[@]')
print(title.text)
all_items = driver.find_elements(By.XPATH, '//a[@style="padding-bottom: 1px;"]')
for links in all_items:
link_text = links.get_attribute('href')
print(link_text)
lists.append(link_text)
- ChromeDriver(適用于 Chrome)
- GeckoDriver(用于 Firefox)
uj5u.com熱心網友回復:
從源獲取資料總是比通過 Selenium 更有效。看起來鏈接是通過 portId 創建的。
import pandas as pd
import requests
url = 'https://www3.vanguard.com.au/personal/products/funds.json'
payload = {
'context': '/personal/products/',
'countryCode': 'au.ret',
'paths': "[[['funds','legacyFunds'],'AU']]",
'method': 'get'}
jsonData = requests.get(url, params=payload).json()
results = jsonData['jsonGraph']['funds']['AU']['value']
df1 = pd.json_normalize(results, record_path=['children'])
df2 = pd.json_normalize(results, record_path=['listings'])
df = pd.concat([df1, df2], axis=0)
df['url_link'] = 'https://www.vanguard.com.au/personal/products/en/detail/' df['portId'] '/Overview'
輸出:
print(df[['fundName', 'url_link']])
fundName url_link
0 Vanguard Active Emerging Market Equity Fund https://www.vanguard.com.au/personal/products/...
1 Vanguard Active Global Credit Bond Fund https://www.vanguard.com.au/personal/products/...
2 Vanguard Active Global Growth Fund https://www.vanguard.com.au/personal/products/...
3 Vanguard Australian Corporate Fixed Interest I... https://www.vanguard.com.au/personal/products/...
4 Vanguard Australian Fixed Interest Index Fund https://www.vanguard.com.au/personal/products/...
.. ... ...
23 Vanguard MSCI Australian Small Companies Index... https://www.vanguard.com.au/personal/products/...
24 Vanguard MSCI Index International Shares (Hedg... https://www.vanguard.com.au/personal/products/...
25 Vanguard MSCI Index International Shares ETF https://www.vanguard.com.au/personal/products/...
26 Vanguard MSCI International Small Companies In... https://www.vanguard.com.au/personal/products/...
27 Vanguard International Credit Securities Hedge... https://www.vanguard.com.au/personal/products/...
[66 rows x 2 columns]
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/442007.html
