使用漂亮的湯從網站中提取碳抵消專案卻一無所獲-有解無憂

我正在嘗試從該網站（'https://alliedoffsets.com/#/profile/2）中提取資料。它有很多這樣的專案，我想得到估計平均批發價格和估計年減排量的值。當我嘗試使用漂亮的湯列印代碼時，它沒有給出這些標簽并給出空值。我知道這可能是一件基本的事情，但我被困住了。可能是使用 javascript 在網站上填充資料，但我無法找到一種方法來做到這一點。

import pandas as pd
import requests
from bs4 import BeautifulSoup

url='https://alliedoffsets.com/#/profile/1'
r=requests.get(url)
url=r.content
soup = BeautifulSoup(url,'html.parser')

tab=soup.find("thead",{"class":"sr-only"})
print(tab)

uj5u.com熱心網友回復：

網頁以 JavaScript 呈現，因此無法使用 BeautifulSoup 直接提取 HTML 元素。Selenium可用于提取呈現的 HTML，然后按 ID、類、XPath 等搜索元素。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import re

url = 'https://alliedoffsets.com/#/profile/1'

s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)

# web driver goes to page
driver.get(url)

# use WebDriverWait to wait until page is rendered

# find Estimated Average Wholesale Price
elt = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, 'direct-price-panel'))
    )
# extract just the price from the text
print(re.sub(r'.*(\$\S ).*', r'\1', elt.text))

# find Estimated Annual Emission Reduction
elt = driver.find_element(By.XPATH, "//*[strong[contains(., 'Estimated Annual Emission Reduction')]]")
print(elt.text.split(":")[1])

輸出：

 $5.06
 11603 tCO2

uj5u.com熱心網友回復：

您看到的資料是通過 JavaScript 從外部 URL 加載的。requests要使用/模塊加載資料，json您可以使用以下示例：

import json
import requests

url = "https://carbon-registry.herokuapp.com/1.0/provider/1"
params = {
    "embedded": '{"provider_capital_types":1,"provider_capital_types.capital_type":1,"provider_countries":1,"provider_countries.country":1,"contacts":1,"contacts.office":1,"provider_currencies":1,"provider_currencies.currency":1,"provider_languages":1,"provider_languages.language":1,"offices":1,"offices.country":1,"provider_sectors":1,"provider_sectors.sector":1,"provider_social_medias":1,"provider_social_medias.social_media":1,"provider_provider_types":1,"provider_provider_types.provider_type":1,"provider_stats":1,"provider_stats.stat":1,"provider_descriptions":1,"provider_descriptions.description":1,"relationships":1,"relationships.description":1,"provider_statuses":1,"provider_statuses.status":1}'
}
headers = {"Authorization": "Bearer 8hCH4MuPCa5t6ra8wtAz8xOQfJdjLvDVZk07ib60TZ"}

data = requests.get(url, headers=headers, params=params).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

stats = {s["stat"]["name"]: s for s in data["provider_stats"]}

print(f"{stats['Estimated Direct Price']['value']=}")
print(f"{stats['Estimated Annual Emission Reduction']['value']=}")

印刷：

stats['Estimated Direct Price']['value']=5.0630778182036105
stats['Estimated Annual Emission Reduction']['value']=11603

uj5u.com熱心網友回復：

該網站是動態的。因此，您可以按照下一個示例 selenium 和 bs4 來獲取正確的資料。

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time

url = 'https://alliedoffsets.com/#/profile/1' 

driver = webdriver.Chrome(ChromeDriverManager().install())
                            
driver.get(url)
driver.maximize_window()
time.sleep(5)

soup = BeautifulSoup(driver.page_source,'lxml')
driver.close()

Price = soup.select_one('p#direct-price-panel').contents[1].strip().replace('/tCO2e','')
Reduction= soup.select('.panel')[-1].contents[1].strip().replace('tCO2','')
print('Estimated Average Wholesale Price: '  str(Price))
print('Estimated Annual Emission Reduction: '   str(Reduction))

輸出：

Estimated Average Wholesale Price: $5.06
Estimated Annual Emission Reduction: 11603

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/459905.html

標籤：Python 数据库网页抓取美丽的汤蟒蛇请求

上一篇：Puppeteer在延遲加載頁面中獲取<img>src屬性

下一篇：將變數的值附加到谷歌表格中的特定列