我正在從這個網站上抓取宇航員的國家:https ://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch order 。我正在使用 BeautifulSoup 來執行此任務,但我遇到了一些問題。這是我的代碼:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch order'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
tags = soup.find_all('div', class_ ='astronaut_index__content container--xl mxa f fr fw aifs pl15 pr15 pt0')
for item in tags:
name = item.select_one('bau astronaut_cell__title bold mr05')
country = item.select_one('mouseover__contents rel py05 px075 bau caps small ac').get_text(strip = True)
data.append([name,country])
df = pd.DataFrame(data)
df
df 回傳一個空串列。不知道發生了什么。當我將代碼從 for 回圈中取出時,似乎找不到 select_one 函式。功能應該來自 bs4 - 不知道為什么它不起作用。此外,我是否缺少可重復的網路抓取模式?每次我嘗試解決這些問題時,似乎都是不同的野獸。
任何幫助,將不勝感激!謝謝!
uj5u.com熱心網友回復:
url的資料是由javascript動態生成的,Beautifulsoup不能抓取動態資料。所以,你可以用Beautifulsoup使用selenium之類的自動化工具。這里我用Beautifulsoup應用selenium。請運行代碼。
腳本:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch order'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
#print(name.text)
country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
if country:
country=country.get_text()
#print(country)
data.append([name, country])
cols=['name','country']
df = pd.DataFrame(data,columns=cols)
print(df)
輸出:
name country
0 Bess, Cameron United States of America
1 Bess, Lane United States of America
2 Dick, Evan United States of America
3 Taylor, Dylan United States of America
4 Strahan, Michael United States of America
.. ... ...
295 Jones, Thomas United States of America
296 Sega, Ronald United States of America
297 Usachov, Yury Russia
298 Fettman, Martin United States of America
299 Wolf, David United States of America
[300 rows x 2 columns]
uj5u.com熱心網友回復:
該頁面是使用 javascript 動態加載的,因此請求無法直接訪問它。資料從另一個地址加載,并以 json 格式接收。您可以通過以下方式獲得它:
url = "https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb_mobile.json"
req = requests.get(url)
data = json.loads(req.text)
加載后,您可以遍歷它并檢索相關資訊。例如:
for astro in data['astronauts']:
print(astro['astroNumber'],astro['firstName'],astro['lastName'],astro['rank'])
輸出:
1 Yuri Gagarin Colonel
10 Walter Schirra Captain
100 Georgi Ivanov Major General
101 Leonid Popov Major General
102 Bertalan Farkas Brigadier General
等等。
然后,您可以將輸出加載到 pandas 資料框或其他任何內容。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/439602.html
上一篇:通過R從嵌入網站的表格中抓取網頁
