我正在嘗試抓取以下網頁:
但是,似乎我做不到,并且在使用漂亮的湯時出現了某種錯誤。這是我到目前為止的代碼:
page = requests.get("https://opensea.io/rankings?sortBy=total_volume").content
soup = BeautifulSoup(page, 'html.parser')
values = soup.findAll('a')
我不太清楚為什么,但我想獲取包含單詞“/collection/”的href值中顯示的所有值
非常感謝一些幫助。
uj5u.com熱心網友回復:
我預計您會收到403 Forbidden錯誤 - 該站點可能有一些非常好的阻止程式,我快速嘗試了 ScrapingAnt,然后嘗試復制我的瀏覽器的整個請求,但兩者都被阻止了。
如果你愿意嘗試 selenium,我有這個函式來處理這種情況——你可以將它粘貼到你的代碼中,然后呼叫它
soup = linkToSoup_selenium('https://opensea.io/rankings?sortBy=total_volume')
values = soup.find_all(
lambda l: l.name == 'a' and
l.get('href') is not None and
'/collection/' in l.get('href')
)
# OR
# values = [a for a in soup.select('a') if a.get('href') and '/collection/' in a.get('href')]
編輯: 獲取所有收藏:
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.common.keys import Keys
# from selenium.webdriver.support import expected_conditions as EC
# from bs4 import BeautifulSoup
# import json
driver = webdriver.Chrome('chromedriver.exe')
url = f'https://opensea.io/rankings?sortBy=total_volume'
driver.get(url)
driver.maximize_window()
scrollCt = 0
rows = []
maxScrols = 1000 # adjust as preferred
while scrollCt < maxScrols:
time.sleep(0.3) # adjust as necessary - I seem to need >0.25s
aro = driver.find_elements_by_css_selector(
'a[role="row"][overflow="hidden"][href*="/collection/"]')
aro = [(
a.get_attribute('href'),
a.find_element_by_xpath('../..') # parent's parent
) for a in aro]
aroFil = [a for a in aro if a[0] not in [h['link'] for h in rows]]
aroFil = [{
'outerHTML': [a[1].get_attribute('outerHTML')], # for BeatifulSoup
'innerText': a[1].get_attribute('innerText').strip(),
'link': a[0]
} for a in aroFil]
rows = aroFil
print(f'[{scrollCt}] found {len(aro)}, filtered down to {len(aroFil)} rows [at {len(rows)} total]')
if aroFil == []:
# break # if you don't want to go to next page
try:
driver.find_element_by_xpath('//i[@value="arrow_forward_ios"]/..').click()
except Exception as e:
print('failed to go to next page', str(e))
break
scrollCt = 1
driver.find_element_by_css_selector('body').send_keys(Keys.PAGE_DOWN)
# if you want to extract more with BeautifulSoup
for i, r in enumerate(rows):
rSoup = BeautifulSoup(r['outerHTML'][0].encode('utf-8'), 'html5lib')
rows[i]['bs4Text'] = rSoup.get_text(strip=True)
###### EXTRACT INFO AS NEEDED ######
del rows[i]['html'] # if you don't need it anymore
# with open('x.json', 'w') as f: json.dump(rows, f, indent=4) # save json file
driver.quit()
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/522191.html
下一篇:從這個網頁抓取表格資料
