抓取資料并收集所有href值-有解無憂

我正在嘗試抓取以下網頁：抓取資料并收集所有 href 值

但是，似乎我做不到，并且在使用漂亮的湯時出現了某種錯誤。這是我到目前為止的代碼：

page = requests.get("https://opensea.io/rankings?sortBy=total_volume").content
soup = BeautifulSoup(page, 'html.parser')
values = soup.findAll('a')

我不太清楚為什么，但我想獲取包含單詞“/collection/”的href值中顯示的所有值

非常感謝一些幫助。

uj5u.com熱心網友回復：

我預計您會收到403 Forbidden錯誤 - 該站點可能有一些非常好的阻止程式，我快速嘗試了 ScrapingAnt，然后嘗試復制我的瀏覽器的整個請求，但兩者都被阻止了。

如果你愿意嘗試 selenium，我有這個函式來處理這種情況——你可以將它粘貼到你的代碼中，然后呼叫它

soup = linkToSoup_selenium('https://opensea.io/rankings?sortBy=total_volume')
values = soup.find_all(
            lambda l: l.name == 'a' and
            l.get('href') is not None and
            '/collection/' in l.get('href')
        )
# OR
# values = [a for a in soup.select('a') if a.get('href') and '/collection/' in a.get('href')]

編輯： 獲取所有收藏：

# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.common.keys import Keys
# from selenium.webdriver.support import expected_conditions as EC
# from bs4 import BeautifulSoup
# import json

driver = webdriver.Chrome('chromedriver.exe')
url = f'https://opensea.io/rankings?sortBy=total_volume'
driver.get(url)
driver.maximize_window()
scrollCt = 0
rows = []

maxScrols = 1000  # adjust as preferred
while scrollCt < maxScrols:
    time.sleep(0.3)  # adjust as necessary - I seem to need  >0.25s
    aro = driver.find_elements_by_css_selector(
        'a[role="row"][overflow="hidden"][href*="/collection/"]')
    aro = [(
        a.get_attribute('href'),
        a.find_element_by_xpath('../..')  # parent's parent
    ) for a in aro]

    aroFil = [a for a in aro if a[0] not in [h['link'] for h in rows]]
    aroFil = [{   
        'outerHTML': [a[1].get_attribute('outerHTML')], # for BeatifulSoup
        'innerText': a[1].get_attribute('innerText').strip(),
        'link': a[0]
    } for a in aroFil]
    rows  = aroFil

    print(f'[{scrollCt}] found {len(aro)}, filtered down to {len(aroFil)} rows [at {len(rows)} total]')
    if aroFil == []:
        # break # if you don't want to go to next page
        try:
            driver.find_element_by_xpath('//i[@value="arrow_forward_ios"]/..').click()
        except Exception as e:
            print('failed to go to next page', str(e))
            break

    scrollCt  = 1
    driver.find_element_by_css_selector('body').send_keys(Keys.PAGE_DOWN)


# if you want to extract more with BeautifulSoup
for i, r in enumerate(rows):
    rSoup = BeautifulSoup(r['outerHTML'][0].encode('utf-8'), 'html5lib')

    rows[i]['bs4Text'] = rSoup.get_text(strip=True)
    ###### EXTRACT INFO AS NEEDED ######

    del rows[i]['html']  # if you don't need it anymore

# with open('x.json', 'w') as f: json.dump(rows, f, indent=4) # save json file 

driver.quit()

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/522191.html

標籤：Python网页抓取美丽的汤

上一篇：Maven包：.txt未包含在.jar檔案中

下一篇：從這個網頁抓取表格資料