我正在嘗試創建一個功能,為專案刮取大學棒球隊花名冊頁面。我創建了一個抓取花名冊頁面的函式,獲取我想要抓取的鏈接串列。但是當我嘗試為每個玩家抓取單獨的鏈接時,它可以作業,但找不到他們頁面上的資料。
這是我從一開始就抓取的頁面的鏈接:
https://gvsulakers.com/sports/baseball/roster
這些只是我在遇到問題的函式中呼叫的函式:
def parse_row(rows):
return [str(x.string)for x in rows.find_all('td')]
def scrape(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
row = soop.find_all('tr')
lopr = [parse_row(rows) for rows in row]
return(lopr)
這是我遇到的問題。當我為 type1_roster 分配一個變數并列印它時,我只得到一個空串列。理想情況下,它應該包含來自球員名單頁面的一名或多名球員的資料。
# Roster page crawler
def type1_roster(team_id):
url = "https://" team_id ".com/sports/baseball/roster"
soop = scrape(url)
href_tags = soop.find_all(href = True)
hrefs = [tag.get('href') for tag in href_tags]
# get all player links
player_hrefs = []
for href in hrefs:
if 'sports/baseball/roster' in href:
if 'sports/baseball/roster/coaches' not in href:
if 'https:' not in href:
player_hrefs.append(href)
# get rid of duplicates
player_links = list(set(player_hrefs))
# scrape the roster links
for link in player_links:
player_ = url link[24:]
return(find_data(player_))
uj5u.com熱心網友回復:
一些事情:
- 我會將標題作為全域傳遞
- 你切片 1 個字符太晚了我認為的鏈接
player_ - 您需要重新處理 的邏輯
find_data(),因為資料存在于混合元素型別中,而不是存在于 table/tr/td 元素中,例如在 spans 中。html 屬性很好且具有描述性,可以輕松支持定位內容 - 您可以使用下面顯示的 css 選擇器串列更緊密地定位著陸頁中的播放器鏈接。這消除了對多個回圈的需要以及使用
list(set())
import requests
from bs4 import BeautifulSoup
HEADERS = {'User-Agent': 'Mozilla/5.0'}
def scrape(url):
page = requests.get(url, headers=HEADERS)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers=HEADERS)
#print(page)
html = page.text
soop = BeautifulSoup(html, 'lxml')
# re-think logic here to return desired data e.g.
# soop.select_one('.sidearm-roster-player-jersey-number').text
first_name = soop.select_one('.sidearm-roster-player-first-name').text
# soop.select_one('.sidearm-roster-player-last-name').text
# need targeted string cleaning possibly
bio = soop.select_one('#sidearm-roster-player-bio').get_text('')
return (first_name, bio)
def type1_roster(team_id):
url = "https://" team_id ".com/sports/baseball/roster"
soop = scrape(url)
player_links = [i['href'] for i in soop.select(
'.sidearm-roster-players-container .sidearm-roster-player h3 > a')]
# scrape the roster links
for link in player_links:
player_ = url link[23:]
# print(player_)
return(find_data(player_))
print(type1_roster('gvsulakers'))
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/429602.html
