當我嘗試抓取名冊鏈接時,我得到https://gwsports.com/roster.aspx?path=wpolo當我在 chrome 上打開它時它變為https://gwsports.com/sports/mens-water-polo/名冊。我想像第二個一樣以正確的格式刮掉它(https://gwsports.com/sports/mens-water-polo/roster)。
pip install -U gazpacho
from gazpacho import get, Soup
url = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s=[link.attrs['href'] for link in links]
print(s)
uj5u.com熱心網友回復:
這不是抓??取問題,您將獲得頁面上的確切 URL。相反,該 URL 會將您重定向到您需要的最終 URL。
您可以使用requests庫來獲取最終 URL:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}
url = 'https://gwsports.com/roster.aspx?path=wpolo'
r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:
print(r.url) # URL after redirections
else:
print('Request failed')
這使您的代碼如下所示:
from gazpacho import get, Soup
import requests
def get_final_url(url, root):
# Note this function assumes url is relative and always prepends root
# You may want to extend it to detect absolute URLs
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}
r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:
return r.url # URL after redirections
else:
raise requests.HTTPError
url = 'https://gwsports.com'
root = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s = [get_final_url(root link.attrs['href'], root) for link in links]
print(s)
輸出
['https://gwsports.com/sports/baseball/roster', 'https://gwsports.com/sports/mens-basketball/roster', 'https://gwsports.com/sports/mens-golf/roster', 'https://gwsports.com/sports/mens-soccer/roster', 'https://gwsports.com/sports/mens-swimming-and-diving/roster', 'https://gwsports.com/sports/mens-cross-country/roster', 'https://gwsports.com/sports/mens-water-polo/roster', 'https://gwsports.com/sports/womens-basketball/roster', 'https://gwsports.com/sports/womens-gymnastics/roster', 'https://gwsports.com/sports/womens-lacrosse/roster', 'https://gwsports.com/sports/womens-rowing/roster', 'https://gwsports.com/sports/womens-soccer/roster', 'https://gwsports.com/sports/softball/roster', 'https://gwsports.com/sports/womens-swimming-and-diving/roster', 'https://gwsports.com/sports/womens-tennis/roster', 'https://gwsports.com/sports/womens-cross-country/roster', 'https://gwsports.com/sports/womens-volleyball/roster']
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/442702.html
上一篇:如何將一個串列添加到另一個串列中,以便通過資料抓取提取資料
下一篇:以QT為例談環境搭建
