PythonBeautifulSoup沒有提取每個URL-有解無憂

我正在嘗試查找此頁面上的所有 URL：https : //courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj- all- departments

更具體地說，我想要在每個“主題代碼”下超鏈接的鏈接。然而，當我運行我的代碼時，幾乎沒有任何鏈接被提取出來。

我想知道為什么會發生這種情況，以及如何解決它。

from bs4 import BeautifulSoup
import requests

url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"

page = requests.get(url)
soup = BeautifulSoup(page.text, features="lxml")

for link in soup.find_all('a'):
    print(link.get('href'))

這是我第一次嘗試網路抓取..

uj5u.com熱心網友回復：

有一個反機器人保護，只需在您的標題中添加一個用戶代理。出現問題時不要忘記檢查您的湯

from bs4 import BeautifulSoup
import requests

url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
ua={'User-Agent':'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_8_2) AppleWebKit/531.2 (KHTML, like Gecko) Chrome/26.0.869.0 Safari/531.2'}
r = requests.get(url, headers=ua)
soup = BeautifulSoup(r.text, features="lxml")

for link in soup.find_all('a'):
    print(link.get('href'))

湯里的資訊是

帶來不便敬請諒解。

我們檢測到來自您的瀏覽器的過多或例外的網路請求，并且無法確定這些請求是否是自動的。

要進入請求的頁面，請完成下面的驗證碼。

uj5u.com熱心網友回復：

我會使用nth-child(1)限制到由 id 匹配的表的第一列。然后簡單地提取.text. 如果包含，*則為未提供的課程提供默認字串，否則，將檢索到的課程識別符號連接到基本查詢字串結構上：

import requests
from bs4 import BeautifulSoup as bs

headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments', headers=headers)
soup = bs(r.content, 'lxml')
no_course = ''
base = 'https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-department&dept='
course_info = {i.text:(no_course if '*' in i.text else base   i.text) for i in soup.select('#mainTable td:nth-child(1)')}
course_info

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/374746.html

標籤：Python 网址美汤提炼

上一篇：從帶有變數的URL在JavaScript中播放音頻

下一篇：《九國列車》（學習報告）《leecode零基礎指南》(第8天) ——貪心，對題目的處理及題解和錯題的總結