為什么這不起作用？Python網頁抓取-有解無憂

在此處輸入影像描述我使用此代碼將 li 標簽中的所有文本都取回，但它不起作用。

from bs4 import BeautifulSoup
import requests
page = requests.get("https://archief.amsterdam/inventarissen/scans/31245/120.3")
soup = BeautifulSoup(page.content, 'html.parser')
result = soup.find_all('#modal > div > div.content > div > div > ul > li:nth-child(1) > span.file-name')

for i in range(len(result)):
    print(result[i].text.strip())

print(len(result))

我想要從中獲取資料的網站的影像

uj5u.com熱心網友回復：

看起來該站點正在使用 JavaScript 創建這些標簽，并且 requests 模塊根本不運行 JS，因此這些標簽永遠不會出現在page.content.

您可以使用諸如requests-html或Selenium 之類的東西來允許 JS 在您訪問內容之前運行，或者直接抓取頁面加載的資料（我檢查過，并且有一個向服務器發出的請求，該請求回傳您需要的資料） JSON 格式。在加載頁面時檢查瀏覽器開發工具的網路選項卡以獲取更多資訊/如果你想使用它）。

還，

li span.file-name假設您想獲取每個檔案名，您可以將選擇器簡化為。
Python 支持這樣的 for 回圈：for result in results，因此您可以使用它而不是更傳統的/JavaScript-y 變體。下面我舉個例子。

# This is assuming the "result" variable is renamed to "results".
for result in results:
    print(result.text.strip())

print(len(results))

資料抓取方法（回復評論）

將呼叫中的網頁 URL 替換為requests.getAPI。
將服務器回傳的JSONP文本轉換為常規 JSON，以便我們可以使用 Python 的標準json庫對其進行決議。
遍歷決議后的 JSON，取出“name”的值并將其添加到某個串列中。

完整示例：

import json
import requests

# The URL from the network tab.
api_url = "https://webservices.picturae.com/archives/scans/31245/120.3?apiKey=eb37e65a-eb47-11e9-b95c-60f81db16c0e&lang=nl_NL&findingAid=31245&path=120.3&callback=callback_json5"
response = requests.get(api_url)
# The split() and strip() calls here remove parts of the request
# that are JSONP, not JSON. We need just the JSON data.
raw_json = response.text.split("(", 1)[1].strip(")")
# Load the JSON data into a regular Python dictionary.
data = json.loads(raw_json)
# Add all the filenames from the data into the filenames list.
filenames = []
for scan in data["scans"]["scans"]:
    filename = scan["name"]
    print(filename)
    filenames.append(filename)

print("\nFilename count:", len(filenames))

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/348621.html

標籤：Python html css 网络美汤

上一篇：【資料結構】單鏈表超詳細決議 | 從零開始步步解讀 | 畫圖理解

下一篇：如何向專用網路發送請求？